Solr/Lucene: Use less memory for Term Indices

Intro

Solr/Lucene 4.4, but also relevant for later versions

Term indices live in memory. If you have a lot of documents and/or lots of indexed fields those term indices will require a lot of memory. But you can do something to limit it.

Problem

The term index is basically the .tip files in your Lucene index-folder. They live almost 1-to-1 in memory.

A concrete case I have worked on

Have an indexed id field of type string. Fairly long id's - all unique
SolrCloud system where each Solr contains about 30 billion documents
Memory usage for in-memory term-index alone is about 10 GB

10 GB is a lot, so we wanted to reduce that

Solution

BlockTreeTermsWriter used by Lucene41PostingsFormat has support for something called minTermBlockSize and maxTermBlockSize with default values 25 and 48 respectively. Increasing those values will reduce term index size. There is not an out-of-the-box support for it in Lucene/Solr, but you can do it yourself

Create a new abstract postings-format that let you increase the term-block-sizes by a factor

 
 /** 
  * Same as {@link Lucene41PostingsFormat} except that
  * * minTermBlockSize is {@link BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE} * <some factor> (instead of just {@link BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE}) 
  * * maxTermBlockSize is 2 * (minTermBlockSize - 1)
  */
 public abstract class Lucene41FactorPostingsFormat extends PostingsFormat {
  private static final Logger log = LoggerFactory.getLogger(Lucene41FactorPostingsFormat.class);
  
  private final Lucene41PostingsFormat delegate; 
 
  public Lucene41FactorPostingsFormat(int factor) {
   super("Lucene41x" + factor);
   int minTermBlockSize = BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE * factor; 
   int maxTermBlockSize = 2 * (minTermBlockSize - 1);
   log.info(getName() + "(" + minTermBlockSize + "," + maxTermBlockSize + ")");
   delegate = new Lucene41PostingsFormat(minTermBlockSize, maxTermBlockSize);
  }
 
  @Override
  public String toString() {
   return delegate.toString();
  }
 
  @Override
  public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException {
   return delegate.fieldsConsumer(state);
  }
 
  @Override
  public FieldsProducer fieldsProducer(SegmentReadState state) throws IOException {
   return delegate.fieldsProducer(state);
  }
 
 }

Create new postings-formats with concrete factors. E.g 4 and 16 factor postings-formats

 
 /** 
  * {@link Lucene41FactorPostingsFormat} using factor 4
  */
 public class Lucene41x4PostingsFormat extends Lucene41FactorPostingsFormat {
 
  public Lucene41x4PostingsFormat() {
   super(4);
  }
 
 }

 
 /** 
  * {@link Lucene41FactorPostingsFormat} using factor 16
  */
 public class Lucene41x16PostingsFormat extends Lucene41FactorPostingsFormat {
 
  public Lucene41x16PostingsFormat() {
   super(16);
  }
 
 }

Use those postings-formats instead of plain Lucene41PostingsFormat

In schema.xml

Declare the new field-types - e.g. for strings

 
 <fieldType name="stringx4" class="solr.StrField" sortMissingLast="true" postingsFormat="Lucene41x4" />
 <fieldType name="stringx16" class="solr.StrField" sortMissingLast="true" postingsFormat="Lucene41x16" />

Use the new field-types for your fields - e.g. an indexed id string field

 
 <field name="id" type="stringx16" indexed="true" stored="true" required="true"/>

You can make the change for any indexed field. We only did it for our id field.

You can make the change for existing Lucene-indices (replica in Solr) and continue searching and indexing in them. Lucene is still able to read the term indices of the existing segments (written with Lucene41PostingsFormat), but their size in memory will not be reduced. New segments in this index will be written with Lucene41xXXPostingsFormat and their size in memory will be reduced. As merging occur you will have more and more segments written in the new postings-format. If you optimize, all segments will be written in the new postings-format and you will have gained fully.

Consequences

In our concrete setup we saw the following consequences

The 10 GB memory usage reduced to about 1.5 GB
No significant changes in search response-time (for the searches we do in practice)
Indexing about 10% slower

What consequences you will see, will probably depend a lot on you concrete setup, so make sure to test thoroughly.

The initiative seems to matter much more for string fields than for e.g. int or long fields.

Solr/Lucene

tirsdag den 3. februar 2015

Use less memory for Term Indices

Intro

Problem

Solution

Consequences

Ingen kommentarer:

Send en kommentar