Intro
Solr/Lucene 4.4, but also relevant for later versions
Term indices live in memory. If you have a lot of documents and/or lots of indexed fields those term indices will require a lot of memory. But you can do something to limit it.
Problem
The term index is basically the .tip files in your Lucene index-folder. They live almost 1-to-1 in memory.
A concrete case I have worked on
- Have an indexed id field of type string. Fairly long id's - all unique
- SolrCloud system where each Solr contains about 30 billion documents
- Memory usage for in-memory term-index alone is about 10 GB
Solution
BlockTreeTermsWriter used by Lucene41PostingsFormat has support for something called minTermBlockSize and maxTermBlockSize with default values 25 and 48 respectively. Increasing those values will reduce term index size. There is not an out-of-the-box support for it in Lucene/Solr, but you can do it yourself
Create a new abstract postings-format that let you increase the term-block-sizes by a factor
/** * Same as {@link Lucene41PostingsFormat} except that * * minTermBlockSize is {@link BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE} * <some factor> (instead of just {@link BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE}) * * maxTermBlockSize is 2 * (minTermBlockSize - 1) */ public abstract class Lucene41FactorPostingsFormat extends PostingsFormat { private static final Logger log = LoggerFactory.getLogger(Lucene41FactorPostingsFormat.class); private final Lucene41PostingsFormat delegate; public Lucene41FactorPostingsFormat(int factor) { super("Lucene41x" + factor); int minTermBlockSize = BlockTreeTermsWriter.DEFAULT_MIN_BLOCK_SIZE * factor; int maxTermBlockSize = 2 * (minTermBlockSize - 1); log.info(getName() + "(" + minTermBlockSize + "," + maxTermBlockSize + ")"); delegate = new Lucene41PostingsFormat(minTermBlockSize, maxTermBlockSize); } @Override public String toString() { return delegate.toString(); } @Override public FieldsConsumer fieldsConsumer(SegmentWriteState state) throws IOException { return delegate.fieldsConsumer(state); } @Override public FieldsProducer fieldsProducer(SegmentReadState state) throws IOException { return delegate.fieldsProducer(state); } }
Create new postings-formats with concrete factors. E.g 4 and 16 factor postings-formats
/** * {@link Lucene41FactorPostingsFormat} using factor 4 */ public class Lucene41x4PostingsFormat extends Lucene41FactorPostingsFormat { public Lucene41x4PostingsFormat() { super(4); } }
/** * {@link Lucene41FactorPostingsFormat} using factor 16 */ public class Lucene41x16PostingsFormat extends Lucene41FactorPostingsFormat { public Lucene41x16PostingsFormat() { super(16); } }
Use those postings-formats instead of plain Lucene41PostingsFormat
In schema.xml
- Declare the new field-types - e.g. for strings
<fieldType name="stringx4" class="solr.StrField" sortMissingLast="true" postingsFormat="Lucene41x4" /> <fieldType name="stringx16" class="solr.StrField" sortMissingLast="true" postingsFormat="Lucene41x16" />
<field name="id" type="stringx16" indexed="true" stored="true" required="true"/>You can make the change for any indexed field. We only did it for our id field.
You can make the change for existing Lucene-indices (replica in Solr) and continue searching and indexing in them. Lucene is still able to read the term indices of the existing segments (written with Lucene41PostingsFormat), but their size in memory will not be reduced. New segments in this index will be written with Lucene41xXXPostingsFormat and their size in memory will be reduced. As merging occur you will have more and more segments written in the new postings-format. If you optimize, all segments will be written in the new postings-format and you will have gained fully.
Consequences
In our concrete setup we saw the following consequences
- The 10 GB memory usage reduced to about 1.5 GB
- No significant changes in search response-time (for the searches we do in practice)
- Indexing about 10% slower
The initiative seems to matter much more for string fields than for e.g. int or long fields.