mandag den 26. maj 2014

Performance of AND-queries with uneven hits


Intro


Using Solr 4.4

Selected parts of my schema.xml
 
    <field name="_version_" type="dlng" indexed="false" stored="true" required="true" docValues="true"/>
    <field name="id" type="string" indexed="true" stored="true" required="true"/>
    <dynamicField name="*_str_ind_sto" type="string" indexed="true" stored="true"/>
    <dynamicField name="*_dstr_doc_sto" type="dstring" indexed="false" stored="true" required="true" docValues="true"/>
    <dynamicField name="*_str_sto" type="string" indexed="false" stored="true"/>
    <dynamicField name="*_int_sto" type="int" indexed="false" stored="true"/>
    <dynamicField name="*_lng_sto" type="long" indexed="false" stored="true"/>
    <dynamicField name="*_lng_ind_sto" type="long" indexed="true" stored="true"/>
    <dynamicField name="*_dlng_doc_sto" type="dlng" indexed="false" stored="true" required="true" docValues="true"/>
    <dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true" stored="true" required="true" docValues="true"/>
    <dynamicField name="*_txt_ind_sto_tp" type="text" indexed="true" stored="true" termVectors="true" termPositions="true"/>
    ...
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" />
    <fieldType name="dstring" class="solr.StrField" sortMissingLast="true" docValuesFormat="Disk"/>
    <fieldType name="int" class="solr.TrieIntField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/> 
    <fieldType name="dlng" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0" docValuesFormat="Disk"/>
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
        ...
    </fieldType>
All documents have the following fields
  • start_timestamp_dlng_doc_ind_sto (long, docvalue, indexed, stored)
  • end_timestamp_lng_ind_sto (long, indexed, stored)
  • no1_dlng_doc_ind_sto (long, docvalue, indexed, stored)
  • no2_dlng_doc_ind_sto (long, docvalue, indexed, stored)
  • 2 other int_sto fields
  • 2 other dlng_doc_ind_sto fields
  • 3 other lng_ind_sto fields
  • 2 other str_sto fields
  • 1 other lng_sto field
  • 2 other int_sto fields
My collection contains about 5 billion documents with the following distribution
  • start_timestamp (close to) evenly (uniform) distributed over a month (december 2013)
  • no1 distributed over 5 million numbers
  • no2 distributed over millions of numbers - not in any way correlated with the distribution of no1
I want, for a certian period in time (a sub-period of december 2013), to find the unique no2's for a particular no1.

First obvious attempt


This can be achieved by doing a facet-search like this
 
q=no1_dlng_doc_ind_sto:(<some-number>) AND 
start_timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])&
facet=true&facet.field=no2_dlng_doc_ind_sto&facet.limit=-1&facet.offset=0&facet.mincount=1&facet.zeros=false&start=0&rows=0&distrib=true

Such a search hits about 500-3000 documents on the no1_dlng_doc_ind_sto part of the query and billions of documents on the start_timestamp_dlng_doc_ind_sto part of the query - depending on the length of the time-interval (all of december: 5 billion hits, half of december 2.5 billion hits etc). Number of unique no2_dlng_doc_ind_sto vary from 100 to 800.

I have a test looping for an hour using different values for no1_dlng_doc_ind_sto every time, but using the same time-interval all the time (about 2/3 of december). For every loop it does
  • 1 concurrent search
  • 2 concurrent searches
  • 4 concurrent searches
  • 8 concurrent searches
This in order to see how things behave when doing the searches in parallel. no1_dlng_doc_ind_sto values are never reused - neither inside a loop nor between loops. The test keeps track of average response-times

Output from the test when doing the query as stated above
1 concurrent average response-time: 6549
2 concurrent average response-time: 6350
4 concurrent average response-time: 9492
8 concurrent average response-time: 15149

Hmmm, I am really not impressed. I fear that Solr/Lucene is actually finding the 500-3000 document-ids and the several-billions document-ids using indices on no1_dlng_doc_ind_sto and start_timestamp_dlng_doc_ind_sto respectively. And then intersects them. This might not be the most efficient way of finding the documents.

Trying with filter-query


Do not know exactly what filter-queries do, so lets just try moving the start_timestamp_dlng_doc_ind_sto over as a filter-query
 
q=no1_dlng_doc_ind_sto:(<some-number>)&
fq=start_timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])&
facet=true&facet.field=no2_dlng_doc_ind_sto&facet.limit=-1&facet.offset=0&facet.mincount=1&facet.zeros=false&start=0&rows=0&distrib=true

Output from the test when doing this search
1 concurrent average response-time: 35658
2 concurrent average response-time: 35087
4 concurrent average response-time: 45384
8 concurrent average response-time: 65576

Ahhhh, that definitely did not help. Did not dive more into that.

Doing time-interval filter and faceting on client


Well we could try to just request the documents having the correct value for no1_dlng_doc_ind_sto and then do filtering on start_timestamp_dlng_doc_ind_sto and faceting on client-side
 
q=no1_dlng_doc_ind_sto:(<some-number>)&fl=no2_dlng_doc_ind_sto,start_timestamp_dlng_doc_ind_sto&start=0&rows=10000&distrib=true

Output from the test when doing this search
1 concurrent average response-time: 3104
2 concurrent average response-time: 4430
4 concurrent average response-time: 6623
8 concurrent average response-time: 10738

Better, but still not really impressed. I profiled a little on this one and found that it is mainly "slow" because it fetches the values to be returned from "slow" storage (even though all values that should be retrieved are docvalue - maybe someone should change that)

Doing time-interval filter by iterating docvalues on result-candidates


I would really like to see how this performs if you, on server-side, just use the index on no1_dlng_doc_ind_sto to find candidate-docs and then for each of those documents fetch the value of start_timestamp_dlng_doc_ind_sto (using docvalue) to filter out the docs that does not fit the time-interval. I hacked SolrIndexSearcher to do exactly that
 
q=no1_dlng_doc_ind_sto:(<some-number>)&
facet=true&facet.field=no2_dlng_doc_ind_sto&facet.limit=-1&facet.offset=0&facet.mincount=1&facet.zeros=false&start=0&rows=0&distrib=true

You do not get the entire picture from that query - remember that documents are filtered with respect to the time-interval, but this is hardcoded and not expressed in the URL. Basically running Solrs with this hacky patch and starting them with the following params
  • -DDO_HARDCODED_FILTERING=true
  • -DHARDCODED_FILTERING_START_TIME=<TIME_START>
  • -DHARDCODED_FILTERING_END_TIME=<TIME_END>
The query in question hits the getDocListAndSetNC.branch#2 path and I really do not know if it works on any of the other paths. Please note that this patch is only to test how it will perform, and in no way expressing my suggestion on how it should be implemented if introduced as a real feature of Solr/Lucene.

Output from the test when doing the search like that
1 concurrent average response-time: 449
2 concurrent average response-time: 573
4 concurrent average response-time: 776
8 concurrent average response-time: 1198

This is much better.

Using the existing Solr/Lucene solution :-)


Along the way testing all this I asked on solr-user@apache.lucene.org mailing-list if this feature is already in Solr/Lucene and how to activate it. Yonik Seeley pointed me to this page and suggested the following way of doing the search
 
q=no1_dlng_doc_ind_sto:(<some-number>)&
fq={!frange cache=false cost=150 v=start_timestamp_dlng_doc_ind_sto l=<TIME_START> u=<TIME_END>}&
facet=true&facet.field=no2_dlng_doc_ind_sto&facet.limit=-1&facet.offset=0&facet.mincount=1&facet.zeros=false&start=0&rows=0&distrib=true

Output from the test when doing the search like that
1 concurrent average response-time: 394
2 concurrent average response-time: 448
4 concurrent average response-time: 678
8 concurrent average response-time: 969

Seems a little bit better than with my hacky patch. I am not sure you can trust the small differences in the measured response-times, but this seem at least as performant as what I did.

Looking at the code it seems like using the way of querying show above actually does nearly the same thing as I did with my hacky patch - just as a nice generalized, established and (hopefully) well tested feature. Using FunctionRangeQuery.FunctionRangeCollector implementing PostFilter, which gets its values from LongFieldSource.getValues using LongDocValue through FieldCache. That is great!

Outro


It been a journey, but I am very happy with it. I feel like I got a confirmation that my thoughts on the best strategy for carrying out such a query was right. I got some hands-on trying to implement it as a proof-of-concept. The fact that is it already there is just a bonus, that will save me from a lot of work. And I learned about more advanced ways to express queries.