Saturday, January 22, 2011

Payloads with Solr

I started looking at Solr again recently - the last time I used it (as a user, not a developer) was at CNET years ago, when Solr was being developed and deployed inhouse. Reading the Solr 1.4 Enterprise Search Server book, I was struck by how far Solr (post version 1.3) has come in terms of features since I last saw it.

Of course, using Solr is not that hard, its just an HTTP based API, what I really wanted to do was understand how to customize it for my needs, and I learn best by doing, I decided to solve for some scenarios that are common at work. One such scenario is concept searching. I have written about this before, using Lucene payloads to provide a possible solution. This time, I decided to extend that solution to run on Solr.

Schema

Turns out that a lot of this functionality is already available (at least in the SVN version) in Solr. The default schema.xml contains a field definition for payload fields, as well as analyzer chain definitions, which I simply copied. I decided to use a simple schema for my experiments, adapted from the default Solr schema.xml file. My schema file (plex, for PayLoad EXtension) is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
<!-- Source: solr/example/plex/conf/schema.xml -->
<?xml version="1.0" encoding="UTF-8" ?>
<schema name="plex" version="1.3">
  <types>
    <fieldType name="string" class="solr.StrField" 
      sortMissingLast="true" omitNorms="true"/>
    <fieldType name="text" class="solr.TextField" 
      positionIncrementGap="100" autoGeneratePhraseQueries="true">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.StopFilterFactory"
          ignoreCase="true" words="stopwords.txt" 
          enablePositionIncrements="true"/>
        <filter class="solr.WordDelimiterFilterFactory" 
          generateWordParts="1" generateNumberParts="1" 
          catenateWords="1" catenateNumbers="1" 
          catenateAll="0" splitOnCaseChange="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" 
          protected="protwords.txt"/>
        <filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>
    <fieldtype name="payloads" stored="false" indexed="true" 
      class="solr.TextField" >
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.DelimitedPayloadTokenFilterFactory" 
          delimiter="$" encoder="float"/>
      </analyzer>
    </fieldtype>
  </types>
  <fields>
    <field name="id" type="string" indexed="true" stored="true" 
      required="true" /> 
    <field name="url" type="string" indexed="false" stored="true" 
      required="true"/>
    <field name="title" type="text" indexed="true" stored="true"/>
    <field name="keywords" type="text" indexed="true" stored="true" 
      multiValued="true"/>
    <field name="concepts" type="payloads" indexed="true" stored="true"/>
    <field name="description" type="text" indexed="true" stored="true"/>
    <field name="author" type="string" indexed="true" stored="true"/>
    <field name="content" type="text" indexed="true" stored="false"/>
  </fields>
  <uniqueKey>id</uniqueKey>
  <defaultSearchField>content</defaultSearchField>
  <solrQueryParser defaultOperator="OR"/>
  <similarity class="org.apache.solr.search.ext.MyPerFieldSimilarityWrapper"/>
</schema>

Ignore the <similarity> tag towards the bottom of the file for now. The schema describes a record containing a payload field called "concepts" of type "payloads", which is defined, along with its analyzer chain, in the types section of this file.

Indexing

For my experiment, I just cloned the examples/solr instance into examples/plex, and copied the schema.xml file into it. Then I started the instance with the following command from the solr/examples directory:

1
sujit@cyclone:example$ java -Dsolr.solr.home=plex -jar start.jar

On another terminal, I deleted the current records (none to begin with, but you will need to do this for testing iterations), then added two records with payloads.

1
2
3
4
sujit@cyclone:tmp$ curl http://localhost:8983/solr/update?commit=true -d \
  '<delete><query>*:*</query></delete>'
sujit@cyclone:tmp$ curl http://localhost:8983/solr/update \
  -H "Content-Type: text/xml" --data-binary @upload.xml

The contents of upload.xml are shown below - its basically 2 records, followed by an optimize (not mandatory), and a commit call (to make the data show up on the search interface).

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
<update>
  <add allowDups="false">
    <doc>
      <field name="id">1</field>
      <field name="url">http://www.myco.com/doc-1.html</field>
      <field name="title">My First Document</field>
      <field name="keywords">keyword_1</field>
      <field name="keywords">keyword_2</field>
      <field name="concepts">123456$12.0 234567$22.4</field>
      <field name="description">Description for My First Document</field>
      <field name="author">Pig Me</field>
      <field name="content">This is the house that Jack built. It was a mighty \
        fine house, but it was built out of straw. So the wicked old fox \
        huffed, and puffed, and blew the house down. Which was just as well, \
        since Jack built this house for testing purposes.
      </field>
    </doc>
    <doc>
      <field name="id">2</field>
      <field name="url">http://www.myco.com/doc-2.html</field>
      <field name="title">My Second Document</field>
      <field name="keywords">keyword_3</field>
      <field name="keywords">keyword_2</field>
      <field name="concepts">123456$44.0 345678$20.4</field>
      <field name="description">Description for My Second Document</field>
      <field name="author">Will E Coyote</field>
      <field name="content">This is the story of the three little pigs who \
        went to the market to find material to build a house with so the \
        wily old fox would not be able to blow their houses down with some \
        random huffing and puffing.
      </field>
    </doc>
  </add>
  <commit/>
  <optimize/>
</update>

Searching

At this point, we still need to verify that the payload fields were correctly added, and that we can search using the payloads. Our requirement is that a payload search such as "concepts:123456" would return all records where such a concept exists, in descending order of the concept score.

Solr does not support such a search handler out of the box, but it is fairly simple to build one, by creating a custom QParserPlugin extension, and attaching it (in solrconfig.xml) to an instance of solr.SearchHandler. The relevant snippet from solrconfig.xml is shown below:

1
2
3
4
5
6
7
8
  <!-- Request Handler to do payload queries -->
  <queryParser name="payloadQueryParser" 
    class="org.apache.solr.search.ext.PayloadQParserPlugin"/>
  <requestHandler name="/concept-search" class="solr.SearchHandler">
    <lst name="defaults">
      <str name="defType">payloadQueryParser</str>
    </lst>
  </requestHandler>

Here's the code for the PayloadQParserPlugin (modeled after example code in FooQParserPlugin in the Solr codebase). It is just a container for the inner PayloadQParser class which parses the incoming query and returns a PayloadTermQuery. The parser has rudimentary support for AND-ing and OR-ing multiple payload queries. For payload fields, we want to use only the payload scores for scoring, so we specify that in the PayloadTermQuery constructor.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
// Source: src/java/org/apache/solr/search/ext/PayloadQParserPlugin.java
package org.apache.solr.search.ext;

import org.apache.commons.lang.StringUtils;
import org.apache.lucene.index.Term;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.search.BooleanQuery;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.BooleanClause.Occur;
import org.apache.lucene.search.payloads.AveragePayloadFunction;
import org.apache.lucene.search.payloads.PayloadTermQuery;
import org.apache.solr.common.params.SolrParams;
import org.apache.solr.common.util.NamedList;
import org.apache.solr.request.SolrQueryRequest;
import org.apache.solr.search.QParser;
import org.apache.solr.search.QParserPlugin;

/**
 * Parser plugin to parse payload queries.
 */
public class PayloadQParserPlugin extends QParserPlugin {

  @Override
  public QParser createParser(String qstr, SolrParams localParams,
      SolrParams params, SolrQueryRequest req) {
    return new PayloadQParser(qstr, localParams, params, req);
  }

  @Override
  public void init(NamedList args) {
  }
}

class PayloadQParser extends QParser {

  public PayloadQParser(String qstr, SolrParams localParams, SolrParams params,
      SolrQueryRequest req) {
    super(qstr, localParams, params, req);
  }

  @Override
  public Query parse() throws ParseException {
    BooleanQuery q = new BooleanQuery();
    String[] nvps = StringUtils.split(qstr, " ");
    for (int i = 0; i < nvps.length; i++) {
      String[] nv = StringUtils.split(nvps[i], ":");
      if (nv[0].startsWith("+")) {
        q.add(new PayloadTermQuery(new Term(nv[0].substring(1), nv[1]), 
          new AveragePayloadFunction(), false), Occur.MUST);
      } else {
        q.add(new PayloadTermQuery(new Term(nv[0], nv[1]), 
          new AveragePayloadFunction(), false), Occur.SHOULD);
      }
    }
    return q;
  }
}

To deploy these changes, I ran the following commands at the root of the Solr project, then restarted the plex instance using the java -jar start.jar command shown above.

1
2
3
sujit@cyclone:solr$ ant dist-war
sujit@cyclone:solr$ cp dist/apache-solr-4.0-SNAPSHOT.war \
  example/webapps/solr.war

At this point, we are able to search for concepts using Payload queries, using the URL to the custom handler we defined in solrconfig.xml.

1
2
http://localhost:8983/solr/concept-search/?q=concepts:234567\
  &version=2.2&start=0&rows=10&indent=on

We still need to tell Solr what order to return the records in the result back in. By default, Solr uses the DefaultSimilarity - we need to tell it to use the payload scores for payload queries and DefaultSimilarity for all others. Currently, however, Solr supports only a single Similarity for a given schema - to get around that, I build a similarity wrapper triggered by field name, similar to the PerFieldAnalyzerWrapper on the indexing side. I believe LUCENE-2236 addresses this in a much more elegant way, I will make the necessary change when that becomes available. Here is the code for the Similarity Wrapper class.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
// Source: src/java/org/apache/solr/search/ext/MyPerFieldSimilarityWrapper.java
package org.apache.solr.search.ext;

import java.util.HashMap;
import java.util.Map;

import org.apache.lucene.search.DefaultSimilarity;
import org.apache.lucene.search.Similarity;

/**
 * A delegating Similarity implementation similar to PerFieldAnalyzerWrapper.
 */
public class MyPerFieldSimilarityWrapper extends Similarity {

  private static final long serialVersionUID = -7777069917322737611L;

  private Similarity defaultSimilarity;
  private Map<String,Similarity> fieldSimilarityMap; 
  
  public MyPerFieldSimilarityWrapper() {
    this.defaultSimilarity = new DefaultSimilarity();
    this.fieldSimilarityMap = new HashMap<String,Similarity>();
    this.fieldSimilarityMap.put("concepts", new PayloadSimilarity());
  }
  
  @Override
  public float coord(int overlap, int maxOverlap) {
    return defaultSimilarity.coord(overlap, maxOverlap);
  }

  @Override
  public float idf(int docFreq, int numDocs) {
    return defaultSimilarity.idf(docFreq, numDocs);
  }

  @Override
  public float lengthNorm(String fieldName, int numTokens) {
    Similarity sim = fieldSimilarityMap.get(fieldName);
    if (sim == null) {
      return defaultSimilarity.lengthNorm(fieldName, numTokens);
    } else {
      return sim.lengthNorm(fieldName, numTokens);
    }
  }

  @Override
  public float queryNorm(float sumOfSquaredWeights) {
    return defaultSimilarity.queryNorm(sumOfSquaredWeights);
  }

  @Override
  public float sloppyFreq(int distance) {
    return defaultSimilarity.sloppyFreq(distance);
  }

  @Override
  public float tf(float freq) {
    return defaultSimilarity.tf(freq);
  }
  
  @Override
  public float scorePayload(int docId, String fieldName,
      int start, int end, byte[] payload, int offset, int length) {
    Similarity sim = fieldSimilarityMap.get(fieldName);
    if (sim == null) {
      return defaultSimilarity.scorePayload(docId, fieldName, 
        start, end, payload, offset, length);
    } else {
      return sim.scorePayload(docId, fieldName, 
        start, end, payload, offset, length);
    }
  }
}

As you can see, the methods that take a field name switch between the default similarity implementation and the field specific ones. We have only one of these, the PayloadSimilarity, the code for which is shown below:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Source: src/java/org/apache/solr/search/ext/PayloadSimilarity.java
package org.apache.solr.search.ext;

import org.apache.lucene.analysis.payloads.PayloadHelper;
import org.apache.lucene.search.DefaultSimilarity;

/**
 * Payload Similarity implementation. Uses Payload scores for scoring.
 */
public class PayloadSimilarity extends DefaultSimilarity {

  private static final long serialVersionUID = -2402909220013794848L;

  @Override
  public float scorePayload(int docId, String fieldName,
      int start, int end, byte[] payload, int offset, int length) {
    if (payload != null) {
      return PayloadHelper.decodeFloat(payload, offset);
    } else {
      return 1.0F;
    }
  }
}

Once again, we deploy the Solr WAR file with this new class, and restart the plex instance, and this time we can verify that we get back the records in the correct order.

1
2
http://localhost:8983/solr/concept-search/?q=concepts:123456\
  &fl=*,score&version=2.2&start=0&rows=10&indent=on

We need a quick check to verify that queries other than concept don't use our PayloadSimilarity. Our example concept payload scores are in the range 1-100, and the scores in the results for the URL below are in the range 0-1, indicating that the DefaultSimilarity was used for this query, which is what we wanted to happen.

1
2
http://localhost:8983/solr/select/?q=huffing\
  &fl=*,score&version=2.2&start=0&rows=10&indent=on

References

The following resources were very helpful while developing this solution.

21 comments (moderated to prevent spam):

Milan Agatonovic said...

Hi Sujit, glad to see your posts about search again. Have you maybe had a chance to try elasticsearch?
We are evaluating the possible search engine implementations for our projects and it looks to me that elasticsearch would give as more than solr.

Thanks and best regards,
Milan

Unknown said...

Great article, Sujit. Solr and Lucene are great technologies as stand alone, however, they do not solve one problem -- the problem of access rights (ACL) resolution. If I have implemented role based access in a set of database tables and the solr results must be filtered using rules in the tables, the SQL-Solr hybrid code becomes tedious. So, I was recently looking at SPYNX SE free text engine of MySQL.

Sujit Pal said...

Thanks Srinivas. I haven't implemented ACL with Solr, so I will defer to your experience with it. Speaking for myself, my first thought would have been to use a database based approach like you described - basically pull the user role and convert it to a QueryFilter object which is appended to the Solr call for this user. Don't know much about SPYNX SE, is it as full featured as Lucene/Solr?

Sujit Pal said...

Thanks Lemel, I did not know about elasticsearch (or Sphinx, I guess I /have/ been living under a rock for the last few years), thanks for pointing me to it. From my initial look at elasticsearch, it seems to me that Solr and elasticsearch each have things to recommend them (for our use case). I will definitely look at both (probably not as a possible replacement for Solr, there is too much management buy-in at this point), but as a source of good ideas :-).

Neil said...

Hey Sujit, are you going to the Lucene Revolution conference?

Sujit Pal said...

Nope, too expensive (my company doesn't sponsor). I'll get by, though, the lucene/solr mailing lists have some very knowledgeable and helpful folks :-).

milandobrota said...

I have done something very similar to this and the payload boosts get applied properly, however if I provide any query-time boosts, they get ignored. I would want to have payload (index-time) boosts multiplied with query-time boosts. Can you, or anybody else point me in the right direction?

Sujit Pal said...

Hi milandobrota, I tried this recently and it appears to be impossible to do with the current Similarity API. Here is the pointer to the discussion on the Java user list. My current implementation assumes that the payload score will be used as the (unboosted) Lucene score, which works great if useSpanScores is false. If I enable it, the other weighting factors come into play. I can turn off most of them, but the field boost seems to have some component other than the document boost I am setting, which I am not able to isolate.

milandobrota said...

Also it seems like the query time analyzer specified in schema.xml is being ignored. Do you know how this could be fixed?

Sujit Pal said...

I've always used the same analyzer for query and index side, so I haven't ever used the feature of having different analyzers defined for query/index, so not sure.

Ravish said...

Nice post. Interesting you rank documents by concept score that is statically defined at index time here. A hybrid approach that uses this score along with the Similarity score will be interesting to play around with. It could be quite useful.

Sujit Pal said...

Yes, we have some plans along these lines also. The current Similarity implementation is a bit restrictive, hopefully the SimilarityProvider (which I understand is coming out in Lucene 4) would allow us to do this.

Ankit said...

Hi Sujit ,i am newbie to solr... i am trapped in adding a record with mutivalued field to solr. i am inserting data through lily api. i am using solr for searching purpose

sessionEndTime": "2013-02-13 14:53:47",
"screens": [
{
"screenId": 1,
"startTime": "2013-0213 14:53:43",
"endTime": "2013-02-13 14:53:47"
},

{
"screenId": 2,
"startTime": "2013-0213 14:53:43",
"endTime": "2013-02-13 14:53:47"
},

{
"screenId": 1,
"startTime": "2013-0213 14:53:43",
"endTime": "2013-02-13 14:53:47"
}
.....
.....
]

My question is how do i define a field for screens record having {screenId,startTime,endTime} in schema.xml . Waiting for ur reply.

Thanx & Regards
Ankit Ostwal

Sujit Pal said...

Hi Ankit, I think the answer would be "it depends". For example, do you want to search by screen.startTime and screen.endTime? If no, then just store it as a multiValued structured (maybe comma-separated) string field, ie "${screenId},${startTime},${endTime}".

If yes, when there is a match, do you want to show the entire record or just the screen record? If you want to show the individual screen record, pull the parent metadata into each individual screen record like this:

${screenId}, ${startTime}, ${endTime}, ${session_end_time}

Another option is to split out the record into a single parent and a set of child records and use a Solr grouping query (still rusty about how this works, btw, so this may be incorrect) to get results.

If you want to show the entire record, a (slightly heavyweight option) could be to build a custom Field Type (similar to Point) which could allow searching on startTime and endTime, and then declare the screens as a multiValued field of your custom FieldType.

Anonymous said...

Against what version of solr and lucene are your files compiled?

Do you remember this? I cant compile because of code and structure differences.

Sujit Pal said...

I'm almost certain its Solr/Lucene 3.2 - thats the version against which I built all these customizations.

Anonymous said...

Thanks for that. I see you are quite involved in lucene/solr. Do you have an example for payloads in solr 4?

Anonymous said...

Thanks for your help. I got it working now but I have still one question. Is it possible to search for more than one concepts?

So for instance you have this document indexed:


1
http://www.myco.com/doc-1.html
My First Document
keyword_1
keyword_2
123456$12.0 234567$22.4
Description for My First Document
Pig Me
This is the house that Jack built. It was a mighty \
fine house, but it was built out of straw. So the wicked old fox \
huffed, and puffed, and blew the house down. Which was just as well, \
since Jack built this house for testing purposes.




There you have the concepts of 123456$12.0 234567$22.4

Can you now query concepts:123456 and 234567 ?

Sujit Pal said...

We haven't moved to Lucene/Solr 4 yet, the latest version I have this stuff working on is 3.5 (someone else did the porting, and while I am not absolutely certain if there were changes involved, I wasn't asked any questions, so I suspect that there were none). I have plans to port the stuff up to 4, but no timeline yet.

To answer your second question, I implemented a custom query parser to do this from Solr. It is described here:
http://sujitpal.blogspot.com/2011/03/using-lucenes-new-queryparser-framework.html

However, from within Lucene (or from within a custom Solr handler or component) you can simply compose multiple PayloadTermQuery objects into a BooleanQuery.

Unknown said...

with includeSpanScore=true,the scores were getting altered, i observed that the tf was getting changed some how, and became less than 1 , so i overidden the tf method in
MyPerFieldSimilarityWrapper.java

@Override
public float tf(float freq) {
if(freq < 1) freq=1;
return defaultSimilarity.tf(freq);
}

after this change, i could use the query time boosting coherently with payload index-time boosts

Sujit Pal said...

Thank you Sharad, this is very useful, will try this out.