Lucene Summit: Elsevier
Art was kind enough to let me attend a Lucene Summit he was having in Windsor and since it was close I went ahead and attended. The summit started with some brief introductions of some of the problems Knowledge Ontario (the organizer) is having with some of their digital search projects. Most were using Lucene for their full text indexing and the talk revolved around related problems and products that seem to make things easier (Flamenco, Solr, etc.)
Elsevier, University of Toronto and Scopus
The first real session was hosted by some of Elsevier’s skunkworks developers. It was really nice to see that a vendor like Elsevier has a developer group (although small) that gets into the quick prototyping and idea generation stuff. This presentation was on the adaption of Scopus to include local results. Here are some highlights:
- Project focused on distributed search, mixed content types and adaption of the scopus UI
- Looked at various search engines including Mark Logic (xmldb) but settled on Lucene due to it's low overhead and other advantages. Since it was open-source they were able to modify quite a few things to work how they wanted
- The distributed indexes included local index (done by the university) and the scopus index (done by elsevier). This allowed Elsevier to stay out of the local process and allow the university to take care of it. Even with servers in Toronto and Ohio, latency was low
- They added extended boolean. The query parser would move outwards from any boolean word found to hopefully parse it better. The example given is a user typing "Boston Red Sox and Toronto Blue Jays" (without quotes) likely did not want their query to be Sox and Toronto.
- They take care of any special operators that mean something to the engine, for example *,?,etc
- They start by checking to see if it's a proper query with special operators, if not then they check for boolean and if it does appear to be structured properly then they move to parsing it as natural language
- They added the ability to do pure NOT queries (everything that isn't something)
- The default operator is AND
- They added a proximity boost that boosts any documents with the words near each other. The examples gives appeared to be within the fuzzy 3 range (~3).
- Example: Boston Red Sox --> (Boston AND Red AND Sox) OR (Boston Red)~3 OR (Red Sox)~3
- They tokenized fields that could have multiple values, for example author names. This was to prevent something from coming up when they put in the first name of one author and the last name of another (example)
- They stated they plan on releasing their Lucene changes as open source though there was no specific timeline
The presentation was interesting but probably the most to those who are working with distributed indexes as the project met quite a few problems included local vs global ranking and indexing differences.