Lucene Summit: Peel Prarie Provinces

The next presentation at the Lucene Summit was regarding the trials and tribulations of the Peel’s Prarie Provinces project. Here are some notes:

Started with OCLC's SiteSearch
Outsourced digitization to OCLC and Olive, which gives xml based information with pixel positions of words in the images
Nice timeline view when doing a search (facet)
xml -> cocoon -> xsl -> display - allows multiple display formats including MARC output that other libraries can use to include the data (way to give back to those who help)
For newspapers: xml -> cocoon -> svg -> batik -> jpeg/png (martini project)
These workflows give some flexibility and i18n abilities but have some scalability and scope problems, especially the svg portion
They index both forms of words that have special characters (with and without diacritics).
Store in an intermediate format to make it easier to index?? Appeared to be more compressed then the native Olive data
Differences in newspapers versus monographs can cause difficulties. Newspapers tended to have poor metadata/images (no article data)
Grouped pages under work. So search that has hits on multiple pages of a work would bring up one hit with multiple pages instead of multiple hits
Some of the facets could be done by pulling info as results came in but this created a large performance hit
Moved to Solr with cocoon - data mining, internationalization, better facet performance and helped with dynamic metadata