Lucene Summit: Peel Prarie Provinces
The next presentation at the Lucene Summit was regarding the trials and tribulations of the Peel’s Prarie Provinces project. Here are some notes:
- Started with OCLC's SiteSearch
- Outsourced digitization to OCLC and Olive, which gives xml based information with pixel positions of words in the images
- Nice timeline view when doing a search (facet)
- xml -> cocoon -> xsl -> display - allows multiple display formats including MARC output that other libraries can use to include the data (way to give back to those who help)
- For newspapers: xml -> cocoon -> svg -> batik -> jpeg/png (martini project)
- These workflows give some flexibility and i18n abilities but have some scalability and scope problems, especially the svg portion
- They index both forms of words that have special characters (with and without diacritics).
- Store in an intermediate format to make it easier to index?? Appeared to be more compressed then the native Olive data
- Differences in newspapers versus monographs can cause difficulties. Newspapers tended to have poor metadata/images (no article data)
- Grouped pages under work. So search that has hits on multiple pages of a work would bring up one hit with multiple pages instead of multiple hits
- Some of the facets could be done by pulling info as results came in but this created a large performance hit
- Moved to Solr with cocoon - data mining, internationalization, better facet performance and helped with dynamic metadata