The next presentation at the Lucene Summit was regarding the trials and tribulations of the Peel’s Prarie Provinces project. Here are some notes:

  • Started with OCLC's SiteSearch
  • Outsourced digitization to OCLC and Olive, which gives xml based information with pixel positions of words in the images
  • Nice timeline view when doing a search (facet)
  • xml -> cocoon -> xsl -> display - allows multiple display formats including MARC output that other libraries can use to include the data (way to give back to those who help)
  • For newspapers: xml -> cocoon -> svg -> batik -> jpeg/png (martini project)
  • These workflows give some flexibility and i18n abilities but have some scalability and scope problems, especially the svg portion
  • They index both forms of words that have special characters (with and without diacritics).
  • Store in an intermediate format to make it easier to index?? Appeared to be more compressed then the native Olive data
  • Differences in newspapers versus monographs can cause difficulties. Newspapers tended to have poor metadata/images (no article data)
  • Grouped pages under work. So search that has hits on multiple pages of a work would bring up one hit with multiple pages instead of multiple hits
  • Some of the facets could be done by pulling info as results came in but this created a large performance hit
  • Moved to Solr with cocoon - data mining, internationalization, better facet performance and helped with dynamic metadata