Wednesday, July 18, 2007

An update on Indexing MODS records with SOLR

My goal in the Aquifer project is to make the machinery behind the indexing as transparent as possible. Indexing is almost invariably a mapping between the source data and index representation. Here, MODS records are mapped to a SOLR input schema using a XSL transform mods-solr.xsl (visible here). The complexity of the MODS document is reduced to a list of fields, such as title, name, subject, etc. Many sub-elements may be concatenated together to create these fields.
Here is an example of what goes into SOLR

<add>
<doc>
<field name="id">1</field>
<field name="title">History of the Saginaw Valley,; its resources, progress and business interests</field>
<field name="name">Fox, Truman B</field>
<field name="type-of-resource">text</field>
<field name="origin">, Daily Courier Steam Job Print1868</field>
<field name="language"><field name="physical-description">text/xml;image/tiff; 80 p. 20 cm.; reformatted digital</field>
<field name="note">Advertising matter interspersed.</field>
<field name="subject-topic">History</field>
<field name="subject-geographic">Saginaw River Valley (Mich.)</field>
<field name="url">http://name.umdl.umich.edu/0885076</field>
<field name="access-condition">Where applicable, subject to copyright. Other restrictions on distribution may apply. Please go to http://www.umdl.umich.edu/ for more information.</field>
</field>
</doc>
</add>

There are problems with some of the fields having extraneous commas from concatenation rules. And more interesting problems like whether fields like access-condition should be indexed at all. But at least with the XSL, the rules are not buried within Java or some other language.

Futher details on how SOLR uses this input is controlled via it's schema.xml file (visible here)
The fields created by the transform must be defined as well as copy information which allows fields to be indexed in more than one way. For example the line

<copyfield source="subject-hierarchic" dest="subject">

causes a topic subject to also be indexed as a generic subject line. More lines like this can cause all text to be indexed under a single default which is used when users do not specify a field in a query. Additional configuration can control how fields are parsed, how sentences and punctuation and are handled on a document or field level.

These files need more work but they are not a bad starting point.

No comments: