Thursday, October 4, 2007

Chick bangs his head against a Date


With apologies to Mr Twain, I have always said, "I'd sooner date two jealous psychopaths than one MODS record." Nothing lately has been making me feel any differently. The question is how to take the date information from a MODS record and get it index usefully into an act_as_solr SOLR instance. Our approach is two-fold. Translate the date information in the originInfo element and run the result through the software package developed by the CDL that implements the erstwhile TEMPER date analysis standard.

Here is an example, say I have a record dated 1900-1920. I want to to index this in such a way that if a user chooses to find records between 1905 and 1910 it finds this one. So what do I do. I use the TEMPER software to expand t date range into a separate facet for each year. This wildly duplicates the records when faceted by year but satisifies that it will be found on any overlap with the desired date range.

And you are probably asking how I got the date range, take a look at this xsl snippet which is taken from an xsl transform within the context of an originInfo element

<xsl:element name="doc_date">
<xsl:for-each select="dateIssued|dateCreated|copyrightDate">
<xsl:value-of select=".">
<xsl:choose>
<xsl:when test="@point = 'start'">
<xsl:text>-</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text> and </xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:value-of>
</xsl:for-each></xsl:element>
This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
</xsl:for-each></xsl:element></pre></span>

This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
<origininfo>
<dateissued>[no date recorded on caption card]</dateissued>
<dateissued encoding="marc" point="start">1900</dateissued>
<dateissued encoding="marc" point="end">1920</dateissued>
</origininfo>

becomes
"<doc_date>[no date recorded on caption card] and 1900-1920 and</doc_date>

Which is then nicely parsed into a range by the TEMPER software. I'd love to hear better theories on how to do this. Really! And I haven't even done all the tests on circa yet.