Thursday, October 4, 2007
Chick bangs his head against a Date
With apologies to Mr Twain, I have always said, "I'd sooner date two jealous psychopaths than one MODS record." Nothing lately has been making me feel any differently. The question is how to take the date information from a MODS record and get it index usefully into an act_as_solr SOLR instance. Our approach is two-fold. Translate the date information in the originInfo element and run the result through the software package developed by the CDL that implements the erstwhile TEMPER date analysis standard.
Here is an example, say I have a record dated 1900-1920. I want to to index this in such a way that if a user chooses to find records between 1905 and 1910 it finds this one. So what do I do. I use the TEMPER software to expand t date range into a separate facet for each year. This wildly duplicates the records when faceted by year but satisifies that it will be found on any overlap with the desired date range.
And you are probably asking how I got the date range, take a look at this xsl snippet which is taken from an xsl transform within the context of an originInfo element
<xsl:element name="doc_date">
<xsl:for-each select="dateIssued|dateCreated|copyrightDate">
<xsl:value-of select=".">
<xsl:choose>
<xsl:when test="@point = 'start'">
<xsl:text>-</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text> and </xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:value-of>
</xsl:for-each></xsl:element>
This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
</xsl:for-each></xsl:element></pre></span>
This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
<origininfo>
<dateissued>[no date recorded on caption card]</dateissued>
<dateissued encoding="marc" point="start">1900</dateissued>
<dateissued encoding="marc" point="end">1920</dateissued>
</origininfo>
becomes
"<doc_date>[no date recorded on caption card] and 1900-1920 and</doc_date>
Which is then nicely parsed into a range by the TEMPER software. I'd love to hear better theories on how to do this. Really! And I haven't even done all the tests on circa yet.
Subscribe to:
Post Comments (Atom)
3 comments:
What's wrong with the duplicating the record in each of the year facets? Is there a technical cost? For a record with multiple values for subject or author, I wouldn't think twice having the record appear under both facet values. Same goes for year in my book.
FYI: TEMPER is neither a standard nor a piece of software. From the IETF Internet-Draft: "TEMPER (TEMPoral Enumerated Ranges) is a simple date and time syntax for representing points, lists, and ranges of timestamps." The software in question is CDL's Date Normalization Utility, which can look in documents for possible dates, interpret them according to its rules, and output the dates to output to the TEMPER date format.
I find unsatisfactory because I can imagine finding a list of records which I then facet. Each facet contains the same number for records as the original list and I get the same set of records regardless of which facet I pick. I guess I should find out how likely that is, but I know from my analysis of the point='start' records that there are quite a few records with date ranges
I just noticed the picture. Looks more like a fig to me. ;-)
Post a Comment