Monday, December 17, 2007

Only the coolest searching feature ever...

Have you ever been trying to find something using search and no matter what terms you try you end up with the same records over and over again. Frustrating! What if you could check a box that said, Only show me things you haven't shown me before, or maybe something a little shorter, and just that would happen. Each search would only show you things that you haven't seen before
Well now the Aquifer portal has it.
Registered users of www.dlfaquifer.org who have enabled Remember my searches and records in their profile will see a new check box Unseen only which, when clicked, causes all records presented will be ones that you have not seen before. This feature is a work in progress, I'm still taking feedback on it and still smoothing off the edges. The ability to set marks (temporal or otherwise) would make this feature even better, but is currently just a gleam in the developers eye.
A new look and feel to the user profile system is part of this release. In addition to their search history users can see records they have seen a search results and records they have selected. A way of clearing this history has also been provided.

Please try it. If you couldn't think of a reason to register before, now you can.

SRU Access ready for testing

SRU access to the Aquifer Portal is up and ready for testing. explain and searchRetrieve are working. Basic error are caught and returned in a diagnostics element. The base address of the service is http://www.dlfaquifer.org/sru Please test, and test.

I shook it down with some help from Ralph LeVan and an SRU test website he created (http://alcme.oclc.org/srw/SRUServerTester.html). Our service now passes most of the tests, we are failing a couple that have to do with error conditions, that I have not quite figured out how to detect. Instead we just return zero hits.

The explain packet does not list all the potential indices yet, but I am holding off on that while Tom and I look into a more generic approach to the MODS SOLR interaction.

Tuesday, November 27, 2007

Testing is good


A lot of irons are in the fire right now, but after doing a medium size refactoring job this afternoon, I just have to say how happy I am that I have been assimilated into the ruby on rails unit testing. I know testing has been available on every system that I have ever used, but I have never ever found it so easy to get off the dime and do the testing and stick with it. I just do a significant portion of my debugging while developing my tests. Having done a pretty good suite for the forthcoming Ruby CQL parser, converting the xml generation to use Builder went really fast, despite some venal debugging problems in RadRails (gotta figure that out later)
The picture is the eastern sky I saw this morning.

Thursday, October 4, 2007

Chick bangs his head against a Date


With apologies to Mr Twain, I have always said, "I'd sooner date two jealous psychopaths than one MODS record." Nothing lately has been making me feel any differently. The question is how to take the date information from a MODS record and get it index usefully into an act_as_solr SOLR instance. Our approach is two-fold. Translate the date information in the originInfo element and run the result through the software package developed by the CDL that implements the erstwhile TEMPER date analysis standard.

Here is an example, say I have a record dated 1900-1920. I want to to index this in such a way that if a user chooses to find records between 1905 and 1910 it finds this one. So what do I do. I use the TEMPER software to expand t date range into a separate facet for each year. This wildly duplicates the records when faceted by year but satisifies that it will be found on any overlap with the desired date range.

And you are probably asking how I got the date range, take a look at this xsl snippet which is taken from an xsl transform within the context of an originInfo element

<xsl:element name="doc_date">
<xsl:for-each select="dateIssued|dateCreated|copyrightDate">
<xsl:value-of select=".">
<xsl:choose>
<xsl:when test="@point = 'start'">
<xsl:text>-</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text> and </xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:value-of>
</xsl:for-each></xsl:element>
This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
</xsl:for-each></xsl:element></pre></span>

This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
<origininfo>
<dateissued>[no date recorded on caption card]</dateissued>
<dateissued encoding="marc" point="start">1900</dateissued>
<dateissued encoding="marc" point="end">1920</dateissued>
</origininfo>

becomes
"<doc_date>[no date recorded on caption card] and 1900-1920 and</doc_date>

Which is then nicely parsed into a range by the TEMPER software. I'd love to hear better theories on how to do this. Really! And I haven't even done all the tests on circa yet.

Tuesday, August 7, 2007

Sorting and indexing

I have made a few changes to the portal, mostly adding support for date, title and score sorting
as well as a few minor wording changes per Katherine. I still have not had a chance to go over
Advanced Searching as much as I would like, sometimes it seems to make sense, sometimes it does not.

I am rebuilding the indexes with some date sorting. I need to go over the dates used for sorting. Practically I think
we need to have one sortable date per record, and I need an algorithm for picking that date.
I think the logic should be go through the various date fields in the originInfo in some agreed upon order
first looking for one with a keyDate = w3cdtf attribute
Failing that look again through the list for any keyDate value
Failing that look again through the list for the first date.

I am running the date I get from above (actually at the moment just DateIssued) through a combination of
a standard Java date parser and failing that through the CDL temper library that attempts to normalize the date.
Mostly from that I get some kind of date range. I have arbitrarily chosen the middle date of that range as the sorting
date for that record.

Any comments on the above process would be appreciated.

Another struggle was getting the acts_as_solr module to correctly sort the result sets. I finally found the solution in the acts_methods.rb where it documents that the configuration fields options passed to acts_as_solr can be an array of hashes. This finally got it working, prior to the change, everything was getting a '_t' tacked on to the end which was breaking the date sorting. I still never got it to correctly order by score. But since that was the default is to sort by score, in that case I just don't pass in an :order specification when I call find_by_solr

Saturday, August 4, 2007

Headings

I met with Katherine and Kalika from Citrus design and had a good meeting. I am anxious to see the result of that. Ere that happens I press on. I spent the better part of Thursday working on user and login registration functionality only to rip it out again in frustration after doing more research no my options. There are several plugins and gems out there that I think bear more research before proceeding, including one which would give us OpenID functionality.
Instead today, triggered by some conversations I had with Katherine and Jerry yesterday. I started working on some analysis of the subject headings. I believe that having some sort of hierarchical view of the collection is critical to our SEO aspirations and I think the subject headings and dates are probably the lowest threshold methodologies to get us there. To that end I produced a table of subject headings and a primitive viewer. It's useful in that it helped me find a few more small problems with the JSON xsl transform I am using and a number of configuations issues that remain with the subject headings, such as lack of a unifying field subject that collects the various specific subject_topic etc. Also the normalizations required to make the heading useful should be improved and more importantly, to follow my latest project mantra, to be made transparent. I am thing right now of a list of regular expressions that can be viewed, maybe a packager of some type as well that can make a collection of transforms that are applied to a given field. It's a nice idea anyway.
I'll have the headings work on the server Monday.

Wednesday, August 1, 2007

Cleaning up, moving forward

Found a few things wrong with xsl now that I can look at the site a more efficiently,
All subjects of a given type are being concatentated into one field. There was also a spurious
name field being generated when the title contained a nonSort element. Those have been fixed and the server is being rebuilt.
I'm working on a registration and login system now. I'd be farther along and have more time to spend on if I hadn't mistyped something and got the following message every time I hit the application I'd get
We're sorry, but something went wrong.
We've been notified about this issue and we'll take a look at it shortly.
nothing I could do. I lost undo after an aptana restart so I was about to pull all the code I had done. I think the problem was and end statement that was mistyped as ens or enS but man I could have a used a better diagnostic than that. So it goes.


Tuesday, July 31, 2007

Record display

Yesterday was plug-in deluge. Every step of the way I was trying to figure out how to do it someone else's way. So today after a morning spent catching the server up with my laptop and rebuilding all it's indexes and everything, it was a pleasure rendering records, just experiencing the joys of render partial and simple programming.
A lot of questions came up though. How should fields germane to 'Search by selecting hyperlinked field values' be normalized. And what should they look like. I went ahead and did this for subjects but records that are nothing but links start looking pretty ugly. Perhaps this is just CSS issues.
And what about record specific searches? Right now, I've gone ahead and done a briefer record display on a browse screen and a full record display, and a mods display for that matter.
I feel I am finally breaking through on the portal.
More tomorrow.

Monday, July 30, 2007

Portal Round 1

I worked today on bringing up the portal. I started from scratch with the two rails plug-ins acts_as_solr and will_paginate. I installed them locally on my laptop using radrails plugin installation and loaded them as svn externals.
I have elected to use the included solr distribution within the acts_as_solr plugin although I cannot seem to run it from the rake task within the ide. To run it I have to find the vendor/plugins/acts_as_solr/branches/release_0.9/solr directory within my Apatan workspace and then manually run the instance with java -Djetty.port=8982 -jar start.jar which starts the jetty servlet container at the port that acts_as_rails deems the default for the development environment.
I retooled the mods-solr.xsl to use solr style field designations and to take advantage of the dynamic field names. Since I was thinking about it I re-did the JSON field names to follow similar conventions.

I spent most of the day banging my head against the following problems
  1. The little I could find on integrating acts_as_solr and will_paginate did not seem to work. I eventually settled on using the WillPaginate::Collection which seems to work pretty well.
  2. Sorting results sets with acts_as_solr is beyond my googling. I still don't have it working but have learned a lot about the issue in the meantime.
    1. Syntactically the format is :order => " " the discovery of which took me deep within the bowels of acts_as_solr. Errors about "title_t".keys method not found required the full tour.
    2. acts_as_solr as of release 0.9 does not use the standard method for declaration of sorting.
    3. After reading up on lucene, sorting of string pretty much requires the existence of an untokenized field. This led me to the creation of title_sort field of type alphaSortOnly. This is find but required that I alter the supplied schema.xml which causes me problems with svn since it is down in an svn externals directory. -- Also worth noting is that adding the untokenized form of the title added an eyeball estimate of ~20% to my runtime when indexing records. This should be pursued
    4. The date sorting will require either a normalized UTC date or a string that we manufacture. Since we are plucking this data out of the MODS xml it may require a bit more
    5. Either way the sort still did not give me the expected results, I'm not fretting yet I got pretty far today, so I'll just have to dig some more
  3. I just started looking at facets but can imagine that will be pretty exciting as well.

Bring up the portal

I spent the last couple of days playing with acts_as_solr the ruby plugin for solr. AAS relies upon the dynamic field capabilities of solr, which I ignored in my initial indexing work. Although I am still not convinced that I want to use it, AAS holds the promise of giving me faceted searching for very little work. The biggest problem with AAS is that it wants to do all my indexing for me. Which may be fine in the long run, right now indexing is down by a java program. I am still not satisfied with the performance of ruby's xsl transform engines though I still not adequately tested the libxml code. Mostly because I still can't get it to build on my Windows laptop (it took a while to get installed on ubuntu as well.)

Thursday, July 26, 2007

Deleting duplicate rows in mysql

I am posting this as a public service to myself on how to delete duplicate rows from a mysql table.
In my case I have a unique column id on each row. And I have an indexed column, call it match, that I use to determine duplication.
In short so I can find it next time what I do is

delete t2 from table1 as t1, table2 as t2 where t1.match = t2.match and t2.id > t1.id;


This works and takes about 2-4 minutes on a table with about 280K rows and 35K duplicates.

I've only tested this with isam tables, I know this can be done other ways and can be gleaned from the mysql documentation, but this way, when I can remember it, seems simplest.

Wednesday, July 18, 2007

An update on Indexing MODS records with SOLR

My goal in the Aquifer project is to make the machinery behind the indexing as transparent as possible. Indexing is almost invariably a mapping between the source data and index representation. Here, MODS records are mapped to a SOLR input schema using a XSL transform mods-solr.xsl (visible here). The complexity of the MODS document is reduced to a list of fields, such as title, name, subject, etc. Many sub-elements may be concatenated together to create these fields.
Here is an example of what goes into SOLR

<add>
<doc>
<field name="id">1</field>
<field name="title">History of the Saginaw Valley,; its resources, progress and business interests</field>
<field name="name">Fox, Truman B</field>
<field name="type-of-resource">text</field>
<field name="origin">, Daily Courier Steam Job Print1868</field>
<field name="language"><field name="physical-description">text/xml;image/tiff; 80 p. 20 cm.; reformatted digital</field>
<field name="note">Advertising matter interspersed.</field>
<field name="subject-topic">History</field>
<field name="subject-geographic">Saginaw River Valley (Mich.)</field>
<field name="url">http://name.umdl.umich.edu/0885076</field>
<field name="access-condition">Where applicable, subject to copyright. Other restrictions on distribution may apply. Please go to http://www.umdl.umich.edu/ for more information.</field>
</field>
</doc>
</add>

There are problems with some of the fields having extraneous commas from concatenation rules. And more interesting problems like whether fields like access-condition should be indexed at all. But at least with the XSL, the rules are not buried within Java or some other language.

Futher details on how SOLR uses this input is controlled via it's schema.xml file (visible here)
The fields created by the transform must be defined as well as copy information which allows fields to be indexed in more than one way. For example the line

<copyfield source="subject-hierarchic" dest="subject">

causes a topic subject to also be indexed as a generic subject line. More lines like this can cause all text to be indexed under a single default which is used when users do not specify a field in a query. Additional configuration can control how fields are parsed, how sentences and punctuation and are handled on a document or field level.

These files need more work but they are not a bad starting point.

Friday, July 6, 2007

Indexing cracked

After a couple of good days of wrestling with indexing using java and solr, I have gotten the recalcitrant beast working. More needs to be done as always, but now I think all the pieces are in place. And we can get to the real work. Indexing configuration needs a thorough once over. I will take a crack at it tomorrow but hopefully everyone will chime in with some ideas. As I am on vacation next week, maybe they'll have some good ideas on how when I get back. SOLR seems more than willing to handle a lot of fields and our faceting desires. We just need to coherently set it up.

Thursday, June 28, 2007

Status at the core

Not a bad time for a status update since Tom and I spent a good portion of the day going over it.
While there remain a few decision to be made we have a general outline, this pretty much the DLXS system
with different storage and languages for the various processing steps. See graph at the wiki at

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferCore/Basic+Data+Flow

  1. Crawling collections
    I have tested crawling with DLXS harvester and Ruby harvester and both are easy to implement and seem to work well. I have worked with Java based versions in the past, it seems that most of the work on them was done in the past also. The ruby based implementation is very succinct and has been run on 100K records for testing and at the moment seems reliable but a little bit slow but was extremely easy to integrate with the RubyOnRails application to deposit the xml straight into mysql
  2. RawXml
    storage is in a text column in a table. Initially I stored only metadata section subsequent work makes me want to reconsider that and store envelope as well. Ruby scaffolding for this table is complete
  3. CookedXml.
    Seems like there is some case to be made for doing an initial pre-process step over data. This will likely consist of generic and specific translations and mapping employing both XSL and regular expressions and other subsidiary code such as date mapping packages. Kat feels that is a critical part of portal quality. More work needs to be done in ruby on the table for this but it should be the same as RawXML
  4. Indexing.
    SOLR has been installed and is in initial testing. Seems like a nice simple wrapper to the lucene indexing facility. Tom has some reservations about deeply nested indexed fields, I am more optimistic. Seems like it will be simple to translate the DLXS XSL transforms into a MODS to SOLR template. This portion will be written in Java, I've run the tutorial and it seems fine.
  5. Pre-rendered data.
    I am of the opinion that the roby should have accessed to pre-processed mods records. I have been doing some experiments mapping to something resembling the DLXS mods as bibclass model. I have written test code that saves the preliminary format into JSON and reads that into a RubyObject. I need to start some timing tests. It may be this data is only for ruby and it will be quicker and simpler code to use Ruby object serialization. But I will most likely be generating this data from Java and XSL it is easiest to use a more neutral format like JSON. The serialized data would include linked objects that implement the Asset Actions
We have a few of issues.
  1. Ruby XML handling through the REXML package seems too slow, but just tonight Tom managed to load a package based on libxml2 which purports to bring performance gains of 33 to 200 times. I have verified that the new code runs but have not yet had a chance to verify performance.
  2. We talked today about nested mods records and how best to handle them. We are interested in the immediate likelihood of seeing data of this form.
  3. I have been experimenting with a number of XSL transformations of my favorite kind (written by someone else) I have encountered both 1.0 and 2.0 transforms. This does not seem to be a problem but others may know more.

Wednesday, June 20, 2007

Harvesting and mapping

It's hard to proceed without data so yesterday and today were devoted to a simple ruby harvester. It is up an running and right now, painfully slow. It was also tested on sites that had no errors so robustness has not been broached. The good news is I can move on to indexing and display. Next specific task is to create a intermediate form of the mods record. This will be very similar to the BibClass implementation of MODS. Parsing the MODS data at runtime every time we want to show or index a record will be just to much overhead. I will probably go with either a simple ruby or json serialization of a an array of array of fields. There will be mapping tables and filters that will translate between the various MODS sources and this layout. I think I will put these in the database so administrators can try applying various ones to new sources as they appear. I am definitely leaning towards json as the serialization method because it will make it much easier for other applications written in other languages to mine the data directly.

The biggest trick to getting the harvester working was divining how the resumption_token was used and then figuring out that that token to list_records can not be accompanied by other tokens.

res = client.list_records( :metadata_prefix => 'mods' )
return if res == nil

entries = res.entries
while true do
entries.each do |entry|
puts entry.header.identifier
save_record entry
end
puts "done with first batch"

token = res.resumption_token
puts "token=#{token}"

res = client.list_records( :resumption_token => token )
break if res == nil
entries = res.entries

end

I have a problem that I did not figure out on this pass. It seems that getting the entries with entries = res.entries takes as long a the list_records call making me think that I am fetching twice. I could just iterate over list records but then I am not sure if the resumption_token is where I need it. A bit of poking will probably reveal all.

Monday, June 18, 2007

Coding is underway

I think we had a great meeting at Michigan. Tom and I are going to begin working on the portal website immediately. While everything is subject to change at this point we have decide to go ahead and write the web stuff using ruby on rails. We may use bits and pieces, so there will probably be Perl and Java code floating around in there too but at the moment we are sold on the sirens' song of agile development. The honeymoon is far from over.