Tuesday, July 31, 2007

Record display

Yesterday was plug-in deluge. Every step of the way I was trying to figure out how to do it someone else's way. So today after a morning spent catching the server up with my laptop and rebuilding all it's indexes and everything, it was a pleasure rendering records, just experiencing the joys of render partial and simple programming.
A lot of questions came up though. How should fields germane to 'Search by selecting hyperlinked field values' be normalized. And what should they look like. I went ahead and did this for subjects but records that are nothing but links start looking pretty ugly. Perhaps this is just CSS issues.
And what about record specific searches? Right now, I've gone ahead and done a briefer record display on a browse screen and a full record display, and a mods display for that matter.
I feel I am finally breaking through on the portal.
More tomorrow.

Monday, July 30, 2007

Portal Round 1

I worked today on bringing up the portal. I started from scratch with the two rails plug-ins acts_as_solr and will_paginate. I installed them locally on my laptop using radrails plugin installation and loaded them as svn externals.
I have elected to use the included solr distribution within the acts_as_solr plugin although I cannot seem to run it from the rake task within the ide. To run it I have to find the vendor/plugins/acts_as_solr/branches/release_0.9/solr directory within my Apatan workspace and then manually run the instance with java -Djetty.port=8982 -jar start.jar which starts the jetty servlet container at the port that acts_as_rails deems the default for the development environment.
I retooled the mods-solr.xsl to use solr style field designations and to take advantage of the dynamic field names. Since I was thinking about it I re-did the JSON field names to follow similar conventions.

I spent most of the day banging my head against the following problems
  1. The little I could find on integrating acts_as_solr and will_paginate did not seem to work. I eventually settled on using the WillPaginate::Collection which seems to work pretty well.
  2. Sorting results sets with acts_as_solr is beyond my googling. I still don't have it working but have learned a lot about the issue in the meantime.
    1. Syntactically the format is :order => " " the discovery of which took me deep within the bowels of acts_as_solr. Errors about "title_t".keys method not found required the full tour.
    2. acts_as_solr as of release 0.9 does not use the standard method for declaration of sorting.
    3. After reading up on lucene, sorting of string pretty much requires the existence of an untokenized field. This led me to the creation of title_sort field of type alphaSortOnly. This is find but required that I alter the supplied schema.xml which causes me problems with svn since it is down in an svn externals directory. -- Also worth noting is that adding the untokenized form of the title added an eyeball estimate of ~20% to my runtime when indexing records. This should be pursued
    4. The date sorting will require either a normalized UTC date or a string that we manufacture. Since we are plucking this data out of the MODS xml it may require a bit more
    5. Either way the sort still did not give me the expected results, I'm not fretting yet I got pretty far today, so I'll just have to dig some more
  3. I just started looking at facets but can imagine that will be pretty exciting as well.

Bring up the portal

I spent the last couple of days playing with acts_as_solr the ruby plugin for solr. AAS relies upon the dynamic field capabilities of solr, which I ignored in my initial indexing work. Although I am still not convinced that I want to use it, AAS holds the promise of giving me faceted searching for very little work. The biggest problem with AAS is that it wants to do all my indexing for me. Which may be fine in the long run, right now indexing is down by a java program. I am still not satisfied with the performance of ruby's xsl transform engines though I still not adequately tested the libxml code. Mostly because I still can't get it to build on my Windows laptop (it took a while to get installed on ubuntu as well.)

Thursday, July 26, 2007

Deleting duplicate rows in mysql

I am posting this as a public service to myself on how to delete duplicate rows from a mysql table.
In my case I have a unique column id on each row. And I have an indexed column, call it match, that I use to determine duplication.
In short so I can find it next time what I do is

delete t2 from table1 as t1, table2 as t2 where t1.match = t2.match and t2.id > t1.id;


This works and takes about 2-4 minutes on a table with about 280K rows and 35K duplicates.

I've only tested this with isam tables, I know this can be done other ways and can be gleaned from the mysql documentation, but this way, when I can remember it, seems simplest.

Wednesday, July 18, 2007

An update on Indexing MODS records with SOLR

My goal in the Aquifer project is to make the machinery behind the indexing as transparent as possible. Indexing is almost invariably a mapping between the source data and index representation. Here, MODS records are mapped to a SOLR input schema using a XSL transform mods-solr.xsl (visible here). The complexity of the MODS document is reduced to a list of fields, such as title, name, subject, etc. Many sub-elements may be concatenated together to create these fields.
Here is an example of what goes into SOLR

<add>
<doc>
<field name="id">1</field>
<field name="title">History of the Saginaw Valley,; its resources, progress and business interests</field>
<field name="name">Fox, Truman B</field>
<field name="type-of-resource">text</field>
<field name="origin">, Daily Courier Steam Job Print1868</field>
<field name="language"><field name="physical-description">text/xml;image/tiff; 80 p. 20 cm.; reformatted digital</field>
<field name="note">Advertising matter interspersed.</field>
<field name="subject-topic">History</field>
<field name="subject-geographic">Saginaw River Valley (Mich.)</field>
<field name="url">http://name.umdl.umich.edu/0885076</field>
<field name="access-condition">Where applicable, subject to copyright. Other restrictions on distribution may apply. Please go to http://www.umdl.umich.edu/ for more information.</field>
</field>
</doc>
</add>

There are problems with some of the fields having extraneous commas from concatenation rules. And more interesting problems like whether fields like access-condition should be indexed at all. But at least with the XSL, the rules are not buried within Java or some other language.

Futher details on how SOLR uses this input is controlled via it's schema.xml file (visible here)
The fields created by the transform must be defined as well as copy information which allows fields to be indexed in more than one way. For example the line

<copyfield source="subject-hierarchic" dest="subject">

causes a topic subject to also be indexed as a generic subject line. More lines like this can cause all text to be indexed under a single default which is used when users do not specify a field in a query. Additional configuration can control how fields are parsed, how sentences and punctuation and are handled on a document or field level.

These files need more work but they are not a bad starting point.

Friday, July 6, 2007

Indexing cracked

After a couple of good days of wrestling with indexing using java and solr, I have gotten the recalcitrant beast working. More needs to be done as always, but now I think all the pieces are in place. And we can get to the real work. Indexing configuration needs a thorough once over. I will take a crack at it tomorrow but hopefully everyone will chime in with some ideas. As I am on vacation next week, maybe they'll have some good ideas on how when I get back. SOLR seems more than willing to handle a lot of fields and our faceting desires. We just need to coherently set it up.