Thursday, June 28, 2007

Status at the core

Not a bad time for a status update since Tom and I spent a good portion of the day going over it.
While there remain a few decision to be made we have a general outline, this pretty much the DLXS system
with different storage and languages for the various processing steps. See graph at the wiki at

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferCore/Basic+Data+Flow

  1. Crawling collections
    I have tested crawling with DLXS harvester and Ruby harvester and both are easy to implement and seem to work well. I have worked with Java based versions in the past, it seems that most of the work on them was done in the past also. The ruby based implementation is very succinct and has been run on 100K records for testing and at the moment seems reliable but a little bit slow but was extremely easy to integrate with the RubyOnRails application to deposit the xml straight into mysql
  2. RawXml
    storage is in a text column in a table. Initially I stored only metadata section subsequent work makes me want to reconsider that and store envelope as well. Ruby scaffolding for this table is complete
  3. CookedXml.
    Seems like there is some case to be made for doing an initial pre-process step over data. This will likely consist of generic and specific translations and mapping employing both XSL and regular expressions and other subsidiary code such as date mapping packages. Kat feels that is a critical part of portal quality. More work needs to be done in ruby on the table for this but it should be the same as RawXML
  4. Indexing.
    SOLR has been installed and is in initial testing. Seems like a nice simple wrapper to the lucene indexing facility. Tom has some reservations about deeply nested indexed fields, I am more optimistic. Seems like it will be simple to translate the DLXS XSL transforms into a MODS to SOLR template. This portion will be written in Java, I've run the tutorial and it seems fine.
  5. Pre-rendered data.
    I am of the opinion that the roby should have accessed to pre-processed mods records. I have been doing some experiments mapping to something resembling the DLXS mods as bibclass model. I have written test code that saves the preliminary format into JSON and reads that into a RubyObject. I need to start some timing tests. It may be this data is only for ruby and it will be quicker and simpler code to use Ruby object serialization. But I will most likely be generating this data from Java and XSL it is easiest to use a more neutral format like JSON. The serialized data would include linked objects that implement the Asset Actions
We have a few of issues.
  1. Ruby XML handling through the REXML package seems too slow, but just tonight Tom managed to load a package based on libxml2 which purports to bring performance gains of 33 to 200 times. I have verified that the new code runs but have not yet had a chance to verify performance.
  2. We talked today about nested mods records and how best to handle them. We are interested in the immediate likelihood of seeing data of this form.
  3. I have been experimenting with a number of XSL transformations of my favorite kind (written by someone else) I have encountered both 1.0 and 2.0 transforms. This does not seem to be a problem but others may know more.

Wednesday, June 20, 2007

Harvesting and mapping

It's hard to proceed without data so yesterday and today were devoted to a simple ruby harvester. It is up an running and right now, painfully slow. It was also tested on sites that had no errors so robustness has not been broached. The good news is I can move on to indexing and display. Next specific task is to create a intermediate form of the mods record. This will be very similar to the BibClass implementation of MODS. Parsing the MODS data at runtime every time we want to show or index a record will be just to much overhead. I will probably go with either a simple ruby or json serialization of a an array of array of fields. There will be mapping tables and filters that will translate between the various MODS sources and this layout. I think I will put these in the database so administrators can try applying various ones to new sources as they appear. I am definitely leaning towards json as the serialization method because it will make it much easier for other applications written in other languages to mine the data directly.

The biggest trick to getting the harvester working was divining how the resumption_token was used and then figuring out that that token to list_records can not be accompanied by other tokens.

res = client.list_records( :metadata_prefix => 'mods' )
return if res == nil

entries = res.entries
while true do
entries.each do |entry|
puts entry.header.identifier
save_record entry
end
puts "done with first batch"

token = res.resumption_token
puts "token=#{token}"

res = client.list_records( :resumption_token => token )
break if res == nil
entries = res.entries

end

I have a problem that I did not figure out on this pass. It seems that getting the entries with entries = res.entries takes as long a the list_records call making me think that I am fetching twice. I could just iterate over list records but then I am not sure if the resumption_token is where I need it. A bit of poking will probably reveal all.

Monday, June 18, 2007

Coding is underway

I think we had a great meeting at Michigan. Tom and I are going to begin working on the portal website immediately. While everything is subject to change at this point we have decide to go ahead and write the web stuff using ruby on rails. We may use bits and pieces, so there will probably be Perl and Java code floating around in there too but at the moment we are sold on the sirens' song of agile development. The honeymoon is far from over.