Thursday, June 28, 2007

Status at the core

Not a bad time for a status update since Tom and I spent a good portion of the day going over it.
While there remain a few decision to be made we have a general outline, this pretty much the DLXS system
with different storage and languages for the various processing steps. See graph at the wiki at

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferCore/Basic+Data+Flow

  1. Crawling collections
    I have tested crawling with DLXS harvester and Ruby harvester and both are easy to implement and seem to work well. I have worked with Java based versions in the past, it seems that most of the work on them was done in the past also. The ruby based implementation is very succinct and has been run on 100K records for testing and at the moment seems reliable but a little bit slow but was extremely easy to integrate with the RubyOnRails application to deposit the xml straight into mysql
  2. RawXml
    storage is in a text column in a table. Initially I stored only metadata section subsequent work makes me want to reconsider that and store envelope as well. Ruby scaffolding for this table is complete
  3. CookedXml.
    Seems like there is some case to be made for doing an initial pre-process step over data. This will likely consist of generic and specific translations and mapping employing both XSL and regular expressions and other subsidiary code such as date mapping packages. Kat feels that is a critical part of portal quality. More work needs to be done in ruby on the table for this but it should be the same as RawXML
  4. Indexing.
    SOLR has been installed and is in initial testing. Seems like a nice simple wrapper to the lucene indexing facility. Tom has some reservations about deeply nested indexed fields, I am more optimistic. Seems like it will be simple to translate the DLXS XSL transforms into a MODS to SOLR template. This portion will be written in Java, I've run the tutorial and it seems fine.
  5. Pre-rendered data.
    I am of the opinion that the roby should have accessed to pre-processed mods records. I have been doing some experiments mapping to something resembling the DLXS mods as bibclass model. I have written test code that saves the preliminary format into JSON and reads that into a RubyObject. I need to start some timing tests. It may be this data is only for ruby and it will be quicker and simpler code to use Ruby object serialization. But I will most likely be generating this data from Java and XSL it is easiest to use a more neutral format like JSON. The serialized data would include linked objects that implement the Asset Actions
We have a few of issues.
  1. Ruby XML handling through the REXML package seems too slow, but just tonight Tom managed to load a package based on libxml2 which purports to bring performance gains of 33 to 200 times. I have verified that the new code runs but have not yet had a chance to verify performance.
  2. We talked today about nested mods records and how best to handle them. We are interested in the immediate likelihood of seeing data of this form.
  3. I have been experimenting with a number of XSL transformations of my favorite kind (written by someone else) I have encountered both 1.0 and 2.0 transforms. This does not seem to be a problem but others may know more.

No comments: