Chicks Bit Brick

How do you know it's a country

2008-03-25T10:41:00.001-07:00

I wanted to know if a string contains a country. So I got a list of countries, there are plenty out there though finding one containing England and names like that is hard. Anyway I think to myself, I'll just build a nice regular expression that looks like /country1|country2|.../ and check it against my string. Out of curiosity I wondered, how much faster that would be than having an array of regular expressions and comparing each one against the string. My profound respect for regular expressions made me think some sort of clever tree would be compiled and the search would be quite fast. So I wrote a quick test in ruby and found much to my surprise that the single regular expression technique was about 4 times slower. I haven't pursued it further (but did go with the array method) but here is the code if someone can tell me what I did wrong.


# simple class to determine whether it's quicker to decide if a string has a country
# with a regex for each country or whether to build a single regex that uses 'or'
# to list every country
# could be missing something
# conclusion: much to my surprise array of regexes seems to be about 4 times faster
class RegexpTest
# this list is from http://www.iso.org/iso/iso3166_en_code_lists.txt
# with a few alterations
@@countries = [ 'AFGHANISTAN', 'ÅLAND ISLANDS', 'ALBANIA', 'ALGERIA', 'AMERICAN SAMOA', 'ANDORRA', 'ANGOLA',
            'ANGUILLA', 'ANTARCTICA', 'ANTIGUA AND BARBUDA', 'ARGENTINA', 'ARMENIA', 'ARUBA', 'AUSTRALIA',
            'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH', 'BARBADOS', 'BELARUS', 'BELGIUM',
            'BELIZE', 'BENIN', 'BERMUDA', 'BHUTAN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA',
            'BOUVET ISLAND', 'BRAZIL', 'BRITISH INDIAN OCEAN TERRITORY', 'BRUNEI DARUSSALAM', 'BULGARIA',
            'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE', 'CAYMAN ISLANDS',
            'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA', 'CHRISTMAS ISLAND',
            'COCOS (KEELING) ISLANDS', 'COLOMBIA', 'COMOROS', 'CONGO', 'CONGO, THE DEMOCRATIC REPUBLIC OF THE',
            'COOK ISLANDS', 'COSTA RICA', 'CÔTE D\'IVOIRE', 'CROATIA', 'CUBA', 'CYPRUS', 'CZECH REPUBLIC',
            'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT', 'EL SALVADOR',
            'ENGLAND', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA', 'ETHIOPIA', 'FALKLAND ISLANDS (MALVINAS)',
            'FAROE ISLANDS', 'FIJI', 'FINLAND', 'FRANCE', 'FRENCH GUIANA', 'FRENCH POLYNESIA',
            'FRENCH SOUTHERN TERRITORIES', 'GABON', 'GAMBIA', 'GEORGIA', 'GERMANY', 'GHANA', 'GIBRALTAR',
            'GREECE', 'GREENLAND', 'GRENADA', 'GUADELOUPE', 'GUAM', 'GUATEMALA', 'GUERNSEY', 'GUINEA',
            'GUINEA-BISSAU', 'GUYANA', 'HAITI', 'HEARD ISLAND AND MCDONALD ISLANDS',
            'HOLY SEE (VATICAN CITY STATE)', 'HONDURAS', 'HONG KONG', 'HUNGARY', 'ICELAND', 'INDIA',
            'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISLE OF MAN', 'ISRAEL', 'ITALY', 'JAMAICA', 'JAPAN',
            'JERSEY', 'JORDAN', 'KAZAKHSTAN', 'KENYA', 'KIRIBATI', 'KOREA', 'KOREA, REPUBLIC OF', 'KUWAIT',
            'KYRGYZSTAN', 'LAO PEOPLE\'S DEMOCRATIC REPUBLIC', 'LATVIA', 'LEBANON', 'LESOTHO', 'LIBERIA',
            'LIBYAN ARAB JAMAHIRIYA', 'LIECHTENSTEIN', 'LITHUANIA', 'LUXEMBOURG', 'MACAO',
            'MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF', 'MADAGASCAR', 'MALAWI', 'MALAYSIA', 'MALDIVES',
            'MALI', 'MALTA', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAURITANIA', 'MAURITIUS', 'MAYOTTE', 'MEXICO',
            'MICRONESIA, FEDERATED STATES OF', 'MOLDOVA, REPUBLIC OF', 'MONACO', 'MONGOLIA', 'MONTENEGRO',
            'MONTSERRAT', 'MOROCCO', 'MOZAMBIQUE', 'MYANMAR', 'NAMIBIA', 'NAURU', 'NEPAL', 'NETHERLANDS',
            'NETHERLANDS ANTILLES', 'NEW CALEDONIA', 'NEW ZEALAND', 'NICARAGUA', 'NIGER', 'NIGERIA', 'NIUE',
            'NORFOLK ISLAND', 'NORTHERN MARIANA ISLANDS', 'NORWAY', 'OMAN', 'PAKISTAN', 'PALAU',
            'PALESTINIAN TERRITORY, OCCUPIED', 'PANAMA', 'PAPUA NEW GUINEA', 'PARAGUAY', 'PERU', 'PHILIPPINES',
            'PITCAIRN', 'POLAND', 'PORTUGAL', 'PUERTO RICO', 'QATAR', 'REUNION', 'ROMANIA', 'RUSSIAN FEDERATION',
            'RWANDA', 'SAINT BARTHÉLEMY', 'SAINT HELENA', 'SAINT KITTS AND NEVIS', 'SAINT LUCIA', 'SAINT MARTIN',
            'SAINT PIERRE AND MIQUELON', 'SAINT VINCENT AND THE GRENADINES', 'SAMOA', 'SAN MARINO',
            'SAO TOME AND PRINCIPE', 'SAUDI ARABIA', 'SENEGAL', 'SERBIA', 'SEYCHELLES', 'SIERRA LEONE',
            'SINGAPORE', 'SCOTLAND', 'SLOVAKIA', 'SLOVENIA', 'SOLOMON ISLANDS', 'SOMALIA', 'SOUTH AFRICA',
            'SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 'SPAIN', 'SRI LANKA', 'SUDAN', 'SURINAME',
            'SVALBARD AND JAN MAYEN', 'SWAZILAND', 'SWEDEN', 'SWITZERLAND', 'SYRIAN ARAB REPUBLIC',
            'TAIWAN, PROVINCE OF CHINA', 'TAJIKISTAN', 'TANZANIA, UNITED REPUBLIC OF', 'THAILAND', 'TIMOR-LESTE',
            'TOGO', 'TOKELAU', 'TONGA', 'TRINIDAD AND TOBAGO', 'TUNISIA', 'TURKEY', 'TURKMENISTAN',
            'TURKS AND CAICOS ISLANDS', 'TUVALU', 'UGANDA', 'UKRAINE', 'UNITED ARAB EMIRATES', 'UNITED KINGDOM',
            'UNITED STATES', 'UNITED STATES MINOR OUTLYING ISLANDS', 'URUGUAY', 'UZBEKISTAN', 'VANUATU',
            'VENEZUELA', 'VIET NAM', 'VIRGIN ISLANDS, BRITISH', 'VIRGIN ISLANDS, U.S.', 'WALLIS AND FUTUNA',
            'WESTERN SAHARA', 'YEMEN', 'ZAMBIA', 'ZIMBABWE']

  # generate a random string that about half the time is a country
  def RegexpTest.rand_string( n )
    a = Array.new
    n.times do
      s = ''
      if rand( 2 ) > 0
        s = @@countries[ rand( @@countries.length ) ]
      else
        x = rand(100)+10
        x.times do
          s << (rand(26)+64).chr
        end
      end
      a << s
      puts s
    end
    a
  end
  
  def RegexpTest.compare
    strings = RegexpTest.rand_string( 10000 )
    
    found = 0
    timer1 = Time.now
    strings.each do |string|
      @@country_regexes.each do |re|
        if re =~ string
          found += 1
          break
        end
      end
    end
    timer2 = Time.now
    delta = timer2 - timer1

    puts "array of regexes found=#{found} in #{delta} seconds"

    found = 0
    timer1 = Time.now
    strings.each do |string|
      found += 1 if @@country_regex =~ string
    end
    timer2 = Time.now

    delta = timer2 - timer1
    puts "regex with big or found=#{found} in #{delta} seconds"
    
  end
end

RegexpTest.compare

An Aquifer Perspective on Code4Lib Conf 2008

2008-03-04T10:30:00.000-08:00

Code4lib
It's a small well run tightly focused forum for programming efforts in support of libraries. With a single track, primary talks limited to 20 minutes talks, then ending each day with an hour of 5 minute lightning talks you can't help feeling in touch with what's going on in the library programming community. See http://code4lib.org/conference/2008/
The trend that stood out most for me is the number of projects implementing catalogs in one form or another. The software tools necessary to create a catalog are more robust and easier to use, borne out by our own project. Very little of the presentation time on these things was spent bemoaning the difficulties of master the different metadata formats, the problems, whatever they are, seem secondary to the issues of getting real time information on holdings, circulation status, etc. My sense is that the continuing movement towards keyword and faceted searching via SOLR/Lucene and like engines is sufficient in the minds of this group (at least) to address concerns about how the quality and differences of the underlying metadata affect the ability to search and present it.
Not that metadata issues are dead, two rather pointed talks on the on-going RDA effort show that to be far from the case. My naïve view is that all metadata efforts would be immensely helped by employing programmers on the metadata team up front and charging them with delivering, upon release of the specification, a reference implementation, any reference implementation. Secondly, (and I can't quite believe I am saying this) more exuberant use of the specificity that xml can provide. As an example, Karen Coyle, one of the keynote speakers, described the inability to achieve closure on the issue of encoding the author and publisher when what appears on cover page is wrong. Since this seems to be important, I say, go for a little tag richness and provide an encoding for both pieces. What will make this work is that the programmer working on the reference implementation will need to know the preferred order for presenting this detail via DC or RSS which is all that most of the world will ever see of it. It's the right solution, the information is not lost but implementers will have a clue what the specification architects had in mind.
In short: great conference, lovely city, nice weather, a beer town of mythic proportions.

Discovery by a Thousand Facets

2008-01-05T23:41:00.000-08:00

I have always thought myself to be a multi-faceted programmer, finally there is proof.

The issue of combining facets has been on the table for quite some time now. I have finally had some time to take a shot at it. The basic machinery underneath is now in place, I am not as convinced about the UI. This version has been installed on the Aquifer Portal Test System, soon to the production server , try it and let me know what you think.
Now when you breakdown a search result screen then select one of the displayed facets, it becomes a filter at the top of the facets sidebar. Further faceting and facet selection can add additional such filters and the filters can be removed individually. This type of interface was borrowed at least partially from the BlackLight project. As suggested by Steve Toub, when a decade is selected the faceting window is changed to be year, suggesting at least some hierarchy there. When removing a filter the faceting window is returned to the facet set from which the filter came. Empirically that seemed the most logical behavior.
It might be possible to add the feature via a button or something that would allow selecting the NOT of a facet, i.e. filter the search to only records that do not have a particular facet value. I could try this if it seems like it would be a useful searching tool, but it might just be screen clutter.
One confusing aspect of faceting is that even when a particular facet is selected the facet list continues to show other facets. This is because the records retrieved with the given filter most likely have other values in the facet list in addition to the filter selected. Maybe this is obvious, it's pretty obvious to me but this is the kind of issue that I always want to check against the principle of least astonishment, if I get confused users may be more so, but perhaps not.

Tag Administration

2008-01-02T15:35:00.000-08:00

We'll start this off with the sunset on Christmas day. Nice but pretty typical for this time of year. Ah yes, tag administration. I realized, as I was going through many of the problems Kat found in her latest testing, that most were due to lists inside the code not keeping up with decisions and changes made as the portal moves forward. After fixing a couple, for my sake and the sake of future administrators, I went ahead and implemented a tag interface. Registered users who are administrators will have a new submenu option near logout that will take them to the tag administration subsystem. There are pages for adding, editing and deleting the field tags, and screens for controlling which and in what order tags display for the full and brief record views. There is one additional page for listing the tags the mods-rubymods xsl transformation creates, some of these are for headings work and should not be included in displays, nonetheless it is useful to see them without scanning through reams of xsl.
I have to say once again that the ruby on rails made it quite easy to implement the drag and drop interface. The examples at scriptaculous worked with minimal modification.

Only the coolest searching feature ever...

2007-12-17T14:55:00.001-08:00

Have you ever been trying to find something using search and no matter what terms you try you end up with the same records over and over again. Frustrating! What if you could check a box that said, Only show me things you haven't shown me before, or maybe something a little shorter, and just that would happen. Each search would only show you things that you haven't seen before
Well now the Aquifer portal has it.
Registered users of www.dlfaquifer.org who have enabled Remember my searches and records in their profile will see a new check box Unseen only which, when clicked, causes all records presented will be ones that you have not seen before. This feature is a work in progress, I'm still taking feedback on it and still smoothing off the edges. The ability to set marks (temporal or otherwise) would make this feature even better, but is currently just a gleam in the developers eye.
A new look and feel to the user profile system is part of this release. In addition to their search history users can see records they have seen a search results and records they have selected. A way of clearing this history has also been provided.

Please try it. If you couldn't think of a reason to register before, now you can.

SRU Access ready for testing

2007-12-17T09:48:00.000-08:00

SRU access to the Aquifer Portal is up and ready for testing. explain and searchRetrieve are working. Basic error are caught and returned in a diagnostics element. The base address of the service is http://www.dlfaquifer.org/sru Please test, and test.

I shook it down with some help from Ralph LeVan and an SRU test website he created (http://alcme.oclc.org/srw/SRUServerTester.html). Our service now passes most of the tests, we are failing a couple that have to do with error conditions, that I have not quite figured out how to detect. Instead we just return zero hits.

The explain packet does not list all the potential indices yet, but I am holding off on that while Tom and I look into a more generic approach to the MODS SOLR interaction.

Testing is good

2007-11-27T23:30:00.000-08:00

A lot of irons are in the fire right now, but after doing a medium size refactoring job this afternoon, I just have to say how happy I am that I have been assimilated into the ruby on rails unit testing. I know testing has been available on every system that I have ever used, but I have never ever found it so easy to get off the dime and do the testing and stick with it. I just do a significant portion of my debugging while developing my tests. Having done a pretty good suite for the forthcoming Ruby CQL parser, converting the xml generation to use Builder went really fast, despite some venal debugging problems in RadRails (gotta figure that out later)
The picture is the eastern sky I saw this morning.

Chick bangs his head against a Date

2007-10-04T20:50:00.001-07:00

With apologies to Mr Twain, I have always said, "I'd sooner date two jealous psychopaths than one MODS record." Nothing lately has been making me feel any differently. The question is how to take the date information from a MODS record and get it index usefully into an act_as_solr SOLR instance. Our approach is two-fold. Translate the date information in the originInfo element and run the result through the software package developed by the CDL that implements the erstwhile TEMPER date analysis standard.

Here is an example, say I have a record dated 1900-1920. I want to to index this in such a way that if a user chooses to find records between 1905 and 1910 it finds this one. So what do I do. I use the TEMPER software to expand t date range into a separate facet for each year. This wildly duplicates the records when faceted by year but satisifies that it will be found on any overlap with the desired date range.

And you are probably asking how I got the date range, take a look at this xsl snippet which is taken from an xsl transform within the context of an originInfo element

<xsl:element name="doc_date">
<xsl:for-each select="dateIssued|dateCreated|copyrightDate">
<xsl:value-of select=".">
<xsl:choose>
<xsl:when test="@point = 'start'">
<xsl:text>-</xsl:text>
</xsl:when>
<xsl:otherwise>
<xsl:text> and </xsl:text>
</xsl:otherwise>
</xsl:choose>
</xsl:value-of>
</xsl:for-each></xsl:element>
This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
</xsl:for-each></xsl:element></pre></span>

This code concatenates, all desired date elements and even nicely adds a dash after a date with the point='start' attribute (indicating that it begins a range, hopefully followed by the ending date in the next element)
E.G.
<origininfo>
<dateissued>[no date recorded on caption card]</dateissued>
<dateissued encoding="marc" point="start">1900</dateissued>
<dateissued encoding="marc" point="end">1920</dateissued>
</origininfo>

becomes
"<doc_date>[no date recorded on caption card] and 1900-1920 and</doc_date>

Which is then nicely parsed into a range by the TEMPER software. I'd love to hear better theories on how to do this. Really! And I haven't even done all the tests on circa yet.

Sorting and indexing

2007-08-07T00:52:00.000-07:00

I have made a few changes to the portal, mostly adding support for date, title and score sorting
as well as a few minor wording changes per Katherine. I still have not had a chance to go over
Advanced Searching as much as I would like, sometimes it seems to make sense, sometimes it does not.

I am rebuilding the indexes with some date sorting. I need to go over the dates used for sorting. Practically I think
we need to have one sortable date per record, and I need an algorithm for picking that date.
I think the logic should be go through the various date fields in the originInfo in some agreed upon order
first looking for one with a keyDate = w3cdtf attribute
Failing that look again through the list for any keyDate value
Failing that look again through the list for the first date.

I am running the date I get from above (actually at the moment just DateIssued) through a combination of
a standard Java date parser and failing that through the CDL temper library that attempts to normalize the date.
Mostly from that I get some kind of date range. I have arbitrarily chosen the middle date of that range as the sorting
date for that record.

Any comments on the above process would be appreciated.

Another struggle was getting the acts_as_solr module to correctly sort the result sets. I finally found the solution in the acts_methods.rb where it documents that the configuration fields options passed to acts_as_solr can be an array of hashes. This finally got it working, prior to the change, everything was getting a '_t' tacked on to the end which was breaking the date sorting. I still never got it to correctly order by score. But since that was the default is to sort by score, in that case I just don't pass in an :order specification when I call find_by_solr

Headings

2007-08-04T00:26:00.000-07:00

I met with Katherine and Kalika from Citrus design and had a good meeting. I am anxious to see the result of that. Ere that happens I press on. I spent the better part of Thursday working on user and login registration functionality only to rip it out again in frustration after doing more research no my options. There are several plugins and gems out there that I think bear more research before proceeding, including one which would give us OpenID functionality.
Instead today, triggered by some conversations I had with Katherine and Jerry yesterday. I started working on some analysis of the subject headings. I believe that having some sort of hierarchical view of the collection is critical to our SEO aspirations and I think the subject headings and dates are probably the lowest threshold methodologies to get us there. To that end I produced a table of subject headings and a primitive viewer. It's useful in that it helped me find a few more small problems with the JSON xsl transform I am using and a number of configuations issues that remain with the subject headings, such as lack of a unifying field subject that collects the various specific subject_topic etc. Also the normalizations required to make the heading useful should be improved and more importantly, to follow my latest project mantra, to be made transparent. I am thing right now of a list of regular expressions that can be viewed, maybe a packager of some type as well that can make a collection of transforms that are applied to a given field. It's a nice idea anyway.
I'll have the headings work on the server Monday.

Cleaning up, moving forward

2007-08-01T12:41:00.000-07:00

Found a few things wrong with xsl now that I can look at the site a more efficiently,
All subjects of a given type are being concatentated into one field. There was also a spurious
name field being generated when the title contained a nonSort element. Those have been fixed and the server is being rebuilt.
I'm working on a registration and login system now. I'd be farther along and have more time to spend on if I hadn't mistyped something and got the following message every time I hit the application I'd get

We're sorry, but something went wrong.
We've been notified about this issue and we'll take a look at it shortly.

nothing I could do. I lost undo after an aptana restart so I was about to pull all the code I had done. I think the problem was and end statement that was mistyped as ens or enS but man I could have a used a better diagnostic than that. So it goes.

Record display

2007-07-31T09:00:00.000-07:00

Yesterday was plug-in deluge. Every step of the way I was trying to figure out how to do it someone else's way. So today after a morning spent catching the server up with my laptop and rebuilding all it's indexes and everything, it was a pleasure rendering records, just experiencing the joys of render partial and simple programming.
A lot of questions came up though. How should fields germane to 'Search by selecting hyperlinked field values' be normalized. And what should they look like. I went ahead and did this for subjects but records that are nothing but links start looking pretty ugly. Perhaps this is just CSS issues.
And what about record specific searches? Right now, I've gone ahead and done a briefer record display on a browse screen and a full record display, and a mods display for that matter.
I feel I am finally breaking through on the portal.
More tomorrow.

Portal Round 1

2007-07-30T20:41:00.000-07:00

I worked today on bringing up the portal. I started from scratch with the two rails plug-ins acts_as_solr and will_paginate. I installed them locally on my laptop using radrails plugin installation and loaded them as svn externals.
I have elected to use the included solr distribution within the acts_as_solr plugin although I cannot seem to run it from the rake task within the ide. To run it I have to find the vendor/plugins/acts_as_solr/branches/release_0.9/solr directory within my Apatan workspace and then manually run the instance with java -Djetty.port=8982 -jar start.jar which starts the jetty servlet container at the port that acts_as_rails deems the default for the development environment.
I retooled the mods-solr.xsl to use solr style field designations and to take advantage of the dynamic field names. Since I was thinking about it I re-did the JSON field names to follow similar conventions.

I spent most of the day banging my head against the following problems

The little I could find on integrating acts_as_solr and will_paginate did not seem to work. I eventually settled on using the WillPaginate::Collection which seems to work pretty well.
Sorting results sets with acts_as_solr is beyond my googling. I still don't have it working but have learned a lot about the issue in the meantime.

Syntactically the format is :order => " " the discovery of which took me deep within the bowels of acts_as_solr. Errors about "title_t".keys method not found required the full tour.
acts_as_solr as of release 0.9 does not use the standard method for declaration of sorting.
After reading up on lucene, sorting of string pretty much requires the existence of an untokenized field. This led me to the creation of title_sort field of type alphaSortOnly. This is find but required that I alter the supplied schema.xml which causes me problems with svn since it is down in an svn externals directory. -- Also worth noting is that adding the untokenized form of the title added an eyeball estimate of ~20% to my runtime when indexing records. This should be pursued
The date sorting will require either a normalized UTC date or a string that we manufacture. Since we are plucking this data out of the MODS xml it may require a bit more
Either way the sort still did not give me the expected results, I'm not fretting yet I got pretty far today, so I'll just have to dig some more

I just started looking at facets but can imagine that will be pretty exciting as well.

Bring up the portal

2007-07-30T14:59:00.001-07:00

I spent the last couple of days playing with acts_as_solr the ruby plugin for solr. AAS relies upon the dynamic field capabilities of solr, which I ignored in my initial indexing work. Although I am still not convinced that I want to use it, AAS holds the promise of giving me faceted searching for very little work. The biggest problem with AAS is that it wants to do all my indexing for me. Which may be fine in the long run, right now indexing is down by a java program. I am still not satisfied with the performance of ruby's xsl transform engines though I still not adequately tested the libxml code. Mostly because I still can't get it to build on my Windows laptop (it took a while to get installed on ubuntu as well.)

Deleting duplicate rows in mysql

2007-07-26T15:25:00.001-07:00

I am posting this as a public service to myself on how to delete duplicate rows from a mysql table.
In my case I have a unique column id on each row. And I have an indexed column, call it match, that I use to determine duplication.
In short so I can find it next time what I do is

delete t2 from table1 as t1, table2 as t2 where t1.match = t2.match and t2.id > t1.id;

This works and takes about 2-4 minutes on a table with about 280K rows and 35K duplicates.

I've only tested this with isam tables, I know this can be done other ways and can be gleaned from the mysql documentation, but this way, when I can remember it, seems simplest.

An update on Indexing MODS records with SOLR

2007-07-18T12:56:00.001-07:00

My goal in the Aquifer project is to make the machinery behind the indexing as transparent as possible. Indexing is almost invariably a mapping between the source data and index representation. Here, MODS records are mapped to a SOLR input schema using a XSL transform mods-solr.xsl (visible here). The complexity of the MODS document is reduced to a list of fields, such as title, name, subject, etc. Many sub-elements may be concatenated together to create these fields.
Here is an example of what goes into SOLR

<add>
<doc>
<field name="id">1</field>
<field name="title">History of the Saginaw Valley,; its resources, progress and business interests</field>
<field name="name">Fox, Truman B</field>
<field name="type-of-resource">text</field>
<field name="origin">, Daily Courier Steam Job Print1868</field>
<field name="language"><field name="physical-description">text/xml;image/tiff; 80 p. 20 cm.; reformatted digital</field>
<field name="note">Advertising matter interspersed.</field>
<field name="subject-topic">History</field>
<field name="subject-geographic">Saginaw River Valley (Mich.)</field>
<field name="url">http://name.umdl.umich.edu/0885076</field>
<field name="access-condition">Where applicable, subject to copyright. Other restrictions on distribution may apply. Please go to http://www.umdl.umich.edu/ for more information.</field>
</field>
</doc>
</add>

There are problems with some of the fields having extraneous commas from concatenation rules. And more interesting problems like whether fields like access-condition should be indexed at all. But at least with the XSL, the rules are not buried within Java or some other language.

Futher details on how SOLR uses this input is controlled via it's schema.xml file (visible here)
The fields created by the transform must be defined as well as copy information which allows fields to be indexed in more than one way. For example the line

<copyfield source="subject-hierarchic" dest="subject">

causes a topic subject to also be indexed as a generic subject line. More lines like this can cause all text to be indexed under a single default which is used when users do not specify a field in a query. Additional configuration can control how fields are parsed, how sentences and punctuation and are handled on a document or field level.

These files need more work but they are not a bad starting point.

Indexing cracked

2007-07-06T00:30:00.001-07:00

After a couple of good days of wrestling with indexing using java and solr, I have gotten the recalcitrant beast working. More needs to be done as always, but now I think all the pieces are in place. And we can get to the real work. Indexing configuration needs a thorough once over. I will take a crack at it tomorrow but hopefully everyone will chime in with some ideas. As I am on vacation next week, maybe they'll have some good ideas on how when I get back. SOLR seems more than willing to handle a lot of fields and our faceting desires. We just need to coherently set it up.

Status at the core

2007-06-28T23:46:00.000-07:00

Not a bad time for a status update since Tom and I spent a good portion of the day going over it.
While there remain a few decision to be made we have a general outline, this pretty much the DLXS system
with different storage and languages for the various processing steps. See graph at the wiki at

http://wiki.dlib.indiana.edu/confluence/display/DLFAquiferCore/Basic+Data+Flow

Crawling collections
I have tested crawling with DLXS harvester and Ruby harvester and both are easy to implement and seem to work well. I have worked with Java based versions in the past, it seems that most of the work on them was done in the past also. The ruby based implementation is very succinct and has been run on 100K records for testing and at the moment seems reliable but a little bit slow but was extremely easy to integrate with the RubyOnRails application to deposit the xml straight into mysql
RawXml
storage is in a text column in a table. Initially I stored only metadata section subsequent work makes me want to reconsider that and store envelope as well. Ruby scaffolding for this table is complete
CookedXml.
Seems like there is some case to be made for doing an initial pre-process step over data. This will likely consist of generic and specific translations and mapping employing both XSL and regular expressions and other subsidiary code such as date mapping packages. Kat feels that is a critical part of portal quality. More work needs to be done in ruby on the table for this but it should be the same as RawXML
Indexing.
SOLR has been installed and is in initial testing. Seems like a nice simple wrapper to the lucene indexing facility. Tom has some reservations about deeply nested indexed fields, I am more optimistic. Seems like it will be simple to translate the DLXS XSL transforms into a MODS to SOLR template. This portion will be written in Java, I've run the tutorial and it seems fine.
Pre-rendered data.
I am of the opinion that the roby should have accessed to pre-processed mods records. I have been doing some experiments mapping to something resembling the DLXS mods as bibclass model. I have written test code that saves the preliminary format into JSON and reads that into a RubyObject. I need to start some timing tests. It may be this data is only for ruby and it will be quicker and simpler code to use Ruby object serialization. But I will most likely be generating this data from Java and XSL it is easiest to use a more neutral format like JSON. The serialized data would include linked objects that implement the Asset Actions

We have a few of issues.

Ruby XML handling through the REXML package seems too slow, but just tonight Tom managed to load a package based on libxml2 which purports to bring performance gains of 33 to 200 times. I have verified that the new code runs but have not yet had a chance to verify performance.
We talked today about nested mods records and how best to handle them. We are interested in the immediate likelihood of seeing data of this form.
I have been experimenting with a number of XSL transformations of my favorite kind (written by someone else) I have encountered both 1.0 and 2.0 transforms. This does not seem to be a problem but others may know more.

Harvesting and mapping

2007-06-20T17:21:00.000-07:00

It's hard to proceed without data so yesterday and today were devoted to a simple ruby harvester. It is up an running and right now, painfully slow. It was also tested on sites that had no errors so robustness has not been broached. The good news is I can move on to indexing and display. Next specific task is to create a intermediate form of the mods record. This will be very similar to the BibClass implementation of MODS. Parsing the MODS data at runtime every time we want to show or index a record will be just to much overhead. I will probably go with either a simple ruby or json serialization of a an array of array of fields. There will be mapping tables and filters that will translate between the various MODS sources and this layout. I think I will put these in the database so administrators can try applying various ones to new sources as they appear. I am definitely leaning towards json as the serialization method because it will make it much easier for other applications written in other languages to mine the data directly.

The biggest trick to getting the harvester working was divining how the resumption_token was used and then figuring out that that token to list_records can not be accompanied by other tokens.

res = client.list_records( :metadata_prefix => 'mods' )
return if res == nil

entries = res.entries
while true do
entries.each do |entry|
puts entry.header.identifier
save_record entry
end
puts "done with first batch"

token = res.resumption_token
puts "token=#{token}"

res = client.list_records( :resumption_token => token )
break if res == nil
entries = res.entries

end

I have a problem that I did not figure out on this pass. It seems that getting the entries with entries = res.entries takes as long a the list_records call making me think that I am fetching twice. I could just iterate over list records but then I am not sure if the resumption_token is where I need it. A bit of poking will probably reveal all.

Coding is underway

2007-06-18T22:16:00.000-07:00

I think we had a great meeting at Michigan. Tom and I are going to begin working on the portal website immediately. While everything is subject to change at this point we have decide to go ahead and write the web stuff using ruby on rails. We may use bits and pieces, so there will probably be Perl and Java code floating around in there too but at the moment we are sold on the sirens' song of agile development. The honeymoon is far from over.