Wednesday, June 20, 2007

Harvesting and mapping

It's hard to proceed without data so yesterday and today were devoted to a simple ruby harvester. It is up an running and right now, painfully slow. It was also tested on sites that had no errors so robustness has not been broached. The good news is I can move on to indexing and display. Next specific task is to create a intermediate form of the mods record. This will be very similar to the BibClass implementation of MODS. Parsing the MODS data at runtime every time we want to show or index a record will be just to much overhead. I will probably go with either a simple ruby or json serialization of a an array of array of fields. There will be mapping tables and filters that will translate between the various MODS sources and this layout. I think I will put these in the database so administrators can try applying various ones to new sources as they appear. I am definitely leaning towards json as the serialization method because it will make it much easier for other applications written in other languages to mine the data directly.

The biggest trick to getting the harvester working was divining how the resumption_token was used and then figuring out that that token to list_records can not be accompanied by other tokens.

res = client.list_records( :metadata_prefix => 'mods' )
return if res == nil

entries = res.entries
while true do
entries.each do |entry|
puts entry.header.identifier
save_record entry
end
puts "done with first batch"

token = res.resumption_token
puts "token=#{token}"

res = client.list_records( :resumption_token => token )
break if res == nil
entries = res.entries

end

I have a problem that I did not figure out on this pass. It seems that getting the entries with entries = res.entries takes as long a the list_records call making me think that I am fetching twice. I could just iterate over list records but then I am not sure if the resumption_token is where I need it. A bit of poking will probably reveal all.

1 comment:

Kat said...

I thought I was going to send you any data you needed in the meantime. I was under the impression that you wanted a zipped file of the records until we can make them harvestable by you (sooner than planned). I am a bit confused as to why you're starting with the harvester, then.