Tuesday, March 25, 2008

How do you know it's a country

I wanted to know if a string contains a country. So I got a list of countries, there are plenty out there though finding one containing England and names like that is hard. Anyway I think to myself, I'll just build a nice regular expression that looks like /country1|country2|.../ and check it against my string. Out of curiosity I wondered, how much faster that would be than having an array of regular expressions and comparing each one against the string. My profound respect for regular expressions made me think some sort of clever tree would be compiled and the search would be quite fast. So I wrote a quick test in ruby and found much to my surprise that the single regular expression technique was about 4 times slower. I haven't pursued it further (but did go with the array method) but here is the code if someone can tell me what I did wrong.


# simple class to determine whether it's quicker to decide if a string has a country
# with a regex for each country or whether to build a single regex that uses 'or'
# to list every country
# could be missing something
# conclusion: much to my surprise array of regexes seems to be about 4 times faster
class RegexpTest
# this list is from http://www.iso.org/iso/iso3166_en_code_lists.txt
# with a few alterations
@@countries = [ 'AFGHANISTAN', 'ÅLAND ISLANDS', 'ALBANIA', 'ALGERIA', 'AMERICAN SAMOA', 'ANDORRA', 'ANGOLA',
'ANGUILLA', 'ANTARCTICA', 'ANTIGUA AND BARBUDA', 'ARGENTINA', 'ARMENIA', 'ARUBA', 'AUSTRALIA',
'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH', 'BARBADOS', 'BELARUS', 'BELGIUM',
'BELIZE', 'BENIN', 'BERMUDA', 'BHUTAN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA',
'BOUVET ISLAND', 'BRAZIL', 'BRITISH INDIAN OCEAN TERRITORY', 'BRUNEI DARUSSALAM', 'BULGARIA',
'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE', 'CAYMAN ISLANDS',
'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA', 'CHRISTMAS ISLAND',
'COCOS (KEELING) ISLANDS', 'COLOMBIA', 'COMOROS', 'CONGO', 'CONGO, THE DEMOCRATIC REPUBLIC OF THE',
'COOK ISLANDS', 'COSTA RICA', 'CÔTE D\'IVOIRE', 'CROATIA', 'CUBA', 'CYPRUS', 'CZECH REPUBLIC',
'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT', 'EL SALVADOR',
'ENGLAND', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA', 'ETHIOPIA', 'FALKLAND ISLANDS (MALVINAS)',
'FAROE ISLANDS', 'FIJI', 'FINLAND', 'FRANCE', 'FRENCH GUIANA', 'FRENCH POLYNESIA',
'FRENCH SOUTHERN TERRITORIES', 'GABON', 'GAMBIA', 'GEORGIA', 'GERMANY', 'GHANA', 'GIBRALTAR',
'GREECE', 'GREENLAND', 'GRENADA', 'GUADELOUPE', 'GUAM', 'GUATEMALA', 'GUERNSEY', 'GUINEA',
'GUINEA-BISSAU', 'GUYANA', 'HAITI', 'HEARD ISLAND AND MCDONALD ISLANDS',
'HOLY SEE (VATICAN CITY STATE)', 'HONDURAS', 'HONG KONG', 'HUNGARY', 'ICELAND', 'INDIA',
'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISLE OF MAN', 'ISRAEL', 'ITALY', 'JAMAICA', 'JAPAN',
'JERSEY', 'JORDAN', 'KAZAKHSTAN', 'KENYA', 'KIRIBATI', 'KOREA', 'KOREA, REPUBLIC OF', 'KUWAIT',
'KYRGYZSTAN', 'LAO PEOPLE\'S DEMOCRATIC REPUBLIC', 'LATVIA', 'LEBANON', 'LESOTHO', 'LIBERIA',
'LIBYAN ARAB JAMAHIRIYA', 'LIECHTENSTEIN', 'LITHUANIA', 'LUXEMBOURG', 'MACAO',
'MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF', 'MADAGASCAR', 'MALAWI', 'MALAYSIA', 'MALDIVES',
'MALI', 'MALTA', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAURITANIA', 'MAURITIUS', 'MAYOTTE', 'MEXICO',
'MICRONESIA, FEDERATED STATES OF', 'MOLDOVA, REPUBLIC OF', 'MONACO', 'MONGOLIA', 'MONTENEGRO',
'MONTSERRAT', 'MOROCCO', 'MOZAMBIQUE', 'MYANMAR', 'NAMIBIA', 'NAURU', 'NEPAL', 'NETHERLANDS',
'NETHERLANDS ANTILLES', 'NEW CALEDONIA', 'NEW ZEALAND', 'NICARAGUA', 'NIGER', 'NIGERIA', 'NIUE',
'NORFOLK ISLAND', 'NORTHERN MARIANA ISLANDS', 'NORWAY', 'OMAN', 'PAKISTAN', 'PALAU',
'PALESTINIAN TERRITORY, OCCUPIED', 'PANAMA', 'PAPUA NEW GUINEA', 'PARAGUAY', 'PERU', 'PHILIPPINES',
'PITCAIRN', 'POLAND', 'PORTUGAL', 'PUERTO RICO', 'QATAR', 'REUNION', 'ROMANIA', 'RUSSIAN FEDERATION',
'RWANDA', 'SAINT BARTHÉLEMY', 'SAINT HELENA', 'SAINT KITTS AND NEVIS', 'SAINT LUCIA', 'SAINT MARTIN',
'SAINT PIERRE AND MIQUELON', 'SAINT VINCENT AND THE GRENADINES', 'SAMOA', 'SAN MARINO',
'SAO TOME AND PRINCIPE', 'SAUDI ARABIA', 'SENEGAL', 'SERBIA', 'SEYCHELLES', 'SIERRA LEONE',
'SINGAPORE', 'SCOTLAND', 'SLOVAKIA', 'SLOVENIA', 'SOLOMON ISLANDS', 'SOMALIA', 'SOUTH AFRICA',
'SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 'SPAIN', 'SRI LANKA', 'SUDAN', 'SURINAME',
'SVALBARD AND JAN MAYEN', 'SWAZILAND', 'SWEDEN', 'SWITZERLAND', 'SYRIAN ARAB REPUBLIC',
'TAIWAN, PROVINCE OF CHINA', 'TAJIKISTAN', 'TANZANIA, UNITED REPUBLIC OF', 'THAILAND', 'TIMOR-LESTE',
'TOGO', 'TOKELAU', 'TONGA', 'TRINIDAD AND TOBAGO', 'TUNISIA', 'TURKEY', 'TURKMENISTAN',
'TURKS AND CAICOS ISLANDS', 'TUVALU', 'UGANDA', 'UKRAINE', 'UNITED ARAB EMIRATES', 'UNITED KINGDOM',
'UNITED STATES', 'UNITED STATES MINOR OUTLYING ISLANDS', 'URUGUAY', 'UZBEKISTAN', 'VANUATU',
'VENEZUELA', 'VIET NAM', 'VIRGIN ISLANDS, BRITISH', 'VIRGIN ISLANDS, U.S.', 'WALLIS AND FUTUNA',
'WESTERN SAHARA', 'YEMEN', 'ZAMBIA', 'ZIMBABWE']

# generate a random string that about half the time is a country
def RegexpTest.rand_string( n )
a = Array.new
n.times do
s = ''
if rand( 2 ) > 0
s = @@countries[ rand( @@countries.length ) ]
else
x = rand(100)+10
x.times do
s << (rand(26)+64).chr
end
end
a << s
puts s
end
a
end

def RegexpTest.compare
strings = RegexpTest.rand_string( 10000 )

found = 0
timer1 = Time.now
strings.each do |string|
@@country_regexes.each do |re|
if re =~ string
found += 1
break
end
end
end
timer2 = Time.now
delta = timer2 - timer1

puts "array of regexes found=#{found} in #{delta} seconds"

found = 0
timer1 = Time.now
strings.each do |string|
found += 1 if @@country_regex =~ string
end
timer2 = Time.now

delta = timer2 - timer1
puts "regex with big or found=#{found} in #{delta} seconds"

end
end

RegexpTest.compare

Tuesday, March 4, 2008

An Aquifer Perspective on Code4Lib Conf 2008

Code4lib
It's a small well run tightly focused forum for programming efforts in support of libraries. With a single track, primary talks limited to 20 minutes talks, then ending each day with an hour of 5 minute lightning talks you can't help feeling in touch with what's going on in the library programming community. See http://code4lib.org/conference/2008/
The trend that stood out most for me is the number of projects implementing catalogs in one form or another. The software tools necessary to create a catalog are more robust and easier to use, borne out by our own project. Very little of the presentation time on these things was spent bemoaning the difficulties of master the different metadata formats, the problems, whatever they are, seem secondary to the issues of getting real time information on holdings, circulation status, etc. My sense is that the continuing movement towards keyword and faceted searching via SOLR/Lucene and like engines is sufficient in the minds of this group (at least) to address concerns about how the quality and differences of the underlying metadata affect the ability to search and present it.
Not that metadata issues are dead, two rather pointed talks on the on-going RDA effort show that to be far from the case. My naïve view is that all metadata efforts would be immensely helped by employing programmers on the metadata team up front and charging them with delivering, upon release of the specification, a reference implementation, any reference implementation. Secondly, (and I can't quite believe I am saying this) more exuberant use of the specificity that xml can provide. As an example, Karen Coyle, one of the keynote speakers, described the inability to achieve closure on the issue of encoding the author and publisher when what appears on cover page is wrong. Since this seems to be important, I say, go for a little tag richness and provide an encoding for both pieces. What will make this work is that the programmer working on the reference implementation will need to know the preferred order for presenting this detail via DC or RSS which is all that most of the world will ever see of it. It's the right solution, the information is not lost but implementers will have a clue what the specification architects had in mind.
In short: great conference, lovely city, nice weather, a beer town of mythic proportions.

Saturday, January 5, 2008

Discovery by a Thousand Facets

I have always thought myself to be a multi-faceted programmer, finally there is proof.

The issue of combining facets has been on the table for quite some time now. I have finally had some time to take a shot at it. The basic machinery underneath is now in place, I am not as convinced about the UI. This version has been installed on the Aquifer Portal Test System, soon to the production server , try it and let me know what you think.
Now when you breakdown a search result screen then select one of the displayed facets, it becomes a filter at the top of the facets sidebar. Further faceting and facet selection can add additional such filters and the filters can be removed individually. This type of interface was borrowed at least partially from the BlackLight project. As suggested by Steve Toub, when a decade is selected the faceting window is changed to be year, suggesting at least some hierarchy there. When removing a filter the faceting window is returned to the facet set from which the filter came. Empirically that seemed the most logical behavior.
It might be possible to add the feature via a button or something that would allow selecting the NOT of a facet, i.e. filter the search to only records that do not have a particular facet value. I could try this if it seems like it would be a useful searching tool, but it might just be screen clutter.
One confusing aspect of faceting is that even when a particular facet is selected the facet list continues to show other facets. This is because the records retrieved with the given filter most likely have other values in the facet list in addition to the filter selected. Maybe this is obvious, it's pretty obvious to me but this is the kind of issue that I always want to check against the principle of least astonishment, if I get confused users may be more so, but perhaps not.

Wednesday, January 2, 2008

Tag Administration



We'll start this off with the sunset on Christmas day. Nice but pretty typical for this time of year. Ah yes, tag administration. I realized, as I was going through many of the problems Kat found in her latest testing, that most were due to lists inside the code not keeping up with decisions and changes made as the portal moves forward. After fixing a couple, for my sake and the sake of future administrators, I went ahead and implemented a tag interface. Registered users who are administrators will have a new submenu option near logout that will take them to the tag administration subsystem. There are pages for adding, editing and deleting the field tags, and screens for controlling which and in what order tags display for the full and brief record views. There is one additional page for listing the tags the mods-rubymods xsl transformation creates, some of these are for headings work and should not be included in displays, nonetheless it is useful to see them without scanning through reams of xsl.
I have to say once again that the ruby on rails made it quite easy to implement the drag and drop interface. The examples at scriptaculous worked with minimal modification.

Monday, December 17, 2007

Only the coolest searching feature ever...

Have you ever been trying to find something using search and no matter what terms you try you end up with the same records over and over again. Frustrating! What if you could check a box that said, Only show me things you haven't shown me before, or maybe something a little shorter, and just that would happen. Each search would only show you things that you haven't seen before
Well now the Aquifer portal has it.
Registered users of www.dlfaquifer.org who have enabled Remember my searches and records in their profile will see a new check box Unseen only which, when clicked, causes all records presented will be ones that you have not seen before. This feature is a work in progress, I'm still taking feedback on it and still smoothing off the edges. The ability to set marks (temporal or otherwise) would make this feature even better, but is currently just a gleam in the developers eye.
A new look and feel to the user profile system is part of this release. In addition to their search history users can see records they have seen a search results and records they have selected. A way of clearing this history has also been provided.

Please try it. If you couldn't think of a reason to register before, now you can.

SRU Access ready for testing

SRU access to the Aquifer Portal is up and ready for testing. explain and searchRetrieve are working. Basic error are caught and returned in a diagnostics element. The base address of the service is http://www.dlfaquifer.org/sru Please test, and test.

I shook it down with some help from Ralph LeVan and an SRU test website he created (http://alcme.oclc.org/srw/SRUServerTester.html). Our service now passes most of the tests, we are failing a couple that have to do with error conditions, that I have not quite figured out how to detect. Instead we just return zero hits.

The explain packet does not list all the potential indices yet, but I am holding off on that while Tom and I look into a more generic approach to the MODS SOLR interaction.

Tuesday, November 27, 2007

Testing is good


A lot of irons are in the fire right now, but after doing a medium size refactoring job this afternoon, I just have to say how happy I am that I have been assimilated into the ruby on rails unit testing. I know testing has been available on every system that I have ever used, but I have never ever found it so easy to get off the dime and do the testing and stick with it. I just do a significant portion of my debugging while developing my tests. Having done a pretty good suite for the forthcoming Ruby CQL parser, converting the xml generation to use Builder went really fast, despite some venal debugging problems in RadRails (gotta figure that out later)
The picture is the eastern sky I saw this morning.