Tuesday, March 25, 2008

How do you know it's a country

I wanted to know if a string contains a country. So I got a list of countries, there are plenty out there though finding one containing England and names like that is hard. Anyway I think to myself, I'll just build a nice regular expression that looks like /country1|country2|.../ and check it against my string. Out of curiosity I wondered, how much faster that would be than having an array of regular expressions and comparing each one against the string. My profound respect for regular expressions made me think some sort of clever tree would be compiled and the search would be quite fast. So I wrote a quick test in ruby and found much to my surprise that the single regular expression technique was about 4 times slower. I haven't pursued it further (but did go with the array method) but here is the code if someone can tell me what I did wrong.


# simple class to determine whether it's quicker to decide if a string has a country
# with a regex for each country or whether to build a single regex that uses 'or'
# to list every country
# could be missing something
# conclusion: much to my surprise array of regexes seems to be about 4 times faster
class RegexpTest
# this list is from http://www.iso.org/iso/iso3166_en_code_lists.txt
# with a few alterations
@@countries = [ 'AFGHANISTAN', 'ÅLAND ISLANDS', 'ALBANIA', 'ALGERIA', 'AMERICAN SAMOA', 'ANDORRA', 'ANGOLA',
'ANGUILLA', 'ANTARCTICA', 'ANTIGUA AND BARBUDA', 'ARGENTINA', 'ARMENIA', 'ARUBA', 'AUSTRALIA',
'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH', 'BARBADOS', 'BELARUS', 'BELGIUM',
'BELIZE', 'BENIN', 'BERMUDA', 'BHUTAN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA',
'BOUVET ISLAND', 'BRAZIL', 'BRITISH INDIAN OCEAN TERRITORY', 'BRUNEI DARUSSALAM', 'BULGARIA',
'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE', 'CAYMAN ISLANDS',
'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA', 'CHRISTMAS ISLAND',
'COCOS (KEELING) ISLANDS', 'COLOMBIA', 'COMOROS', 'CONGO', 'CONGO, THE DEMOCRATIC REPUBLIC OF THE',
'COOK ISLANDS', 'COSTA RICA', 'CÔTE D\'IVOIRE', 'CROATIA', 'CUBA', 'CYPRUS', 'CZECH REPUBLIC',
'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT', 'EL SALVADOR',
'ENGLAND', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA', 'ETHIOPIA', 'FALKLAND ISLANDS (MALVINAS)',
'FAROE ISLANDS', 'FIJI', 'FINLAND', 'FRANCE', 'FRENCH GUIANA', 'FRENCH POLYNESIA',
'FRENCH SOUTHERN TERRITORIES', 'GABON', 'GAMBIA', 'GEORGIA', 'GERMANY', 'GHANA', 'GIBRALTAR',
'GREECE', 'GREENLAND', 'GRENADA', 'GUADELOUPE', 'GUAM', 'GUATEMALA', 'GUERNSEY', 'GUINEA',
'GUINEA-BISSAU', 'GUYANA', 'HAITI', 'HEARD ISLAND AND MCDONALD ISLANDS',
'HOLY SEE (VATICAN CITY STATE)', 'HONDURAS', 'HONG KONG', 'HUNGARY', 'ICELAND', 'INDIA',
'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISLE OF MAN', 'ISRAEL', 'ITALY', 'JAMAICA', 'JAPAN',
'JERSEY', 'JORDAN', 'KAZAKHSTAN', 'KENYA', 'KIRIBATI', 'KOREA', 'KOREA, REPUBLIC OF', 'KUWAIT',
'KYRGYZSTAN', 'LAO PEOPLE\'S DEMOCRATIC REPUBLIC', 'LATVIA', 'LEBANON', 'LESOTHO', 'LIBERIA',
'LIBYAN ARAB JAMAHIRIYA', 'LIECHTENSTEIN', 'LITHUANIA', 'LUXEMBOURG', 'MACAO',
'MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF', 'MADAGASCAR', 'MALAWI', 'MALAYSIA', 'MALDIVES',
'MALI', 'MALTA', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAURITANIA', 'MAURITIUS', 'MAYOTTE', 'MEXICO',
'MICRONESIA, FEDERATED STATES OF', 'MOLDOVA, REPUBLIC OF', 'MONACO', 'MONGOLIA', 'MONTENEGRO',
'MONTSERRAT', 'MOROCCO', 'MOZAMBIQUE', 'MYANMAR', 'NAMIBIA', 'NAURU', 'NEPAL', 'NETHERLANDS',
'NETHERLANDS ANTILLES', 'NEW CALEDONIA', 'NEW ZEALAND', 'NICARAGUA', 'NIGER', 'NIGERIA', 'NIUE',
'NORFOLK ISLAND', 'NORTHERN MARIANA ISLANDS', 'NORWAY', 'OMAN', 'PAKISTAN', 'PALAU',
'PALESTINIAN TERRITORY, OCCUPIED', 'PANAMA', 'PAPUA NEW GUINEA', 'PARAGUAY', 'PERU', 'PHILIPPINES',
'PITCAIRN', 'POLAND', 'PORTUGAL', 'PUERTO RICO', 'QATAR', 'REUNION', 'ROMANIA', 'RUSSIAN FEDERATION',
'RWANDA', 'SAINT BARTHÉLEMY', 'SAINT HELENA', 'SAINT KITTS AND NEVIS', 'SAINT LUCIA', 'SAINT MARTIN',
'SAINT PIERRE AND MIQUELON', 'SAINT VINCENT AND THE GRENADINES', 'SAMOA', 'SAN MARINO',
'SAO TOME AND PRINCIPE', 'SAUDI ARABIA', 'SENEGAL', 'SERBIA', 'SEYCHELLES', 'SIERRA LEONE',
'SINGAPORE', 'SCOTLAND', 'SLOVAKIA', 'SLOVENIA', 'SOLOMON ISLANDS', 'SOMALIA', 'SOUTH AFRICA',
'SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 'SPAIN', 'SRI LANKA', 'SUDAN', 'SURINAME',
'SVALBARD AND JAN MAYEN', 'SWAZILAND', 'SWEDEN', 'SWITZERLAND', 'SYRIAN ARAB REPUBLIC',
'TAIWAN, PROVINCE OF CHINA', 'TAJIKISTAN', 'TANZANIA, UNITED REPUBLIC OF', 'THAILAND', 'TIMOR-LESTE',
'TOGO', 'TOKELAU', 'TONGA', 'TRINIDAD AND TOBAGO', 'TUNISIA', 'TURKEY', 'TURKMENISTAN',
'TURKS AND CAICOS ISLANDS', 'TUVALU', 'UGANDA', 'UKRAINE', 'UNITED ARAB EMIRATES', 'UNITED KINGDOM',
'UNITED STATES', 'UNITED STATES MINOR OUTLYING ISLANDS', 'URUGUAY', 'UZBEKISTAN', 'VANUATU',
'VENEZUELA', 'VIET NAM', 'VIRGIN ISLANDS, BRITISH', 'VIRGIN ISLANDS, U.S.', 'WALLIS AND FUTUNA',
'WESTERN SAHARA', 'YEMEN', 'ZAMBIA', 'ZIMBABWE']

# generate a random string that about half the time is a country
def RegexpTest.rand_string( n )
a = Array.new
n.times do
s = ''
if rand( 2 ) > 0
s = @@countries[ rand( @@countries.length ) ]
else
x = rand(100)+10
x.times do
s << (rand(26)+64).chr
end
end
a << s
puts s
end
a
end

def RegexpTest.compare
strings = RegexpTest.rand_string( 10000 )

found = 0
timer1 = Time.now
strings.each do |string|
@@country_regexes.each do |re|
if re =~ string
found += 1
break
end
end
end
timer2 = Time.now
delta = timer2 - timer1

puts "array of regexes found=#{found} in #{delta} seconds"

found = 0
timer1 = Time.now
strings.each do |string|
found += 1 if @@country_regex =~ string
end
timer2 = Time.now

delta = timer2 - timer1
puts "regex with big or found=#{found} in #{delta} seconds"

end
end

RegexpTest.compare

Tuesday, March 4, 2008

An Aquifer Perspective on Code4Lib Conf 2008

Code4lib
It's a small well run tightly focused forum for programming efforts in support of libraries. With a single track, primary talks limited to 20 minutes talks, then ending each day with an hour of 5 minute lightning talks you can't help feeling in touch with what's going on in the library programming community. See http://code4lib.org/conference/2008/
The trend that stood out most for me is the number of projects implementing catalogs in one form or another. The software tools necessary to create a catalog are more robust and easier to use, borne out by our own project. Very little of the presentation time on these things was spent bemoaning the difficulties of master the different metadata formats, the problems, whatever they are, seem secondary to the issues of getting real time information on holdings, circulation status, etc. My sense is that the continuing movement towards keyword and faceted searching via SOLR/Lucene and like engines is sufficient in the minds of this group (at least) to address concerns about how the quality and differences of the underlying metadata affect the ability to search and present it.
Not that metadata issues are dead, two rather pointed talks on the on-going RDA effort show that to be far from the case. My naïve view is that all metadata efforts would be immensely helped by employing programmers on the metadata team up front and charging them with delivering, upon release of the specification, a reference implementation, any reference implementation. Secondly, (and I can't quite believe I am saying this) more exuberant use of the specificity that xml can provide. As an example, Karen Coyle, one of the keynote speakers, described the inability to achieve closure on the issue of encoding the author and publisher when what appears on cover page is wrong. Since this seems to be important, I say, go for a little tag richness and provide an encoding for both pieces. What will make this work is that the programmer working on the reference implementation will need to know the preferred order for presenting this detail via DC or RSS which is all that most of the world will ever see of it. It's the right solution, the information is not lost but implementers will have a clue what the specification architects had in mind.
In short: great conference, lovely city, nice weather, a beer town of mythic proportions.

Saturday, January 5, 2008

Discovery by a Thousand Facets

I have always thought myself to be a multi-faceted programmer, finally there is proof.

The issue of combining facets has been on the table for quite some time now. I have finally had some time to take a shot at it. The basic machinery underneath is now in place, I am not as convinced about the UI. This version has been installed on the Aquifer Portal Test System, soon to the production server , try it and let me know what you think.
Now when you breakdown a search result screen then select one of the displayed facets, it becomes a filter at the top of the facets sidebar. Further faceting and facet selection can add additional such filters and the filters can be removed individually. This type of interface was borrowed at least partially from the BlackLight project. As suggested by Steve Toub, when a decade is selected the faceting window is changed to be year, suggesting at least some hierarchy there. When removing a filter the faceting window is returned to the facet set from which the filter came. Empirically that seemed the most logical behavior.
It might be possible to add the feature via a button or something that would allow selecting the NOT of a facet, i.e. filter the search to only records that do not have a particular facet value. I could try this if it seems like it would be a useful searching tool, but it might just be screen clutter.
One confusing aspect of faceting is that even when a particular facet is selected the facet list continues to show other facets. This is because the records retrieved with the given filter most likely have other values in the facet list in addition to the filter selected. Maybe this is obvious, it's pretty obvious to me but this is the kind of issue that I always want to check against the principle of least astonishment, if I get confused users may be more so, but perhaps not.

Wednesday, January 2, 2008

Tag Administration



We'll start this off with the sunset on Christmas day. Nice but pretty typical for this time of year. Ah yes, tag administration. I realized, as I was going through many of the problems Kat found in her latest testing, that most were due to lists inside the code not keeping up with decisions and changes made as the portal moves forward. After fixing a couple, for my sake and the sake of future administrators, I went ahead and implemented a tag interface. Registered users who are administrators will have a new submenu option near logout that will take them to the tag administration subsystem. There are pages for adding, editing and deleting the field tags, and screens for controlling which and in what order tags display for the full and brief record views. There is one additional page for listing the tags the mods-rubymods xsl transformation creates, some of these are for headings work and should not be included in displays, nonetheless it is useful to see them without scanning through reams of xsl.
I have to say once again that the ruby on rails made it quite easy to implement the drag and drop interface. The examples at scriptaculous worked with minimal modification.