Tuesday, March 25, 2008

How do you know it's a country

I wanted to know if a string contains a country. So I got a list of countries, there are plenty out there though finding one containing England and names like that is hard. Anyway I think to myself, I'll just build a nice regular expression that looks like /country1|country2|.../ and check it against my string. Out of curiosity I wondered, how much faster that would be than having an array of regular expressions and comparing each one against the string. My profound respect for regular expressions made me think some sort of clever tree would be compiled and the search would be quite fast. So I wrote a quick test in ruby and found much to my surprise that the single regular expression technique was about 4 times slower. I haven't pursued it further (but did go with the array method) but here is the code if someone can tell me what I did wrong.


# simple class to determine whether it's quicker to decide if a string has a country
# with a regex for each country or whether to build a single regex that uses 'or'
# to list every country
# could be missing something
# conclusion: much to my surprise array of regexes seems to be about 4 times faster
class RegexpTest
# this list is from http://www.iso.org/iso/iso3166_en_code_lists.txt
# with a few alterations
@@countries = [ 'AFGHANISTAN', 'ÅLAND ISLANDS', 'ALBANIA', 'ALGERIA', 'AMERICAN SAMOA', 'ANDORRA', 'ANGOLA',
'ANGUILLA', 'ANTARCTICA', 'ANTIGUA AND BARBUDA', 'ARGENTINA', 'ARMENIA', 'ARUBA', 'AUSTRALIA',
'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH', 'BARBADOS', 'BELARUS', 'BELGIUM',
'BELIZE', 'BENIN', 'BERMUDA', 'BHUTAN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA',
'BOUVET ISLAND', 'BRAZIL', 'BRITISH INDIAN OCEAN TERRITORY', 'BRUNEI DARUSSALAM', 'BULGARIA',
'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE', 'CAYMAN ISLANDS',
'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA', 'CHRISTMAS ISLAND',
'COCOS (KEELING) ISLANDS', 'COLOMBIA', 'COMOROS', 'CONGO', 'CONGO, THE DEMOCRATIC REPUBLIC OF THE',
'COOK ISLANDS', 'COSTA RICA', 'CÔTE D\'IVOIRE', 'CROATIA', 'CUBA', 'CYPRUS', 'CZECH REPUBLIC',
'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT', 'EL SALVADOR',
'ENGLAND', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA', 'ETHIOPIA', 'FALKLAND ISLANDS (MALVINAS)',
'FAROE ISLANDS', 'FIJI', 'FINLAND', 'FRANCE', 'FRENCH GUIANA', 'FRENCH POLYNESIA',
'FRENCH SOUTHERN TERRITORIES', 'GABON', 'GAMBIA', 'GEORGIA', 'GERMANY', 'GHANA', 'GIBRALTAR',
'GREECE', 'GREENLAND', 'GRENADA', 'GUADELOUPE', 'GUAM', 'GUATEMALA', 'GUERNSEY', 'GUINEA',
'GUINEA-BISSAU', 'GUYANA', 'HAITI', 'HEARD ISLAND AND MCDONALD ISLANDS',
'HOLY SEE (VATICAN CITY STATE)', 'HONDURAS', 'HONG KONG', 'HUNGARY', 'ICELAND', 'INDIA',
'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISLE OF MAN', 'ISRAEL', 'ITALY', 'JAMAICA', 'JAPAN',
'JERSEY', 'JORDAN', 'KAZAKHSTAN', 'KENYA', 'KIRIBATI', 'KOREA', 'KOREA, REPUBLIC OF', 'KUWAIT',
'KYRGYZSTAN', 'LAO PEOPLE\'S DEMOCRATIC REPUBLIC', 'LATVIA', 'LEBANON', 'LESOTHO', 'LIBERIA',
'LIBYAN ARAB JAMAHIRIYA', 'LIECHTENSTEIN', 'LITHUANIA', 'LUXEMBOURG', 'MACAO',
'MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF', 'MADAGASCAR', 'MALAWI', 'MALAYSIA', 'MALDIVES',
'MALI', 'MALTA', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAURITANIA', 'MAURITIUS', 'MAYOTTE', 'MEXICO',
'MICRONESIA, FEDERATED STATES OF', 'MOLDOVA, REPUBLIC OF', 'MONACO', 'MONGOLIA', 'MONTENEGRO',
'MONTSERRAT', 'MOROCCO', 'MOZAMBIQUE', 'MYANMAR', 'NAMIBIA', 'NAURU', 'NEPAL', 'NETHERLANDS',
'NETHERLANDS ANTILLES', 'NEW CALEDONIA', 'NEW ZEALAND', 'NICARAGUA', 'NIGER', 'NIGERIA', 'NIUE',
'NORFOLK ISLAND', 'NORTHERN MARIANA ISLANDS', 'NORWAY', 'OMAN', 'PAKISTAN', 'PALAU',
'PALESTINIAN TERRITORY, OCCUPIED', 'PANAMA', 'PAPUA NEW GUINEA', 'PARAGUAY', 'PERU', 'PHILIPPINES',
'PITCAIRN', 'POLAND', 'PORTUGAL', 'PUERTO RICO', 'QATAR', 'REUNION', 'ROMANIA', 'RUSSIAN FEDERATION',
'RWANDA', 'SAINT BARTHÉLEMY', 'SAINT HELENA', 'SAINT KITTS AND NEVIS', 'SAINT LUCIA', 'SAINT MARTIN',
'SAINT PIERRE AND MIQUELON', 'SAINT VINCENT AND THE GRENADINES', 'SAMOA', 'SAN MARINO',
'SAO TOME AND PRINCIPE', 'SAUDI ARABIA', 'SENEGAL', 'SERBIA', 'SEYCHELLES', 'SIERRA LEONE',
'SINGAPORE', 'SCOTLAND', 'SLOVAKIA', 'SLOVENIA', 'SOLOMON ISLANDS', 'SOMALIA', 'SOUTH AFRICA',
'SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 'SPAIN', 'SRI LANKA', 'SUDAN', 'SURINAME',
'SVALBARD AND JAN MAYEN', 'SWAZILAND', 'SWEDEN', 'SWITZERLAND', 'SYRIAN ARAB REPUBLIC',
'TAIWAN, PROVINCE OF CHINA', 'TAJIKISTAN', 'TANZANIA, UNITED REPUBLIC OF', 'THAILAND', 'TIMOR-LESTE',
'TOGO', 'TOKELAU', 'TONGA', 'TRINIDAD AND TOBAGO', 'TUNISIA', 'TURKEY', 'TURKMENISTAN',
'TURKS AND CAICOS ISLANDS', 'TUVALU', 'UGANDA', 'UKRAINE', 'UNITED ARAB EMIRATES', 'UNITED KINGDOM',
'UNITED STATES', 'UNITED STATES MINOR OUTLYING ISLANDS', 'URUGUAY', 'UZBEKISTAN', 'VANUATU',
'VENEZUELA', 'VIET NAM', 'VIRGIN ISLANDS, BRITISH', 'VIRGIN ISLANDS, U.S.', 'WALLIS AND FUTUNA',
'WESTERN SAHARA', 'YEMEN', 'ZAMBIA', 'ZIMBABWE']

# generate a random string that about half the time is a country
def RegexpTest.rand_string( n )
a = Array.new
n.times do
s = ''
if rand( 2 ) > 0
s = @@countries[ rand( @@countries.length ) ]
else
x = rand(100)+10
x.times do
s << (rand(26)+64).chr
end
end
a << s
puts s
end
a
end

def RegexpTest.compare
strings = RegexpTest.rand_string( 10000 )

found = 0
timer1 = Time.now
strings.each do |string|
@@country_regexes.each do |re|
if re =~ string
found += 1
break
end
end
end
timer2 = Time.now
delta = timer2 - timer1

puts "array of regexes found=#{found} in #{delta} seconds"

found = 0
timer1 = Time.now
strings.each do |string|
found += 1 if @@country_regex =~ string
end
timer2 = Time.now

delta = timer2 - timer1
puts "regex with big or found=#{found} in #{delta} seconds"

end
end

RegexpTest.compare

Tuesday, March 4, 2008

An Aquifer Perspective on Code4Lib Conf 2008

Code4lib
It's a small well run tightly focused forum for programming efforts in support of libraries. With a single track, primary talks limited to 20 minutes talks, then ending each day with an hour of 5 minute lightning talks you can't help feeling in touch with what's going on in the library programming community. See http://code4lib.org/conference/2008/
The trend that stood out most for me is the number of projects implementing catalogs in one form or another. The software tools necessary to create a catalog are more robust and easier to use, borne out by our own project. Very little of the presentation time on these things was spent bemoaning the difficulties of master the different metadata formats, the problems, whatever they are, seem secondary to the issues of getting real time information on holdings, circulation status, etc. My sense is that the continuing movement towards keyword and faceted searching via SOLR/Lucene and like engines is sufficient in the minds of this group (at least) to address concerns about how the quality and differences of the underlying metadata affect the ability to search and present it.
Not that metadata issues are dead, two rather pointed talks on the on-going RDA effort show that to be far from the case. My naïve view is that all metadata efforts would be immensely helped by employing programmers on the metadata team up front and charging them with delivering, upon release of the specification, a reference implementation, any reference implementation. Secondly, (and I can't quite believe I am saying this) more exuberant use of the specificity that xml can provide. As an example, Karen Coyle, one of the keynote speakers, described the inability to achieve closure on the issue of encoding the author and publisher when what appears on cover page is wrong. Since this seems to be important, I say, go for a little tag richness and provide an encoding for both pieces. What will make this work is that the programmer working on the reference implementation will need to know the preferred order for presenting this detail via DC or RSS which is all that most of the world will ever see of it. It's the right solution, the information is not lost but implementers will have a clue what the specification architects had in mind.
In short: great conference, lovely city, nice weather, a beer town of mythic proportions.