Tuesday, March 25, 2008

How do you know it's a country

I wanted to know if a string contains a country. So I got a list of countries, there are plenty out there though finding one containing England and names like that is hard. Anyway I think to myself, I'll just build a nice regular expression that looks like /country1|country2|.../ and check it against my string. Out of curiosity I wondered, how much faster that would be than having an array of regular expressions and comparing each one against the string. My profound respect for regular expressions made me think some sort of clever tree would be compiled and the search would be quite fast. So I wrote a quick test in ruby and found much to my surprise that the single regular expression technique was about 4 times slower. I haven't pursued it further (but did go with the array method) but here is the code if someone can tell me what I did wrong.


# simple class to determine whether it's quicker to decide if a string has a country
# with a regex for each country or whether to build a single regex that uses 'or'
# to list every country
# could be missing something
# conclusion: much to my surprise array of regexes seems to be about 4 times faster
class RegexpTest
# this list is from http://www.iso.org/iso/iso3166_en_code_lists.txt
# with a few alterations
@@countries = [ 'AFGHANISTAN', 'ÅLAND ISLANDS', 'ALBANIA', 'ALGERIA', 'AMERICAN SAMOA', 'ANDORRA', 'ANGOLA',
'ANGUILLA', 'ANTARCTICA', 'ANTIGUA AND BARBUDA', 'ARGENTINA', 'ARMENIA', 'ARUBA', 'AUSTRALIA',
'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH', 'BARBADOS', 'BELARUS', 'BELGIUM',
'BELIZE', 'BENIN', 'BERMUDA', 'BHUTAN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA',
'BOUVET ISLAND', 'BRAZIL', 'BRITISH INDIAN OCEAN TERRITORY', 'BRUNEI DARUSSALAM', 'BULGARIA',
'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE', 'CAYMAN ISLANDS',
'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA', 'CHRISTMAS ISLAND',
'COCOS (KEELING) ISLANDS', 'COLOMBIA', 'COMOROS', 'CONGO', 'CONGO, THE DEMOCRATIC REPUBLIC OF THE',
'COOK ISLANDS', 'COSTA RICA', 'CÔTE D\'IVOIRE', 'CROATIA', 'CUBA', 'CYPRUS', 'CZECH REPUBLIC',
'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT', 'EL SALVADOR',
'ENGLAND', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA', 'ETHIOPIA', 'FALKLAND ISLANDS (MALVINAS)',
'FAROE ISLANDS', 'FIJI', 'FINLAND', 'FRANCE', 'FRENCH GUIANA', 'FRENCH POLYNESIA',
'FRENCH SOUTHERN TERRITORIES', 'GABON', 'GAMBIA', 'GEORGIA', 'GERMANY', 'GHANA', 'GIBRALTAR',
'GREECE', 'GREENLAND', 'GRENADA', 'GUADELOUPE', 'GUAM', 'GUATEMALA', 'GUERNSEY', 'GUINEA',
'GUINEA-BISSAU', 'GUYANA', 'HAITI', 'HEARD ISLAND AND MCDONALD ISLANDS',
'HOLY SEE (VATICAN CITY STATE)', 'HONDURAS', 'HONG KONG', 'HUNGARY', 'ICELAND', 'INDIA',
'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISLE OF MAN', 'ISRAEL', 'ITALY', 'JAMAICA', 'JAPAN',
'JERSEY', 'JORDAN', 'KAZAKHSTAN', 'KENYA', 'KIRIBATI', 'KOREA', 'KOREA, REPUBLIC OF', 'KUWAIT',
'KYRGYZSTAN', 'LAO PEOPLE\'S DEMOCRATIC REPUBLIC', 'LATVIA', 'LEBANON', 'LESOTHO', 'LIBERIA',
'LIBYAN ARAB JAMAHIRIYA', 'LIECHTENSTEIN', 'LITHUANIA', 'LUXEMBOURG', 'MACAO',
'MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF', 'MADAGASCAR', 'MALAWI', 'MALAYSIA', 'MALDIVES',
'MALI', 'MALTA', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAURITANIA', 'MAURITIUS', 'MAYOTTE', 'MEXICO',
'MICRONESIA, FEDERATED STATES OF', 'MOLDOVA, REPUBLIC OF', 'MONACO', 'MONGOLIA', 'MONTENEGRO',
'MONTSERRAT', 'MOROCCO', 'MOZAMBIQUE', 'MYANMAR', 'NAMIBIA', 'NAURU', 'NEPAL', 'NETHERLANDS',
'NETHERLANDS ANTILLES', 'NEW CALEDONIA', 'NEW ZEALAND', 'NICARAGUA', 'NIGER', 'NIGERIA', 'NIUE',
'NORFOLK ISLAND', 'NORTHERN MARIANA ISLANDS', 'NORWAY', 'OMAN', 'PAKISTAN', 'PALAU',
'PALESTINIAN TERRITORY, OCCUPIED', 'PANAMA', 'PAPUA NEW GUINEA', 'PARAGUAY', 'PERU', 'PHILIPPINES',
'PITCAIRN', 'POLAND', 'PORTUGAL', 'PUERTO RICO', 'QATAR', 'REUNION', 'ROMANIA', 'RUSSIAN FEDERATION',
'RWANDA', 'SAINT BARTHÉLEMY', 'SAINT HELENA', 'SAINT KITTS AND NEVIS', 'SAINT LUCIA', 'SAINT MARTIN',
'SAINT PIERRE AND MIQUELON', 'SAINT VINCENT AND THE GRENADINES', 'SAMOA', 'SAN MARINO',
'SAO TOME AND PRINCIPE', 'SAUDI ARABIA', 'SENEGAL', 'SERBIA', 'SEYCHELLES', 'SIERRA LEONE',
'SINGAPORE', 'SCOTLAND', 'SLOVAKIA', 'SLOVENIA', 'SOLOMON ISLANDS', 'SOMALIA', 'SOUTH AFRICA',
'SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 'SPAIN', 'SRI LANKA', 'SUDAN', 'SURINAME',
'SVALBARD AND JAN MAYEN', 'SWAZILAND', 'SWEDEN', 'SWITZERLAND', 'SYRIAN ARAB REPUBLIC',
'TAIWAN, PROVINCE OF CHINA', 'TAJIKISTAN', 'TANZANIA, UNITED REPUBLIC OF', 'THAILAND', 'TIMOR-LESTE',
'TOGO', 'TOKELAU', 'TONGA', 'TRINIDAD AND TOBAGO', 'TUNISIA', 'TURKEY', 'TURKMENISTAN',
'TURKS AND CAICOS ISLANDS', 'TUVALU', 'UGANDA', 'UKRAINE', 'UNITED ARAB EMIRATES', 'UNITED KINGDOM',
'UNITED STATES', 'UNITED STATES MINOR OUTLYING ISLANDS', 'URUGUAY', 'UZBEKISTAN', 'VANUATU',
'VENEZUELA', 'VIET NAM', 'VIRGIN ISLANDS, BRITISH', 'VIRGIN ISLANDS, U.S.', 'WALLIS AND FUTUNA',
'WESTERN SAHARA', 'YEMEN', 'ZAMBIA', 'ZIMBABWE']

# generate a random string that about half the time is a country
def RegexpTest.rand_string( n )
a = Array.new
n.times do
s = ''
if rand( 2 ) > 0
s = @@countries[ rand( @@countries.length ) ]
else
x = rand(100)+10
x.times do
s << (rand(26)+64).chr
end
end
a << s
puts s
end
a
end

def RegexpTest.compare
strings = RegexpTest.rand_string( 10000 )

found = 0
timer1 = Time.now
strings.each do |string|
@@country_regexes.each do |re|
if re =~ string
found += 1
break
end
end
end
timer2 = Time.now
delta = timer2 - timer1

puts "array of regexes found=#{found} in #{delta} seconds"

found = 0
timer1 = Time.now
strings.each do |string|
found += 1 if @@country_regex =~ string
end
timer2 = Time.now

delta = timer2 - timer1
puts "regex with big or found=#{found} in #{delta} seconds"

end
end

RegexpTest.compare

4 comments:

Anonymous said...

Hello. This post is likeable, and your blog is very interesting, congratulations :-). I will add in my blogroll =). If possible gives a last there on my blog, it is about the Dieta, I hope you enjoy. The address is http://dieta-brasil.blogspot.com. A hug.

Masao said...

Hi,

Matching with a simple joined regex seems like sequential search...

You can optimise regex with TRI-based searching algorithm.
Below is a optimised regexp.
# I used Regexp::Assembler to build this, though the module itself is perl-based... ;)

/(?:S(?:A(?:INT (?:(?:VINCENT AND THE GRENADINE|KITTS AND NEVI)S|(?:PIERRE AND MIQUELO|MARTI)N|(?:HELEN|LUCI)A|BARTHÉLEMY)|O TOME AND PRINCIPE|(?:UDI ARABI|MO)A|N MARINO)|O(?:UTH (?:GEORGIA AND THE SOUTH SANDWICH ISLANDS|AFRICA)|LOMON ISLANDS|MALIA)|(?:VALBARD AND JAN MAYE|PAI)N|W(?:(?:ITZER|AZI)LAND|EDEN)|(?:LOV(?:AK|EN)I|RI LANK)A|E(?:YCHELLES|NEGAL|RBIA)|I(?:ERRA LEON|NGAPOR)E|YRIAN ARAB REPUBLIC|U(?:RINAME|DAN)|COTLAND)|M(?:A(?:C(?:EDONIA, THE FORMER YUGOSLAV REPUBLIC OF|AO)|L(?:A(?:YSIA|WI)|DIVES|TA|I)|R(?:SHALL ISLANDS|TINIQUE)|URIT(?:ANIA|IUS)|DAGASCAR|YOTTE)|O(?:N(?:T(?:ENEGRO|SERRAT)|GOLIA|ACO)|LDOVA, REPUBLIC OF|ZAMBIQUE|ROCCO)|ICRONESIA, FEDERATED STATES OF|YANMAR|EXICO)|C(?:O(?:NGO(?:, THE DEMOCRATIC REPUBLIC OF THE)?|(?:(?:COS (KEELING)|OK) ISLAND|MORO)S|(?:STA RIC|LOMBI)A)|A(?:M(?:BODIA|EROON)|YMAN ISLANDS|PE VERDE|NADA)|H(?:(?:RISTMAS ISLAN|A)D|I(?:LE|NA))|(?:ENTRAL AFRICAN|ZECH) REPUBLIC|(?:ROATI|UB)A|ÔTE D'IVOIRE|YPRUS)|B(?:O(?:(?:(?:SNIA AND HERZEGOVI|TSWA)N|LIVI)A|UVET ISLAND)|R(?:ITISH INDIAN OCEAN TERRITORY|UNEI DARUSSALAM|AZIL)|A(?:H(?:AMAS|RAIN)|NGLADESH|RBADOS)|E(?:L(?:ARUS|GIUM|IZE)|RMUDA|NIN)|U(?:R(?:KINA FASO|UNDI)|LGARIA)|HUTAN)|T(?:A(?:NZANIA, UNITED REPUBLIC OF|IWAN, PROVINCE OF CHINA|JIKISTAN)|U(?:RK(?:S AND CAICOS ISLANDS|MENISTAN|EY)|NISIA|VALU)|RINIDAD AND TOBAGO|O(?:KELAU|NGA|GO)|IMOR-LESTE|HAILAND)|N(?:E(?:THERLAND(?:S ANTILLE)?S|W (?:CALEDONIA|ZEALAND)|PAL)|OR(?:THERN MARIANA ISLANDS|FOLK ISLAND|WAY)|I(?:GER(?:IA)?|CARAGUA|UE)|A(?:MIBIA|URU))|A(?:(?:N(?:T(?:IGUA AND BARBUD|ARCTIC)|G(?:UIL|O)L|DORR)|(?:L(?:BAN|GER)|USTR(?:AL)?)I|R(?:GENTIN|MENI|UB)|MERICAN SAMO)A|(?:FGHANIST|ZERBAIJ)AN)|P(?:A(?:L(?:ESTINIAN TERRITORY, OCCUPIED|AU)|(?:PUA NEW GUINE|NAM)A|KISTAN|RAGUAY)|O(?:RTUGAL|LAND)|HILIPPINES|UERTO RICO|ITCAIRN|ERU)|G(?:U(?:A(?:DELOUPE|TEMALA|M)|INEA(?:-BISSAU)?|ERNSEY|YANA)|RE(?:E(?:NLAND|CE)|NADA)|E(?:ORGIA|RMANY)|A(?:MBIA|BON)|IBRALTAR|HANA)|L(?:I(?:(?:B(?:YAN ARAB JAMAHIRIY|ERI)|THUANI)A|ECHTENSTEIN)|A(?:O PEOPLE'S DEMOCRATIC REPUBLIC|TVIA)|E(?:BANON|SOTHO)|UXEMBOURG)|F(?:R(?:ENCH (?:SOUTHERN TERRITORIES|(?:POLYNESI|GUIAN)A)|ANCE)|A(?:LKLAND ISLANDS (MALVINAS)|ROE ISLANDS)|I(?:NLAND|JI))|U(?:NITED (?:(?:STATE(?:S MINOR OUTLYING ISLAND)?|ARAB EMIRATE)S|KINGDOM)|ZBEKISTAN|KRAINE|RUGUAY|GANDA)|H(?:O(?:LY SEE (VATICAN CITY STATE)|N(?:G KONG|DURAS))|EARD ISLAND AND MCDONALD ISLANDS|UNGARY|AITI)|E(?:(?:(?:QUATORIAL GUIN|RITR)E|(?:THIOP|STON)I)A|(?:L SALV|CU)ADOR|NGLAND|GYPT)|I(?:S(?:LE OF MAN|RAEL)|R(?:A[NQ]|ELAND)|ND(?:ONES)?IA|CELAND|TALY)|K(?:OREA(?:, REPUBLIC OF)?|(?:AZAKH|YRGYZ)STAN|IRIBATI|UWAIT|ENYA)|V(?:I(?:RGIN ISLANDS, (?:BRITISH|U.S.)|ET NAM)|ENEZUELA|ANUATU)|R(?:(?:USSIAN FEDERAT|EUN)ION|(?:OMANI|WAND)A)|D(?:OMINICA(?:N REPUBLIC)?|JIBOUTI|ENMARK)|W(?:ALLIS AND FUTUN|ESTERN SAHAR)A|J(?:A(?:MAICA|PAN)|ERSEY|ORDAN)|Z(?:IMBABWE|AMBIA)|ÅLAND ISLANDS|(?:YEME|OMA)N|QATAR)/


This optimized regex is complex but fastest one in my environment:

array of regexes found=5004 in 1.167118 seconds
regex with big or found=5004 in 3.910049 seconds
regex with optimiser found=5004 in 0.430012 seconds

Anonymous said...

The blog is marvelous and enthusiastic.First time i visit your blog and love it.I ll also tell my friends to visit your blog and read it........................

IT Solution

bathmate said...

good posting.i like it. thank u. :)-


bathmate