Tuesday, March 25, 2008

How do you know it's a country

I wanted to know if a string contains a country. So I got a list of countries, there are plenty out there though finding one containing England and names like that is hard. Anyway I think to myself, I'll just build a nice regular expression that looks like /country1|country2|.../ and check it against my string. Out of curiosity I wondered, how much faster that would be than having an array of regular expressions and comparing each one against the string. My profound respect for regular expressions made me think some sort of clever tree would be compiled and the search would be quite fast. So I wrote a quick test in ruby and found much to my surprise that the single regular expression technique was about 4 times slower. I haven't pursued it further (but did go with the array method) but here is the code if someone can tell me what I did wrong.


# simple class to determine whether it's quicker to decide if a string has a country
# with a regex for each country or whether to build a single regex that uses 'or'
# to list every country
# could be missing something
# conclusion: much to my surprise array of regexes seems to be about 4 times faster
class RegexpTest
# this list is from http://www.iso.org/iso/iso3166_en_code_lists.txt
# with a few alterations
@@countries = [ 'AFGHANISTAN', 'ÅLAND ISLANDS', 'ALBANIA', 'ALGERIA', 'AMERICAN SAMOA', 'ANDORRA', 'ANGOLA',
'ANGUILLA', 'ANTARCTICA', 'ANTIGUA AND BARBUDA', 'ARGENTINA', 'ARMENIA', 'ARUBA', 'AUSTRALIA',
'AUSTRIA', 'AZERBAIJAN', 'BAHAMAS', 'BAHRAIN', 'BANGLADESH', 'BARBADOS', 'BELARUS', 'BELGIUM',
'BELIZE', 'BENIN', 'BERMUDA', 'BHUTAN', 'BOLIVIA', 'BOSNIA AND HERZEGOVINA', 'BOTSWANA',
'BOUVET ISLAND', 'BRAZIL', 'BRITISH INDIAN OCEAN TERRITORY', 'BRUNEI DARUSSALAM', 'BULGARIA',
'BURKINA FASO', 'BURUNDI', 'CAMBODIA', 'CAMEROON', 'CANADA', 'CAPE VERDE', 'CAYMAN ISLANDS',
'CENTRAL AFRICAN REPUBLIC', 'CHAD', 'CHILE', 'CHINA', 'CHRISTMAS ISLAND',
'COCOS (KEELING) ISLANDS', 'COLOMBIA', 'COMOROS', 'CONGO', 'CONGO, THE DEMOCRATIC REPUBLIC OF THE',
'COOK ISLANDS', 'COSTA RICA', 'CÔTE D\'IVOIRE', 'CROATIA', 'CUBA', 'CYPRUS', 'CZECH REPUBLIC',
'DENMARK', 'DJIBOUTI', 'DOMINICA', 'DOMINICAN REPUBLIC', 'ECUADOR', 'EGYPT', 'EL SALVADOR',
'ENGLAND', 'EQUATORIAL GUINEA', 'ERITREA', 'ESTONIA', 'ETHIOPIA', 'FALKLAND ISLANDS (MALVINAS)',
'FAROE ISLANDS', 'FIJI', 'FINLAND', 'FRANCE', 'FRENCH GUIANA', 'FRENCH POLYNESIA',
'FRENCH SOUTHERN TERRITORIES', 'GABON', 'GAMBIA', 'GEORGIA', 'GERMANY', 'GHANA', 'GIBRALTAR',
'GREECE', 'GREENLAND', 'GRENADA', 'GUADELOUPE', 'GUAM', 'GUATEMALA', 'GUERNSEY', 'GUINEA',
'GUINEA-BISSAU', 'GUYANA', 'HAITI', 'HEARD ISLAND AND MCDONALD ISLANDS',
'HOLY SEE (VATICAN CITY STATE)', 'HONDURAS', 'HONG KONG', 'HUNGARY', 'ICELAND', 'INDIA',
'INDONESIA', 'IRAN', 'IRAQ', 'IRELAND', 'ISLE OF MAN', 'ISRAEL', 'ITALY', 'JAMAICA', 'JAPAN',
'JERSEY', 'JORDAN', 'KAZAKHSTAN', 'KENYA', 'KIRIBATI', 'KOREA', 'KOREA, REPUBLIC OF', 'KUWAIT',
'KYRGYZSTAN', 'LAO PEOPLE\'S DEMOCRATIC REPUBLIC', 'LATVIA', 'LEBANON', 'LESOTHO', 'LIBERIA',
'LIBYAN ARAB JAMAHIRIYA', 'LIECHTENSTEIN', 'LITHUANIA', 'LUXEMBOURG', 'MACAO',
'MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF', 'MADAGASCAR', 'MALAWI', 'MALAYSIA', 'MALDIVES',
'MALI', 'MALTA', 'MARSHALL ISLANDS', 'MARTINIQUE', 'MAURITANIA', 'MAURITIUS', 'MAYOTTE', 'MEXICO',
'MICRONESIA, FEDERATED STATES OF', 'MOLDOVA, REPUBLIC OF', 'MONACO', 'MONGOLIA', 'MONTENEGRO',
'MONTSERRAT', 'MOROCCO', 'MOZAMBIQUE', 'MYANMAR', 'NAMIBIA', 'NAURU', 'NEPAL', 'NETHERLANDS',
'NETHERLANDS ANTILLES', 'NEW CALEDONIA', 'NEW ZEALAND', 'NICARAGUA', 'NIGER', 'NIGERIA', 'NIUE',
'NORFOLK ISLAND', 'NORTHERN MARIANA ISLANDS', 'NORWAY', 'OMAN', 'PAKISTAN', 'PALAU',
'PALESTINIAN TERRITORY, OCCUPIED', 'PANAMA', 'PAPUA NEW GUINEA', 'PARAGUAY', 'PERU', 'PHILIPPINES',
'PITCAIRN', 'POLAND', 'PORTUGAL', 'PUERTO RICO', 'QATAR', 'REUNION', 'ROMANIA', 'RUSSIAN FEDERATION',
'RWANDA', 'SAINT BARTHÉLEMY', 'SAINT HELENA', 'SAINT KITTS AND NEVIS', 'SAINT LUCIA', 'SAINT MARTIN',
'SAINT PIERRE AND MIQUELON', 'SAINT VINCENT AND THE GRENADINES', 'SAMOA', 'SAN MARINO',
'SAO TOME AND PRINCIPE', 'SAUDI ARABIA', 'SENEGAL', 'SERBIA', 'SEYCHELLES', 'SIERRA LEONE',
'SINGAPORE', 'SCOTLAND', 'SLOVAKIA', 'SLOVENIA', 'SOLOMON ISLANDS', 'SOMALIA', 'SOUTH AFRICA',
'SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS', 'SPAIN', 'SRI LANKA', 'SUDAN', 'SURINAME',
'SVALBARD AND JAN MAYEN', 'SWAZILAND', 'SWEDEN', 'SWITZERLAND', 'SYRIAN ARAB REPUBLIC',
'TAIWAN, PROVINCE OF CHINA', 'TAJIKISTAN', 'TANZANIA, UNITED REPUBLIC OF', 'THAILAND', 'TIMOR-LESTE',
'TOGO', 'TOKELAU', 'TONGA', 'TRINIDAD AND TOBAGO', 'TUNISIA', 'TURKEY', 'TURKMENISTAN',
'TURKS AND CAICOS ISLANDS', 'TUVALU', 'UGANDA', 'UKRAINE', 'UNITED ARAB EMIRATES', 'UNITED KINGDOM',
'UNITED STATES', 'UNITED STATES MINOR OUTLYING ISLANDS', 'URUGUAY', 'UZBEKISTAN', 'VANUATU',
'VENEZUELA', 'VIET NAM', 'VIRGIN ISLANDS, BRITISH', 'VIRGIN ISLANDS, U.S.', 'WALLIS AND FUTUNA',
'WESTERN SAHARA', 'YEMEN', 'ZAMBIA', 'ZIMBABWE']

# generate a random string that about half the time is a country
def RegexpTest.rand_string( n )
a = Array.new
n.times do
s = ''
if rand( 2 ) > 0
s = @@countries[ rand( @@countries.length ) ]
else
x = rand(100)+10
x.times do
s << (rand(26)+64).chr
end
end
a << s
puts s
end
a
end

def RegexpTest.compare
strings = RegexpTest.rand_string( 10000 )

found = 0
timer1 = Time.now
strings.each do |string|
@@country_regexes.each do |re|
if re =~ string
found += 1
break
end
end
end
timer2 = Time.now
delta = timer2 - timer1

puts "array of regexes found=#{found} in #{delta} seconds"

found = 0
timer1 = Time.now
strings.each do |string|
found += 1 if @@country_regex =~ string
end
timer2 = Time.now

delta = timer2 - timer1
puts "regex with big or found=#{found} in #{delta} seconds"

end
end

RegexpTest.compare

6 comments:

Dieta said...

Hello. This post is likeable, and your blog is very interesting, congratulations :-). I will add in my blogroll =). If possible gives a last there on my blog, it is about the Dieta, I hope you enjoy. The address is http://dieta-brasil.blogspot.com. A hug.

Masao said...

Hi,

Matching with a simple joined regex seems like sequential search...

You can optimise regex with TRI-based searching algorithm.
Below is a optimised regexp.
# I used Regexp::Assembler to build this, though the module itself is perl-based... ;)

/(?:S(?:A(?:INT (?:(?:VINCENT AND THE GRENADINE|KITTS AND NEVI)S|(?:PIERRE AND MIQUELO|MARTI)N|(?:HELEN|LUCI)A|BARTHÉLEMY)|O TOME AND PRINCIPE|(?:UDI ARABI|MO)A|N MARINO)|O(?:UTH (?:GEORGIA AND THE SOUTH SANDWICH ISLANDS|AFRICA)|LOMON ISLANDS|MALIA)|(?:VALBARD AND JAN MAYE|PAI)N|W(?:(?:ITZER|AZI)LAND|EDEN)|(?:LOV(?:AK|EN)I|RI LANK)A|E(?:YCHELLES|NEGAL|RBIA)|I(?:ERRA LEON|NGAPOR)E|YRIAN ARAB REPUBLIC|U(?:RINAME|DAN)|COTLAND)|M(?:A(?:C(?:EDONIA, THE FORMER YUGOSLAV REPUBLIC OF|AO)|L(?:A(?:YSIA|WI)|DIVES|TA|I)|R(?:SHALL ISLANDS|TINIQUE)|URIT(?:ANIA|IUS)|DAGASCAR|YOTTE)|O(?:N(?:T(?:ENEGRO|SERRAT)|GOLIA|ACO)|LDOVA, REPUBLIC OF|ZAMBIQUE|ROCCO)|ICRONESIA, FEDERATED STATES OF|YANMAR|EXICO)|C(?:O(?:NGO(?:, THE DEMOCRATIC REPUBLIC OF THE)?|(?:(?:COS (KEELING)|OK) ISLAND|MORO)S|(?:STA RIC|LOMBI)A)|A(?:M(?:BODIA|EROON)|YMAN ISLANDS|PE VERDE|NADA)|H(?:(?:RISTMAS ISLAN|A)D|I(?:LE|NA))|(?:ENTRAL AFRICAN|ZECH) REPUBLIC|(?:ROATI|UB)A|ÔTE D'IVOIRE|YPRUS)|B(?:O(?:(?:(?:SNIA AND HERZEGOVI|TSWA)N|LIVI)A|UVET ISLAND)|R(?:ITISH INDIAN OCEAN TERRITORY|UNEI DARUSSALAM|AZIL)|A(?:H(?:AMAS|RAIN)|NGLADESH|RBADOS)|E(?:L(?:ARUS|GIUM|IZE)|RMUDA|NIN)|U(?:R(?:KINA FASO|UNDI)|LGARIA)|HUTAN)|T(?:A(?:NZANIA, UNITED REPUBLIC OF|IWAN, PROVINCE OF CHINA|JIKISTAN)|U(?:RK(?:S AND CAICOS ISLANDS|MENISTAN|EY)|NISIA|VALU)|RINIDAD AND TOBAGO|O(?:KELAU|NGA|GO)|IMOR-LESTE|HAILAND)|N(?:E(?:THERLAND(?:S ANTILLE)?S|W (?:CALEDONIA|ZEALAND)|PAL)|OR(?:THERN MARIANA ISLANDS|FOLK ISLAND|WAY)|I(?:GER(?:IA)?|CARAGUA|UE)|A(?:MIBIA|URU))|A(?:(?:N(?:T(?:IGUA AND BARBUD|ARCTIC)|G(?:UIL|O)L|DORR)|(?:L(?:BAN|GER)|USTR(?:AL)?)I|R(?:GENTIN|MENI|UB)|MERICAN SAMO)A|(?:FGHANIST|ZERBAIJ)AN)|P(?:A(?:L(?:ESTINIAN TERRITORY, OCCUPIED|AU)|(?:PUA NEW GUINE|NAM)A|KISTAN|RAGUAY)|O(?:RTUGAL|LAND)|HILIPPINES|UERTO RICO|ITCAIRN|ERU)|G(?:U(?:A(?:DELOUPE|TEMALA|M)|INEA(?:-BISSAU)?|ERNSEY|YANA)|RE(?:E(?:NLAND|CE)|NADA)|E(?:ORGIA|RMANY)|A(?:MBIA|BON)|IBRALTAR|HANA)|L(?:I(?:(?:B(?:YAN ARAB JAMAHIRIY|ERI)|THUANI)A|ECHTENSTEIN)|A(?:O PEOPLE'S DEMOCRATIC REPUBLIC|TVIA)|E(?:BANON|SOTHO)|UXEMBOURG)|F(?:R(?:ENCH (?:SOUTHERN TERRITORIES|(?:POLYNESI|GUIAN)A)|ANCE)|A(?:LKLAND ISLANDS (MALVINAS)|ROE ISLANDS)|I(?:NLAND|JI))|U(?:NITED (?:(?:STATE(?:S MINOR OUTLYING ISLAND)?|ARAB EMIRATE)S|KINGDOM)|ZBEKISTAN|KRAINE|RUGUAY|GANDA)|H(?:O(?:LY SEE (VATICAN CITY STATE)|N(?:G KONG|DURAS))|EARD ISLAND AND MCDONALD ISLANDS|UNGARY|AITI)|E(?:(?:(?:QUATORIAL GUIN|RITR)E|(?:THIOP|STON)I)A|(?:L SALV|CU)ADOR|NGLAND|GYPT)|I(?:S(?:LE OF MAN|RAEL)|R(?:A[NQ]|ELAND)|ND(?:ONES)?IA|CELAND|TALY)|K(?:OREA(?:, REPUBLIC OF)?|(?:AZAKH|YRGYZ)STAN|IRIBATI|UWAIT|ENYA)|V(?:I(?:RGIN ISLANDS, (?:BRITISH|U.S.)|ET NAM)|ENEZUELA|ANUATU)|R(?:(?:USSIAN FEDERAT|EUN)ION|(?:OMANI|WAND)A)|D(?:OMINICA(?:N REPUBLIC)?|JIBOUTI|ENMARK)|W(?:ALLIS AND FUTUN|ESTERN SAHAR)A|J(?:A(?:MAICA|PAN)|ERSEY|ORDAN)|Z(?:IMBABWE|AMBIA)|ÅLAND ISLANDS|(?:YEME|OMA)N|QATAR)/


This optimized regex is complex but fastest one in my environment:

array of regexes found=5004 in 1.167118 seconds
regex with big or found=5004 in 3.910049 seconds
regex with optimiser found=5004 in 0.430012 seconds

milf said...

black mold exposureblack mold symptoms of exposurewrought iron garden gatesiron garden gates find them herefine thin hair hairstylessearch hair styles for fine thin hairnight vision binocularsbuy night vision binocularslipitor reactionslipitor allergic reactionsluxury beach resort in the philippines

afordable beach resorts in the philippineshomeopathy for eczema.baby eczema.save big with great mineral makeup bargainsmineral makeup wholesalersprodam iphone Apple prodam iphone prahacect iphone manualmanual for P 168 iphonefero 52 binocularsnight vision Fero 52 binocularsThe best night vision binoculars here

night vision binoculars bargainsfree photo albums computer programsfree software to make photo albumsfree tax formsprintable tax forms for free craftmatic air bedcraftmatic air bed adjustable info hereboyd air bedboyd night air bed lowest pricefind air beds in wisconsinbest air beds in wisconsincloud air beds

best cloud inflatable air bedssealy air beds portableportables air bedsrv luggage racksaluminum made rv luggage racksair bed raisedbest form raised air bedsbed air informercialsbest informercials bed airmattress sized air beds

bestair bed mattress antique doorknobsantique doorknob identification tipsdvd player troubleshootingtroubleshooting with the dvd playerflat panel television lcd vs plasmaflat panel lcd television versus plasma pic the bestadjustable bed air foam The best bed air foam

hoof prints antique equestrian printsantique hoof prints equestrian printsBuy air bedadjustablebuy the best adjustable air bedsair beds canadian storesCanadian stores for air beds

migraine causemigraine treatments floridaflorida headache clinicdrying dessicantair drying dessicantdessicant air dryerpediatric asthmaasthma specialistasthma children specialistcarpet cleaning dallas txcarpet cleaners dallascarpet cleaning dallas

vero beach vacationvero beach vacationsbeach vacation homes veroms beach vacationsms beach vacationms beach condosmaui beach vacationmaui beach vacationsmaui beach clubbeach vacationsyour beach vacationscheap beach vacations

bob hairstylebob haircutsbob layeredpob hairstylebobbedclassic bobCare for Curly HairTips for Curly Haircurly hair12r 22.5 best pricetires truck bustires 12r 22.5

washington new housenew house houstonnew house san antonionew house venturanew houston house houston house txstains removal dyestains removal clothesstains removalteeth whiteningteeth whiteningbright teeth

jennifer grey nosejennifer nose jobscalebrities nose jobsWomen with Big NosesWomen hairstylesBig Nose Women, hairstyles

ranjali said...

The blog is marvelous and enthusiastic.First time i visit your blog and love it.I ll also tell my friends to visit your blog and read it........................

IT Solution

bathmate said...

good posting.i like it. thank u. :)-


bathmate

免費看台灣火辣貼圖區 said...

情色成人網站 -
情色成人論壇 -
18情色成人網 -
台灣18情色成人 -
AV情色成人網 -
情色成人圖片 -
情色成人聊天室 -
18禁情色成人影音 -
18情色成人 -
情色成人短片 -
卡通情色成人網 -
小瓢蟲情色成人動畫 -
情色成人娛樂 -
愛愛情色成人貼片區 -
69情色成人無碼 -
情色成人貼片 -
A383情色成人影音城 -
免費情色成人圖片 -
情色成人天下淫書 -
情色成人動畫 -
嘟嘟情色成人貼圖 -
免費下載情色成人短片 -
嘟嘟情色成人網站 -
十八禁情色成人卡漫 -
金瓶梅情色成人網 -
免費情色成人短片 -
情色成人哈啦聊天室 -
免費情色成人影片 -
台灣十八情色成人網 -
熊貓情色成人 -
情色成人自拍 -
情色成人小遊戲 -
台灣情色成人貼圖 -
情色成人黃色笑話 -
小弟弟情色成人娛樂網 -
免費情色成人小說 -
情色成人文章 -
情色成人18禁 -
免費18禁情色成人圖片 -
情色成人聊天 -