Sunday 2 July 2017

Computer Science Internationalization - Text Search

So, you have just written some Cool Code which will search for and find occurrences of specified text strings. You have access to Big Data text eg all the text in all public webpages. You will,of course, want to test your Cool Code. Letʼs perform some, seemingly, very simple tests.

Letʼs search for the word 'Scorpion'. Your code works just fine and hence finds all occurrences of the word 'Scorpion'.

Now test with the following two words.

  • Scorpion
  • Scorpion

Your Cool Code works fine as all I have done is applied some CSS styling, thus giving each of the two words differing appearance.

Now test you Cool Code with the following two words.

  • 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛
  • 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧

If you have only programmed for ASCII text then your now not so Cool Code will fail. These two words have differing appearance because they are not made up of the ASCII characters you are familiar with. These words use characters from the Unicode Math Alphanumeric Symbols block, U+1D400-1D4FF.

Should the Math Alphanumeric Symbols Scorpion be treated the same as the ASCII Scorpion wrt the search results of your code? In this context I think "Yes", most definitely. A person reading this blog, for example, will just perceive the word Scorpion whatever characters are used to write the word. The reader may well also visualise the insect with a "sting in the tail"😱

What of current working practice?

With twitter, a user has no means of changing text style within a tweet. It has thus become common to use Unicode Math Alphanumeric Symbols to change appearance. I could, for example, use Unicode Math Alphanumeric Symbols to emphasise a word (eg Scorpion) or phrase within a tweet. The meaning of the tweet remains the same.

Google returns the same number of search results whichever of the above forms of Scorpion I use. At time of writing this is "About 144,000,000 results". I deduce Google is treating ASCII Scorpion and Unicode Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 & 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 as equivalent.

Sogou 搜狗 is a Chinese search engine. Using Sogou: ASCII Scorpion returns 93,341 results, Math Alphanumeric Symbols 𝑆𝑐𝑜𝑟𝑝𝑖𝑜𝑛 returns 4,738, Math Alphanumeric Symbols 𝐒𝐜𝐨𝐫𝐩𝐢𝐨𝐧 returns 61. I think it evident that Sogou does not treat my three forms of Scorpion as equivalent.

I side with Google on this.

Here is a taster of what is happening in the behind the scenes technicalities of Unicode. Letʼs take just one of the Unicode Math Alphanumeric Symbols I have used, 𝐒 MATHEMATICAL BOLD CAPITAL S U+1D412. If you visit codepoints.net/U+1D412 you will see a wealth of information about this character. Of relevance to this blog is the Decomposition Mapping which is to the, oh so familiar, ASCII uppercase capital S. This Unicode information can be used to compute string equivalents which can then be used for search thus providing all relevant results.

The moral of this "Sting in the Tale" is: If you do not already know it, you must learn Unicode, it is essential.