Saturday 11 June 2011

Twitter i18n

There are a number of internationalization (i18n) issues with Twitter. I use Twitter directly from a browser. If you use a Twitter client then your experiences could differ from mine. A client could, for instance, reconstruct Unicode Domain Names from a Punycode form. I deliberately use twitter from a browser so that I can determine what is happening at base level. Additionally I do not use any twitter related browser extensions. So now to the i18n twitter issues that I have so far encountered.

hashtags

If I use an ASCII hashtag then it works as expected. If though I use non ASCII Unicode characters then twitter does not recognize it as a hashtag. eg #loughborough works but #ラフバラ does not.

Sina's Microblog 新浪微博, unlike twitter, does have Unicode hashtags. Twitter uses a single # character as a prefix to the text. 新浪微博 uses a pair of # characters to bracket the text. By way of example, #ラフバラ# is a valid Unicode hashtag which I have used on my 新浪微博 weibo.com/andreschappo

ラフバラ is loughborough written in Japanese Katakana.

IDNs

Several countries now have functioning idn ccTLDs. A recently live ccTLD is Korea's dot 한국. Lets take the new IDN for Songpa District Office, Seoul, Korea. Their IDN should show on twitter as a live clickable link ie 송파구청.한국. Twitter, though, does not process it as a valid web address and so it just shows as plain text on twitter ie http://송파구청.한국/

If, though, I use an IDN with an ASCII TLD then twitter recognizes it as a valid web address and displays it as a live clickable link. See twitter.com/andreschappo/...

Auto Shortening

Twitter are rolling out an automatic link shortening service. Link shortening (currently) kicks in at 11 characters. The character count includes the http:// prefix. http://t.co is exactly 11 characters and one cannot have a shorter link. Therefore all tweet links will be shortened. Some time since yesterday I have been rolled in and so now my links are auto shortened by twitter. Not a problem for ASCII links but a new problem arises with IDN links. The punycode form is displayed in the tweet instead of the unicode form. See twitter.com/andreschappo/...

Instead of displaying xn--2e0b569ap6hmmg.kr twitter should display the unicode form 송파구청.kr. The underlying href link, as created by twitter, is http://t.co/tzlxTh4

URLs

Lets now add a pathname part to a Domain Name. Add an ASCII pathname part and all is fine. Take an ASCII Domain Name and add a non ASCII Unicode pathname and twitter fails to recognize the full URL.

In a tweet I want http://ta.gd/러프버러 to appear as ta.gd/러프버러 Instead it appears as ta.gd러프버러 The ASCII Domain Name is processed as a link but the Unicode pathname part is not being treated as part of the URL and is just displayed as plain text. One ends up with an incomplete URL.