Thursday, 25 August 2016

Internationalizing Regular Expressions

The purpose of this post is to encourage all of you who are teaching Regular Expressions (RegExp) or are learning RegExp to think international. Think beyond ASCII. Thinking international means thinking Unicode instead of ASCII. Once one thinks Unicode then one is encompassing the world.

My RegExp teaching slides use ASCII only as a starting point. They then progress to Unicode. I give one of my slides as an example.

There is a lot of information packed into this one slide which needs some explanation. My example slide is using Unicode Chinese characters and Unicode Emoji characters.
  • 人 is a Unicode Chinese character meaning person
  • 鸭 is a Unicode Chinese character meaning duck
  • 鸡 is a Unicode Chinese character meaning chicken
This slide also contains a cultural reference. Some time ago I came across a Weibo 微博 post about the visit to Hong Kong by the big floating yellow duck The Weibo post had a photo containing many people looking at the duck. The text of the Weibo post was:- 


 When I saw this I thought it so funny and very clever. It just would not work in English but works so perfectly in Chinese. When writing my RegExp slides I remembered this Weibo post and thought this would make for an excellent cultural connection. Thus my slide is internationalized by using Unicode and incorporating a cultural reference. The use of Unicode is essential for internationalisation. Incorporating a cultural reference is optional but it does add an extra dimension that may well serve to make RegExp slides more interesting and encourage readers to explore the boundless potential of internationalized Regular Expressions.

 Note: I have tried to find the Weibo post but have been unsuccessful so I cannot, unfortunately, provide a reference.