Monday 20 February 2012

Language Characteristics

In this article I list some of the characteristics of natural languages and scripts as they are manifested and used in modern day IT. With languages there are always exceptions and so there will be some exceptions to these characteristics. I will not be delving into linguistic technicalities such as the distinction between mora and syllable or the distinction between logogram and ideogram. I will take a more broad brush approach.

Arabic

  1. Arabic is written in the Arabic script
  2. Written from right to left
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms
  6. A Keyboard Mapping is sufficient in order to write Arabic
  7. The Arabic script is inherently cursive and hence is presented/displayed in it's cursive form.
  8. Letters change shape according to their position within a word. These different shapes are named Initial, Medial, Final and Isolated forms. en.wikipedia.org/wiki/Arabic_alphabet#Letter_forms

Chinese

  1. Chinese is written in the Chinese script which consists of hànzì (汉字) characters, of which, there are tens of thousands
  2. Written from left to right. Once browsers implement CSS3 Writing Modes we may well see some return to the traditional vertical text in webpages dev.w3.org/csswg/css-writing-modes/#vertical-intro
  3. There is no space character separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+3002 IDEOGRAPHIC FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms
  6. An Input Method is required in order to write Chinese
  7. All characters, including punctuation, are monospaced. Thus, for example, the list items separator in the text string "北京,南京,东京" is the single character U+FF0C FULLWIDTH COMMA. The text string "北京、南京、东京" uses the single character U+3001 IDEOGRAPHIC COMMA as the list items separator.
  8. With respect to number of characters required to communicate, Chinese is much more compact than English. Given a sentence written in English, the same sentence written in Chinese would require far fewer characters. This compactness gives Chinese a significant advantage over English for IDNs and when microblogging.

English

  1. English is written in the Latin script
  2. Written from left to right
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Has uppercase and lowercase letter forms
  6. A Keyboard Mapping is sufficient in order to write English

Japanese

  1. Japanese is written in the Japanese scripts Kanji (漢字), Hiragana (ひらがな) and Katakana (カタカナ)
  2. Written from left to right. Once browsers implement CSS3 Writing Modes we may well see some return to the traditional vertical text in webpages dev.w3.org/csswg/css-writing-modes/#vertical-intro
  3. There is no space character separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+3002 IDEOGRAPHIC FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms. Uppercase is sometimes used for emphasis in English. Similarly, Katakana is sometimes used for emphasis.
  6. An Input Method is required in order to write Japanese
  7. In general, Japanese, like Chinese is monospaced. The exception is that there are half-width forms of Katakana and some punctuation characters. The half-width forms are in Unicode block Half-width and Full-width Forms U+FF00 ➤ U+FFEF.
  8. With respect to number of characters required to communicate, Japanese is much more compact than English. Given a sentence written in English, the same sentence written in Japanese would require far fewer characters. This compactness gives Japanese a significant advantage over English for IDNs and when microblogging.

Korean

  1. Korean is written in the Hangeul (한글) script
  2. Written from left to right
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Unicase ie no uppercase and lowercase letter forms
  6. An Input Method is required in order to write Korean
  7. The individual Korean letters (jamo/자모) are grouped into and displayed as Syllabic blocks. e.g. the individual jamo ㅎ ㅏ ㄴ ㄱ ㅜ ㄱ are combined to form the two Korean characters 한국

Russian

  1. Russian is written in the Cyrillic (Кириллица) script
  2. Written from left to right
  3. The space character (U+0020 SPACE) is used as a separator between words and sentences
  4. The sentence terminator full stop is the Unicode character U+002E FULL STOP
  5. Has uppercase and lowercase letter forms
  6. A Keyboard Mapping is sufficient in order to write Russian