The codepoints and the UTF-8 are all written in hexadecimal(hex). The binary bits are an intermediate form for the purposes of encoding and decoding.
We start with the following form which is designed for encoding Unicode codepoints to UTF-8 and decoding UTF-8 to Unicode codepoints.
The first thing we can do is fill in the fixed bits. They are the fixed bits defined by the encoding scheme. I have entered the fixed bits in red to make them distinct from variable bits.
So, how do we determine where the codepoints go on the form. We need to look at the free bits to determine the range of values that can be accommodated.
- 1 byte row - 7 free variable bits giving a range of 0 ➔ 7F
- 2 byte row - 11 free variable bits giving a range of 80 ➔ 7FF
- 3 byte row - 16 free variable bits giving a range of 800 ➔ FFFF
- 4 byte row - 21 free variable bits giving a range of 10000 ➔ 1FFFFF (the actual maximum value of a codepoint is 10FFFF)
It is a Unicode convention, when writing codepoints, to use a minimum of four hex digits. So for codepoints <1000, one should left pad with zeroes. Hence my entries U+0444 and U+0057 rather than U+444 and U+57.