sandbox
UTF-16 sandbox (decode)
Convert a sequence of UTF-16 bytes into a Unicode code point, given the endianness.
Hex (with or without 0x). UTF-16 takes 2 or 4 bytes per code point.
Endianness
U+1F389🎉
4 bytesDecimal
127881
Bytes
0x3C 0xD8 0x89 0xDF
Binary
00111100
11011000
10001001
11011111
Step-by-step breakdown
- 01
Given endianness
Each code unit is read low-order byte first (
Little Endian).Little Endian (LE) - 02
Identify the code unit count
4 bytes = surrogate pair for a code point beyond the BMP. The high code unit is in
0xD800→0xDBFF, the low one in0xDC00→0xDFFF.2 code units · surrogate pair - 03
Extract the useful bits from each surrogate
Subtract
0xD800from the high surrogate (10 high bits) and0xDC00from the low surrogate (10 low bits).0000111100 | 1110001001 - 04
Reassemble the binary
Concatenate the 10 + 10 bits then add
0x10000to recover the code point.00001111001110001001 - 05
Convert to a code point
The binary equals
127881in decimal, i.e.U+1F389in Unicode notation.U+1F389