charset.school
Decode UTF-16

sandbox

UTF-16 sandbox (decode)

Convert a sequence of UTF-16 bytes into a Unicode code point, given the endianness.

Hex (with or without 0x). UTF-16 takes 2 or 4 bytes per code point.

Endianness
U+1F389🎉
4 bytes

Decimal

127881

Bytes

0x3C 0xD8 0x89 0xDF

Binary

00111100
11011000
10001001
11011111

Step-by-step breakdown

  1. 01

    Given endianness

    Each code unit is read low-order byte first (Little Endian).

    Little Endian (LE)
  2. 02

    Identify the code unit count

    4 bytes = surrogate pair for a code point beyond the BMP. The high code unit is in 0xD800 → 0xDBFF, the low one in 0xDC00 → 0xDFFF.

    2 code units · surrogate pair
  3. 03

    Extract the useful bits from each surrogate

    Subtract 0xD800 from the high surrogate (10 high bits) and 0xDC00 from the low surrogate (10 low bits).

    0000111100 | 1110001001
  4. 04

    Reassemble the binary

    Concatenate the 10 + 10 bits then add 0x10000 to recover the code point.

    00001111001110001001
  5. 05

    Convert to a code point

    The binary equals 127881 in decimal, i.e. U+1F389 in Unicode notation.

    U+1F389
charset.school

Teaching tool. No tracking, no ads.

Developed by Florent Sorel