Comment by solardev

Comment by solardev 2 days ago

1 reply

I don't understand the difference between a character, a codepoint, a glyph, and whatever else makes up a single "thing" in unicode.

Rendello 2 days ago

That tripped me up too. The Unicode Core spec is quite good at explaining things and introduces some terminology you don't really hear outside the document. Chapter 2, General Structure, is worth reading in its entirety. I've linked some bits that might help:

> *2.2.3 Characters, Not Glyphs*

> The Unicode Standard draws a distinction between characters and glyphs. Characters are the abstract representations of the smallest components of written language that have semantic value. They represent primarily, but not exclusively, the letters, punctuation, and other signs that constitute natural language text and technical notation. [...] Letters in different scripts, even when they correspond either semantically or graphically, are represented in Unicode by distinct characters.

> Characters are represented by code points that reside only in a memory representation, as strings in memory, on disk, or in data transmission. The Unicode Standard deals only with character codes.

> *2.4 Code Points and Characters*

> The range of integers used to code the abstract characters is called the codespace. A particular integer in this set is called a code point. When an abstract character is mapped or assigned to a particular code point in the codespace, it is then referred to as an encoded character.

> *2.5 Encoding Forms*

This deals with UTF-{8,16,32}, which is a tricky bit and tripped me up for a long time. If the document is too dense here, there's a lot of supplementary material online explaining the different forms, I'll link a Tom Scott video explaining UTF-8.

---

The long and short of it is: the atomic unit of Unicode is the character, or encoded character, which is a value that has been associated with a code point, which is an integer usually represented in hex for as U+XXXX. Unicode doesn't deal with glyphs or graphical representations, just characters and their properties (eg. what is the character name? what should this character do when uppercased?). As you probably know, many characters can combine with others to form grapheme clusters, which may look like a single (abstract) character, but underneath consist of multiple (encoded) characters. Every character is associated with an integer index (a codepoint), and those integers can be represented in three formats (this sort of happened by accident): UTF-32 (just represent the integer directly), UTF-16 (was originally supposed to represent the integer directly, but there were too many and it got extended), and UTF-8 (which has different byte lengths to encode different characters efficiently).

[spec] https://www.unicode.org/versions/Unicode16.0.0/core-spec/

[2.2.3] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.4] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[2.5] https://www.unicode.org/versions/Unicode16.0.0/core-spec/cha...

[Tom Scott UTF-8] https://www.youtube.com/watch?v=MijmeoH9LT4