Dark Corners of Unicode (2015)
(eev.ee)16 points by cratermoon 5 days ago
16 points by cratermoon 5 days ago
Previous discussion: https://news.ycombinator.com/item?id=13149705
And don't miss [this comment](https://news.ycombinator.com/item?id=13149912). The future is now!
Recently I compared Unicode handling in Rust, Swift and Go for my own curiosity. Sharing it here, in the hope someone finds it useful:
Get bytes representing utf8-encoding of string
Only ASCII characters map 1:1 to their utf8-encoding. Everything else expands to multiple bytes.
https://en.wikipedia.org/wiki/UTF-8#Description
Rust
line.bytes()
Swift
line.utf8
Go
line // slice of bytes
// assuming line is valid utf8, which is not enforced
Get Unicode codepoints of stringMost characters and emojis consist of a single codepoint. Some are made up of multiple codepoints.
If it isn't guaranteed that only this limited set of characters is used, this is not a safe way to iterate over what users would consider characters.
Codepoints are 4 bytes, usually stored internally as u32 or i32 but with different API for the programmer.
Rust
line.chars()
// https://doc.rust-lang.org/std/primitive.char.html
Swift
line.unicodeScalars
// https://developer.apple.com/documentation/swift/unicode/scalar
Go
[]rune(line)
// or iterate with range
for index, runeValue := range line {
fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
}
// https://go.dev/blog/strings
Get extended grapheme clusters of stringWhat a reader would actually consider to be a character. E.g, this character consists of two codepoints but is one grapheme cluster: a̐
Rust
use unicode_segmentation::UnicodeSegmentation;
line.graphemes(true)
Swift
for ch in line {
print(ch)
}
// This is the default view - just iterate over string (or map, filter etc.)
// In Swift, a `Character` is a grapheme cluster.
// https://developer.apple.com/documentation/swift/string#Accessing-String-Elements
Go
// https://pkg.go.dev/github.com/rivo/uniseg
Normalize stringsA character like é can be represented in different forms: either as one codepoint (U+00e9) or as a combination of e + ◌́ (U+0065, U+0301).
Some characters are defined multiple times with different names: Ω can be found as "greek capital letter omega" (U+03a9) and as "ohm sign" (U+2126).
Normalization converts a string to use only one of those forms and is required to consistently compare strings.
Rust
use unicode_normalization::UnicodeNormalization;
line.nfc()
line.nfd()
Swift
line.precomposedStringWithCanonicalMapping
line.decomposedStringWithCanonicalMapping
Go
// https://pkg.go.dev/golang.org/x/text/unicode/norm
Remove diacriticsThis can be considered a destructive form of normalization, which can be useful in some cases.
Rust
use diacritics::remove_diacritics;
remove_diacritics(line)
Swift
line.applyingTransform(.stripDiacritics, reverse: false)
// and others to transform between alphabets etc.
// https://developer.apple.com/documentation/Foundation/StringTransform
Worth noting that the addition of the interlinear annotation characters was quite controversial, with many commenting that this simply is not plain text and as such does not belong in Unicode. I'm not clear on how it made it in anyway, but it sure seems like the Unicode Consortium now somewhat agrees, as while they haven't formally deprecated the characters, they have kind of discouraged their use.