Comment by fainpul

Comment by fainpul a day ago

1 reply

Recently I compared Unicode handling in Rust, Swift and Go for my own curiosity. Sharing it here, in the hope someone finds it useful:

Get bytes representing utf8-encoding of string

Only ASCII characters map 1:1 to their utf8-encoding. Everything else expands to multiple bytes.

https://en.wikipedia.org/wiki/UTF-8#Description

  Rust
  line.bytes()

  Swift
  line.utf8

  Go
  line  // slice of bytes
  // assuming line is valid utf8, which is not enforced

Get Unicode codepoints of string

Most characters and emojis consist of a single codepoint. Some are made up of multiple codepoints.

If it isn't guaranteed that only this limited set of characters is used, this is not a safe way to iterate over what users would consider characters.

Codepoints are 4 bytes, usually stored internally as u32 or i32 but with different API for the programmer.

  Rust
  line.chars()
  // https://doc.rust-lang.org/std/primitive.char.html

  Swift
  line.unicodeScalars
  // https://developer.apple.com/documentation/swift/unicode/scalar

  Go
  []rune(line)
  // or iterate with range
  for index, runeValue := range line {
    fmt.Printf("%#U starts at byte position %d\n", runeValue, index)
  }
  // https://go.dev/blog/strings

Get extended grapheme clusters of string

What a reader would actually consider to be a character. E.g, this character consists of two codepoints but is one grapheme cluster: a̐

  Rust
  use unicode_segmentation::UnicodeSegmentation;
  line.graphemes(true)

  Swift
  for ch in line {
    print(ch)
  }
  // This is the default view - just iterate over string (or map, filter etc.)
  // In Swift, a `Character` is a grapheme cluster.
  // https://developer.apple.com/documentation/swift/string#Accessing-String-Elements

  Go
  // https://pkg.go.dev/github.com/rivo/uniseg

Normalize strings

A character like é can be represented in different forms: either as one codepoint (U+00e9) or as a combination of e + ◌́ (U+0065, U+0301).

Some characters are defined multiple times with different names: Ω can be found as "greek capital letter omega" (U+03a9) and as "ohm sign" (U+2126).

Normalization converts a string to use only one of those forms and is required to consistently compare strings.

  Rust
  use unicode_normalization::UnicodeNormalization;
  line.nfc()
  line.nfd()

  Swift
  line.precomposedStringWithCanonicalMapping
  line.decomposedStringWithCanonicalMapping

  Go
  // https://pkg.go.dev/golang.org/x/text/unicode/norm

Remove diacritics

This can be considered a destructive form of normalization, which can be useful in some cases.

  Rust
  use diacritics::remove_diacritics;
  remove_diacritics(line)

  Swift
  line.applyingTransform(.stripDiacritics, reverse: false)
  // and others to transform between alphabets etc.
  // https://developer.apple.com/documentation/Foundation/StringTransform
anonnon a day ago

You probably want ICU4X if you're working with Unicode in Rust. It's fast, has a tolerable overhead, and its lead developers have experience doing i18n work at Mozilla and Google and are involved with the Unicode Consortium.