Comment by chrismorgan
Comment by chrismorgan 3 days ago
> indexing by bytes instead of UTF-8 code units
When the encoding is UTF-8 (which it is here), the code unit is the byte.
They called the fields byteStart and byteEnd, but a more technically precise (no more or less accurate, but more precise) labels would be utf8CodeUnitStart and utf8CodeUnitEnd.
Sorry, I keep mixing these - bytes instead of scalars, which I think would be more natural to iterate over in most languages (at least the ones I use).