Comment by akoboldfrying

Comment by akoboldfrying 5 hours ago

1 reply

> Surely there's enough common short words like "a", "the", etc that that would be impractical?

Tokens don't have to correspond to words. The 2-character tokens "a " and " a" will cover all practical uses of the lowercase word "a". Yes, this does make some strings unrepresentable, such as the single-character string "a", but provided you have tokens "ab", "ba", "ac", "ca", etc., all other strings can be represented. In practice you won't have all such tokens, but this doesn't materially worsen the output provided the substrings that you cannot represent are all low-probability.

shawnz 5 hours ago

Ah yeah, factoring in the whitespace might make this a bit more practical