Comment by docmechanic

Comment by docmechanic 9 months ago

That’s only true if you tokenize words rather than characters. Character tokenization generates new content outside the training vocabulary.

selfhoster11 9 months ago

All major tokenisers have explicit support for encoding arbitrary byte sequences. There's usually a consecutive range of tokens reserved for 0x00 to 0xFF, and you can encode any novel UTF-8 words or structures with it. Including emoji and characters that weren't a part of the model's initial training, if you show it some examples.

Reply View 2 replies

docmechanic 9 months ago

Pretty sure that we’re talking apples and oranges. Yes to the arbitrary byte sequences used by tokenizers, but that is not the topic of discussion. The question is will the tokenizer come up with words not in the training vocabulary. Word tokenizers don’t, but character tokenizers do.
Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."

Reply View | 1 reply
- selfhoster11 8 months ago
  
  Those tokens won't come up during training, but LLMs are capable of In-Context Learning. If you give it some examples of how to create new words/characters in this manner as a part of the prompt, they will be able to use those tokens at inference time. Show it some examples of how to compose a Thai or Chinese sentence out of byte tokens, and give them a description of the hypothetical Unicode range of a custom alphabet, and a sufficiently strong LLM will be able to just output bytes despite those codepoints not technically existing.
  And like I said, single-byte tokens very much are a part of word tokenisers, or to be precise, their token selection. "Word tokeniser" is a misnomer in any case - they are word piece tokenisers. English is simple enough that word pieces can be entire words. With languages where you have numerous suffixes, prefixes, and even in-fixes as a part of one "word" (as defined by "one or more characters preceded or followed by a space" - because the truth is more complicated than that), you have not so much "word tokenisers" as "subword tokenisers". A character tokeniser is just a special case of a subword tokeniser where the length of each subword is exactly 1.
  
  Reply View | 0 replies

asdff 9 months ago

Why stop there? Just have it spit out the state of the bits on the hardware. English seems like a serious shackle for an LLM.

Reply View 0 replies

emaro 9 months ago

Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.

Reply View 1 reply

docmechanic 9 months ago

Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."

Reply View | 0 replies