Comment by emaro Comment by emaro a day ago 1 reply Copy Link View on Hacker News Kind of, but character-based tokens make it a lot harder and more expensive to learn semantics.
Copy Link docmechanic 18 hours ago Collapse Comment - Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary." Reply View | 0 replies
Source: Generative Deep Learning by David Foster, 2nd edition, published in 2023. From “Tokenization” on page 134.
“If you use word tokens: …. willnever be able to predict words outside of the training vocabulary.”
"If you use character tokens: The model may generate sequences of characters that form words outside the training vocabulary."