Comment by akoboldfrying
Comment by akoboldfrying 15 hours ago
Cool! It creates very plausible encodings.
> The Llama tokenizer used in this project sometimes permits multiple possible tokenizations for a given string.
Not having tokens be a prefix code is thoroughly unfortunate. Do the Llama team consider it a bug? I don't see how to rectify the situation without a full retrain, sadly.
I can't imagine they consider it a bug, it is a common and beneficial property of essentially every LLM today. You want to be able to represent common words with single tokens for efficiency, but at the same time you still need to be able to represent prefixes of those words in the cases where they occur separately