Comment by akoboldfrying

Comment by akoboldfrying 15 hours ago

9 replies

Cool! It creates very plausible encodings.

> The Llama tokenizer used in this project sometimes permits multiple possible tokenizations for a given string.

Not having tokens be a prefix code is thoroughly unfortunate. Do the Llama team consider it a bug? I don't see how to rectify the situation without a full retrain, sadly.

shawnz 11 hours ago

I can't imagine they consider it a bug, it is a common and beneficial property of essentially every LLM today. You want to be able to represent common words with single tokens for efficiency, but at the same time you still need to be able to represent prefixes of those words in the cases where they occur separately

  • akoboldfrying 10 hours ago

    I find this surprising, but I suppose it must be more efficient overall.

    Presumably parsing text into tokens is done in some deterministic way. If it is done by greedily taking the longest-matching prefix that is a token, then when generating text it should be possible to "enrich" tokens that are prefixes of other tokens with additional constraints to force a unique parse: E.g., if "e" is a token but "en" is too, then after generating "e" you must never generate a token that begins with "n". A text generated this way can be deterministically parsed by the greedy parser.

    Alternatively, it would suffice to restrict to a subset of tokens that are a prefix code. This would be simpler, but with lower coding efficiency.

    • shawnz 10 hours ago

      Regarding the first part: that's an interesting idea, although I worry it would bias the outputs in an unrealistic way. Then again, maybe it would only impact scenarios that would have otherwise been unparsable anyway?

      Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality

      • akoboldfrying 8 hours ago

        The first approach would only affect outputs that would have been otherwise unparseable.

        The second approach works with any subset of tokens that form a prefix code -- you effectively set the probability of all tokens outside this subset to zero (and rescale the remaining probabilities if necessary). In practice you would want to choose a large subset, which means you almost certainly want to avoid choosing any single-character tokens, since they can't coexist with tokens beginning with that character. (Choosing a largest-possible such subset sounds like an interesting subproblem to me.)

bonzini 10 hours ago

I think it's plausible that different languages would prefer different tokenizations. For example in Spanish the plural of carro is carros, in Italian it's carro. Maybe the LLM would prefer carr+o in Italian and a single token in Spanish.

  • akoboldfrying 3 hours ago

    Certainly! What surprised me was that apparently LLMs are deliberately designed to enable multiple ways of encoding the same string as tokens. I just assumed this would lead to inefficiency, since I assumed that it would cause training to not know whether it should favour outputting, say, se|same or ses|ame after "open", and thus throw some weight on each. But provided there's a deterministic rule, like "always choose the longest matching token", this uncertainty goes away.