Comment by shawnz

Comment by shawnz 18 hours ago

Another fun application of combining LLMs with arithmetic coding is steganography. Here's a project I worked on a while back which effectively uses the opposite technique of what's being done here, to construct a steganographic transformation: https://github.com/shawnz/textcoder

akoboldfrying 15 hours ago

Cool! It creates very plausible encodings.

> The Llama tokenizer used in this project sometimes permits multiple possible tokenizations for a given string.

Not having tokens be a prefix code is thoroughly unfortunate. Do the Llama team consider it a bug? I don't see how to rectify the situation without a full retrain, sadly.

Reply View 9 replies

shawnz 11 hours ago

I can't imagine they consider it a bug, it is a common and beneficial property of essentially every LLM today. You want to be able to represent common words with single tokens for efficiency, but at the same time you still need to be able to represent prefixes of those words in the cases where they occur separately

Reply View | 6 replies
- akoboldfrying 10 hours ago
  
  I find this surprising, but I suppose it must be more efficient overall.
  Presumably parsing text into tokens is done in some deterministic way. If it is done by greedily taking the longest-matching prefix that is a token, then when generating text it should be possible to "enrich" tokens that are prefixes of other tokens with additional constraints to force a unique parse: E.g., if "e" is a token but "en" is too, then after generating "e" you must never generate a token that begins with "n". A text generated this way can be deterministically parsed by the greedy parser.
  Alternatively, it would suffice to restrict to a subset of tokens that are a prefix code. This would be simpler, but with lower coding efficiency.
  
  Reply View | 5 replies
  
  shawnz 10 hours ago
  
  Regarding the first part: that's an interesting idea, although I worry it would bias the outputs in an unrealistic way. Then again, maybe it would only impact scenarios that would have otherwise been unparsable anyway?
  Regarding the second part: you'd effectively just be limiting yourself to single character tokens in that case which would drastically impact the LLM's output quality
  
  Reply View | 4 replies
bonzini 10 hours ago

I think it's plausible that different languages would prefer different tokenizations. For example in Spanish the plural of carro is carros, in Italian it's carro. Maybe the LLM would prefer carr+o in Italian and a single token in Spanish.

Reply View | 1 reply
- akoboldfrying 3 hours ago
  
  Certainly! What surprised me was that apparently LLMs are deliberately designed to enable multiple ways of encoding the same string as tokens. I just assumed this would lead to inefficiency, since I assumed that it would cause training to not know whether it should favour outputting, say, se|same or ses|ame after "open", and thus throw some weight on each. But provided there's a deterministic rule, like "always choose the longest matching token", this uncertainty goes away.
  
  Reply View | 0 replies