Comment by dmitrygr

Comment by dmitrygr 19 hours ago

6 replies

"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.

underdeserver 18 hours ago

Technically correct, but a better benchmark would be a known compressor with an unknown set of inputs (that come from a real-world population, e.g. coherent English text).

  • eru 14 hours ago

    Yes, definitely. Alas, it's just harder to run these kinds of challenges completely fairly and self-administered, than the ones where you have a fixed texts as the challenge and add the binary size of the decompressor.

paufernandez 18 hours ago

Yeah, but the xz algorithm is also not counted in the bytes... Here the "program" is the LLM, much like your brain remembers things by coding them compressed and then reconstructs them. It is a different type of compression: compression by "understanding", which requires the whole corpus of possible inputs in some representation. The comparison is not fair to classical algorithms yet that's how you can compress a lot more (given a particular language): by having a model of it.

FartyMcFarter 18 hours ago

True for competitions, but if your compression algorithm is general purpose then this matters less (within reason - no one wants to lug around a 1TB compression program).