Comment by dmitrygr

Comment by dmitrygr 19 hours ago

"compressed size" does not seem to include the size of the model and the code to run it. According to the rules of Large Text Compression Benchmark, total size of those must be counted, otherwise a 0-byte "compressed" file with a decompressor containing the plaintext would win.

underdeserver 18 hours ago

Technically correct, but a better benchmark would be a known compressor with an unknown set of inputs (that come from a real-world population, e.g. coherent English text).

Reply View 1 reply

eru 14 hours ago

Yes, definitely. Alas, it's just harder to run these kinds of challenges completely fairly and self-administered, than the ones where you have a fixed texts as the challenge and add the binary size of the decompressor.

Reply View | 0 replies

paufernandez 18 hours ago

Yeah, but the xz algorithm is also not counted in the bytes... Here the "program" is the LLM, much like your brain remembers things by coding them compressed and then reconstructs them. It is a different type of compression: compression by "understanding", which requires the whole corpus of possible inputs in some representation. The comparison is not fair to classical algorithms yet that's how you can compress a lot more (given a particular language): by having a model of it.

Reply View 2 replies

wrs 18 hours ago

“Compressors are ranked by the compressed size of enwik9 (10^9 bytes) plus the size of a zip archive containing the decompresser.” [0]
[0] https://www.mattmahoney.net/dc/text.html

Reply View | 0 replies
[removed] 18 hours ago

[deleted]

Reply View | 0 replies

FartyMcFarter 18 hours ago

True for competitions, but if your compression algorithm is general purpose then this matters less (within reason - no one wants to lug around a 1TB compression program).

Reply View 0 replies