Comment by sltkr

Comment by sltkr 16 hours ago

I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring, so the pifs might not work even in theory (besides being utterly impractical, of course).

On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.

The relevance to this discussion is that the LLM weights would have to be included as static data, since the only way to regenerate them is from the initial training data, which is much larger than the resulting model. By comparison, pi based compression is the other way around: since pi is a natural constant, if your decompressor requires (say) a trillion digits of pi, you could write a relatively small program (a few kb) to generate them. It would be terribly slow, but it wouldn't affect your compression ratio much.

meindnoch 2 hours ago

If we assume pi's digits to be uniformly random, then the expected offset for the first occurrence of a particular N-bit sequence is going to be ~2^N. (This can be proven using a Markov-chain argument. Also note: we're working in binary). So you've converted an N-bit value into an offset on the order of 2^N, which takes again N bits to represent.

Reply View 0 replies

dataflow 11 hours ago

> I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring

Fascinating. Do you know if this has been proven about any interesting number (that wasn't explicitly constructed to make this true)?

Reply View 3 replies

kadoban 11 hours ago

https://en.wikipedia.org/wiki/Normal_number has some examples and a bunch of info. Most of them are pretty artificial, but the concatenation of the primes one is... at least interesting, not obvious (to me) from doing that that it'd be normal.

Reply View | 2 replies
- _ache_ 7 hours ago
  
  Hmm, you are referring to rich numbers but pointing to normal numbers, so I will be the nerd who points out that every normal number is rich, but some rich numbers aren't normal.
  https://en.wikipedia.org/wiki/Disjunctive_sequence
  
  Reply View | 1 reply
  
  kadoban 6 hours ago
  
  Oh nice, thanks, I didn't know that one.
  
  Reply View | 0 replies

eru 15 hours ago

> I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring, so the pifs might not work even in theory (besides being utterly impractical, of course).

Well, either your program 'works', or you will have discovered a major new insight about Pi.

> On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.

And that's the only way to do this fairly, if you are running a competition where you only have a single static corpus to compress.

It would be more interesting and would make the results more useful, if the texts to be compressed would be drawn from a wide probability distribution, and then we scored people on eg the average length. Then you wouldn't necessarily need to include the size of the compressor and decompressor in the score.

Of course, it would be utterly impractical to sample Gigabytes of new text each time you need to run the benchmark: humans are expensive writers. The only way this could work would be either to sample via an LLM, but that's somewhat circular and wouldn't measure what you actually want to measure in the benchmark, or you could try to keep the benchmark text secret, but that has its own problems.

Reply View 0 replies

netsharc 13 hours ago

You mentioning the concept of pi containing every substring makes me think of Borges' Library of Babel.

Ha, next: a compression algorithm that requires the user to first build an infinite library...

Reply View 1 reply

_ache_ 7 hours ago

Yep, the Library of Babel is a related topic. Wikipedia links it as "See also" on the page of "Normal numbers".
> a compression algorithm that requires the user to first build an infinite library...
Kind of already exists, pifs. More like a joke, but the concept is already a joke so...
https://github.com/philipl/pifs

Reply View | 0 replies

charcircuit 16 hours ago

This only does 1 byte, so you only have to prove it contains the bits for 0-255.

Reply View 0 replies