Comment by gmuslera

Comment by gmuslera 17 hours ago

Reminded me of pi filesystem (https://github.com/philipl/pifs), with enough digits of pi precalculated you might be able to do a decent compression program. The trick is in the amount of reasonable digits for that, if it’s smaller or bigger than that trained LLM.

GuB-42 16 hours ago

I suspect that the length of the offset of your input data in pi is equal to the length of the input data itself, plus or minus a few bytes at most, regardless of the size of the input data.

That is: no compression, but it won't make things worse either.

Unless the input data is the digits of pi, obviously, or the result of some computation involving pi.

Reply View 6 replies

luke5441 3 hours ago

See http://www.faqs.org/faqs/compression-faq/part1/section-8.htm...

Reply View | 0 replies
gmuslera 4 hours ago

What if instead of the index of your full data, you store the index of smaller blocks? Would I need i.e. to use an 8kbytes or larger integer to store the offset all the possible 8k blocks?
It is meant to be a joke anyway.

Reply View | 1 reply
- sumtechguy 2 hours ago
  
  That would 'work' to a point. But my gut guess is it would end up with bigger data.
  Most algs that I have ever made. There are several places where your gains disappear. The dictionary lookup for me is where things come apart. Sometimes it is the encoding of the bytes/blocks themselves.
  In your example you could find all of the possible 8k blocks out there in pi. Now that number set would be very large. So it will be tough to get into your head how it is working. As it is not the whole of pi space you also probably need a dictionary or function to hold it or at least pointers to it.
  One way to tell if a compression alg is doing ok is to try to make the most minimal version of it then scale it out. For example start with a 4 bit/8 bit/16 bit value instead of 8k. Then see how much space it would take up. Now sometimes scaling it up will let you get better gains (not always). That is where you will have a pretty good idea if it works or not. Like just move from 1 byte to 2 then 4 and so on. Just to see if the alg works. That exercise also lets you see if there are different ways to encode the data that may help as well.
  I got nerd sniped about 3 decades ago on problems just like this. Still trying :)
  
  Reply View | 0 replies
noctune 5 hours ago

Some patterns must happen to repeat, so I would assume the offset to be larger, no?

Reply View | 0 replies
MrLeap 14 hours ago

You could express the offset with scientific notation, tetration, and other big math number things. You probably don't need the whole offset number all at once!

Reply View | 1 reply
- GuB-42 14 hours ago
  
  Actually, you do.
  You can use all the math stuff like scientific notation, tetration, etc... but it won't help you make things smaller.
  Math notation is a form of compression. 10^9 is 1000000000, compressed. But the offset into pi is effectively a random number, and you can't compress random numbers no matter what technique you use, including math notation.
  This can be formalized and mathematically proven. The only thing wrong here is that pi is not a random number, but unless you are dealing with circles, it looks a lot like it, so while unproven, I think it is a reasonable shortcut.
  
  Reply View | 0 replies

miki123211 5 hours ago

In that vein, pi is the true "illegal number"[1], as all the illegal content ever produced can eventually be found in the digits of pi.

I guess when God made the Universe, he filled it with CSAM, hate speech and unsolicited Viagra ads...

[1] https://en.wikipedia.org/wiki/Illegal_number

Reply View 3 replies

anthk 3 hours ago

Pi is not bound to God neither the universe. It's just the ratio between a distance and adding up all the existing points around you at that distance.

Reply View | 2 replies
- SAI_Peregrinus an hour ago
  
  Which means it depends on the geometry of spacetime. If spacetime is quantized such that the Euclidean norm doesn't apply, then the value of pi will be different. See the "Extremal values of pi" paper[1].
  [1]https://e.math.cornell.edu/people/Nikhil_Sahoo/files/pi_pape...
  
  Reply View | 1 reply
  
  anthk 23 minutes ago
  
  You can define Pi without universal geometry. And it's still a constant, nothing arbitrary, unlike
  rot13 spoiler gur raqvat bs Pbagnpg juvpu vf vzcbffvoyr rot13
  
  Reply View | 0 replies

sltkr 16 hours ago

I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring, so the pifs might not work even in theory (besides being utterly impractical, of course).

On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.

The relevance to this discussion is that the LLM weights would have to be included as static data, since the only way to regenerate them is from the initial training data, which is much larger than the resulting model. By comparison, pi based compression is the other way around: since pi is a natural constant, if your decompressor requires (say) a trillion digits of pi, you could write a relatively small program (a few kb) to generate them. It would be terribly slow, but it wouldn't affect your compression ratio much.

Reply View 9 replies

meindnoch 2 hours ago

If we assume pi's digits to be uniformly random, then the expected offset for the first occurrence of a particular N-bit sequence is going to be ~2^N. (This can be proven using a Markov-chain argument. Also note: we're working in binary). So you've converted an N-bit value into an offset on the order of 2^N, which takes again N bits to represent.

Reply View | 0 replies
dataflow 11 hours ago

> I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring
Fascinating. Do you know if this has been proven about any interesting number (that wasn't explicitly constructed to make this true)?

Reply View | 3 replies
- kadoban 11 hours ago
  
  https://en.wikipedia.org/wiki/Normal_number has some examples and a bunch of info. Most of them are pretty artificial, but the concatenation of the primes one is... at least interesting, not obvious (to me) from doing that that it'd be normal.
  
  Reply View | 2 replies
  
  _ache_ 7 hours ago
  
  Hmm, you are referring to rich numbers but pointing to normal numbers, so I will be the nerd who points out that every normal number is rich, but some rich numbers aren't normal.
  https://en.wikipedia.org/wiki/Disjunctive_sequence
  
  Reply View | 1 reply
  
  kadoban 6 hours ago
  
  Oh nice, thanks, I didn't know that one.
  
  Reply View | 0 replies
eru 15 hours ago

> I'm going to be the nerd that points out that it has not been mathematically proven that pi contains every substring, so the pifs might not work even in theory (besides being utterly impractical, of course).
Well, either your program 'works', or you will have discovered a major new insight about Pi.
> On a more serious note, as far as I understand these compression competitions require that static data is included in the size computation. So if you compress 1000 MB into 500 MB, but to decompress you need a 1 MB binary and a 100 MB initial dictionary, your score would be 500 + 100 + 1 = 601 MB, not 500 MB.
And that's the only way to do this fairly, if you are running a competition where you only have a single static corpus to compress.
It would be more interesting and would make the results more useful, if the texts to be compressed would be drawn from a wide probability distribution, and then we scored people on eg the average length. Then you wouldn't necessarily need to include the size of the compressor and decompressor in the score.
Of course, it would be utterly impractical to sample Gigabytes of new text each time you need to run the benchmark: humans are expensive writers. The only way this could work would be either to sample via an LLM, but that's somewhat circular and wouldn't measure what you actually want to measure in the benchmark, or you could try to keep the benchmark text secret, but that has its own problems.

Reply View | 0 replies
netsharc 13 hours ago

You mentioning the concept of pi containing every substring makes me think of Borges' Library of Babel.
Ha, next: a compression algorithm that requires the user to first build an infinite library...

Reply View | 1 reply
- _ache_ 7 hours ago
  
  Yep, the Library of Babel is a related topic. Wikipedia links it as "See also" on the page of "Normal numbers".
  > a compression algorithm that requires the user to first build an infinite library...
  Kind of already exists, pifs. More like a joke, but the concept is already a joke so...
  https://github.com/philipl/pifs
  
  Reply View | 0 replies
charcircuit 16 hours ago

This only does 1 byte, so you only have to prove it contains the bits for 0-255.

Reply View | 0 replies