Comment by eru
Yes.
When you train your neural network to minimise cross-entropy that's literally the same as making it better as a building block in an arithmetic coding data compressor. See https://en.wikipedia.org/wiki/Arithmetic_coding
See also https://learnandburn.ai/p/an-elegant-equivalence-between-llm...
Indeed, KL-divergence can be seen as the difference between the average number of bits required to arithmetically encode a sample from a given distribution, using symbol probabilities from both the original distribution and an approximating distribution.
https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_diver...