Comment by EdSchouten

Another thing that's worth investigating:

As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).

Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:

https://github.com/buildbarn/go-cdc

Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?