Comment by carsonfarmer
Comment by carsonfarmer 12 hours ago
You're right, we should delve into a comparison more with respect to prolly trees. We actually have a lot of experience with prolly trees, and have found, in practice, that you need to do a lot of the things that folks like dolt have had to do to make them work nicely. Whereas with G-trees, the basic implementation turns out to be quite nice (and extremely easy to reason about).
One of the biggest benefits of G-trees in my mind, is their ease of implementation. Additionally, we did a lot of work to explore their statistical properties, which doesn't exist for prolly trees (though in hindsight, we have done this, so should probably write it up formally).
Another thing that's worth investigating:
As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).
Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:
https://github.com/buildbarn/go-cdc
Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?