Comment by bminor13

Comment by bminor13 a year ago

7 replies

In order to be in the same G-node, they'd need to have the same rank and be close in value (such that they were not "broken up" by a value in the next highest rank), right?

Seems like brute-force search for adjacent values with the same rank is possible, but guaranteeing that intermediate higher-rank values dont also exist may not be (for an attacker). Maybe one mitigation on this sort of attack is to search for higher-rank extra values to insert to break up large G-nodes?

This also assumes they can know the hash function (if rank is chosen by cryptographically-secure hash); maybe also salting values before hashing could thwart these sorts of attacks?

yorwba a year ago

Generating arbitrarily many values of the minimum rank is very easy for an attacker. Since the rank is geometrically distributed with parameter p = 1 - 1/k and k ≥ 2, randomly sampling a value will give you one of minimum rank with probability p ≥ 1/2 and it only gets easier for larger k.

If you want to break that up with dummy elements, you now have the problem of choosing those dummies in a history-independent manner efficiently.

But I think their recursive construction with G-trees of G-trees of ... might work if nodes with too many elements are stored as G-trees with a different, independent rank function (e.g. using a hash function with different salt). Producing many nodes with the same ranks should then get exponentially harder as the number of independent rank functions increases.

  • carsonfarmer a year ago

    Yes this exactly. Another really simple way to do this, is to use alternating leading and trailing zero counts in the hash in your nested G-trees. Simple, and pretty effective.

    • yorwba a year ago

      Hmmm... if you need to go deeper (because 1/4 of all hashes have zero leading zeros and zero trailing zeros), you can generalize this by converting the hash into its run-length encoding to get a sequence of rank functions where finding values with the same rank for all rank functions is equivalent to finding hash collisions. Very nice.

      • carsonfarmer a year ago

        Whoah, I totally hadn't taken the thought experiment this far. This is fantastic! I'd like to explore this further, interested in a quick research chat/sync some time? My email is linked from the paper.

EdSchouten a year ago

Exactly. But my concern is that this is not any stronger/better than what Prolly trees already offer, which is why I'm disappointed that they are mentioned under "related work", but not discussed/compared in more detail.

  • carsonfarmer a year ago

    You're right, we should delve into a comparison more with respect to prolly trees. We actually have a lot of experience with prolly trees, and have found, in practice, that you need to do a lot of the things that folks like dolt have had to do to make them work nicely. Whereas with G-trees, the basic implementation turns out to be quite nice (and extremely easy to reason about).

    One of the biggest benefits of G-trees in my mind, is their ease of implementation. Additionally, we did a lot of work to explore their statistical properties, which doesn't exist for prolly trees (though in hindsight, we have done this, so should probably write it up formally).

    • EdSchouten a year ago

      Another thing that's worth investigating:

      As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).

      Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:

      https://github.com/buildbarn/go-cdc

      Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?