Comment by EdSchouten

Comment by EdSchouten 5 days ago

7 replies

From "2 Related Work":

> None of these data structures can provide a non-probabilistic upper bound on the number of items per vertex. This hampers efficient implementation; and adversarial data suppliers can trivially produce n items in O(n) expected time that must all be stored in the same vertex.

And "4.3 Novel G-Trees":

> Our analysis confirmes that — with high probability — the resulting trees are of logarithmic height and their G-nodes store O(k) items. > > The second key insight toward an efficient k-ary data structure is, paradoxically, that there is no need to be clever about it. Sorted linked lists are naïve, inefficient data structures, yet zip trees are efficient. We can be similarly naïve for our k-ary construction. > > We use a sorted linked list in which every node stores up to k items. We require all items to be stored as early in the list as possible; this is the simplemost way of achieving history-independence. In other words, the only node to store fewer than k items is the final node. We call such a list a k-list (see Figure 9).

To me it's not obvious why an adversarial data supplier can't produce items that all need to go into the same G-node. Sure, it can be transformed into a k-ary data structure by partitioning the items across n/k nodes, but I don't see how that's better than a linked list.

So this makes me wonder: how is this approach any better than a Prolly tree? I see it being mentioned twice in this article, but the comparison doesn't go into detail.

bminor13 4 days ago

In order to be in the same G-node, they'd need to have the same rank and be close in value (such that they were not "broken up" by a value in the next highest rank), right?

Seems like brute-force search for adjacent values with the same rank is possible, but guaranteeing that intermediate higher-rank values dont also exist may not be (for an attacker). Maybe one mitigation on this sort of attack is to search for higher-rank extra values to insert to break up large G-nodes?

This also assumes they can know the hash function (if rank is chosen by cryptographically-secure hash); maybe also salting values before hashing could thwart these sorts of attacks?

  • yorwba 16 hours ago

    Generating arbitrarily many values of the minimum rank is very easy for an attacker. Since the rank is geometrically distributed with parameter p = 1 - 1/k and k ≥ 2, randomly sampling a value will give you one of minimum rank with probability p ≥ 1/2 and it only gets easier for larger k.

    If you want to break that up with dummy elements, you now have the problem of choosing those dummies in a history-independent manner efficiently.

    But I think their recursive construction with G-trees of G-trees of ... might work if nodes with too many elements are stored as G-trees with a different, independent rank function (e.g. using a hash function with different salt). Producing many nodes with the same ranks should then get exponentially harder as the number of independent rank functions increases.

    • carsonfarmer 12 hours ago

      Yes this exactly. Another really simple way to do this, is to use alternating leading and trailing zero counts in the hash in your nested G-trees. Simple, and pretty effective.

      • yorwba 10 hours ago

        Hmmm... if you need to go deeper (because 1/4 of all hashes have zero leading zeros and zero trailing zeros), you can generalize this by converting the hash into its run-length encoding to get a sequence of rank functions where finding values with the same rank for all rank functions is equivalent to finding hash collisions. Very nice.

  • EdSchouten 4 days ago

    Exactly. But my concern is that this is not any stronger/better than what Prolly trees already offer, which is why I'm disappointed that they are mentioned under "related work", but not discussed/compared in more detail.

    • carsonfarmer 12 hours ago

      You're right, we should delve into a comparison more with respect to prolly trees. We actually have a lot of experience with prolly trees, and have found, in practice, that you need to do a lot of the things that folks like dolt have had to do to make them work nicely. Whereas with G-trees, the basic implementation turns out to be quite nice (and extremely easy to reason about).

      One of the biggest benefits of G-trees in my mind, is their ease of implementation. Additionally, we did a lot of work to explore their statistical properties, which doesn't exist for prolly trees (though in hindsight, we have done this, so should probably write it up formally).

      • EdSchouten 9 hours ago

        Another thing that's worth investigating:

        As the name implies, the sizes of nodes of Prolly trees and geometric search trees are geometrically distributed. My question is: is this really the right distribution to use? The larger nodes get, the larger the probability is that they get mutated. This means that in a content addressed storage system, there will be more large objects than small ones. My gut feeling tells me that the distribution should be uniform, with the spread between min/max sizes bound by a small constant factor (2x? 4x?).

        Some time ago I experimented with this, where I implemented a content defined chunking algorithm that chunks inputs at locations where the value of a rolling hash is maximal, as opposed to finding offsets at which the first/last n bits are zeros/ones. My observation was that this led to a 2-3% reduction in storage space usage. The source code for this can be found here:

        https://github.com/buildbarn/go-cdc

        Would it also be possible to model trees around this approach as well? If so, would this lead to better deduplication rates than Prolly/geometric search trees?