Comment by tetha
Comment by tetha 9 hours ago
It reminds me of a game we played with students of data classification algorithms like ID3: How many yes/no questions do we need to uniquely identify everyone in this room?
With like 12 students, that's 4 bits, and it often ends up with 2-3 questions. It starts off with the obvious ones - man/woman/diverse, but then a realization comes in: An answer usually contains more information than just that one bit. If you have long hair, you're most likely a woman and/or a metalhead for example. That part will get shaken out later on.
And those thoughts make these browser fingerprinting techniques all the more scary: They contain a lot of information and that quickly cuts the possible amount of people down. Like, I'm a Linux Firefox user with a screen on the left. I wouldn't be suprised if that put me in a 5-6 digit bucket of people already.
> An answer usually contains more information than just that one bit.
That means there is less information in the question "do they have long hair?", not more. Asking "long hair?" and then "woman?" is probably, in most groups, roughly the same as just the first or second question alone. So the second question added much less than one bit of information because the answer is probably "yes". "Long hair" and then "metalhead" is the same, except that the answer to the second question is probably "no".
Yes/no questions on average contain the most information each when they partition the remaining possibilities 50:50. Then each answer gives you exactly one more bit. The closet you get to either a 100:0 or 0:100 yes:no split, the smaller the fraction of a bit you encode in the answer.
"Metalhead?" usually gives you lots of bits of information (probably 4 in an "average" group of 16 containing at least one metalhead) if the answer is "yes", but on average that's outweighed by the very high chance that the answer will be "no". If there are no metalheads or only metalheads, it gives you zero information.