Comment by donmcronald

> Is the idea that you find a unique file size across every single site on the internet?

Not really. I'm thinking more along the lines of a total page load. I probably don't understand it well enough, but consider something like connecting to facebook.com. It takes 46 HTTP requests.

Say (this is made up) 35 of those are async and contain 2MB of data total, the 36th is consistently a slow blocking request, 37-42 are synchronous requests of 17KB, 4KB, 10KB, 23KB, 2KB, 7KB, and 43-46 are async (after 42) sending back 100KB total.

If that synchronous block ends up being 6 synchronous TCP connections, I feel like that's a pretty distinct pattern if there isn't a lot of padding, especially if you can combine it with a rule that says it needs to be preceded by a burst of about 35 connections that transfer 2MB in total and succeeded by a burst of 4 connections that transfer 100KB combined.

I've always assumed there's the potential to fingerprint connections like that, regardless of whether or not they're encrypted. For regular HTTPS traffic, if you built a visual of the above for a few different sites, you could probably make a good guess which one people are visiting just by looking at it.

Dynamic content getting mixed in might be enough obfuscation, but for things like hidden services I think you'd be better off if everything got coalesced and chunked into a uniform size so that all guards and relays see is a stream of (ex:) 100KB blocks. Then you could let the side building the circuit demand an arbitrary amount of padding from each relay.

Again, I probably just don't understand how it works, so don't read too much into my reply.