Comment by donmcronald

Comment by donmcronald a year ago

Say there are only 2 sites on Tor. Site 'A' is plain text and has no pages over 1KB. You know this because it's public and you can go look at it. Site 'B' hosts memes which are mostly .GIFs that are 1MB+. You know this because it's also a public site.

If I was browsing one of those sites for an hour and you were my guard, do you think you could make a good guess which site I'm visiting?

I'm asking why that concept doesn't scale up. Why wouldn't it work with machine learning tools that are used to detect anomalous patterns in corporate networks if you reverse them to detect expected patterns.

alasdair_ a year ago

The point is that there aren't only two sites available on the clearnet. Is the idea that you find a unique file size across every single site on the internet?

My understanding (that may be totally wrong) is that there is some padding added to requests so as to not be able to correlate exact packet sizes.

Reply View 1 reply

donmcronald a year ago

> Is the idea that you find a unique file size across every single site on the internet?
Not really. I'm thinking more along the lines of a total page load. I probably don't understand it well enough, but consider something like connecting to facebook.com. It takes 46 HTTP requests.
Say (this is made up) 35 of those are async and contain 2MB of data total, the 36th is consistently a slow blocking request, 37-42 are synchronous requests of 17KB, 4KB, 10KB, 23KB, 2KB, 7KB, and 43-46 are async (after 42) sending back 100KB total.
If that synchronous block ends up being 6 synchronous TCP connections, I feel like that's a pretty distinct pattern if there isn't a lot of padding, especially if you can combine it with a rule that says it needs to be preceded by a burst of about 35 connections that transfer 2MB in total and succeeded by a burst of 4 connections that transfer 100KB combined.
I've always assumed there's the potential to fingerprint connections like that, regardless of whether or not they're encrypted. For regular HTTPS traffic, if you built a visual of the above for a few different sites, you could probably make a good guess which one people are visiting just by looking at it.
Dynamic content getting mixed in might be enough obfuscation, but for things like hidden services I think you'd be better off if everything got coalesced and chunked into a uniform size so that all guards and relays see is a stream of (ex:) 100KB blocks. Then you could let the side building the circuit demand an arbitrary amount of padding from each relay.
Again, I probably just don't understand how it works, so don't read too much into my reply.

Reply View | 0 replies