Comment by dpeckett

Comment by dpeckett a day ago

3 replies

For TCP streams syscall overhead isn't a big issue really, you can easily transfer large chunks of data in each write(). If you have TCP segmentation offload available you'll have no serious issues pushing 100gbit/s. Also if you are sending static content don't forget sendfile().

UDP is a whole another kettle of fish, get's very complicated to go above 10gbit/s or so. This is a big part of why QUIC really struggles to scale well for fat pipes [1]. sendmmsg/recvmmsg + UDP GRO/GSO will probably get you to ~30gbit/s but beyond that is a real headache. The issue is that UDP is not stream focused so you're making a ton of little writes and the kernel networking stack as of today does a pretty bad job with these workloads.

FWIW even the fastest QUIC implementations cap out at <10gbit/s today [2].

Had a good fight writing a ~20gbit userspace UDP VPN recently. Ended up having to bypass the kernels networking stack using AF_XDP [3].

I'm available for hire btw, if you've got an interesting networking project feel free to reach out.

1. https://arxiv.org/abs/2310.09423

2. https://microsoft.github.io/msquic/

3. https://github.com/apoxy-dev/icx/blob/main/tunnel/tunnel.go

johncolanduoni a day ago

Yeah all agreed - the only addendum I’d add is for cases where you can’t use large buffers because you don’t have the data (e.g. realtime data streams or very short request/reply cycles). These end up having the same problems, but are not soluble by TCP or UDP segmentation offloads. This is where reduced syscall overhead (or even better kernel bypass) really shines for networking.

mastax 20 hours ago

I have a hard time believing that google is serving YouTube over QUIC/HTTP3 at 10Gbit/s, or even 30Gbit/s.

  • johncolanduoni 19 hours ago

    These are per-connection bottlenecks, largely due to implementation choices in the Linux network stack. Even with vanilla Linux networking, vertical scale can get the aggregate bandwidth as high as you want if you don’t need 10G per connection (which YouTube doesn’t), as long as you have enough CPU cores and NIC queues.

    Another thing to consider: Google’s load balancers are all bespoke SDN and they almost certainly speak HTTP1/2 between the load balancers and the application servers. So Linux network stack constraints are probably not relevant for the YouTube frontend serving HTTP3 at all.