HTTP Caching, a Refresher
(danburzo.ro)113 points by danburzo 16 hours ago
113 points by danburzo 16 hours ago
There was a recent discussion on X about this that had a couple of Cloudflare people chip in, including their CTO:
The highlight from that thread https://xcancel.com/dok2001/status/1989005141450846470#m
Good call! Honestly I just wanted to wrap it up before the holidays, but you’re right that a small section on Vary would have been useful.
Things like non-conforming caching services made me punt actual suggestions to a later article, as I wasn’t sure how my sense of the RFC interacted with the real world. HTTP Caching Tests seems like a great resource for this, but only includes Fastly out of the big providers, and it seems to be doing okay with Vary. https://cache-tests.fyi/
Vary is Very important.
> the cache MUST NOT use that stored response without revalidation unless all the presented request header fields nominated by that Vary field value match those fields in the original request
You’ll find that some have creative readings of MUST NOT.
As many have pointed out here, the nature of caching has changed in the current climate of ubiquitous HTTPS, and I want to add a paragraph or two about it. Is there a good summary somewhere that I could reference? What are the the usual, most prevalent uses of HTTP intermediaries involving caches, besides CDNs and origin-controlled caches (eg Varnish)?
This is nothing new and doesn't add anything new to the topic, so am I the only that thinks this is just an attempt at boosting their SEO through HN?
It clearly notes that it's "a refresher", does not claim that it's novel research, and extensively links to the reference documents. It is, essentially, a review article (https://en.wikipedia.org/wiki/Review_article). And there's absolutely nothing wrong with that.
Hell, the author could probably have called it a primer and I think it'd have been fair.
I’m sorry you didn’t get anything out of it. I wasn’t operating at the edge of caching knowledge, just a person refreshing and clarifying for themselves how caching works. Some things were new to me, and after spending so much time with the RFC, I just thought others may benefit or, more selfishly, would point out errors or ways to make it better.
I mean, do those <meta> tags really suggest someone who’s into SEO? Call me stale but what I really want is validation :-)
A lot of this seems irrelevant these days with https everywhere.
It is not uncommon for enterprises to intercept HTTPS for inspection and logging. They may or may not also do caching of responses at the point where HTTPS is intercepted.
I previously experimented a bit with Squid Cache on my home network for web archival purposes, and set it up to intercept HTTPS. I then added the TLS certificate to the trust store on my client, and was able to intercept and cache HTTPS responses.
In the end, Squid Cache was a little bit inflexible in terms of making sure that the browsed data would be stored forever as was my goal.
This Christmas I have been playing with using mitmproxy instead. I previously used mitmproxy for some debugging, and found out now that I might be able to use it for archival by adding a custom extension written in Python.
It’s working well so far. I browse HTTPS pages in Firefox and I persist URLs and timestamps in SQLite and write out request and response headers plus response body to disk.
My main focus at the moment is archiving some video courses that I paid for in the past, so that even the site I bought the courses from ceased operation I will still have those video courses. After I finish archiving the video courses, I will proceed to archiving other digital things I’ve bought like VST plugins, sample packs, 3d assets etc.
And after that I will give another shot at archiving all the random pages on the open web that I’ve bookmarked etc.
For me, archiving things by using an intercepting proxy is the best way. I have various manually organised copies of files from all over the place, both paid stuff and openly accessible things. But having a sort of Internet Archive of my own with all of the associated pages where I bought things and all the JS and CSS and images surrounding things is the dream. And at the moment it seems to be working pretty well with this mitmproxy + custom Python extension setup.
I am also aware of various existing web scrapers and internet archival systems for self hosting and have tried a few of them. But for me the system I am doing is the ideal.
Some of it is different, but the basics are still the same and still relevant. Just today I've been working with some of this.
I took a Django app that's behind an Apache server and added cache-control and vary headers using Django view decorators, and added Header directives to some static files that Apache was serving. This had 2 effects:
* Meant I could add mod_cache to the Apache server and have common pages cached and served directly from Apache instead of going back to Django. Load testing with vegeta ( https://github.com/tsenart/vegeta ) shows the server can now handle multiples more simultaneous traffic than it could before.
* Meant users browsers now cache all the CSS/JS. As users move between HTML pages, there is now often only 1 request the browser makes. Good for snappier page loads with less server load.
But yeah, updating especially the sections on public vs private caches with regards to HTTPS would be good.
CDNs manage user TLS certificates and that is one of the advantages of using them.
A node server could negociate https close to the user, do caching stuff and create an other https connection to your local server (or reuse an existing one).
Https everywhere with your CDN in middle.
At one point with http only your isp could do its own cache, large corporate it networks could have a cache, etc. which was very efficient for caching. But horrible for privacy. Now we have CDN edge caching etc but nothing like the multi layer caching that was available with http.
As is traditional with most explanations of HTTP caching, it doesn't mention Vary header. Although apparently some CDNs (e.g. Cloudflare) straight up ignore it for some reason [0].
[0] https://news.ycombinator.com/item?id=38346382