This is a good writeup, though I'm surprised no-one has mentioned the holy grail of caching with HTTP. That of course is good old RFC 2616: http://www.ietf.org/rfc/rfc2616.txt
There's an entire section in there devoted just to caching in HTTP. Very well worth reading in its entirety.
I really enjoyed your writing style. It didn't push its (successful!) attempts at humor and used them sparingly enough to be a joy when they were read.
I think one of the meta-takeaway is that understanding the fundamentals of web caching can help with your general CS knowledge ("There are only two hard problems in Computer Science: cache invalidation and naming things." -- Phil Karlton).
Looking at Apache, we see a few strategies:
* Include last-modified metadata
* Include content metadata (eTag/md5 of content)
* Include explicit expiration date
* Include a max-age
* Include metadata about who can cache (public/private/no-caching, i.e. users can cache but proxies cannot)
These approaches could be used when designing data flows with Memcache, Redis, etc.
The best way I've seen for dealing with cache expiry, which the article does not talk about, is to use version numbers on assets. We found this to be especially important with javascript, css, etc -- if all of that stuff doesn't expire at the same time, it can hose the layout of your site.
Also there are may be many layers of caching between you and the user; not only HTTP caching in the browser, but you have to take into consideration any CDN's (Akamai, etc) and sometimes even caching reverse proxies in corporations.
At my previous job, we handled the versioning with deployment-time rewriting of the assets included in the base page to include the version number (As tagged by the build software with branch name + build number).
That said, enabling browser side caching was a huge win for page speed on the site.
One thing I don't understand. If the server has asked the client to cache an image for a year, and the image is indeed updated in that time, is there some way of telling the client to download that image anyway?
I'd take it to Google, but I have no idea how I'd ask that in Google query form.
This is actually referenced in the article. You can use the Last-Modified date and the server will either return a 304 (Not Modified) or the modified image if it is newer.
I read that, but if you say "this image won't change for exactly one year" and the client doesn't even request that resource from the server any more, how do you start that dialogue again?
pork has offered that you add a junk parameter to the end of a GET request and that should disrupt the cache, I'll need to read in to this. I'm interested in optimizing web speed as much as possible and this sort of thing and caching has always been something I've understood poorly.
Yep, that's the problem with long expiration dates -- the client may never check again (that's what we wanted, right?). The workaround is to request a new url which restarts the process.
Separately, the easiest way to get started with all these optimizations is to run the page speed check online:
I've actually been playing around with this stuff all day, pretty much since my last comment above. I've enabled smarter caching on my website, replaced multiple image requests with a single spritesheet, optimized my images, and cleaned up my CSS file to remove unused code. Google's PageSpeed has been an invaluable tool, as well as webpagetest.org which breaks down the data in an intelligent way.
Turns out Google Analytics is actually doubling my page load time, but the data is too valuable to give up.
In HTTP, since it's stateless, you don't "tell" the client anything without it first asking. The usual way to bust the cache is to add a junk parameter to the end of a GET request.
I see. I assume you would change the html code to say <img src="image.png?cache=no"> or something like that to force the browser to redownload it? What if the html page itself were cached for a year? Is there an Apache setting that can give a global "no caching" command, or something like that?
Yep, exactly -- not only can the images be cached, but the HTML too!
The ideal way to do it is have the "loader file" (index.html) only cached with last-modified date, so as soon as it changes the client is aware. The client requests the file each time, and is returned the full file or a simple Not Modified response.
Within the file, you have references to permanently-cached, versioned resources (<img src="/images/foo.png?build=123" />). If the cache expiration is far enough away, the browser won't even issue the request to check for a new version.
Some browsers don't cache query params so you might use rewriting rules to change foo.png to foo.123.png. This rewriting is done automatically for you with Google Page Speed module for Apache.
It sounds like a much deeper subject than I first appreciated. I'll definitely read up on this. Using URL rewriting with caching is interesting, I've not seen that before.
I work with one lady that complains about a slow loading JQuery slideshow, and smarter caching may very well be the solution (at least, after the first load).
I've experimented and managed to shave 2 seconds off a client's website upon reloads - that's significant! Still playing with it, but I've already learned a lot.
I know you say that in jest, but I would say that it is still good to know the underlying technology. It is much easier to debug issues in the future once you understand them.
Concur - I think that, once you have a certain amount of experience built up working in frameworks like Rails, you can only go so far before you hit a wall. At that point, it's necessary to start learning the protocols and the lower level stuff to advance your understanding and your craft.
There's an entire section in there devoted just to caching in HTTP. Very well worth reading in its entirety.