Hacker News new | past | comments | ask | show | jobs | submit login
How Google Taught Me to Cache and Cash-In (highscalability.com)
58 points by timf on Sept 13, 2009 | hide | past | favorite | 24 comments



Following this guide blindly is a bad idea. Doing premature caching can be dangerous - you may end up caching stuff that gets hit once, you may miss caching stuff that get hit a lot of times and you may end up making your caching a lot more complex than it needs to be.

A better caching strategy is to collect data (for example, what are the most used queries) and then do caching optimization based upon this data. I.e. don't apply caching blindly, but apply it by profiling your application. This approach also works for testing out how new caching strategies affect your cache and your database.


You let the cache expire stale objects and update it on db IO. What's the problem, exactly?


Updating the cache is far from easy. An example that show cases the non-triviallity is MySQL's query cache: it will purge the whole cache on a table update. If you get a lot of reads, this works great, but if you get a lot of updates, then MySQL query cache will be very inefficient as the cache will be purged on every update. MySQL query cache has a lot information available that could be used implement a smarter cache, unfortunately this problem is far from easy to solve even with a lot of information available.

The bottom line: good luck on expiring the cache on database updates ;)


What I did in Rails on a previous app was:

- Tie hooks into the logger so I can track exactly which ActiveRecord models are accessed in a given page view.

- Have 100% test coverage AND make sure coverage pattern (i.e. frequency distribution of line execution) at least looks something like production AND use subset of production data for testing.

When tests are run:

- Use the logger hooks to build the tree of partials included in each view, and models included in each view/partial (i.e. we got a User which had Friends and an Avatar which had a PicResource .. etc.), where the models are linked by foreign key relations, and pages/partials linked by inclusion. This is why you need 100% coverage & real data - you don't want to miss anything.

Then:

- Tie cache-busting code to ActiveRecord's after-save hook. When a table is updated, you go to every tree in which that model appears and go up until you hit the highest partial. You bust the partial and then you bust the top-level page.

- If you want to tune this further (for expensive pages), you can differentiate between CREATE and UPDATE. On CREATE you need to bust the whole cache for those partials/pages. On UPDATE you only need to bust existing cached renders of those pages. So you can just store the N most-viewed pages/partials with those models, and keep their IDs around.

Why we did this:

I realize this isn't typical for a web app. Our app had to render binaries for embedded devices, so we had to compile everything into a single file before it went out the door. This was a highly interactive app, so pages changed all the time. Using this system, we could:

a) maintain quick response times by almost never rendering on page view (rendering was compute-intensive and required about 500ms per page)

b) pre-render often-used and often-changed pages

c) save our precious database from wear and tear

d) do all this WITHOUT harassing the page/content developers, who could throw together new rails views/controllers using whatever AR models they please, without having to maintain onerous "cache-busting" lists or strict model-controller ties.

(I've omitted some details here, like dealing with cyclical dependencies .. no fun otherwise :)


I wonder if anyone's gone the next step and written updates to the caches directly, leaving out the database write from the critical path.


Do you mean that when you get an update, you both write to the database and perform an update in the cache? If so, you would want it to be a transaction so that the cache and database is not out of sync.

Maybe a write cache, like on hard drives with write back caching. If you first wrote the update to a cache (queue of updates). Then it would be written to both the cache (cached html for instance) and database in one transaction.

But I think it sounds like more trouble than it's worth.


transactions are unnecessary, unless you are looking for 100% consistency between cache and database at all times. Usually this is not the case and if it is, you will have extremely serious problems scaling. All i'm saying is: If you have an update that you know will invalidate some cached pages, why invalidate them and then rerender them from the database once the update is commited instead of updating them directly. and lazy queueing the update, taking the load of the database (less pressure to commit instantly, and less reads). In the facebook inbox example, you know that a new message to the user will increase their message count by one. Just update the cache directly with this new info. When/if it explodes, by all means, rerender everything.


And then what happens when the server with your cache dies? :)


Update the cache instantly (critical path), update the db lazily queued?


We do this for the Twiddla sandboxes.

The key is that you have to be prepared for everything to evaporate if the server cycles its memory for whatever reason. In this specific case, we toss everything out of the sandboxes every 5 minutes & reset them anyway, so it's no great tragedy if they clean up after themselves from time to time.

For something with a ton of low-priority edits (like HN or Reddit where nobody sues you if their vote goes away), you could certainly get away with caching updates and only saving them out every once in a while.


Another idea: populate ('warm') the cache automatically after server restarts to get up to speed quickly.


One extra point: Don't Add Caching Yet.

1. wait until something presents itself as a bottleneck

2. optimize it until it's not a bottleneck anymore

3. wait until it presents itself as a bottleneck again anyway.

...Then add caching.

You'll never find the low-hanging yet dog-slow fruit if you put your aggressive caching scheme in place right off the bat. If you have a bunch of poorly optimized code running with a ton of caching to hide it from you and you suddenly have enough traffic to cause scaling pain, that's a big problem.

Speed it up. Then cache it.


Seems to mainly be about Reddit.

Didn't seem to answer the #1 question I had in my mind - how do they cache votes for logged-in users? Do they, say, cache a generic front page and then apply the user's prior voting choice via JS? Or do they cache the page only for the general public, and for logged-in users just cache their voting history and generate it on-the-fly?

However they do it, they do a damn good job. I'm always impressed by how responsive Reddit is, along with the other highly dynamic / high load sites like Digg, etc. They manage to stay up and running fine even with all that going on; but just a link on their front page to someone's blog, which should be almost static, crushes that server into goo. Shows the power of good design!


Use the source Luke: http://code.reddit.com/


Thanks. I want to know the answer, but not so much that I'm willing to spend the several hours necessary to familiarise myself with a large foreign code base. I was hoping someone here could give a quick top-level explanation.

From looking around, though, I think they're caching the user's votes and then just assembling the page on the fly from cache fragments. You could do that pretty quickly. Might be wrong.


In the past my strategy was basically that.

Load a static page from cache wherever possible. If the user performs a vote, or submit, or what have you, write it to the database, but just perform the update on the user's local page in the DOM.


Yeah, that's what you'd do as the user makes a vote. But how would you load the page in the first place? The front page is different for each user, as they've likely voted up and down numerous links on it.

You could load a generic page and then apply their votes by AJAX or something but that doesn't strike me as being any more efficient for the DB than just generating the page.


For the most part, we were serving a generic page, except for prior actions done by the user. The only major differences between your page and my page, for example, is that article 5 might not show vote links at all, if I had previously voted on them.

For the most part, we accomplished this through cookies. If the cookie exists, read in a hashmap of activities you've recently performed, and update the page DOM based on those. If the cookie didn't exist, then go ahead and do a full query from the db, repop the cache and build the cookie.

That said, we noticed that in a lot of cases, that wasn't ideal, and we started rendering partial templates -- so instead of loading a whole page from cache, we would load page parts, basically one part for each article. This worked well, as each template was basically a pre-rendered static file, and we only had two basic states to work with -- whether the user had voted on this before or not.

Our new strategy then was to query just the indexed table for a user's recent votes, separately query an index for which articles to display, and load the templates for the to-be-displayed articles based upon the user's vote state.

That allowed the page to render a lot faster, but kept the box under more load. Though we suspect it would have scaled well enough (had the project not eventually fizzled out), it wasn't ideal server-side, though delivered a vastly superior end-user experience.

YMMV.


Loading data from the db can be cheaper than actually RENDERING said data, so if you can push that clientside, you can save cycles.


You could load the user's votes into a cookie.


It would get too large quickly.

Check this out for a good explanation of why you want to keep cookie sizes as small as possible: http://yuiblog.com/blog/2007/03/01/performance-research-part...


Thanks for that link. Interesting on the how ebay and myspace use cookie size with great abandon (v. Amazon & Google, for ex.). Leaves me wondering whether they know what it's costing them. And if so - whether they are realizing significant savings with other performance figures.


This probably isn't how they do it but you could have a static page updates customizing information with AJAX.


For people familiar with Oracle, 11g has caching built-in called result cache. it's transparent and works well.We saw a 400-800+% increase in performance in our application by just turning this feature on, without any change in our application.Also note that you can still use memcached on top of result cache for even better performance/scalability.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: