Hacker News new | past | comments | ask | show | jobs | submit login
This is a story of caching (code.google.com)
95 points by terpua on Sept 9, 2010 | hide | past | favorite | 18 comments



... and then the customer called and asked why the graphs on the front page were wrong even though they clearly just edited hugetable.

You explain them that it'll just take a little while to be updated, but the customer didn't like that answer. The data needs to always be current.

Apparently, you need to flush parts of the cache as new data arrives. Unfortunately though, you can't as memcache is a strict key/value store. So you change how you name the cache keys and make them dependent of, say the max(timestamp) of your hugetable.

Load goes back up to 2 because all requests now still have to check the table.

But it's still not as bad.

Until the next phone call...


Or you could just update the cache when the data changes.


True, if it's possible.

Let's say that hugetable is some interface table filled by a different system you have no control on. You could add a trigger on the database that shells out to some script to clean the cache, but if the that external tool adds rows one by one, that's really expensive (aside of the fact that this is NOT what triggers were invented for).

Or the data in hugetable depends on a lot of different components in your application. Then it's really hard to always be sure to invalidate the cache correctly and there's sure to be a location where you'll forget.

In addition, invalidating the cache on write works counter to the pattern described in the tutorial that concentrates the caching around retrieval.

Don't get me wrong: I agree with the article. It's just never as easy as these tutorials make it seem.


In what way was this a tutorial? Seemed more like a parable meant to drive home the line about how "When they have questions, they ask the mailing list or read the faq again."

Each problem poses different challenges, and it doesn't seem fair to attempt to invalidate a given example by theorizing about hypothetical complexities the original authors never alluded to.


You misunderstood the intent of my possible continuation of the original post. I was not trying to invalidate it, but just noting what problems might (and do) arise.

I totally agree with all the article says. Caching in general and especially memcache are awesome, but as always it's a trade off. What you get in performance, you pay for in complexity


Best way to deal with cache invalidation is not to invalidate it (smart key names technique). The other way round you may miss some edge-cases and/or add a lot of code smell.


> Apparently, you need to flush parts of the cache as new data arrives. Unfortunately though, you can't as memcache is a strict key/value store.

Huh? You can delete keys in memcached just fine.

> So you change how you name the cache keys and make them dependent of, say the max(timestamp) of your hugetable.

Or you could use memcached's existing expiration support.


Yes. You can delete keys. But you need to know the name of the key. You could not use some kind of tagging ("these keys are related to component A" and later "invalidate all entries related to component A")

Using memcaches expiration is what the article does. But I was referring to the requirement of real-time data. If the source data changes, you might want to see the new data and not have to wait for the caches data to expire.


people normally kick off a background job to update the cache and not wait for it to expire. It still isn't instantly consistent, but it's usually close.


So you make the app update the cache when the customer visits - now the newest version is available and it works.


For some reason, I started reading the story with the assumption that it was a "don't do it this way" tutorial, and I got very nervous towards the end. ("But that's exactly how I use memcache!")


hehe, all software documentation should come in 3 forms: Reference, Tutorial and Pop-up Book


_why actually proposed (on a few separate occasions) that there should be more computer books in the 80 page range. More like the Poignant Guide or Nobody Knows Shoes than a dead-tree, slow version of Google.


The real story is in how they push untested code into production just to see what happens. ;)


Heh - 'cause none of _us_ have ever done that, right? <looks around nervously for any colleagues reading>


It depends on what they meant by load average of 20. Or specifically, what kind of workload is it. In some cases LA 20 is pretty much standard, in some other you cannot even login to that box anymore. If it was the second case... well - since nothing works anyways, someone might just as well push untested code into production ;)

Although "One day the Sysadmin realizes" is pretty bad. They should have daily trends showing them it's coming days before it actually happened (unless it was some big release / marketing day).


Programmer and Sysadmin were either very lucky, or not working on anything important, or else they would have been fired or gone out of business. You can't just add caching and magically expect things to work. You have to think hard about expiration policies and test to make sure you aren't going to get wrong answers, or else you need to prove that wrong answers are ok.


"All programming is an exercise in caching."

-Terje Mathisen




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: