Yeah, I am the engineer mentioned in the article, and I agree the explanation doesn't really work. The pieces of the explanation that ended up in the article itself don't add up to a coherent explanation. The fault for that is mostly mine. In hindsight, my original explanation was too long and too elaborate to be helpful. It's a good reminder that it is easy to go to far with an analogy and end up complicating the thing you were trying to simplify. Oh well, live and learn :)
There are more (coherent and to-the-point) details about PoolCounter in the prologue to PoolCounter.php in MediaWiki's source tree:
Wow, that's a nice technique, but in this case, isn't the rendered page invalidated by changes to the source, rather than over time?
I suppose, in this case you could use the time since invalidation as your input. The downside is that changes in the source aren't immediately reflected in the rendered output, especially for infrequently updated pages.
Basically a page view checks the cache and rebuilds if necessary. Thousands of page hits in the same second before the build is over starts thousands of parallel rebuilds.
* Cache never automatically expires, but you do have some notion of staleness
* Whenever you request data, you get it from the cache (if it exists), and check for staleness, so you cached data needs to know when it was cached.
* Return the data as usual, but if the data was also stale, you fire off a worker to update the cached data.
* If you have lots of requests happening at the same time, you have a system for seeing if a worker already exists, to ensure that you only create one (for each piece of cached data).
* For the time it takes for the worker to complete, you have to be okay with serving stale data, in most cases this is okay.
There's an edge case missed here, which is what to do when the cache is empty (either because it's one of the first requests, or because the cached data has been evicted). That's up to you, depending on your use case. You can basically either return a default value, you can pre-warm your cache, or you can let the requests hang until the data is ready.
Because it's usually triggered with the cache expiring then it's not available and every page view attempts rebuilds. Otherwise the cache and time will be separate. I store the cache in memcached with an expiration to detect it expiring
You could queue an async job to rebuild what was stored in the cache right before the expiration time when you update / renew the expiration on the cached item?
It's a cute name but there is existing Research that relates to this issue which calls it "the thundering herd problem" so I would recommend the one you can actually search for without problems.
16:49 UTC refers to when TMZ first reported Prince died. Prior to that, there had been plenty of reports that there were ambulances at Paisley Park. It is not strange that people would google either "paisley park" or "prince" and then follow the results to Prince's wiki page.
I don't know the exact timeline, but my understanding is that there were reports of a death in the area where Prince lived before it was known that it was Prince.
Look very closely at the section of the graph starting around 4:20 PM. There's a small but significant increase in hits before the big spike starts around 4:50.
How are Wikipedia articles kept consistent with each other? Say someone like Prince dies. His page will instantly change, seemingly while his portrait is still in the sky and the cannon fires.
But with certain people there's a variety of connected items that need referential integrity. For instance, I can imagine Prince being on one of those lists (eg highest grossing) that has bold text for still living artist. For office holders, they need to be moved from "incumbent" to a box with dates and the new incumbent needs to be updated. And then there's text snippets that are in present tense ("Prince and David Bowie are among the greatest living artists").
And then there's the corresponding pages in other languages.
> How are Wikipedia articles kept consistent with each other?
They aren't. There is no transactional/referential integrity on Wikipedia. When someone famous dies a pretty common pattern is that first the death date is added, and then in a later subsequent edit someone gets around to changing the present tense verbs into past tense verbs ("Prince is a singer..." --> "Prince was a singer...").
Some material like that is generated by templates, so updating the templates updates it in many places.
A lot, however, is just done manually by the armies of volunteers that contribute to the site. More than a few people specialize in updating bits of minutia like that.
I don't think WMF staff are credited enough for the work they do in keeping Wikipedia running. They seriously know how to scale, I think the only ones better than them are honestly Facebook and Twitter!
I don't know if I would go that far, but WMF also don't have nearly the resources of Facebook or Google or Amazon, and at their scale nothing is easy anymore.
It's possible to use stacks to 'cache' writes in scenarios like this.
Writes to the same object go in the same stack, iterate over stacks, pop the first item, write it, clear the stack.
It works miracles for ephemeral data like wikipedia edits.
If you have extremely spikey load on servers, stacks are also a great replacement for queues, admit that during the deluge some portion of queries will timeout and go unanswered, instead of trying to process queries that are likely to timeout, simply process the first query on the stack and don't waste time processing the ones bound to fail.
Varnish does handle the mass of logged out cached requests. However, because they were having such a high amount of page edits per second the cache in Varnish would only by valid for about a second. Then a flood of logged out users hit the servers at the same requesting the uncached page to be rendered. The PoolCounter extensions keeps the web servers under control and by throttling requests for page rendering.
Why not just set the min TTL to a few minutes, or even a minute, for anonymous users? Is there enough usefulness in ~seconds of delay on article edits vs minutes to warrant a much more complex design?
No, definitely not across a cluster (although that would be quite nifty). Even on a single node that would reduce the thundering herd effect substantially.
This is so impressive, to see behind the curtains of what has become the central repository of humanities knowledge, during a moment of loss of one of humanity's greats.
Prince has provided millions (probably billions) of fans with moments of joy in their lives. The album "Purple Rain" pretty much was the bookmark of my freshman year at college. So many memories of that first year away from home come rushing back whenever I hear any of the songs from that album. I can vividly recall specific events, settings, and people for almost all of them, over three decades later.
I never saw Prince live, but talking to people who did, he commanded the stage and the audience like few others. Without exception they describe him as one of the best live performers they have ever seen.
I don't see why you bring up Bill Gates. His philanthropy is laudable and that stands on its own. I don't see how recognizing that Prince is probably up there with the best performing artists in human history takes anything away from that.
William Shakespear didn't saved as many lives as any physician of the time. Yet here we are, 400 years after his death, commemorating the man and his work.
Different people contribute differently to the betterment of mankind. That doesn't mean some contributors need to be censored away from popular culture just because you perceive their contributions to be not as important as others'.
I agree with your point, though I think many physicians at the time of Shakespeare may actually have had a negative score when it came to saving lives.
Just lol. Blind comment. He played and produced his first album his self, all 27 instruments, all production, composition, arrangement. Pretty sure he was under 20 also. Sign o the Times, lyrically hits you like Bob Dylan. I'm not sure what makes an artist to you but Prince created some of the best art I've heard. But yes, beauty is in the eye of the beholder.
I totally agree with the sentiment of this comment (though I'm not sure how much marketing is involved.) Prince was very good at music and was, well, kind of a dick. I don't quite get the massive outpouring of grief that has ensued.
Well, I wasn't really responding to the Bill Gates part in particular. Gates was a bit of a megalomaniac with MS. But he's doing awesome stuff with the money now. As person I'm not aware of him ever being a jerk.
So if you gain lots of money via semi nefarious means what fraction do you have to dedicate to good works before the earlier wrong is cancelled?
I know that he didn't gas 6 million people but letting people buy their way out of moral debt with a fraction of the money they gained still seems horribly repugnant.
First, what moral debt? Second, I imagine the sum total of his humanitarian efforts are greater than the total charity if all of those dollars remained in the pockets of each person who bought windows 95 et al. So repugnant seems like a real stretch.
They mentioned 5M views within 24 hours of Michael Jackson's death. With over 3B Internet users out there, I am actually a little surprised how small the spike was. Did they only count English Wikipedia? Even so I am quite surprised. I would expect 10-20M at least. Similarly, many young people like myself have never heard of Prince, I had to look him up to find out who he truly was.
They recently overhauled it, [0] but back then the pageviews [1] wouldn't count mobile users. You can see the old stats for Jackon's page here: http://stats.grok.se/en/200906/Michael%20Jackson. The actual number is 5,875,404 views within that day (in whichever time zone) and is for the English version of the article specifically.
Thanks. Yeah, I think mobile viewer would be a substainal amount, but desktop user count is still below my personal expectation. We are talking about 850M English speaking Internet users :( only 6M page view within 24 hours is really quite low.
As an "old" person (I'm 32) it blows my mind that younger generations might not be familiar with Prince's work. Unfortunately these "damn, I'm old!" moments keep cropping up more and more just lately. ;)
Given how short the spike is, I wonder if it's possibly due to a misconfiguration somewhere and a TZ offset is either getting misapplied or mis-corrected for?
Even in peak it's just ~800 hits per second - it shows how is irrelevant the C10k problem (yes, I know it's not exactly about hits per second, but still).
The C10k problem is absolutely not irrelevant, and certainly not because one page only saw 800rps during a worldwide event. Two things there:
(1) 800rps TO THAT PAGE is the metric. The entire rest of Wikipedia was still getting traffic, and as an educated guess I would estimate raw traffic to be on the order of magnitude of 3-4krps (across editing and views). They are quite open with operations and if I weren't mobile I could probably find the accurate answer.
(2) There are much higher traffic properties. I'm aware of one property beyond 200krps in aggregate.
If you had said "most people don't have to worry about C10k," then I'm on board. That's true. Irrelevant? Far from it.
And yes, query rate and connection count have a complicated relationship. You need three or four other metrics to explain their relationship, but raw query rate is a good yardstick for active connections when combined with quantiled request latency. (Not averaged.) Simple example: a 750ms 95th% page hit 10,000 times per second is almost certainly far > C10k because of the outliers.
Now I will grant that C10k itself is somewhat irrelevant, yes, but not for the reason you are saying. It was defined in an age when 10,000 active connections was pretty surprising (didn't it come from FTP or some other heavy eyeball protocol?). These days with long poll apps, long-running protocols, and so on, millions of open connections are quite common at consumer scale. I find C1M far more interesting these days. C10M is still kinda nuts, but does exist in the magical world of metal and fiber and hot aisles and all those great things that nobody uses anymore (depressingly).
Ah. So my hunch was that the published number was what is making it through cache, and that's where my estimate comes from too. That sounds about what I'd expect for the cached side.
which you can see is not showing any substantial increase due to the passing of Prince. The big hole of two days that ends just before the news broke is due to wikimedia switching traffic to a second datacenter for two days, see http://blog.wikimedia.org/2016/04/18/wikimedia-server-switch...
10k connections case is related to 10k hits case, because 10k hits is 10k connections, just not long in time. And what I tried to say: "if even so popular page of Wikipedia doesn't have 10k hits problem, for 99.99% of projects it will be corner case, not everyday routine". Sorry if it was worded wrong.
C10k doesn't come up with Wikipedia because it doesn't do websockets. If it did it would probably have millions of concurrent connections, far over 10k. I dare say 10k is a bit passe. Whatsapp can apparently do 2m/server.
Ah, that makes sense. Still, is 2m rps within the realm of possibility for BEAM with a high-end server grade processor and heavy use of actors/concurrency?
Okay, I can see where you're coming from but the sentence was "raining downpour in front of over a hundred million people." How does that NOT imply a live audience of a hundred million? I can't believe people are downvoting me for this. The sentence is absurd as written.
Downvotes aren't punishment. Up/down votes indicate "this comment ought to be more/less prominent on the page". The sentence was somewhat ambiguous and your misunderstanding is understandable. But your comment is not useful to the discussion because most people understood and in any case it's tangential.
Okay honestly am I the only one who reads "in front of a hundred million people" and thinks it actually means he was in front of a hundred million people? I don't see it as an ambiguity, it's simply a wrong statement. I'm okay with people being uninterested in (and even downvoting) a correction, but I'm baffled that anyone could read the sentence as anything other than incorrect. I mean, they even describe the weather of the event as if to make it sound even more impressive that all these people showed up!
I think they mentioned the weather because it's relevant to his reputation as a legendary live performer and it's part of a great story. [1] There's something almost magical about incredible live performances. I can't really describe it, but sometimes a show has the perfect mix of emotion and artistry and it's completely mesmerizing to be a part of, even as a spectator. Participating in that moment bonds millions of people together in a small way. Even if they have nothing else in common, they were able to experience that together. Prince's Superbowl performance was one of those moments.
Thanks, I really do appreciate it. The thought that it could have meant televised audience never crossed my mind and surely wouldn't have for some other portion of the audience as well. I was out googling for largest concert size records to make sure I wasn't crazy.
https://wikitech.wikimedia.org/wiki/PoolCounter
TL;DR:
It's a limiter on how many workers start rendering the new page version when the old page version in cache has been invalidated.