The impact of Prince’s death on Wikipedia

semi-extrinsic · on April 23, 2016

For others who were left scratching their heads at what exactly this pop-sci-explained PoolCounter mechanism actually is:

https://wikitech.wikimedia.org/wiki/PoolCounter

TL;DR:

It's a limiter on how many workers start rendering the new page version when the old page version in cache has been invalidated.

atdt · on April 24, 2016

Yeah, I am the engineer mentioned in the article, and I agree the explanation doesn't really work. The pieces of the explanation that ended up in the article itself don't add up to a coherent explanation. The fault for that is mostly mine. In hindsight, my original explanation was too long and too elaborate to be helpful. It's a good reminder that it is easy to go to far with an analogy and end up complicating the thing you were trying to simplify. Oh well, live and learn :)

There are more (coherent and to-the-point) details about PoolCounter in the prologue to PoolCounter.php in MediaWiki's source tree:

https://github.com/wikimedia/mediawiki/blob/1617e7822eaf7426...

And in a short blog post by Domas Mituzas, who is the original author of PoolCounter:

https://dom.as/2009/06/26/embarrassment/

hedgehog · on April 24, 2016

There is a stochastic approach that can be adapted to address this problem, I think I first saw it at IMVU in 2009 but conveniently Wikipedia has a good reference now: https://en.wikipedia.org/wiki/Cache_stampede#Probabilistic_e...

The advantage is less coordination is necessary and you should be able to get down to a single concurrent rerender per page.

DanWaterworth · on April 24, 2016

Wow, that's a nice technique, but in this case, isn't the rendered page invalidated by changes to the source, rather than over time?

I suppose, in this case you could use the time since invalidation as your input. The downside is that changes in the source aren't immediately reflected in the rendered output, especially for infrequently updated pages.

atdt · on April 24, 2016

Wow, fascinating -- thank you for the pointer!

homero · on April 24, 2016

Basically a page view checks the cache and rebuilds if necessary. Thousands of page hits in the same second before the build is over starts thousands of parallel rebuilds.

taf2 · on April 24, 2016

Also known as the thundering herd...

homero · on April 24, 2016

Yeah I never really solved it while I was using memcached but I'm not Wikipedia

andrewingram · on April 24, 2016

The way i've seen it done is something like this:

* Cache never automatically expires, but you do have some notion of staleness * Whenever you request data, you get it from the cache (if it exists), and check for staleness, so you cached data needs to know when it was cached. * Return the data as usual, but if the data was also stale, you fire off a worker to update the cached data. * If you have lots of requests happening at the same time, you have a system for seeing if a worker already exists, to ensure that you only create one (for each piece of cached data). * For the time it takes for the worker to complete, you have to be okay with serving stale data, in most cases this is okay.

There's an edge case missed here, which is what to do when the cache is empty (either because it's one of the first requests, or because the cached data has been evicted). That's up to you, depending on your use case. You can basically either return a default value, you can pre-warm your cache, or you can let the requests hang until the data is ready.

andrewingram · on April 24, 2016

I should've fixed the formatting whilst it was still editable :(

A former colleague of mine built a Django implementation of this pattern, which is pretty useful: https://github.com/codeinthehole/django-cacheback

homero · on April 24, 2016

I never looked into whether memcached will give you the creation date, I just let memcached expire it itself

forgotpwtomain · on April 24, 2016

Why isn't it a possibility to rebuild and invalidate the cache only when the rebuild is finished?

homero · on April 24, 2016

Because it's usually triggered with the cache expiring then it's not available and every page view attempts rebuilds. Otherwise the cache and time will be separate. I store the cache in memcached with an expiration to detect it expiring

forgotpwtomain · on April 25, 2016

You could queue an async job to rebuild what was stored in the cache right before the expiration time when you update / renew the expiration on the cached item?

hunter2_ · on April 24, 2016

+++ath0

zatkin · on April 23, 2016

+1 for whoever named it the "Michael Jackson problem". I'm going to start using that if I ever encounter the same issue myself.

encoderer · on April 23, 2016

Is this not the thundering herd problem?

bubuga · on April 24, 2016

> Is this not the thundering herd problem?

There's a wikipedia article on the thundering herd problem.

https://en.wikipedia.org/wiki/Thundering_herd_problem

The article needs a little love, though.

gerry_shaw · on April 24, 2016

As I understand it yes, but I think the "Micheal Jackson problem" is a better name.

techdragon · on April 24, 2016

It's a cute name but there is existing Research that relates to this issue which calls it "the thundering herd problem" so I would recommend the one you can actually search for without problems.

yrro · on April 24, 2016

The_ed17 · on April 23, 2016

That was Tim Starling:

https://wikitech.wikimedia.org/w/index.php?title=PoolCounter...

https://wikimediafoundation.org/wiki/User:Tim_Starling_(WMF)

RyJones · on April 24, 2016

American Idol caused widespread rearchitecting of SMSGW/SMSC. Fun times; voting was a very distributed DoS.

Buge · on April 23, 2016

Interesting how in the graph it looks like some people found out about 25 minutes before it was more publicly found out.

sigmar · on April 23, 2016

16:49 UTC refers to when TMZ first reported Prince died. Prior to that, there had been plenty of reports that there were ambulances at Paisley Park. It is not strange that people would google either "paisley park" or "prince" and then follow the results to Prince's wiki page.

fineIllregister · on April 23, 2016

I don't know the exact timeline, but my understanding is that there were reports of a death in the area where Prince lived before it was known that it was Prince.

The_ed17 · on April 23, 2016

I found a few references to stories about a 'medical situation' at Paisley Park that came out before TMZ's report—I'm assuming it was those reports.

CaptSpify · on April 23, 2016

I wonder if there's a crossover between celebrity reporters, and people who update Wikipedia on celebrities?

erelde · on April 23, 2016

I'm seeing an exponential curve, isn't it what we should expect ?

duskwuff · on April 23, 2016

Look very closely at the section of the graph starting around 4:20 PM. There's a small but significant increase in hits before the big spike starts around 4:50.

It's easier to see on the full resolution graph:

https://upload.wikimedia.org/wikipedia/commons/f/f2/Prince_a...

lordnacho · on April 24, 2016

How are Wikipedia articles kept consistent with each other? Say someone like Prince dies. His page will instantly change, seemingly while his portrait is still in the sky and the cannon fires.

But with certain people there's a variety of connected items that need referential integrity. For instance, I can imagine Prince being on one of those lists (eg highest grossing) that has bold text for still living artist. For office holders, they need to be moved from "incumbent" to a box with dates and the new incumbent needs to be updated. And then there's text snippets that are in present tense ("Prince and David Bowie are among the greatest living artists").

And then there's the corresponding pages in other languages.

How's it done?

CydeWeys · on April 24, 2016

> How are Wikipedia articles kept consistent with each other?

They aren't. There is no transactional/referential integrity on Wikipedia. When someone famous dies a pretty common pattern is that first the death date is added, and then in a later subsequent edit someone gets around to changing the present tense verbs into past tense verbs ("Prince is a singer..." --> "Prince was a singer...").

I can tell you all about how category normalization is maintained, though. It starts with this process: https://en.wikipedia.org/wiki/Wikipedia:Categories_for_discu...

nullc · on April 24, 2016

Some material like that is generated by templates, so updating the templates updates it in many places.

A lot, however, is just done manually by the armies of volunteers that contribute to the site. More than a few people specialize in updating bits of minutia like that.

chris_wot · on April 23, 2016

I don't think WMF staff are credited enough for the work they do in keeping Wikipedia running. They seriously know how to scale, I think the only ones better than them are honestly Facebook and Twitter!

philwelch · on April 24, 2016

I don't know if I would go that far, but WMF also don't have nearly the resources of Facebook or Google or Amazon, and at their scale nothing is easy anymore.

dward · on April 23, 2016

Google as well?

chris_wot · on April 24, 2016

You know, Google are so much in my life I don't even notice them!

Google beats everyone :-) they are so far ahead that I didn't even consider mentioning them!

zappo2938 · on April 23, 2016

Can't caching a page with varnish and memcache handle this?

nothrabannosir · on April 23, 2016

They cache the bejeezus out of their pages. Problems come up when a lot of people want to edit a page, an inherently uncacheable operation.

_3u10 · on April 24, 2016

It's possible to use stacks to 'cache' writes in scenarios like this.

Writes to the same object go in the same stack, iterate over stacks, pop the first item, write it, clear the stack.

It works miracles for ephemeral data like wikipedia edits.

If you have extremely spikey load on servers, stacks are also a great replacement for queues, admit that during the deluge some portion of queries will timeout and go unanswered, instead of trying to process queries that are likely to timeout, simply process the first query on the stack and don't waste time processing the ones bound to fail.

chris_wot · on April 24, 2016

Doesn't caching writes from multiple different servers potentially cause consistency and durability concerns?

I know MongoDB still haven't marked the bug [1] reported by Kyle Kingsbury [2] that found stale reads on all consistency and write concern levels...

1. https://jira.mongodb.org/plugins/servlet/mobile#issue/SERVER...

2. https://aphyr.com/posts/322-jepsen-mongodb-stale-reads

_3u10 · on April 25, 2016

Yes, but this is wikipedia, the entire premise of it is eventual consistency. The idea being that the most recent update to a page is the correct one.

chris_wot · on April 29, 2016

Good point.

Washuu · on April 23, 2016

Varnish does handle the mass of logged out cached requests. However, because they were having such a high amount of page edits per second the cache in Varnish would only by valid for about a second. Then a flood of logged out users hit the servers at the same requesting the uncached page to be rendered. The PoolCounter extensions keeps the web servers under control and by throttling requests for page rendering.

seanp2k2 · on April 24, 2016

Why not just set the min TTL to a few minutes, or even a minute, for anonymous users? Is there enough usefulness in ~seconds of delay on article edits vs minutes to warrant a much more complex design?

danielrhodes · on April 23, 2016

Correct me if I am wrong, but I thought Varnish has support for limiting concurrent backend fetches to the same resource.

brianwawok · on April 23, 2016

Across a cluster of varnishes? I think that limit is per varnish.

danielrhodes · on April 24, 2016

No, definitely not across a cluster (although that would be quite nifty). Even on a single node that would reduce the thundering herd effect substantially.

chris_wot · on April 24, 2016

Maybe that's a new feature request!

untog · on April 23, 2016

So... did you read the article?

_mhyx · on April 23, 2016

This is so impressive, to see behind the curtains of what has become the central repository of humanities knowledge, during a moment of loss of one of humanity's greats.

nymi · on April 24, 2016

Take note Robots at Facebook and Google. Monetizing humanities information through ads is not the only way.

conceit · on April 25, 2016

> the central repository of humanities knowledge

luckily, knowledge is decentralized

aaron695 · on April 24, 2016

> during a moment of loss of one of humanity's greats.

Compared to, lets say Bill Gates who's saved millions of lives?

Even artistically, I'm not sure Prince was up there in the top 1%

The power of marketing.....

ams6110 · on April 24, 2016

Prince has provided millions (probably billions) of fans with moments of joy in their lives. The album "Purple Rain" pretty much was the bookmark of my freshman year at college. So many memories of that first year away from home come rushing back whenever I hear any of the songs from that album. I can vividly recall specific events, settings, and people for almost all of them, over three decades later.

I never saw Prince live, but talking to people who did, he commanded the stage and the audience like few others. Without exception they describe him as one of the best live performers they have ever seen.

I don't see why you bring up Bill Gates. His philanthropy is laudable and that stands on its own. I don't see how recognizing that Prince is probably up there with the best performing artists in human history takes anything away from that.

bubuga · on April 24, 2016

William Shakespear didn't saved as many lives as any physician of the time. Yet here we are, 400 years after his death, commemorating the man and his work.

Different people contribute differently to the betterment of mankind. That doesn't mean some contributors need to be censored away from popular culture just because you perceive their contributions to be not as important as others'.

ZeroGravitas · on April 24, 2016

I agree with your point, though I think many physicians at the time of Shakespeare may actually have had a negative score when it came to saving lives.

http://historyworld.net/wrldhis/PlainTextHistoriesResponsive...

dizzy3gg · on April 24, 2016

Just lol. Blind comment. He played and produced his first album his self, all 27 instruments, all production, composition, arrangement. Pretty sure he was under 20 also. Sign o the Times, lyrically hits you like Bob Dylan. I'm not sure what makes an artist to you but Prince created some of the best art I've heard. But yes, beauty is in the eye of the beholder.

todaysaccount · on April 24, 2016

If you're serious, do you genuinely believe you know what you're talking about?

quaristice · on April 24, 2016

I totally agree with the sentiment of this comment (though I'm not sure how much marketing is involved.) Prince was very good at music and was, well, kind of a dick. I don't quite get the massive outpouring of grief that has ensued.

ams6110 · on April 24, 2016

"Kind of a dick" I think could easily apply to Gates as well, no?

quaristice · on April 24, 2016

Well, I wasn't really responding to the Bill Gates part in particular. Gates was a bit of a megalomaniac with MS. But he's doing awesome stuff with the money now. As person I'm not aware of him ever being a jerk.

michaelmrose · on April 24, 2016

So if you gain lots of money via semi nefarious means what fraction do you have to dedicate to good works before the earlier wrong is cancelled?

I know that he didn't gas 6 million people but letting people buy their way out of moral debt with a fraction of the money they gained still seems horribly repugnant.

quaristice · on April 24, 2016

First, what moral debt? Second, I imagine the sum total of his humanitarian efforts are greater than the total charity if all of those dollars remained in the pockets of each person who bought windows 95 et al. So repugnant seems like a real stretch.

yeukhon · on April 23, 2016

They mentioned 5M views within 24 hours of Michael Jackson's death. With over 3B Internet users out there, I am actually a little surprised how small the spike was. Did they only count English Wikipedia? Even so I am quite surprised. I would expect 10-20M at least. Similarly, many young people like myself have never heard of Prince, I had to look him up to find out who he truly was.

cooper12 · on April 24, 2016

They recently overhauled it, [0] but back then the pageviews [1] wouldn't count mobile users. You can see the old stats for Jackon's page here: http://stats.grok.se/en/200906/Michael%20Jackson. The actual number is 5,875,404 views within that day (in whichever time zone) and is for the English version of the article specifically.

[0]: https://blog.wikimedia.org/2015/12/14/pageview-data-easily-a... [1]: https://en.wikipedia.org/wiki/Wikipedia:Pageview_statistics

yeukhon · on April 24, 2016

Thanks. Yeah, I think mobile viewer would be a substainal amount, but desktop user count is still below my personal expectation. We are talking about 850M English speaking Internet users :( only 6M page view within 24 hours is really quite low.

SmellyGeekBoy · on April 24, 2016

As an "old" person (I'm 32) it blows my mind that younger generations might not be familiar with Prince's work. Unfortunately these "damn, I'm old!" moments keep cropping up more and more just lately. ;)

sparkzilla · on April 23, 2016

Just out of interest, did you go directly to Wikipedia to find the information, or did you go to Google, which then led you to Wikipedia?

yeukhon · on April 24, 2016

Always Google, which leads me to Wikipedia. Almost literally every time...

rompic · on April 24, 2016

this is what was going on on twitter at this time: http://tweetfortat.net/timeploteventsTWAPPERKEEPERjacko.php5

cmdrfred · on April 23, 2016

What happened at 7:15?

teamhappy · on April 23, 2016

End of a news broadcast maybe (end of the 10 o'clock news on the west coast would work I think)

eterm · on April 23, 2016

Given how short the spike is, I wonder if it's possibly due to a misconfiguration somewhere and a TZ offset is either getting misapplied or mis-corrected for?

EugeneOZ · on April 23, 2016

Even in peak it's just ~800 hits per second - it shows how is irrelevant the C10k problem (yes, I know it's not exactly about hits per second, but still).

jsmthrowaway · on April 23, 2016

The C10k problem is absolutely not irrelevant, and certainly not because one page only saw 800rps during a worldwide event. Two things there:

(1) 800rps TO THAT PAGE is the metric. The entire rest of Wikipedia was still getting traffic, and as an educated guess I would estimate raw traffic to be on the order of magnitude of 3-4krps (across editing and views). They are quite open with operations and if I weren't mobile I could probably find the accurate answer.

(2) There are much higher traffic properties. I'm aware of one property beyond 200krps in aggregate.

If you had said "most people don't have to worry about C10k," then I'm on board. That's true. Irrelevant? Far from it.

And yes, query rate and connection count have a complicated relationship. You need three or four other metrics to explain their relationship, but raw query rate is a good yardstick for active connections when combined with quantiled request latency. (Not averaged.) Simple example: a 750ms 95th% page hit 10,000 times per second is almost certainly far > C10k because of the outliers.

Now I will grant that C10k itself is somewhat irrelevant, yes, but not for the reason you are saying. It was defined in an age when 10,000 active connections was pretty surprising (didn't it come from FTP or some other heavy eyeball protocol?). These days with long poll apps, long-running protocols, and so on, millions of open connections are quite common at consumer scale. I find C1M far more interesting these days. C10M is still kinda nuts, but does exist in the magical world of metal and fiber and hot aisles and all those great things that nobody uses anymore (depressingly).

yuvipanda · on April 24, 2016

https://grafana.wikimedia.org/dashboard/db/varnish-http-erro... (and grafana.wikimedia.org in general) have more stats. It stated '13.38 Million req/min), which I think is ~230,000 req/s

jsmthrowaway · on April 24, 2016

Ah. So my hunch was that the published number was what is making it through cache, and that's where my estimate comes from too. That sounds about what I'd expect for the cached side.

Nice find!

_joe · on April 24, 2016

The total number of requests that get through the cache to the application layer can be seen here

https://ganglia.wikimedia.org/latest/stacked.php?m=ap_rps&c=...

which you can see is not showing any substantial increase due to the passing of Prince. The big hole of two days that ends just before the news broke is due to wikimedia switching traffic to a second datacenter for two days, see http://blog.wikimedia.org/2016/04/18/wikimedia-server-switch...

EugeneOZ · on April 24, 2016

10k connections case is related to 10k hits case, because 10k hits is 10k connections, just not long in time. And what I tried to say: "if even so popular page of Wikipedia doesn't have 10k hits problem, for 99.99% of projects it will be corner case, not everyday routine". Sorry if it was worded wrong.

tim333 · on April 23, 2016

C10k doesn't come up with Wikipedia because it doesn't do websockets. If it did it would probably have millions of concurrent connections, far over 10k. I dare say 10k is a bit passe. Whatsapp can apparently do 2m/server.

Cyph0n · on April 23, 2016

2m per second? These guys must have squeezed every ounce of performance out of Beam.

js2 · on April 23, 2016

2m refers to concurrent connections, not rate.

Cyph0n · on April 23, 2016

Ah, that makes sense. Still, is 2m rps within the realm of possibility for BEAM with a high-end server grade processor and heavy use of actors/concurrency?

RobertKerans · on April 24, 2016

The Phoenix framework tests in October got up to 2m concurrent active (well, mostly just awake) connections, without timeouts.

http://www.phoenixframework.org/blog/the-road-to-2-million-w...

xrstf · on April 24, 2016

Finally a replacement for "Site got slashdotted": "Site got Prince'd". I like it.

tgb · on April 23, 2016

"He was ... known for, among many other things, ... a performance at Super Bowl XLI in a raining downpour in front of over a hundred million people."

Typo and/or I call bullshit.

Amorymeltzer · on April 23, 2016

This[1] seems to indicate viewership of the halftime show peaked at 140M.

1: https://web.archive.org/web/20090412054158/http://www.suntim...

tgb · on April 23, 2016

Okay, I can see where you're coming from but the sentence was "raining downpour in front of over a hundred million people." How does that NOT imply a live audience of a hundred million? I can't believe people are downvoting me for this. The sentence is absurd as written.

jessriedel · on April 23, 2016

Downvotes aren't punishment. Up/down votes indicate "this comment ought to be more/less prominent on the page". The sentence was somewhat ambiguous and your misunderstanding is understandable. But your comment is not useful to the discussion because most people understood and in any case it's tangential.

tgb · on April 23, 2016

Okay honestly am I the only one who reads "in front of a hundred million people" and thinks it actually means he was in front of a hundred million people? I don't see it as an ambiguity, it's simply a wrong statement. I'm okay with people being uninterested in (and even downvoting) a correction, but I'm baffled that anyone could read the sentence as anything other than incorrect. I mean, they even describe the weather of the event as if to make it sound even more impressive that all these people showed up!

kevan · on April 24, 2016

I think they mentioned the weather because it's relevant to his reputation as a legendary live performer and it's part of a great story. [1] There's something almost magical about incredible live performances. I can't really describe it, but sometimes a show has the perfect mix of emotion and artistry and it's completely mesmerizing to be a part of, even as a spectator. Participating in that moment bonds millions of people together in a small way. Even if they have nothing else in common, they were able to experience that together. Prince's Superbowl performance was one of those moments.

[1] http://www.maxim.com/entertainment/prince-2007-super-bowl-pe...

PhasmaFelis · on April 24, 2016

It's a poorly written sentence, but it's also fairly clear what they meant to say. It's polite to give people the benefit of the doubt.

The_ed17 · on April 24, 2016

I've reworded the sentence in the post—does it work better that way? Many thanks for the feedback, everyone. :-)

tgb · on April 24, 2016

Thanks, I really do appreciate it. The thought that it could have meant televised audience never crossed my mind and surely wouldn't have for some other portion of the audience as well. I was out googling for largest concert size records to make sure I wasn't crazy.

pasquinelli · on April 24, 2016

during the set he performed purple rain, probably his most famous song, and it was pouring down rain. that's why the weather is notable.

a hundred million people in one venue is certainly an unbelievable number though.

MrBra · on April 23, 2016

Maybe because it's not the best time to nitpick about that?

MrBra · on April 24, 2016

ahah, always keep a scientist mind, right? that's why you downvoted :D keep it up, scientist :)

Karunamon · on April 23, 2016

That's actually spot on if you count the TV viewing audience - 112M people as of the 2014 event.

dtparr · on April 23, 2016

That presumably includes the television audience.

gobakhan · on April 23, 2016

I would assume they are including the tv audience as well.