Hacker News new | past | comments | ask | show | jobs | submit login
How we made editing Wikipedia twice as fast (wikimedia.org)
339 points by ecaron on Dec 30, 2014 | hide | past | favorite | 132 comments



Great write up. This will be a good point of reference for future debates concerning the value of selecting high-performance platforms for web-applications. A common refrain among advocates of slower platforms is that computational performance does not matter because applications are invariably busy waiting on external systems such as databases. While that may be true in some cases, database query performance is often only a small piece of an overall performance puzzle. Blaming external systems is a too-convenient umbrella to avoid profiling an application to discover where it's actually spending time. What to do once you've made external systems fast and your application is still squandering a hundred milliseconds in the database driver or ORM, fifty milliseconds in a low-performance request router, and another 250 milliseconds in a slow templater or JSON serializer?

Yes, three seconds for a page render is still uncomfortably slow, but it's substantially faster than the original implementation, and unsurprisingly frees up CPU for additional concurrent requests. It's a shame Wikimedia didn't have this platform available to them earlier.

Today web developers have many high-performance platform options that offer moderate to good developer efficiency. Those who use low-performance platforms may do their future selves a service by evaluating (comfortable) alternatives when embarking on new projects.


> This will be a good point of reference for future debates concerning the value of selecting high-performance platforms for web-applications.

Keep in mind, Wikipedia is at scale that most web applications will never, ever get to.

It's important to think about performance and scale, but it's not the only important trade off engineers should concern themselves with.


The win here wasn't about scale, though. A tiny fraction of wikipedia users are logged in.

The win here was about individual page load time. And page load time is just as important, if not more so, for something new trying to vigorously grow as it is for the big sites.

(Disclaimer: HHVM alum.)


Yeah, this. We helped my boss' brother out with his wordpress ecommerce site.

His machine could easily handle the traffic thrown at it, but the page load time was slow. 2 seconds at best, with all machines involved being almost 100% idle. With various common tweaks such as caching, we got it down to about 800ms. We eventually replaced it with our own solution and got it to 50ms.

Scale never entered the picture because from our testing, we had the machinery in place to handle enough traffic to be wildly profitable. The user experience at 2 second page load times was vastly different from the user experience at 50ms though.


Could you describe what your own solution was?


It wasn't anything special. We used Java because it's what we knew, with a mix of spring and some home grown framework stuff that we've developed over the last 10 years.

MySQL over Postgres because that's what we had experience with at the time. Redis as both a cache and ephemeral store.


The win is _also_ about scale. The server CPU usage went down significantly - see http://bit.ly/1vqp1ki

So, more request can be served with a smaller number of servers.


Just curious, why did you use bit.ly there?



... just because I have that saved to use on IRC or anywhere else (like twitter) where a very long url would be inconvenient.

(I should've put a disclaimer there: I am one of the WMF engineers involved in the project).


Fyi, if you add a + to the end of any bitly URL, you'll see stats about the link. (Logged in users can set it to private but most times they don't)


On the other hand I find that the problems where performance is relevant are the problems that are technically interesting and rewarding to solve.

It's also not necessarily about scale, as in number of users that one needs to serve. There are lots and lots of problems where performance is very relevant, from startups that can't afford to scale horizontally, to projects in an industrial setting that have to be rock solid, to participating in real-time bidding on RTB platforms, to games in which frames being dropped are a disaster, to the platforms on top of which we build stuff.

Quite the contrary, I believe that too many software developers are focusing on building front-ends to a database, but this isn't necessarily because these problems are solving real needs.


> Keep in mind, Wikipedia is at scale that most web applications will never, ever get to.

As I read the article I thought exactly this. Scaling is a nice problem to have. I keep playing around with Nginx, elasticsearch, and other tech that's supposed to help PHP's performance problem, but from personal experience, missing database indexes for complex joins have been the issue every time. I'm unlikely to get to use the tech I play with unless I switch jobs.

But back to the point, this is a great writeup, and I think a big step forward for both PHP and HHVM


The win here is because the thing it speeds up is hugely CPU-bound.

This will also make tiny MediaWiki installs just as much faster.


Wikipedia faces unique challenges, however, that many web applications do not face. Almost the entirety of the page rendered is user-generated content produced from hand-written wiki markup that must be parsed and rendered, and that uses complex nested templates. MediaWiki is basically a turing-complete applications platform. The wikipedia page for Barack Obama transcludes 199[0] unique templates, with 585 total non-unique invocations, and that's not counting the templates transcluded by those templates. Some of Wikipedia's templates are so complex that they had to be rewritten in Lua.

I don't think Wikipedia is really a typical case. Most websites probably don't have the CPU burden Wikipedia faces.

[0]https://en.wikipedia.org/w/index.php?title=Barack_Obama&acti...


> MediaWiki is basically a turing-complete applications platform.

Perhaps MediaWiki should figure out how to use a weaker form of computation?


Oh, the turing-completeness is new, it wasn't in the past. Using Lua was to improve template performance.


Actually, it's been around for years - it was accidental. ParserFunctions is a good example of a DSL accidentally achieving Turing-completeness, at which point people start doing awful, awful things in it because they can. So they went to Lua because a proper programming language is superior in every way to an accidental one.


You actually have to be really careful to design a language that's both useful and not Turing-complete. Regular expressions are one example of such a language, but they are not strong enough for templating.


Yeah. It wasn't actually intended to be Turing-complete, but then some excessively creative goddamn lunatic came up with a way to beat MediaWiki syntax with ParserFunctions into yielding a truly hideously inefficient "if" statement. Soooo they gave up and codified that so the CPUs wouldn't melt.


I am afraid that the way those mark-ups have been used is the way they have encouraged.


For our rather heavy Magento installation, we went from up to 6 sec loading times to about 600ms with HHVM without any fancy full page caching solutions. Thankfully Facebook decided to do something about it. It saved us.


Which version of Magento? This is something I'm suspecting we could really do with. (We're running 1.14 Enterprise.)


Six seconds? What were you doing that was so time-consuming?


Is the Magento codebase fully compatible with HHVM?


The HHVM team runs the tests of popular GitHub PHP repositories on HHVM, it's available here: http://hhvm.com/frameworks/

According to that 99% of magento2 tests are passing.


I've been very happy with node as a platform for 3+ years now... Though it tends to take the mindset of designing to scale instead of as a monolith... breaking workers up from services, and spreading the data out.

I'd be more interested to see what, if any changes they have made to their other server components. Are they using NGINX vs Apache? What kinds of caching systems are they using? What is their database layout? Is their data sharded? These kinds of things I think are much more interesting, and at their scale at least as impactful.


There are some bits and pieces on https://wikitech.wikimedia.org/wiki/Main_Page


Thanks for sharing... I'm kind of surprised they are using MariaDB over something like Cassandra.


We are actually working on a storage service called RESTBase (https://www.mediawiki.org/wiki/RESTBase). RESTBase has pluggable table storage backends, starting with Cassandra. Other backends will scale this down for small installations.

The medium-term goal is to store all revisions as HTML, so that views and HTML-based saves can become storage-bound operations with low double-digit ms latencies. This will mostly eliminate the remaining 2-4% of cache misses, and should get us closer to actual DC fail-over without completely cold caches.

There is still a large amount of work to be done until all the puzzle pieces for this are in place, but we are hard at work to make it a reality. A major one, the bidirectional conversion between wikitext and HTML in Parsoid (https://www.mediawiki.org/wiki/Parsoid), is already powering VisualEditor and a bunch of other projects (https://www.mediawiki.org/wiki/Parsoid/Users). Watch this space ;)


Thanks for letting me know... I had some exposure to C* while at GoDaddy, and it would seem to be a great fit for what you are talking about...

GD is using it to store user-generated content, and it works very well, most responses (lookup and processing) were sub 20ms consistently under heavy load... (I'm hoping they write a DNS service that does similar).


They started with MySQL. Replacing that with MariaDB is a whole lot simpler (not to mention, less risky!) than replacing it with something completely different like Cassandra.


They started with PostgreSQL iirc, only later migrated to MySQL.


MariaDB is MySQL compatible. Had they gone with Cassandra, they would have had to replace the PHP-based MediaWiki software with something else. So not only the database, the entire software stack would have to be replaced as well.

Also, speaking from a few years' personal experience with HBase / Cassandra, the support for non-Java languages on these two NoSQL databases is small (though it is getting better.)


I worked on a project that used C* with node.js and it worked out very well... cluster of node servers, with minimal processing over a cluster of C* servers... very fast response times under some really big load.

Though node.js tends to be very flexible in terms of wrapping a friendlier interface around a less friendly one.


Why? The vast majority of the requests they serve are cached at multiple layers above the DB. Switching to an entirely different DB would require rewriting a lot of code, a much larger learning curve to contribute (MediaWiki is FOSS), and limit its use to large sites (MediaWiki is widely used by small sites).


I'm not familiar enough with their data structure to comment, but the concern would seem to be around write performance... if they have to use sharding in an rdbms like my/mariadb, then it's not too much harder to change databases at that point.. not to mention improvements with distributed reads on a cache miss..


NoSQL databases were still in their infancy when Wikipedia was originally developed.


I understand that... but would have assumed they would have changed over time. Plenty of platforms have migrated backends to handle much larger load (twitter, facebook, etc, etc).


Wikimedia's resources are over two orders of magnitude smaller than those of the other top-10 web sites. It is only fairly recently that they have more than a handful paid developers. And they have a huge amount of data they would have to convert. In fact, I think they would have been up to it, but there simply were more pressing concerns.


I would also like to hear more about their infrastructure.

I do know they use Varnish as their cache server.

https://wikitech.wikimedia.org/wiki/Varnish


Several of the previous comments here have quoted a key, interesting fact from the submitted article: "Between 2-4% of requests can’t be served via our caches, and there are users who always need to be served by our main (uncached) application servers. This includes anyone who logs into an account, as they see a customized version."

That's my experience when I view Wikipedia. I am a Wikipedian who has been editing fairly actively this year, and I almost always view Wikipedia as a logged-in Wikipedian. I see the Wikimedia Foundation tracks the relevant statistics very closely and has devoted a lot of thought to improving the experience of people editing Wikipedia pages. I can't say that I've noticed any particular improvement in speediness from where I edit, and I have definitely seen some EXTREMELY long lags in edits being committed just in the past month, but maybe things would have been much worse if the technical changes this year described in this interesting article had not been made.

From where I sit at my keyboard, I still think the most important things to do to change the user experience for Wikipedia editors is to change the editing culture a lot more to emphasize collaboration in using reliable sources over edit-warring around fine points of Wikipedia tradition from the first decade of Wikipedia. But maybe I feel that way because I have worked as an editor in governmental, commercial, and academic editorial offices, so I've seen how grown-ups do editing. I think the Wikimedia Foundation is working on the issue of editing culture on Wikipedia too, but fixing that will be harder than fixing the technological problems of editing a huge wiki at scale. Human behavior is usually a tougher problem to solve than the scalability of software.

By the way, the article illustrates the role for-profit business corporations like Facebook have in raising technical standards for everybody through direct assistance to nonprofit organizations running large websites like the Wikimedia Foundation. That's a win-win for all of us users.


<< From where I sit at my keyboard, I still think the most important things to do to change the user experience for Wikipedia editors is to change the editing culture a lot more to emphasize collaboration in using reliable sources over edit-warring around fine points of Wikipedia tradition from the first decade of Wikipedia. >>

Completely agree with you. Also agree that fixing culture is often harder than fixing technological scaling problems!

All that said, kudos to Wikimedia Foundation for addressing the speed issues for uncached pages. Great work!


> and there are users who always need to be served by our main (uncached) application servers. This includes anyone who logs into an account, as they see a customized version of Wikipedia pages that can’t be served as a static cached copy

I keep hearing this, but it isn't true anymore. For something like wikipedia, even when I'm logged in, 95% of the content is the same for everyone (the article body). You can still cache that on an edge server, and then use javascript to fill in the customizations afterwards. This will get you two wins: 1) The thing the person is most likely interest in will load quickly (the article) and 2) your servers will have a drastically reduced load because most of the content still comes from the cache.

The tradeoff of course is complexity. Testing a split cache setup is definitely harder and more time consuming as is developing towards it. But given the page views of Wikipedia, would be totally worth it.


You don't necessarily need a split cache. Just serve some extra javascript to all users that only performs the extra work if there is presence of a cookie.

(Also make sure the javascript sets a cookie so that wikipedia can fall back to non-cached if javascript isn't enabled.)


You don't even need to use JavaScript - you can do it in the cache with edge-side includes:

https://www.varnish-cache.org/docs/3.0/tutorial/esi.html


After using ESIs with Varnish for a project, I'd never do it again. Cached static pages with Javascript that pulls in the dynamic parts is an easier solution to maintain and has better failure modes.


Interesting! Could you tell me more about the problems you found with ESI?


My offhand guess is that they want editors to always see the current version of the page whereas caches may be stale.


Yeah LiveJournal was doing that with memcached like 10 years ago.


worth mentioning: phpng (PHP7) has cut cpu time in half over the past year [1] (scroll down), don't know what the mem situation is. i don't know if HHVM has additional advantages over plain PHP, but certainly the list of major benefits will be smaller by the next major version.

[1] https://wiki.php.net/phpng


The main advantage is that HHVM is available now, PHPNG won't be stable until later this year at best. For my code base HHVM isn't compatible so we're eagerly awaiting the stable release of NG but damned if we didn't try to upgrade to HHVM.


The advantage of PHPNG is that it has a dedicated development team who will not go away when Facebook goes the way of Myspace. And, yes, I know it is open source and development can continue. My point still stands.


I once thought that Facebook would 'go the way of Myspace' too. But then I attended a USENIX LISA conference and I sat in on some of the Facebook presentations. Initially, I was dismissive of them (I'm too old to be in the social movement), but after seeing the advancements that Facebook are making with Open Compute and looking over some of their source code, I was very impressed. They had the best presentations out of everyone. It was all very surprising to me. And last but not least, the caliber and reputation of their technical staff is as good or better than the other big players. I came away with a much deeper respect for them.


This is cute wishful thinking, but what exactly makes you think that whatever company is behind PHPNG will outlive friggin' Facebook ?


PHPNG is actually the official next version of PHP. If the PHPNG team is dead that means no active development is being done on PHP (period).


If PHPNG dies they simply could elect a different environment to be the official one. Just because PHPNG is "the future" right now doesn't mean the fate of the language is tied to it forever.


It's all well and good labelling HHVM the new official environment, but a lot of existing code does not work on HHVM.


Facebooks' ties with the NSA spying ring is what makes me think that. Oh, that and showing users their dead relatives pictures surrounded by festive themes around christmas time.

EDIT: I didn't think I would need to mention the obvious reason nhtechie there mentioned. That was my whole point after all, for the ones who understood it before downvoting, but whatever.


You're being downvoted because it's extremely foolish to think that Facebook will collapse due to the dead people on it. My theory is that people don't even know where to start on a rebuttal for such an argument given how absurd it is, so instead, they downvote.


There won't be a stable php7 release until late 2015


node.js doesn't have v1.0 release yet, but I still use it...

I think we're reaching a more acceptable point in technologies where our proper test coverage ensures that using an unblessed release in production is acceptable.


That analogy is a bit flawed, both NodeJs and PHP advertise their stable production ready versions on the homepage. Node doesn't advertise v0.11+ anywhere that could be misconstrued as production ready, neither does PHP advertise NG as such. While you may be ready to take such risks, I can assure you most companies are not.


> Between 2-4% of requests can’t be served via our caches, and there are users who always need to be served by our main (uncached) application servers. This includes anyone who logs into an account, as they see a customized version

I run a similar site (95% read-only), and have been pondering whether it would make sense to use something like Varnish's Edge Side Includes (Like SSI, combining cached static page parts and generated dynamic page parts) -- I wonder if they've considered that and what the results would be like?


I've played with this configuration but never put it into production:

Varnish (with ESI enabled) -> Nginx (with memcached module enabled) -> PHP-FPM.

Most of the page is served with Varnish with the ESI directives hitting the Nginx server which serves the fragments from memcached if present, otherwise the PHP-FPM server is hit which then returns the results and sets the fragment in memcached with a low ttl.


There is an old proposal to use it, but it seems that not much effort was put into it.

https://phabricator.wikimedia.org/T34618 https://www.mediawiki.org/wiki/Requests_for_comment/Partial_...

Personally I doubt that it would help much – there are hardly any non-dynamic parts of the page in MediaWiki. (Everything is customizable using server-side configuration, modifying magical pages, or both; or dependent on user permissions or preferences; or both.)


I've run into a similar case. In my case using ESI defeated the whole purpose of the Varnish implementation because the dynamically generated portion of the page had to rebuild the user session and bootstrap the framework, which, while faster than doing it for the entire page, was still not as fast as I would have liked to deliver the content to the browser.

In the end I ended up placing special HTML tags where the ESI tags would have been placed, and then making AJAX calls (on dom ready) to swap out the static content with the dynamic user-specific content.

It worked well for our needs.




It's great to see such a significant improvement, but it goes to show just how limiting the CGI era architecture really is.

A modern persistent web apps running in Python/Java/Ruby/etc is able to perform preparatory work at startup in order to optimize for runtime efficiency.

A CGI or PHP app has to recreate the world at the beginning of every request. (Solutions exist to cache byte code compilation for PHP, but the model is still essentially that of CGI.) Once your framework becomes moderately complex the slowdown is painful.


If you write it right, you should even be able to cache the state of the whole execution right until your code sees the first byte that depends on the user request or the environment. Sort of like a copy-on-write, but it's copy-on-read.


BTW, what's the relationship between Facebook and Zend? Is PHP now 'owned' by Facebook?


No, PHP is not owned by Facebook, and wasn't owned by Zend either. Zend contributed a lot into PHP development (phpng effort is largely sponsored by Zend) but it doesn't mean it "owns" PHP. Nobody really owns PHP. Facebook owns HHVM and Hack (which I guess was one of the reasons they created it - with organization this large, it makes sense to have a platform that is predictable for them - i.e. owned by them). Some people from Facebook also contribute to PHP, including those working for HHVM/Hack team.


Also note that the Zend Engine, confusingly, isn't made by Zend, it's just a component of PHP. Zend Technologies, Inc. came after the Zend Engine, but both were created by the same duo (Zeev Suraski and Andi Gutmans).



Facebook is doing a lot of awesome open-source work. They already have Open Compute, many projects on Github - https://github.com/facebook, and sent a developer to MediaWiki to help with the migration to HHVM. I hope Facebook keeps open-sourcing their internal projects (in addition to contributing in existing ones)!


> Between 2-4% of requests can’t be served via our caches, and there are users who always need to be served by our main (uncached) application servers. This includes anyone who logs into an account, as they see a customized version of Wikipedia pages that can’t be served as a static cached copy,

Is this why we get logged out every 30 days, to boost cache hits for users who rarely need to be logged in? (It seems like every time I want to make an edit I have to login again.)


Curious why they're using squid and not varnish for caching. Weird how they're progressive with PHP but still sticking with the antiquated squid.


I think Gabriel was referring to a historical moment -- Wikimedia does use Varnish now.

"We currently use Varnish for serving bits.wikimedia.org, upload.wikimedia.org, text of pages retrieved from WMF projects, and various miscellaneous web services. Nothing uses Squid."

-- https://wikitech.wikimedia.org/wiki/Varnish


They said they were using Squid back in 2004, not necessarily now still.

And "progressive" with PHP but not Squid ? They're both technologies from the same era, Facebook just made PHP more usable.


Modern PHP isn't the PHP of 2004, HHVM or not.


Honest questions: What kind of benchmark do others uses for a 'reasonable' response time? Of course it fully depends on the use-case (rendering a video can be hard in 500ms), but for user facing stuff? In my previous startup we tried to stay within 500ms. Not saying this isn't a great improvement, but to me 3s still sounds quite long? (not saying it's easy to do quicker!)


There's a difference between requests that the user "expects" to take a long time, and those that can never be fast enough. For example, POSTs, credit card transactions, and things like Wikipedia edits generally have lengthy forms prior to the request, and the user can tolerate a correspondingly-lengthy response time. I prefer 2s as a target for anything like that, and rely on a queue for asynchronously processing anything that takes longer.

For GET requests, particularly those reached by clicking a link from elsewhere on the site, faster is better... Luckily, many of these types of requests can leverage a cache.


It depends completely on the application. It's also usually best to focus on the tail end -- I've always found alerts on 95th or 99th percentile latency useful and easy to decide on thresholds for (ask yourself -- how slow can I tolerate this being for 1% or 5% of users?)


= How to make it take 0.0 seconds to 1.0 seconds.

Save early, before the Editor actually presses save. Only commit the change if the Editor actually presses save.

This will improve the Editor experience by making save faster at the expense of CPU time. Predict well enough, or do enough processing client side, then you won't need extra server side CPU used.

What is more precious to you? The human Editors or some dumb pieces of silicon?


When you don't have much funding, both are extremely important, and when you can't buy more silicon, you either need to optimise the code or let the human wait.


Sure.

However, the method I mentioned can be done with zero extra server load if done correctly. I've done this before with other editors, and it definitely did not take a team 12 months to complete.

It is a full stack optimisation however, from UI all the way through the DB and the rest of the stack. Now that they're using only 10% load, there's plenty of room to breath.

Saving a draft of what someone may have spend hours typing into a crash prone browser is good UI practice regardless.


Again, "full stack optimisations" cost money, if there's not the money to throw more machines against it (which is often, though not always, cheaper than dev time) then there's not likely to be the money for extensive dev time either.

Saving a draft is indeed good UI practice, but best practice is never "regardless" in the real world, particularly in cash/manpower strapped non-profits.


There was over a year of dev time effort put into this optimisation by a team of people.

I think the effort of my optimisation would cost less time and effort, and give better results.

It would also waste less Editor time.

Mediawiki already allows drafts to be saved.


Here's a video of the author's presentation at Scale Conf on migrating Wikipedia to HHVM: http://www.dev-metal.com/migrating-wikipedia-hhvm-scale-conf...


Are the Wikipedia server and network performance graphs public?


yes, you can access them on https://wikitech.wikimedia.org/



[flagged]


> So why is Wikipedia still begging for money? http://newslines.org/blog/stop-giving-wikipedia-money/

Personally, I actually went and donated to Wikipedia after reading this article because I had no idea that they ran the site with so little funding.

Go hock your clickbait somewhere else.


As unpopular as the views in the grandparent post may be, I believe it to be an abuse of flagging-powers to flag-censor it into invisibility.

The WMF fundraising efforts should be defensible without resorting to viewpoint-based censorship. Note also that a number of involved Wikipedians have reservations about fundraising appeals optimized for revenue rather than truthfulness, as collected by another active Wikipedia critic here:

https://www.quora.com/How-well-is-Wikipedias-fundraising-pro...


Your link has a lot of good points (and I say this as someone who donated a substantial amount of money to Wikipedia this year). However, the parent's article was just of generally low quality with poor arguments.


You are free to waste your money as you wish. Just remember that less that 5% of your donation will go to hosting, and even less will go the people who actually created the content. http://www.theregister.co.uk/2013/10/08/wikipedia_foundation...

That said, you do not seem to have a good understanding of the word "clickbait" and seem to be using it as an insult against content you do not like. A primary feature of clickbait is that the headline does not directly say what the article is about. Example: "You won't believe what this huge website is doing with your donations!" or "These do-nothing programmers get paid with this one weird trick!" The objective is to incite the reader's curiosity so that they will click on something they normally wouldn't. As you can plainly see, the link to my article tells readers exactly what to expect before clicking.

FYI there are quite a few other recent articles about this topic on the web: http://www.theregister.co.uk/2014/12/01/penniless_and_desper...


Clickbait also includes scandal-mongering, sensationalism, or yellow journalism as a whole. The Register fits that category very well and I didn't even have to open those 2 links to know that they are rubbish.

Personally, I don't care on how many millions the directors of Wikimedia sit on, because Wikipedia is possibly the most useful online service we have. And not only that, but all the content in it is licensed under open licenses that allow for reuse [1], which is very relevant in a climate in which copyright extensions keep being granted. I also prefer to see them asking for donations instead of serving ads, because ad companies are selling my data to the highest bidder and I don't necessarily want other parties to find out the topics I'm interested in.

Do you care about the internal expenses or about the revenue of each company you're interacting with? If Wikimedia would introduce a subscription model, they'd earn much more than they are doing right now. They don't have a subscription model, they don't serve ads, they aren't locking-in content, they don't sell my traffic - all they are doing is asking nicely for donations about once a year, which is totally optional, it's not like you have to pay up. I really don't get what's wrong with that.

[1] http://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_con...


You shouldn't write anyone a free license to collect as much money as they can by monetizing the Wikipedia brand. The money has to actually serve a need and be spent in a way that benefits the reader.

And there have been germane questions asked about that, even by Wikimedia's own outgoing director:

http://www.dailydot.com/business/sue-gardner-log-rolling-cor...

The reason most of the Wikimedia Foundation's expenditure goes on salaries these days is that they have vastly expanded their staff. Staff expansion isn't bad, but it shouldn't be an end in itself, and the staff hired should be qualified and deliver value for money. And here even Jimmy Wales admitted that their cost/benefit ratio needs a lot of improvement:

https://en.wikipedia.org/w/index.php?title=User_talk:Jimbo_W...

The main thing to bear in mind is that donations are not needed to keep Wikipedia online and ad-free, as the fundraising banners say. There is more than enough money for that; it's a very small part of the Foundation's expenditure today.

It used to be different:

https://www.youtube.com/watch?v=WQR0gx0QBZ4#t=275


> Just remember that less that 5% of your donation will go to hosting

I don't understand why you find this surprising. I think the people who work on the site deserve to be well paid for their work, just like anybody else. I think the outreach efforts are interesting as well.

> and even less will go the people who actually created the content

This argument makes little sense to me: nobody who contributes content to Wikipedia has any expectation of ever being paid for it.

> That said, you do not seem to have a good understanding of the word "clickbait"

Your definition of clickbait is far too narrow IMO. Your article takes some facts about Wikipedia's finances and presents them with very little context in a way that is intended to elicit an emotional response from people who aren't familiar with the scale on which they operate.

Most egregiously, you present $50MM as though it were some huge amount of money, when in reality that's pocket change compared to the financial resources of most entities that host and maintain websites like Wikipedia.

And if you think "Stop donating to Wikipedia" isn't an inflammatory headline, you're deluding yourself. You wrote this article to piss a bunch of people off and get publicity for your new site: that's textbook clickbait.


I just donated to Wikipedia after reading some of your inane comments.


This is, possibly literally, the shittiest article I have ever read, and you should be ashamed for linking to it.


> you should be ashamed for linking to it.

He wrote the article, and that's his site.


Tell me any part that is factually incorrect.


> What is the point of a site saying they don’t want to show ads, then covering up 50% of the screen with a request for money? No serious for-profit site would consider giving up 50% of the page to an ad. It’s insane.

First of all, it's for like 1 or 2 weeks a year or whatever their fundraising period is. The donation request is not up there all the time. It is similar to how NPR raises money.

Second of all, Wikipedia is not a "serious for-profit site". I have no idea why you are making this comparison. Lets take a look at a "serious for-profit site" like nytimes.com or cnn.com. Oh look, there are ads everywhere, taking up a huge portion of the page. And they are there every single time you visit the site.

Stop trying to push your inferior competing site (newslines.org) with trash like this.


Also, donations are presumably less likely to have strings attached.


They should try that. I won't send them a penny until they fix their deletionist problem.

"We're really proud we burn books because we only burn the ones we feel are bad" isn't a philosophy I can stomach funding.


I wish they would delete fewer things too but that's not really what I'm thinking about so much as organizations demanding favorable coverage because of their advertising dollars.


There are no facts to argue with. Aside from the numbers from Wikipedia, of course – but the only thing you do with those is pass opinion on them. It's just a mean-spirited, slanderous hack-job.

I didn't realise that you were the founder of Newslines. I'll tell you what – someone at Wikipedia must have pissed you off something fierce, for you to have such as weirdly twisted and poisonous view of it, quite aside from the laughable idea that Newslines is "Wikipedia's only direct competitor".


$46M isn't all that much money compared with other websites as popular as wikipedia.

You are of course free to stop using wikipedia if you can't handle the fact that they ask for donations every now and then.


That's an opinion, not a fact. In fact staff cost have more than doubled over the last year, while page views are stable. You can easily see this from the KMPG report: http://upload.wikimedia.org/wikipedia/foundation/e/e3/FINAL_...


> In fact staff cost have more than doubled over the last year, while page views are stable.

From the SCFs on Page 3 of your link, staff costs in FY2013 were $15.98M and staff costs in FY2014 were $19.98M. This is only an increase of 25%. I'll take your word on the traffic stats but your assertion that staff cost has more than doubled is easily disproven.


Sorry, for my error. They have doubled over the past few years, while page views are static.


> while page views are static.

They're static?

http://stats.wikimedia.org/EN/TablesPageViewsMonthlyAllProje...

Sept 2014 - 23.1B views

Sept 2013 - 17.9B

Sept 2012 - 19.1B

Sept 2011 - 15.8B

Sept 2010 - 14.5B

Sept 2009 - 11.8B

Sept 2008 - 10.6B


>>> They can’t very well make money on a site that was created through the free labor of its contributors

Two incorrect things right here:

a) there are lots of sites that run ads on community content, some even charge membership fees on top of that

b) only content is created by unpaid contributors. Hardware, bandwidth, admin team, software team, etc. all cost money. The fact that the content is free doesn't mean running Wikipedia is free - it's far from it. That's like saying making a talk show is free if it doesn't pay for the interviews.

>>> Money is not an issue of survival for Wikipedia

It's incorrect again - the fact that Wikipedia has money doesn't mean it doesn't need it. Just as the fact that you have enough air to breathe right now doesn't mean you won't die very quickly if the air flow stops. Wikipedia would die too if the money flow stops. Fortunately, it does not, and it's the great thing, and it in no way means money flow is not vital - just as the abundance of air doesn't mean it's not vital for you.

>>> is in “survival” mode.

Here you engage in distortion, making it sound like Wikipedia claims it's "in survival mode". The claim is the donations are necessary for continued survival, not that there's already a financial crisis now. In fact, since the kind contributors continue to contribute, there is no crisis. This is a good thing.

Moreover, the only reason this is the case is that people in the past mostly were not listening to people like you. So you build your case on the fact that almost nobody would join your cause.

>>> Wikipedia’s core software is essentially unchanged since 2001

This is wildly untrue, to the point that betrays complete unfamiliarity with the platform and it's development. Which is much more puzzling given that most of the information about it is not that hard to find if desired.

>>> Yet all of the money spent on programmer salaries has produced no measurable change to the site’s quality.

This again is false. The very article you're reading belies this claim, but there are many more improvements (including whole mobile platform, which of course did not exist in 2001).

>>> These grants have been described as “corrupt” by the WMF’s ex-director Sue Gardner. who said,

This is an obvious lie, Sure Gardner said that the process doesn't provide enough protection against corrupt practices, not that the grants are corrupt. The difference as between "this lock is not strong enough" and "you are a thief". It is bewildering that you distort the quote in the same phrase as you provide it and expect the reader to miss it.

>>> Your donations are going to golden chairs.

This is false. Ask somebody who has been in Wikimedia office if there are any golden chairs there.

>>> I guess parks and libraries would be a lot less popular if you had panhandlers at the doors.

You would prefer wikipedia to tax you and have corrupt politicians distribute the budgets instead of direct voluntary donations? That's s strange mindset that prefers to be forced to do something instead of having a choice to do it at their own free will - or not do it and write an article full of distortions and inaccuracies if so inclined.

I could spend more time to point more incorrect statements and fallacies in the text but I think this is enough.


From your article:

>What is the point of a site saying they don’t want to show ads, then covering up 50% of the screen with a request for money?

This is an article you seriously recommend reading?


Continuing the quote... "No serious for-profit site would consider giving up 50% of the page to an ad. It’s insane."

No, the content-to-ad ratio on the screen of a commercial site would never be 50%.

https://imgur.com/i74b7He


Yeah, that would be absurd.

http://i.imgur.com/wkXl52I.png


Is there a factual problem? See the enclosed screenshot where the begging banner covered 50% of my 23" screen.


There is a substantial difference between a "Please give us money" served from the same servers that you are accessing, and a third-party advertisement loaded with tracking beacons, malicious cookies (and often evercookies) from the evil advertising industry.

The former is annoying. The latter is annoying and invasive.


Seeing the content of the page shifted down by 50% so you can beg for money leads to me not donating, but tweaking my filter rules.

ABP users can kill that banner by toggling on the "Fanboy's Annoyances" list.


Yes, I've seen the banners myself. Asking users directly for money means you're not worried about what advertisers, or other investors, will think about the content on your site. https://pbs.twimg.com/media/B5L1CR0CEAEN78C.jpg <- Sorry for the text-in-a-picture but the original thread is dead.


The behavior of the WP adminship surrounding that particular controversy has revealed a serious problem with their policies that they do not seem to want to fix: consensus is regarded as gospel.

It is physically impossible for Wikipedia to factually report on something that is being misrepresented by the mass media, because mass media consensus is taken as the only thing that matters in establishing verifiability per their policies. This means that if the media is misrepresenting something, so will Wikipedia, and any attempts to correct the record will be seen as "original research" and deleted.

Say what you want about that particular controversy, but the problem is real either way.


gg GG no re


Still slower than https://slimwiki.com


Ward Cunningham, now there's a guy who should have played on an nfl team.


I feel like 3 seconds to load a page is still slow. I assume that is time to build a page that doesn't hit cache, and that Wikipedia is using something like varnish to cache pages most of the time.

Still, 3 seconds to load a page feels like a slow page and should be a lot faster.


Did you even read the article, or just skip to the comments when you saw the graph? He spends a whole section talking about Wikipedia's aggressive caching strategy, and how 96-98% of the pages are served from Squid cache servers.

Also you didn't even scroll down to the next graph, where he shows the average page load time (for logged in, not anonymous, users) as 800 ms. (which is obviously larger then the average page load time served from cache). The first graph was the average amount of time it takes to save a wikipedia edit.


3 seconds is page saving time for editors. Since wikipedia is a read-heavy site, I would expect that to take longer than loading a page, as writes will always be more expensive and less optimised.

By comparison, uncached page load time appears to be ~800ms


Wikipedia only caches the latest version of each page. Older versions have to be rebuilt on the fly every time.


Hmm, this sounds remarkably like a possible DoS vector.


I would be surprised if the servers/instances responsible for rebuilding old pages had any other responsibilities. If that's the case, DoS-ing by flooding old page versions would likely only bring down that particular site feature.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: