Hacker News new | past | comments | ask | show | jobs | submit login
Introducing the Wikimedia Enterprise API (wikimedia.org)
175 points by dfjorque on March 16, 2021 | hide | past | favorite | 85 comments



With all due respect to Wikipedia for what it is, I believe their success is partly because of the 'altruistic' nature of their model. Sure, they should seek donations from huge companies like Google - which make tons of money off of their data - for the services they provide, but I feel like locking down the 'better' api to the public is not the way to go about it. It's just too often that a commercial offering just disincentives bettering the 'free' product.

Wikipedia as the product of a public good foundation should be just that; by the public, for the public, and accessible to the public (including all access methods and API's).


Wikipedian here. We edit because we're contributing to free information. Free information means anyone can use it for any purpose, including commercially.

I think this move is great. I'd rather have this money go to the foundation, than ParseAPIco.


> Free information means anyone can use it ...

Of course anyone can use it commercially and for whichever reason they like. But by creating a walled 'premium' offering you are going against the premise of 'anyone can use it' since not anyone can use the commercial api.

Surely if Google or anyone wants the api so badly they're willing to pay for it, they should fund it. Why does that mean it needs to be 'locked'?


Google and co will essentially pay for the SLA, free access will be available through Wikimedia Cloud Services and I'm sure that the team is investigating how to make the API available to the wider public.


Devils advocate: Google clearly already has a working pipeline to import and format Wikipedia data for its needs. Why would they stop using it and start paying Wikipedia? Will Wikipedia be able to build an enterprise API thats faster/cheaper/more reliable/more scalable than the internal one build by one of the world’s top engineering companies?

No doubt the enterprise API will add attractive value for smaller companies without the resources to process the raw dumps but I’m skeptical that this will convert Google et al into well-paying customers. Unless they start restricting the free dumps...


While Google already have a pipeline in place the bottleneck of that pipeline is on the Wikimedia end of things, this project addresses that. No more scraping and data dumps when they can stream changes over gRPC. WMF didn't build this without consultation with these big companies.


What is the difference between a funder and a paying customer?

They are both users who are paying for the service.

The difference is that funders have political power over the project beyond the scope of their own use.

Customers are always better and more fair to everyone than funders.


A service that's available to the 'public' is called a 'public service' even though it is funded by 'funders'. But a service that's available only to 'customers' is by definition not a 'public' service.


I'm another Wikipedian and donor, and I'm pissed about this.

Creating an enterprise product with enterprise funding means Wikimedia is directing energy and resources to serve enterprise alone. These APIs were likely demanded by larger tech firms, and I imagine they discussed with Wikimedia the possibility of creating a commercial service. Now they got what they wanted, it won't be a public resource, and it's design won't be guided by the public either.

Now the lines are very blurry. A wiser strategy would have been to create a separate private entity.


> A wiser strategy would have been to create a separate private entity.

Unless I am mistaken, seems like they did do that?

"the foundation’s newly created subsidiary, Wikimedia LLC"

https://www.wired.com/story/wikipedia-finally-asking-big-tec...


see you in a year, after you hear stories about how wikipedia had to redact information that was sensitive to their valued customers


I haven't loom at the details, but a free usage tier with relatively low limits so hobbiest can use it would be acceptable. If you're pulling a lot of data out of a project like wikipedia you would at least be paying to keep the servers running.


The point is that's not an option. Everyone should get the public tier. And if you tried to pull a lot of data, you'd get throttled.


Totally, and if a big company like Google does not want to be throttled or they feel a need for a more advanced api, they can donate to Wikimedia for the purpose of creating such an api which will be available to everyone publicly.


Yes exactly. Google should pay whatever would be required such that it would be free to everyone.


> The new API is an opt-in product, meaning that everyone (including those companies) can continue to use the current publicly-available tools at no cost and no restriction. The ability to freely access the knowledge across all Wikimedia projects remains unaffected–it is core to our mission.

Not sure what you have to worry about.


>Not sure what you have to worry about.

>From op: It's just too often that a commercial offering just disincentives bettering the 'free' product.

Instead of the public foundation focusing on bettering and modernizing the public option, they are -and will continue to be - putting their efforts to create a better api to be used solely for private purposes (albeit with good intentions).

Sure, it's opt-in to use the 'better' api. But that comes at the expense of this better api not being available publicly. So going forward by default the public is always going to be getting the inferior api, saving the 'better' ones for commercial use. That doesn't seem very opt-in-like to me...


They state better API, is just a way to scrape data easier think like a structured JSON, regular public don't scrape data they visit for info once in a while, some people use other tens of apps to export data offline. Wikipedia wants to charge some money from freeloading corporates so, instead of facing the backlash about scraping being paid, they chose to charge these mega corps in a different way to avoid the hassle of commercial and non commercial BS. If they don't provide a fancy API, no different from free ways, corporates would rather use the free stuff than pay for the API.


The altruistic model seems to be doing not that well. The number of editors peaked in 2007 and has been going down since.

https://en.wikipedia.org/wiki/Wikipedia:Wikipedians#Number_o...


Going by the complains people on this site have about editing wikipedia, it's not funding that's driving people away, it's the Stack-Overflow-like situation where certain editors are extremely quick to delete content and resist the submission of new content.


Is it possible that this effect is due to a lot of "low-hanging fruit" articles being essentially complete? Back in '07 there were a lot more empty pages to fill on topics of broad interest.



This is being discussed at https://en.wikipedia.org/wiki/User_talk:Guy_Macon/Wikipedia_... New Developments

I am pretty sure that they would welcome comments from Hacker News users on that page.


I can't quite follow your point. Yes, donations continue to grow. Are you trying to rebut him?


This thread (or rather, this whole topic) is more about funding/business model than anything else, yet gp points to the slight decline in the number of active editors that is quite easily explained by other factors (see other replies) as evidence that Wikipedia is “not doing that well” (cerntainly not financially, so in what way exactly and how is that relevant?). Now that’s something hard to follow.


The thread is about whether a for-profit service for Wikipedia is a good idea. The healthy finances show this is completely unnecessary for keeping the lights on, while the "slight" decline in editors (a huge drop relative to where it should be with continued exponential growth) shows that the health of the community is much more concerning. That suggests that the decision should be informed much more by its impact of community healthy than its impact on finances. The more Wikipedia becomes an organization monetizing the work of volunteers, with that money spent on cancerous overhead while the contributor tools languish for decades without meaningful improvement, the fewer people will want to volunteer.


Maybe people have moved on to wikia (now called fandom) where you can focus on a specific series or topic. The number of users seem to heavily favor fandom too.


And it's a big loss to the Internet, because unlike Wikipedia, there's AFAIK no easy way to download all the content despite most of it being under the same license, and because unlike Wikipedia, site functionality is significantly impaired unless you enable their obnoxious Javascript.


Maybe I should listen instead of problem solving, but....

Maybe they could alternate their fundraising banners with "contribute to an article" campaigns?


Over time, I believe more public non-profit sites will introduce this. Then for-profit sites. Until Google eventually pays for most of the valuable content it gets today for free.

I own multiple sites where I and my users work to produce valuable data (e.g “so so company reviews”, “Is tenet on Disney” and other data of that kind). And what does Google do? Scrap it all and display it on their page. As a result, the page links gets millions of impressions but tens of clicks. Thus, the sites cannot be monetized. Any reasonable person knows this can’t go on for long before the free and open web comes crashing down or Google (and others like it) pays its due.


If Google scraping your sites is a bad thing, you want to set "nosnippet" tags on your page [0].

If Google scraping your sites is a good thing, then why are you complaining?

I hope Google never starts paying for the links. Once there is a precedent, this becomes an effective blocker for the new search engines, visualizers, and other exciting web search startups. A new search engine startup is not going to be able to establish a commercial relationship with every site on the web like Google could.

[0] https://developers.google.com/search/docs/advanced/appearanc...


The one issue I see with this is it is always Opt Out. I feel that google really should be lining up partners to opt-in. While I am sure there is reasons why Google believe they have the right (and a good case can be made), it always feels slightly entitled to just assume that people are OK with this being done to their content.

That being said, of all the sources, Wikipedia actively license their content in such a way that google are well within their rights to slurp it all down and serve it however they want.

Google is already effectively paying for links to news sites as part of the negotiations in Australia. And I agree that this will be a dampener on any competition, I think the era of "ask for forgiveness, rather then permission" needs to stop.


if you post information publicly on the internet, google is entitled to scrape it. you've opted in by publishing it.

if you want to specifically exclude one entity from accessing information that you've posted for anybody to see, i'm not sure how there's a way that could be "opt-in"


Google is entitled to scrape it, but are they entitled to display the content on their site, the results pages? Everything in the instant answers is content that deserves to be displayed on its creators page, along with whatever monetisation the creator chooses.


You could do this using a robots.txt file (assuming the scraper obeys it, of course).


> And I agree that this will be a dampener on any competition, I think the era of "ask for forgiveness, rather then permission" needs to stop.

Does this mean that you think there should be less competition for Google?


I similarly require that producers of motion pictures say "nosteal" at some point in the opening credits otherwise I assume I am free to make copies of the film to share with the internet.


They do, don't you remember those FBI notices in the movies? https://mashable.com/2012/05/10/fbi-copyright-warnings/

And when you sign up for netflix or cable tv, there is an agreement you accept that you are not going to pirate.

Remember, the nosnippet does not have to be on every page -- you can put into robots.txt or HTTP header, so it is literally 1 line of configuration for most web servers.

Movie producers can only dream of stopping piracy that easily.


> They do, don't you remember those FBI notices in the movies?

Oh I'm sorry I don't have the ability to look for that, my system is only equipped to look for that specific string.

> And when you sign up for netflix or cable tv, there is an agreement you accept that you are not going to pirate.

Again my system doesn't read the TOS, does Googles?

> Remember, the nosnippet does not have to be on every page -- you can put into robots.txt or HTTP header, so it is literally 1 line of configuration for most web servers.

Remember they just have to add the string "nosteal" to the opening credits. That's a few minutes in final cut pro.

Also, if they forgot to add it or have some other issue I offer no public facing customer service whatsoever.


I think you are trying to claim that Google goes further than DVD or netflix, but this analogy is really not working for you.

DVDs have technological protection as well -- the CSS[0] system. So yes, if you don't want your movie to be pirated you need to explicitly enable this. This was probably harder than creating robots.txt too, there were NDAs and stuff involved.

The netflix requires logging in to access the content. If you add the same requirement, then Google is not going to take your snippets.

Unlike the string "nosteal", the robots.txt file is not Google invention, it is as much part of the web standards as all other technologies.

If you want a website, you need a server which can support HTTP, HTML, CSS, links, robots.txt and so on. You can omit parts you don't need, but then you _may_ suffer the consequences -- without CSS your site will be ugly, and without robots.txt your site will be scraped by Google.

[0] https://en.wikipedia.org/wiki/Content_Scramble_System


The point is it doesn't matter how hard or how easy it is, Google has no entitlement to anyone else's labor or content and if they post content to their website in violation of copyright I don't think "he didn't say the magic word that stops us from stealing content" is a defence any reasonable judge should entertain.


> in violation of copyright ... defence any reasonable judge should entertain.

Now we are talking specifics! Are you implying that Google is violating the law? Given that the snippet showing has been going for a long time and no one has sued Google for it yet, it does not seem to. Plus, there is the whole Fair Use laws [0].

I personally love that I can take snippets from the random websites on the net, quote them in my posts, and not worry about copyright infringement. And if I can do this, why can't Google?

[0] https://ammori.org/2012/05/08/copyright-misunderstandings-an...


I would argue that the snippet is the thing of value being potentially abused, not the page.

So if I search for e.g. "specific breakdown of something something, in a unique breakdown format that only this website has", then the website owner has worked on, created unique/copyrighted material, and posted it on a page on their site, and Google just extracts that piece, then they might as well have "acquired" the right to host that piece of info on their search results "page".

Google "extracting" that crucial bit of info and essentially "hosting" it on their search results page could definitely be argued to be some sort of abuse of fair-use (and at this point - who is willing or big enough to take on Google on this to set a precedent? The EU, maybe? ). It's not like they're quoting a piece of a large text, they actively find the specific piece of juicy info that relates to your query and host it on their page instead of yours.


VHS/DVD's used to have these when they were around.


Movies are not public accessable. And they come with usage-rights. If you don't publish your content for all, then define the usage properly.


> I own multiple sites where I and my users work to produce valuable data

How much do you pay your users for the content they generate?


Well nothing because as they said, they can’t monetize the site due to google snatching all the content :)


While Google can use Wikimedia for free, they do make financial contributors to the Wikimedia ecosystem. https://wikimediafoundation.org/news/2019/01/22/google-and-w...


I suspect this project was pushed by Google, to make importing wiki data to their knowledge graph more convenient for them.


Google already have their own knowledge graph that is much bigger than the Wikipedia graph, and they already scrape every Wikipedia page daily so they don't need a Wikipedia API.


For hot topics, search engine wants to scrap every minute, not daily, now wikimedia will provide them with such feed.

Also, they need to have team of engineers, who support infobox extractor, now this work will be done by wikimedia.


wikipedia is a large input into their knowledge graph


You can actually download the whole wikipedia if you like.


Great, just a shame it isn't more 'tradititionally' transparent & democratised IMO. Claims no custom contracts, but is enterprise sales team contact us anyway, for example.

.proto on GitHub is nice, but no pricing, no public docs? This is probably great for Wikimedia coffers, but at the headline I hoped for new/improved Wikidata; instead it's.. different bordering on 'don't care'.


I guess the article are targeted towards general public, not technical people

I found this page which has more technical details on what it actually is: https://www.mediawiki.org/wiki/Wikimedia_Enterprise

Also found out that it is open source: https://github.com/wikimedia/OKAPI


Those are linked from the article, (that's the .proto on GitHub I mentioned) but what're we going to do with that? (And why does everything have to be formatted like a wiki page..)

I mean, it's fine, I just got momentarily excited for something that the announcement isn't. I wanted to find a pricing page, free tier, API docs, etc. Like Wikidata but.. I don't want to say 'modernised', but made more accessible, and with APIs for higher level content like this rather than just rawer data.


I hope this is not the first step into a worrisome future.

It appears now that they are offering "read-only" access to existing data structured and packaged in a more convenient way.

How long before paying enterprises would like to be able to "update" content on a more efficient basis?

Perhaps Sony would like to add articles about movies that will be released soon, or as they are released? That is a pretty benign example.

Creating alerts that enterprises can subscribe to so that they will be informed if anyone adds any negative content would also be valuable.

These systems already exist in some manner, it would just make it more efficient and more common.


This just feels the wrong, trying to push the new concept of open source that SV created in the past decade to the general public. Most wikipedians are still in early 00s idealism , and will push back against this "dual model" crap, and they will be right.


If this sets them on a self-sustaining path without having to rely on running highly conspicuous donation campaigns on Wikipedia, I think that's a wonderful thing.


Not so fast.

Imagine Wikimedia Enterprise becomes the #1 source of revenue for Wikimedia. Shortly after, people will see that Wikimedia is doing OK and become reluctant to open their wallets and donate.

Then, the top Wikimedia Enterprise customers will acquire leverage over Wikimedia and try to get Wikipedia curated to their convenience. Wikipedia articles will start being indistinguishable from advertisement.

Governments will intervene and want their share of influence too.

Top volunteers will start asking to be paid, many others will leave, some others will become critics of the project.

People will start being skeptical of Wikipedia because of their biased editorial line and then the project will be declared a failure, once everyone is angry and a beautiful project is torn apart by greed.


The Foundation has already considered this. https://meta.wikimedia.org/wiki/Wikimedia_Enterprise/FAQ#How...


Very cool. I hope Wikipedia sees success with this strategy!



Why would Google pay for this when they already crawled and are crawling whole Wikipedia and have complete index of it?

Better way for Wikipedia to earn extra revenue are affiliate links. A lot of people when they read and learn about some topic go to Amazon and buy a book about that topic. Wikipedia could embed book affiliate links and earn commission from book sales.


Affiliate links seems like a type of advert or they at least share some of the arguments against implementing them on Wikipedia: https://en.wikipedia.org/wiki/Wikipedia:Funding_Wikipedia_th...


That sounds like a horribly perverse incentive for the world's main free and open source of information.


I mean Wikipedia offers basic information about some topic it's not like you are about to get deep insight unless you buy a book. And I bet Wikipedia generated millions of book sales like I said people who read an article from Wikipedia and went to Amazon or Google to search for a book.


If Wikipedia starts to receive commission for selling books the incentive changes from recommending the best book to the one who pays then the most


It's not about "who pays the most" it's about the best book. Book authors compete against each other to write the best book and therefore generate more sales.

Imagine reading an article from Wikipedia and at the end of the article it says "If you want to learn more about this topic take a look at our recommended books" and link leads you to dedicated page which lists let's say top 20 most popular books about that topic.

It can even replace Goodreads in a sense that Wikipedia community can rate, review and recommend books to each other.


Perverse incentives for book authors too. You could create a industry generating books to cite on Wikipedia.


I do a lot of Wikipedia editing, and I'm fine with them monetizing an Enterprise API. But affiliate links are too close to advertising. IMHO one of the best features of Wikipedia is the lack of Ads, and if they start going down that route I'm gotta there.


> Why would Google pay for this when they already crawled and are crawling whole Wikipedia and have complete index of it?

Google's not the only enterprise out there :) I believe Wikipedia's taxonomy is used by lots of people for ML purposes, for example.


But for ML you’d probably want to download the whole thing anyway, considering it’s only like 47GB. I doubt many people want to make a model on only soccer pages or something.


There’s an SLA involved in this service. Business people like SLAs. :)


> affiliate links

Sounds well-intentioned, but it would be immediately gamed by every unscrupulous entity and ruin Wikipedia.


Well, they are already a big donor:

https://wikimediafoundation.org/about/annualreport/2019-annu...

> Google Matching Gifts Program


If you work at Google, they will 1:1 match donations to virtually any non-profit, plus there's various charity drives where employees get to donate company money. So the match program can become a huge donor just off random Googlers donating.


The distinction doesn't really matter though, does it? Makes no different if it's Googlers as opposed to Google itself.


Google has already set a precedent by agreeing to pay massive corporations for news. If they're willing to do that, why shouldn't they pay WikiMedia for all the content they use?


That will lead to many unexpected, possibly perverse, incentives.


Does anyone know how much the Wikimedia Enterprise API costs?


It's "enterprise", so you have to talk to a human and haggle. More info is located on the FAQ link in the article.


Are any details known yet about the format or structure of the API? I didn’t see anything in the article.



They didn't mention pricing, did they?




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: