Wikipedia and Internet Archive partner to fix 1M broken links on Wikipedia

jacquesm · on Oct 26, 2016

In the long term the internet archive will likely be the major supplier of references to Wikipedia. Webpages don't live forever, hopefully the internet archive does. It's an extremely valuable resource, the archive and wikipedia are amongst the most valuable digital assets we have.

danso · on Oct 26, 2016

I think a case could be made that the Archive should be the reference by default for citations -- that is, when a URL is cited, the Archive's snapshot API is triggered, and the archive URL is used as the direct link with the original URL being the "backup". Sites/pages change so much that it's more helpful to point the average user to what a page looked like when the reference was accessed and included. It may rob the original URL some traffic, but Wikipedia isn't meant to be a link-aggregator for external content.

The obvious roadblocks are:

- Websites, even content-focused ones, can be so complicated in terms of JS that the Archive might not capture it accurately.

- The Archive's policy is to honor robots.txt and other no-archive directives.

nothrabannosir · on Oct 26, 2016

It may rob the original URL some traffic,

That is an advantage. It reduces extraneous incentives to post links, which should be only about the information they provide and not about page views. At wikipedia's scale, and sensitivity to information purity, that's relevant.

See also the rel="nofollow" decision from a few years back.

eli · on Oct 27, 2016

It'd be a real shame if publishers decide to opt out of the archive to reclaim "lost" pageviews.

iotscale · on Oct 27, 2016

It's probably good that it's possible to opt-out of being archived but what annoys me is a useful domain changing hands and the new owners deciding to wipe its history.

thomasahle · on Oct 27, 2016

Surely that's not possible?? And there is no way to find the old archive, even under an alias?

lorenzhs · on Oct 27, 2016

It's not wiped (they don't delete anything) but if robots.txt forbids indexing then access to old archives is blocked. It can be quite aggravating, but at least the information is not lost---if robots.txt or archive.org's policy changes, it can be retrieved again.

JoshTriplett · on Oct 27, 2016

> - The Archive's policy is to honor robots.txt and other no-archive directives.

What bothers me is that they do so retroactively based on the current robots.txt, not the one contemporary with the archived content. So if a domain parker takes over a domain, and their robots.txt excludes everyone from every page (or everyone but Google), then archive.org no longer provides its archive of the old content.

jon-wood · on Oct 27, 2016

I'm not sure there's a good way to work around this - the alternative is to only respect the robots.txt as it was when the snapshot was taken, at which point once a confidential page is in the archive you can't (easily) remove it again.

ldjb · on Oct 27, 2016

Perhaps access to archived pages should only be blocked when the ia_archiver user agent specifically is denied in the robots.txt. That way they aren't inadvertently blocked by a generic robots.txt that denies everything (which sometimes occurs with parked domains), but there's still a way to deny the Wayback Machine if you really need to.

JoshTriplett · on Oct 27, 2016

This, exactly. Or, even better, use a separate user-agent ("ia_retroactive" or similar) for retroactive removals.

smsm42 · on Oct 27, 2016

Well, then everybody would claim the same right and you'd have to maintain full list of bots and keep it up to date. This doesn't sound scalable, * should still mean "everybody".

ldjb · on Oct 28, 2016

robots.txt only tells bots not to crawl the website. It doesn't say anything about indexing or archival of pages.

It would be perfectly possible to have the Wayback Machine respect the robots.txt and not have it crawl or archive any new pages, whilst making pages that have already been archived accessible unless a specific user agent has been denied.

Iv · on Oct 27, 2016

I would say that the default behavior should be to respect the robots.txt of the time of the snapshot and only revert archival of accidentally cached pages that were never intended to be public.

Sadly, it requires a bit of human intervention there.

lmm · on Oct 27, 2016

The internet archive wouldn't work if its crawler wasn't fully automated. You can't handle the whole internet in a way that requires human intervention.

smsm42 · on Oct 27, 2016

The problem for manual deletion via human intervention is that if you do it outside of robots.txt you then need to ensure the identity of the owner, which makes it much more complicated and costly.

smsm42 · on Oct 27, 2016

Maybe robots.txt could have clauses that say whether to apply entries retroactively. True, domain parkers could enable it, but I don't think most would, since it's extra work for no benefit - the point is usually not to erase history but to protect current site.

jpalomaki · on Oct 27, 2016

Since the number of pages referenced from Wikipedia is limited, the archiving could be also done by other parties. Maybe just setup a raspberry, xTB hard disk, download everything and then distribute over IPFS (or something else) (as backup in case Archive.org removes the pages).

To take into account Javascript etc, one could also capture a png snapshot of the page.

lucb1e · on Oct 27, 2016

That seems like an awful bug, is this by design?!

throwaway7767 · on Oct 27, 2016

Yes, it's by design. It's to stave off the inevitable flood of legal threats and takedown demands.

If someone requests content be taken off, they instruct them to update their robots.txt. The content is not removed but will not be shown through archive.org as long as the robots.txt exclusion is in place.

There was a court case where the plaintiff wanted to subpoena the Internet Archive for evidence (since the defendant had since blocked the content with robots.txt). They sent an expert to testify that complying with that kind of thing would be too much of a burden for them, and suggested that the court force the defendant to change their robots.txt. The court agreed.

simcop2387 · on Oct 27, 2016

By design from the last time I saw it discussed. The idea is that a change there could indicate that there was a mistake and the data shouldn't have been crawled for one reason or another. There's just no way to know in an automated fashion.

lucb1e · on Oct 27, 2016

I'm inclined to say that robots.txt should be ignored altogether and removal requests should be handled per individual case. But I guess that goes too far...

Still, if the domain owner changes, they should not be able to remove content from old archives. That's like being able to remove stuff from encyclopedias about a palace somewhere, just because you live in the place where the palace once stood.

They are The Internet Archive after all, it's logical to archive contents like the domain owner and robots.txt for a given point in time. A change of owner can be easily detected.

mhurron · on Oct 27, 2016

> removal requests should be handled per individual case

Are you going to fund the internet archive to handle that workload?

> if the domain owner changes, they should not be able to remove content from old archives.

Why not? If that work belongs to anyone, it is the current domain owner. Why does the fact that the internet archive happened to crawl it mean that suddenly they lose control of their information?

This is elevating the Internet Archive from 'hey it's cool someone made a copy of that while it was up and no one cared' 'because the Internet Archive crawled it the world has absolute rights to that information from now on, wishes of the owner be damned.'

JoshTriplett · on Oct 27, 2016

> If that work belongs to anyone, it is the current domain owner.

Not at all. A domain parker taking over a domain does not imply they have any rights over all the content that the previous owner of the domain posted.

mhurron · on Oct 27, 2016

Outside of any other information, it does imply that.

lucb1e · on Oct 27, 2016

> Are you going to fund the internet archive to handle that workload?

I figured someone would ask that. I don't have an immediate answer but it's a good question (upvote for that). My hopes are just that removal requests are not too frequent. But without current numbers (of number of pages hidden after-the-fact and current removal requests) this is guesswork.

They could charge for it perhaps? A dollar per request. Doesn't seem too unreasonable for something you mistakenly made available to the planet. It doesn't have to be per page, so if you made a million documents available all under example.com/hidden/ then hiding that folder is a simple action and costs just one dollar. You're paying them for their time.

In the Netherlands, if you want your personal information (e.g. phone number; email address) removed from a company's systems, you can request that and they must grant it if they have no reason to keep the data any longer. And you can make requests to see your data, etc. But the law allows for companies to charge for this and I've seen example amounts (I think around 3 euros) somewhere. It's a somewhat similar situation.

So I don't have a single good answer, but I think by-case is a better way to go (and worth thinking about, at least) than just using the current approach.

RandomInteger4 · on Oct 26, 2016

Wait, so every time someone makes a citation on wikipedia it causes Archive.org to archive that URL? This seems like it could be abused by a malicious agent, no?

danso · on Oct 27, 2016

Well, in my hypothetical world, the Wikipedia editor page would include a widget in which a user could submit a URL, and Wikipedia would generate the markup, including an archive URL. Deciding how to implement this so that it works efficiently will have some overlap with issues on that 2nd hardest computer science problem of cache invalidation.

cmdrfred · on Oct 27, 2016

Good idea, after that it's really just rate limiting and perhaps only looking up the same URL every X days. Perhaps a per domain limit as well. No problem they probably haven't already solved.

colejohnson66 · on Oct 27, 2016

There's already the citation tool in the default editor. Shouldn't be too hard to just fire off an AJAX request to the Archive when the citation is added, right?

ohyes · on Oct 27, 2016

Easy enough, but what do we call it?

The Wikiarchivator?

voxic11 · on Oct 26, 2016

You can already just tell archive.org to archive a URL. How would this be different?

smitherfield · on Oct 26, 2016

You could have a "reverse DDoS" where archive.org blocks requests from Wikipedia, removing the ability to add Wikipedia citations.

klodolph · on Oct 27, 2016

There's no such thing as a "reverse DDoS", that's just a regular denial of service attack.

cmdrfred · on Oct 27, 2016

A theoretical reverse DDoS: You somehow prevent traffic from reaching their load balancer so it automatically scales their AWS instances to zero.

toomuchtodo · on Oct 27, 2016

Your minimum instances would always be 1 in the autoscaling group associated with your ELBs (or now, ALBs).

Regardless, neither the Internet Archive nor Wikimedia use AWS or other cloud providers, as it would be prohibitively expensive. They both run their own infrastructure/ops.

cmdrfred · on Oct 27, 2016

Mostly a joke, just trying to think of what that term would mean.

simcop2387 · on Oct 27, 2016

I think a reverse DoS attack would be one that actually increases the capacity of the service. I'm not sure what scenario would allow that to happen though.

lmm · on Oct 27, 2016

Some streaming sites (e.g. 4 on Demand, IIRC) have a peer-to-peer element; you could reverse-DDOS those by running a lot of computers "seeding" (perhaps by leaving them on the page but at the end of the video) distributed all over the globe.

smitherfield · on Oct 27, 2016

What I meant was:

1. Get Wikipedia to send lots of requests to Archive.

2. Archive blocks requests from Wikipedia.

3. Wikipedia citations are disabled.

In other words getting Wikipedia to DDoS Archive, so that Archive's defense hurts Wikipedia.

A very silly scenario of course, just coming up with one for why an attacker might want to indirectly DDoS Archive via Wikipedia.

smsm42 · on Oct 27, 2016

What would probably happen between 1 and 2 is that archive.org notices spike in traffic from Wikipedia, talks to Wikipedia engineers, they find out who is generating all these requests, and block them from editing Wikipedia.

Though it's certainly possible to generate enough junk edit traffic to cause disruption on Wikipedia, but that's nothing new. It's the nature of Wikipedia as the resource - it trusts the internet community to be good on average. So far it worked.

jon-wood · on Oct 27, 2016

That's still not a DDoS, as its missing the Distributed element. Almost by definition if you can block all requests at source its not a DDoS, its just a regular old DoS.

HCIdivision17 · on Oct 26, 2016

I'll include the obligatory donation link below for what is effectively an internet utility. [0]

Also, take a look at downloading a copy of Wikipedia! You can get a full download with images for around 100GB (last I checked, which was about two years ago). It's great for if you ever think you'll need some technical info while away from live internet. I keep a copy on a hard drive that boots into linux, just in case (maybe I'll be on site with a customer and need some engineering notes - BAM taken care of!)

[0] https://archive.org/donate/

EDIT: I used XOWA, and I do not keep the wiki up to date, really. Note that the entire wiki history is huge, but a reasonably current snapshot is manageable (~100GB or so).

[~] http://xowa.org/home/wiki/Help/Download_XOWA.html

protomikron · on Oct 27, 2016

Actually I think 100G is not that much, if you consider what you get. Furthermore some time ago I estimated that you could store the entire world's tiles (raster data in 10m resolution) and OSM data on a SSD (<500G).

If you think about it, that might be a game changer. Maybe it might make sense to ship e.g. smart phones with wikipedia, OSM and satellite photo (e.g. Sentinel-2 is free data) on disk.

Granted, 10m resolution is not the state-of-the-art in aerial imagery - which is around ~0.1m (this means 10^4 times more data) - but it is reasonable to detect buildings and combined with OSM's vector data you have practically Google Maps in your pocket.

toomuchtodo · on Oct 27, 2016

> If you think about it, that might be a game changer. Maybe it might make sense to ship e.g. smart phones with wikipedia, OSM and satellite photo (e.g. Sentinel-2 is free data) on disk.

I don't think this is a sensible use of local phone storage, but I do think you'll see a lot more P2P edge cache nodes if/when IPFS takes off (OSM tiles are already served on the IPFS network).

It would be a simple matter of picking a VPS provider or hardware colo provider near expected heavy use, launching IPFS, and having it pin the relevant content locally.

anc84 · on Oct 27, 2016

How is caching handled for those tiles in IPFS? At OSM.org they get rerendered when the underlying data changes.

toomuchtodo · on Oct 27, 2016

https://github.com/ipfs/faq/issues/16

iotscale · on Oct 27, 2016

If we're considering just the availability, Google Maps has an option to cache a region for offline use, which works spectacularly good - given that you get your antenna on a good signal every once in a while.

takingflac · on Oct 27, 2016

I mostly use this because limited data plan, makes it use a lot less data if you already downloaded the area you are currently in before hand while you were still on wifi.

theGimp · on Oct 26, 2016

Thank you for posting the donation link. I have donated in the past and always think about donating when the Internet Archive is mentioned.

I just set up a monthly donation amount, and I encourage you to do the same if you believe in the Internet Archive's mission!

cmdrfred · on Oct 26, 2016

I have a pretty substantial amount of storage for things I have collected over the years. Podcasts, YouTube videos, Gifs even stuff from back when Flash animations were popular. All organized and curated. Because it's private I don't really have to worry about copyright laws as well. 6 disks in a raid 10 array with 2 hot spares. It's rather cumbersome now but hopefully the tech will get good enough to be able to maintain itself for a few hundred years before I die. I know some of that stuff will be lost to time, some of it you already can't find online. I call it the ark. I hope some day down the road I can blow some historians mind.

Perhaps, long from now, they might quote internet comments from sites like HN like we quote the Greek philosophers. "cmdrfred the elder once said...". I should make a habit of ripping the front page or so and comments for the ark.

schoen · on Oct 26, 2016

> Perhaps, long from now, they might quote internet comments from sites like HN like we quote the Greek philosophers. "cmdrfred the elder once said..."

Or Talmudic claims of authority in Judaism:

https://duckduckgo.com/?q=%22taught+in+the+name+of+rabbi%22&...

https://duckduckgo.com/?q=%22said+in+the+name+of+rabbi%22&ia...

Muslim ahadith also have the phenomenon of the chain of narration where people declare the provenance of the teaching:

https://en.wikipedia.org/wiki/Hadith_studies#Sanad_and_matn

The_ed17 · on Oct 26, 2016

I've been an editor on Wikipedia for years, and it's simply amazing how many web pages I referenced in 2008–09 have disappeared. Digital archivists have their hands full.

noufalibrahim · on Oct 27, 2016

Here's another piece of irony for you.

There was an article by Brewster about preserving the Internet in the scientific American which put the average lifespan of a URL at around 40 days. The said article now 404s and the only way to get it is through the wayback machine. :)

http://web.archive.org/web/19970504212157/http://www.sciam.c...

jacquesm · on Oct 27, 2016

Of course. The page is much older than 40 days...

noufalibrahim · on Oct 27, 2016

A little anecdote that might be interesting here.

I worked at The Archive for a little while and one of the projects I worked on was to unpack about 300 TB of crawl data from the defunct search company cuil.com. It was mostly decent quality data in a standard format and after some grinding, the whole thing was converted into warc files which the wayback machine could use to show the URLs. The end result was that about 60 billion URLs came "back onto the web".

During the work, I was looking at the stones rather than cathedral but after I left and thought about it in detail, it was very satisfying. I was reading the book "A Canticle For Leibowitz" at the time and the general theme of cycles of history was in my head. That dovetailed very well with the work I had done.

If you're interested, you can download the dumps over here https://archive.org/details/cuilcrawl

toomuchtodo · on Oct 27, 2016

Upvote for "A Canticle For Leibowitz" alone. Thanks for your service at the Internet Archive!

noufalibrahim · on Oct 28, 2016

It was a privilege to work there and definitely the best period of my career. So many wonderful people and experiences in such a short time. I personally consider Brewster Kahle a severely underrated hero of our age. The Archive's work is extremely valuable but I think the value will be appreciated only by a future generation.

As for the book, I think if The Archive had novel for a totem, it'd be "A Canticle For Leibowitz". Very much affected my world view when I read it and this thread has just kindled my interest again. :)

toomuchtodo · on Oct 28, 2016

If you don't have a paperback copy of A Canticle, I'd be happy to send you mine!

noufalibrahim · on Oct 30, 2016

That's very generous of you but I do have a treasured copy myself. Thanks! :)

colejohnson66 · on Oct 27, 2016

Whenever I cite a webpage, I try to archive the page at that moment and include the archive link in the citation. It's a bit more effort than just `<ref>[url]</ref>`, but it really is necessary.

dredmorbius · on Oct 27, 2016

I had to migrate my archived articles off Readability prior to the OEL at the end of September. I'd only used the service for a couple of years, and hit about a 5% bitrot rate, this on fairly significant articles.

One of the more curious cases was CSIRO (Australia's national science and research organisation) which seems to have not only deliberately purged a fair amount of data (Graham Turner's work specifically), but has a robots.txt in place which blocks archival by TIA. That strikes me as ... downright curious.

caf · on Oct 27, 2016

Which CSIRO site is that with the robots.txt? (it's not their main site).

I'm thinking that could be something to have an Opposition Senator bring up in Senate Estimates.

toomuchtodo · on Oct 27, 2016

Could you reply with CSIRO links blocked by robots.txt? I run my own instance of ArchiveTeam's ArchiveBot (which archives links provided regardless of robots.txt), and would be happy to put the content into cold storage.

dredmorbius · on Oct 27, 2016

I'll need to check my Readability dump.

toomuchtodo · on Oct 27, 2016

No rush; I'll bookmark this thread to check two weeks from now.

dredmorbius · on Oct 27, 2016

Found it. Apparently the robots.txt has been fixed:

404: http://www.csiro.au/en/Portals/Multimedia/CSIROpod/Growth-Li...

Now available: http://web.archive.org/web/20120508210658/http://www.csiro.a...

That's among the specific links which wasn't being served by TIA earlier.

toomuchtodo · on Oct 27, 2016

Awesome. I've emailed CSIRO to try to track down that podcast included in the article that was not archived.

dredmorbius · on Oct 27, 2016

Thanks, I really appreciate this.

dredmorbius@gmail.com if you happen to track that down.

toomuchtodo · on Oct 28, 2016

Emailed. Also, archiving all of the current version of csiro.au, just in case.

dredmorbius · on Oct 28, 2016

It seems to be climate, CO2, and limits-related work that is most prone to being censored.

toomuchtodo · on Oct 28, 2016

Not surprising based on the political climate in AU currently.

b2600 · on Oct 27, 2016

Wikipedia and The Internet Archive are examples to me of what makes the internet awe inspiring. (I can throw in Google search too, not Google Inc). Data (sometimes imperfect) is one query away on any internet connected device. Truly amazing in a way.

eriknstr · on Oct 27, 2016

The archive.is guy provides mirrors of rotten links to Wikipedia also, although not as the result of any official agreement with Wikipedia, just on his own initiative, which I think was nice of him.

Enclyclopedia Dramatica is generally not a reputable source of truth, being the site that it is, but while looking for some more information on archive.is mirroring of links from Wikipedia articles, I found an article on ED that I found interesting. It is heavily advocating one side of the story but at least it backs it up with some links, which is rather seldom on ED (most links on ED usually go to other pages on ED in my experience).

https://encyclopediadramatica.se/Archive.is

necessity · on Oct 27, 2016

It's amazing to me how Wikipedia ends up being a reasonably good website with such a cancerous community behind it.

caf · on Oct 27, 2016

Perhaps this is the genius of Wikipedia - it keeps many of those of a certain type of personality occupied amongst themselves while using the energy of their machinations to produce a product of wider social good.

cooper12 · on Oct 27, 2016

You know ED heavily dramatizes stuff right? (It's in the name) They also have a huge hate boner for Wikipedia in my experience. Anyway, I think it's unfair to call the community cancerous when in this case archive.is was spamming Wikipedia with bots. Spam from any website is spam, regardless of how useful, and in this case it's a severe breach of the community's trust. Anyway, archive.is was recently removed from the blacklist, so it's silly to paint the whole community as "cancerous", which is also a juvenile term to use.

TorKlingberg · on Oct 27, 2016

What are you referring to?

corobo · on Oct 27, 2016

I'm not necessarily agreeing with the OP here and I don't even know what the community is like but this seems a decent page to start[1]. I do like how Wikipedia keeps a page on it's own controversies - I mean it makes sense, but I like that they're open about it.

[1] https://en.wikipedia.org/wiki/List_of_Wikipedia_controversie...

smsm42 · on Oct 27, 2016

The thing with the links on wiki is that if they lead to obscure third-party site, the user would trust them at the level of trust for Wikipedia. But if Wikipedia community has no idea what this site is, they'd feel uncomfortable using these for linking. Especially if there's danger than in 10 years whoever runs this site gives up, loses the domain, some spammer or criminal buys it and gets all the naive people from Wikipedia coming to them and trusting them since Wikipedia referred them.

For archive.org it is a known, established and trusted organization. It's actually has an office within walking distance from Wikimedia offices, AFAIK :) - not that it is very important, just an interesting fact. The point is there's no reputation problem. But for site that is less known, there is.

I understand the frustration of people about not being trusted, but that's how it works - trust needs to be earned. I don't see any way to it but for whoever runs the bot to talk to Wikipedia community and earn their trust. Name-calling won't exactly be helpful here like some do here in comments. Shady practices used by whoever wrote the bot like using tons of IPs and not identifying the bot properly also doesn't help. You can't be sneaky and complain there's no trust at the same time.

shortformblog · on Oct 27, 2016

Excellent news. Should note that today is the 20th anniversary of the Internet Archive: https://blog.archive.org/2016/10/26/making-the-web-more-reli...

qwertyuiop924 · on Oct 27, 2016

I'm really glad this is happening. Wikipedia needs to clean up their broken links, and this could help the archive get a wider sampling of websites, so as to preserve more data.

Websites going offline is a huge problem. For example, the now-famous thread from which sleepsort originated (on 4chan's /prog/ textboard) isn't archived anywhere: textboard threads are immortal, so nobody thought to archive any threads until dis.4chan.org went down for good.

Thankfully, some bright spark managed to save the sqlite databases for most of the boards on dis to the Internet Archive, so I was able to track down the thread eventually.

pmiller2 · on Oct 26, 2016

This is a huge step forward for Wikipedia as an authoritative source of information. Glad to see this happening. :)

OT: I considered applying to the Internet Archive last time I was looking for work, but their office is too hard to commute to coming from the East Bay. :(

brudgers · on Oct 26, 2016

I agree that it's a big step forward for encyclopedias. Not just as a 'source of truth' but also in terms of automating away a lot of the routine editorial maintenance that needs to happen at Wikipedia's scale.

ideonexus · on Oct 27, 2016

This whole discussion reminds me of how all MySpace content was destroyed in a rash corporate decision years ago. Just like that, five years of the most popular social networking site on the World Wide Web and all its history were wiped out:

http://activehistory.ca/2013/06/myspace-is-cool-again-too-ba...

Unfortunately, the Internet Archive was only able to get the non-logged-in version of the site. All those loud, obnoxious profile pages users spent endless hours working on? We only have oral histories now to remember them.

smsm42 · on Oct 27, 2016

I wish I could get all my old horrible homepages back. There was time you'd have to torture me to admit I had anything to do with that, but now I would probably be proud of them again. It's history now.

caf · on Oct 27, 2016

It'd be great if StackOverflow approached the Internet Archive about doing the same for their broken links, too.

sengork · on Oct 27, 2016

Internet Archive should look into distributed models such as IPFS for storage of the archived sites.

toomuchtodo · on Oct 27, 2016

The IPFS team is working with the Internet Archive on this.

sengork · on Oct 27, 2016

Excellent, it may be a better way to provide spare bandwidth/storage similar to their https://archive.org/details/archiveteam-warrior

toomuchtodo · on Oct 27, 2016

For sure. ArchiveTeam has explored providing mirrors of the entire Archive [+], but IPFS is a perfect fit for the task.

[+] http://www.archiveteam.org/index.php?title=INTERNETARCHIVE.B...

felipesabino · on Oct 27, 2016

As I have clicked in several broken links already, I am wondering how many, absolute number or in percentage, per article are likely to be broken

I might be way off, but doesn't 1M seems like a low number for wikipedia size? What is that in percentage of total number of links? Does anyone know?

youdontknowtho · on Oct 27, 2016

Wow. I really love the internet archive as a project. This is a great usage. Looking forward to see how that will work out.

I wonder if they will publish a list of replaced links after the fact?

h1d · on Oct 27, 2016

What's blocking Wikipedia to just archive the referenced pages on edit?

It would be far more reliable than depending on Internet Archive when it may not have the page archived and more likely the time of the archive would differ from the time it was referenced.

It would cost some more disk space and bandwidth, which of course is already pressuring them but in turn would greatly improve usability and reliability.

digi_owl · on Oct 27, 2016

Likely some interpretation of how Wikipedia is not to be a primary source.

raverbashing · on Oct 27, 2016

One corner case that exists: a content is linked on Wikipedia, this content is taken down due to a copyright violation

(I suppose Archive.org would be asked to take the content down)

sp332 · on Oct 27, 2016

Archive.org will take content down for certain reasons, but they have a pretty broad copyright exemption as a non-profit archive.

torrent-of-ions · on Oct 27, 2016

Why does the headline says "to fix 1M broken links" but the article says it's already been done?

cooper12 · on Oct 27, 2016

Yeah that's a bit confusing. I'd attribute it to the "press release" nature of Wikimedia's blog where they mean that they have already partnered with the Internet Archive and are announcing it after the fact.

45h34jh53k4j · on Oct 27, 2016

(red heart)(yellow heart)(green heart)(blue heart) Internet Archive (red heart)(yellow heart)(green heart)(blue emoji)

There are fewer more noble pursuits than archiving the sum of human knowledge.

alecco · on Oct 27, 2016

On a side note, it makes me very sad how Wikipedia editors are often pushing some political agenda. I'm relying on it for less and less topics. Clearly nothing that can be affected by US politics or SJW-style controversies.