Hacker News new | past | comments | ask | show | jobs | submit login
Download the Entire Wikimedia Database (wikimedia.org)
151 points by surround on March 6, 2021 | hide | past | favorite | 74 comments



I'm so glad the download-entire-wikipedia function continues to exist. That will help counter the "lost the entire library problem" from the city of Alexandria. To be fair, Wikipedia only has summaries, not the detailed material, but it's still important.


Kiwix offers downloadable material (including the full Wikipedia), in a format specifically designed for offline browsing.

https://wiki.kiwix.org/wiki/Content

Their Wikipedia bundle wasn't being updated for a while and had fallen out of date, but that seems to have been fixed now.


This is what makes the 256GB iPad worth it, you can carry the entire Wikipedia (with pictures) around with you. It’s fantastic when on a plane or if you need to entertain yourself without Wi-Fi.


I "upgraded" the internal storage of my Kobo Glo HD (which is in fact using a 4GB MicroSD card as its onboard storage) to a 128GB. I can store the "all maxi" 82GB ZIM file on it. The Kobo firmware includes (but doesn't advertise) ZIM support, so I can browse Wikipedia on an eInk display and great battery life without Internet connectivity.

That would be a great post-apocalyptic knowledge archive, and it would be easy enough to recharge using a USB solar panel.


I've been wanting something like this for ages. I always thought the Kindle not having the full Wikipedia archive as an optional install was a massive missed opportunity, especially back when the Kindle had the keyboard at the bottom. It would have literally been the Hitchhikers Guide as explained in the books.


> The Kobo firmware includes (but doesn't advertise) ZIM support

This is one of the nicest hidden features/easter eggs I've ever come across. Thanks for sharing this!


This comment convinced me to buy a kobo.

How's the search functionality in wikipedia when used like this?


You open the ZIM file as if it was an eBook, then you can use the search feature within it.

Example: https://www.mobileread.com/forums/showthread.php?t=276219


Time to update XKCD: https://xkcd.com/548/


This sounds like an amazing idea. Thanks for the tip!


Or a Kobo with an external SD slot!


The link for this post has the full wikipedia, just not in a very readable form. I believe the grandparent comment was saying that wikipedia is a summary of original material such as books and scholarly articles.


It is pretty awesome that there are people like /r/datahoarder that are obsessed with backing up the collective knowledge of humanity.


There's also Archive Team, focused on preserving at-risk sites before being taken offline.

https://wiki.archiveteam.org/


I'm not familiar with r/datahoarder, but if the name bears any significance, it seems they are mostly centered on hoarding data, which means just digital I guess? If so, I much rather would want to promote efforts like Internet Archive that back up all kind of things, not just digital data.


What does the Internet Archive back up that isn’t represented by 1s and 0s?


A ton of 35mm and 16mm film reels, vinyl, and even wax recording, physical books and a lot more. They make digital copies of them, but they also archive the physical versions as well. Here's a selection of the movies: https://archive.org/details/moviesandfilms?tab=about



Promote both, rest easier.


There's not much evidence that the loss of the Library of Alexandria constituted much loss to human knowledge. This idea seems to come from Carl Sagan's Cosmos. While there a lot of good things to say about him he's not much of a historian.


I think it is almost certain that a vast amount of human knowledge was lost. The policy of the city of Alexandria was to copy every book that went through it, and give the copy back. As a result, they ended up with a spectacularly large collection of original materials.

The amount of ancient literature that is no longer available is vast. History, literature, you name it, so much has been lost. The odds are very good that at least a significant portion of that were in the Alexandrian library.


I'm going to ask my parents about this to see if they learned this idea in school. They didn't grow up in an English speaking country, and I don't think they're big fans of Sagan.


Maybe that facts such as the rough circumference of the earth were lost for a thousand years had something to do with it? Or is that false?


This is a also user privacy solution that could be applied in other areas. If one is only querying a local store of Wikipedia pages, then third party observation of those queries is perhaps impossible, and so the "surveillance capital" "business model" may become infeasible. (Now this is perhaps a poor example because Wikipedia generally does not appear to be engaged in "surveillance capitalism". Maybe their straightforward provision of bulk data coupled with this fact is not a coincidence.) But there are other databases besides Wikipedia that internet users query weekly, daily, hourly or even more frequently. Making incessant queries over the internet can a rich source of data for a third party observer, one that can form the foundation of "surveillance capitalism" revenue strategies. Whereas if users can download bulk data, and thereby avoid queries over the internet, in many cases they can avoid falling victim to "surveillance capitalism". Bulk data can be a viable privacy solution.


Can it also download sources if available too, would be cool.


> I'm so glad the download-entire-wikipedia function continues to exist. That will help counter the "lost the entire library problem" from the city of Alexandria. To be fair, Wikipedia only has summaries, not the detailed material, but it's still important.

Personally, I think Wikipedia's quality is too poor for that. Plus, it's digital, so when our civilization is at risk of "[losing] the entire library" it probably would have already lost the ability to maintain the computer systems to access Wikipedia dumps.


"Personally, I think Wikipedia's quality is too poor for that."

You should see some of the crap in the books at Alexandria: The world is flat and there are four elements and other bollocks. Obviously I'm taking the piss. The content is sometimes just as important as the factual accuracy of the content. For a given value of factual accuracy.

WP is written by people and holds a vast amount of stuff. It is flawed in my opinion in many ways but that is the human experience.

I live in a town called Yeovil in Somerset, UK. https://en.wikipedia.org/wiki/Yeovil . Several years ago I noticed an incorrect old name for the place and I tried to correct it. I appealed to the Domesday Book which is considered quite authoritative hereabouts but I linked to the only site I could find which sold copies of it. My edit was thrown out by the local Somerset editor rather than being fixed. I own a coffee cup bought from the local museum that lists the >60 spellings of this tiny town over the last 1500 odd years. The editor wouldn't accept that either "original research" WTF! I didn't put the names on the mug - archaeologists, historians and a bunch of tax gatherers hired by a hoard of Normans back in the day did that. OK, no they didn't - they scrawled stuff and the museum gathered together the scrawls and made my mug. I did one for a tourist shop on the Plymouth Barbican wrt the Mayflower complement, about 30 years ago. To be fair, I simply copied the names off the board near the Mayflower steps!

My point is that WP is what it is and you need to see it for that. It is both a store of knowledge and also a store of knowledge and blatant lies and everything in between ... about knowledge. It contains its own metadata and also omits vast amounts of it.

WP is without question in my mind absolutely magnificent but you do need to learn how and when to interpret it to fit in with your idea of factual - whatever that is.


> You should see some of the crap in the books at Alexandria: The world is flat and there are four elements and other bollocks. Obviously I'm taking the piss. The content is sometimes just as important as the factual accuracy of the content. For a given value of factual accuracy.

But that reflected the actual state of knowledge at the time, which is what you'd really want to study.

Let me put it this way: Wikipedia wouldn't even allow Wikipedia to be used as a source for one if it's articles, because it's too unreliable: https://en.wikipedia.org/wiki/Wikipedia:Reliable_sources#Use...


> Let me put it this way: Wikipedia wouldn't even allow Wikipedia to be used as a source for one if it's articles, because it's too unreliable

Er.. I think you're missing the point of citing sources. If an otherwise unverified claim could cite as its source another unverified claim, it would make citing sources meaningless. If a wikipedia article wanted to cite a verified claim from another wikipedia article that did have a verified source, they may as well just use that original source as their source too.

I think your example actually shows wikipedia in a very good light.


> Er.. I think you're missing the point of citing sources. If an otherwise unverified claim could cite as its source another unverified claim, it would make citing sources meaningless. If a wikipedia article wanted to cite a verified claim from another wikipedia article that did have a verified source, they may as well just use that original source as their source too.

1) The Wikipedia guideline doesn't mention that logic, 2) quite a lot of Wikipedia doesn't cite anything, and 3) it not unusual for a passage in Wikipedia to not actually be supported by the citation given (e.g. https://en.wikipedia.org/w/index.php?title=Special:WhatLinks...).

In any case, this is getting off topic. My point was that, if you're looking for an ark of cultural knowledge to survive some apocalypse, Wikipedia's a bad choice. IMHO, something like Project Gutenberg plus a newspaper archive would be about 1000% better, and take up far less space. If you got space to spare, throw in Libgen. Wikipedia's not a replacement for its sources, and I pity the future scholar that would have to rely on it without being able to check them.

But unless your data storage is orders of magnitude more reliable than any current technology and you package your archive with an equally reliable computer to read it, the concept of a digital ark fails. If you don't to that, your archive will be unreadable, and and unreadable archive is useless.


The content on Wikipedia is really not that bad. Obviously a Wikipedia article will never be the final say on any specific subject, but it tends to do a pretty good of aggregating sources and condensing it into a reader-friendly synopsis. This data is super valuable, if not just for the sources alone.


> so when our civilization is at risk of "[losing] the entire library" it probably would have already lost the ability to maintain the computer systems to access Wikipedia dumps

But as long as it continues to exist, some future civilization could figure out how to read the data again, eventually. Just like we eventually discovered how to read ancient languages that were once forgotten.


> But as long as it continues to exist, some future civilization could figure out how to read the data again, eventually. Just like we eventually discovered how to read ancient languages that were once forgotten.

Eh, I think you're vastly underestimating how difficult that would be.

1. The media would have to last hundreds of years at least, when it's hoped modern archival media can last maybe fifty.

2. Even assuming the media did last, the new civilization would have to reverse engineer encoding on top of encoding on top of encoding (e.g. physical disk encoding, complex filesystems, file formats, character encodings). Our civilization already has trouble reading some old file formats, and our disks already have trouble reading their data (which is why they pack a ton of error correction information).

It took the Rosetta stone to figure out how to read encoding of Egyptian hieroglyphics, when that language was still alive in the form of Coptic.

3. Then you're dealing with the probability that the hard disks the future archeologists find will even have a Wikipedia dump on them. That probability will be very small, given very few people will download these dumps.


If people still understand english in some form (a good bet. We still understand latin. English has more reach than latin did at its peak) understanding charsets is pretty easy. Just assume its a shift cipher.

As far as media goes. That's true, but its a bit of a numbers game. After all you only need one unusually preserved specimen. The dead sea scrolls survived after all. Not to mention intentional preservation efforts. I know github has its artic vault thing. There's even a copy of wikipedia on the moon! https://meta.m.wikimedia.org/wiki/Wikipedia_to_the_Moon/Wrap...


> If people still understand english in some form (a good bet. We still understand latin. English has more reach than latin did at its peak) understanding charsets is pretty easy. Just assume its a shift cipher.

IIRC, Coptic is directly descended from Ancient Egyptian, but the Rosetta Stone was still needed to decipher hieroglyphics.

It won't be as simple as you think. The problem will be more like: here's 10TB of partially corrupted binary data, find the text when you don't know the encoding (oh, and the text may be compressed with an algorithm you also don't know).


Back in 2014 I computed the PageRanks within English Wikipedia, thanks to their database dump. https://www.nayuki.io/page/computing-wikipedias-internal-pag...


That's intriguing.

Curious if you ever compared how PageRanks correlate to traffic? (They make their per-page traffic available too.)

It would be interesting to see the largest disparities -- super-popular pages in visits but which don't have nearly as many internal Wikipedia links to them, versus unpopular pages but that have tons of internal Wikipedia links to them.


What page had the highest PageRank?


Here is a plaintext version of Nayuki's results, via the link they posted above:

https://www.nayuki.io/res/computing-wikipedias-internal-page...

Geographic coordinate system is first.


Maybe would be better to separate content links from more technically links that come from standardized templates which aren't really part of the article content in a certain sense


The homepage.


Would be interesting to see the results if you have that rank to the most viewed page (maybe that is the homepage though)


Cool. Was there any surprising result in top highest ranks?


You can also get it in a user-friendly format with the application Kiwix (https://www.kiwix.org/) if that's your use case. PC, phone, or server. You get subsets of the data, and images are smaller to save space.


Kiwix use is still somewhat hit or miss when browsing. Im not sure how it handles text parsing but it either takes forever or doesn't return results making it sort of unusable.

But it's still a fantastic and incredible piece of software. When it gets to the point where I can portably keep the full 60gb zim file seemlessly, it will change simple computer for low broadband areas. Imagine the uses as a portable and versatile database that could accept json, html/css, data to make your own offline encyclopedias!


It's useful to browse on a phone if you have limited mobile data, and the text-only English Wikipedia fits onto a modern micro SD card.


The complete archive with image is 82GB, well inside the capacity of an affordable modern microSD card.


kiwix includes a relation to Qt5Core.dll with an invalid signature, not sure if relevant to anybody



Genuine question: why is bittorrent not being used for this?


I imagine because wikimedia has lots of bandwidth (once your website gets to be a certain size, bandwidth gets sold very differently), new versions come out regularly, and there is a relatively small number of people who want every single one of these and keep them around to seed.


It would be interesting intermittent releases supplemented with diffs


Because they wanted it to be easy for people to get?


It is available as a torrent


I wish Wikipedia would offer incremental downloads (e.g. rsync). That would make it much easier to host your own Wikipedia.


Wikipedia is always bugging me about donations, and yet here it is a feature they could charge for or at least hint to donate. It would be perfectly acceptable to charge here since abuse of this can rack up quite a bill. Maybe they don't pay as much as I do for outbound traffic on aws, but still


> would be perfectly acceptable to charge here since abuse of this can rack up quite a bill

Not according to Wikipedia. Wikipedia much rather beg people from all corners of the world to donate, than restricting access to their data. That's what a good, honest and well-meaning foundation does.

And yes, no sane person shuffling a lot of data around is using AWS because of their awful bandwidth pricing, Wikipedia included.


I guess a hint would be fine, but charging for access, even bulk access, feels quite contrary to the spirit of the project. It excludes huge ranges of people who cannot afford it or don't have access to Internet payment methods.

I suspect the traffic caused by this is minuscule compared to the overall traffic, anyway. But that's just a guess.


> Maybe they don't pay as much as I do for outbound traffic on aws, but still

Almost certainly not. I doubt this is even a rounding error on their budget.

I'm not sure how this stuff works at a high level (IANA network engineer), but i think they have peering presence at internet exchange points - https://www.peeringdb.com/asn/14907 so im not sure they actually pay for bandwidth at all, at least on some routes


Further down the page:

> Backup dumps of wikis which no longer exist [...] This includes, in particular, the Sept. 11 wiki.

There was a Sept. 11 wiki hosted by Wikimedia?



Wouldn't it be cool if Steam supported distributing offline Wikipedia database? It's just a few gigs (depending on languages/images/etc, but it fits the DLC model perfectly), and it already uses bittorrent.


An IPFS cluster for Wikipedia like this https://collab.ipfscluster.io/ would be nice.


Is the media content (images, videos) downloadable?

When I follow the links I find 2012-2013 data but may be I missed something?

https://meta.wikimedia.org/wiki/Mirroring_Wikimedia_project_...


No. It starts to get prohibitive in terms of file size. There are 288 TB of uploaded media https://commons.wikimedia.org/wiki/Special:MediaStatistics + https://en.wikipedia.org/wiki/Special:MediaStatistics . Much of that isnt used on english wikipedia, but nonetheless.

I suspect if you had some genuinely good reason why you wanted it, and asked real nicely, and provided some way to transport the data, you might be able to make an arrangement of some sort to get it.


There should be some form of compilation of quality articles per domain, like history, sciences, etc.

In a way all articles should belong to a category...


Kiwix provides that. :)


No article are properly labelled by domain of science.

There are collections of article, but for example there are no collections of all articles related to computer science or some part of Swedish history.


I see your point: what you are asking for is labelling the articles rather than grouping them in collections, right?

I think labelling is useful when you can download individual articles (by label) efficiently. I don’t think it’s practical to create torrents for each nor efficient to ask users to scrape. IPFS could have helped if it wasn’t so slow in practise.

Also see https://wiki.kiwix.org/wiki/Content_in_all_languages


I guess wikipedia should encourage all article editor to properly tag articles.

Dewey categories are a good way to label things, I guess.


One of the only things you can do to ensure lasting democracy today is to download the pages, with complete history, put it on a usb drive or microsd card properly labelled for you to keep offline, and just forget about it. You can do this as a consumer, it's easy. There's no harm in it, it's not some kind of private data such as personal photos or documents. If you end up forgetting or losing track of it, it really is no big deal. You just decided to download it when you saw it on hacker news back in 2021, right?

My reason for saying this is one of the only things you can do to ensure lasting democracy is that it is in the realm of what is possible in a physical sense that at some point through some mechanism the online version simply does not inform the public on some important public issue, whereas the history as you can download it today does. Though, I wouldn't speculate about what the mechanism might be or what kinds of subject.

At that point in a physical sense you could consult your offline copy on an airgapped PC or future equivalent and I think it would be impossible for any group of any kind to even know you were doing that let alone stop it.

How you might get the word out is another question but having this personal capability is easy for the people here, as technical users and simple consumers. Indeed the whole entire Internet was set up as a distributed network in case of nuclear attack, so the entire topology of the Internet is set up for you to do this easily today.

It's a click and a cheap flash drive or slightly more expensive microsd card away. You can take this step in less than 20 active minutes of your time and for less than $50 if you go with an external spinning disk drive (such as 1 terabyte) or $200 or so if you go with a microsd card. It doesn't really matter if the file ultimately fails, this is not a critical backup for you to have just a nice to have. You could write the file's checksum onto the drive in marker so you can tell whether it's still correct later (as opposed to having bit errors).

Maybe there is some file type that has a bit of redundancy (checksums) for long-term storage, since due to the large amount (several hundreds gigabytes) I wouldn't be all that surprised if a few bits flipped over the course of several years in cold storage. But I don't know what kind of file type has any sort of redundancy or parity built into it that is supposed to protect against this. (Does anyone know?) Most likely the hash just wouldn't match what you wrote in pen on it but it would still be useable.

Regarding choice of spinning disk or microsd card: I guess it's in the realm of what's possible in a physical sense that at some point people would have their personal property rummaged through by some group and a hard drive is pretty obvious and could be stolen or removed for that reason. (In a physical sense, not speculating about social or political developments that might lead to that.)

So for this reason perhaps best would be to put it on a microsd card even though it is quite a bit more expensive. I guess written once, bit rot causes microsd cards to decay within a few years if not used at all.[1] I don't know for spinning media but I guess it's also about 5-10 years at least.[2]

You could put the microsd card under a postage stamp for example and put an important unrelated document into the envelope, which you would expect to keep for many years. Of course you could always end up accidentally discarding your envelope (while retaining its contents) but that risk shouldn't matter too much. In a physical sense it is possible for groups to x-ray all paperwork (such as envelopes as I just suggested) and a microsd card's electrical contacts are pretty obvious in an x-ray. (It looks like this [3]). I don't have any suggestion that works against this attack, which is within the realm of what's possible according to the laws of physics.

I'm not speculating on what social or political developments might possibly make anything like this necessary at some point in the future, but we still live in a world governed by the laws of physics so as technical professionals you have a huge leg up on most of the world. Spending $50 doing this today might save democracy tomorrow. You could also leave it as a time capsule however the storage longevity is not that long (between 5 and 20 years I guess), and in a physical sense, a time capsule is not particularly secure and would require instructions for someone else to figure out so it's not great in that sense.

So in terms of what you can do today, I would suggest just getting an external 1 terabyte usb drive ($50), downloading the dump together with history (20 active minutes), writing the checksum onto it in marker and just putting it somewhere. Obviously this small $50 investment is one you would hope never to have to use, but who knows, you might go down in history as the one who saved some small part of the world. Though, obviously, not in Wikipedia history.

[1] https://www.quora.com/What-is-the-longevity-of-a-sd-memory-c...

[2] https://serverfault.com/questions/986911/how-long-will-unuse...

[3] https://www.reddit.com/r/pics/comments/3b6bjw/i_xrayed_an_sd...


Are images/media from Wikimedia Commons included in these dumps?


No. Its just the wikitext (markup language) source of all pages.


Well not the entire db, just the public parts. User passwords are not included ;)




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: