This is great. In 2008 Patrick Collison had a buddy who was using the then undocumented iphone internal apis to make apps. They made a wikipedia dump[0] so they could have the hitchikers guide to the galaxy.
I was surprised that this was only 32gb as is noted below, but if you figure the entire WP dump is ~27gb unconpressed (from comment below) if that metric is all languages you could download a specific language and then modules i.e
Top 10,000 articles. Nature, mechanics, etc. then this thing likely has a global stylesheet so you are only downloading text. How much info do we really need?
Sure 32GB isnt a lit comparatively but even 24GB of text is a lot. 1gb is like 65k pages in a microsoft word doc
??
I have offline copy of entire (en)wiki on my disk, its <100GB for images, 12GB articles compressed (~30GB with entire edit history). All other languages might double that, still nowhere close even one TB.
Wikipedia EN is 100GB[0] in an xml dump. You linked to the dump of the entire primary db which probably (and this is total spec.) is all users, editors, edits, languages, usage dtatistics and internal metrics.
I don't know if the wouted stat above includes pics, but i believe it is a text dump. You would have to read further to confirm. If they made their usage stats available you could pull 10% of that data (10gb) which corresponds to the most frequented articles
Edit: maybe you meant what you said and are in fact correct. From a user standpoint, the article text, supplementary material and some edits are likely the most inportant metrics. However a massive dump of the entire site and infra would be about that large. H
I was surprised that this was only 32gb as is noted below, but if you figure the entire WP dump is ~27gb unconpressed (from comment below) if that metric is all languages you could download a specific language and then modules i.e
Top 10,000 articles. Nature, mechanics, etc. then this thing likely has a global stylesheet so you are only downloading text. How much info do we really need?
Sure 32GB isnt a lit comparatively but even 24GB of text is a lot. 1gb is like 65k pages in a microsoft word doc
0 http://www.zdnet.com/article/encyclopedia-offline-wikipedia-...