Hacker News new | past | comments | ask | show | jobs | submit login

This is great. In 2008 Patrick Collison had a buddy who was using the then undocumented iphone internal apis to make apps. They made a wikipedia dump[0] so they could have the hitchikers guide to the galaxy.

I was surprised that this was only 32gb as is noted below, but if you figure the entire WP dump is ~27gb unconpressed (from comment below) if that metric is all languages you could download a specific language and then modules i.e

Top 10,000 articles. Nature, mechanics, etc. then this thing likely has a global stylesheet so you are only downloading text. How much info do we really need?

Sure 32GB isnt a lit comparatively but even 24GB of text is a lot. 1gb is like 65k pages in a microsoft word doc

0 http://www.zdnet.com/article/encyclopedia-offline-wikipedia-...




To be clear, those sizes exclude pictures. Some articles are nearly useless without pictures.


Wikipedia currently is ~30-35TB total in size. If storage trends continue, you'll be able to store/run the entire corpus locally in 3-5 years.


But how large will the corpus be in 3-5 years?


?? I have offline copy of entire (en)wiki on my disk, its <100GB for images, 12GB articles compressed (~30GB with entire edit history). All other languages might double that, still nowhere close even one TB.



Wikipedia EN is 100GB[0] in an xml dump. You linked to the dump of the entire primary db which probably (and this is total spec.) is all users, editors, edits, languages, usage dtatistics and internal metrics.

I don't know if the wouted stat above includes pics, but i believe it is a text dump. You would have to read further to confirm. If they made their usage stats available you could pull 10% of that data (10gb) which corresponds to the most frequented articles

0. https://en.m.wikipedia.org/wiki/Wikipedia:Size_of_Wikipedia

Edit: maybe you meant what you said and are in fact correct. From a user standpoint, the article text, supplementary material and some edits are likely the most inportant metrics. However a massive dump of the entire site and infra would be about that large. H


Does that include page histories and talk pages?




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: