Hacker News new | past | comments | ask | show | jobs | submit login

I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the total number of PDFs available.



Just wondering what do you collect? Is it mainly mirroring things like libgen?

I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.


Just wondering what do you collect?

I can't speak for the OP, but you can buy optical media of old out-of-print magazines scanned as PDFs.

I bought the entirety of Desert Magazine from 1937-1985. It arrived on something like 15 CD-ROMS.

I drag-and-dropped the entire collection into iBooks, and read them when I'm on the train.

(Yes, they're probably on archive.org for free, but this is far easier and more convenient, and I prefer to support publishers rather than undermine their efforts.)


Yep, a good bit of them are from sources like this :)


No torrents at all in this data, all publicly available/open access. Mostly scientific pdfs, and a good portion of those are scans not just text. So the actual text amount is probably pretty low compared to the total. But still, a lot more than 8TB of raw data out there. I bet the total number of PDFs is close to a petabyte if not more.


> I bet the total number of PDFs is close to a petabyte if not more.

That's a safe bet. I'v seen PDF's in the GBs from users treating it like a container format (which it is).


It's probably tens of petabytes if not more, if you count PDFs that'd be private. Invoices, order confirmations, contracts. There's just so so much.


Care to make it publicly available? Or is that not permitted on your dataset? Certainly, there’s a lot more PDFs out there than 8TB. I bet there’s a lot of redundancy in yours, but doesn’t dedup well because of all the images.


I have >10TB of magazines I've collected so far, and I could probably source another 50TB if I had the time. I'm working on uploading them, but I've had too much on my plate lately: https://en.magazedia.wiki/

There is a significant issue with copyright, though. I'll remove anything with a valid DMCA, but 99.9% of the world's historical magazine issues are now in IP limbo as their ownership is probably unknown. Most of the other .1% aren't overly concerned as distribution is their goal and their main income is advertising, not sales.


I think that would be legally iffy for the stuff like collections of old magazines that were purchased on CD/DVD and such :/




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: