I have 20-40TB (pre-dedup) of PDFs - 8TB is a lot but not even close to the tota...

sporedro · 2024-08-19T14:29:08 1724077748

Just wondering what do you collect? Is it mainly mirroring things like libgen?

I have a decent collection of ebooks/pdfs/manga from reading. But I can’t imagine how large a 20TB library is.

reaperducer · 2024-08-19T15:19:14 1724080754

Just wondering what do you collect?

I can't speak for the OP, but you can buy optical media of old out-of-print magazines scanned as PDFs.

I bought the entirety of Desert Magazine from 1937-1985. It arrived on something like 15 CD-ROMS.

I drag-and-dropped the entire collection into iBooks, and read them when I'm on the train.

(Yes, they're probably on archive.org for free, but this is far easier and more convenient, and I prefer to support publishers rather than undermine their efforts.)

buildbot · 2024-08-19T16:03:16 1724083396

Yep, a good bit of them are from sources like this :)

buildbot · 2024-08-19T14:34:05 1724078045

No torrents at all in this data, all publicly available/open access. Mostly scientific pdfs, and a good portion of those are scans not just text. So the actual text amount is probably pretty low compared to the total. But still, a lot more than 8TB of raw data out there. I bet the total number of PDFs is close to a petabyte if not more.

tylerflick · 2024-08-19T14:53:49 1724079229

> I bet the total number of PDFs is close to a petabyte if not more.

That's a safe bet. I'v seen PDF's in the GBs from users treating it like a container format (which it is).

Maxion · 2024-08-19T15:57:53 1724083073

It's probably tens of petabytes if not more, if you count PDFs that'd be private. Invoices, order confirmations, contracts. There's just so so much.

mehulashah · 2024-08-19T15:08:43 1724080123

Care to make it publicly available? Or is that not permitted on your dataset? Certainly, there’s a lot more PDFs out there than 8TB. I bet there’s a lot of redundancy in yours, but doesn’t dedup well because of all the images.

qingcharles · 2024-08-20T07:12:44 1724137964

I have >10TB of magazines I've collected so far, and I could probably source another 50TB if I had the time. I'm working on uploading them, but I've had too much on my plate lately: https://en.magazedia.wiki/

There is a significant issue with copyright, though. I'll remove anything with a valid DMCA, but 99.9% of the world's historical magazine issues are now in IP limbo as their ownership is probably unknown. Most of the other .1% aren't overly concerned as distribution is their goal and their main income is advertising, not sales.

buildbot · 2024-08-19T16:05:29 1724083529

I think that would be legally iffy for the stuff like collections of old magazines that were purchased on CD/DVD and such :/