a plain old website or a publishing house with distribution services and syndica...

jandrese · on June 9, 2021

In Internet scale it's not a lot of data. Most people who think they have big data don't.

Estimates I've seen put the total Scihub cache at 85 million articles totaling 77TB. That's a single 2U server with room to spare. The hardest part is indexing and search, but it's a pretty small search space by Internet standards.

andyxor · on June 9, 2021

The entire archive actually fits in a small desktop NAS (e.g. QNAP or Synology) with a few 14-18TB drives, you don't even need a server rack.

There is existing index in sql format distributed by libgen: https://www.reddit.com/r/scihub/comments/nh5dbu/a_brief_intr..., it is around 30GB uncompressed.

Those 851 torrents uncompressed would probably take half a petabyte of storage, but I guess for serving pdfs you could extract individual files on demand from zip archive and (optionally) cache them. So the scihub "mirror" could run on a workstation or even laptop with 32-64GB memory connected to 100TB NAS over 1GBE, serving pdfs over VPN and using unlimited traffic plan. The whole setup including workstation, NAS and drives would cost $5-7K.

it's not a very difficult project and can be done DIY style, if you exclude the proxy part (which downloads papers using donated credentials). Of course it would still be as risky as running Scihub itself which has $15M lawsuit pending against it.

dredmorbius · on June 9, 2021

The entire Library of Congress books collection is on the order of 40 million items.

At 5 MB per book, this works out to about 200 TB of disk storage.

At about $12/TB, hosting the entire LoC collection would cost roughly $2,400 presently, with prices halving about every three years.

dredmorbius · on June 9, 2021

Note that $2,400 is disks alone. You'd obviously need chassis, powere supplies, and racks. Though that's only 17 12 TB drives.

Factor in redundancy (I'd like to see a triple-redundant storage on any given site, though since sites are redundant across each other, this might be forgoable). Access time and high-demand are likely the big factor, though caching helps tremendously.

My point is that the budget is small and rapidly getting smaller. For one of the largest collections of written human knowledge.

There are some other considerations:

- If original typography and marginalia are significant, full-page scans are necessary. There's some presumption of that built into my 5 MB/book figure. I've yet to find a scanned book of > 200MB (the largest I've seen is a scan of Charles Lyell's geology text, from Archive.org, at north of 100 MB), and there are graphics-heavy documents which can run larger.

- Access bandwidth may be a concern.

- There's a larger set of books ever published, with Google's estimate circa 2014 being about 140 million books.

- There are ~300k "conventionally published" books in English annually, and about 1-2 million "nontraditional" (largely self-published), via Bowker, theh US issuer of ISBNs.

- LoC have data on other media types, and their own complete collection is in the realm of 140 million catalogued items (coinciding with Google's alternate estimate of total books, but unrelated). That includes unpublished manuscripts, maps, audio recordings, video, and other materials. The LoC website has an overview of holdings.

Published document scarcity is entirely imposed.

HWR_14 · on June 9, 2021

It still amazes me that 77TB is considered "small". Isn't that still in the $500-$1,000 range of non-redundant storage? Or if hosted on AWS, isn't that almost $1,900 a month if no one accesses it?

I know it's not Big Data(tm) big data, but it is a lot of data for something that can generate no revenue.

smichel17 · on June 9, 2021

> Isn't that still in the $500-$1,000 range of non-redundant storage?

Sure. Let's add redundancy and bump by an order of magnitude to give some headroom -- $5-10k is a totally reasonable amount to fundraise for this sort of application. If it were legal, I'm sure any number of universities would happily shoulder that cost. It's miniscule compared to what they're paying Elsevier each year.

HWR_14 · on June 9, 2021

Sorry. My point was it was a lot of money precisely because it cannot legally exist. If it could collect donations via a commercial payment processor, it could raise that much money from end users easily. Or grants from institutions. But in this case it seems like it has to be self-funded.

pbhjpbhj · on June 9, 2021

I'm prepared to accept "does generate no revenue" but "can generate no revenue" ...?

Perhaps some sort of MTurk or captcha-like tasks per access? Patr[e]ons? Donation drives? Micro-payments? Something else??

HWR_14 · on June 9, 2021

Oh, it could generate revenue if it was legal. But it is not, so it seems difficult.

dredmorbius · on June 9, 2021

For an institution, it's a rounding error.

AWS is not the cheapest bulk-storage hosting possible.

matthewdgreen · on June 10, 2021

Google already does a pretty good job with search. Sci-Hub really just needs to handle content delivery, instead of kicking you to a scientific publisher's paywall.

einpoklum · on June 9, 2021

If the sane price is an optional "Donate to keep this site going" link, then ok. But only free access, without authentication or payment, to scientific papers, is sane. IMHO.

munk-a · on June 9, 2021

Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

The cost of maintaining a free and open DB of scientific advances and publications would be so incredibly insignificant compared to both the value and the continued investment in those advancements.

jpeloquin · on June 10, 2021

> Might this be a case where the best resolution would be to have the government (which is at least partially funding nearly all of these papers) step in and add a ledger of papers as a proof of investment?

I feel that we're halfway there already and are gaining ground. Does Pubmed Central [0] (a government-hosted open access repository for NIH-funded work) count as a "ledger" like you're referring to? The NSF's site does a good job of explaining current US open access policy [1]. There are occasional attempts to expand the open access mandate by legislation, such as FASTR [2]. A hypothetical expansion of the open access mandate to apply to all works from /institutions/ that receive indirect costs, not just individual projects that receive direct costs, would open things up even more.

[0] https://www.ncbi.nlm.nih.gov/pmc/

[1] https://www.nsf.gov/pubs/2016/nsf16009/nsf16009.jsp#q1

[2] https://sparcopen.org/our-work/fastr/

einpoklum · on June 9, 2021

Well, some research venues (and publication venues) are not government-funded, and even if they are indirectly government funded, it's more of a sophistry than something which would make publishers hand over copies of the papers.

Also, a per-government ledger would not be super-practicable. But if, say, the US, the EU and China would agree on something like this, and implement it, and have a common ledger, then it would not be some a big leap to make it properly international. Maybe even UN-based.

That's a pretty big "if" though.

posterboy · on June 17, 2021

I share the sentiment insofar as free access would benefit my own sanity, except when it is about hording.

On the other hand, there is a slippery sloap to decide what isn't scientific so much as to not be required open knowledge.

By the way, specialist knowledge and open knowledge is kind of a dichotomy. You would need to define the intersection of both. Suddenly you are looking at a patent system. Pay to Quote, citation fees, news websites already are demanding this from google, here in Germany, inuding Springer Press

whimsicalism · on June 9, 2021

Libgen's coverage is definitely more shallow than scihub, but it is still pretty good.