Hacker News new | past | comments | ask | show | jobs | submit | legatus's comments login

I think it's worth noting that EleutherAI is a grassroots collection of researchers, which distinguishes it from academia/industry labs.

As part of their work on democratizing AI, they're now hoping to replicate GPT-3 and release it for free (unlike OpenAI's API).

I would encourage everyone interested to join their discord server (https://discord.gg/BK2v3EJ) -- they're extremely friendly and I think it's a project worth contributing to.


How are they sourcing/funding the compute to train these massive models?


TFRC, a program that lets you borrow Google's TPUs when they're not being used. You can apply here: https://www.tensorflow.org/tfrc


Connor Leahy, who I think is a sort of BDFL figure for ElutherAI, mentioned in a Slatestarcodex online meetup I attended that Google donated millions of dollars worth of preemptable TPU credits to the project. There is a video of the meetup on YouTube somewhere. Struck me as a really smart kid with a lot of passion.


Haha Connor (although one of the main participants) definitely isn't a BDFL - we don't have any BDFLs :)

We don't really have much of a hierarchy at all - it's mostly just a collection of researchers of widely varying backgrounds all interested in ML research.


I'm not sure what a BDFL figure is, but Google does not give us millions of dollars. We are a part of TFRC, a program where researchers and non-profits can borrow TPUs when they're not being used. You could say that we are indirectly funded as a result, but it's nowhere near millions of dollars and it doesn't reflect any kind of special relationship with Google.


Benevolent dictator for life


EleutherAI has a very flat hierarchy; we do not have any BDFL-like figure.


they'll probably run it on scientific clusters of various universities, or on collections of idle lab desktop machines. Both of these tend to sit idle a lot of the time, based on my experience at uni in Europe.


Any idea how large dataset used to train GPT-3 was?


570GB of Common Crawl post-filtering, but only 40% of CC data was seen even once during training, though CC is only 60% of the training data. You could work through the math to find the rough size of GPT-3's training data, but it sounds like The Pile is of comparable size.


Yeah, the Pile is approximately the size of the GPT-3 training data, which is not a coincidence--one major reason we created the Pile (though certainly not the only one) was for our GPT-3 replication project.


I'm curious -- what does HN think of factor investing [0]? It has been shown over long periods of time to outperform the total market, and has seen many new ETFs available. Does anyone here tilt towards small cap value? Do you think those effects will last, now that they're more widely known, or are the last 15 years evidence of them weakening? I've been looking into investing but I'm probably going with a total world stock market. Part of the reason I find those ETFs less attractive are the higher associated fees as well as the more "active" look. I have trouble believing anyone who claims there is a way to consistently outperform the market while charging me for it.

[0] https://www.investopedia.com/terms/f/factor-investing.asp


Currently (at least for the-eye) it's about IPFS's barrier of entry. I expect LibGen's case to be similar. Most people don't know about it, and if even those that knew about it had to learn how IPFS works etc, they would probably just try to find the book they're looking for elsewhere.


No need to conflate the frontend (the end-user interface that 'most people' use when trying to 'find the book they're looking for') with the mirroring/archiving backend (the distributed/p2p technology used to 'make sure LibGen never goes down').

The frontend would still be a user-friendly HTTP web-application (or collection of several) that pulls (portions of) the archive from the distributed/resilient backend to serve individual files to clients.

The backend can be a relatively obscure, geeky, post-BitTorrent p2p software like IPFS or Dat, as long as those willing to donate bandwidth/storage can run it on their systems. This is a vastly different audience from 'most people'.

The real question is which software's features best fits the backend use-case (efficiently hosting a very large and growing/evolving, IP-infringing dataset). Dat [1] has features to (1) update data and efficiently synchronize changes, and to (2) efficiently provide random-access data from larger datasets. Two quite compelling advancements over BitTorrent for this use-case.

[1] https://docs.datproject.org/docs/faq#how-is-dat-different-th...


I am not fully aware how IPFS operates, but wouldn't it at least solve the back-end mirroring? Front-end servers would then "only" need to access IPFS for continuous syncing of metadata (for search) and fetching user-requested files (upon request).


True, I too find it not ideal, but having such a massive library available over it surely would increase the interest in lowering the barrier of entry?


Microsoft's project Silica [0] may hopefully provide really long term, large capacity archive grade storage on earth. I wonder what effects interstellar radiation has on them.

[0] https://www.theverge.com/2019/11/4/20942040/microsoft-projec...


Glass is pretty inert, full stop. It would depend on the voxel size but I imagine as long as you have more than a few hundered atoms per voxel/bit you will have survivability on the order of millenia, even in high radiation environments. Someone would have to do the nuclear cross section calculations to get a real bit error rate but glass is very tough stuff.


There are ways to do so. The archive is made up of many, many torrents (I believe it's a monthly if not biweekly update of the database). If you have the storage/bandwidth availability for the whole 32TBs, please get in touch and I may be able to help you get the whole deal without too much hassle. Otherwise, just pick some torrents (it would be best to pick them based on torrent health, but they are so many to check manually) and try to keep seeding as much as possible.

EDIT: To find libgen's torrents health, check out this google sheet: https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...

Thanks frgtpsswrdlame for the heads up.


If LibGen can announce all of the torrents in a JSON payload with health metadata, that can be consumed for automated seedbox consumption and prioritization. Check out ArchiveTeam's Warrior JSON project payload [1] for inspiration. It need not even be generated on-demand; render it on a schedule and distribute at known endpoints.

[1] https://warriorhq.archiveteam.org/projects.json


Actually there is now a google sheet which shows the health of the torrents so it should be easy to pick the most helpful torrents. It's linked in this post: reddit.com/e3yl23


I'm pretty surprised by the lack of seeders. Out of the 2438 torrents listed, a third have 0 seeders, another third have 1 seeder, and all but 5 have less than 10. Hopefully the publicity boosts those numbers.


From what I've heard a good chunk of people rotate their seeds for LibGen because their seedboxes can't handle all the connections for every torrent at once.


Is there some tool or documentation describing this practice?


I'm sure someone could get you the info to get setup as a seeder. For modern clients it's rather rather trivial to manage that many torrents. Get any decent modern CPU, 4gb+ ram, and $560 in storage and you're off.


I think the problem is that because of the size of each torrent, and there's 1000 of them, it's difficult to effectively seed all at once, so instead people would rather seed sections at once, and rotate through them.

I'm not sure how people setup the rotation though, that can't be an incredibly common feature but I could be wrong.


There are features that prioritize those with low seed/leech ratio in a sort of periodic fashion. Also it partially auto-balances because a swarm only needs a little more than unity ratio injected into it to get itself fully replicated. So each one that get's chosen because of a low seed/leech ratio will inherently drop out of that criteria as soon as the swarm is self-sufficient.


Why doesn't someone maintain a single torrent containing a snapshot of the full archive at a given point in time, updated (say) monthly?

I want a full mirror, and ain't nobody got time to deal with 2000 torrents, many of which have no seeders. That's a really dumb way to run this particular railroad.


Because torrent clients can't handle that many pieces in a single torrent. There are algorithms that are super-linear, maybe even quadratic or worse. They start causing trouble int eh TB range.

Also the UI for adding many torrents is much nicer than for selecting a non-trivial subset of files inside a single torrent. Also many parts of the ecosystem handle partial-seeds that do and will only for the near future seed a subset and not leech any other parts. They often get treated as leechers, despite not really being leechers.

TL;DR: 2k files are just a watchfolder and a cp * watchfolder/ away from working. Scaling does not work with one fat 32TB, however.


Thanks! I don't have 32TB free locally at the moment but I might soon. If and when that happens, I'll get in touch :)


There are groups behind data curation as well, though it is much harder. LibGen sees an addition rate of about 230 GBs per month, while SciMag's is around 1.10 TBs per month. We should expect those numbers to increase in the future. The man-hours required to curate those database may very well cost much more than the storage and bandwidth required to store duplicates and incorrectly tagged files. In any case, as I said, there are people seriously interested in curating the LibGen database, though most efforts I know of are still in the earliest stages.


Do you know if they process PDF to reduce file size ?


A lot of the data is in the djvu format which is very efficient for scanned books.


This is an extremely important effort. The LibGen archive contains around 32 TBs of books (by far the most common being scientific books and textbooks, with a healthy dose of non-STEM). The SciMag archive, backing up Sci-Hub, clocks in at around 67 TBs [0]. This is invaluable data that should not be lost. If you want to contribute, here's a few ways to do so.

If you wish to donate bandwidth or storage, I personally know of at least a few mirroring efforts. Please get in touch with me over at legatusR(at)protonmail(dot)com and I can help direct you towards those behind this effort.

If you don't have storage or bandwidth available, you can still help. Bookwarrior has requested help [1] in developing an HTTP-based decentralizing mechanism for LibGen's various forks. Those with experience in software may help make sure those invaluable archives are never lost.

Another way of contributing is by donating bitcoin, as both LibGen [2] and The-Eye [3] accept donations.

Lastly, you can always contribute books. If you buy a textbook or book, consider uploading it (and scanning it, should it be a physical book) in case it isn't already present in the database.

In any case, this effort has a noble goal, and I believe people of this community can contribute.

P.S. The "Pirate Bay of Science" is actually LibGen, and I favor a title change (I posted it this way as to comply with HN guidelines).

[0] http://185.39.10.101/stat.php

[1] https://imgur.com/a/gmLB5pm

[2] bitcoin:12hQANsSHXxyPPgkhoBMSyHpXmzgVbdDGd?label=libgen, as found at http://185.39.10.101/, listed in https://it.wikipedia.org/wiki/Library_Genesis

[3] Bitcoin address 3Mem5B2o3Qd2zAWEthJxUH28f7itbRttxM, as found in https://the-eye.eu/donate/. You can also buy merchandising from them at https://56k.pizza/.


Sounds like anyone with a seed box could donate some bandwidth and storage by leeching then seeding part of it? It would be nice if there’s a list of seeder/leecher counts (like TPB) or better yet of priority list of parts that need more seeders.

Edit: Found the other comment where you link to the seeding stats: https://docs.google.com/spreadsheets/d/1hqT7dVe8u09eatT93V2x...


Or better yet, a RSS feed that plays nice with auto-retention and quota settings. It just delivers you a bunch of parts that are in need of seeders and you use your existing mechanism to help with it.


For important archives like this maybe we need some sort of turn-key solution for the masses? Like a Raspberry Pi image that maintains a partial mirror. Imagine if one could by a RPi and external HD, burn the image, and connect it to some random wifi network (at home, at work, at the library, etc).


I'm not hosting a copy of this at work (where we easily have 32TB on old hardware) since distributing it is copyright infringement. The same goes for my home connection.


Most people don't care. The chance anything at all bad will happen is so incredibly low.


This isn't even movies wherein some large studio's can send notices. I don't think publishing houses have that many funds to send so many legal notices.

Books are a safe bet to pirate


That's what people said about music and films too. You don't want to be the next Jammie Thomas.

This is an existential threat to the deep-pocketed likes of Elsevier et al. They will use the law to make an example of anyone too close to their sphere of influence; so if you are in the US or the EU; support the efforts of LibGen vocally and loudly, and contribute anonymously, but don't risk your neck to the extent where they can get a hold of you.

There are plenty of ways to support the effort safely though. Make sure people who wish to access scientific papers and books know where to go, and make sure your elected officials know about the need for publicly funded science to be published free of charge, open access, (retroactively too).


I'm guessing a pretty significant minority of HN's users maintain offshore seed boxes to get other copyrighted content and for them it might be pretty trivial to add partial peering of libgen content.


I think a turn-key solution for people living in not US/EU will still help the general health of the archive.


At least the large academic publishers are sitting on enormous stacks of cash, so that argument doesn't fly.


I just read the article and your comments here and I'm a bit unsure what's the difference to the Internet Archive. Is it that the IA can archive them but not make them public for legal reasons and The-Eye is more focused on keeping them online and accessible no matter what?


Yes. It is extremely likely IA has the LibGen corpus archived, but darked (inaccessible), to prevent litigation.


There are quite a few such copies, on the 'just in case' principle.


> Lastly, you can always contribute books. If you buy a textbook or book, consider uploading it (and scanning it, should it be a physical book) in case it isn't already present in the database.

There's no easy solution for scanning physical books, is there?


There are providers [1] that will destructively scan the book for you and return a PDF. If you want to preserve the book, you're stuck using a scanning rig [2]. The Internet Archive will also non-destructively scan as part of Open Library [3], but they only permit one checkout at a time of scanned works, and the latency can be high between sending them a book and it becoming available. FYI, 600 DPI is preferred for archival purposes.

[1] http://1dollarscan.com/ (no affiliation, just a satisfied customer, can't scan certain textbooks due to publisher threats of litigation)

[2] https://www.diybookscanner.org/

[3] https://openlibrary.org/help/faq


A big +1 for 1dollarscan.com. They've scanned many hundreds of books for me. The quality of the resulting PDFs is uniformly excellent, their turnaround time is fast, and their prices are cheap ($1 per 100 pages).

I've visited their office -- located in an inexpensive industrial district of San Jose -- on multiple occasions. They have a convenient process for receiving books in person.

I believe the owners are Japanese and the operation reminds me of the businesses I visited in Tokyo: quiet, neat, and über-efficient.


> quiet, neat, and über-efficient

I wish the same could be said for the Tokyo office I work in!


I will add a vote for bookscan.us, which I have been using since 2013 or so. Very reasonable prices and great service.


There are DIY book scanners (http://diybookscanner.org) and products such as the Fujitsu ScanSnap SV600. The SV600 has decent features like page-detection and finger-removal (I recommend using a pencil's eraser tip). I have personally used it to scan dozens of books, with satisfactory results.


Just saw a father who had to do it fully manually for her blind daughter. I shall show your comment to him.


Scanning with your phone is getting easier. At a minimum you can take a pic of each of the pages. Software can clean up the images, sorta. It's not ideal but it's better than nothing.


I remember when "cammed" books were bottom-tier and basically limited to things like 0day releases, even when done with an expensive DSLR. It's amazing how much camera technology has progressed since then; in less than a second you can get a high-resolution, extremely readable image of each page.

I used to participate in the "bookz scene", well over a decade ago. Raiding the local public libraries --- borrowing as many books as we could --- and having "scanparties" to digitise and upload them was incredibly fun, and we did it for the thrill, never thinking that one day almost all of our releases would end up in LibGen.


I found vFlat to be magical in cleaning up book scan images you took with your phone.

https://play.google.com/store/apps/details?id=com.voyagerx.s...


>This app is incompatible with your device.

my disappointment is immeasurable and my day is ruined


I use bookscan.us for this purpose: I mail the physical book to them and they send me a file a few days later for a very reasonable price.


Unfortunately it’s a destructive process.


Your local physical library may make a book scanner available. Mine does, with a posted 60-pages-at-a-time limit (though I don't know how this is enforced).


Mind explaining the origin of your 32 TB figure? I must be missing something enormous, but as far as I can tell the SciMag database dump is 9.3 GB, the LibGen non-fiction dump is 3.2 GB, and the LibGen fiction dump is 757 MB. That's a pretty huge divergence.

Source: http://gen.lib.rus.ec/dbdumps/


Oh, wait. I'm dumb. I see that your first link is a citation.

Continuing to be dense, why is there a difference between their "database dump" and the total of all the files they have?


The databases contain the metadata (authors, edition, ISBN, etc.) for the books.

Thus, 32 TB of books (over 2 million titles), 3.2 GB database.


Ah, that makes sense.

To make sure I'm understanding this correctly:

The Libgen Desktop application (which requires only a copy of the database) would then use the DB metadata to make LibGen locally searchable, and would only retrieve the individual books/papers on request?


I guess it's stunningly obvious to everyone else, but how are you certain the replacement isn't worse than the original system. I already see comments about the curation problem, for example. What's the point in making bad information (duplicate information etc.) highly available? Why put so much faith in this donation strategy i.e. donating bandwidth or donating money?


Associated video: https://www.youtube.com/watch?v=-ZNEzzDcllU [4:42 mins]


A quick guide for neuroglancer:

_spacebar_ on one area: fullscreen the area

L: change colors

X: remove colors

double click one neuron: select the neuron

ctrl-mousewheel: zoom in/out

right-click: center clicked area

A great thing to do is to choose a few neurons (double click on the colored parts in the top-left, top-right, bottom-right windows) and then view the automatic 3d model by using _spacebar_ on the bottom-left window. Press _spacebar_ again to return to the normal view.


I know democide has been mostly coined by Rummel. Do you by chance refer to him? If so, could I ask for your sources? I know that Rummel's data has been heavily criticized by many historians. His work is usually about taking the largest estimate, which is fruitful if you want big titles (à la Black Book of X), but unfruitful if you are interested in the truth.


>unfruitful if you are interested in the truth.

Do you really think its possible to find "the truth" of numbers when it comes to things like democide?


No, no historian thinks he will find the definitive truth. My point is that there are valid historians (I only know about a few fields, but you can see which historians are valid by how much their peers respect their work) that at least try to get as close to the the probable number of, in this case, deaths. One example: taking a look at Rummel's data on the Soviet Union, he reports 60 million deaths 1917-1987, which is laughable (he even claimed that was a low estimate!). That's the number reported by Solzhenitsyn, which is not considered a reliable source in contemporary Sovietology (probably not a source at all). Here is a relevant thread on r/AskHistorians if you're interested: https://old.reddit.com/r/AskHistorians/comments/3v5u2t/are_r...


Is the irony of saying Solzhenitsyn is unreliable while linking a Reddit thread as a better source lost on you?


I'm really sorry an AskHistorians thread is not good enough for you. Let me paste Jonathan Smele's remark about Rummel's work (Smele is known for his compiled bibliography on the Russian Revolution):

>A poorly researched, obsessively anti-Soviet polemical general survey. [1]

Unsurprisingly, you would have read that exact quote in the reddit thread I linked. Guess that was too much to ask for.

[1] Jonathan Smele, Russian Revolution and Civil War Annotated Bibliography


What's wrong with being anti-Soviet?


The point is that in this case Rummel doesn't even attempt to be objective. You can be biased, everyone is. A real historian, someone who is respected by his peers, actively tries to remove his bias from his work. It isn't really profitable (you earn much more by writing extremely biased pop-history) but some individuals, such as me, value it.


It's pretty hard to get objective and accurate count here; unlike Nazis, Soviets (and I imagine Chinese et al) did not keep meticulous accounting even of people specifically executed, let alone of those keeling over in prison camps despite technically being only sentenced to imprisonment). Solzhenitsyn himself said, at least initially, that his numbers are extremely approximate, based on what personal research he could do and that he hopes future historians will do better.

Having said that, 60 million killed in USSR alone would seem highly suspicious. 60 million repressed sounds quite feasible. And as Stalin would acknowledge (death of one is a tragedy, death of millions is statistics), in a way it doesn't really matter -- communists killed as many, if not significantly more, people as Nazis, but because they usually did it to their own population, and weren't defeated by an invading army, they escaped anything resembling Nuremberg Tribunal, and to this day Nazi defenders keep quiet in decent society, but Soviet apologia is pretty rampant.


Well, in fact, they did. It's why pretty much every Sovietology book pre-1991 is now rarely recommended. The Soviet archives were opened with the fall of the USSR, allowing historians to have a much more accurate view of what was happening. Sure, there are caveats (e.g. people freed from the gulag in a dying state) yet, contemporary historians have been able to provide estimates, based on the best data available. The thing with Solzhenitsyn is that he had no access to any kind of data, and as a refugee in the US it's clear he was extremely biased. Note that I'm not saying I don't understand why he did that, I'm just saying that, if you want a good estimate of the numbers related to the Gulag system Solzhenitsyn is not a good source.

Regarding your last points: 60 million repressed is entirely dependent on how you define "repressed". Some people may argue that not having access to health care and education means being repressed. Just noting that it's not really a useful value, it's already difficult estimating number of deaths, estimating such a vague term as "repressed" people will be next to impossible. Regarding your claim that communists killed as many if not more people than Nazis, I really dislike this kind of comparisons. It is an extremely politicized topic, removes the human part of the statistics and historical context. Even then, I would still disagree with your claim, unless you count Mao's Great Leap Forward as part of it, which is really a comparison between apples and oranges. I personally believe state-pursued, systematic genocides should be differentiated from famines (such as the Holodomor and the famines related to the Great Leap Forward) which were more of a consequence of the regime's terrible management (plus already low harvest in the case of the Holodomor). If you want to take a look, the Wikipedia page for the excess mortality under Stalin is not too bad, especially the source they used (Wheatcroft's Victims of Stalinism and the Soviet Secret Police).

I don't see Nazi defenders keeping quiet, I think it's enough to think of the growing "Identity Politics" movement. Soviet apologia is rarely referring to the Stalinist period, especially considering Khrushchev's 56' "De-Stalinization" speech. And if you only consider the USSR of Khrushchev and beyond, while definitely worth criticizing, I'm not sure it's even comparable to the Nazi regime.


Not really. Getting anything useful from the archives is still like pulling teeth, and quite a few documents still have not been declassified. Solzhenitsin only had access to his own experience and interviewing very few people (obviously, this research was not exactly looked upon with approval). But he was also the first to make note of it.

I mean specifically repressed through the state apparatus, not lacking Internet access. Of course there are different degrees of that repression here, from being executed on the spot, to dying of hunger and overwork, all the way down to being relocated 101+ kilometers from a big city.

I would very much count victims of Holodomor (as well as of less known famines in Central Asia etc.) and Great Leap -- it does not matter to the dying if the State actively wanted to kill them, or just did not care the least bit if they die or not. And there are studies from real historians that conditions in Soviet labor camps, during peace time, were worse than in Nazi extermination camps (I am not sure if there are English copies online, but certainly Russian ones are accessible).

Absolute numbers may not mean all that much (although as far as exterminating their own citizenry, communist Khmer Rouge have no equals) but it is still important to remember and never forget -- it's not a competition who is more evil, it's about making sure that evil does not get repeated.

USSR of Khrushchev and beyond -- it's also place where workers demonstrating for decent pay and food for their children were fired upon by the army, slight liberalization efforts in satellite countries were crushed with tanks, dissidents had their brains fried in psychiatric hospitals, etc. etc. Sure, it was much better than Reich in the 30s-40s, or USSR itself in the same period, but it is also an apples to oranges comparison. Had Reich been contained instead of defeated, chances are it would have evolved into something similar by then as well. But Soviet apologia usually does not make a distinction anyway.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: