Hacker News new | past | comments | ask | show | jobs | submit login

I’d like to see a model with the effluent of the internet intelligently filtered from the pretraining data by LLM and human curation, and much more effort to include digitised archival sources and the entirety of books and high quality media transcripts. I imagine it would yield far better baseline quality outputs with much less than current “requirements” for (over)correction with ultimately disastrous RLHF masking.



I'd love to play with a version of GPT 4 fine-tuned with every science textbook written in the last few decades, every published science paper (not just preprints from ArXiV), and everything generated by every large research institute. Think NASA, CERN, etc...

Or one tuned with every fiction novel ever written, along with every screenplay.


So a model fine-tuned on libgen?


Why not?


To be honest, I've been asking myself the same thing, technically the amount of "good quality" data in libgen is huge, way larger than the books3 dataset. However it would probably run afoul of copyright. Then again, a huge amount of data that LLMs go through is copyrighted.


Training on copyright data is arguably considered fair use in quite a few jurisdictions to various extents and levels of precedent, and entirely legal for entities based in Japan.


Yes, but the acquisition of that data itself is illegal in almost all jurisdictions, since libgen is treated as a piracy website. Now if there were a pipeline to access books from Amazon or the Google Books project for training it would be a different story.

Still, for certain languages, only libgen and public piracy websites contain any scientific or fiction material in digital formats. E.g. my native language doesn't have easily accessible e-books at all, unless you go through illegal means.

I hope somebody undertakes the steps necessary to train on the entirety of libgen. The amount of high quality tokens in libgen should be substantial.


Google has the resources train on Google Books, Google Scholar, and their crawled copy of the whole Internet. No clue what Bard is/isn't trained on tho.


I would gladly pay triple digits a month for exactly that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: