I’d like to see a model with the effluent of the internet intelligently filtered...

jiggawatts · on May 31, 2023

I'd love to play with a version of GPT 4 fine-tuned with every science textbook written in the last few decades, every published science paper (not just preprints from ArXiV), and everything generated by every large research institute. Think NASA, CERN, etc...

Or one tuned with every fiction novel ever written, along with every screenplay.

benxh · on May 31, 2023

So a model fine-tuned on libgen?

anticensor · on May 31, 2023

Why not?

benxh · on June 1, 2023

To be honest, I've been asking myself the same thing, technically the amount of "good quality" data in libgen is huge, way larger than the books3 dataset. However it would probably run afoul of copyright. Then again, a huge amount of data that LLMs go through is copyrighted.

napier · on June 1, 2023

Training on copyright data is arguably considered fair use in quite a few jurisdictions to various extents and levels of precedent, and entirely legal for entities based in Japan.

benxh · on June 2, 2023

Yes, but the acquisition of that data itself is illegal in almost all jurisdictions, since libgen is treated as a piracy website. Now if there were a pipeline to access books from Amazon or the Google Books project for training it would be a different story.

Still, for certain languages, only libgen and public piracy websites contain any scientific or fiction material in digital formats. E.g. my native language doesn't have easily accessible e-books at all, unless you go through illegal means.

I hope somebody undertakes the steps necessary to train on the entirety of libgen. The amount of high quality tokens in libgen should be substantial.

fragmede · on June 2, 2023

Google has the resources train on Google Books, Google Scholar, and their crawled copy of the whole Internet. No clue what Bard is/isn't trained on tho.

napier · on May 31, 2023

I would gladly pay triple digits a month for exactly that.