At the same time, there are situations where humans are expected to provide sour...

fennecbutt · 2023-12-28T18:07:15.000000Z

The raw technology behind it literally cannot do that.

The model is fuzzy, it's the learning part, it'll never follow the rules to the letter the same as humans fuck up all the time.

But a model trained to be literate and parse meaning could be provided with the hard data via a vector DB or similar, it can cite sources from there or as it finds them via the internet and tbf this is how they should've trained the model.

But in order to become literate, it needs to read...and us humans reuse phrases etc we've picked up all the time "as easy as pie" oops, copyright.

ahepp · 2023-12-29T00:18:13.000000Z

I agree that the model being fuzzy is key aspect of an LLM. It doesn't sound like we're just talking about re-using phrases though. "Simple as pie" is not under copyright. We're talking about the "knowledge" that the model has obtained and in some cases spits out verbatim without attribution.

I wonder if there's any possibility to train the model on a wide variety of sources, only for language function purposes, then as you say give it a separate knowledge vector.

fennecbutt · 2023-12-29T02:10:14.000000Z

Sure, it definitely spits out facts, often not hallucinating. And it can reiterate titles and small chunks of copyright text.

But I still haven't seen a real example of it spitting out a book verbatim. You know where I think it got chunks of "copyright" text from GRRM's books?

Wikipedia. And https://gameofthrones.fandom.com/wiki/Wiki_of_Westeros, https://awoiaf.westeros.org/index.php/Main_Page, https://data.world/datasets/game-of-thrones all the god dammed wikis, databases etc based on his work, of which there are many, and of which most quote sections or whole passages of the books.

Someone prove to me that GPT can reproduce enough text verbatim that it makes it clear that it was trained on the original text first hand basis, rather than second hand from other sources.