This was mostly political guff about environmentalism and bias, but one thing I ...

visarga · on Jan 14, 2023

Is this a good or bad thing? We hear "hallucination" this and that. You can't rely on the LLM. It is not like a search engine. But then you hear on the other side "it memorises PII".

Being able to memorise information is demanded when we want the top 5 countries by population in Europe or the height of Everest. But then we don't want it in other contexts.

Looks more like a dataset pre-processing issue.

weeksie · on Jan 14, 2023

I think I agree with this take.

Is it conceivable that a model could leak PII that is present but extremely hard to detect in the data set? For example, spread out in very different documents in the corpus that aren't obviously related, but that the model would synthesize relatively easily?

srvmshr · on Jan 14, 2023

That is sort of understood facts with even models like Copilot & ChatGPT. With the amount of information we are generally churning, all PII may not get scrubbbed. And these LLMs often could be running on unsanitized data - like a cache of Web on Archive.org, Getty images & the likes.

I feel this is a unavoidable consequence of using LLM. We cannot ensure all data is free from any markers. I am not a expert on databases/data engineering so please take it as an informed opinion

weeksie · on Jan 14, 2023

Copilot has a ton of well publicised examples of verbatim code being used, but I didn't realize that it was as trivial as all that to go plumbing for it directly.