This was mostly political guff about environmentalism and bias, but one thing I didn't know was that apparently larger models make it easier to extract training data.
> Finally, we note that there are risks associated with the fact
that LMs with extremely large numbers of parameters model their training data very closely and can be prompted to output specific information from that training data. For example, [28] demonstrate a methodology for extracting personally identifiable information (PII) from an LM and find that larger LMs are more susceptible to this style of attack than smaller ones. Building training data out of publicly available documents doesn’t fully mitigate this risk: just because the PII was already available in the open on the Internet doesn’t mean there isn’t additional harm in collecting it and providing another avenue to its discovery. This type of risk differs from those noted above because it doesn’t hinge on seeming coherence of synthetic text, but the possibility of a sufficiently motivated user gaining access to training data via the LM. In a similar vein, users might query LMs for ‘dangerous knowledge’ (e.g. tax avoidance advice), knowing that what they were getting was synthetic and therefore not credible but nonetheless representing clues to what is in the training data in order to refine their own search queries
Shame they only gave that one graf. I'd like to know more about this. Again, miss me with the political garbage about "dangerous knowledge", the most concerning thing is the PII leakage as far as I can tell.
Is this a good or bad thing? We hear "hallucination" this and that. You can't rely on the LLM. It is not like a search engine. But then you hear on the other side "it memorises PII".
Being able to memorise information is demanded when we want the top 5 countries by population in Europe or the height of Everest. But then we don't want it in other contexts.
Is it conceivable that a model could leak PII that is present but extremely hard to detect in the data set? For example, spread out in very different documents in the corpus that aren't obviously related, but that the model would synthesize relatively easily?
That is sort of understood facts with even models like Copilot & ChatGPT. With the amount of information we are generally churning, all PII may not get scrubbbed. And these LLMs often could be running on unsanitized data - like a cache of Web on Archive.org, Getty images & the likes.
I feel this is a unavoidable consequence of using LLM. We cannot ensure all data is free from any markers. I am not a expert on databases/data engineering so please take it as an informed opinion
Copilot has a ton of well publicised examples of verbatim code being used, but I didn't realize that it was as trivial as all that to go plumbing for it directly.
> Finally, we note that there are risks associated with the fact that LMs with extremely large numbers of parameters model their training data very closely and can be prompted to output specific information from that training data. For example, [28] demonstrate a methodology for extracting personally identifiable information (PII) from an LM and find that larger LMs are more susceptible to this style of attack than smaller ones. Building training data out of publicly available documents doesn’t fully mitigate this risk: just because the PII was already available in the open on the Internet doesn’t mean there isn’t additional harm in collecting it and providing another avenue to its discovery. This type of risk differs from those noted above because it doesn’t hinge on seeming coherence of synthetic text, but the possibility of a sufficiently motivated user gaining access to training data via the LM. In a similar vein, users might query LMs for ‘dangerous knowledge’ (e.g. tax avoidance advice), knowing that what they were getting was synthetic and therefore not credible but nonetheless representing clues to what is in the training data in order to refine their own search queries
Shame they only gave that one graf. I'd like to know more about this. Again, miss me with the political garbage about "dangerous knowledge", the most concerning thing is the PII leakage as far as I can tell.