If a weakness is common, then of course Copilot is going to suggest it. Copilot gives you popular responses not correct ones. Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.
The studies results are rather unsurprising and its conclusions are oft-repeated advice. As many have said, treat copilot’s code in the same light you would treat a junior programmer’s code.
It is artificial intelligence, it just isn't artificial general intelligence, nor artificial general knowledge. LLMs are artificial linguistic intelligence. They are really good at linguistic operations that require intelligence, like summarizing long text, transforming disrespectful text into professional looking text, translation between languages to a certain degree, etc.
It is not possible to ask an LLM for factual knowledge, without providing it the source of the fact. Without a source of the fact, you can only ask an LLM to generate an answer to the question that is linguistically convincing. And they can do a really good job at that. They can accidentally encode factual knowledge by predicting the next word correctly, but that should be regarded as an accident.
That's the most accurate term I've heard to describe the situation. I think it could get worse though because when I've seen mediocre people work with mediocre people they generate sub-mediocre solutions through trying to be clever and failing spectacularly at it.
LLMs at the moment feel a bit like a buzwword-throwing dilbertian pointy haired boss. The use terms they heard somewhere and with some luck, they use them in the proper context... but without actually understanding them.
Edit: And yeah, I think I know what you mean. The expectation (or hope) that collaboration of mediocre people results in some above-average end by means of some magical synergy effect very rarely works in real life.
Less intelligent than average, given that 100 is¹ calibrated to be average. Assuming your use of the word intelligence takes the concept as a sliding scale not a boolean is/isn't which is implied by quoting IQ results.
The way the “I” in AI is usually used seems to me to imply achieving average or better, so the aim is mediocre & upwards. steve1977 is agreeing with an opinion that results so far are at best “up to average”, maybe not even that, so more like mediocre & downwards. The phrase Artificial Mediocracy does not seem at all unfair in this context.
The opinion that current models can at best achieve average results overall seem logical to me: that are essentially summarizing a large corpus of human output rather than having original thought. While the systems may “notice” links average humans don't due to not being able to process large amounts of data like that, bringing the average quality of their output up, they are similarly likely to latch onto bad common practise/understanding bringing it back down again. Average results are mediocre results, by definition. Not bad, but not outstanding in any way.
--
[1] supposedly, there are strong views about IQ being a flawed measure/measure in some quarters
I never thought of the I in AI as a comparison to a human of average intelligence. I always understood it means intelligence as in "capable of reasoning", regardless of whether it's "kinda dumb" or "super smart" - the same way we speak about animals not being intelligent, and are looking for "intelligent alien life" in space - the aliens might not be very smart, perhaps even totally dumb, but still intelligent. The same applies for AI, perhaps it doesn't match even the least performing humans, but it's still intelligent.
I guess my parent comment was lead a bit by the fact that nowadays AI is often conflated with superhuman intelligence. You're certainly correct in that even a "dumb" AI could still be intelligent.
The interesting question is of course if that applies to LLMs or not. Are they actually intelligent or do they just look intelligent (and do we even have the means to answer those questions)?
>The interesting question is of course if that applies to LLMs or not. Are they actually intelligent or do they just look intelligent (and do we even have the means to answer those questions)?
It's not an interesting question. It's pretty meaningless.
Are birds really flying or do they look like they are flying (perspective of the bee)?. Are planes really flying or do they look like they're flying ?
This question is essentially thought-terminating in most contexts as most of not all people can’t answer it given how littler know about how humans work.
Nerds will also get hung up on this because they can’t stand the notion that any aspect of their job doesn’t require their immense intelligence.
For most contexts, “will this tool help me”’is a much more appropriate question. Anyone conflating the two is doing themselves a disservice.
A key problem is the many different readings of the word intelligent. I wouldn't call what we currently have as "capable of reasoning" for instance, though that might not be the intent and that is instead a property of "general intelligence". Of course that has linguistic issues to as it makes general intelligence (artificial or otherwise) a subset of intelligence - i.e. more specific despite adding "general" to the name.
Part of the problem is that we have very vague notions of what "reasoning" actually means. If you mean simple deductive logic ... LLMs can often perform such operations today, albeit highly inconsistently. If you mean inductive reasoning and working through a problem through first principles, then they usually fail. The state of the art their are tricks to get the system to extract out the assumptions and base knowledge and then work deductively.
But every time we have an advance in machine learning, we seem to redefine intelligent activity to be beyond that. At a certain point, what is left?
Humans are the only thing capable of reasoning, AI isn't capable of it and its very rare that an animal other than a human is capable of even the most basic reasoning.
Animals act on instinct that is hard coded based on the probability of survival AI essentially does the same thing it follows hardcoded probabilities not reason.
Technically they are. My comment was also meant to be a bit tongue-in-cheek of course (and, hopefully, obviously).
I wouldn't use a score like IQ to define a treshold of "intelligence" in absolute terms. By definition, if you can score somewhere on the IQ scale, you have some intelligence. Otherwise your IQ would probably be N/A? (not sure, never looked that deeply into IQ tests=.
I guess the problem is that the term "intelligent" is ambiguous and overloaded.
So in one usage someone who is kinda dumb would still be intelligent, just less intelligent than others.
In another usage, we use the term to describe someone of above average intelligence (which is technically not really correct and actually not very intelligent).
From the main Wiki entry on IQ tests: “The raw score of the norming sample is usually (rank order) transformed to a normal distribution with mean 100 and standard deviation 15.”
So, you’re not too far off numerically. 80 IQ is only a -1.33 Z-score. So, 9th percentile. 91% of people score higher than 80 IQ.
My point is, it's not a comparison at all. Intelligence is a trait, you either have it or not. We also use the same word for "how smart you are", but that measurement doesn't change anything about AI being intelligent or not. It can be dumb, but intelligent.
Intelligence is IMO an inherently comparative measure.
It's not "on/off", it's "smarter" (than a pile of rocks, than a slug, than another human).
So, yeah, you can be a dumb human but you'd be a smart chimpanzee. But we want to be comparing apples with apples in the context of this topic.
When people say "AI", everyone implicitly assumes the comparison with human intelligence. So "AI" needs to be as smart as the average human to be actual AI. Ok, AGI, if you prefer that term.
There's a reason AI is moving farther and farther away and we're creating new, finer, terms like ML, shape recognition, etc.
I don't know anybody who implicitly assumes that the I in AI is a comparison, people around me understand it as a term for a set of traits, regardless of the level of performance. That's how the term AGI came to be, to distinguish between "intelligent but not necessarily like a human" (AI) and "generally intelligent [like a human]" (AGI).
Do you also think that "intelligent alien life in space" means comparable to humans? What if we find something capable of reasoning, abstract thinking, rationality, adaptibility, etc - but much, much dumber than humans? That's intelligence, comparison to humans doesn't change anything about that.
https://en.m.wikipedia.org/wiki/Intelligence - how could there be animal intelligence if "intelligence" means comparable to humans? "Crows are intelligent but nowhere near human-level" - this statement wouldn't make any sense if you're right, but it actually does make sense, IMHO.
> There's a reason AI is moving farther and farther away and we're creating new, finer, terms like ML, shape recognition, etc.
That thing with AI is called moving goalposts. And the finer terms - yeah of course we need to be able to be specific about our software, doesn't mean that's not AI. We talked about shape recognition in neuropsychology for much longer than in AI, same for many more terms that will certainly be reused soon.
Democracy is a bit different. Hopefully the goal is a democratic government isn't to be intelligent but is to just make people's lives good (better?). In theory if people find their lives are getting worse then they can replace the government. But, sure, there are many examples of it not working. Such has the government making gas prices low because the people want that when its polluting the planet that the people live on.
I'm not sure you can claim that the essential functionality of something is the issue with something.
The whole idea of LLMs is that they chose the most likely token based on the tokens before, and then sometimes chose less likely tokens. But it's all based on likelihoods.
Probably there is a huge education part missing from this, if people aren't aware that this is how it works, and they think that any LLM can "creatively" come up with it's own chain of tokens based on nothing.
It would be very interesting to fine-tune Copilot on the code of people widely regarded in their communities as experts, to see how the suggestions would change.
I don’t think it is a tautology , but I can imagine a cve scanner picking up older code with log4j where newer code may avoid that library altogether, just as an example.
Since there is more older code than newer code would the llm be suspectible to that ?
This makes me wonder about training an LLM on one language and then fine tuning it for another. If you train over only, say, JavaScript, and then finetune for C, I imagine it will be quite bad at writing safe code, even if it makes the code look like C, because it didn't have to learn about freeing and such.
Similarly, would it pick up patterns from one language and keep then in the other? Maybe an LLM trained on Kotlin would be more likely to write functional code finetuned.
> It only means programmers commonly talk about it. This isn't the same thing as measuring incidence in production or distribution.
Copilot was primarily trained on GitHub projects, not on communication between programmers. Patterns that frequently show up in Copilot output are most likely prevalent on GitHub, which is a pretty good indicator that they're common in production code.
That’s no longer true. Copilot uses the same ChatGPT-3.5 model as, well, ChatGPT. If it were trained on just GitHub projects, the chat features wouldn’t work at all.
You're assuming that Copilot Chat and the regular completion are the same model. Do you have a source that says so? I'd assumed that they were two different models, since they're quite different tasks.
Footnote 1 on page 2 explicitly mentions the 3.5 model and the research in this paper is only about auto completion: https://arxiv.org/pdf/2306.15033.pdf
Lastly, OpenAI states on the original Codex page: “OpenAI Codex is a descendant of GPT-3; its training data contains both natural language and billions of lines of source code from publicly available sources, including code in public GitHub repositories.” - It included GitHub repos, but it never was only GitHub repos. https://openai.com/blog/openai-codex
As I said to the other commenter, I specifically avoided saying "only", I said "primarily", and I should have clarified that I meant "primarily fine-tuned". My point is simply that it is far more likely to spit out results that are patterned after GitHub than results that are patterned after any inter-programmer communications.
Also, I wasn't contesting that autocomplete uses GPT 3.5 as the base model, I was contesting the idea that it uses the same derivative model as chat.
you don't need to use the chat to test whether it was exclusively trained on github or not.
I started typing this in my IDE in a rust project I already had open:
// the following function prints out the best soccer players of all time, which teams they played for, and why they were great as a hashmap
fn soccer_greats() -> Hashmap<String,(Vec<String>, String)> {
let mut hm = HashMap::new();
it completed with the following:
hm.insert("Pele", (vec!["Santos", "New York Cosmos"], "He scored over 1000 goals in his career".to_string()));
hm.insert("Maradona", (vec!["Boca Juniors", "Barcelona", "Napoli"], "He won the world cup with Argentina".to_string()));
hm.insert("Messi", (vec!["Barcelona"], "He has won the Ballon d'Or 6 times".to_string()));
hm.insert("Ronaldo", (vec!["Barcelona", "Inter Milan", "Real Madrid", "AC Milan", "Corinthians"], "He won the world cup with Brazil".to_string()));
hm.insert("Zidane", (vec!["Juventus", "Real Madrid"], "He won the world cup with France".to_string()));
}
I don't believe that information is going to be on github anywhere, but i could be mistaken.
You're addressing a straw man, I never claimed it was "exclusively" trained on GitHub. I said "primarily", though I should have been specific and said "primarily fine-tuned".
In the context of the person I replied to, the point is that it isn't made up primarily of a bunch of communications between programmers.
A junior programmer's code? This makes no sense. It's happening right in front of you. A junior programmer isn't going to write on my screen. I can just correct it right here I am currently holding the context in my head.
These "security weakness" examples are
print("first user registered, role set to admin", user, password)
Nah, this stuff I can easily spot while I'm writing code. For a junior programmer, I'm going to be looking at design, and then at common specific mistakes. For Copilot it's writing in front of me. I can easily exclude anything that isn't obviously correct because I'm in the state right there.
It's a fantastic tool. If you go and use it and end up with `print(user_credentials)` I don't know what to tell you.
You’ve had to watch out for “bad” programmers since the beginning of programming. Having been an external CS examiner for almost a decade now, I’m not too worried about LLMs in teaching because I’m not convinced they can do worse than what we’ve been doing this far. I do think it’s a little frightening that a lot of freshly educated Computer Scientists will have to “unlearn” a good amount of what they were taught to actually write good code, but on the flip side I work in a place where a big part of our BI related code is now written by people from social sciences because, well, they are easier to hire.
That’s how you end up with pipelines that can’t handle 1000 PDF documents, because they are simply designed not to scale beyond one or two documents. Because that’s what you get when you google program, or alternatively, when you use ChatGPT, and it’s fine… at least until it isn’t, but it’s not like you can’t already make a lucrative career “fixing” things once they stop being “good enough”. So I’m not sure things will really change.
If anything I think LLMs will be decent in the hands of both juniors and senior developers, it’s the mediocre developers who are in danger. At least with google programming they could easily tell if an SO answer or an article was from 20 years ago, that info isn’t readily available with LLMs. I fully expect to be paid well to clean up a lot of ChatGPT messes until the end of my career.
The studies results are rather unsurprising and its conclusions are oft-repeated advice. As many have said, treat copilot’s code in the same light you would treat a junior programmer’s code.