Hacker News new | past | comments | ask | show | jobs | submit login
Security weaknesses of Copilot generated code in GitHub (arxiv.org)
130 points by belter on Oct 4, 2023 | hide | past | favorite | 83 comments



If a weakness is common, then of course Copilot is going to suggest it. Copilot gives you popular responses not correct ones. Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.

The studies results are rather unsurprising and its conclusions are oft-repeated advice. As many have said, treat copilot’s code in the same light you would treat a junior programmer’s code.


> Copilot gives you popular responses not correct ones.

That also sums up most of the issues with LLMs in general in one sentence.


Which is why the term Artificial Intelligence is really a misnomer for LLMs. Artificial Mediocracy might be more fitting.


It is artificial intelligence, it just isn't artificial general intelligence, nor artificial general knowledge. LLMs are artificial linguistic intelligence. They are really good at linguistic operations that require intelligence, like summarizing long text, transforming disrespectful text into professional looking text, translation between languages to a certain degree, etc.

It is not possible to ask an LLM for factual knowledge, without providing it the source of the fact. Without a source of the fact, you can only ask an LLM to generate an answer to the question that is linguistically convincing. And they can do a really good job at that. They can accidentally encode factual knowledge by predicting the next word correctly, but that should be regarded as an accident.


That's the most accurate term I've heard to describe the situation. I think it could get worse though because when I've seen mediocre people work with mediocre people they generate sub-mediocre solutions through trying to be clever and failing spectacularly at it.


LLMs at the moment feel a bit like a buzwword-throwing dilbertian pointy haired boss. The use terms they heard somewhere and with some luck, they use them in the proper context... but without actually understanding them.

Edit: And yeah, I think I know what you mean. The expectation (or hope) that collaboration of mediocre people results in some above-average end by means of some magical synergy effect very rarely works in real life.


A sci-fi series I read on occasion uses term "artificial stupids"


Do you think people with IQ below 80 are not intelligent?


Less intelligent than average, given that 100 is¹ calibrated to be average. Assuming your use of the word intelligence takes the concept as a sliding scale not a boolean is/isn't which is implied by quoting IQ results.

The way the “I” in AI is usually used seems to me to imply achieving average or better, so the aim is mediocre & upwards. steve1977 is agreeing with an opinion that results so far are at best “up to average”, maybe not even that, so more like mediocre & downwards. The phrase Artificial Mediocracy does not seem at all unfair in this context.

The opinion that current models can at best achieve average results overall seem logical to me: that are essentially summarizing a large corpus of human output rather than having original thought. While the systems may “notice” links average humans don't due to not being able to process large amounts of data like that, bringing the average quality of their output up, they are similarly likely to latch onto bad common practise/understanding bringing it back down again. Average results are mediocre results, by definition. Not bad, but not outstanding in any way.

--

[1] supposedly, there are strong views about IQ being a flawed measure/measure in some quarters


I never thought of the I in AI as a comparison to a human of average intelligence. I always understood it means intelligence as in "capable of reasoning", regardless of whether it's "kinda dumb" or "super smart" - the same way we speak about animals not being intelligent, and are looking for "intelligent alien life" in space - the aliens might not be very smart, perhaps even totally dumb, but still intelligent. The same applies for AI, perhaps it doesn't match even the least performing humans, but it's still intelligent.


I guess my parent comment was lead a bit by the fact that nowadays AI is often conflated with superhuman intelligence. You're certainly correct in that even a "dumb" AI could still be intelligent.

The interesting question is of course if that applies to LLMs or not. Are they actually intelligent or do they just look intelligent (and do we even have the means to answer those questions)?


>The interesting question is of course if that applies to LLMs or not. Are they actually intelligent or do they just look intelligent (and do we even have the means to answer those questions)?

It's not an interesting question. It's pretty meaningless.

Are birds really flying or do they look like they are flying (perspective of the bee)?. Are planes really flying or do they look like they're flying ?

"Mimic Intelligence" is not a real distinction.


This question is essentially thought-terminating in most contexts as most of not all people can’t answer it given how littler know about how humans work.

Nerds will also get hung up on this because they can’t stand the notion that any aspect of their job doesn’t require their immense intelligence.

For most contexts, “will this tool help me”’is a much more appropriate question. Anyone conflating the two is doing themselves a disservice.


I don’t know if either question is more „appropriate“. One is more scientific and philosophical, the other is practical.

I mean a power drill is also helping me a lot, without being intelligent.


I think you're correct although I'd like to point out that a lot of different animals are capable of reasoning. For example: https://www.cnet.com/tech/watch-a-wild-crow-tackle-a-complex...


A key problem is the many different readings of the word intelligent. I wouldn't call what we currently have as "capable of reasoning" for instance, though that might not be the intent and that is instead a property of "general intelligence". Of course that has linguistic issues to as it makes general intelligence (artificial or otherwise) a subset of intelligence - i.e. more specific despite adding "general" to the name.


Part of the problem is that we have very vague notions of what "reasoning" actually means. If you mean simple deductive logic ... LLMs can often perform such operations today, albeit highly inconsistently. If you mean inductive reasoning and working through a problem through first principles, then they usually fail. The state of the art their are tricks to get the system to extract out the assumptions and base knowledge and then work deductively.

But every time we have an advance in machine learning, we seem to redefine intelligent activity to be beyond that. At a certain point, what is left?


Humans are the only thing capable of reasoning, AI isn't capable of it and its very rare that an animal other than a human is capable of even the most basic reasoning.

Animals act on instinct that is hard coded based on the probability of survival AI essentially does the same thing it follows hardcoded probabilities not reason.


Technically they are. My comment was also meant to be a bit tongue-in-cheek of course (and, hopefully, obviously).

I wouldn't use a score like IQ to define a treshold of "intelligence" in absolute terms. By definition, if you can score somewhere on the IQ scale, you have some intelligence. Otherwise your IQ would probably be N/A? (not sure, never looked that deeply into IQ tests=.


I'm not sure about your angle here, but I thought IQ was calibrated to have 100 as the average value?

So wouldn't 80 mean that someone is... kinda dumb?


I guess the problem is that the term "intelligent" is ambiguous and overloaded.

So in one usage someone who is kinda dumb would still be intelligent, just less intelligent than others.

In another usage, we use the term to describe someone of above average intelligence (which is technically not really correct and actually not very intelligent).


> So in one usage someone who is kinda dumb would still be intelligent, just less intelligent than others.

100 IQ = 50% of the population is smarter than this.

80 IQ = not sure what the percentage is, but >50% of the population is smarter than this.

If any of the tables here are to be believed:

https://en.wikipedia.org/wiki/IQ_classification#Historical_I...

then 80 IQ could mean 80% of the population is smarter.

I'll rephrase my "kinda dumb" to "dumb as rocks".


From the main Wiki entry on IQ tests: “The raw score of the norming sample is usually (rank order) transformed to a normal distribution with mean 100 and standard deviation 15.”

So, you’re not too far off numerically. 80 IQ is only a -1.33 Z-score. So, 9th percentile. 91% of people score higher than 80 IQ.


Kinda dumb, but still intelligent. The sibling comment explains it well.


https://news.ycombinator.com/item?id=37776800

If by "intelligent" you're comparing with other primates, too, sure :-)


My point is, it's not a comparison at all. Intelligence is a trait, you either have it or not. We also use the same word for "how smart you are", but that measurement doesn't change anything about AI being intelligent or not. It can be dumb, but intelligent.


There's little value in that statement.

Intelligence is IMO an inherently comparative measure.

It's not "on/off", it's "smarter" (than a pile of rocks, than a slug, than another human).

So, yeah, you can be a dumb human but you'd be a smart chimpanzee. But we want to be comparing apples with apples in the context of this topic.

When people say "AI", everyone implicitly assumes the comparison with human intelligence. So "AI" needs to be as smart as the average human to be actual AI. Ok, AGI, if you prefer that term.

There's a reason AI is moving farther and farther away and we're creating new, finer, terms like ML, shape recognition, etc.


I don't know anybody who implicitly assumes that the I in AI is a comparison, people around me understand it as a term for a set of traits, regardless of the level of performance. That's how the term AGI came to be, to distinguish between "intelligent but not necessarily like a human" (AI) and "generally intelligent [like a human]" (AGI).

Do you also think that "intelligent alien life in space" means comparable to humans? What if we find something capable of reasoning, abstract thinking, rationality, adaptibility, etc - but much, much dumber than humans? That's intelligence, comparison to humans doesn't change anything about that.

https://en.m.wikipedia.org/wiki/Intelligence - how could there be animal intelligence if "intelligence" means comparable to humans? "Crows are intelligent but nowhere near human-level" - this statement wouldn't make any sense if you're right, but it actually does make sense, IMHO.

> There's a reason AI is moving farther and farther away and we're creating new, finer, terms like ML, shape recognition, etc.

That thing with AI is called moving goalposts. And the finer terms - yeah of course we need to be able to be specific about our software, doesn't mean that's not AI. We talked about shape recognition in neuropsychology for much longer than in AI, same for many more terms that will certainly be reused soon.


Artificial Average


Sums up the issues with democracy too, and a ton of other stuff


educating the "low-hanging fruit" is much more effective in moving the average than piling on excellence.


Democracy is a bit different. Hopefully the goal is a democratic government isn't to be intelligent but is to just make people's lives good (better?). In theory if people find their lives are getting worse then they can replace the government. But, sure, there are many examples of it not working. Such has the government making gas prices low because the people want that when its polluting the planet that the people live on.


I'm not sure you can claim that the essential functionality of something is the issue with something.

The whole idea of LLMs is that they chose the most likely token based on the tokens before, and then sometimes chose less likely tokens. But it's all based on likelihoods.

Probably there is a huge education part missing from this, if people aren't aware that this is how it works, and they think that any LLM can "creatively" come up with it's own chain of tokens based on nothing.


With humans too


Evals do help to account for correctness when it comes to LLMs


I propose calling it artificial non-diligence.


It would be very interesting to fine-tune Copilot on the code of people widely regarded in their communities as experts, to see how the suggestions would change.


I wonder if llm are biased towards older, more insecure implementations because there is a higher volume of old code vs new code.

Same thing with the data it is trained on — not all code requires all levels of refinement. Most of the data is probably around average.


I'm not sure that code being newer inherently means it will be more secure


I don’t think it is a tautology , but I can imagine a cve scanner picking up older code with log4j where newer code may avoid that library altogether, just as an example.

Since there is more older code than newer code would the llm be suspectible to that ?


Everyone still uses log4j after the fix. There aren't many full-featured alternatives... and the one that exist probably contain unfixed bugs.


This makes me wonder about training an LLM on one language and then fine tuning it for another. If you train over only, say, JavaScript, and then finetune for C, I imagine it will be quite bad at writing safe code, even if it makes the code look like C, because it didn't have to learn about freeing and such.

Similarly, would it pick up patterns from one language and keep then in the other? Maybe an LLM trained on Kotlin would be more likely to write functional code finetuned.


Given that dataset anomalies can result in LLM output corruption, I'm not convinced that cross-training like that would even work.


> Most of the data is probably around average.

I know this is not how distributions work, but I had to chuckle at the literal interpretation of this.


I'd say the data is pretty normal.


That's not how it's presented or how managers expect it to be used.


Brawndo is great for plants because it has elecrolytes.


> Yet if a weakness is common, it also means that human coders frequently make the same mistake as well.

It only means programmers commonly talk about it. This isn't the same thing as measuring incidence in production or distribution.

Anyway, i'd argue the real question is "can the chatbot fix the code if requested to".


> It only means programmers commonly talk about it. This isn't the same thing as measuring incidence in production or distribution.

Copilot was primarily trained on GitHub projects, not on communication between programmers. Patterns that frequently show up in Copilot output are most likely prevalent on GitHub, which is a pretty good indicator that they're common in production code.


That’s no longer true. Copilot uses the same ChatGPT-3.5 model as, well, ChatGPT. If it were trained on just GitHub projects, the chat features wouldn’t work at all.


You're assuming that Copilot Chat and the regular completion are the same model. Do you have a source that says so? I'd assumed that they were two different models, since they're quite different tasks.


Footnote 1 on page 2 explicitly mentions the 3.5 model and the research in this paper is only about auto completion: https://arxiv.org/pdf/2306.15033.pdf

And this blog post states “beyond Codex”, again for auto completion: https://github.blog/2023-07-28-smarter-more-efficient-coding...

Lastly, OpenAI states on the original Codex page: “OpenAI Codex is a descendant of GPT-3; its training data contains both natural language and billions of lines of source code from publicly available sources, including code in public GitHub repositories.” - It included GitHub repos, but it never was only GitHub repos. https://openai.com/blog/openai-codex

Update: GitHub Community Manager confirms it here: https://github.com/orgs/community/discussions/56975#discussi...


As I said to the other commenter, I specifically avoided saying "only", I said "primarily", and I should have clarified that I meant "primarily fine-tuned". My point is simply that it is far more likely to spit out results that are patterned after GitHub than results that are patterned after any inter-programmer communications.

Also, I wasn't contesting that autocomplete uses GPT 3.5 as the base model, I was contesting the idea that it uses the same derivative model as chat.


you don't need to use the chat to test whether it was exclusively trained on github or not.

I started typing this in my IDE in a rust project I already had open:

// the following function prints out the best soccer players of all time, which teams they played for, and why they were great as a hashmap fn soccer_greats() -> Hashmap<String,(Vec<String>, String)> { let mut hm = HashMap::new();

it completed with the following: hm.insert("Pele", (vec!["Santos", "New York Cosmos"], "He scored over 1000 goals in his career".to_string())); hm.insert("Maradona", (vec!["Boca Juniors", "Barcelona", "Napoli"], "He won the world cup with Argentina".to_string())); hm.insert("Messi", (vec!["Barcelona"], "He has won the Ballon d'Or 6 times".to_string())); hm.insert("Ronaldo", (vec!["Barcelona", "Inter Milan", "Real Madrid", "AC Milan", "Corinthians"], "He won the world cup with Brazil".to_string())); hm.insert("Zidane", (vec!["Juventus", "Real Madrid"], "He won the world cup with France".to_string())); }

I don't believe that information is going to be on github anywhere, but i could be mistaken.


You're addressing a straw man, I never claimed it was "exclusively" trained on GitHub. I said "primarily", though I should have been specific and said "primarily fine-tuned".

In the context of the person I replied to, the point is that it isn't made up primarily of a bunch of communications between programmers.


You need to indent code with two spaces for it to render as-is.


A junior programmer's code? This makes no sense. It's happening right in front of you. A junior programmer isn't going to write on my screen. I can just correct it right here I am currently holding the context in my head.

These "security weakness" examples are

     print("first user registered, role set to admin", user, password)
and

     pprint({"json":"somejunk", "classes": somefunc(user)})

Nah, this stuff I can easily spot while I'm writing code. For a junior programmer, I'm going to be looking at design, and then at common specific mistakes. For Copilot it's writing in front of me. I can easily exclude anything that isn't obviously correct because I'm in the state right there.

It's a fantastic tool. If you go and use it and end up with `print(user_credentials)` I don't know what to tell you.


The added complication is now you'll have to watch out for the junior+copilot combo, though it's a trade I personally am very willing to take.


You’ve had to watch out for “bad” programmers since the beginning of programming. Having been an external CS examiner for almost a decade now, I’m not too worried about LLMs in teaching because I’m not convinced they can do worse than what we’ve been doing this far. I do think it’s a little frightening that a lot of freshly educated Computer Scientists will have to “unlearn” a good amount of what they were taught to actually write good code, but on the flip side I work in a place where a big part of our BI related code is now written by people from social sciences because, well, they are easier to hire.

That’s how you end up with pipelines that can’t handle 1000 PDF documents, because they are simply designed not to scale beyond one or two documents. Because that’s what you get when you google program, or alternatively, when you use ChatGPT, and it’s fine… at least until it isn’t, but it’s not like you can’t already make a lucrative career “fixing” things once they stop being “good enough”. So I’m not sure things will really change.

If anything I think LLMs will be decent in the hands of both juniors and senior developers, it’s the mediocre developers who are in danger. At least with google programming they could easily tell if an SO answer or an article was from 20 years ago, that info isn’t readily available with LLMs. I fully expect to be paid well to clean up a lot of ChatGPT messes until the end of my career.


> The results show that (1) 35.8% of Copilot generated code snippets contain CWEs

What percent of non-Copilot generated public GitHub repos contain CWEs?

Edit: According to this study, Copilot generates C/C++ code with vulnerabilities, but at a lower rate than your average human coder: https://arxiv.org/pdf/2204.04741.pdf


"...The results show that (1) 35.8% of Copilot generated code snippets contain CWEs, and those issues are spread across multiple languages, (2) the security weaknesses are diverse and related to 42 different CWEs, in which CWE-78: OS Command Injection, CWE-330: Use of Insufficiently Random Values, and CWE-703: Improper Check or Handling of Exceptional Conditions occurred the most frequently, and (3) among the 42 CWEs identified, 11 of those belong to the currently recognized 2022 CWE Top-25. Our findings confirm that developers should be careful when adding code generated by Copilot (and similar AI code generation tools) and should also run appropriate security checks as they accept the suggested code..."


I wonder if it would be possible to rate the code used during the training phase. For example the code could go through various static analysis tools and the result would be assigned as metadata to the code being used to train the model. The final model would then know that a given pattern is flagged as problematic by some tool and could take this into account not just to suggest new snippets but also to suggest improvements of existing snippets. Though I suppose if it was that easy, they'd have done it already.


This is probably the next step for the LLM providers. They need to find ways to increase quality, and for code, there are many options. Perhaps code repos could get in on this too.


Wouldn't this train it to avoid detection more than to avoid bad patterns?


Yes, but presumably in the training data those two are quite correlated.


As always the statistic is useless without the human comparison. If it improves on human coders, no amount of gnashing and wailing will stop the layoffs.


They didn't improve on human truck drivers yet.


There's only one weakness specifically identified that I can see.

    print("new user", username, password)
Yeah, not best practice, but also pretty common for development if you wanted to check that everything is being passed to the correct function.


I don't know if it still does it, but it used to be that if you did something like

  NonQueryResult StoreUser(User user) {
   var sql = "INSERT...

It would use string interpolation to fill out the properties


Not best practice? That's a very generous way to describe storing plaintext passwords in logs. I've seen this in the wild too but that's no excuse.


> I've seen this in the wild too but that's no excuse.

See, the LLM also saw it in the wild...


That is the CWE that they identify, but the code seems to store the apparently unhashed password in the database on top of that?


This is where one needs a hyphen :-)


A related headline could be "Security weaknesses of code produced by a junior developer". It says copilot in the product name -> it's not intended to replace the pilots (aka developers) brain.


Did they prompt it to consider security weaknesses?


They did not prompt at all. They used GitHub’s code search to find projects where the repo owner specified that the code was generated “by Copilot” and the authors took that at face value for all code in the project. Whether the code was actually suggested by Copilot is not at all analyzed in the paper. As such, the results are highly questionable.


That would be kind of wild. Imagine a world where whether your system was secure was just a matter of remembering to tell the AI agent "& also make it secure" before it writes your code.

(could be quite real!)


This would likely help a little bit. We've already seen LLMs improve performance on some tasks by being instructed to "think carefully" first; presumably this biases it towards parts of the training set that are higher quality.

But security ultimately requires comprehension, which is not something LLMs have.


Security 100% does not require comprehension in the philosophical sense


Not sure what the philosophical sense would be. I just mean that an awful lot of people treat security as "hack on the code until it doesn't obviously break", and that's the wrong mindset for practical security.


I’m not sure what you could possibly mean by LLMs don’t have comprehension if not speaking to philosphical debates of what comprehension means.

I think your point is otherwise right, but the correct answer is standard best practices which is the easiest thing for bots to do


I think it’s more likely that you would use a security graded bot.

It’s perfectly reasonable to not use secure code for a large number of use cases.


> also make it secure

[proceeds to simply refactor the same code]




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: