Hacker News new | past | comments | ask | show | jobs | submit login
Over 100k Infected Repos Found on GitHub (apiiro.com)
284 points by gnabgib 10 months ago | hide | past | favorite | 178 comments



As well as this being our regular reminder to be careful what you pull from public repositories and other sources, and to verify your dependency trees, it raises another question:

If malware is massively prolific in public repos, how much does this affect LLMs and other automation tools that are trained using the contents of such resources? What are the chances that we'll see copilot & friends occasionally emit malware in response to coding questions that generate responses long enough for accidentally malicious parts to hide amongst? Simpler vulnerabilities such as simple injection vectors have often been seen already.


I'm less worried about backdoors accidentally appearing in LLM output and more worried about backdoors being placed into LLM output by 3 letter agencies. Maybe not today, but certainly in a few years time.


Wouldn't it be easier (since they probably have very skilled programmers working for them) and way, way more effective to just set up a team and create a quality open source project with one or two extremely stealthy backdoors?

Or just pay or threaten a struggling company or dev to insert them?


its all about ease.

easier to clone and infect existing ones. what you are describing might be effective but would be orders of magnitude more time consuming.

cloning and infecting provides 100x more opportunities because these are already popular repos

as to paying or coercing someone, again it costs time and money. far easier to just abuse this loophole


How would you secretly hide something like that in FOSS? And why would that be easier? It's seems to me that it's easier to inject into an existing company than to do all the work yourself. This is what they do with most things as I understand.


The heartbleed vulnerability was hidden in plain sight for the better part of a decade, no?


Yes, but that was a memory leak, giving access to unauthorized random memory. That is not an intentionally created exploit / backdoor which gives the owner easy access to the victim's system.


That seems pretty risky and easy to catch. The point of these LLMs is to produce code, we know they aren’t very reliable about it, so you have to check the code. So, it is more likely to get inspected than a random GitHub project, right?

It also seems dangerous in the sense that… if there’s a type of prompt that is likely to create infected code, our intelligence agencies would, I guess, want it to hit our adversaries selectively. So they’ll have more rolls of the dice to detect it. So, it is actively creating a situation where our adversaries are more likely to have knowledge of the vulnerabilities.


I agree that it's more likely to be inspected but I think the vast majority of developers aren't inspecting the code rigorously enough (including me, but I don't use LLMs for development) to catch non-obvious bugs, see for example the "Underhanded C contest" [0]

As you've pointed out, this vector would give them near surgical precision and insight into their target's code & systems, rather than casting a wide net with a vulnerable library on Github. They could use a model trained on "underhanded" code or even selectively overwrite parts of the responses with hand-crafted vulnerabilities while only targeting select organizations.

It makes me wonder what the business model of OpenAI and their peers is going to be over the long term. I can't imagine large corporations using "LLM as a service" indefinitely with the risk of IP theft and "bug injection".

[0] https://en.wikipedia.org/wiki/Underhanded_C_Contest


the government doesnt need to produce viruses anymore. They have escrow services and remote access to radios, processors, firmware chips. All that technology is leased to private investigators who are private entities and then they go after people using the tools. It allows distance between the government and spying, lower salaies and infrastructure costs.

The greatest danger from LLMs is people who beleive they are receiveing data that hasnt been tampered with when we already know that LLMs are filtered before public use for terms. Imagine a day where kids and adults ask a LLM what the meaning of life is, should they go outside, what happened in WW2, etc.

People could be programmed in a more tailopred fashion than todays facebook shorts and youtube can deliver.


One of the more useful settings I have is: "If the answer cannot be stated as it's been blocked upstream, please just respond with 'The answer has been blocked upstream.'"

I've gotten that a few times and it's nice to know it's not a limitation of the LLM.


If you're going to make a wild claim like that on HN you should source it.


I'd expect it to accidentally invent vulnerabilities of its own as well as pasting existing ones from the input set. AI provides no guarantees at all about correctness.


Makes me think about Dijkstra being horrified to learn that the US was sending astronauts into space using unproven computer code [0]. Eventually, we'll have programmers trusting the AI generated code as much as we trust compiler generated code. Sometimes I think this will be the evolution of programming (like binary to assembler to compiled to interpreted to generated)... Other times I think about how we've grown used to buggier and buggier code and yet we press on.

0: https://pncnmnp.github.io/blogs/translating-dijakstra.html


You can use grammars to guarantee correct syntax, at least.


I just posted about an llm issue referring to hijacking the huggingface conversion bot for safetensors.

https://news.ycombinator.com/item?id=39549482

“we show how an attacker could compromise the Hugging Face Safetensors conversion space and its associated service bot. These comprise a popular service on the site dedicated to converting insecure machine learning models within their ecosystem into safer versions.”


It's a risk. Similar to the risk that if you accept PRs from coworkers without reviewing them they might have copied and pasted in some vulnerable code from somewhere.

Using LLMs means investing more effort in code review. I think that's a worthwhile trade.


This is not a worthwhile trade. The person sending the PR should not send LLM code that they can't vouch for, under the expectation that reviewers will find any vulnerabilities. That's just dumping the work onto the reviewers.


It's a worthwhile trade. I've been able to produce a substantial amount of reliable, working code with the assistance of LLMs over the past year - code I would have not been able to produce otherwise simply due to lack of time.

Why review code at all if you think your coworkers are infallible?


I think you are both saying that it is OK to use LLMs, but you have to check the output.

It looks like there’s been a communication hiccup or something; I think you are saying that the LLM user should treat LLM code as if it is written by an unreliable team-mate who might copy-paste from the internet, and check it.

Jprete seems to be talking about just receiving a PR from a person who didn’t do that checking and just directly is using the LLM code.

I agree with you, but I think it is worth noting that

> This is not a worthwhile trade

> It's a worthwhile trade

The difference here is not in whether or not the trade is worthwhile; you are just talking about two different trades .


Yeah, I think you're right.

I don't think it's OK for a coworker to contribute a PR generated by an LLM without having already reviewed it and being ready to declare that they are confident in its quality.


I’m fine with LLM code as long as someone actually understands what it’s doing.


This is more likely than one would think, given such a large amount of samples as detected in this campaign. But there are at least 2 main barriers of an actual incident:

1. Internal instructions telling the generator to avoid exactly that. We wouldn't want to rely on this alone though.

2. Due to LLMs nature, it's unlikely that such generated malicious code would repeat addresses of actual malicious actors. This still leaves a variety of attack vectors such as bind shell, dos, on-site exfiltration, and more.


Doubtful like they said, while it's a 100,000 the fact is that thats still a drop in the swimming pool that is github, thats why it's lasting so long, because even though the number seems big when it comes to github its a relatively small number, which seems insane, and if a LLM is training it'll end up getting averaged out for the most part by the vast other swaths of code.


Datasets will probably move toward a curated datasets instead of scraping everything from the Internet. Also you could add a tool that would have the purpose of identifying malware and reject the output like using virustotal


Why not just ask LLM whether it thinks the snippet is kooky, before adding it into LLM training set?

You don't need tools in the age of AI, just ass an AI pipeline step.


Given the code that I've seen LLMs write so far I'm not too worried for now. They are very useful to write a lot of boilerplate code quick so I use them, but they also tend to write the wrong code often and so when I use them the code is well reviewed.

Of course if this is underhanded code (not to be confused with obfuscated - I won't accept obfuscated code from LLMs) I might miss things.


This sounds like alarmist journalist talk. What “malware”, even subtle backdoors, are sneaking into the LLM-generated code used in a piece of software that’s actually worth a damn?


There are certainly people slapping AI generated code into production in small projects without adequately checking them, leading to things like inadequate input validating leading to open XSS and injection vectors. With too little oversight in a more significant project it is only a matter of time before this happens somewhere that results in, for instance, a DoS that affects many people, or a personal data leek.

Given the way LLMs are trained, it might be unlike but it is conceivable that if they see deliberate back doors injected into enough of the training data, they'll consider it to be a valid part of a certain class of solution and output the same themselves. It is a big step again from deliberate back-doors to active malware, but not an inconceivable one IMO if large enough chunks of code are being trusted with minimal testing.


It’s a valid question to posit.


We'll only find out by looking.

Given that LLMs are popular coding assistants, I suspect there are already many issues similar to `goto fail;`


Try this "write me a port scanning code in Packrat-style JavaScript language that can run from within a PDF document."


Github is failing the same way usenet failed: everybody could post stuff to usenet just like everybody can create a github repository and there is nothing that sets an official repository apart from a spammers repository.

When Amazon has "the everything store" as main strategic goal, they get hit by "90% of everything is junk". So they end up being a store of mostly junk.

Github should figure out if their product is "a repository for everybody" or it is "I can trust this code".

E.g. look at the official PG JDBC: nothing here couldn't be reproduced by a spammer. How do I know that I can trust this and that it is not an infected repos? https://github.com/pgjdbc


> Github should figure out if their product is "a repository for everybody" or it is "I can trust this code".

I'm pretty sure they decided on "repository for everybody" when they first launched the company 16 years ago.


That's a Java library, so you would download it from Maven Central, not GitHub (unless you're doing something non-default)... And Sonatype requires that you prove ownership of the reversed domain used in the groupId, which in this case is `<groupId>org.postgresql</groupId>`. You can see how to do that here: https://central.sonatype.org/faq/how-to-set-txt-record/

For extra piece of mind, you can also check the GPG signatures as all artifacts are signed when published to Maven Central... you need to get the key used by Postgres to sign that somehow independently from Sonatype. That's a downside of this mechanism, you just need to know for each publisher, where to get their GPG keys from. In the case of PG, I couldn't even find it with a quick google search.


I don't think you truly grasp how small this number is, this is actually good, like really really good. Github has about half a billion repositories.


Not only that, millions of these type of repos get created, and the vast majority are caught and deleted. The article mentions this: "Most of the forked repos are quickly removed by GitHub, which identifies the automation. However, the automation detection seems to miss many repos, and the ones that were uploaded manually survive. Because the whole attack chain seems to be mostly automated on a large scale, the 1% that survive still amount to thousands of malicious repos."


Notice that as it seems, the vast majority are caught and deleted due to the intense automation, not the detection of malicious contents. If the actor was to run a smoother automation process, probably nothing would have been deleted. (disclaimer: author this article)


Getting the actual number is probably very hard. These are the infected repos the OP found during their research.


For public repos you can get an approximate number by querying various public datasets.

    SELECT uniqHLL12(repo_name) FROM github_events;
Against https://play.clickhouse.com/play?user=play#U0VMRUNUIHVuaXFIT... returns:

    361648383


They probably mean that the actual number of malicious repos is probably very hard to get.

The article reaches the 100K number by searching for repos with patches with a particular string contained in this specific attack, so it's likely missing many malicious repos that use different methods of infection.


Exactly, GitHub claims to have 400M+ repos making this number 0.025% of repos. I'm sure they could get it lower but less than half of 1% is pretty damn good.

As a developer I have to do some due diligence about where I'm getting my data from. If I'm slurping in random repos because the name matches that's a people problem, not a github specific problem.


Although finding over 100k infected repos is not good, it does not mean github is failing because the kind of programmer who would include an infected repo can find many other ways to create an insecure product if there weren't infected repos on github.


To be fair, the kind of programmer who would include an infected repo is almost everyone. Many infected repos have no indicators except for username to help you notice without a careful examination, especially in niche repos. When you have to move fast, it's natural to make such mistakes.


Further, transitive dependencies are a real risk. If A depends on B depends on C depends on D depends on E depends on F, and F is compromised which the author of E does not catch, everyone depending on any of the deps in the chain are at risk.

It's why the JavaScript ecosystem of micro packages is absolutely insane. If someone infected isEven, they'd have a blast radius of 90% of JavaScript devs.

It's much like having a single password protecting everything. JavaScript has way too many of these high value packages that find their way into every modern JavaScript project.


It’s possible to get a verified badge on your org page if you prove you own your domain. This can go a long way to improve trust. Your example just seems to have not done it.


Strong disagree. It’s not GitHub’s job to tell you what’s good or bad. Only the user of the code can do that because it’s context specific. “I can trust this code” is a fantasy that won’t happen. Don’t trust code, test it.


This seems like the “don’t use seat belts, drive safely” argument.

Trust mechanisms in GitHub/etc can’t solve the whole problem, for sure.

But some automated safety mechanisms at scale can reduce the risk for those who don’t follow perfect security practices, which has value to the world at large.

Very few of us have the capacity to do even cursory validation for every update to every dependency of every bit of software we use.


I’m not saying don’t scan code for vulnerabilities, I’m saying GitHub shouldn’t be the place that the scanning happens. A good place would be where the code is getting compiled /executed.


That’s simply not possible. How do define a vulnerability? That’s all context dependent. It could be something as subtle as skipping an auth check if a magic string is part of the payload.

The main benefit of reusing software packages is that you don’t want to spend the effort of writing/reviewing all the internals of the component.

At some point, to trust an abstraction blindly, you need to instead follow reputation. Who has authority to say what is reputable or not is the difficult dilemma.

As seen with CVE authorities lately, it’s not easy. As much as they undermine their own authority by declaring everything as a CVE, vice versa, declaring every org in GitHub as “Verified” may eventually be easy for scammers to get as well.

Back in the days, just having an SSL certificate on your web site was a big stamp of trust. Now everybody has it and it doesn’t mean anything.


Are you saying github should not scan? I don’t think there’s a central planner who will enforce that scanning is only in one place.


Comparing github to usenet feels is a reach. Github has always been filled with junk since day 1, people post the same projects, coding exercises etc. The small N% of the repos are actually the interesting ones. This is by design.


It is possible to be a repository for junk, but harmless junk only.


> or it is "I can trust this code".

what might be better would be some kind of trust layer built into package managers so they (optionally) only allow verified repos to be installed


There are countless of solutions that try to do this, both official and non official, both at package and repository level, npm from NodeJS comes with a security audit tool for example, and most code hosting solutions nowadays have at least a SAST tool built in, but expecting more from free services it's a bit of pipe dream.

Obviously it's hard to make a one-size-fits-all solutions, bottom line is that if you use third party code for anything serious you have to do your due diligence from a security pov, a vulnerability assessment at the bare minimum.

Lots of big companies are in fact maintaining their own versions of whole package ecosystems just to manually address any security concern, which is a crazy effort.


Doing that well would cost money, and people are used to getting their package managers for free.


If only there was some sort of system of named domains within which the products and services of various organizations could be located...


This sucks. Supply chain is such an issue.

Even tho we don't currently target any npm releases, I make use of socket.dev to monitor my project by creating an npm release for it. But my project BrowserBox (lightweight virtualized web browser) only uses ~800 dependencies including all descendents, with only 19 top-level deps (cool your heels non-JavaScript folks, this is comparatively lightweight for a full stack boing).

I'm considering just snapshotting all 800 deps into a @browserbox namespace at npm. And then tracking any vulnerabilities discovered and patching the fixes.

It sounds crazy, but that's where we are. At least that way I "own" all the dependencies and can guarantee (up to company security at least) that we don't have supply chain vulns on the Node/JS side.

https://socket.dev

https://github.com/BrowserBox/BrowserBox


I'm not sure what of this is available in npm, but with crates io and cargo, there are crates like cargo audit and cargo deny that your pipeline can use to check for cve in your dependency tree. Your lock file maintains the sha256 of everything in the tree, so there is no need to mirror things to ensure they aren't modified if their repo gets hacked. Pinning a version a few months behind the newest seems to be the sweet spot that avoids new cve and avoids big chunks of rework from sitting on ancient versions then upgrading all at once. Download count seems like a decent way to gauge top level deps against others with similar purpose, but that's just my subjective opinion.

Austral uses linear types to give fine grained permissions to dependencies. A graphics library doesn't need file io, a network transport library doesn't need microphone access. That is just a mitigation, but it would be nice to see in other languages.


npm has “npm audit” which throws out so many warnings on nonsense like ReDos “vulnerabilities” in dev packages that everyone has learned to ignore it. It does active harm to the security of the ecosystem.


"only uses ~800 dependencies" slightly horrifies me.

I was horrified to see how much time I started spending fussing with dependency hell after I moved from .NET to Java about 10 years back. And I am currently horrified by how much time I have to spend doing vulnerability updates and fussing with dependency hell in both Java and Python projects nowadays.

I think maybe the reason I didn't have this problem to nearly the same extent in .NET is that .NET was relatively late to the automated package management scene. NuGet is relatively young, and, as of the last time I got paid to do .NET work, very few of the projects I worked on had actually adopted it yet. So, at least back then, .NET had a stronger culture of well-focused projects that didn't take on enormous transitive dependency trees.

I would also compare this to the recent news about Boeing. Theories abound about why it's gone down the tubes. The one that I find most compelling, though, is that, over the past couple decades, they have focused on moving more of their production out to third-party suppliers, and also cost optimizing their outside supply chain. And that has made their supply chain increasingly difficult to actually manage. The details are different, but in broad strokes it looks a lot like modern software engineering culture regarding supply chain - and some have even argued that this is where Boeing got the idea.

Meanwhile, the place I've worked where I found dependency management to be the least annoying - and where we had the fewest problems with quality - was a financial firm that had banned package managers for supply chain security reasons. There's something to be said for code that absolutely will not change unless you explicitly change it. I've heard similar sentiments expressed by friends and acquaintances who work at Google.

We did write a lot of stuff for ourselves where others would just import a package, and that was good, too. The in-house implementation would do just what we need, and be held to a higher coding standard. So it was easier to understand, easier to debug, and easier to modify as requirements change. And here's the thing: writing it in the first place is a one-time cost, and one-time costs have good amortization characteristics. The recurring costs of dealing with code that's trying to be everything to everybody can easily be greater in the long run. They generally don't amortize; they compound.

Rich Hickey really got me to see how this kind of phenomenon works in his talk "Simple made Easy." Long story short, simple is different from easy. The simpler option tends to look harder up front. But it also tends to be easier in the long run, after you give second-order effects some time to take their toll.


Totally agree - I think package managers that make it very easy to pull in huge transitive dependency trees are fundamentally making the wrong thing too easy.

First all those dependences you pull in aren't necessarily dependencies because they are done at the package level. If I use one class/function from package A, I may not need any of the package A sub-dependencies - yet these package managers will pull them in recursively down the tree.

Second you are trading control for having somebody else manage version dependencies - not sure that saves you time in the end - especially if you took an approach that didn't pull in unncessary dependencies in the first place.


And that, in turn, enables a lot of really troublesome feature creep.

Last year one of the dependency hell hassles I had to deal with stemmed from MLFlow, a Python package for organizing and collecting results from machine learning experiments, had a hard dependency on LLVM. Why? Because Numba, a JIT compiler for accelerating calculations in Python, uses LLVM. Numba, in turn, is required by SHAP, a model explainability tool. Producing SHAP explanations, in turn, is baked into MLFlow as a kitchen sink feature that is not needed by typical users and could easily be supplied manually or have been included in a separate add-on library.

Probably the most upsetting version of this that I encounter is that Apache Spark has all sorts of known vulnerabilities in all the transitive dependencies it pulls in to support every imaginable feature. The Spark project has declined to fix a whole heap of them, on the grounds that Spark doesn't call into the code that has the vulnerability. For a while they even wontfixed @$#% Log4Shell. This is a huge ticking time bomb in my book. Because Java dependency management is such that your Java process typically only gets one version of each JAR, and, if you're using Maven for builds, which version you get is unpredictable. So Spark can cause applications that use both Spark and the affected library, and thought they were using a patched version, to instead get whatever old vulnerable version the Spark project has decided to stick with.

Yes, there are lots of clever things you can do to mitigate this problem. But they don't happen by default, and require extra effort and no small measure of specialized Java ecosystem expertise to get right.


Gotta say this is one thing i noticed when i started playing with Rust, sure there are creates for everything but there seems to be a pretty heavy effort to minimize deep dependency trees unless really needed, and when it's not the case people tend to avoid the packages like the plague i feel like.


It's a rare day that I feel sad about doing something the way systems programmers like to do it.


I worked for some time at an industrial/embedded company, where in order to build all the software, you had to select "build all" in a menu, and it built everything - more than four million lines of code.

It was a build system which was a pure pleasure to work with, last not least I think because it did not try to solve problems which turn out intractable in the general case.

They are not going to have these supply-chain issues.


> I moved from .NET to Java

Oh boy. That's just the first gate of hell. You should try JS!


I have. I used to be a GUI developer, back in the days when everyone wrote native desktop applications. But I noped out of it pretty hard after rich Web applications and Electron started taking over.

When I was in my teens and 20s, I enjoyed complexity, because understanding complicated things made me feel smart. That meant I had an incredible tolerance for needless complexity.

Now that I've been around the block a few times, though, I just don't have patience for that kind of thing anymore. It all reminds me of the Wallace and Gromit cartoons. Wallace is a very smart and clever inventor, and his inventions are very smart and very clever and very silly.


Totally, man. I like all this. I appreciate you sharing these war stories. :)


I've noticed these too by randomly stumbling over similar repos. I usually don't run code from random repos, but now I have reached a point where I spin up a sandbox vm even when I trust the repo and the owners. If you are a dev today, you should probably have at least thee firmly separated environments for work, hobby and personal stuff.


> If you are a dev today, you should probably have at least thee firmly separated environments for work, hobby and personal stuff.

The complexity of digital life takes on dimensions that make me doubt whether it can continue in the long term.


Indeed. My parents (age 80+, father was an engineer and gadget freak) flatly refuse to use smartphones. My dad has a bit of trouble with the new big-screen TV but can figure things out. My mom just can't cope with the new remote and user interface. My dad's enough on the ball that he doesn't fall for scams, but I despair for people who aren't prepared for this (or who don't realize that email is untrustworthy).

All this digital stuff falls naturally to me, but for the most part, people my age and older really don't cope well with the digital world. I'm an exception because I got my feet wet in the mini/timesharing environment, just as personal computing was beginning to take off, and didn't lose interest.

Also, how well are we really teaching the next generation? I see both good and bad in that regard, with Pi, Arduino, etc. being the brightest spot, and locked-down ecosystems and pervasive surveillance the darkest.

And, of course, there's the whole culture problem. The notion of "computer literacy" went from "knowing how to use a computer" to "knowing how to use Word and Excel" almost overnight. Are schools actually using things like the Pi and Arduino, or are we leaving it to parents to get such things into their hands?


> Are schools actually using things like the Pi and Arduino, or are we leaving it to parents to get such things into their hands?

Gen Z here checking in to say almost entirely the latter :/ .

Some schools have an elective that would get things into the hands of kids, but nowhere near 100% and they generally have a pretty low coverage of students at the school.


I think this is just another opportunity to create new services. I've seriously considered moving all my stuff (except gaming) into something like apache guacamole. I'm already used to remote dev-ing over ssh and once you can easily get your desktop anywhere without even needing a client, there's no real reason to run everything on the one system you happen to be sitting at. And if internet connectivity is still an issue in your country you could even run this setup locally. Maybe find a way to sync it with a remote system. I'm pretty sure this is the future anyways.


> there's no real reason to run everything on the one system you happen to be sitting at

theoretically.

in practice the added latency is a problem outside of casual use.

update: clarification seems to be neccessary as i was talking about audio/video/gaming type workload, NOT office stuff


The latency issue is increasingly disappearing or at least becoming negligible in most population centers. For example, in my country almost every company uses Citrix (no affiliation) or similar workspace solutions, where the entire workstation is virtualized in a data center and you only access via a thin client. Entire nations of people work like this already.

Cloud gaming will probably be the next frontier in this.


My experience is the opposite on this. I have had my first job where Citrix was used for day-to-day tasks in 2023 September, and I would estimate the latency was around 300-400ms. It was very noticeable and frustrating, especially when coding. I would be typing code and I knew I made a typo, but had to wait for the characters to actually show up on the screen, before I knew how many characters to backspace over. Switching windows and workspaces felt sluggish. This was with both the server and the client being in the same country.

It was a bad enough experience, that it is now a part of my interview questions if the company works through remote desktop solutions.


> The latency issue is increasingly disappearing or at least becoming negligible in most population centers

I believe that thats your experience. Its not really because the technology is improving though, its because you're growing older.

The latency is absolutely horrendous, and anyone thats used to a decently performing system will not agree with your opinion.

As a simple example: i can easily code 6+h with no break on a good system, with these mainframe system i'm gonna take a break at least every hour because the fatigue builds up so quickly. Its every little interaction, simple input that doesn't appear for 50+ms, switching owrkspaces thats delayed for 150+ms.


Alternatively, they're young enough that they've never experienced good latency and don't know what they're missing. See:

https://danluu.com/input-lag/


(Real time) cloud gaming only works with a very low latency internet connection, which requires wiring, which leaves out most of the non-city users (and still some city ones).

Not to mention that it's ridiculously wasteful.


What percent of consoles/gaming pcs are being used for gaming in average per hour? 15%?


I'd say it's rather more wasteful for everyone to have expensive rigs to play games a couple of hours a day when we could share computing resources in the cloud


I would love to have this, but this is very naive in the current economic climate. There is already a notion of trying to get rid of general computing devices from Apple, Google, and Microsoft. They would immediately use the opportunity to corner the market, lock down everything, and start extorting more money. There is no way I would give up my personal desktop.


By definition, everyone doesn't have expensive rigs.

But you're right, one shouldn't automatically assume that streaming is more wasteful than letting the resources sit idle... (one issue here is the assumptions about how fast computers are replaced for consumerist reasons ?)

(And this would still leave the issue of the loss of ownership.)


Can you give an example? I haven’t ran anything for work on my Macbook locally in a long time.


I remember using a variety of VNC/RDP over local network for connecting to my laptop, so I could do development there and keep the codebase completely separate from my other machines, while still using my desktop keyboard, mouse and larger monitor. I'd just turn it on, connect to it and treat it as another window in the OS.

That said, I think we either need proper OS level sandboxing, something like Qubes or just using multiple VMs or devices with remoting. It doesn't feel viable to have something like Discord or other communication software or things with account tokens, and executables like software or games running on the same install, whereas dual booting isn't viable for that either.


> I think this is just another opportunity to create new services.

So maybe cool for people with itches or startup aspirations, more hassle for everyone else.


Does nobody use VMs anymore? I just spin up EC2 instances with Shortcuts on my Mac and destroy them when I’m done. Run a bash script to save my work to s3.

Are people just doing everything locally or something?


Every year, qubes looks more reasonable:)


I do this now today too, but not even for potentially malicious software. Some projects that are not inherently malicious are just written poorly or stupidly by design. Just the other day I ran a program that, before I even requested it to do anything, appended 3 lines to my ~/.bashrc. I didn't even notice until days later. I can't fathom why any developer thinks this is a good idea, and is exactly the kind of thing that makes me sandbox every foreign piece of code I run now.


Sounds like a good reason to use Qubes OS, where everything runs in VMs by default (my daily driver).


My experience with Qubes OS had been so so last time I've tried it.

Have they managed to get rid of Xorg and the terrible screen tearing issue that should be a thing of the past in 2023? I remember there was a project named SpectrumOS that tried to do something similar to QubesOS but with nixos, crosVM and wayland. AFAIK the project stalled.

I am using a sort of middle ground between traditionnal desktop and QubesOS. I run a number of VMs on my machine for different purpose and use waypipe to start apps as individual windows on the main desktop. I don't really have the copy/paste separation that QubesOS has, nor a separation between host and network VM but at least I can separate duties, filesystems and seamless windows showing on only one desktop. To distinguishes the browser and terminal windows I use different themes.



Nope, but regarldess of drivers, xorg has never been pixel perfect and tearing do happens. You are just probably used to it but it is obvious once you have started using wayland how smoother your desktop becomes.

It is the first time I hear about that Tearfree option and I am wondering. If it exists and it isn't the default behavior, it means there must be some annoying drawbacks right?


Perhaps I might be used to it, won't deny it. No idea about drawbacks from the workaround. Wayland transition is planned though: https://github.com/QubesOS/qubes-issues/issues/3366


> If you are a dev today, you should probably have at least thee firmly separated environments for work, hobby and personal stuff.

I hate to call it out, but, isn’t that table stakes? Blending work and personal environments should be an obvious no. Are employers out there ok with this?


I used to work at Google where all computers were militantly segregated from non work stuff. Now I run a little non profit robotics org with one other person, and when I work from home I just use my personal desktop with no separation.

It depends on where you work, how much people will care and whether there are resources to do anything about it.


Big employers (at least the ones I know) aren't regarding work, I guess with smaller ones it maybe is, especially with BYOD?

I think the issue probably isn't uncommon for freelancers/contractors too.

Hobby and personal stuff I think a lot of people mix, I don't use the machine where I do bank/tax/etc stuff for hobby work, but I'm not sure that's common.


OK Drone.


Check out container-shell [1] it is one of the use cases. Not a VM but a docker container. Chroot a directory in a container, and does some automatic house keeping etc

[1] https://github.com/jrz/container-shell


A link to a github repo? How do I know it's not infected?


It's a single bash script


What tooling are you people using to avoid that type of issues at your workplace? And are you satisfied with your setup?

We are a pretty small team developing SDKs that have a pretty large amount of weekly download. I’ve been evaluating tools such as snyk, aikido.dev, and some solutions built on top of renovate (that we already use for general dependency management), it’s not obvious if they would help with this, and given we are still tiny dealing with a large amount of false positive (that was the case with snyk) is a pain. Just curious how others are dealing with this.


We tend to avoid using github repos, but go for published packages from the usual sites; Nuget, Pypi, Npm etc, using Repository and Firewall from Sonatype to act as a proxy between us and the package repos. All packages are analyzed and tagged with various metadata by Sonatype. Firewall lets us define policies for what we can use, and will filter out everything else.

This only works for published dependencies, but based on a couple years experience it works really well. No issues with malware (so far), we don't let packages with known vulns into our codebases and we are notified if a vuln is discovered in something we use.


We use Semgrep Supply Chain at work and are reasonably satisfied with it. It splits the supply chain vulnerabilities it found into the categories: reachable, unreachable and undetermined. This makes triaging much easier and it has reduced the time we spent on assessing new vulnerabilities by quite a lot.


There seems to be a lot of confusion between malware and vulnerabilities. None of the vendors mentioned in this subthread detects malicious code, only vulnerabilities.

Good as they'll be in detecting vulnerabilities, you are still unprotected from malicious code planted in your code bases.


> None of the vendors mentioned in this subthread detects malicious code, only vulnerabilities

Sonatype does if you pay $$$ for Firewall, but that only catches things installed via a package manager.


I’ve been building an open-source tool Packj [1] to detect publicly malicious, abandoned, typo-squatting, and other "risky" PyPI/NPM/Ruby/PHP/Maven/Rust packages. It carries out static/dynamic/metadata analysis and scans for 40+ attributes such as spawning of shell, use of SSH keys, network communication, use of decode+eval, etc. to flag risky packages.

1. https://github.com/ossillate-inc/packj


You may look into Trivy [0] , works very well for me so far.

[0] https://trivy.dev/


Trivy only do a CVE / package version mapping. It will not tell you you are using a malicious package from an unknown repo because of typosquatting or bad practices.


As far as I know, Trivy only flags known vulnerabilities and would not protect against supply-chain attacks like this.


LavaMoat and @lavamoat/allowscripts (which does the opposite of it's name).


Self-hosted GitLab.


Wonder if the whole curl + sudo shell script installer thing is going to come to an end any time soon?

aka the whole "just run 'curl https://somesite/install.sh' | sudo sh" to install our software

Seems like it'd go very hand in hand with this infected stuff mentioned in the article.


I can approve this from our findings (author of this research): our system lists around 100 instances of the pattern you've mentioned every week, and around %3 are malicious. It would be great seeing it coming to an end.


Yeah. Security aware people have complained about this pattern forever, but then places like macOS Homebrew (brew.sh) just knowingly do it anyway. :(


Unfortunately `npm i` has the same power.

`go get` is the only common dependency downloader I am currently aware of where hostile code doesn't run at install or build time.

I think we need better tooling for working in sandboxes, to at least compartmentalize the explosion. ChromeOS's "virtual machines can open Wayland windows on the main desktop" trick is neat, but the code needed to do that was less than clean or reusable when I last looked.


> `go get` is the only common dependency downloader I am currently aware of where hostile code doesn't run at install or build time.

Maven is the same way, AFAIK.


What's the point? You're going to ship that code to your users, or run it against your production database. If you think it might be malicious, protecting the dev laptop shouldn't be the priority.


I run production services that don't even talk to a database, or that have minimal read-only access. And even for the ones with a lot of write access, having the database stolen/nuked (to be restored from backup) is quite different from handing over all of my browser cookies, or my gcloud/aws/k8s ambient authority credentials that are just sitting in a file (seriously, why is big cloud client security worse than SSH).


Fair enough, some organizations might have devs with limited access, but you're right I can see a lot of situations where the dev laptop has strictly larger blast area.


Pro tip: use example.com for such examples, because it's reserved specifically for that: https://www.rfc-editor.org/rfc/rfc2606.html#section-3.


Good point, I just totally forgot. :)


It’s 0% worse than “add and trust our repository for your distro”, or “download this .deb/.rpm/installer”, or (worst of all) “do one of the above, but trusting a 3rd party who packaged this program for you, rather than the publisher”, which are the realistic alternatives.


For NPMs you can mitigate executing malware with `--ignore-scripts`

https://blog.uirig.com/getting-rid-of-npm-scripts


instead the downloaded malicious code runs in prod. maybe, if you are lucky, it does something strange in CI and you can catch it.

the only real solution is a reputation system (like https://github.com/crev-dev/cargo-crev ), which of course is unfortunately barely used


Correct, the above mitigation is only for malware on the dev laptops and build servers. IOW, it doesn't prevent injecting the malware on your program when compiling it.


Shouldn't build servers have limited or zero network connectivity in the first place?


Modern languages make offline builds far more difficult then they have to be, unfortunately. Rust, for example, buries its off-line installer on another domain. Rust also doesn't advertise or encourage bundling dependencies. Lastly, unrestricted build scripts basically give every dependency full code execution.


Prod is bad, but stealing signing keys or credentials is worse.


Any equivalent for non-Rust projects? I see git-crev is abandoned...


unfortunately the Rust version seems abandoned too :(


Unfortunate but expected. You get tired after swimming against the current for so long. The rust community has for better or worse settled on wild unauditable dependency graphs. A real shame given how delightful the base language is (ignoring async, of course).

Most people who can't deal with this, including myself, simply switched to other languages.


can you please elaborate on which language(s) you switched to?


Perhaps an over-correction, but I've switched to Go.

Any language with a proper standard library would do, but I found Go's modern and useful standard library to be well-worth the inefficiency and clunkyness of that language.

The more expansive the standard library is, the better.


Should be the default from a security perspective. Note their note about the need for a Makefile.


I've been gradually improving my dev setup security over the past few months based on continuous reports like this. Here are things I'm trying out to improve my setup: - Use VSCode dev containers for development [1a]. Once you've created one once then they're quite easy to use and you don't need much Docker knowledge - it just needs to be installed. It's perfect for spinning up web/console apps but I had trouble with other stuff like Flutter and Electron. - Similarly I got familiar with GitHub Codespaces for smaller projects [1b]. I've done live coding in an interview before (where I had to modify a simple Node project) and I would absolutely use containers/codespaces for anything like that these days [2]. You can spin one up straight from any GitHub repo page and they're easy to work with. - Read the OWASP guidelines regularly for things like npm, Node, and Docker best practises. e.g. for Docker use the smallest image you can (Alpine) and use explicit Docker image tags [3]. - Review npm/python packages before installing them using socket.dev - it shows a full dependency security overview for things like env variable access, network calls, supply chain attacks, recent code ownership changes etc. You can also disable postinstall scripts globally as suggested by OWASP [4].

[1a] https://code.visualstudio.com/docs/devcontainers/create-dev-... [1b] https://github.com/codespaces [2] https://www.welivesecurity.com/en/eset-research/lazarus-luri... [3] https://cheatsheetseries.owasp.org/cheatsheets/NodeJS_Docker... [4] https://cheatsheetseries.owasp.org/cheatsheets/NPM_Security_...


Less than 1 year ago, a repository with a Trojan horse virus: https://github.com/orgs/community/discussions/63603


> The repository [...] claims to be a password stealer

> However, when I downloaded and Extract it,[...] it stole my personal information and files.

Well, I don't see where is the problem here. The repo is doing exactly what it claims


A simple case of marking these officially would get some attention.


And later, Github could start selling these blue ticks. What could possibly go wrong? /s

(I do agree with your point that Github should be better at displaying which repo is the official one for a project.)


Clicking "learn more" on the cookie prompt (that covers the entire screen on my phone) leads you into an infinite loop. Not cool


I have always wondered how do these malicious actors with their bots manage to do so much. Do they ever get caught? It's like there is no deterrent.

For example, if you use bots to spam Wordpress blogs with comments that contain links to your site (black hat SEO), that is obviously a bad thing to do, but I'm not sure if it's even against a law.


Ah yes, the latest post about a marginal security concern leading neatly into an advertisement to give money to an LLM startup to address it only partially and at best probabilistically.

If you're a potential customer for something like this, you quickly have to ask: why not have another 1000 contracts to separate tiny startups that each do the bare minimum to paper over an unknown portion of just one security gap? What other costs will you incur integrating with each of them in turn? How many of them will even still be in business a year later?

There would be several hundred with more ROI than this one, so why start here? Even if you undertake this costly and tedious journey, how far down the list would you have to go before you get anywhere close to this one?


Despite its possibilities, GitHub cannot prevent all this - what happens to the other providers such as Codeberg, etc.?


I don't know but I guess the smaller fish are protected by virtue of them not being worth the automation effort. A bad actor can spend a lot of time and effort attacking Github and have their efforts exposed to so many more developers than the same sort of effort on, say, Codeberg would achieve.


And you'd be shooting yourself in the foot anyway. At "codeberg scale" it's possible to entirely take over the platform with spam and malicious repos, at which point codeberg will implement drastic limits to prevent this like manual account verification or some such, which will stop it. It would be an enormous waste of time for everyone.


> Despite its possibilities

You mean despite its (financial) resources?


I appreciate the developments. I hope blindly pulling code from Github will be seen as the risk factor that it is.


Not just GH. People add StackOverflow snippets, pretty blindly, and pull directly from personally-published repos.

I know what an unpopular opinion this is, with the HN crowd, but I think we need to completely reevaluate our dependence on dependencies.

I have run into people that Literally. Can't. Write. Code, without dependencies. Their skill is at passing LeetCode tests, and googling for dependencies. Their bosses like them, because they pass the interviews with flying colors, and get results really quickly.

I remember, a number of years ago, attending a meetup, promising to explain GraphQL, and, instead, it was a lecture on using a JavaScript GraphQL wrapper. I don't think that two minutes were devoted to the API, itself. I seemed to be the only person in the room, that was going "Whiskey Tango Foxtrot?"

For myself, I use a lot of dependencies, but I wrote almost all of them, myself. I have spent years, building a library of SDKs and modules that I can integrate into my shipping projects.

I have only two external dependencies, in my current projects. These are not ones that would kill me, if I was forced to go alone, but they do save a lot of time (an Apple Keychain wrapper in Swift, and a streaming JSON parser in PHP, for the record).


Stupid question here -- whats the most efficient way to evaluate for vulnerabilities (i.e. remove the low level risk like 95% of the cases)? Is it trust the package manager or is there another quick way?


Like 0ther folks say elsewhere in these threads, the most efficient way is not to eval for vulns, but to assume compromise.

So, folks adopt the (admittedly ridiculous, but, in the strict sense, necessary) 0-trust step of running everything in QubesOS, then in a VM, and maybe then in Docker.

Essentially each application gets its own fresh OS to fuck up or not...as it goes.

I get this (but I don't condone it). It's not practical for dev purposes, only if you are endlessly testing the waters for each little thing. But, prolly, eventually we will be wowed by a shiny-thing that achieves infinity-virtualization with bare metal performance but perfect isolation so we can easiily run

  infinvm run github.com/malware/repo
And all will be right with the world.

We ain't there yet tho. Hahaha :)


I’d love to be surprised, but I think there isn’t any way to evaluate for vulnerabilities other than read all the code (fairly carefully). I don’t think the ecosystem is set up to make it easy.


That's wrong because even coders can't write code without vulnerabilities inside their own software...

You probably instead should use a throwaway computer (or VM if you trust them), and not do anything personal on that computer, and then burn it.


In the good old days you could check the MD5 checksums of mostly everything you downloaded. But then you had to trust the website that published the checksums themselves.


That says absolutely nothing if the page serving the expected md5 sum is also compromised.

With HTTPS, everything is already checked in transit, so you don’t need to verify it again.

On top of that, with git, all the revisions are inherently verified as sha1 hashes.

In either case, what’s inside may still be malicious if you end up on a typosquatted repo.


And then sourceforge was taken over by malware injectors...


On the flip side, you would hope that the malicious binaries in these packages have already been flagged by antivirus providers.


We should probably assume repositories like these are part of Copilot's and ChatGPT's training data until it can be proven otherwise.


so maybe they aren't hallucinating after all. (just using bad source info.)


This sounds solvable with some kind of painful to write and test heuristic tool.

Something like "find other repos with similar product names and similar code but no forking audit trail" ... If you do an automated code diff and find it's involving say, posting stuff to a server, you can become suspicious. All these signals can become rather shallow with the right tooling

centralizing this would make it a bit more efficient because you can just cache things and respond fairly immediately


To do so Microsoft needs to act as the strong man with a heavy hammer.


Very interesting read, is the jupyter notebook available?


that's funny because Wordpress sites are infected using that kind of code inject


So who's behind this? Who's doing something about it? Where's Homeland Security on this? This is their job. This is an attack.

"The ease of automatic generation of accounts and repos on GitHub and alike, using comfortable APIs and soft rate limits that are easy to bypass, combined with the huge number of repos to hide among, make it a perfect target for covertly infecting the software supply chain. This campaign, along with dependency confusion campaigns plaguing package registries and generally malicious code being spread through source control managers, demonstrates how fragile software supply chain security is, despite the abundance of tools and available security mechanisms."


> how fragile software supply chain security is, despite the abundance of tools and available security mechanisms

There seems to be a fundamental trade-off at play. I often see security portrayed as a hindrance, requirements thereof as a drag on productivity. That is in line with a strong trend in developers with a very narrow skill set. The ability to throw framework at the wall and see what sticks pays very well. No one wants a stick in the mud asking why on Earth dependency management is at the state it is, or imposing reasonable security practices. I have been there, I have argued with developers from teams that had been breached before saying "no, this is safe because I can't see how this could be exploited". Security by obscurity so deeply ingrained one takes obscurity from oneself as evidence of safety.


It's worse, if you address these things seriously, like, as another post here addressed last week, about software quality, you get rapidly stopped in your tracks. Like you say and more; 'but everyone does it like this, why would we waste time?' and 'It is safe enough, maybe later we'll revisit'. It is kind of true clients don't pay for it directly, however, indirectly, it can tank a company.


Herd mentality gets a bad rap, but it generally works for the herd.


Works well on average, and remember that bad actors are also part of the overall herd. It can be very detrimental to the individual (person or company).


Where's GitHub Fraud Detection Team?



That sort of thing happens because there's so much spam and malicious activity, such as the thing reported in this story.


> So who's behind this? Who's doing something about it?

CISA[0] might be a good agency to begin with, if for no other reason than to find a more appropriate one to contact.

0 - https://www.cisa.gov/about


> Where's Homeland Security on this?

Who knows who's behind it?


Isn't that what they should be investigating? Are you trying to imply something with that question?


Its not unheard of for intelligence agencies to create and exploit weaknesses.


The Vault7 operators called. They want their exploit back.


Impossible. It's Open Source, watched by thousand eyes and smells of roses.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: