Hacker News new | past | comments | ask | show | jobs | submit login
Dozens of malicious PyPI packages discovered targeting developers (phylum.io)
754 points by louislang on Nov 2, 2022 | hide | past | favorite | 320 comments



I think a proper way to solve this issue, not specific to python but languages running in a VM in general, would be to have some sort of language support where you specifically define what access rights/ system resources you allow for any given dependency.

Example of defining project dependencies:

  {
    "apollo-client": {
      "version": "...",
      "access": ["fetch"] // only fetch allowed
    },
    "stringutils": {
       "version": "...",
       "access": [] // no system resources allowed for this dependency, own or transitive
    },
    ...
  }
It would probably require the language to limit monkey-patching core primitives (such as Object.prototype in javascript), and it would be more cumbersome for the developer to define the permissions it gives to each dependency. These required permissions could be listed on the package site (eg npm or PyPI) and the developer would just copy paste the permissions when adding the dependency. But if you upgrade a dependency version and it now requires a permission that seems suspicious (eg "stringutils" needing "filesystem"), it would prompt the developer to stop and investigate, or if it seems justified add the permission to "access" list.


At the beginning the permissions aspect of deno[0] was actually on of the major selling points for me. The approach used there was to begin at zero and offer granular permission control, e.g. `--allow-read=data.csv`, for filesystem, network etc. I would love to have this for, e.g., python or npm packages.

[0]: https://deno.land/manual@v1.27.0/getting_started/permissions


Doesn't this only apply to the entire process? Not the individual dependencies, right? Just confirming, Deno was my first thought with this, it requires the developer to deliberately enable permissions needed.


Yes, it applies to the whole process. It's incredibly hard to sandbox dependencies individually since you don't know how your code or other dependencies interact with it. If you want you can run dependencies in a worker process and sandbox that tighter, but that is quite a bit of work.


This is exactly what I've done for Membrane[0]. It's capabilities based, even to get the time (and thus introduce non-determinism) you need a capability. Dependencies run as separate processes and everything is orthogonally persistent. It's a typescript/javascript system for personal automation built entirely within VSCode. Stay tuned, I'll be posting a video this week.

[0] https://membrane.io


Hi there! I just wanted to let you know that I read the blog posts, and membrane sounds extremely cool. Ambitious, though. If you don’t mind a small bit of feedback: it would be encouraging to potential users or testers to see some semi-regular posts related to development. It would also be great to see how membrane might work to build a tool using current APIs. I know there is video forthcoming, and perhaps this will be addressed. I didn’t look at the GitHub, where I probably could glean some additional info. But from my perspective, a development blog builds trust and anticipation. It’s also a great way to check your assumptions (or have them checked, rather).

Good luck with the project. I hope it delivers, because I’d love to use it. Signed up for the mailing list.


That’s exactly why I use strace-based to sandbox ALL dependencies: https://github.com/ossillate-inc/packj


Phylum's extension framework is built on Deno for this exact reason. The ability to provide granular permissions was something we were really interested in.

Deno is a really cool project, imo.


Interesting read, thanks.


Check out OpenBSD's pledge(2): https://man.openbsd.org/pledge.2

It does exactly that (although on a per-process basis).

I don't think this kind of permission system can be retrofitted into an existing language without direct OS support, and probably not at the library level (you'd need something like per-page permissions which would get hairy real fast).


I think @jart has been porting it to Linux https://justine.lol/pledge/ .


Indeed! If your dependencies are able to be command line programs that are shell scripted together, then you can in fact have an access policy on a per-dependency basis, using the pledge.com program linked on my website. So shell scripters rejoice.

But it gets better. If you build Python in the Cosmopolitan Libc repository:

    git clone https://github.com/jart/cosmopolitan
    cd cosmopolitan
    build/bootstrap/make.com -j8 o//third_party/python/python.com
Then you can use cosmo.pledge() directly from Python.

    $ o//third_party/python/python.com
    Python 3.6.14+ (Actually Portable Python) [GCC 9.2.0] on cosmo
    Type "help", "copyright", "credits" or "license" for more information.
    >>: import cosmo, socket
    >>: cosmo.pledge('stdio rpath wpath tty', None)
    >>: print('hi')
    hi
    >>: socket.socket()
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/zip/.python/socket.py", line 144, in __init__
        _socket.socket.__init__(self, family, type, proto, fileno)
    PermissionError: [Errno 1] EPERM/1/Operation not permitted
Since we didn't put "inet" in the pledge, you can now be certain your PyPi deps aren't spying on you or uploading your bitcoin wallet to the cloud. You can even use os.fork() to rapidly put each dependency in its own process, then call cosmo.pledge() afterwards to grant each component of your app its own maximally restrictive policy.

Cosmopolitan Python also ports OpenBSD's unveil() system call to Linux too. For example, to disallow all file system access, just call cosmo.unveil(None, None). You need a very recent version of Linux though. For instance, I use unveil() in production on GCE but I had to apt install linux-image-5.18.0-0.deb11.4-cloud-amd64 in order for Landlock LSM to be available to use unveil().


I thought the thread model of pledge/unveil was to restrict a program that you are writing, but that you couldn't wrap around other program in a safe way.

That is, you can protect your own program from doing network stuff because of incorrect input, but you can't use it to sandbox another program.

See this thread: https://marc.info/?t=162367803300003&r=1&w=2 and this mail about sandboxing: https://marc.info/?l=openbsd-tech&m=162367954705721&w=2


You can, if you use pledge() and unveil() on Linux. SECCOMP and Landlock use a monotonically decreasing permissions model. It's inherited across exec(). This is a good thing. OpenBSD devs don't need it because they built their own hermetic system. They're more afraid of having their servers compromised remotely than they are of programs they've installed locally. The tradeoff is you can't use pledge() and unveil() to build your own SSH server on Linux, since SSH needs to shed restrictions when launching a shell. But the benefit is you can safely leverage more code written by strangers on the Internet, which is what Linux is all about.


I thought of OpenBSD too since it includes Perl modules for OpenBSD::Pledge and OpenBSD::Unveil. Now I'm wondering if I can get something to work where these are used before importing CPAN modules to reduce the damage of potentially hostile modules.


I like this. I'd try to keep the permission sets as small, limited, and simple as possible though.


> I'd try to keep the permission sets as small and simple as possible though.

You've described OpenBSD in general. I recommend a deeper dive - it's fantastically refreshing, how simple yet functional an OS can be.


Same for FreeBSD. Incredibly code-stable and well-documented by Linux standards.

I often use the FreeBSD Handbook as an example of first-party documentation done right, and that's only possible because of deliberately limited "churn for churn's sake". The kinds of regular code rot and attrition that Linux suffers just does not take place on BSD systems because if you contribute something new you're expected to make your thing mesh with the ecosystem, they don't tolerate the "I'm gonna churn everyone's environments because my version is 10% better than the existing one" that tends to take place on Linux.

As an example, the amount of init systems that Ubuntu has gone through over the last 20 years is completely insane by BSD standards. They've gone from sysvinit to upstart to systemd. You don't have people ripping out the graphics and audio subsystems and doing total rewrites, either. It's nuts that a lot of Linux people don't even realize that it doesn't have to be that way - that's not "just how maintenance is supposed to work", that's a bunch of incredibly bad behavior on the parts of distros and Poettering specifically that has become highly normalized and overlooked. Maintenance shouldn't be breaking things like that. "we never break userland" doesn't have to be a suicide pact either - BSD has actually maintained a much more stable ABI than the linux kernel without crystallizing a bunch of shitty bug-compatibility stuff like Linux is doing either. It's all insanely stable and competent and professional by linux standards.

If I was developing a product for true long-term support, with the minimum possible engineering effort devoted to fighting churn, BSD would be at the top of the list.


What attracted me to Linux so long ago is what ultimately drove me away. I now apply it only as needed because it's no longer simple ("stable" over the timescales I'm interested in) as a project, and it's not moving towards any type of consistency.

If you read "rolling stones gather no moss" to mean "keep moving or risk becoming obsolete", choose Linux. If you understand it as "change for change's sake prevents the achievement of mastery", go with a BSD.


And with all that beauty, FreeBSD is not something I would [nowdays] look into as base OS for hosting my services and products. It means something, probably something about humans.


Netflix uses FreeBSD for their CDN/edge caches - it was consistently more performant both in benchmarks and with real-world workloads.


Yeah, that's quite interesting reading from them, some sort of specialized appliance really.

I'm and the average Joe around me, totally far from Netflix's task of packing bytes from disk to network. Simple 2vCPU VPS serving 4GBit without being saturated on system resource level is quite often much more than enough. Extra note - it's not even using kTLS.

Moreover, even for Netflix, noting they know FreeBSD in and out, do you think/have info on using FreeBSD as base OS beyond distribution level - running applications/services in particular?

I've quickly checked on their repos like https://github.com/Netflix/conductor and it smells like they use containers/Docker, which doesn't work on FreeBSD => I'm in very much doubts it's OS of choice for them.


A resource access model for dependencies doesn't make much sense to me, there's basically only 2 things you want to gate access for libraries: filesystem and network. And it's all-in. A library that needs network access may be legit today and after an update start exfiltrating data to a different url. It seems easier to grep for fs and network calls in the library code than any of that.


Restricting a process to only be able to access opt-in list of directories underneath the project directory would be useful. Assuming one uses a venv, and all the dependencies are contained there. Then one might want some data folder. And have at least prevented dels from scraping user-wide secrets.


You're describing a chroot jail. key there is "process". Dealing with processes permissions is the OS's job.

A if a language wants to deal in library's security it should strive to make static analysis possible. Eg: the language guarantees that network and filesystem calls can only be done with a single function, statically so I can audit that leftpad indeed doesn't make network calls .


This is essentially the fine-grained control the Java Security Manager enabled. But hardly anyone used it and it was deprecated sadly.


And it didn't really work in practice. Every callback or thread hop is a security vulnerability by default.


This is one of the projects we're working on (and open sourcing)!

Currently allows you to specify allowed resources during the package installation in a way very similar to what you've outlined [1].

The sandbox itself lives here [2] and can be integrated into other projects.

1. https://github.com/phylum-dev/cli/blob/main/extensions/npm/P...

2. https://github.com/phylum-dev/birdcage


I broadly agree, but if I'm reading your suggestion correctly I think that "access" list is too coarse-grained still! It looks like you're suggesting a predefined list of permissions that can be granted to a dependency, but... why not go even further? If the list of things you're passing in are references rather than strings, then you could do...

    const apollo = require('apollo-client', {fetch});
    const stringutils = require('stringutils');
That way, you can pass in at a very fine-grained level any object you want to, including things like `fetch` and `fs`. But then, you could just as easily pass in a custom `loggedFetch` module like so, that wraps `fetch` and logs all network requests made:

    const apollo = require('apollo-client', {fetch: loggedFetch});
This is the object-capability security model, and as a sibling commenter pointed out, LavaMoat basically does this for JavaScript dependencies. (This is also basically just lexical scope, by the way, if your language is strict enough - though most unfortunately aren't.)


I love this idea. But what about peer deps? And their dependencies? What if multiple deps share peer deps? Is there a simple way permissions could be resolved in the web of modules?

Also- tangential but I really wish people would stop using require and use ESM whenever possible. I cringe when I see `require` just as much as when I see `var` being used in 2022, but I’m not certain there’s never a good reason not to use ESM.


If you want to override the dependency of your dependency, then it's your dependency too - so if you were using the javascript approach up there you might have

    var bar = require("bar", {baz: loggingBaz})
    var foo = require("foo", {bar})
This is true even if you never otherwise use bar.

This is certainly how peer dependencies work in nix flakes, for instance, where you can say something like

   bar = //....
   bar.inputs.baz.follows = loggingBaz;
   foo = //....
   foo.inputs.bar.follows = bar;
Since nix flakes is a package/dependency manager, it generates a lockfile that you could inspect too see what all the dependencies are, and make sure you got it right (oh, I still have two bar's, I must have forgotten to override some other dep). I suppose any compiler that involves a linking stage would in principle be able to generate some comparable output at the language level.

(Apologies for using require and var, but I'm convinced the functional syntax is semantically clearer in this context.)


For JS, You are basically talking about Lavamoat. It provides tooling and policies for SES, which aims to make it into standards.

https://github.com/LavaMoat/LavaMoat

https://github.com/endojs/endo/tree/master/packages/ses


You could remove the need to explicitly specify the permissions somewhat by enforcing semver for permission addition.

If you own a package at version 1.X.X and you want to add a permission requirement you have to bump the version to 2.0.0. If you also allow people to opt in to a less strict "auto allow all the currently required permissions for these dependencies" mode, they would at least know for sure nothing can touch anything new unless they explicitly bump the major version.

If you're extra concerned about security you can explicitly specify them so it's really obvious when a major version bump adds new ones, but it removes some of the friction.


I watched a video[0] about the Roc language recently, and they do something interesting to address this: they have a layer in their language called "platforms" and the idea behind these are that there are many different platforms that you can choose between to run code with and each one has different permissions. So one platform might be sandboxed and disallow the use of certain unsafe APIs whereas another might be less sandboxed.

[0] https://m.youtube.com/watch?v=cpQwtwVKAfU


Another thing I think might help is

(a) Discourage any future use of ">=" in version dependencies. Specify an exact version. That way a future compromised version doesn't get pulled

(b) Every build system needs better ways of having multiple versions of a same dependency coexist. I should be able to have one of my project's dependencies depend on "numpy==1.15" and another dependency depend on "numpy==1.16" and they should be able to coexist in the SAME environment and "see" exactly the numpy versions they requested.

For python we should think about how to support something like this in the future:

    import numpy==1.15
and have it just work.

That way if a hacker compromises PyPI and releases a malicious numpy 1.19 it won't get pulled in accidentally.

Here's a bit of a joke I made before that might be an interesting starting point, though since it uses virtualenv behind the hood it doesn't have a way for multiple versions of one package to exist. I don't think it's impossible to do though with some additional work.

https://github.com/dheera/magicimport.py

Sample code:

    from magicimport import magicimport
    tornado = magicimport("tornado", version = "4.5")


> I should be able to have one of my project's dependencies depend on "numpy==1.15" and another dependency depend on "numpy==1.16" and they should be able to coexist in the SAME environment

Now I see what people at one respectable, big project were thinking when they allowed 7 different versions of OpenSSL to be statically linked to the same executable…

Seriously, this idea may save you from some not very interesting work, but it will create the need of much bigger amount of work which while potentially interesting is not very productive. You are toying with exponential growth here, like a chain reaction - like a bomb.


Would you run into dynamic linker problems in this case due to symbol conflicts? Or does symbol versioning magically resolve that somehow?


Prefix all symbols with versions?

foo() becomes v1_15_4_foo() automatically


There were some attempts made for Python 10 - 15 years ago and the conclusions were that it's very hard to do it right. If I remember correctly Zope was using some sandboxing and because of it, it took a while time to catch up with newer versions of Python. You had to compile you own Python because the one that came with the Linux distribution was too new. Also If I'm not mistaking I think that PyPy has some sandboxing.

Anyway I'll leave this from 2013 here:

> After having work during 3 years on a pysandbox project to sandbox untrusted code, I now reached a point where I am convinced that pysandbox is broken by design. Different developers tried to convinced me before that pysandbox design is unsafe, but I had to experience it myself to be convineced.

> It would also be nice to help developers looking for a sandbox for their application. Please tell me if you know sandbox projects for Python so I can redirect users of pysandbox to a safer solution. I already know PyPy sandbox.

-- https://mail.python.org/pipermail/python-dev/2013-November/1...


In a post-Spectre and post-rowhammer world, I have my doubts that any sub-process security boundary will prove durable in the long run.


I've been messing around with some ideas.

1. `autobox` (to be renamed lol) [0]. It's basically a Rust interpreter that performs taint and effect analysis, reporting on both, allowing you to use that information to generate sandboxes. ie: "autobox sees you used the string '~/.config' to read a file, and that is all the IO performed, so that is all the IO you get".

2. I'm working on a container based `cargo` with `riff` built in that aims to work for the vast majority of projects and sandbox your build with a defined threat model.

The goal is to be able to basically `alias cargo=cargo-sandboxed` and have the same experience but with a restricted container environment + better auditing of things happening in the container.

3. I previously built a POC of a `Sandbox.toml` and `Sandbox.lock` with a policy language that allowed you to specify a policy for a given build step. Unfortunately, I couldn't decide on how I wanted it to work in terms of "do I generate a single sandbox for the entire build, or do I run each build stage in its own sandbox" - there are tradeoffs for both.

Here's a lil snippet:

    [build-permissions.file-system]
    // All paths are relative to the project directory unless they start with `/`
    "../" = {permissions = ["read"]}
    // "$target" being a special path
    "$target" = {permissions = ["read", "write"]}
    // Source this path from the environment at build time, `optional` means it's
    // ok if it isn't available
    "$env::PROTOC_PATH" = {permissions = ["read", "execute"], optional=true}
    // Default protobuf installation paths, via regex
    "^(/usr)?/bin/protoc" = {permissions = ["read", "execute"], regex=true}

Once I'm done with (2) though I think I'll tackle (3).

`autobox` is fun but I think it may be impractical without more language level support and no matter what I'd end up having to implement it in the compiler at some point, which means it would be unusable without nightly or a fork.

I'm going to try to wrap up an autobox POC that handles branching and loops, publish it, and see if someone who does more compilery things is willing to pick it up. As for (2) and (3) I believe I can build practical implementations for both.

[0] https://github.com/insanitybit/autobox/


This is really cool work! Also a fan of Grapl.


:D Thanks!


I saw a similar proposal (I think with JavaScript/node) not too long ago that deacribed limiting packages to data in their own namespace. For instance third-party-dep-a would only have access to data it created or was passed in versus indiscriminately accessing anything in the language VM. Even this would be a good step in the right direction although you'd likely still need something like you e described for accessing shared system resources (aka the mobile phone security model)


Yep, a declarative mechanism would be nice like OAuth scopes.

Though, like scopes, I think many times packages would need broad access, but maybe not?


It's only a matter of time, until, someone with some cash + a good connection to a package just does what is going to happen.


I thought that Java Applets (and maybe flash, I am less familiar) had an advanced security model, but it was exploit after exploit because of the huge attack surfaces?

I suspect you may run into similar sandbox escapes once things are complicated enough. So it seems like a good idea if they can be made bug free, but good luck with that?


Part of the problem with the Java sandbox is that it was enforced entirely by the VM + the VM is written in C++. The idea is not inherently bad.


It's been a while since I worked in this area but my recollection was that most JVM security issues in this areas were bypasses of the Java Security Manager often by confusing it about code origin. That's all Java code, not C++.


For both Java and .NET, there were actual verification bugs, as well - when bytecode that's not supposed to be valid gets past the verifier and results in e.g. mistyped references (which can then be used for all kinds of creative vtable abuse). Sometimes it can even be a bug in the VM spec itself, because implications of two different features interacting weren't fully considered.

But, yes, we've tried this idea many times now, and it never held up for long.


It's been so long I could be remembering incorrectly.


Maybe the issue was with how powerful and unrestricted reflection was in java before introduction of modules.


Java and .NET had it, and in both cases they eventually dropped it, because many security exploits were cause by developers not really understanding how to use them.

In the end they became yet another attack vector, and now everyone should use OS security services instead.


Node.js is building something very similar: Permission Model https://github.com/nodejs/security-wg/issues/791


Your proposed solution sounds an awful lot like a manifest file

https://en.wikipedia.org/wiki/Manifest_file


Sort of like BSD's pledge() and similar APIs[1]

[1]: https://man.openbsd.org/pledge.2



Wouldn't running your development environment inside docker provide the same safety levels?


Containers aren't great security boundaries. To get the safety you'd really need, you should absolutely use a VM.


Wasmtime / WASI does this extremely well.


In a previous HN discussion on the topic of rogue Python packages, readers had suggested bubblewrap and firejail for sandboxing. They limit the access a script and its packages have to your filesystem and network.

I think that's the better approach - just assume all packages are malicious by default. Can't rely on scanners because of the large number of packages and attacks.


That's not going to help much if code from the malicious attacker is still going to end up integrated into the software product being built.


I disagree. I need to write this up in more detail but it helps quite a lot.

1. An attacker who can access your development environment and your production environment is worse than one who can only access your production environment. You might say "but the end goal is prod", but it's not that simple because of (2).

2. We already have very good tooling for isolating services at runtime. Separating them onto different instances, firewall/security groups, limited API keys, docker/ containers, apparmor, selinux, etc. We have a lot of tooling for "a service in production is owned". What we lack is "a library in dev environment is owned".

3. Devs often have more privileges than your services. It's unfortunate but at a lot of companies, perhaps given some lateral movement around dev envs, you'll find SSH keys to production, browser session cookies that give you console access, source code, chat sessions, internal documents, git keys, gpg keys, etc.

So I'm actually fine with a tool that sandboxes the build process but leaves open the hole of "but the attacker can patch the binary and execute code in production". That's a huge win.


Based on my experience, tools like firejail, ebpf, and opensnitch help us keep security in the forefront, train us to verify behavior instead of trusting blindly, and even persuade end users towards that mindset through our installation steps.

If we can spot odd behavior during development and eliminate it from our stacks, the product will be more secure for end users too.

There was a time when convenience overrode any security doubts in my mind. But now I routinely use these tools to restrict access, monitor, and review runtime behavior.


Ive contracted on a project where another, IMO better, solution was employed. This was ruby.

The only allowed package (gem) server was one ran by the project. This package-server scanned, vetted and manually checked any version of a lib before "publishing it".

If you wanted to e.g. upgrade a package, you'd have to do this on this server first. It would then go through some steps, -automatic scanning, risk analysis, sometimes even needing the eyes of someone from a security team. After that the package was published on this server, and you could pull it onto your dev machine and use it in CI/staging/test/prod etc. Similar steps, to get a new package listed.

IMO this is better, because it stops supply-chain attacks before they hit your code, not after they've (potentially) infected the system.

Edit: for clarity "only allowed package" wasn't enforced very strictly. A linter and CI would catch any changes to code that would want to fetch packages from elsewhere. It wasn't to protect against rogue developers, but against "stupid me, accidentally upgrading to a version that is infected" and such.


The issue is where your tools are supposed to be generally available on a machine or when your application has access to secrets (like keystores, configuration files, log files, etc.) which is pretty much every application.


Another good option is to create a new user and run everything under the new UID. Running under a new UID has less chances of accidentally leaving something exposed that can allow for sandbox escape.

If you run everything from the new UID, it will mostly be contained to it's own $HOME directory and be unable to modify your user's files or system files. Some distros do not protect home directories from being read so it might be worth setting your actual user's $HOME to umask 0700 or whatever.

If you are using bwrap while running X11 and not running the sandbox with a new UID, the sandboxed processes may be able to escape via the X11 socket! This can happen even when you don't mount the X11 socket into the sandbox (see abstract sockets)! I think unsharing the network namespace fixes this specific issue (not 100% sure), but there are probably more subtle footguns like this.

I really suggest running Wayland with XWayland disabled, and the Wayland socket protected from the sandbox if you want to use bwrap for security purposes!


How do you escape the sandbox through a Wayland or X11 socket? Do you have specific code examples?

Is there no way to safely run graphical applications in a bwrap sandbox? I thought Wayland was supposed to be better about this.


I think Wayland is fairly safe, but any X11 client can take screenshots or listen to the keyboard, or emit keyboard event, without limitations.


I do not have a specific code example, but you can use the normal X11 client interfaces to interact with the X server, which allows a lot of dangerous things such as sending events to other clients. We can imagine a rouge X11 client spawning a terminal and entering text through a virtual input interface, to run an arbitrary command for example.

On Wayland, assuming you don't have XWayland enabled and running, it depends on the specific compositor you are using and what Wayland protocols it supports.

Sandboxing GUI stuff on Wayland requires at the very least not having XWayland running, and also requires understanding what the compositor allows clients to do by default. Some compositors may have permission dialogues that prevent clients from doing stuff that you didn't expect.


Good insights, thank you!


Do both and more. When using an unfamiliar package check it’s upload history. How far does it go back? How did I discover the package, do i trust that source? Etc.

Unless your code is never going to touch important data or resources, like for example (but not limited to) being used commercially in any vein then you can’t keep it in a padded cell forever.


So that means you can never use any package in code that has to handle sensitive data or manipulate the host machine?


No, it means don't trust it blindly but instead learn techniques to monitor and verify what it does. I use firejail, ebpf, and opensnitch to restrict access, monitor, and verify runtime behavior.

Where possible, persuade end users too to be equally careful. The "linux is safe" cliche blinds both us and end users to its obvious security problems, like running every script as the logged-in user with the same level of access. These malware developers know it and rely on it. That's why we need to move everybody towards a restrict-monitor-verify mindset by default.


Plug: I've been building Packj [1] to address exactly this problem. It offers “audit” as well as “sandboxing” of PyPI/NPM/Rubygems packages and flags hidden malware or "risky” code behavior such as spawning of shell, use of SSH keys, and mismatch of GitHub code vs packaged code (provenance).

1. https://github.com/ossillate-inc/packj


There is also this, although I haven't tested it yet. The approach is interesting though. https://github.com/avilum/secimport


I agree, "assume unknown, unaudited packages are malicious" is the ideal stance. However, I would say that a simple scanning approach could probably take you pretty far. For instance, if you're not using the requests module or the socket module, chances are pretty good there's no data exfiltration going on.

It's absolutely not a foolproof approach, but it is a lightweight layer that can be used in a "defense in depth" approach.


In Python, dynamic imports exist, making this impossible


I don't see how having dynamic imports matters if all you want to do is detect if a specific file is imported. Run the install and see what gets imported. That's it.


If you actually have to execute a program (but have no safe way of doing so), to see if a complex routine that may return any filename imports a safe file or not, then you are facing up against https://en.wikipedia.org/wiki/Rice%27s_theorem


So? Any method of detecting a "malicious package" faces Rice's theorem, unless you want to claim that "malicious" is a trivial property.


Which is why any approaches relying on identity verification or scanning are bound to fail - sandboxing/capability security MUST become built into languages


Are you suggesting that you would rather wait for languages to provide robust sandboxing capabilities and not use available static/dynamic analysis tools (e.g., Packj [1]) to audit packages for malicious/risky indicators, particularly when we hear about new attacks on open-source package managers almost every week?

1. https://github.com/ossillate-inc/packj [Disclaimer: I built it]


No, I'm merely suggesting that they are not a solution to the problem, and that the fundamental issue has to be approached at the language level.

What you are building is mainly a smoke detector (and maybe a bit of a sprinkler if it takes some decisions itself), not fireproof doors (that only at install, not test- or runtime). Smoke detectors by themselves cannot prevent fires from spreading and are not completely reliable.

Analysis tools are still useful - even with perfect language-level access and resource control, packages which are given many required permissions may behave maliciously (e.g. through compromise of any component in the development or distribution pipeline), or return malicious data (which is out of scope/unsolvable at the language level). Both approaches complement each other nicely.


Another options is "nsjail". I landed on it having considered at firejail, apparmour, selinux, bubblewrap.


What are the advantages?

I'm very frustrated with firejail since I can't for example block execution in my home directory, with the exception of one subdirectory.

It just can't be done.


[flagged]


It's a problem with every open ecosystem where libraries can be downloaded and run. Rust, Golang, Node all have the same problem. That's why I think it's better to assume anything we download is malicious. Stuff like Bubblewrap and Qubes OS seem to be the better approach compared to relying on vulnerability hunters and scanning tools.


Do both and more. When using an unfamiliar package check it’s upload history. How far does it go back? How did I discover the package, do i trust that source? Etc.

Unless your code is never going to touch important data or resources, like for example (but not limited to) being used commercially in any vein then you can’t keep it in a padded cell forever.


I started to develop only inside VMs, with a full Desktop, IDE, browser etc. inside the virtual machine.

There have been to many contaminations of major package repos lately. Only one typo in an import statement up the dependency chain and you’d be compromised.


Full disclosure, I am a co-founder at Phylum.

We are actively working on a solution that will fully sandbox package installations for npm, yarn, poetry and others.

It's rolled up as part of our core CLI [1], but is totally open source [2]:

[1] https://github.com/phylum-dev/cli [2] https://github.com/phylum-dev/birdcage


Sounds awesome.

Though I’m not sure of the solution really is / should be increased sandboxing.

The alternative may be a rethinking of the increasingly smaller packages. Maybe it’s better to have few large packages maintained by reputable organisations or personalities?


The problem is large, sprawling and complex. In an ideal case, we'd have high quality packages maintained by reputable people/organizations. But today this just isn't true. Open source takes contributions from a large number of unknown authors/contributors with motivations that may or might night align with your own.

We really need a defense in depth approach here. Sandbox where it makes sense, perform analysis of code being published, consider author reputation, etc.


Why is there so much discussion about sandboxing? Why wouldn't I put some malicious code in the package itself limiting myself to installation only?


A lot of the malware targeting developers is leveraging the installation hooks as the execution mechanism. So sandboxing the install helps stop this particular attack vector - which is why it gets talked about so much.

If you put code in the package itself, this would side step the "installation" sandbox. However we're also doing analysis of all packages introduced to the ecosystem to uncover things that are hiding in the packages themselves.

So you're right, we need a defense in depth approach here.


I think this is a very good idea. It should actually be built into the OS. I know of BSD jails, but not sure what else there is for Linux/Windows/MacOS.


Virtual is part of a solution but not the key: the key is to separate your dev env from your real life/business environment -- including all your personal and professional business data and web accounts that expose your financials and private data.

If you log into your email from the virtual machine, you are at risk.


That protects me (the software developer/maintainer) to some degree, but does nothing to protect the users of the software I am maintaining.


Development should be more exploratory and experimental than prod. For the past decade I've had a similar strategy: I freely install and demo new dependencies on separate dev hardware (or a VM when I'm on the road). Then I code review (incl. locked dependencies) and deploy from a trusted environment with reduced supply chain exposure.


As long as your are creating web applications then browsers are pretty good at limiting blast radius of a single attacked website. Well, at least until attacker discovers that he can inject some fancy phishing into trusted site.

With local development environment it is a bit different, because unless you are running build/test etc. in a container/vm/sandbox, then attacker has access to all of your files, especially web browser data.


I think that separation is the point of the VM. Do the dev work in the VM, don't give it sensitive info about yourself.


The only place I log into from the VM is Github, protected by 2FA in case any malware gets my password.


The malware will just take the session cookie. Some actions still require 2FA approval, but it’s not many, iirc.


So the malware can delete all your projects or inject malware into them, but thankfully it won't be able to log in again later?


This is a good approach, though presumably the VM still has access to your Github credentials (via the browser) and your SSH keys? It'll limit the fallout of getting owned to anything reachable from Github (is it against Github's TOS to have multiple accounts?), less if you have 2FA (does there exist 2FA for SSH keys (I don't mean passphrases)?), but I think it would be better for just my build/run/test cycles to be cordoned off into their own universe, with a way for just the source code itself to cross the boundary.


> though presumably the VM still has access to your Github credentials (via the browser) and your SSH keys?

Not in Qubes OS:

https://github.com/Qubes-Community/Contents/blob/master/docs...

https://www.qubes-os.org/doc/split-gpg/


It might be too cumbersome for most, and I might be more paranoid than average, but each project for me means a fresh VM, a new Keepass database and dedicated accounts. Then again I work mostly in ops, and I've seen first hand how badly things can go wrong so isolation and limiting blast radius takes precedence over daily convenience for me.


Why wouldn't you use disposable VMs [0] and secure inter-VM copy [1] on Qubes OS instead? It's much less cumbersome and more secure.

[0] https://www.qubes-os.org/doc/how-to-use-disposables/

[1] https://www.qubes-os.org/doc/how-to-copy-and-move-files/


Could you please share some resources/tactics for protecting your host machine from these development VMs? If I were to do this, I would want some assurances (never 100%) that my host is protected from the VM to the best of my ability.

(If it makes any difference, I would probably be using VMWare Workstation Pro)


I can't give you what you're looking for. You need to decide on the trade offs for yourself. There will always be a risk. Directed attacks can get out of VMs. You could slip up and log into a personal account inside the VM.


I tried to make it clear in my reply that I understood there are no guarantees. What I’m asking is if you have any guidance on reducing the likelihood of these attacks succeeding


That does sound incredibly cumbersome. I suppose that means you are an ace at provisioning machines.

How do you move data in/out of the guests? I always found that part of interacting with VMs to be annoyingly painful.


There are always trade offs. You do get better at things you do a lot. My mother won't use a password vault because copying and pasting is too much work for her. I'd just rather pay with my time and inconvenience than one day find out some python package I fiddled with for a late night project once means I need to call my bank.


SSH.

Doesn't even need to be command line, you can just open remote addresses in your favourite graphical file browser, at least under Linux.


> does there exist 2FA for SSH keys (I don't mean passphrases

Yes. Yubikey. ecdsa-sk key requires you to tap yubikey to have a working key. It consists of 2 parts - a private key file, but which is useless without yubikey. https://developers.yubico.com/SSH/

https://developers.yubico.com/SSH/Securing_SSH_with_FIDO2.ht...


Github offers fine grained personal access tokens. https://docs.github.com/en/authentication/keeping-your-accou...

Azure DevOps does it too


Far as I know, in AzDO you can't even limit a PAT to a single project/repository. Not good for limiting access cause even a read only can see private stuff in other projects. You might create a specific account and assign to only that project but what a pain.


I've tried the same but the graphics performance was too slow (no GPU acceleration). The current setup is to use a virtual machine but connect to it via VS Code's Remote SSH extension from the host.


I hope you've turned off VS Code's "workspace trust" settings.

https://code.visualstudio.com/docs/editor/workspace-trust


Sometimes but I wonder to what degree it actually matters. Tasks, debuggers, extensions etc. run in the context of the VM, not the host. The Remote SSH extension turns VS Code into a "thin" client which presents pretty much just the UI.

https://code.visualstudio.com/docs/remote/ssh


Readme says: https://marketplace.visualstudio.com/items?itemName=ms-vscod...

> A compromised remote could use the VS Code Remote connection to execute code on your local machine.

So I would say that it might be a bit harder for an attacker to gain access to your local machine, but you should not rely on it, because it's more like security by obscurity.


Well damn. I was under the impression that the communication channel uses/accepts only well defined VSCode specific messages related to the UI...


Darn. Maybe the solution is to use vs-code client in the browser? Like vscode.dev or https://github.com/coder/code-server ? It limits what keyboard shortcuts and extensions are available, but at least it's in a secure sandbox on the client side.



This is goid defense in depth measure but doesn't solve one fundamental issue. You might be protected during developement by the sandbox but your users are not necessarily. I think we as developers should not give any sotware we do not trust to our users.


Then you might be interested in Qubes OS: https://qubes-os.org.


That's why I chose it. A lot of peace of mind there.


Packj sandbox [1] offers "safe installation" of PyPI/NPM/Rubygems packages.

1. https://github.com/ossillate-inc/packj/blob/main/packj/sandb...

It DOES NOT require a VM/Container; uses strace. It shows you a preview of file system changes that installation will make and can also block arbitrary network communication during installation (uses an allow-list).

Disclaimer: I've been building Packj for over a year now.


Only secures installation, not runtime, but still helpful. I'm not a package maintainer, but I do wish that packages were not allowed to run any code at install-time.


> Only one typo in an import statement up the dependency chain and you’d be compromised.

Doesn’t even have to be a typo if the actual project is compromised. Like one of the 100s of NPM modules without 2FA for publishing.


I'm following the same workflow. I use a Linux host and then a Linux guest with OpenGL acceleration on virt-manager. I do all my development and browsing inside the VM. I do not trust any of the npm packages or PIP packages. Any personal stuff like banking, password manager, Nextcloud goes on the host.

With modern virtio interfaces for network, disk and graphics practically giving near metal performance, there's no reason to not utilize VMs for development.


Similar. I run text editor on the main OS but run language server in the requisite environment with all the requirements in a container.


This is the way.


This type of stuff is one reason I like vendoring all my deps in golang. You have to be very explicit about updating dependencies, which can be a big hassle, but you're required to do a git commit of all the changes, which gives you a good time to actually browse through the diffs. If you update dependencies incrementally, it's not even that big a job. Of course, this doesn't guarantee I won't miss any malicious code, but they'd have to go to much greater lengths to hide it since I'm actually browsing through all the code. I'm not sure the amount of code you'd have to read in python would be feasible, though. Definitely not for most nodejs projects, for example.

I think it's an interesting cultural phenomenon that different language communities have different levels of dependency fan-out in typical projects. There's no technical reason golang folks couldn't end up in this same situation, but for whatever reason they don't as much. And why is nodejs so much more dependency-happy than python? The languages themselves didn't cause that.


> And why is nodejs so much more dependency-happy than python?

Part of it—but I'm sure not all—is that the core language was really, really bad for decades. Between people importing (competing! So you could end up with several in the same project, via other imports! And then multiples of the same package at different versions!) packages to try to make the language tolerable and polyfills to try to make targeting the browser non-crazy-making, package counts were bound to bloat just from these factors.

Relatedly, there wasn't much of a stdlib. You couldn't have as pleasant a time using only 1st-party libraries as you can with something like Go. Even really fundamental stuff like dealing with time for very simple use cases is basically hell without a 3rd party library.

Javascript has also been, for whatever reason, a magnet for people who want to turn it into some other language entirely, so they'll import libraries to do things Javascript can already do just fine, but with different syntax. Underscore, rambda, that kind of thing. So projects often end up with a bunch of those kinds of libraries as transitive dependencies, even if they don't use them directly.


It’s worth mentioning that Underscore started before browsers widely implemented the same features in standard JavaScript. Underscore is much less necessary now that Internet Explorer EoL’d.


The problem is the tree of dependencies you might check. Sure you can check the changes in a direct dependency, but when that dependency updates a few others and those update a few others, the number of lines you need to read grow very quickly


Golang flattens the entire dependency tree into your vendor directory. It's still not that big. The current project I am working on has 3 direct external dependencies, which expands out into 22 total dependencies, 9 of which are golang.org/x packages (high level of scrutiny/trust). It's really quite manageable.


Indeed, gophers often make it a point of pride to have no dependencies in their packages.


> And why is nodejs so much more dependency-happy than python?

Could it be that nodejs has implemented package management more consistently and conveniently than other languages/platforms?


That's one thing, the other is the almost complete absence of a standard library.


Yeah, I think this is a big one. One of the things that I have always liked about Golang is that the standard library is quite complete and the implementations of things are (usually) not bare-bones implementations that you need to immediately replace with something "prod-ready" when you build a real project. There are exceptions, of course, but I think it's very telling that most of my teammates go so long without introducing new dependencies that they usually have to ask me how to do it. (I never said the ux was fantastic :) This also goes to GP's "consistent and convenient" argument.


Totally agree. It feels like there is a pretty strong inverse correlation between standard library size, and average depth of a dependency tree for projects in a given language. In our world, that is pretty close to attack surface.


Rust is another example of this. Just bringing in grpc and protobuf gets about a hundred dependencies. Some of them seemingly unrelated. For a language aimed at avoiding security bugs, I find this to be an issue. But a good dependency manager and a small (or optionally absent) stdlib has lead to highly granular dependencies and bringing in giant libs for tiny bits.


pip throws your dependencies in some lib directory either on your system (default if you use sudo), in your home directory (default if you don't use sudo), or inside your virtualenv's lib directory.

npm pulls dependencies into node_modules as a subdirectory of your own project as default.

Python really should consider doing something similar. Dependencies shouldn't live outside your project folder. We are no longer in an era of hard drive space scarcity.


Have you seen how much space a virtualenv uses? It can easily be >1 GB. For every project, this adds up. (Not to mention the bandwidth, which is not always plentiful).


Well, npm uses a cache so it won't re-download every package every time you install it.


4TB hard drives are $300 these days.


4tb HDDs are closer to 80$ now, but that reinforces your point :). Even SSDs are now close to 300$ for 4tb!


Yeah i meant 4TB SSDs, who uses magnetic HDDs anymore lol


As of Python 3, pip install into the system Python lib directory is strongly discouraged. ISTR that even using pip to update pip results in a warning.

That’s not to say that there’s not still some libs out there that haven’t updated docs to get with the times.


More distros should adopt the Debian practice of installing into dist-packages and leaving site-packages as a /usr/local equivalent for pip to use on it's own.


It also blows up the size of your git checkouts pretty fast though.

I don't think you really gain much either; vendoring was useful before modules, but now we have modules and go.sum I don't really see the advantage. If you have "github.com/foo/bar" specified at version 1.0.4 the go.sum will ensure you have EXACTLY that version or it will issue an error in case of any tomfoolery.


Vendoring also means your builds don’t need an Internet connection.

Going on a trip somewhere without an Internet connection? Checkout the repo on your laptop and go. Without vendoring: oh shoot, I forgot to download the deps, I guess I’m going to be forced into a work-life balance. With vendoring: no additional step needed after checking out the repo. The repo has everything you need to work.

Another case: repo of your dependency is removed, or force-pushed to overwriting history. You’ve lost the ability to build your project, and need to either find another source for your dependency, or rewrite it. With vendoring: everything still works, you don’t even notice the dep repo went under.

Generally, with vendoring your code is in just one place instead of being a distributed being which crumbles when any part of it gets sick.

Moreover, relying on checksums to me seems a bit overcomplicated. It’s like going to a pub and giving each drink from a stranger to a chemist for verification to make sure they didn’t slip any pills, when you could just carry your own drink around and cover the top with your hand.


You should have the modules downloaded to the module cache for the occasional case when you don't have direct internet access.

> Another case: repo of your dependency is removed, or force-pushed to overwriting history. You’ve lost the ability to build your project, and need to either find another source for your dependency, or rewrite it.

The GOPROXY (https://proxy.golang.org/) still contains that removed repo, and since everything is summed people can't just force overwrite it. Plus, you still have it in the module cache locally.

You can of course always come up with "but what if...?" scenarios where any of the above fails, and all sort of things can happen, but they're also not especially likely to happen. So the question isn't "is it useful in some scenario?" but rather "is it worth the extra effort?"

> Moreover, relying on checksums to me seems a bit overcomplicated.

It's built-in, so no extra complications needed.


> You should have the modules downloaded to the module cache for the occasional case when you don't have direct internet access.

That’s assuming I’ve built the thing previously on that same computer. I’m talking about the common case of working on a normal desktop day-to-day and then switching to a laptop, when travelling to a place without internet (or internet of such a poor quality you might as well not bother). With vendoring I don’t need to think about any other steps than copy/checkout the repo. The repo is self-contained. Without it, I’m making the quantum leap to a checklist.


You need internet access to either checkout or update the repo; you can use "go mod download" (or just go build, test, etc.) to fetch the modules too. It's an extra step, but so is vendoring stuff all the time.

But like I said, it's not about "is it useful in some scenarios?" but "is it worth the extra effort?" I'm a big fan of having things be self-contained as possible but for this kind of thing modules "just work" without any effort. Very occasionally you might go "gosh, I wish I had vendored things!", but I think that's an acceptable trade-off.


I wonder why we can’t have pip packages be published by username or organization, like

    pip install google/tensorflow
It would significantly reduce the attack space


npm does something similar with their scoped packages. It fixes the problem for the top level packages, but you'd still have to contend with the transitive dependencies written by smaller organizations or individual contributors. In this case, you have to guarantee that no one involved in the dependency chain ever typos anything.


"you'd still have to contend with the transitive dependencies written by smaller organizations or individual contributors" - generally these are the higher risk dependencies anyway and should probably be used with extra caution anyway.


This is true, and wouldn’t remove the entire space of attack, but would still limit it to some extent.


Oh absolutely. Unless everyone wants to be cool and stop publishing malware, gotta take a defense in depth approach here.


Maven had this 20 years ago

quite why python refuses to learn from anything that went before it I really dont know

“Namespaces are one honking great idea — let's do more of those!”


It gives false sense of security. What about google_official/tensorflow


It would still be an improvement if companies make clear what their namespace is.


What if the package is signed by a key available at google.com/pypi/key ?

Actually it should be not just a key but a whole TLS certificate, with references to a CA, activity dates, etc.


How would you know where you’d find that key?


Again Maven has the answer: if your namespace is a domain name you own, then the key needs to be available at a well-defined path on that domain (or as a TXT record in it).


Perhaps something like Docker hub where "official" images are like "/_/nginx"

So "_google/tensorflow" would be official.

"google/tensorflow" would not be (plus it would be reserved by default to avoid confusion).


google.com/tensorflow (and you'd have to prove you own google.com)

not perfect, but better.


I've always been a fan of how Java packages do it where the TLD is first.



one of main issues I have with java is how messy it is to import external modules. python is a breath of fresh air comparitively. introducing this kind of thing as mandatory is a step away from that


> Upon first glance, nothing seems out of the ordinary here. However, if you widen up your code editor window (or just turn on word wrapping) you’ll see the __import__ way off in right field. For those counting at home, it was offset by 318 spaces

Haha, simple & effective...


The guy who runs the C2 openly has the source code for the stealer on his GitHub. Why doesn't GitHub do anything about this shit?

I've personally been hacked by a supply chain attack via a GitHub wiki link. I contacted GitHub support and didn't hear back from them for 3 months. They are completely useless.


Does GitHub actually prohibit programs that are up front about the fact they do something questionable? Considering there have been active repos for those steam pirating DLLs on the site for ages I thought they only really go after hidden maliciousness


Considering the entire open source pentesting community almost exclusively uses GitHub to host their projects: no. There is actual malware being hosted on GitHub, with the caveat that malware and pentest tools or proof of concept exploits are sometimes indistinguishable.

GitHub announced a few years ago that they would crack down on malware and were about to introduce some very strict T&C. After a huge backlash from the pentesters (justified in my opinion), they backpedaled a little bit. Hosting pentesting tools is fine, using GitHub as your C2 server or to to deliver malware in actual attacks is not.


I completely agree, despite the wording of my comment. In this case, the user has a different GH account for hosting their malware and C2, but the fact that they're so flagrant about it is what bothers me.

I was a skid once, I get it, probably a lot of us were.


They are trying. The level of effort to release these things is so low, the effort required to catch it and remove it at scale is much harder, unfortunately.


Are they? I know I'm biased because this affected me and I'm still mad about it, but I just don't buy it.

I contacted them, showing the plainly obvious malicious account that was distributing malware. Two months later, they send me a generic message saying that they've "taken appropriate action", but the account and their payload was STILL THERE, they hadn't done anything. The attacker was rapidly changing their username, and honestly I'm not sure their support staff has a way of even dealing with that. I tried to explain the situation as best I could, but they were not helpful in the slightest.


I don't know what their standard for 'malicious' is, but they nuked Popcorn Time and Butter (the technological core without the actual piratey bits) from orbit until there was a huge amount of backlash.


I'm not even asking them to deal with the problem "systemically" or "at scale". I just want them to respond when I am trying to stop an active criminal campaign whose goal is to steal money and cryptocurrency from people.


Talk to the FBI or any authorities, then.

I despise the idea of GitHub removing any code just because YOU (anyone) think they are criminals.


Read mr_mitm's comment. I have no problem with potentially malicious code being hosted on GitHub, I think it's a good thing. Using GitHub's infrastructure for your theft campaign is clearly not okay.


We're not talking about some quirky money-strapped startup. We're talking about Microsoft.


The standard HN answer is because freedom of speech. That the problem is the one using the code, not the code itself.


These sorts of things is why D doesn't allow any system calls when running code at compile time, and such code also needs to be pure.

Of course, this doesn't protect against compiling malicious code, and then running the code. But at least I try to shut off all attempts at simply compiling the code being a vector.


I've never understood this position. How often do you add a dependency to your project, compile your project, and then never run your project ever? I can't think of a single case where this would have protected me.


It means you don't need to run the compiler in a sandbox. People do not expect the compiler to be susceptible to malware attacks, and I do what I can to live up to that trust.

I haven't heard of anyone creating a malicious source file that would take advantage of a compiler bug to insert malware, but there have been a lot of such attacks on other unsuspecting programs, like those zip bomb files.


> People do not expect the compiler to be susceptible to malware attacks

I'm not familiar with D, so I'll use the example of Rust. My usual workflow looks something like this

1. Make some changes

2. Either use `cargo test` to run my tests or `cargo run` to run my binary.

In both those cases the code is first compiled and subsequently run. I care if running that command gives me malware. I don't care at what step it happens.


With rust quite often (e.g. if you are running rust_analyzer) it will run `cargo check`, to produce errors. When `cargo check` is run, build.rs compiled and run. So quite often by step 1, just opening the file in your editor before even making any changes code is compiled and run.

Walter's solution here allows the compiler to be used by the editor without the editor being susceptible. Which at the very least negates the need for a pop-up in your editor asking for permission.


> With rust quite often ... it will run `cargo check`

Yup. But making "cargo check" safe while "cargo run" stays vulnerable just reduces the number of times you run malicious code. And whether malicious code runs on my laptop every time I edit a file or every hour or every week makes absolutely no difference. One run and the malware can persist and run whenever it wants going forwards.

> Which at the very least negates the need for a pop-up in your editor asking for permission.

My argument is that the pop-up is security theater. I've disabled it, I don't think it should be enabled by default.

[1]: I'm handwaving slightly to get from "your code depends on a malicious library" to "malicious code is run". If I recall correctly there's linker tricks that could do that, or you could just have every entrypoint call some innocuous sounding setup function that runs the malicious code.


only if you intend to run the program though, if you want to just read the source code, perhaps to see if it contains malicious code you really don't want your editor doing such things by default, so something has to give.


Some people try to inspect code outside of a project before including it into something.

Say there is a Github repo with a crate called "totally_safe_crate", and I want to use this crate in my project, but I am not sure whether I can trust it or not. What do I do?

What I would likely do is clone the repo, then open my editor and look through the source and whatnot and make my decision.

In this case I never intend to run "cargo run" at all, but I may want to run an LSP server to help inspect the code, or I may accidentally enable LSP in the editor out of habit or something.

In this case, it would be nice if I could be certain that simply inspecting and reading code was safe, but as it stands now, in Rust, this is not the case. We can't even inspect code to make sure it's safe unless we open the source files in a "dumb" editor.

In your example you are just adding a crate to cargo.toml and firing off cargo, so of course it's not going to be useful there. Some of us may want to be more cautious than that and actually read code before putting it in our projects.


Perhaps it’s about responsibility. It’s not the compilers fault if you chose to compile and run malware. But you could blame the compiler if it ran malware during the compilation process.


All else equal I'd agree. But I'm perplexed why people spend a lot of effort on what seems to me like a purely philosophical benefit.


It's not philosophical. All people who write programs that consume untrusted data should be actively trying to prevent compromise by malware.


In general, I agree. I think developer tools are a special exception because there are so many gaping vulnerabilities inherent to it it's meaningless.

I think of that kind of thing as the equivalent of "your laptop won't be vulnerable on odd-numbered days". That'd be a great plan if there was a pathway to going from there to no vulnerability. If that was the low-hanging fruit and you're stopping there it's a complete waste of time.


It's just address part of the problem, which of course is why it seems somewhat pointless. I need to:

1. Install packages/deps/libraries etc safely

2. Run code that includes those libraries that limits their capabilities centrally.


> It's just address part of the problem, which of course is why it seems somewhat pointless

I cut my teeth in the aviation industry, where the idea is to address every part of the problem. No one part will fix everything. Every accident is a combination of multiple failures.


CI/CD servers, dev laptops etc could have more privileges than the production machines. For instance.


So you never run tests on your dev machine or CI/CD, and never run your code to manually test?

I'm not experienced but I thought it was normal to have some way to try out what you've written on your dev machine. Is everyone else stepping through code in their head only, and their code is run for the first time when it's deployed to production?


Your dev machine getting pwned is bad, but your CI server getting screwed up is worse.

This way you don't need to sandbox the compiler, and it can freely use system resources and access source trees. You only need to sandbox the execution.

(As some people point out in this thread, editors are starting to use compilers to get overall meta-information, too-- if you can't even -view the code- to tell if it's malicious without getting exploited, that's bad).


> This way you don't need to sandbox the compiler, and it can freely use system resources and access source trees. You only need to sandbox the execution.

If this is now only helping CI and not dev machines I don't see why it's worth the effort. Wouldn't it be much simpler and more reliable to just sandbox compilation of anything in your CI?

> if you can't even -view the code- to tell if it's malicious without getting exploited, that's bad

I guess? I can't think of a single time in my life where this would have practically helped me.

I skim dependencies on GitHub for obvious red flags and then trust them. I assume places with the resources to do actual in-depth review can disable advanced analysis in their IDEs for that.


You're digging in really hard trying to talk someone out of following good development practices because you don't personally think his effort is worth it. Personally, I don't think the effort being put into this argument is worth it.


> Wouldn't it be much simpler and more reliable to just sandbox compilation of anything in your CI?

Not really. The compiler has to be able to access large swaths of your code. You want to e.g. keep your code safe. You would have to have very finely-grained sandboxing to prevent substantive disclosure, and even then you're likely leaking information.


Fancy IDEs perform code analysis and thus if you feed them something malicious I guess it’s feasible to run a shell command or similar. By definition, IDEs have to do that to compile code, run linters etc.


Build servers?


I'm honestly not sure the benefits of executing code during compilation/install outweigh the bad. Most attacks we have seen leverage this as the attack vector.


CTFE (Compile Time Function Execution) is a major feature of D, and has proven to be immensely useful and liked.

Note that CTFE runs as an interpreter, not native. Although there are calls to JIT it, an interpreter makes it pretty difficult to corrupt.


The problem comes when you need to do something bespoke and custom, like building a C dependency so you can link it into your Python (or whatever language) library. Sometimes your options are "run a makefile" or "reimplement an entire library from scratch". I'm not saying that this isn't a problem; it is. I think the better solution is transparent sandboxing for dev environments.


I'd love transparent sandboxing -- but the difference between me wanting to install the awscli and something that steal awscli's credentials is only a matter of intent, so a bit difficult.

I've basically converted to doing all node development in docker containers with volume mounts for source, now it looks like python is going to need to be there as well, at least for stuff that pulls in any remote dependencies.


> I think the better solution is transparent sandboxing for dev environments.

I don't disagree at all. We're building an open source sandbox for devs right now for this exact reason. Linked it in another comment.


PyPi should warn when the package and developer are new.


Even Firefox and Chrome's extension "stores" don't get this right. In either, a once trusted extension can be sold to a malicious company who then pushes new updates which automatically get downloaded by Firefox and Chrome by default, with no warning. Quite possibly without Mozilla and Google having any way of knowing it happened at all.

One way to address this is to move to a traditional "debian" style system, where packages are people affiliated with / known by Debian/Mozilla/Google, and specifically aren't the developers of the software themselves. The software is written by Developer X, but is then packaged and distributed by Packager Y, who ideally has no commercial affiliation with Developer X. If Developer X sells out to Malware Corp Z, end users can hope that Packager Y isn't part of that deal and prevents the malware from being packaged and distributed. This still isn't bullet-proof, but it's a lot better.


Yeah a time/activity based trust system like thepiratebay uses could be helpful.

Also devs should get into the habit of providing sha256 hashes on offical channels (i.e., github readme) so users can validate (if its possible to validate a pkg before executing malicious code in the python ecosystem, I'm not sure how that'd work).


Doesn't pip's hash checking mode solve this issue? Freeze your requirements with hashes. Pypi already provides hashes for sdists and wheels. See https://pip.pypa.io/en/stable/topics/secure-installs/#hash-c...

If we are talking typos or other human errors, guess we could only warn people that there are other package with similar name available. Can't predict what people have in mind when they make a typo.


It definitely does help. We've seen malicious actors introduce "bad things" into legitimate packages [1]. So hashes help identify what you got, but doesn't necessarily prevent you from getting something you didn't intend.

[1] https://www.cisa.gov/uscert/ncas/current-activity/2021/10/22...


> Also devs should get into the habit of providing sha256 hashes on offical channels (i.e., github readme) so users can validate (if its possible to validate a pkg before executing malicious code in the python ecosystem, I'm not sure how that'd work).

I would think the easy solution is to publish a public signing key per-person or per-project, and then sign individual files with that. So, GPG.


> Yeah a time/activity based trust system like thepiratebay uses could be helpful.

This is an excellent idea. Authors are something we are digging into heavily as part of an ongoing effort to improve trust in the open source ecosystem.


Yeah, an onboarding process build on trust and time-delay would be nice.


Totally agree. There should be a 30 day pe


A good strategy would be to not allow typosquatting by just blocking names that too similar (something simple as hamming distance would suffice here).

Afaik there are two types of supply chain attacks strategy - you either compromise the a legitimate package by somehow getting a PR approved with malicious code, which is very hard to do, or you "typosquat". The latter is way easier and probably the dominant strategy, so package repositories such as PiPy need to invest into preventing it.

Edit: formatting, grammar


A third is to fill a niche. E.g. if there's no "storage library for DigitalOcean Block Storage"[1], one can make it and publish it[2]. It would work as advertised, but maybe do some additional stuff, like sending the API keys to malicious server too. Unknowing developers would be searching, seeing something that promises to solve their problem, and use that.

I haven't encountered this in the wild yet, but IMO it's reasonably easy. More work than typosquatting, for sure, but probably less work than getting a malicious PR accepted in an existing project.

[1] It's just an example, doesn't exist, and if (by now) it does, I never meant to hint it is malicious. DO is just an example, and I have nothing but good to say about their services (and libraries) where they exist.

[2] The example here would be to just copy the S3 library, because DO block storage is fully compatible with S3 API. Which would also be the reason a dedicated library doesn't need to exist.


Curation. Have a wide open repo and then have a repo of curated packages. There is no typo squatting in the curated repo because a human had to approve it.

Pip install ansible --from curated.pypi.org

Pip install anssible --from wildwest.pypi.org

Let them typosquat all they want in the wildwest. I don't worry about this nonsense with Apt. Heck maybe curation becomes a revenue stream to pay to get in to support such activites.


A lot of people in the comments here aren't happy with people using tons of dependencies, but what else do you expect from ecosystems that (a) make it easy to download and publish packages and (b) don't have much of (b') a standard library or (b'') set of blessed solutions?

This isn't a case of 'oh, python (javascript, etc) programmers are dumb and lazy'. If C++ had a packaging system and it wasn't so horrible to add third party dependencies we'd be seeing the same thing there.


Open/free software is great when a great person writes some code and lets you use it, because they are kind and there's nearly no marginal cost.

But malicious actors can get value from polluting the sharing network, and that costs effort to defend against, which means someone(s) has to pay to secure the network, or be open to attack.


That’s not an either or thing. Someone you pay to can also be a malicious actor.


Yes and it is common too. E.g. the vendor of my SmartTV now has a probe in my home.


Or be compromised themselves and an unknowing vehicle for attacks (e.g. see SolarWinds).


Once a buddy and I reverse engineered some JS on a site that did the same thing - sent you down one rabbit hole, more obfuscated code, etc.. etc.. we eventually got to the end of it and discovered a comment:

// help my name is ###

// i am being held at #### (address in china)

// please contact my family ###

(this was in chinese, we had to translate it)

Scary!


So did it seem like some kind of weird scam, or what?


Presumably it's to trick whitehats into tipping off the hackers that their code was being analysed and had been successfully deobfuscated, so the hackers knew they needed to move to a different attack.

It's actually quite devious, like a reverse honeypot that the bad guys use against the good guys, exploiting their empathy.


That would be clever!


No, maybe a joke? Or serious...


This is one reason I prefer Debian python packages.


There's also a disturbing new trend of publishing end-user software as pip packages instead of apt-get packages, just because the bar to join apt-get is too high.


This is definitely a benefit of using distro-provided packages.

You get some vetting, and in addition, standard practice for many distros is to build everything that goes into the repos in sandboxes or VMs which have restricted or no network access. Additionally, some package managers incorporate that kind of sandboxing into their builds categorically, like Nix and Guix. (For Nix, this may only be on Linux— there are issues with sandboxing on macOS.) So if you build your project's dependencies via Nix or Guix, you're also protected.

This only protects you from `setup.py`-type (build time) attacks, of course. If the distro packages get compromised in some other way so that malicious code ends up in your installed programs (this attack has elements of that, IIRC), you're still in trouble.


This really is the sweet spot when production is a specific Debian version, set up your dev environment to match that and it's pretty bombproof. Run CI with pip installs against later Python versions to see the shouty deprecations you'll be able to sidestep.


Programming languages have to become able to sandbox imported dependencies, to limit their side effects up to sandboxing them completely, ideally in a fine grained way that allows developers to gradually reduce the attack surface of even their own code

https://medium.com/agoric/pola-would-have-prevented-the-even...

https://github.com/void4/notes/issues/41


The article doesn't explain what exactly the "W4SP Stealer" does. Would someone be able to explain?


It downloads a script that, at least right now, will turn around and grab cookies and passwords from browsers and send the data off to an discord webhook.


> discord webhook

Hah. Is this true? I find it funny since IRC has/had this reputation for being a means of communication with malware and it's often blocked on this grounds.

Nice to know that malware is going on with the times and is using Discord for that now.


Discord is great as command and control server because the malware author doesn't need to expose their ip address or implement a complex web of proxy to secure their C&C server.


Couldn't you use someone else's IRC server, the same way you use Discord's server?


I suppose you could, but have you seen how popular new opensource projets being run these days? Young devs really loves discord to the point of hosting documentations there. I imagine young malware authors are no different.


Which, I don’t know if I’m getting old, but man that frustrates me. It’s a terrible platform for documentation. It’s barely a good text chat platform.


You are, yes and yes.


The source is actually hosted on GitHub, and there is a good readme explaining all that :)

https://github.com/loTus04/W4SP-Stealer


If I hosted malware, I would be in jail. It is against the law. I wonder why Github is allowed to host malware, and continues to provide a platform for it?

https://sanctiontrace.com/malware-hosting-providers-sentence...


It's a slew of checks for passwords and other things on the developers machine. The data is extracted and sent to a remote endpoint controlled by the attacker.


It makes me very sad that something as wonderful as code, the closest we have to actual magic is tainted by this. You know how to code and you chose to spend your time doing this? What a shame


Lots of “reputable” devs write code that’s every bit as shitty as this. Somehow it’s ok when all you’re doing is spying on your users and shoving ads into their eyeballs.


Frankly it's surprising this doesn't happen more often.


Another one is just maintainer/packagers for popular distros like Arch.

I'd be very surprised if something shady isn't there as of now.


Is there something about Python or PyPI that makes it more attractive for malicious developers to add malware?

Is this also happening for repos for other languages (e.g. CPAN, RubyGems)?



I think this has more to due with the contexts/industries we typically see Python used in over pure popularity. If popularity was the only factor I'd expect to seeing a lot more news about these problems in Java/PHP ecosystems which are absolutely massive.


Java is special since it has mandatory version pinning of all dependencies, doesn't run code at install time and uses a full url for dependency names. That means dependencies don't auto-update, don't compromise CI/CD as easily and don't get misspelled as easily (ie: people copy paste the whole name+version versus writing it from memory).

Many languages since then decided that causes too much overhead.


I mean, PHP is pretty much a domain specific language, and we're only about a year out from log4j.


Log4 wasn't malicious.


;)


It’s the most popular

It can run code during installation phase

It’s very easy to obfuscate due to its dynamic nature. An import of urllib or os.system isn’t immediately visible at the top of a file as it would be in Java, it can be hidden in eval() or basically anywhere.

Finally, even legitimate packages have very, and I mean it, very bad names. Usually with a trailing number for no good reason. Lacking organizational namespaces. Together with a culture of using a lot of such dependencies and lacking a culture of freezing transitive versions. Blending in among those is too easy. Just take a common name and suffix it with 3.


For what it's worth, this happens in pretty much all the ecosystems. We have seen similar behavior in NPM, rubygems, and others. PyPI is just really popular.


Yes, it's very very popular. And getting more popular every year.


Couldn't forcing publishers to sign a hash of the module not be a solution?

The certificate could contain information about the owner and the consumer could check if he wants to deal with the owner or not. Developers could add a desired whitelist to pip (or use a curated one) to continue using automation.


Hashing solves one side of the coin. Namely, whether or not you got the thing you expected (or perhaps, got the thing you expected from the individual you expected it to come from).

On the other side, we have to contend with the fact that malware can be slipped into otherwise legitimate packages. This has happened numerous times over the years. In this case, the hash would serve as a way to say "yup, you definitely got malware". Useful for incident response, but I think we can do better and try and prevent these attacks from being viable in the first place.


Would it be possible to make a more trusted package mirror?

Somehow validating packages before inclusion?

IIRC mirrors for NPM, Packagist and others is not impossible, can be done for PyPY and others too?

Maybe it's a stop-gap before all the fancy permissions feature build out (which seems hard)


This is what systems like deb and rpm do - they curate a list of packages that can be installed to the system. But most people (in my experience, including myself) don’t use them because they get out of date really quickly and don’t lend themselves to things like virtual environments very well.


Debian unstable is usually years behind the times.


It is possible to set your registry in NPM via the "npmrc" file. That will let you hit the specified HTTP server whenever you run commands like "npm install".

I know this is also possible for Python because we did it at Uber. I don't remember the specific details anymore though.

In either case though, a lot of people have written proxies for this use case (I helped write one for NPM at Uber). Companies like Bytesafe and Artifactory also exist in this space.

We're working on something similar that's on GitHub here: https://github.com/lunasec-io/lunasec

Proxy support isn't built out yet but the data is all there already.


A lot of people in this thread are asking for a reputation/"verified user" solution for this, but really I think just pulling a gazillion dependencies for applications is just all around bad. I actually think having a reputation system would be even worse, because people would see it and assume that reputation is a guarantee of safety. Trust without verification is where issues can become even worse.


Based on my experience with shady plug-ins for e.g. Photoshop back in the late 90s/early 00s, all that a reputation/"verified user" solution is going to achieve is a very lucrative black market of high-reputation/verified user profiles and credentials.


So is .NET finally going to make a comeback? Yes, you need some dependencies for projects, but in general Microsoft does a good job providing a lot of tooling and libraries.


It's using base64 encoded strings to deliver the initial stage. Can this be avoided/flagged more easily if by adding a scan of statements featuring base64 or import?


We tried doing this on PyPI a couple of years ago, and it produced a large number of false positives (too many to manually review).

You can see the rules we tried here[1].

[1]: https://github.com/pypi/warehouse/blob/main/warehouse/malwar...


The way I'd go about this is probably starting a VM, installing the package and seeing what in the filesystem is affected by it rather than trying to do static analysis (which becomes a cat and mouse game as detection heuristics improve so do the stealth heuristics).

The attack surface area is too big when random python code is executed, which is the case for `setup.py`, but even if there wasn't code executed there, as soon as you import the package and use it, you'd have the same issue.


Unless you can hide the fact that it's running in a VM, I don't see why code couldn't act normally if it thought it was being analysed like this. Or what about some kind of payload that executes after a long delay, and would become visible for long running programs but not short tests? and so on.


Yes, this works really well. But as soon as you deploy it, the actors change tactics. We've had to build a defense in depth approach to discovering malicious packages as they are introduced into the system.


it is probably very easy to bypass, by creating a sub package that can do the decoding via proxy functions, which is not evil at all, and depending on that package on the evil one. It won’t trigger the alarm, as it is indirectly depending on the base64 :)


Exactly, I don't think you can rely on a naïve import to determine maliciousness. We've basically had to build out heuristics that are capable of walking function calls for this exact reason. Otherwise things are just too noisy.


This is exactly what Packj [1] scans packages for (30+ such risky attributes). Many packages will use base64 for benign reasons, this is why no fully-automated tool could be 100% accurate.

Manual auditing is impractical, but Packj can quickly point out if a package accesses sensitive files (e.g., SSH keys), spawns shell, exfiltrates data, is abandoned, lacks 2FA, etc. Alerts could be commented out if don't apply.

1. https://github.com/ossillate-inc/packj

Disclaimer: I developed this.


That'd catch a ton of valid packages. Right now on my random collection of packages in site-packages I have ~60 packages that have 'import base64' in them.


Yeah, it would probably create more manual work if you have too many false positives. I have maybe six base64 strings in the code I'm working on, so it might be worthwhile looking into provided my legitimate imports don't have any.


So malicious packages for JS, now python, but still no W4SP stealers for libc :(

Feeling a bit left out, guys! How will my code get compromised randomly?


That's the main reason we should start using WebAssembly for distributing and using packages.

Shamless plug: Wasmer [1] and WAPM [2] could help a lot on this quest!

[1]: https://wasmer.io/

[2]: https://wapm.io/


I’ve been building out a PyPi proxy to try and protect against these use-cases: https://artifiction.io

Explicit allow-lists and policies such as requiring "greater than X downloads per week" go a pretty long way to filtering out malicious packages.


I am really surprised that there haven't been even more malicious packages distributed in the past couple of years considering the rise of cryptocurrency. Seems like a determined and malicious actor could score big by targeting the more popular wallets.


It's totally happening. We've seen packages targeting a lot of the big exchanges. Most of the packages are targeting developers directly though; attempting to exfil the users wallets/keys.


Sonatype found a whole bunch of those and blogged about it in August. https://blog.sonatype.com/more-than-200-cryptominers-flood-n...

Disclaimer: I currently work for Sonatype, but in a different area of the company.


Thanks for sharing this, I had no idea it was already this prevalent.


W4SP is a python module that harvests passwords from your computer/network?


Yep. You can read the source code for it here: https://github.com/loTus04/W4SP-Stealer


That's correct, it's exfiltrating data from the developer machine


In a perfect world, outgoing network/file operations should be explicitly whitelisted in code using a decorator. And containers should limit which process/thread can access which port/path.


Is there a way to check if you've been compromised by these PyPl packages? Does PyPl have a mechanism to let people know that they've downloaded a compromised package?


Malicious packages are yanked as and when they are found or reported by the community.


That bit with the semicolon way off to the right side of screen is kind of sloppy. It's a dead giveaway of "I'm doing something I shouldn't be doing"


This is a fundamental flaw in most langs. There should be a smarter way to track changes in what is specifically used in dependencies.


Are any tools to detect such malicious modules, or are PyPI reliant on proprietary services by third parties to do that?


I built Packj to flag such "risky" packages: https://github.com/ossillate-inc/packj


"collectively the packages listed above account for over 5700 downloads"

Or about 190 downloads per package, and who knows how many of those are actually real downloads by victims.

This is basically a non-issue; a few fools downloaded random code from the internet and ran it on their computer and, hopefully, learned a lesson. Shock and horror!


It's okay people: I have the solution. We just need to make Python as hard to use and ugly as all other languages and then the skids won't be able to use it. Thank me later.


One of the fake packages is called `felpesviadinho`, which looks like calling someone named "felpes" with a homophobic slur in Brazilian Portuguese.


From a web page so crammed with JavaScript that it's pointless to even try to take a look at the article.


waestdryftugy


I hope the age of a thousand dependencies automatically pulled and upgraded on a basis of trust is coming to a close. It was obvious from the start this would eventually become a problem. Trust-based systems like this only work for as long as scoundrels remain out of the loop.


can there be a "blue checkmark" system for pypi authors? I'm sure that's been brought up and rejected for reasons.


It's not going to be a "blue checkmark" per se, but we're currently working on integrating Sigstore signatures into PyPI. The idea there will be that you'll be able to verify that a package's distributions are signed with an identity that you trust (for example, your public email address, GitHub repository name, or GitHub handle).


I don't think it makes much sense to verify pypi authors. I mean you could verify corporations and universities and that would get you far, but most of the packages you use are maintained by random people who signed up with a random email address.

I think it makes more sense to verify individual releases. There are tools in that space like crev [1], vouch [2], and cargo-vet [3] that facilitate this, allowing you to trust your colleagues or specific people rather than the package authors. This seems like a much more viable solution to scale trust.

[1]: https://github.com/crev-dev/crev [2]: https://github.com/vouch-dev/vouch [3]: https://github.com/mozilla/cargo-vet


We've found a lot of open-source packages that are authored by (well, released by authors identified by) disposable email addresses. We were shocked to find companies doing this, too.

Package Dependency land is a crazy place


The reason is obvious, people crawl pypi.org/github.com/npmjs.com and email their job posts or product launches. Every platform that requires an email and shows it publicly will necessarily get a lot of disposable ones.


Identity verification will never be enough, if their account or anything in their development or distribution pipeline is compromised, so will their code. Sandboxing mechanisms are fundamentally required - not only to ward off malicious attacks there, but to prevent accidental side effects and compromise at runtime too.


yes I knew someone would say, "that wont solve the problem!" it would certainly make it way better. it is much more difficult to hack someone's account and bypass their two-factor authentication instead of literally uploading any number of randomly named packages at will.

the same argument can be applied to, "when you park on the street in manhattan, close your windows and lock your doors". Well that won't save your car from being compromised. But it will certainly make it way less likely that someone will steal something out of your car.


Yes, this lines up with the "Critical Project" concept that has been floating around in the past year. It is... contentious to say the last. Previous HN discussion: https://news.ycombinator.com/item?id=32111738


This gives a checkmark based on number of downloads, so there is absolutely no guarantee that the package doesn't do anything malicious or won't in the future.


I think the issue isn't so much malicious authors, it's compromised repositories and compromised repositories as dependencies.

Blue check would gatekeep a lot of noble, new developers.


these aren't compromised repositories on pypi, these are fully legitimate repos started by bad actors that intentionally use similar sounding names and readmes. a system by which all of the major packages and dependencies are coming in, which to be clear is the whole pile of stuff that people aren't usually looking at, on a set of trusted authors and where someone who is being careful can whitelist specific repos for newer projects they want to use would reduce the problem to almost nothing.


This is the double-edged sword of open-source. It's awesome because anyone can contribute. It can be dangerous for the same reason, unfortunately.


SQLAlchemy can get one for $20/mo : - D


we have more than enough donations to pay for that, and since SQLAlchemy is a "critical project" they'd likely comp us anyway.

but IMO they'd never charge for such a thing, that's not at all in the spirit of OSS / python dev.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: