Hacker News new | past | comments | ask | show | jobs | submit login
Malicious software libraries found in PyPI posing as well known libraries (gov.sk)
475 points by nariinano on Sept 15, 2017 | hide | past | favorite | 245 comments



Ok, here's some ugly backstory on this: This problem has been known for a while, yet both the pypi devs and the python security team decided to ignore it.

Last year someone wrote his thesis describing python typosquatting and standard library name squatting: http://incolumitas.com/2016/06/08/typosquatting-package-mana...

However after that the packages used in this thesis - the most successful one being urllib2 - weren't blocked, they were deleted. Benjamin Bach was able to register urllib2 afterwards. Benjamin and I decided that we'd now try to register as many stdlib names as possible.

See also: https://www.pytosquatting.org/


This is a scary attack. One partial mitigation is to use a firewall (e.g., Amazon VPC network ACLs) to restrict outbound network traffic to a small number of known addresses like well-known repos. I can't think of a good reason why code in any well-behaved application should be allowed to make random outbound network calls.

I think it's also on app developers to rethink the culture of randomly grabbing packages to build applications quickly. This is already a security problem even with approved repos. Having a rat's nest of packages makes it hard to upgrade quickly when those repos post updates to address vulnerabilities.

Edit: Removed confusing statement about return connections


>well-known repos

I think it's clear at this point that you should be using internal mirrors of both programming language and OS package repos, so there is no need for build or production machines (other than those responsible for syncing the mirrors) to have outbound internet access at all.


> One partial mitigation is to use a firewall (e.g., Amazon VPC network ACLs) to restrict outbound network traffic to a small number of known addresses like well-known repos.

That breaks down very quickly with the combination of public CDNs and TLS. I suppose you could do SNI based firewalling, but that is bit ugly, and afaik you can't do that easily with common firewalls (like netfilter).


As I mentioned above some of this goes back to application design. Ideally if you are layered correctly the application components of the system that own data should not talk to the outside world except in very narrowly circumscribed ways.

Unfortunately most real world systems that have internet access are created in far from ideal conditions.


This. If it was that easy we'd already be doing it on all our desktops and servers.


This is probably a good mitigation anyway, but it's not enough: a malicious package could include an (obfuscated) API key that allows it to exfil whatever data it wants to an s3 bucket or similar.

Unless you're super gung-ho about using Service Control Policies _and_ your service doesn't otherwise need to access s3, then there isn't really a way to block this.


> I can't think of a good reason why code in any well-behaved application should be allowed to make random outbound network calls

If your application implements webhooks, then there's a valid use case? I'm in complete agreement however that you should default to deny and open up as required.


For security, you should call outbound webhooks using a server dedicated to that purpose, where the main server and webhook caller are connected through an async message passing system such as RabbitMQ. That way you can maintain strict firewall rules on the main server.


Aws Sqs write only queue for external facing servers is what we do. What worries me is not servers... It's admin/Dev machines...


>both the pypi devs and the python security team decided to ignore it.

Without trying to downplay the seriousness, I think that's a less-than-charitable take on the bind they felt they were in. It also sounds like there's a crack team of dedicated devs sitting around waiting for something to work on.


When I read things like,

https://caremad.io/posts/2013/07/packaging-signing-not-holy-...

I get the impression they don't care about security at all. They seem like children plugging their ears and shouting "nah nah nah" while putting repo users at risk. They've obviously done nothing since that post was made four years ago.

In contrast, Maven central requires signing. Unsurprisingly, Maven central doesn't have typosquatting problems. That's not a coincidence. That's also a strong reason why Java still dominates the enterprise.

If you (the reader) are a PyPi/NPM user, I challenge you to watch this,

https://www.youtube.com/watch?v=pBJafU0p_Nk

and tell me why you shouldn't use a repository manager like Sonatype Nexus which validates package signatures, checks licenses, and does vulnerability scanning.

By itself, package signatures aren't the holy grail? That's not the point. Security is achieved in layers. Too bad they aren't mature enough to understand that.


>I get the impression they don't care about security at all. They seem like children

I had the absolute pleasure of working with dstufft for 6 or 7 months years ago and learned a ton from his breadth of knowledge and willingness to teach (and I had more than a decade of experience at the time). I can appreciate being unhappy with the situation and not agreeing with the approach, but personal attacks and assuming ill intent don't help. It is, of course, possible the developers actually do pay attention to the details and just see the issue better than you do.

I read that link at the time and I do wonder how you would address the issues he raises. The concern seems not so much not wanting to apply security but actually thinking about the underlying issues. Which is worse: an open but dangerous Pypi or one with a bunch of security theater around keys and hashes and other stuff everyone pays lip service to but never actually checks?


Typosquatting and package signatures are separate issues. Package signing only prevents typosquatting insofar as either the user or some intermediate layer resolves the typo to the intended package. If someone was going to this effort, they'd probably go to the effort of double-checking the package name before installation anyway.

PyPi needs moderators to sit in the middle and remove anything that is obviously malicious, whether the packages are signed or not. Bad guys can sign packages just as easily as good guys.

Software should also be used to correct likely typos, perhaps including checking against a blacklist of known-bad package hashes, before the package is installed.

Yes, these approaches are imperfect, but they are better than doing nothing. "Perfect is the enemy of good".


> needs moderators

Like Anaconda and Enthought? And countless internal departments? Or are you suggesting folks donate to the PSF and they hire a team?


I'm suggesting that whoever currently has admin rights on the PyPi packaging servers, and it's someone, take responsibility for this and physically remove the typosquatting libs from the lookup mechanism. "We need donations before we can do that" doesn't pass muster as far as I'm concerned; leaving this unaddressed is an existential issue for PyPI.

There are already privately-maintained repositories and that's great, but IMO it's not an excuse for PyPi to leave this vulnerability open.


Ah, so you're going to refuse to use PyPI until someone volunteers. I'm not certain that's an existential issue for PyPI. I plan to continue using it.


> In contrast, Maven central requires signing.

forgive my ignorance and my lack of 56 minutes to watch the entire youtube video, but who are the identities behind these signatures? The blog post you reference discusses the problem both of users signing their own packages (anyone can make a signature and any malicious package author can point people at a maliciously-owned signature as well) as well as having a central key (used by organizations with employees and known contributors, does not scale to pypi's model).

I'm also ignorant of a vulnerability scanner for Python (haven't looked). Does such a tool exist and have you proposed it as part of pypi's infrastructure? I am sure they'd be interested in that.

I'm not sure how the license file of a product impacts the issue of it being malware or not.

Took a look at https://www.sonatype.com/ and it appears to be a closed-source, commercial product - it appears to have a database of vulnerabilities in some format, but it appears to use hashes of some kind. I'm not sure how that would work against arbitrary Python source code, but again, I am ignorant. I would encourage you to write a comprehensive rebuttal to the blog post you refer towards.


>who are the identities behind these signatures?

https://maven.apache.org/guides/mini/guide-central-repositor...

"we require you to provide PGP signatures for all your artifacts (all files except checksums), and distribute your public key to a key server like http://pgp.mit.edu."

>anyone can make a signature

The article flip flops on this.

any hacker can do it

it's too much burden for developers

>any malicious package author can point people at a maliciously-owned signature as well

Anyone can also verify ownership of the key before accepting packages signed by it. This is something professionals do. This is something institutions do. This is something three letter agencies do.

>Does such a tool exist

There are CVEs for python. One could even scan a repository using those. Java has prebuilt tools for this. OWASP has the dependency-check plugin for Maven. Nexus uses the same information in their repository health checks.

>have you proposed it as part of pypi's infrastructure? I am sure they'd be interested in that.

Why would I? Given their response to signed packages, I would expect a response along the lines of "Too much burden. Too hard. Not perfect. Not worth it. Security theater. Go away. Ur dumb."

>I'm not sure how the license file of a product impacts the issue of it being malware or not.

It's one of those nice features of good repository management. Do python packages even list licenses? I mean, I assume they would, but then, they actively resist implementing other basic things which I would just assume they could do.

Licenses change over time. Some enterprises treat GPL like a virus. Knowing ReactJS changes from Apache to BSD + Patents in a new version is as important to someone in the business as knowing if a package is compromised.

>it appears to be a closed-source, commercial product

Nexus OSS is open source, Nexus Professional is commercially licensed. The later has a few nice features the former does not. Both can manage PyPi, NPM, Ruby, Docker, Maven, and Nuget repos to name a few.

https://www.sonatype.com/nexus-repository-oss

>I would encourage you to write a comprehensive rebuttal to the blog post you refer towards

It's easier to fool people than to convince them that they have been fooled. -- Mark Twain


> Do python packages even list licenses?

Of course they do and it goes into the package classifiers.


> https://caremad.io/posts/2013/07/packaging-signing-not-holy-...

Thanks, that's actually a great article that explains very well why you can't just throw signatures at the problem and claim that fixes everything.

As other commenters have pointed out, the reason Maven central doesn't have this problem has nothing to do with signatures, and everything to do with the fact that all new packages must undergo manual review, which is unfortunately a solution that doesn't scale. (See the "Linux Has Packaging Signing, Let’s Steal Theirs" section from the article you linked.)


Package signing doesn't achieve anything without a trust model behind it, which is exactly what that post states. Too many people go "we need to add some crypto to this thing!" without developing a threat model and that ends up making the crypto pointless wankery to act as a security blanket without actually solving any problems.

Maven Central, to my knowledge, does not have typo squatting problems because Sonatype has a manual review process for all new projects. It has absolutely nothing to do with the fact that they allow projects to upload PGP signatures and it could not have anything to do with that, because PGP does not provide any mechanism to prevent that.

For example, there may be `urllib3` which is a valid project that must be signed by key X. We'll ignore how a tool like pip would find out that key X is the right key (although this is actually the most important part of a package signing solution) and just grant that we've solved that problem. Someone then comes and registers another project, `urlib3` which must be signed by key Y. The attack that is being described here is that a user would erroneously say ``pip install urlib3`` when they meant to type ``pip install urllib3`` and pip would then fetch that and download the package and install it. I think it is pretty obvious that signing doesn't help here, because pip doesn't know that the user really wanted urllib3 and not urlib3, so it can only determine that urlib3 is supposed to be signed by key Y (which of course, the hypothetical malicious person controlling urlib3 would have), fetch the package and verify it's signature.

There is only one tried and true method for preventing across the board this kind of human introduced error collision (aka typo squatting), and that is manual review of all new projects. The problem with manual review then becomes one scale. There are as of this time of writing 117,226 unique projects on PyPI with an average growth of around 100 new projects a day. In addition there are zero full time developers or operations or support people working on PyPI. There is one part time paid person (me), plus my unpaid time, plus one other part time unpaid developer/ops person who do the vast bulk of the work. There is simply not enough available bandwidth to process 100 new projects every day and to validate them for typo squatting/confusion possibilities.

Beyond that, there are a number of possible heuristic based approaches that can try to reduce the chance of this from happening such as using levinstein distance, unicode confusables, attempting develop "reputation", etc. Most of these are either so broad as to catch a lot of projects which are not typo squatting but are real, actual different things or are so narrow as to be trivially defeated. That's not to say they aren't worthwhile or there isn't an idea that would make sense but focusing on that has not been a priority for a largely volunteer based organization because there are lower hanging fruit that are more impactful , because at the end of the day without a manual review system individual end users are still ultimately responsible for ensuring they're asking for the correct thing (and even beyond that, they're responsible for ensuring that the thing they're asking be installed is something that satisfies their own security constraints).

Security is achieved by layering multiple secure systems on top of each other, not by randomly rubbing crypto on things because it makes you feel good to have crypto involved.


>For example, there may be `urllib3` which is a valid project that must be signed by key X. Someone then comes and registers another project, `urlib3` which must be signed by key Y.

Key X is on the company approved key list, key y is not. Your argument just fell apart.

>The problem with manual review then becomes one scale. There are as of this time of writing 117,226 unique projects on PyPI with an average growth of around 100 new projects a day.

You're not dealing with projects. You're dealing with keys. It's not one key per project. It's one key per contributor. This has the added bonus that if a contributor goes rogue, you can revoke the one key and all the suspect projects are are invalidated at once.

>There is one part time paid person (me), plus my unpaid time, plus one other part time unpaid developer/ops person who do the vast bulk of the work.

Sonatype has turned this into a rather nice business. It's not a volunteer project for them. You expect me to believe it's impossible despite solid examples to the contrary?

>at the end of the day without a manual review system individual end users are still ultimately responsible for ensuring they're asking for the correct thing

Blaming the victims.

>Security is achieved by layering multiple secure systems on top of each other, not by randomly rubbing crypto on things because it makes you feel good to have crypto involved.

It's also not achieved by doing absolutely nothing at all.


> You're not dealing with projects. You're dealing with keys. It's not one key per project. It's one key per contributor.

My rough guess is that for the Python community, these are roughly proportional; there are a lot of different people maintaining approximately one library each, not a small number of people (or companies) maintaining large parts of the ecosystem. There's nothing directly like org.apache for Python.


And yet in every typo squatting case, there's multiple projects leading back to a single contributor. It's almost like noticing a little known nobody sneaking to the front of the Pareto distribution would be a huge red flag.


By "every typo squatting case" do you just mean researchers demonstrating the viability of the attack against various systems? A system that successfully defends against researchers but not against actual genuine attackers would be worse than useless. If I actually wanted to pull off an attack without anyone noticing for as long as possible, I'd just target a single package whose maintainer is on vacation.

I think the only way your key-signing mechanism would actually solve the problem is if we made it actively hard for new developers to upload projects to PyPI without a long vetting process. Some projects work this way (Debian, notably; I've had upload rights for a few Debian packages for years and still don't feel ready to apply for full access), but I think it's a poor match for PyPI's actual goal.


Your arguments and exaggerated claims are silly. I can point to Maven Central all day long. They're doing it right. They don't have these problems.

You know who does this sort of thing? Politicians. They can't just look at a working system, single payer for instance, and copy it. No, they have to make silly arguments about why it will never work, despite a concrete, working example, right in front of their own eyes.


Donald already pointed out that the key difference in Maven Central is a manual review process, not package signing.

If Python introduced manual review of new packages, it would either need a massive amount of resources that no-one is offering to provide, or it would immediately be a huge bottleneck on people making new packages, which the community doesn't want to do.


>the key difference in Maven Central is a manual review process

Lipstick on the pig, still covered in mud.

The key difference is the regular occurrence of malware finding its way into PyPi and NPM due to the lack of multilayered security on those repos.

You guys keep trying to prop up the straw man that ONLY package signing is needed. It's not. It's a start. Nobody is making that argument but you. You not only repeatedly beat that dead horse, but you carry it to the illogical extreme that package signing is somehow harmful. Not only do you see no value in that layer of security, but you actively resist any talk or attempts at implementing it.

Meanwhile, your repo is infested with hackers and malware. Big surprise.


Hyperbole and insults. Now you're just trolling. If I see a cockroach in the kitchen, I kill it and spray. I don't rip out the walls or move house.


That sounds like you're relying on blacklisting the 'bad guy' key that's uploading all the malicious packages. Any half-way competent bad guy will generate a new key for each package (or try to steal keys with some good reputation), so it won't work.


> You're not dealing with projects. You're dealing with keys. It's not one key per project. It's one key per contributor. This has the added bonus that if a contributor goes rogue, you can revoke the one key and all the suspect projects are are invalidated at once.

you cannot locate said rogue contributor without regularly manually reviewing 117,226 packages.


>you cannot locate said rogue contributor without regularly manually reviewing 117,226 packages.

Herd immunity. Someone is out there reviewing it. Most users won't need to lift a finger beyond verifying signatures.


> Herd immunity. Someone is out there reviewing it.

More likely everyone assumes someone else is reviewing it, and nobody actually does.


We definitely do where I work. I'm really, really surprised to hear they do not where you work. A multibillion dollar company like Amazon, who runs half the internet with AWS, does not verify dependencies? Wow. That's breathtaking.


If 100k packages are already security audited by the community.. Then what's the issue? They send them to dstufft, he takes them down (which of course does not actually happen because nobody is auditing most packages). As mentioned elsewhere, most contributors to pypi have only one package so the notion of "find one rogue package == dozens of untrustworthy packages removed in one swoop" doesn't really exist (esp because a rogue agent would be making one account / key per package just to avoid this kind of detection!)


> Key X is on the company approved key list, key y is not. Your argument just fell apart.

A minuscule amount of people are going to bother to do something like approve keys. Security for the minority can already be achieved by those companies mandating their developers use DevPI and mirroring trusted projects from PyPI to DevPI (or similar system).

Complicating the system further for something that, for practical purposes, does not improve the security of the vast bulk of people is not a trade off we're willing to make. Package signing will come to PyPI, likely in the form of TUF which is strictly superior to the trust model provided by PGP for package signing. It hasn't done so because nobody has had the time to do it yet.

What you seem to be missing about my statement both in blog post and here is not that package signing is not worthwhile, but that a lot of people like yourself seem to think that all you need to do is add signatures to a system and suddenly poof it's secure! That view point is common among inexperienced developers or people who don't commonly think too hard about how secure systems are designed/made.

The reality of the situation that adding signatures is painfully easy, but that without a coherent trust model backing those signatures you've achieved nothing but adding more complexity. Determining a trust model (particularly one that works for the majority) is the hard part, and you can't just wave your hand and wish it better.

> Sonatype has turned this into a rather nice business. It's not a volunteer project for them. You expect me to believe it's impossible despite solid examples to the contrary?

Is it impossible to turn PyPI into a business? I don't suspect it is no. However I don't want to do that because my personal risk tolerance doesn't have room for giving up a stable job with health benefits for something that may or may not fail. Others are free to try that if they want of course, but given the lack of people stepping forward to do that, it doesn't seem like anyone else is interested either.

> Blaming the victims.

Stating reality. PyPI is not a curated repository and the end users is responsible for their own security while using it. If they wish to outsource that responsibility there are a number of Linux distributions that are happy to do that for them as well as companies like Enthought and Continuum Analytics who provide curated repositories.

> It's also not achieved by doing absolutely nothing at all.

Good thing we're not doing nothing at all then. Luckily for the Python community we have actual experts and not arm chair cryptographers who fail to understand even the basic fundamentals of developing secure software.


>Complicating the system further for something that, for practical purposes, does not improve the security of the vast bulk of people is not a trade off we're willing to make.

This is the weakest argument. Are Python devs somehow dumber than Java devs? Are they dumber than Android devs? Are they dumber than iOS devs? Everyone knows how to sign a dependency/app/project except python devs? I don't believe that. I honestly think that's the most insulting aspect of this argument.

The rest of this post seems to have turned to hand waving and personal attacks, so I won't bother responding to that. I'm just glad I got to share this perspective with you. Once you cool down, I hope you look harder at the problem. All I care about is improved security. I'm not here for the imaginary internet points.


> This is the weakest argument. Are Python devs somehow dumber than Java devs? Are they dumber than Android devs? Are they dumber than iOS devs? Everyone knows how to sign a dependency/app/project except python devs? I don't believe that. I honestly think that's the most insulting aspect of this argument.

Nope, I think they're perfectly capable of signing things. I also think it's silly to ask them to do that when the proposed system hasn't been designed to provide any benefit. Properly designing that system is hard, and 99% of people who go "just use PGP!" or "just use X" have spent exactly zero amount of time doing that. Particularly when the proposed solution doesn't actually solve the problem at hand (though it does solve other problems if it's correctly designed).

Ultimately your "suggestions" are nothing new, they're the same generic, cargo culting, suggestions that folks who haven't looked really hard at the problem tend to make.


I appreciate the proactive approach.

Is your project the author of the packages identified by NBU? If so:

(1) Why is the tracking pingback obfuscated?

(2) Why does the code include a cheeky hello instead of a link to https://www.pytosquatting.org/ ?

(3) Why is there not a visible warning when installing one of these packages?

=================

edit:

Reading through the linked blog post [0], it appears these researchers used different code that DID provide visible warning and an cleartext pingback. It also collected command history and hardware information.

[0] http://incolumitas.com/2016/06/08/typosquatting-package-mana...


We're not the authors of those packages. But we own many others.

1. We're not obfuscating pingbacks.

2./3. We're raising an exception with an explanation and a link.

Just look at the code of one of our packages: https://pypi.python.org/pypi/codecs

The research in 2016 was done by someone else. The kinda crazy thing is: Some of the package names he used were made available again after that instead of being blocked... And now we own them.


Man. You're right - that's a mess.


And just to be super clear, this is the code. Hard from scary or obfuscated:

  html = urllib_request.urlopen(
  "https://www.pytosquatting.org/pingback/pypi/{}/".format(package_name)
  )
  raise Exception(
    "This is a bogus package that should not be installed\n\n"
    "Please read https://www.pytosquatting.org"
  )


"Hard from scary or obfuscated" Maybe you are typing on mobile?

"Far from scary or obfuscated" reads more clearly.


> Being total jerks, we have a pingback in the setup.py of all packages

Ugh. Yes, you are being jerks. The ethical way to collect statistics would be to ask the victim to click the pingback link.


Package managers seem to be an increasingly popular attack vector. It's only luck that none of the attacks have been particularly malicious yet. Considering how many package manager downloads go to a server in a datacenter, a widely distributed malicious package could control a botnet with extremely high throughput, or wreak havoc on any databases it comes into contact with.

It's only a matter of time before something like this happens. A big part of the problem is that application package managers, like pip or npm, are far less sophisticated than those of operating systems, like aptitude or yum. It needs to be easy for developers to open source their code, and to mark dependencies with precise commit hashes, but the download also needs to be secure and verifiable. There are many difficult tradeoffs to consider in terms of usability, centralization, security and trust.


This is why we are working on integrating TUF (The Update Framework) signing into the OCaml OPAM package manager. See https://github.com/hannesm/conex-paper/blob/master/paper.pdf for the talk from last year. There's one more iteration required on the implementation before we're happy with it, but we are aiming to get this live on the OPAMv2 package repository some time in 2018 for all the publicly available OCaml packages.

OPAMv2 also exposes sufficient hooks during the build process for using OS sandboxing during builds, and disconnecting network access/etc. It would be nice to factor this out to be more OS independent (e.g. for all the `unshare` tricks on Linux, or the sexp-format for sandboxing on OSX) in the future.


Another fun fact to consider is that with many package formats, you can execute arbitrary code at install time so if a malicious package can get into a repository, it's very likely to start compromising systems quickly.

Whilst a package manager repo. compromise would be the biggest bang in terms of attack, compromising the credentials of the developers of popualar libraries would be an easier attack (and indeed is already happening https://twitter.com/chrispederick/status/892768218162487300)


I was recently very surprised to see an expensive security scanner which builds apps as the service account apparently without sandboxing. A package manager which executes code or which is exploitable would give you access to what is very likely an interesting account & data, and in at least some cases might not leave many clues behind. PyPI, etc. at least has centralization and immutable versions but who knows what's serving some random repo, tarball, etc. which might not even be a direct dependency?


With Python you can execute arbitrary code at import time... you don't even have to get to the install process. I've seen packages on Pypi that try to "sudo apt-get install ..." when the setup.py file is imported.

Also, even without sudo there's absolutely nothing stopping you (for example) downloading a cryptocurrency miner, or DDOS tool, or something, and starting it up to run in the background.


Doubly so since when installed on a server very often it will be done as "sudo pip install ....." (/s/pip/other-package-manager/ as needed)


I almost never see this. Even on systems that are only running a single python project, I only ever see folks use virtualenv. The only time I ever see things installed with sudo is when the package is being installed in a docker container.


> virtualenv

Running "pip install" as a user that has access only to the virtualenv directory is sounding like a good strategy.


This used to be super common, though. I see it all the time in legacy apps.


Yes – to the degree that it's uncommon now that's because many people in the community spent years loudly advising against it.


It's also very possible that a far more malicious attack has happened, but not been discovered.


I'm always fascinated by the amount of trust being exhibited by the developers of some node projects I've seen. Their projects have an order of magnitude more dependencies than I'm used to - and at the other end of each one is someone publishing some small module to npm with an unknown amount of review. I feel safe(r) installing dependencies from apt because I know the processes the Debian community follows before packages are included in the official repos.


More needs to be done by package managers to warn end users.

One scenario that worries me is where apps age and use popular trusted dependancies (e.g. gems on Github).

When those gems stop being maintained but need to be updated to work (say with latest OSX) - it's common to quickly look at the latest forks available and select the one that now works correctly - but without a detailed inspection of the new code it's potentially kryptonite for a production datacenter.


Package managers are providing (in the most case) a free service, so it's hard to see a strong case for them providing more services here.

The problem is one of scale. npm has over 500,000 packages, so no manual review will address their scale over the whole repository.

Until the developer market shows that they'll pay for a more secure service (e.g. package signed, reviews done etc) I doubt much will change.


It always worries me when I install a well-known or large package from npm and it ends up downloading dozens of dependencies maintained by disparate and unaccountable github users.


Maybe most of those 500,000 packages simply shouldn't be trusted.

There's a precedent for curated subsets of package ecosystems. Stackage for Haskell is an example, although it doesn't have security as the primary goal.

I don't think we should focus on actual audits of packages. Just checking that packages seem basically credible seems like a better approach because it's doable.


Credibility is an easier check but still tricky. Many of the package are uploaded by anonymous or pseudonymous authors, so there's no easy way to even tie that to an IRL identity, let alone check for credibility.

I'd agree that a curated small package repository would be a better way to address the problem, but the market doesn't seem very interested in that as a solution.


I think it might just not have happened yet. The npm community can be pretty creative and enthusiastic!

I don't think IRL identities are necessary for what I imagine. It's more like establishing a basic set of packages that have been around, have communities of committers, reverse dependencies, etc.

Maybe we would even make a starting assumption that the transitive closure of dependencies originating with a set of high profile packages are "approved".

I'm thinking aloud but I think there could be a reasonably pragmatic way to get this started...


So, apologies for being a bit cynical here, but I don't see this one being addressed any time soon.

it's been at least 5 years since npm started getting scrutiny relating to security weaknesses https://blog.andyet.com/2012/03/08/compromising-the-integrit...

and 4 years since Rubygems was compromised http://blog.rubygems.org/2013/01/31/data-verification.html

and yet, I don't see substantial movements relating to package security and trustability in these repo's. To be clear I'm not suggesting these two are any worse than others, they're just large repo's who have had incidents in the past.

The problem here (to my view) is that increasing the security of package repo's will slow down releases (additional checks take time) and cost money (additional security, hosting etc) and until there's a market demand for those service, they won't happen.


I'm skeptical too. But if I think like a sci-fi writer I can vaguely imagine ways for it to actually happen. That open source maintenance happens at all is pretty remarkable, so I think this thing, with an appropriate concept and some good tools (with emojis in their command line output), is at least vaguely plausible...


I don't see any reason it can't be done, either, in theory, and that's with the manual approach. Fancier ideas are viable too, but 500k is still relatively tiny and a manually tractable number. Incremental reviews starting today by many coordinating groups in the node community would take a while to complete, maybe a few years, but with some sensible ordering heuristics like e.g. the most downloaded first, or the most suspicious names first, some value could be produced quickly. But it won't happen, package vetting isn't really a value in these communities. (And that might not really be a bad thing, at least for now...)


If there was a business/enterprise offering with extra security I'm sure they'd have a long list of people who would sign up and happily pay for it.


Those service already exit, e.g. https://www.sourceclear.com/ whilst I hope they're doing well, I don't think they've made significant in-roads into the volume of people using open source software library repo's.


It needs to be part of PyPi directly, I would have a difficult time trusting and unknown third-party.


The conda package manager is free and generally feels like a professional package manager like yum or apt-get.


Yes, Conda [0] is a package manager designed by Continuum Analytics (now Anaconda, Inc.) to support their Anaconda Distribution [1]. The distribution is free, and the Conda client is open source. However, Anaconda sells several enterprise products, including an on-premise Conda server ("Anaconda Repository").

In general, Conda does more package verification than pip, and the packages in the Anaconda distribution are more thoroughly vetted than PyPi. Conda-Forge [2] provides an escape hatch for less-vetted community code.

[0] https://conda.io/docs/index.html [1] https://www.anaconda.com/distribution/ [2] https://conda-forge.org/


Five hundred thousand packages, am I reading it right? I dont believe top ten OS package managers combined would reach that number. Either this is a typo or it's crazy.


Not a typo, http://www.modulecounts.com/ has the details. npm is adding 497/day at the moment.


This [1] is npm growth compared to anything else. God this can't be safe nor sane...

[1] https://imgur.com/a/enjvR


The left-pad disaster has been predicted well in advance...


There's nothing especially awful about left-pad being its own package, the disaster was because a huge number of developers were betting on npm to somehow be highly available (despite being donated by its admins at no cost and with no committed SLA) rather than vendoring their deps.


Vendoring thousands of tiny libs is even worse. Trusting many lesser known, tiny libs is more risky than few, big well known ones.

Also, they are not vetted and there are much more opportunities for an attacker to sneak in a backdoored lib on the edge of the dependency graph.

Finally, due to vendoring there's no way to receive timely drop-in security fixes for all dependencies from a trusted source.


One can both vendor and use the package manager to fetch updates. Just add the node-modules directory to your VCS.

The thing with node is that AFAIK it requires you to have libraries for what in most languages would be in the standard library. Maybe someone should start a "stdnode" project where the most popular / successful libraries for generic tasks are integrated into a dependable, maintained de-facto standard library, with an eye on quality and sanity, and community / Joyent funding.


JS is the most popular language on GitHub by far, and npm is a public site where anyone can instantly upload as many new packages as they want...


JS devs usually make packages that have single functions....


Yet another attack vector that doesn't exist at all in Linux distributions but invented by language package managers, sadly.

They solved the issue 2 decades ago by heavily vetting packages before accepting them into repositories. Users are allowed to add and use packages from 3rd party repositories.

Maybe solution to this is creating curated repositories based on publicly open ones and using them by default (and requiring opt-in for using other repositories). Conda for Python and Stackage for Haskell seems like relevant solutions.


There's a certain amount of work (and therefore money) required to do this. That incremental difference is small for a well designed application, but - someone must actually vet and curate the contents of the repo. That tends to slow down execution, leading to scenarios where things like docker a year or two ago from the canonical "trusty" repo were hopelessly behind the "real" docker since docker was evolving so quickly and trusty was by design slowing down.

Each commit that went into trusty required a team to submit and a team to approve. That costs money. ;-)


It is a matter of distribution and release policy and not an inherent limitation of the model.

Stable/lts/enterprise distributions have other concerns like preventing regressions and configuration or behavior changes during lifetime of release.

Rolling distributions like Arch and OpenSuse Tumbleweed on the other hand can move a lot faster but still provide basic vetting wrt security and sanity of new/updated packages.


The Linux distribution approach to package management ("we'll package everything ourselves!") simply doesn't scale.


That's a feature. You want a set of vetted and curated packages you can trust, and you want to receive security updates on them.


But developers also want a way to get software without getting it blessed by Debian and waiting months/years for a distro release. That's why repositories like PyPI exist and are in very widespread use.

Distro repositories are a great example of 'secure for ideal users'. They give you security if you can put up with a small selection of software and older versions. In practice, we end up working around distro repositories by installing stuff with pip, or PPAs, or downloaded from websites.


I don't understand why they even try, Debian stable seem to have an almost arbitrary selection of outdated ruby and python libraries, at this point that hardly seems worth the effort. Sure I get the idea, but it obviously doesn't work in practice. Their security methodology also seems heavily flawed, to backport security fixes to older versions is neither scalable nor particularly reliable. I sincerely doubt that Debian can provide adequate security to its almost 50,000 packages. If the security community would invest equal amount of resources that they invest in finding flaws in iOS or Chrome, nothing would be left of Debian but a pile of smoking ashes.


> Debian stable seem to have an almost arbitrary selection of outdated ruby and python libraries

Yet Amazon and other big tech companies have a very similar process of packaging open source software for internal use and relying on "outdated" libraries.

> I sincerely doubt that Debian can provide adequate security to its almost 50,000 packages.

There's a security tracker where you can see how quickly packages are assigned CVEs and patched - sometimes even before the upstream patch is ready.


Sure its nice (and easier) to use the distro's package management system, but it often just isn't up to date enough. You end up using thing that are a while out of date and may have security flaws as a result.


> using thing that are a while out of date and may have security flaws as a result

On the contrary, on distributions that perform security updates the level of security of a package can only increase over time.

It might sound obvious but vulnerabilities are created in new releases, while vulns in existing packages can be only be found and fixed, not created.

(Of course I'm talking only about vulnerabilities here and excluding removal of obsoleted crypto or addition of new security features)


I would just like to point out that a "fix" for a vulnerability does occasionally introduce others.


This is incorrect of rolling release distros. Even Ubuntu is fairly quick to update, which is not a rolling release. CentOS on the other hand can be like pulling teeth. I'm going to be glad to stop dealing with CentOS 6.


Ubuntu is fairly quick to update

For popular packages perhaps, but for many more obscure and niche python packages Ubuntu is often a several releases behind, and that is if there even exists a package to begin with


> Yet another attack vector that doesn't exist at all in Linux distributions but invented by language package managers, sadly.

https://www.schneier.com/blog/archives/2008/05/random_number...

A.K.A., the Debian openssl Fiasco.

Just one example of distros fucking up the packages from upstream and causing major havoc.


You are cherry-picking one example involving a library that had a plethora of vulnerabilities from upstream. Contrast it with reviewing and maintaining 50k+ packages, managing thousands of CVEs every year, sometimes even writing security patches before upstream.

Also the project pioneered reproducible builds and implemented build hardening for most packages.


None of that, nor the good that they have done, negates the fact that distributions can fuck up too. And regardless of whether OpenSSL had a plethora of bugs from upstream, this one wasn't one of them. It was only in Debian, because of changes the project had made. Just because it's packaged in Debian by Debian maintainers doesn't mean you're immune to these kind of issues. They're arguably less likely but you'd need to do a comprehensive study of all packages in the repo to get to some usable statistic.


So the PyPi issue isn't a bug. It is an attack by hostiles. There are always bugs but this is a person packaging malware probably as a practice to package worse stuff that runs at build time. If package maintainers have time to respond to valgrind reports on their package they have time to check strace on the installer. Edited: longer rant.


> So the PyPi issue isn't a bug. It is an attack by hostiles.

Arguably. The issue with typosquatting on PyPi has been known and demonstrated for a long time, but nothing has been done about it. Considering there are ways of closing this attack vector, even though it would require some serious work, I'd consider this a bug. It's just a bug that's being exploited now.


This is a totally different issue. Distro package management isn't perfect, but you don't have to worry about a random malicious individual squatting on "opensssl" and including compromised code. That's a whole different ball game than a bug inadvertently introduced during a backport.


> Yet another attack vector that doesn't exist at all in Linux distributions but invented by language package managers, sadly

Not really, PPAs (and equivalents like copr or obs or aur etc etc) are mostly vulnerable to similar problems. People do want to install upstream software for various reasons, blaming language package managers for the reduced security of that is imho disingenuous.


Read my comment carefully. Though I've used terms loosely you'll see I'm actually talking about package repositories and their inclusion policies.


> Yet another attack vector that doesn't exist at all in Linux distributions but invented by language package managers, sadly.

Language package managers solve the problem that we don't have neither an universal package format that works across all programming language requirements and all sort of OSes, nor the time to create an OS specific package for all sort of OSes.


Solving that problem is orthogonal to vetting repositories.

There is nothing wrong with inventing their own solution if they're solving other problems, what is wrong is not learning from previous examples and fucking up creating problems that have been already solved in the process.


It looks like the code phones home to a server in China:

IP: 121.42.217.44 Decimal: 2032851244 Hostname: 121.42.217.44 ASN: 37963 ISP: Hangzhou Alibaba Advertising Co.,Ltd. Organization: Hangzhou Alibaba Advertising Co.,Ltd. Services: None detected Type: Broadband Assignment: Static IP Blacklist: Click to Check Blacklist Status Continent: Asia Country: China cn flag State/Region: Zhejiang City: Hangzhou Latitude: 30.2936 (30° 17′ 36.96″ N) Longitude: 120.1614 (120° 9′ 41.04″ E)


When you go this adress http://121.42.217.44:8080/

"Hi bro :)

Welcome Here!

Leave Messages via HTTP Log Please :)"


I wonder what would happen if the return payload had some data that would trigger the GFoC


This to me is the nightmare scenario. Well one of the two, the other one being that a developer of an obscure library I use has their password to PyPI compromised and a bad actor uploads a backdoored version of the library.

Fundamentally, the reason this is different from how thinks like Linux distos work is because Linux distros have maintainers who are in charge of making sure every new update to one of their packages is legit. I am sure you can try to sneak malicious code in, but it isn't going to be easy.

I am not advocating that PyPI (and npm) adopt the same model. That would be too restrictive. But maybe just showing the number of downloads isn't the best way to assure whether the package is legit. Perhaps some kind of built in review system would be nice.


A review system unfortunately isn't likely to be practicable with current development models. npm alone has over 500,000 packages (http://www.modulecounts.com/) so even a one time review isn't going to happen.

If people want a more trusted solution the likely outcome is that they'll need to use a smaller more static set of libraries and then either do the audits themselves, or outsource that to a 3rd party.

Ofc with current speeds of change and deployments, it doesn't seem likely that many companies will adopt that model.


> npm alone has over 500,000 packages (http://www.modulecounts.com/) so even a one time review isn't going to happen.

But at least the modules with the most downloads (webpack, react, or stuff like left-pad) could be vetted, and especially npm could implement a 2-or-more person model - basically, everyone with publish access can upload a new artifact, but to actually have it distributed to endusers, a second person would be required to sign off.


That's the thing. I worry less about popular packages. I can check that Django's GitHub repo links to PyPI and vice versa. But a random package to parse DSN's? I don't know it from Adam. I want to use it, and lots of others do too, but not everyone is going to review it. Maybe just a button on the package that says "I found insecure code!" Would be good.


> That's the thing. I worry less about popular packages. I can check that Django's GitHub repo links to PyPI and vice versa.

I worry about the most popular, and there the small and next-to-unmaintained. Just think back to the left-pad desaster that broke builds all over the world and imagine it was not a deleted package but an update containing malware. I assume there are lots of such "hidden gems" where the maintainer has gone away... the consequences of hacking just one improperly secured account are severe.


Indeed it's not impossible to do (although full code review would be expensive/tricky/slow).

The fact it hasn't been done despite the obvious risks indicates how much demand there is for this feature...


Full code review while not bad is probably not going to stop attackers. Often I can barely understand what the dev intended with non hostile code. Something more along the lines of strace on the install and run down any connects or execs. Runtime is different.


> they'll need to use a smaller more static set of libraries

That's what stable Linux distributions do.


Indeed they do, and you can get some libs for Node/ruby etc there, however most companies, from what I've seen choose the option of using direct access to npm/rubygems etc.


> Fundamentally, the reason this is different from how thinks like Linux distos work is because Linux distros have maintainers who are in charge of making sure every new update to one of their packages is legit.

How is that different?


Because the person who pushes the code to the public repo is not the same person who makes sure it isn't malicious. You have a review process. Nothin is stopping me right now from creating a PyPI package called Django2.0 and having some poor souls download it. Or creating a tiny but useful utility, having it become popular, then introducing an update with a backdoor.


This isn't, in any way, a new problem. I did a presentation on this topic for OWASP AppSecEU 2015 (https://www.youtube.com/watch?v=Wn190b4EJWk&list=PLpr-xdpM8w...) and when doing the research for that I encountered cases of repo. attacks and compromise.

IME the problem will continue unless the customers (e.g. companies making use of the libraries hosted) are willing to pay more for a service with higher levels of assurance.

The budget required to implement additional security at scale is quite high, and probably not a good match with a free (at point of use) service.


If someone here wants to build a business around this, count me in for NPM (high willingness to pay) or PyPi (lower WTP).

Here's an idea: make it similar to Kickstarter, where customers can commit a certain amount of funds towards a specific package. If the package doesn't "tilt" in a certain amount of time money goes back. Otherwise you vet a point release and add it to your repo. you could offer subscriptions to keep packages updated or handle each update as its own project (with presumably lower costs if a recent release has been audited). Handling dependencies is key as an exercise for the reader


One thing to consider if you're going to provide a service like this:

What happens if a vulnerability nevertheless sneaks through?

The whoever did the vetting could conceivably get sued. So then they might want to take out insurance or try to protect themselves from lawsuits in some other way -- all of which is likely to make such a service even more expensive.


It has to be constrained to something reasonable. You can't guarantee the software is safe, but you can guarantee it is published by someone who is who they say they are, similar to EV certificates for domains. You can also refuse to publish packages with intentionally-confusing names.


"you can guarantee it is published by someone who is who they say they are"

Can you? Positively identifying people seems a pretty tricky and easily screwed up business.

ID's can be forged, and a web of trust requires, well, trust.

I guess such a service could say something like "we got this person's ID (and/or address)" or "here's this key's web of trust", and that would probably be a bit better than what we have today (which is virtually nothing), but it would still be a far cry from "guaranteeing it is published by someone who is who they say they are".


EV certs have a complex verification process that can involve sending a physical representative from the company down to the place of business to confirm its presence/existence.

Bitcoin trading platforms have shown that compliance with AML/KYC regulations can be performed virtually by manual verification of a valid government ID, timestamped photo, handwritten note, and other mechanisms.

A company offering this service would go outside of the keyserver and verify the ID independently. It'd be much more of a "notarized packages" paradigm rather than just "published by 1337PyHax0r-88".

It is true that even extensive manual verification processes dependent on government-issued IDs can be faked, but there's a much higher bar involved.


I'm sure companies would pay for it. The service needs to be part of the main package service, not some third party.


Interesting if you think that npm/Rubygems/PyPI are leaving a load of money on the table, why do you think they haven't introduced those services so far...


ISTM we're just talking about running an alternate, more restrictive registry? npm etc. don't have to play any part in that. This service could be offered by anyone: IBM could do it.


Indeed IBM or anyone else could do this, but they're not, which implies a lack of demand.


Because their mission isn't to generate income like a traditional business. But if the income went back to the foundations, like Python Foundation, I think that would make sense.


But income can be also used to help finance their main mission. Obviously they seem to operate fine without strong reasons to expand revenue streams, but I feel like they ignore an opportunity to create improvements for just about everyone.


Anaconda gives a healthy amount to open source, either by donations to foundations like NumFOCUS or paying salaries of contributors. Is that what you're looking for?


npm is a commercial organisation, they offer paid subscriptions but don't offer a curated package signed option...


Sort of a critical feature they are missing


I think a more Linux-like approach to package repos is better - a curated package repository run by volunteers in maintainership roles. Then you have a human being verifying the upstream and keeping malware out, and get more consistency across packages as a bonus. If you want your package added it's as simple as sending an email and provides a new avenue for people to contribute to the success of the ecosystem as package maintainers.

When you make the next big thing, consider this approach.


Maybe you're right, but I see one possible downside that is quite important.

I have encountered the case "the package has an important bugfix but is not yet published on PyPI" way more than once or twice.

With the intermediate maintainers, that's going to get worse.

I believe namespaces and signatures are the way to go. With a special privileged namespace for the curated widely known packages (e.g. SciPy or Django) - a little like it's on the Docker Hub, where curated mainstream images are just "debian" or "python" but anyone can upload e.g. "jdoe/debian" if they need some customization.


PyPI should also run a build to audit behavior which would be fairly easy to implement. A submitted package would just fail if it access the network or privileged files during compile unless unique needs are called out in an spec file.

I do wish that `--user` was the default for pip.

It is also a pity that trivial Debian bugs like this block adoption of non sudo pip installs weren't ignored.

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=839155

Although Debian/Ubuntu default to --user on pip people resort to sudo because the current standard user bin directory isn't in the default path due to a regression.

I may start a project to create a apparmor/selinux wrapper for pip to audit and restrict access to sensitive resources. I actually have a fairly heavyweight version in place on my build pipeline to detect new dependencies. I add the files/network resources that a build accesses outside of the testing stage to the build artifacts. But it wouldn't be cross platform enough for Windows/Mac.


1. This would require PyPI to provide computing resources to build the packages (for all OSes, if the package contains native code). And then, malicious packages would just detect the build environment and avoid exhibiting the unwanted behavior. I think I read that it's not unusual for malware to detect being ran in a VM and doesn't do anything suspicious to complicate detection and analysis.

2. I don't think `pip install --user` adds any significant security. A little bit - sure, but not much. A trivial injection into ~/.*shrc or ~/.profile (I don't think anyone would notice the file was changed until it's too late) would result in full system compromise on the next login and sudo invocation. Same goes if you have ~/bin or ~/.local/bin (or anything user-writeable) in $PATH.

And even with non-root access, malicious software can do a lot of undesirable things (e.g. send spam or steal user data).

---

I believe, signature-based trust (with mandatory code signing) is the way to go. On the first `pip install` ever ask "The package not-expect (1.2.3) is signed by The Spanish Inquisition (key: ...) and was audited by The Python Developers (key: ...). Have you verified the keys and do you trust a) this vendor, b) this auditor or c) both?", then it gets recorded in ~/.pip (for this machine) and ./requirements.txt or ./setup.{py,cfg} (for distribution) so future installations don't ask anything.

To get non-interactive mode (for CI or something), on must either pass something like --insecure-skip-signature-checks (so they mean it) or pre-supply all the trusted keys.

(Not ideal, of course - just a quick idea. Surely, it has a lot of rough edges to polish.)


On point 1) yes so they would have to compile, that is not a huge barrier and it would also result in improved quality of modules just as CI/CD does.

On 2) even if you ignore the much larger attack surface due to running ALL installs as the root user, consider the one-shot opportunities by disabling the protections of capabilities(7) # capsh --print -- -c 'pip list > /dev/null' Current: = cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read+ep

And if --user was the norm, it would be trivial to write an apparmor/selinux policy to protect files like ~/.*shrc or ~/.profile

It is just the basic principle of least privilege. Heck if irssi can bother with an apparmor profile, the maintainers of pip or the package should be able to.


With this system, one could pull a Volkswagen.


I don't think there are any maintainers that verify upstream code, they only manage packages and updates. Which is actually safer to do without maintainers, completely automatically, as it will eliminate a huge attack surface introduced by a maintainer.


Debian Developers are responsible of the package quality. Most package only well-known software that they are familiar with or read the code. Some do more thorough security audits.

Some security-sensitive packages are maintained by teams to share the workload.


They do a lot that could be and sometimes even is automated, but that doesn't help with security, only weakens it. Which is my point. It's better to automate package generation and trust fewer people, mainly the authors, not introduce maintainers into it.


Authors don't generate distro packages because there's too many distros and each distro needs to make changes that have nothing to do with code. Maintainers are necessary and a package signed by the author is a non-starter.

However, source code can be signed and then used to make a package signed by a distro.


Think of it instead of allowing access to the system to maintain packages, we allow people to submit code that generates packages.


Of course you can do this - all packages are basically just wrappers around upstream code. But you still need someone to maintain the wrapper, and they have to check every new code release to see if there's something in the wrapper that has to change. And there are multiple distros. There's no getting away from maintainers with traditional linux distros.

Code package management is different. The author writes their software specifically to conform to the one code package management system. There's no wrapper glue needed, so you don't need a maintainer. Just release your new code and it fits into the system, and other code/tools/etc can just pick it up and use it.

This works if you constantly update all the software you use everywhere, and is pretty much guaranteed to become a nightmare if you don't. CPAN is probably the most mature software package management system in existence and it's still a nightmare if you don't keep a private repo and tightly manage releases, and you absolutely need a maintainer.


To be clear, what I'm suggesting is to generate those wrappers automatically, instead of maintaining them manually. A script can visit a release page daily, parse it and check for updates. If there is a new upstream release, it can generate a wrapper and let the build system do the rest, produce binaries, test them, etc. When things break, the code needs to be fixed, but it's definitely very far from every release. And you don't have to trust the maintainer of that script anymore or even have a separate maintainer, everything could be reviewed on pull requests with only a small group of people having commit rights to the repository.


That's basically how packages are maintained today, they just don't have as much automation and there's a lot less packages as a result. If you made one package for every software update and had to review each one you'd spend a lot more time reviewing.

Trust isn't an issue in reviewed/maintained repos because you have eyeballs on everything. When anyone can just ship an app/library and release it automatically you get these malicious software issues.


> I don't think there are any maintainers that verify upstream code, they only manage packages and updates.

Even that minimal amount of work is enough to prevent such a ridiculous attack like typo squatting.



That one is even more malicious, it uploads contents of your ~/.bash_history and system profile. But at least it notifies you afterwards...


Yes, but they do filter bash_history client side, only transmitting pip-related commands. They did this to find additional common typos. The relevant code:

  def get_command_history():
    if os.name == 'nt':
      # handle windows
      # http://serverfault.com/questions/95404/
      #is-there-a-global-persistent-cmd-history
      # apparently, there is no history in windows :(
      return ''
  
    elif os.name == 'posix':
      # handle linux and mac
      cmd = 'cat {}/.bash_history | grep -E "pip[23]? install"'
      return os.popen(cmd.format(os.path.expanduser('~'))).read()


Both Anaconda (for Python, https://docs.anaconda.com/anaconda/packages/pkg-docs) and Microsoft (for R, https://mran.microsoft.com/) have "reviewed and audited" collections of packages for their languages. That's part of what you pay for when you buy support for the open source tools.



While there is no public announcement from the PSF yet, I sent an email to the python-dev mailing list at least to announce the issue but also try to discuss how to mitigate/prevent it.

https://mail.python.org/pipermail/python-dev/2017-September/...

Honestly, I am impressed that the information gone so quick! The National Security Authority of Slovakia contacted the PSRT 10 days ago. All packages were removed 1h10 after we got their email. We were discussing how to communicate about this issue, while they published an advisory. A few hours after the advisory was published, I saw the information on IRC, Twitter, LWN, etc. I didn't expect that the advisory would be published so quickly. FYI last week there was also a CPython sprint attended by more than 20 Python core developers. We were busy on discussing Python enhancements.


I guess that's because it's not a surprise. This has come up before, and it's basically unavoidable with the way PyPI is designed to work: if you see an unclaimed name, you can put whatever you want there.


I see your point, but I don't think that it needs to be a surprise to be announced or made well-known. No one is surprised that Microsoft releases loads of patches every Patch Tuesday, but they still publicize it and make it well known and easy for people to find out about.


The regex they have for identifying fake/harmful packages is wrong.

`pip list –format=legacy | egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib) '`

This incorrectly lists `urllib3` or the `cryptography` package for example, which are perfectly valid packages.

[UPDATE]

Read "tobltobs" comment below. I incorrectly removed a trailing space from the regex.


Not for me. There is space at the end between the closing bracket and the apostrophe. Maybe you did remove this space when you corrected the smart apostrophes.


You're right. It seems I did remove the space. When I put it back in it doesn't print anything.


Conda users: Here's a script that runs this check against each environment:

https://gist.github.com/osteele/198b50a2a208e5bc7e5fb8d010cf...


I believe urllib3 is built-in. So if you have installed it from PyPI you've gotten a malicious version.


urllib and urllib2 are built-in for Python 2, and were merged and reorganized as just urllib in Python 3. urllib3 is a third-party module.


This is correct. In general, though, most packages don't rely on urllib3 directly, but on `requests`, which uses urllib3 but provides a friendlier API and built-in SSL cert verification.


It's not generally true that built-in packages which also appear on PyPI are malicious.

Many batteries-included packages are also maintained outside of CPython. This is because: (1) in many cases they existed outside prior to being included in CPython, (2) they can experiment with new features before they're included in the CPython version of their package.



xml should be added to this list.

https://pypkg.com/pypi/xml/f/setup.py


pip list --format=legacy | cut -d' ' -f1 | xargs egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib)$'


When running that command, I get output like this:

  grep: alabaster: No such file or directory
  grep: appdirs: No such file or directory
  grep: arandr: No such file or directory
for dozens and dozens of packages. Are those errors benign?


No, jastr's command is wrong.


Whoops, remove the xargs


"Success of the attack relies on negligence of the developer"

How about package manager managers accept their enourmous responsabilty? urllib vs urllib2, one is a virus? Sorry but that is not "negligence of the developer"


The least they can do is create an alias system for common libs or disallow some lib names.

Another easy thing to implement would be a popularity check: "This package was only installed nnn times. Did you mean xxx, or do you want to proceed with the installation of yyy by author dev@g00gle.com?"

Email verification is a must.


Managing supply chain is one the basic principles of good engineering. Not properly vetting your sources is negligence. The problem of course is that computers are really good at amplifying work, including mistakes. So small mistake, like a typo, could have catastrophic impact, like injecting malware that can take over the whole system.


There are over 100,000 packages and PyPI is run by volunteers. This is not practical.

PyPI is not a curated distribution.


there are many ways to reduce the likelihood of malicious packages. not all of them require active curation. some can be systemic.


How about a Levenshtein distance threshold for new package names to be accepted? I.e only allow names that are different enough from the existing set to avoid typos (or whatever errors we are trying to guard against)?


You don't need a strict ban for this to work either. Maybe just an end-user warning if distance < N and the relative popularity of the two modules is very high. You could also allow users or organizations to explicitly whitelist some names.


Any method of software distribution which is not rooted in cryptographic author verification against a fine-grained, user-manageable trust store should be put bellow the sanity waterline, 20 years ago.


Here's something that contributes to typosquatting: the lack of responsiveness by package management organizations to claims on orphaned or unmaintainable packages.

People who upload packages often leave organizations, who are then stuck with a package they can't update because the password went with the person, and the email reset link points to a now-defunct email address.

Petitioning the package management team is sometimes fruitless, forcing a needless new instance of typosquatting.


I have found the PyPI group of people to be very helpful in these cases. You also should probably, as an organization, have more than one owner of your packages. That way, unless two people leave, things aren't orphaned. We have gone as far to have a 'meta-user' that is on all packages. It is only ever used to recover a fully abandoned package.


I understand you are trying to be helpful, and of course you are right, but the fact is that sometimes things fall between the cracks, especially in, say, hard-pressed startups.

There are so many shoulds in the world that don't make it to dids, it reminds me of the joke about the salesman trying to sell farming improvement techniques and being turned down by the old farmer, who says, "Son, I don't farm half as good as I know how to already."

Unfortunately, I have not found the PyPI group as helpful as you have. Perhaps I have been looking in the wrong places.


UPDATE: I was helped out by a very nice person from PyPI, so kudos to them.


Part of my dislike for the Node ecosystem in particular and I am sure others have a similar problem, is the dependency trees are super complex.

Because packages tend to be small and many, and each of those has their own dependencies, you can end up with hundreds of packages installed which is simply impractical to manually review.

It is not node, but we do in fact manually review each package we utilize for our given language because it's feasible and worthwhile as the dependency tree is small in this ecosystem. Each and every package is a possible attack vector whether that be intentionally or just because it's poorly written and we can't simply ignore that because it's the done thing and "the community reviews them".


I bet there are quite a few malicious NPM packages that we do not know about.

Is Node is used in government and military solutions? If so then the NPM ecosystem is likely targeted by state actors, and it is a sitting duck.


State actors do not limit themselves to government and military targets; many of them target civilians for all sorts of purposes.


I once tried to upload a package called "requirements.txt" (since people do pip install requirements.txt all the time forgetting the -r).

Pypi actually blocks that name from being a package!


Here is the general problem with dependencies:

When a dependency changes, all the projects that directly depend on it should get notified immediately and their maintainers should rush to test the new changes, to see if they break anything.

There is no shortcut around this, because if B1, B2, ... Bn depend on A1, the consequences may be different for each Bk.

The only real secure optimization that can be done is realizing that some of the Bk use A1 the exact same limited way and thus make an intermediate A1b that depends on A1 which those Bk's depend on. These "projection" builds may be automated by eg the set of methods called by the B's.

Anyway, this is the way that iOS does it before iOS 11 comes out to users. They release a beta to all developers. And they even fix bugs in the beta before releasing to the public.

Without beta testing periods, you can get laziness and just auto-accepting of whatever cane out.

There is be an "alpha release" feature in git where maintainers might put out the next version to be tested by all who depend on it. THIS FEATURE SHOULD NOTIFY THE MAINTAINERS SUBSCRIBED TO THE REPO. THE BUILD ITSELF SHOULD GET ISSUES AND RATINGS FROM MAINTAINERS AS THEY TEST THE NEW BUILD. And releases should not be too frequent.

This is the way to prevent bad things from happening. But that also means that the deeper the dependency is, the more levels this process could take to propagate to end-users.


I think we need a system to prevent this instead of the wild-west that PyPi has become. For example: Developer signatures that are checked against a community rating. If someone does `pip install` pip would look up the developer signature of the package and check a community rating that would verify this is a developer who has offered legit packages in the past. It's not foolproof, but it would go a long way towards solving this.


That sounds easy to defeat. Make some mundane, but legit packages (maybe on of those "$X but without the pointless complexity"-packages), gain trust, once trust is reached start uploading typo-squatting packages.

Knowing today's internet, programmers from cheap-labour nations (India & Co.) would soon start offering "trusted PyPi accounts" for sale on hacker forums.


Yes, but keys can also be revoked, providing a way to mitigate this.


You could add the ability for well known members to vet newbie developers, maybe by signing their key. And now you have re-invented web of trust.


I was thinking more than the ratings would handle this, rather than having to sign keys. Ratings by well-respected and vetted members would have more weight.


This is interesting in conjunction with the recent post about Python's popularity because that may be a weakness exploited here [1.]. It's easy to use and install and get libraries for anything, and apparently libraries for infecting your machine :(.

[1.] https://news.ycombinator.com/item?id=15249348


The problem is that the vetting process of PyPI is completely inexistant. This has happened many times in the past, the last time I remember they uploaded a few libraries called "bs4" and stuff like that.



Would it be possible to have a general package manager (like apt) as reusable base for the individual language specific package managers? I know that npm and pip and gem etc all do some additional stuff, but at the core they all do the same (pull packages from repo, do some postinstall, resolve dependencies, maybe in some cases even check if the package is legit). So we could implement and check that once and then just reuse it like we do with many other libraries for image processing etc.


+1 to this. There's no need to reinvent the wheel a million times. At least having a shared standard on how to do packaging.


The packaging technology is not the issue here, it's about the repository. If you made an apt repo where anyone could claim an unused name and start uploading packages, you'd have exactly the same issue.


See also: PackageKit.


This is a known problem for a while. For example when we (RhodeCode) had an installer based on pip we actually rolled our own PYPI index. Having one you host is very easy and there are nice projects in existence that allowed that. It basically solved a problem of deployment when pypi wasn't available, speed up our test installer builds, and also total control over the packages we ship.


Maybe packages should be signed by several trusted maintainers. Or, noticing PyPI packages list a source code link on github sometimes, along those lines, there can be a process to prove ownership of some known online identity, keybase style. Unpopular packages can also be flagged, especially one that has a near twin that is much more popular. There are many solutions.


Unless your package manager enforces signatures and you trust the person that signed the package. Then this is an attack vector for you.

That includes Java (Maven), Ruby (Gems, Bundler), Node (npm), Haskel (stack) etc etc.

Installing code via package managers is the coders equivelant of opening up an exe sent to you in an email.

Code downloaded from the internet is not to be trusted.


Package signing is no silver bullet.

Signing packages helps against typosquatting about as much as SSL certificates help against phishing. Or in other words, not at all, especially if we don't have the certificates rooted in real world identities (like EV SSL certs).


I thought Maven enforces signatures? Though that doesn't fully mitigate the risk as you still have to trust the signer.


Signatures are good, but do not help in this case (typo-squatting)


Looks like they missed one:

https://pypkg.com/pypi/xml/f/setup.py

Dork: site:https://pypkg.com intext:"just toy, no harm"


Hooray for the "wild west" model of package repositories.

Come back maintainers & packagers, all is forgiven!


This is why I am not a huge fan of using package managers. I like to understand the code we put into our platform, and vet it. And not have it change under us automatically, after that, but review the changes manually before accepting it.

I felt a bit curmudgeonly but we have a responsibility at https://qbix.com/platform for all our apps being secure. I wanted to use repos for each package and manually git pull or hg pull them when they changed.

I was finally convinced by our developers to just use package managers with version pinning. Honestly it's really hard to avoid package managers, especially for all the newer functionality such as Payment Requests or Web Push. Luckily there is version pinning.

We want our clients to feel secure that we vetted ALL the code that went into the platform. So our package json (and composer.json) uses version pinning. We'd rather take a bug report and manually fix it than NO bug report and have a SHTF moment.


to see if you have any of these deps on your python path:

pip list –format=legacy | egrep -e '^acqusition$' -e '^apidev-coop$' -e '^bzip$' -e '^crypt$' -e '^django-server$' -e '^pwd$' -e '^setup-tools$' -e '^telnet$' -e '^urlib3$' -e '^urllib$'

to see if you have any projects in a given directory that require them:

cat $(find /path/to/dir -name 'requirements.txt') | egrep -e '^acqusition==' -e '^apidev-coop==' -e '^bzip==' -e '^crypt==' -e '^django-server==' -e '^pwd==' -e '^setup-tools==' -e '^telnet==' -e '^urlib3==' -e '^urllib=='


pip list --format=legacy | cut -d' ' -f1 | xargs egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib)$'


well shit, I guess I should have followed up on this after I noticed it 2 months ago.

https://twitter.com/JustinAzoff/status/881163562739277824



Whoa urlib & urllib3. Those are pretty popular packages, especially to newbies. Hundreds of websites that teach web-scraping use those libraries.

Wonder what is an effective form of protection against such attack vectors?

Do digitally signed certificates fit into this usage scenario??


> Do digitally signed certificates fit into this usage scenario??

No, because either the package author would have to sign them, in which case you have to choose to trust each package author, or the repository would sign them, in which case there would be no improvement for this current issue, since the repo would sign the fake packages as well.


Any way in which blockchain technology can be used? Like, the transaction becomes the act of the author uploading the code and the repo and user verify the transaction in some form?


Nope. You can't solve phishing with technological means. You have to curate either a whitelist or blacklist.

The best way to handle this is whitelists of trusted package maintainers and/or code authors.


To be fair you can do a lot with simple heurestics. New package + large number of downloads || new package || new package author = show users a warning message before installing the package.


Ohh ok. Thanks. Great to know.


If you can't trust the package author at all then you shouldn't use their package.



To check a few you different requirements.txt files (will look 3 folders deep)

find . -maxdepth 3 -name requirements.txt | xargs egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib)'


Avoid some false positives

pip list --format=legacy | cut -d' ' -f1 | xargs egrep '^(acqusition|apidev-coop|bzip|crypt|django-server|pwd|setup-tools|telnet|urlib3|urllib)$'


I wonder if it's worthwhile having a check that compares closeness of the name to existing popular packages and if so does some extended vetting.


Anyone know if this is also an issue for Java? I've used Maven repository for ages, and I know many big cos depend on it.


It's less of an issue, but it still could be an issue.

Deployed Maven artifiacts from Central are to required to be signed with a PGP key and are only supposed to come from approved hosts. I don't know how strictly that is enforced and how hard it is to become a host, but at least there is some kind of process.

Maven Central also doesn’t allow the removal of artifacts after they've been published, and every artifact requires a unique version and name. And the names are namespaced. So you don't have the issues that you see with npm, where someone can pull a package and break everything people are using, and then some third party can come in and publish anything under the exact same name.

Is this model perfectly secure? No, you still have to trust that the artifact was signed by a non-malicious person from a host that was not compromised.


It's absolutely an issue. I'm pretty sure no one is looking at every jar file added to maven to see if there's an issue.

In your POM file do you have a checksum?


tiny open source project for this. https://github.com/williamforbes/pypi_hacked_names


Maybe gov.sk should be vouched for too, I mean what's the chain of trust here? Why should I trust anyone?


Do not forget their password that worked for couple of years: nbuSR123 ...


Dry run?


The "malicious" code at the end of the advisory looks like nothing more than a beacon announcing it was installed?

  edit:
  get current working directory
  get username
  get hostname
  concatenate the last 3 together
  obfuscate(/encrypt?) this string
  send the result as a http request to 121.42.217.44 (the value of the base64 string)


# Welcome Here! :)

# just toy, no harm :)


some of this could be helped by intelligent naming of packages. If something is called urllib, name the package urllib because that's what people are going to look for.


Curious to know something similar happening for Scala.


how likely is it that npm and other package managers that do not use digital signatures by default are unaffected?


Python needs a way to run 2to3 during package installation that doesn't use setup.py (setup.cfg or wheels). As it stands now, you have the hassle of building a release four times if you want to support all combos of Py2, Py3, 32-bit, and 64-bit platforms. The absent support for 2 to 3 migration in the safer alternatives is why I stick with setup.py.


No, what you need to do is fix your package's code to work on Python 2 and 3 without running 2to3 on it. The only case where this doesn't work is if you have binary extensions - but then you need separate wheels in any case.


I'm glad Go completely sidestepped the name-rush induced by this type of package managers (composer, cpan, rvm, pypi, npm…). Just provide an URL. Done.


...which is even less secure.


May I ask why ? If anything, it's more secure, since you know exactly who's publishing what.

Yes, it might put a higher burden on the publisher if they don't host on github/gitlab, etc.

But it strips the "magic" part and makes sure the dev knows where the code is coming from.


With packages identified by full URLs it's more likely to make a typo, or a misremember a part of the URL, or search for it on a search engine and pick a fork instead of the right one, or paste one from stack overflow or other forums that is plain false or even look legitimate due to unicode tricks. Also DNS MiTM/hijack can be used to inject a backdoor. Or the expiration of a legitimate domain.


You can also make a typo when all you need is a package name - as long as human beings have to type thigns out, that's going to be a problem. On the other hand, with a URL, you can actually inspect the code directly and (if it's hosted on Github or somewhere similar) see whether it's starred, forked or has any issues. It's not a case of URLs being less secure, it's just a tradeoff that pushes some of the security work to the community itself, rather than a dedicated staff of curators.

And since most package managers eventually resolve packages to a URL somewhere, the issues you mention are probably present in other package managers, albeit hidden behind abstractions.


> You can also make a typo when all you need is a package name

"packagename" instead of a full URL is quite a difference. And you are not addressing the other risks.

> On the other hand, with a URL, you can actually inspect the code directly

You can do that with most package managers as they show you the upstream URL.

Expecting every developer and every system engineer to verify every package and every dependency they install is not "just a tradeoff". It's simply impossible.

> since most package managers eventually resolve packages to a URL somewhere, the issues you mention are probably present in other package managers

Some check for the SSL certificate, some use package signing (e.g. APT). Also if the pypi domain expires everybody will know, unlike a random library.


>"packagename" instead of a full URL is quite a difference. And you are not addressing the other risks.

It is more characters, and therefore easier to misspell, but a URL also gives you a domain and probably a namespace for the developer, each of which can act as indicators of trustworthiness and help disambiguate packages with the same or similar names.

If you can't double check your spelling for a package name or you just pick the first Google result, or paste from SO, then you deserve what you get. Domain hijacking, MITM, Unicode shenanigans and such are real risks, but not of URLs as package identifiers per se, so much as risks of distributing packages over the internet, which most if not all do anyway.

>You can do that with most package managers as they show you the upstream URL.

But if you don't have to deal with the URL, chances are you won't, and it's less likely you'll bother to follow it. I'm arguing that, if URLs are dangerous because of their length, then package names alone are dangerous because of their abstraction. I know that I can probably trust including "https://github.com/symfony/symfony" but "symfony" or even "symfony/symfony" alone tells me nothing useful.

>Expecting every developer and every system engineer to verify every package and every dependency they install is not "just a tradeoff". It's simply impossible.

True, but Linus' Law is still basically the security model that's supposed to underpin open source software, even it it's proven not to scale as well as assumed. Someone, somewhere has to know the code is safe, and that's either you or someone you trust, or (as is likely the case with most developers) someone you just assume exists.

>Some check for the SSL certificate, some use package signing (e.g. APT). Also if the pypi domain expires everybody will know, unlike a random library.

There's no reason a package manager using URLs can't also require package servers (which, let's face it, are probably going to be Github and Bitbucket in almost all cases) or maintainers to do something similar. Or at the very least put out warnings the way browsers do about invalid or untrusted certificates or unknown domains. You would lose the freedom of the "wild west" model in its purest form but still not be tied down to a single source of authority.


I'm all for security but this hit a nerve with me: "Success of the attack relies on negligence of the developer, or system administrator, who does not check the name of the package thoroughly."

Package managers need to do more. If they had an enterprise version that you could subscribe to monthly/annually invoice that you would get enterprises onboard, they are concerned about security and will pay. Developers like us will help encourage it. I'd rather not see some third-party "secure" package managers but make them part of PyPi and send funding to the Python foundation. They are seeking donations but that doesn't work well with businesses. Make it a monthly/yearly service.


There's written to check the package name and not to go through whole source code.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: