Hacker News new | past | comments | ask | show | jobs | submit login
Typosquatting programming language package managers (incolumitas.com)
486 points by xrstf on June 8, 2016 | hide | past | favorite | 143 comments



We've gotten flack from package developers submitting new packages to Package Control [0] because all additions to the default channel are hand reviewed. Part of this process is to prevent accidentally close package names, to try and encourage collaboration and to encourage developers to actually explain what their package does and how to use it.

My hope is to be automating a large amount of the review in the next few months, however I think this is a good argument for never having it be fully automatic. Having a human sanity check submissions isn't a terrible idea if we can keep the workload down.

Certainly this doesn't prevent a malicious author from posting a legitimate package and then changing the contents to be malicious, but that can be somewhat solved by turning off automatic updates.

[0] https://packagecontrol.io


Hey Will,

Thanks for keeping Package Control high quality, I know it's highly appreciated :-)


Another grateful Package Control user here.


Sonatype has a manual review process as well before allowing new projects to deploy to Maven Central. [1][2]

One step to mitigate things like this as well would be to have some sort of "crowd-sourcing" command in the package manager program... like "npm flag coffe-script" or something like that to alert repository maintainers of possible issues.

[1]: http://central.sonatype.org/pages/ossrh-guide.html [2]: http://central.sonatype.org/articles/2014/Feb/27/why-the-wai...


Typosquatting can be flagged automatically (for reviewal of a human later) using Levenshtein distance.



Keep fighting the good fight.


> Certainly this doesn't prevent a malicious author from posting a legitimate package and then changing the contents to be malicious, but that can be somewhat solved by turning off automatic updates.

Perhaps you could make this safer by adding an automatic check for how much the package has changed since the last version? And at least warn the user when they want to update?


I don't know how much checking for how much the package has change would help. You wouldn't need to change much to exploit - one line that downloads and executes code from somewhere would do it.


Perhaps list all new packages to the community and require/request validation or flagging by the community, along with listing similar package names.


> In the thesis itself, several powerful methods to defend against typo squatting attacks are discussed. Therefore they are not included in this blog post.

http://incolumitas.com/data/thesis.pdf section 5 "Practical implications". Just wanted to point out that in case you skipped it it's worth a read, some interesting proposals there that are worth discussing with package manager maintainers.

I particularly like the preemptive approach of auto-blacklisting common typos by simply monitoring the number of times a specific unexisting package is requested over time (5.10). So if a lot of people regularly attempt to install the unexisting package "reqeusts", it could signal that it's a common typo and should be blacklisted to prevent malicious use in the future. False positives could always be sorted out manually by communicating with the package manager maintainers.


You'd Bayesian that.

- The package name is something lot of people regularly attempt to install, but it doesn't exist (per above) - The package name is 1-2 chars off from the name of another package which has more than X downloads - The package is frequently installed then uninstalled in a short time


Reminds me of the quote, 'there are only two hard things in computer science: naming things, cache invalidation and off-by-one errors.'

I think that this clearly falls under the heading 'naming issue.' People know what they want, but do not enter it properly.

I can't think of a 100% off-hand, which isn't surprising, because it's a hard problem.

pmontra's suggestion to use typo blacklisting ain't a bad idea. Maybe some sort of reputation-per-name could help?


Sure it's not an off-by-one[-key] error? :)


Banks have a similar problem when people write cheques or set up standing orders. You have to put a name and the account number.

I wonder if you could do something similar here - enter the name of the package and a code of some sort. I haven't thought this through in a lot of detail.


Banks generally solve the issue with simple classic checksumming methods that guarantee that any number with a typo or swapped neighbouring characters will always result in an invalid number.

That doesn't work with arbitrary names because they are, well, arbitrary.


Why not? Central repositories could require that all names are within a certain Levenshtein distance of one another.

This could get mildly annoying every once in a while when there are legitimate non-clashing names. A better metric/typo recognition technique is probably possible. Or else some manual process for requesting exceptions (maybe with a tiny fee to help fund the overall project) would also address this problem.

EDIT: Just downloaded and read the thesis abstract. The author actually suggests the first idea: "The analytical part generates ideas for countermeasures that allow repository maintainers or users to detect typosquatting attacks in the future. For this purpose potential typosquatting candidates could be generated for each legitimate package name with the help of the Levenshtein distance algorithms or Bayesian networks. Another option that can be considered is the Metaphone algorithm."


"Sorry, the otherwise 100% valid and reasonable name you've selected for your project is invalid because an algorithm has determined it is arbitrarily too close to this other unrelated project. Try again."

Who would use that?


"The project title has been flagged due to similarity with an existing name. Your submission has been sent for moderator review".

Package managers have humans to deal with edge cases (removing malicious packages, investigating package errors, etc.) and this is no different. It wouldn't significantly increase their burden because only a small fraction of package names should require human validation.


Or just refer to packages by 2 names.

    Maintainer/PackageName
It solves so many problems, this included.


This is all half of a much larger problem, which is package identification. Perl 6 specced out[1] quite a bit of a future system to handle a lot of this, and I believe a lot of it is now implemented. A few things you need to consider:

- Maintainership can change over time.

- Multiple people may trade off releasing a package, but it's still the same package.

- There may be multiple repos (consider you may want to run a local company repo for non-redistributable modules).

I imagine in the end, one of the better approaches to the installation name typo problem might be to scan the code for what packages are required (utilizing as much specific information as possible), and confirming that exists as a local package that can be installed or offering to install it. Package installers should be able to take a source file or files, and install modules listed within. This won't solve all cases (dynamically determined and loaded modules may be a problem still), but it will solve quite a bit of them.

1: http://design.perl6.org/S11.html#Versioning


Those are some good points, and I guess in my head I'm thinking of how Github does repos on their site as my "example".

Github allows transferring of repos to another "namespace" (username), and will even forward requests from the old one to the new one for a while (how long i'm not sure...)

Thinking about it a bit more that kind of "mutability" might not be the best idea in a package manager...

Still, i think the namespaces can help more than they hurt if the platform is designed with them in mind, as even "namespace-less" systems still suffer from some of those issues like wanting to rename a package or split it up into multiple smaller packages.


I'm not arguing for no namespaces, much the opposite. I'm arguing that the whole way most languages implement modules is fairly haphazard, and that that leads to this problem. If you review the link I included previously, you can see some examples of how you could definitively specify a particular module version. E.g.

    use OldDog:name<Dog>:auth<cpan:JRANDOM>:ver<1.2.1>;
This would use Dog from the CPAN repository, author JRANDOM, and version 1.2.1, and namespace it as OldDog. You could also just "use Dog;" to use the canonical Dog package from the canonical sources (in order). If we could just point our package manager at this source code and it could determine "Hmm, you have a Dog module of that version, but not that author and repo, and you have a Dog module from that repo and author but not that version. Looks like we need to install it." that would leave us in a much better place, both for code using definitive versions of packages, and admins/programmers installing packages and making sure they get the right one, if it's been defined.


A different maintainer per major/minor version number is probably common enough of a requirement that it should absolutely be considered in the scheme.

For a while I bumped into projects that tried to follow the old Linux model of even/odd version numbers for telegraphing API stability. Long term support and backported security enhancements are another case where maybe the guys working on new functionality are exactly the wrong people to take responsibility.


One imagines that "Maintainer" could be typoed as e.g. "Maintaner" just as easily as "PackageName" could be "PackagName".


But then the attacker would need to register a ton of packages that match other popular packages under their namespace which can set off some alarms. (I guess "solve" was a bit too strong of a word to use there...)

There could also be some other cool tricks you could apply (This is the first time you are installing a package from "Maintaner", would you like to continue?)


An attacker would only need to register the equivalent of the package under attack. Other packages would continue to error out harmlessly as they did before.

The maintainer-level confirmation could be of slight assistance to advanced users, but it's no panacea.


That gets into issues with needing to either support multiple individual maintainers for a single package, or require any multi-maintainer package to create an organization they'll all work under, and use the org name. And since the org name is likely to be the name of the package, you're back at square 1.

For example, on the Python Package Index five people have authorization to publish a new Django release. Creating a "Django" org namespace wouldn't help, since someone could typo the org name and hit a squatted malicious version (and that's almost certainly what it would end up being; our github org is named "django").


I guess that would work, as long as you require PackageName to be unique across all Maintainers.


Though this would obviate the most compelling argument for namespacing, which is to allow exactly that.


    jashkenas/coffeescript
could go easily be mislead by:

    jashkneas/coffescript


When you think about it, how different is the destructive potential of an npm/pip install from curl | bash that (some) people tend to froth at the mouth about?

It's pretty mind blowing how big of a blindspot package installers are. I guess running everything inside a e.g. Docker container/VM would be a partial interim solution for the paranoid?


> When you think about it, how different is the destructive potential of an npm/pip install from curl | bash that (some) people tend to froth at the mouth about?

It's a bit better - there is only one possible source of compromise rather than everyone on the network path. Given that npm/pip likely keep archives of all packages uploaded, it would be much harder (perhaps impossible) to attack someone secretly this way, at least in the long term.

Good package managers require signing of uploads (e.g. maven central requires every package to have a GPG signature; Debian goes further, and requires your key to be signed by an existing member of the organization). If the client checks the signatures you end up with a system that's perhaps actually secure.


Signing is definitely part of the answer but there's still the question of trust.

A signed package doesn't really tell you that much. In the best case scenario it tells you the package you're installing in fact came from developer X and contains code Y (which you kinda already know since you have the source code). This works as long as you know and trust developer X, or did your due diligence reading through the code (which you can already do today).

I can't think of an end solution that wouldn't have to rely on network effects and social proof, which strikes me as rather fragile. Maybe formal verification and AI can help, but that's a long way off (?)


For me they're very similar. I actually did a talk last year for OWASP AppsecEU where I started with the curl|bash bit and pointed out where rubygems/npm etc aren't really a lot better in some ways

https://www.youtube.com/watch?v=Wn190b4EJWk


Nice talk! Sounds like there's no silver bullet...

I'm curious to hear your opinion about a combination of digital signing with e.g. keybase/blockchain + reputation system, a sandboxed development environment (mitigates the "short con" risk) and a sandboxed production environment, with the minimum set of permissions required to operate (as well as auditing of course).

Call me pessimistic but I don't see developers taking on the extra friction given the status quo. Though a major data breach or two might change things, as I'm sure we'll find out sooner or later.


I'm a fan of the approach of personally submitting projects to the repository maintainer (e.g. through GitHub issues), and having the maintainer personally approve them.

It does raise the barrier to entry, but it would prevent typosquatting and regular namesquatting.

EDIT: Does any major package manager provide a "did you mean" functionality, offering a list of actual package names similar to what you typed?


then the maintianer must have perfcet sigth and never ovrelook even one tpyo :]

and then also have perfect memory of all packages and notice that similarly named package is too (for some value of "too") similarly named to some already existing one... even if e.g. both are a correct dictionary word.


APT does and others probably do too, but it obviously only gives suggestions when the package you entered doesn't exist.


Right, it's only useful if you've prevented typosquatting.

Which Debian has, because submitting a new package is a much more involved processes than sudo apt-get publish.


That's a massive burden on the poor person who has to ok the package - especially at NPM's scale, for example.


We believe npm's scale is a direct result of having the lowest ceremony to publish a package. Turning the dial in the direction we did has pros and cons.


Well, ideally you'd set up some sort of system where multiple people work on managing a repository, similar to maybe how linux distributions package applications and libraries.


NPM's scale is the exception, rather than the rule.


Or someone needs to approve suspiciously named packages.


How do you determine what is a suspicious package without reviewing every new package by hand?


You could base it off edit distance with all the other packages. If the distance is too close, then it needs manual approval.


After watching this awesome Defcon talk https://www.youtube.com/watch?v=YqxaKGA9Lnc I wondered if there was any use cases for bit/typo squating in crypto. This is a pretty cool one! Not crypto but interesting none-the-less :)


Probably the maintainers of the package managers know which typos their users do, because of the 404s in the logs or equivalent errors. A preventive action could be starting to blacklist any name resolving to 404. If somebody eventually tries to upload a package in the blacklist, a maintainer should check the code and whitelist the name. Obviously people can be very crative with typos and with squattinq and there is no real protection against mistakes.


Might it work to mandate that the name of an uploaded package have a minimum levenshtein distance (or similar calculation) from the names of all the existing packages? Then you wouldn't have to worry about maintaining a blacklist.


That would mean that, for example on crates.io, you couldn't create a `libm`, because `libc` is already very popular. I don't think that works.


The default approach would stop automated attacks, there is no reason why the repository couldn't whitelist libm after review


True- levenshtein isn't the best algorithm for the purpose. Is there an algorithm that takes key proximity into account? Like, 'libm' and 'libc' are sufficiently different to preclude typos, but 'lib[n/j/k]' or 'lib[x/d/f/v]' are not?


Key proximity on which of the hundreds of keyboard layouts?


Good question... I'd imagine your standard QUERTY makes up a sizeable majority of programmers, but then I have no data to back that up... :)


It seems a good idea.

I used the Ruby code at the beginning of http://stackoverflow.com/questions/16323571/measure-the-dist... to calculate the distance between the package names at page 60 of the thesis and their typos. The maximum is 2.

I checked some similar package names from a Gemfile.lock of a project of mine. Unfortunately the two gems hike and hirb are also at distance 2. Probably many short names are close with this metric.

A combination of the two approaches could be ok: knowing that a name was blacklisted should be an indicator that's not a good name, despite the distance with any other name, plus an approval of the maintainers for distance 2.

But a blacklist could generate another type of squatting, with people trying to pre-blacklist perfectly legit names. Only one thing is sure: there is more work to do for the maintainers and this extra friction is not good.

Edit: the distance suffers from the same problem.


Surely some troll would deploy a fleet of machines that flood package indexes with requests to available names, effectively blacklisting entire dictionaries and eventually all short names.


Yeah, this is what I came to think too. I mentioned it in another comment. Somebody suggested to use a distance indicator, but trolls could attack that too.


> Obviously people can be very crative with typos and with squattinq and there is no real protection against mistakes.

I see what you did.


This seems like pretty unethical research to me.

Also, doesn't point out that the bigger threat is that this is wormable.


The doc (http://incolumitas.com/data/thesis.pdf) does have a short section on ethics, but IMHO it completely misses the point of the ethical concerns in running unauthorized, non-sandboxed code on devices you don't own. Instead it justifies the research by saying the threat cannot be shown unless the vulnerability is exploited, which is true, but that fact does not justify the research.

The acknowledgements mention 2 of the university advisers and a PyPi admin consented to the "notification program".

Still, people with good intentions have been prosecuted and convicted for less. I would be very concerned for this student.


There was no actual intrusion, so this feels like fair game to me. Especially since mitigating a very possible attack vector is a direct result of running experiment. Still, hopefully the researchers got an IRB to sign off on the experiment setup...


Well, there was a small intrusion. It reports back a filtered command history (including just package install commands), the hardware info, and the list of installed modules (along with regular info, like system type, if there are admin privileges, etc). That's not nothing, but it is fairly benign. I was worried about the command history until I saw it was filtered, and that mostly allayed my misgivings.


The research got computers to execute code on them without authorization and extracted information from them.

That is a crime under the CFAA in the USA. Not sure what it is in Germany/EU.


"Your honor, my client created and published a software library. The so-called victims here wrote code that specifically referenced my client's software library, by name mind you. My client in no way compelled or solicited the victims to do so. Now how can that be called 'without authorization?'"


> "Your honor, my client created and published a software library. The so-called victims here wrote code that specifically referenced my client's software library, by name mind you. My client in no way compelled or solicited the victims to do so. Now how can that be called 'without authorization?'"

The prosecuting attorney is going to tell a jury of twelve of your non-technical "peers" that it is hacking.

Your client can either go to trial for seventeen thousand two hundred and eighty nine counts of felony hacking, and risk half a million years in prison, or they can plea bargain to 5 years in prison and a felony on his record.

Or your client can hang himself, but I'm pretty sure a federal prosecutor counts that as a win too.


Yup. I'm surprised an advisor would sign off on this thesis.


That's an interesting point. Is there any case law on negligence versus malice?


What about Android apps (like Facebook) that collect phone numbers, contact lists, geolocation data, record sound without any user authorization?


Yea, this would never get past my university's ethics department. I'm actually surprised he was allowed to do this. Maybe it's partially due to the fact our ethics department is also worried about liability.


Perhaps this could have been made cleaner by relying on the package manager for download counts only, and then demonstrating the code execution scenario on research machines only. If you wanted to avoid actually downloading anything to the user's machine (after all, they expect a 404 in this case, not a package even be it a harmless one) you'd perhaps need the cooperation of the repo admins to a greater extent.

Anyway, this is all part of why I always try to build inside a container, or at least in a virtualenv where I don't need to sudo the install.


Yeah I wouldn't want to find myself in court hearing

>17000 computers were forced to execute [unauthorized] arbitrary code

Certainly a crime in the US, not sure about Germany.

Nice execution though!


I'm not so sure - were they forced? Could you take the maintainer of `requests` to court too? If someone types `pip install reqeusts` and gets something they maybe didn't expect, did you really force them?


Are you asking if the maintainer of 'requests' decides to spy on computers and phone home information?

What packages do this?


Not a laywer, I'm just picking nits. It seems to me when you pip install a package, you are saying "download <this thing> and run its setup.py file". What if requests did something you didn't like, something simple like write a new directory or change the name of a certain file. Could you sue over that? Where is the distinction?


No one would be suing. This would be criminal.

I was thinking that a simple way this would be illegal in the US would be

"[accessing] a computer without authorization or exceeds authorized access, and thereby obtains information from any protected computer"

See a2C here: https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act#C...

I'd assume you can make a decent case that the person only authorized the installation of a piece of software, not the gathering of identifying information.

IP addresses can be used as identifying information especially when paired with a timestamp.

Being an American citizen living in the US I would not want my name on this paper.


Ah fair enough, that makes more sense. It's definitely an unethical experiment, glad my name isn't on it either.


I wonder about the legality. It looks to me like he isn't technically responsible, since he didn't access any authorized computer himself.

If I intentionally leave an infected USB drive on the ground, someone picks it up and sticks it into it's computer, am I liable?

Seems like it could go either way.


Part of the problem is the many packages that require sudo permissions to install - IMHO that should be an exceptional case, but it isn't.


Packages often require sudo in order to install to the global interpreter - it's a security hazard otherwise. Imagine a Python package which overrides the sys module. If it didn't require sudo, anyone could install it and compromise Python for everyone else (or, for instance, compromise setuid programs).

The two solutions here are user-local packages (pip --user, for example) and virtual environments.


And 'npmjs.org' is misspelled as 'npmsjs.org' in the introduction. Nice.


Wow, this a very good study and explanation of what typo squatting is, and I really liked how he proved it's effectiveness.

I wonder what kind of steps we can take to prevent this risk.


I think we will have to rely on crypto hash in some form. Similar to download checksum. It won't be convenient, but it will be safe(r).


That doesn't really save you from typos


I was thinking something along the line of a mandatory hash/checksum along with the name of the software you are trying to install from a package manager. It does not have to be very long, just enough to avoid common collisions.


Instead of blacklisting, why not respond with a "You requested package ABD, but we think you might mean package ABC. Enter 'yes' to continue or anything else to start over."

That way authors can continue to use any name they want, and the emphasis is on letting installers know that they might be installing the wrong package.


"You requested package ABD, but we think you might mean package ABC. Enter 'yes' to continue or anything else to start over."

That'll be fun to automate around in puppet or ansible.


I hope you're using a local package cache for puppet or ansible or even specifying via hash (think git commit)


But if ABD and ABC are both package names in the system, then in order to present that warning we have to do some sort of resolution process to determine whether one is typosquatting.

Now that there's a strategy for finding fakers: 1) You have an attacker-defender arms race. The attacker will always be one step ahead of the defender. 2) You have the extra burden of keeping up in this race, otherwise your security feature is a facade. At best, this is useless. At worst, it lulls your users into a false sense of security.


I feel like "pick the more popular package" is a good enough solution in this case.


Cool. Attacker-defender race is on!

As attacker, my next strategy is create a bunch of agents (<10K should be enough) to download my typo packages.

Your move, defender ;)

But seriously, my point has less to do with the particular tactics of the adversaries and more to do with how the proposed strategy of automatically detecting potential typos invites gaming.


Perfect, if each of those 10K hosts downloads the library 100 times you can now typo-attack the zope.event (working in python) library, which gets ~100 downloads per day, many of which are automated and so invulnerable to your attack. Your attack vector gets you, we'll say 1 new hit every 2 days at most, and likely only one a week or so (according to some math, on `requst` vs. requests)


We need operating system vendors to give us a mechanism for easily creating and managed sandboxed dev environments.

Ones dev environment should be a place where remote code execution is a high probablity and we need better tools to partition that from high value data.


This only seems to be an issue for languages where packages reside in a global namespace, like Python, Rust etc.

I think most languages these days are a bit smarter and avoid this beginner mistake (for various reasons).


This is obviously not true. If `serde` resided at `erickt/serde` (as the counterproposal for Rust would've had it), I could create `erict/serde` or `erick-t/serde` or any other variations of erickt's handle.

The only way this is 'solved' is if some third party authority hands out top level names and refuses to register names that are similar to other names for some definition of similar. The number of levels between top level and package name is irrelevant.


Well, you could also solve it by saying that the post slash names are unique. ie. There can't exist zardeh/serde if erickt/serde already exists. Then the author-name works as a logical checksum, and you aren't any worse off than you were with a global namespace.


The purpose of a namespace is to make it possible to disambiguate two otherwise identical identifiers. If you force package names to be unique across all namespaces, then you don't have namespaces at all, you just have a single global namespace where you're forced to prepend an author name to the package name.


I know, I wasn't suggesting this as a namespacing solution, but instead a typo-prevention one.


That reduces the likelihood of success (erick-t/srede requires 2 typos) but doesn't eliminate the possibility.


True, but two simultaneous and specific typos is much, much less likely than a single one.


The name is just one part of the problem.

There's another solution (like debian does), auditing what the package itself does, so that you don't allow malicious code into the repository.


You are obviously wrong.

While attacking a single package would be possible, covering any interesting amount of "typo"-space would require registering huge amounts of namespaces.

If package manager developers are smart, the allocation of namespaces is also handled externally and associated with some cost (e. g. domain names).

Therefore these kinds of attacks become impractical.


While a package manager could require something like a domain name to authenticate, its much more common for them to require something with a much lower barrier to entry, like a GitHub account. I don't agree that this design decision means they are 'not smart' (nor do I think having a single namespace is a 'beginner mistake,' but whatever).

Package managers like these approach social networks, which has many advantages but carries the disadvantage of opening users to attacks that resemble social network phishing attacks. We could mitigate this by rolling back to package managers with higher barriers to entry, but I think that is not likely to happen.

You clearly would prefer to use a more adjudicated, managed package manager, with a higher barrier to publish and stronger rules about naming. That's a reasonable thing to want, but it would be better of you if you didn't act like people who want something which conflicts with that goal are stupid.


this is yet another reason why i really wished rust had went for namespaced packaging on crates.io. i like so many of the decisions the rust team made, but not this one.


This is incorrect. Package repositories with namespacing are just as vulnerable to these attacks.


Wrong.


Say that a popular package lives at `jack/foo`. An attacker needs only register `jakc` and create a package `foo`, and now anyone typing `blah install jakc/foo` is owned. There's a reason why "namespacing" isn't listed under the "Defenses against typo squatting" section.


Just read my other reply.


Ruby and JS package managers are un-namespaced as well.


Julia too, though there is a central list. I haven't done any tests for this kind of thing.


couldn't you register a typo namespace?


Yes, but then you'd need to also register a ton of packages under that namespace.

That's something that can be flagged for manual review before it gets too far.


maybe I don't understand the namespaces...

but if you are targeting a package `someuser/popularpackage` can you not just register your own malicious `popularpackage` under a typo namespace like `smoeuser`?


Yes, but my thought was it gives a bit more "data" to work with on the package manager's side.

They can see someone registering popular package names under something with a similar namespace and can flag them for manual review (which can be done for namespace-less packages, but there will be much more noise), they can apply things like "This is the first time you are installing a package from 'smoeuser' would you like to continue?", or even require adding a specific namespace "out of band" depending on how paranoid it wants to be.


  > "This is the first time you are installing a package 
  > from 'smoeuser' would you like to continue?"
You don't need package namespacing for this. All package repositories already require a registered account to publish a package.


That's a good point, but honestly I wouldn't be able to tell you the account names of any of the packages i use regularly.

And unless the account name of the package maintainers is brought front-and-center, you aren't necessarily going to know it shouldn't be different until it's too late.


A user doesn't need to be able to recognize the account name, that's the purpose of your aforementioned prompt. Let's consider the possible scenarios for installing "foo/bar":

  I. I've installed anything from the author "foo" before 
     on this machine, implying that I trust "foo".
    A. On a system with namespaced packages, I attempt to 
       install "fpp/bar". I've never installed anything 
       from the author "fpp" before, so I get a prompt.
    B. On a system without namespaced packages, I attempt 
       to install "bsr".
      1. If "bsr" is by an author I trust, then it will be 
         installed. This will be confusing, but is not a 
         security vulnerability. because this author is 
         already running code on my machines.
      2. If "bsr" is by an author I don't trust, then I get 
         a prompt, as in scenario I.A.
  II. I've never installed anything from the author "foo" 
      before on this machine.
    A. On a system with namespaced packages, I attempt to 
       install "fpp/bar". The system prompts me, as in 
       scenario I.A., but because I expect this prompt I 
       don't bother reading it and blindly accept it. The 
       prompt does reiterate the name of the author, but if 
       I didn't catch the typo the first time, there's 
       little chance I'll catch the typo this time. 
       Remember: the value of the prompt is not the 
       reiteration of the name, it's in its unexpected 
       nature, because research has repeatedly shown
       that users, even power users, do not bother 
       reading routine prompts (this is why, e.g., Chrome 
       no longer allows users to bypass the enormously 
       scary warning page that appears when a secure site 
       has a certificate error). My system gets owned.
    B. On a system without namespaced packages, I attempt 
       to install "bsr". The system prompts me, as in 
       scenario I.A., but because I expect this prompt I 
       don't bother reading it and blindly accept it. My 
       system gets owned.
A more complete version of the solution that you're proposing would be to have an actual implementation of a web of trust, but even that doesn't solve all the security problems inherent to package repositories.


Did anyone else find it surprising the the number of total requests (45334) is so much higher than the number of unique total requests (17289)? It is more than twice the number of unique requests!

Possible explainations:

* Perhaps many of those are automated build systems, which would also explain the high number of systems with admin access (for example, if you use travis without docker, every build runs in a clean vm with admin access).

* People download one package and install it multiple times? Seems unlikely

Any other ideas?


I think he forgot to define a baseline (could be wrong, I didn't read the paper). He should have generated a few packages with a completely innocent name (and maybe some packages with just a GUID as a name) to see how much downloads / installs they get too.


The person who ran the line,

sudo pip install lumpy (instead of numpy)

Ran it again because it 'didn't work'


In the case of python (not sure about the other package managers) if a valid package requires the hacked package, each project that requires that valid package will download and install the hacked package separately if you're using virtual environments. Also if you're using docker you reinstall everything when your requirements file changes.


Numerous developers and/or building multiple servers behind a single IP address aka NAT. It's pretty common.


Automated testing, continuous integration/delivery, et cetera download and install packages pretty often. If the type is made in the requirements.txt or package.json or what have you, the error can be repeated very often up to and including production.


with npm there should be at least an option which prompts for Y/N/A when package has preinstall hook.

but even this just tries to put the problem under carpet. you could still for example have requests package which just installs request package, works as expected, just sends request/response to your own server from time to time. ie. when there's http basic auth used only.


It is possible to disable install hooks at install time by running npm install with --ignore-scripts.

You can also make this the default, with npm config set ignore-scripts true (and then --ignore-scripts false at install time if you wish to run them).


Solaris did (does?) this - "this package contains installation scripts which run as superuser" or words to that effect. Unfortunately I never found a owa to inspect the scripts directly so it wasn't all that helpful.


Maybe this is overly naive, but when I make a typo in the Google search bar, it doesn't even search for my typo-ed term (even if it would have gotten some hits), it searches for what I actually meant to type. Can't package managers have a similar feature?


The main problem is when you really did mean to search for the typo term. There's no inherent problem in two packages having similar names.

Consider the following:

requests - a python package for making HTTP requests. requestr - a python package for a fictional startup that allows you to send requests to your nearest and dearest.

Given they both could be typos of each other:

1) How do we determine which one to use? What if someone accidentally also tries "requestd", somewhere between the two ?

2) How do we apply the principle of least surprise - I asked to install requests, and everything installed just fine, but now I can't import it?!


    $ pip install requestr

    Package "requestr": did you mean "requests"? [Y/n]
    (reason for this warning: similar spelling and requests is much more popular)

    Pass --no-spell-warnings to disable this feature.


So last week my client discovered there's a gem named bunlder... sigh


There is a gem called bundle which doesn't do anything but preventing a typosquat

https://rubygems.org/gems/bundle Total downloads 1,800,600

Source (empty) at https://github.com/will/bundle and interesting README.

https://rubygems.org/gems/bundler Total downloads 92,116,090

It's almost the 2%.


I think the authors here missed an opportunity for even more effective squatting like that: cases where the name you import, name you type at the command line, or name you commonly call the package by is different from the name in the repository.

In Python, "pytables" (should be "tables") and "skimage" (should be "scikit-image") come to mind.


Yeah. I think it's becoming a reflex for programmers when they get an import error like:

    Error: Cannot find module 'x'
to quickly type:

    npm install x


My gem has a good downlaod/loc ratio.


I thank my stars every time I get a "Package not found" error due to a typo, because I'm reminded that it could have been much worse.


Trying to parse the title made my head hurt. It should be "Typosquatting software package names" or something.


The homebrew model where packages and changes to packages are reviewed takes care of this problem quite nicely.


Ouch. This really hurts. So hard to protect against human error.


Glad to hear bower is stated to be safe in this regard :)


Really cool applied research. If I get the time, I'll check out the author's thesis.


I'm confused.. is it 17 computers or 17000 computers? inconsistent use of decimals in this article.


17000. In Europe a common decimal format it #.###,##. See here: https://en.wikipedia.org/wiki/Decimal_mark#Examples_of_use


17,000. The author is from a country that uses . to delimit thousands.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: