Hacker News new | past | comments | ask | show | jobs | submit login
Incident Report: Inadvertent Private Repository Disclosure (github.com/blog)
187 points by jamesfryman on Oct 28, 2016 | hide | past | favorite | 41 comments



We received an email from Github yesterday informing us that one of our repositories had been accessed by a third party due to this issue. While it's not a fun notification to receive, it definitely made our general security paranoia feel justified – we're lucky that from the get-go we've held best practices around keeping secrets out of the codebase. Obviously we still dedicated time as a team to prune through our repository history with a fine-toothed comb for anything that could potentially be a vulnerability, as we take this very seriously.

One of our engineers came up with a useful script to grab all unique lines from the history of the repository and sort them according to entropy. This helps to lift any access keys or passwords which may have been committed at any point to the top.

I think this is a great example to illustrate the tough edges of security to less experienced engineers. Github will most likely never let something like this happen to you, but on the off-chance that they do it's great to be prepared. Additionally, the response from Github was very well received. No excuses, just a thorough explanation of what happened.

I also can't help but mention that we're hiring, if you'd like to work at an organization that values security and data privacy very highly. :) usebutton.com/join-us


I'm curious, how did they calculate entropy? My first thought was to do something with Huffman encoding.


I wrote the script in question and actually used a simple shannon entropy value. (http://codereview.stackexchange.com/questions/868/calculatin...). It worked well enough help rule out several problem spaces.


Would you mind posting the script? I'd love to run it against our codebase and see what it comes up with.

It might be a fun thing to open source as part of a "I've inherited a project, what now?" toolkit that helps you decide what to fix.


Sure. It's a simple tool but the concept could be augmented toward something like the scenario you described.

https://gist.github.com/jasonmoo/06691c8fea09b62aa35235fc93e...


IIRC Instagram released a plugin for Bandit (the OpenStack static analyzer for Python) that does this.


I thought the idea was interesting so here is a little PoC in Ruby:

    require 'facets'
    lines = Dir['**/*.rb', '**/*.py', '**/*.cpp'].map { |f| File.read(f).lines.map(&:chars) }.inject(&:+)
    puts lines.sort_by(&:entropy).map(&:join).last(10).reverse
Using: http://www.rubydoc.info/github/rubyworks/facets/Array%3Aentr...


Is it easy to modify this script to run over all lines that have ever existed in the repository history?

For example, could you pipe the output of `git log -p --all` through this and filter out all the commit hashes somehow?


Yup, just use:

    lines = `git log -p --all`.lines.map(&:chars)
So I found that `git grep /.+/ $(git rev-list --all)` is a better way to get the content of all the files: https://gist.github.com/Dorian/e1514535c3c5036cf327ce61eb34a...

But actually an hex number regexp might me far more accurate than the entropy (e.g.: secrets are often long hex numbers).

I tried it and it yields interesting results: https://gist.github.com/28110f0b8105db11e8973d1d0be85259


I got the other end of that email today, saying that an account in my organization had inadvertently downloaded private repos from another customer when fetching from one of our own. Fortunately for GitHub/that user, it was almost certainly our automated provisioning system so we never had any idea and whatever it was never made it anywhere interesting.

The email was kind of funny though, part of it was effectively "if you have this data pretty please delete it without looking at it". I'm sure that's the best they can do, but it still made me chuckle.


Of the very small number of repos affected, there are now two of us reporting that it affected us :). And I had the same approximate response: well, I don't keep secrets in the repository, so not that big of deal. I'd rather the source not get shared with the world, but shit happens and they owned up to it right away. If that source were valuable enough, I'd be hosting it on-prem with encrypted off-site backups.


Github takes security seriously, this disclosure post is a proof of that.


This honest report is a good example of transparency.


One of the most striking things about this report is the scale that GitHub has now reached: the whole incident apparently lasted only 10 minutes, but during that time 17 million requests were sent to their git proxy.

It's obviously unfortunate in this case, since even a relatively small and quickly fixed bug affecting a tiny proportion of requests still had serious consequences.

However, it's a remarkable achievement (if also a little terrifying for the software development industry from a single-point-of-failure perspective).


I approve of the handling, but this just underscores why you want self-hosted instances.


Does it? Except for very sophisticated organizations, I doubt it.

You don't hear about intrusions into self-hosted source repositories. Not because there are fewer, but because they likely don't have the security infrastructure in place to know that they ever happened.


Also, there is very little incentive for them to advertise that they've been compromised. Whereas, Github has a duty to disclose that they've been compromised to their clients.


You don't hear about intrusions into self-hosted source repositories. Not because there are fewer,

[Edit: Multiple downvotes within moments of each other do not make calling out the above speculation any less justified.]


If you have private repositories that should never be public, no matter what, which isn't true for most users of private repositories, then:

Given that git provides many transports and ways to push commits around, I agree. If you have to be safe, there's no reason to use github or a self-hosted git collaboration service (gitlab, etc) on a publicly accessible server, regardless of access control measures. If you really need to have sources on a remote machine, you can limit the potential damage by only sharing archives of a certain revision, without history.

I know many will dismiss it, but if you're serious about the repositories being private, then the most you make them accessible is via an on-premises hosted gitlab instance, which is local to the company network, not accessible via the public network, and only allowed to, if you want, by dialing into VPN first. Then, to be safe, you null-route anything but the VPN traffic on the connected off-premises developer machine.

Access keys get stolen, just like SSH keys are, so you need to use a VPN service that requires additional security like the use of OTP key generators or similar measures.

This probably sounds like a hassle in the day and age of people just going for the comfort of private github or gitlab repositories, but it's what companies have been doing for almost 20 years as standard practice.

You cannot consider any git repository, even on your own root server, safe to keep private code on. CIOs would argue against that practice for good reason. The same CIOs require work laptops to encrypt all data.

If you don't need to be that serious, then an incident like this should be planned and accounted for as part of using such hosting, and shouldn't be a big deal.


We've had a couple of VPS provider admin panel compromises, Shellshock and other showstopper remote vulnerabilities for privately administered servers that don't get constant professional attention, etc. It's possible, but not obvious.


Don't you mean on-premise?


Datacentre's have outstanding track records, and if you secure your box correctly there are few ways to compromise it. On-premise will either be incredibly costly or missing key protections or infrastructure.

Self-hosted git (through the many installable git servers or raw git) running on a correctly sized box is almost certainly the way to go


This github vulnerability has nothing to do with insecure box. It has to do with a bad application logic. This can happen anywhere - self-hosted or not.


That is true. But, if I'm running my own instance, the probability is that it doesn't matter if someone else gets access via bad application logic. Everybody is probably employed by the same company.

It's a difference of degree: compromising my self-hosted or on-premise server means that somebody already in my employ has more access than they should. If I'm a small organization, that probably doesn't matter. If I'm a big organization, I probably have an IT staff to deal with this and the people involved are still "nominally" under my control.

The github mistake means that people completely unrelated to my repository can get access.


Well, the code can simply be made public by bad application logic. Which is why I thought you where talking of on premise where the intranet will seal off outsiders


Leaking private repositories is one thing, but if you have a private build server that pulls and runs scripts, you could be in for a bad time even if you ended up pulling a random public repository, if the build script is malicious... hmmm...


In retrospect of course it's always easy to criticise, but still, the diff is really cringeworthy.

The deleted code is very specific-looking. Nobody writes that just casually or out of ignorance. Also it is what was at use in production.

It's very naive to just go and replace that with nice-looking, shorter code.

Key lessons:

- Understand what you are deleting

- Treat production code as sacred

- Add reasonably extensive comments for delicate code (as the original one). Git commit messages aren't enough.

- Try out infrastructure changes in production-like staging servers. I really doubt they properly did, as they say the "majority" of 17M requests failed.


How did they become aware of the bug so quickly (<10 minutes)? Unless I'm missing something from the report, it doesn't say.


The bug trigger a flood of errors.

> The impact of this bug for most queries was a malformed response, which errored and caused a near immediate rollback.


we have a lot of internal tools that allow us to see quickly when things aren't behaving as expected after deploying


Interesting that they don't mention expanding the information being logged to make the multiple joins they had to do unnecessary or more deterministic.


Next step: setup development system ?!

Surely they do some end-to-end testing?


They state in the post that of 17 million requests to their git-proxy server, only 230 of those requests could be identified as successful responses to incorrect data/repos, at a percent of 0.0013%.

I don't know of anyone that would recommend creating tests, even integration tests, that hammers a service to check to see if something like one hundredths of one percent of requests returns invalid data. If anything, the fact that a script is hammering a service that probably (in a Dev or QA environment) has much less data in it's database and file stores, and much less protection (like load balancing and caching) than it would in production would generate more false positives than it would generate in substantial data disclosure regression defects.


But the overwhelming majority of requests failed with errors. The happy-path was not tested either.


Normally perhaps not, but if you host other people's IP . The risk of a leak like this can have major economic consequences to other organisations which trusted you with security for their code.

To me it looks like poor design , I would expect private repos to be hosted completely independently and in isolation with more secure and throughly audited code with longer release cycle (LTS ?) after the code has been well tested in the public free repos.

It is not excessive if you consider the potential value of the private repos that github has control over. They already do something similar for enterprise edition. It leaves bad taste that smaller customers are not treated with similar caution


Probs to github for the disclosure. And congratulations to gitlab for probably getting a nice boost in on premise support contracts:)


Github Enterprise is on-premise too.

I don't know that this would make you necessarily want to make both the change to self-hosting, and the change of platform.


Because there is one critical characteristic in a private repository, and they failed to execute. Moving on-prem doesn't fix that failure, it just mitigates fallout.


It seems highly unlikely this commit made it into a GitHub Enterprise release.


We'll never know, which is a problem unto itself.


The only private repository is one you created on your own computer and didn't push to github.




Consider applying for YC's W25 batch! Applications are open till Nov 12.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: