Very cool that they are able to change BGP advertisements from ChatOps, achieve convergence and mitigate the attack in all of 4 minutes, that is some insane engineering.
I had a similar reaction. I had to double check the timestamps when I first read them. That this was all handled so fast is extremely impressive to me.
Meh, or just block UDP to your networks that have no reason to run UDP. Every carrier will do upstream ACL's these days. 5 years ago that wasn't the case. These days, they all do. Some free, some charge.
re: chat ops vs a web page. It's just a single BGP advertisement -- big whoop. Chatops is just hipster famous right now.
A snarky reply like this comes up every time there's discussion of a DDOS, but it ignores the fact that there is some point that has to filter that UDP traffic, and if that point is saturated, the DDOS still worked. Mitigating attacks of this size isn't a firewall rule or a support ticket with your ISP.
Snark or not, the traffic is filtered upstream before your handoff. If you pick your carriers well, there's not a problem. Many carriers have turned upstream filters into a product. NTT's DPS Lite springs to mind.
This just comes down to experience and knowing how to build a network. I'd think that Github would have people knowing how to architect this. They've been through a few DDoS before.
Edit: It looks like Github uses NTT for traffic. Hello Github Netops person, you need to call your sales rep and turn on DPS Lite. It's like $100 per 10gig port and you get full ACLs. Telia, another one of your carriers, will do this too. At least they have for me. Level3 though? Lol kick that sorry network to the curb
Also, get another /22 allocation so you can at least separate out your DC-origin traffic from your customer traffic.
It's interesting that on HN, some experience doesn't get rewarded, just because there's some rather opinionated language. The majority of voters are barely exposed to these kinds of overwhelming attacks, leaving practical analysis to be buried unless it's got some big company name drop to legitimize it.
Maybe the experience isn't rewarded because of said opinionated language?
I mean someone can be right, but that doesn't give that person the right to be dismissive. Treating ignorance with disdain isn't going to make anyone smarter.
Am I old-fashioned to raise an eyebrow when I discover that Memcached servers are running visible to the public Internet? This strikes me as approximately as bizarre as having a database server that accepts connections from the public Internet.
In my day, such back-end services were either simply not connected to the Internet (connected via a private network to the application services), firewalled, or at the very least, configured to listen for and respond exclusively to connections from known front-end or application services.
Is this sort of deployment architecture falling out of favor? My casual observation is that cloud architectures—at least the ones I've seen employed by small organizations—are more comfortable than I am with services running with public IPs. What is going on? Am I misunderstanding this in some way?
No, it's not out of favor. There are a lot of unqualified people out there pushing buttons on cloud providers dashboards and not caring about security (or not even understanding that it's an issue) though.
When it's easier to just open up a server to the wide world than it is to learn how to connect safely, you'll always get a lot of people doing it.
It's simpler to just click services on AWS and get a public IP to connect to. Drop-policy Firewalls like AWS security groups are hard to configure and debug. Managing network interfaces and binding to specific interfaces instead of others is hard and causes hanging connections.
Those are the excuses I dealt with when I took over the current IT department. By now, only haproxy accepts public connections. Everything else is firewalled to the office at most.
I wonder if it's time for providers like Amazon to provide configs by default that block all ports besides TCP 22, 80 and 443. You want to do other stuff? Configure a firewall. Don't know how? Hire somebody who does. This scenario with cheap insecure things being put out on the internet repeats again and again. IoT, PaaS, etc.
It's interesting you say this, as that's pretty much exactly how Lightsail (Amazon's easy-mode VM thing) works by default. Public IP, ports 22 and 80 open. I'm guessing for a good chunk of users, that default config is all they need.
This is the entire Internet we're talking about, of course there will be a few misconfigured servers. It's more surprising that there are only a thousand.
> firewalled, or at the very least, configured to listen for and respond exclusively to connections from known front-end or application services.
Combine this with staying on top of vulnerabilities, this is really all you can hope for from a host standpoint. What is changing are the days of perimeter defense. The Zero Trust model is really the best path forward, and the only way to implement security in relation to the IoT.[1][2]
Google's "Beyond Corp" initiative [1] discourages trusted networks and VPNs in favor of secure services on public networks. By trusting the network to provide a level of security, you are more likely to be vulnerable to escalation attacks by bad actors that are able to access your private networks. You're also more likely to encourage legitimate users to set up workarounds that result in secure network breaches. Typically they use an Identity aware proxy in front of the service, but services can have a public view as well.
To answer your second question, I work for an open source non-profit software company, and we run some of our jenkins servers, which do continuous integration builds, publicly available so that community contributors and users can see build failures. Google has a number of open source projects that probably have similar goals.
Many open-source applications (especially Java-based) use a public-facing Jenkins server for running and distributing nightly and PR builds. Nowadays, this is usually handled by hosted CI (Travis or GitLab), but there are still some who prefer good old Jenkins.
This is a great example of why it's important to pick secure defaults when writing software, especially software that is often deployed on high bandwidth servers or cloud instances. If no listening interfaces are specified then the default should be to exit with an error, not listen on everything!
I also wonder if you can store something in a memcached cache that looks like a valid request, then reflect that with the source IP of another memcached server and let them burn each other out...
And it's going to take a while for the new version to propagate to a released version, then to distributions, then to customer images and scripts, etc.
Is there any legitimate reason to spoof a source IP? I don't think there is, why don't ISPs block any traffic with a source IP that isn't in their network. And then the rest of us block any ISPs that don't do that.
1. It may be difficult/expensive to arrange for the correct set of source subnets to be available at the points where filtering needs to be done. Motivation to perform egress filtering fails to overcome this cost threshold.
2. Fear that some customers are actually (probably without realizing) relying on alien source address traffic being routed. Therefore filtering that traffic would result in unhappy customers and support workload.
In our network over the years I've come across several instances where it turned out we were (erroneously) relying on one of our upstream providers routing traffic with source IP from another provider's network. Since policy-based source IP selection on outbound traffic is quite tricky to setup and get right, I can imagine that ISPs would take the easy way out and just pass the traffic.
That sounds like a negative externality that ISPs get to be lazy about and save money on by shoving the burden onto the cloudflares of the world. It’s really hard to dispose of hazardous waste when manufacturing things, but we force manufacturers to pay for the negative externalities. We should probably start thinking about the internet in the same way we think about the environment.
Spoofing is in the eye of the beholder. A router first and foremost routes packets toward the right destination, there is no such thing as a "spoofed source IP" without context. Policy about what traffic is allowed to come from what pipe is always error prone and increased complexity.
If I understand the article's point, essentially, carriers pay for the egress traffic that causes DDoSes, that cost and the cost of the generated ill-will outweighs that of filtering, whose price has fallen and continues to fall.
Personally, I think that if the article author is correct, then I wonder if this is one of those high-level long-term decisions that companies appear absolutely incapable of making. (In my experience, short-term gains are way overvalued at the cost of long-term loss, generally, especially when it is hard to directly determine the costs/benefits involved.)
Let's rephrase the question - Is there any reason consumer ISP's don't follow BCP38?
There is almost no reason whatsoever for clients to spoof their public IP address. Obviously, there are reasons to SNAT at the carrier level for load balance or routing purposes.
No good reason except it's for the health of the Internet.
And it doesn't cost any significant amount of money except initial configuration and automation. The "CPU power" to add an ACL on interfaces is negligible.
Please define better. If (say) an IPFS item is cached by 3-5 nodes each of which is on a private home internet connection it is trivial to DDOS each of those and thus effectively censor the content. On the other hand, it is decidedly nontrivial to DDOS Cloudflare.
Now that we are moving away from net neutrality, can we not get ISPs to do DDOS protection so that we don't need specialised services like Cloudflare to be layered on top of simple sites?
There are, fundamentally, two different kinds of attacks:
- Volumetric attacks like this one, mostly reflection.
- Application level attacks like SYN floods or protocol-specific attacks.
Defending against both costs a LOT of money.
Volumetric attack are dealt with at the network edge using rate limits and router ACLs. They're really easy to identify and block, but the point is that you need more bandwidth than the attacker in order to successfully do so. With attacks in the terabits-per-second range, this gets expensive.
Application-level attacks are harder to execute since there's no amplification and you need more bandwidth to pull it off, but they're much harder to block, too. They exhaust the server software's capacity by mimicking a real client. Common examples are SYN or HTTP floods.
When you get hit by a DDoS attack, you have two choices:
- Filter the attack and block the offending traffic without affecting legitimate requests. This is hard, and most companies can't do this. They need to have someone like Akamai on the retainer and dynamically reroute traffic like GitHub did.
- Declare bankruptcy and announce a blackhole route to your upstream providers (taking down the host in question, but protecting the rest of your network).
When you host custom applications that can't be scaled out or cached, DDoS mitigation is especially hard since you cannot just throw more servers at it like CloudFlare does.
Most services we host use proprietary binary UDP protocols, which is unfortunate, since UDP is easy to spoof and even experienced DDoS mitigation companies have trouble filtering it. Our customers get hit by DDoS attacks 24/7, so blackholing is not an option.
We had to build our own line-rate filtering appliances in order to handle the ever-increasing number of application-level DDoS attacks, by reverse engineering the binary protocols and building custom filtering and flow tracking.
All of this costs a huge amount of money, and most ISPs simply lack the resources to do this.
Happy to answer questions, but I'm going home right now, so it may take a few hours :-)
(Nitrado is a leading hosting provider specializing on online gaming, both for businesses/studios and regular customers, so we're dealing with DDoS attacks on a regular basis. We got hit with the same memcached attacks than GitHub and CloudFlare, and it was the largest attack in our company history. Ping me if you want to talk.)
ISPs absolutely could, but having worked near this space previously, it really isn't as easy as it sounds, both the detection and the mitigation, and ISPs are not particularly equipped to handle it themselves right now. There's a lot of money to be made there, though.
Being naive here, wouldn't a massive help be to not focus on detection of DoS/DDoS attacks but instead to focus on validating that IP addresses come from within the range of addresses being served by the ISP?
It strikes me that this would prevent a massive number of amplification attacks.
Lol, CloudFlare is what's breaking the web if you need to have a stupidly complicated JavaScript engine enabled and accessible to a webpage you don't trust (and can't trust) to be able to access the said webpage.
Based on how it's done, you can't check first if the page hidden behind clouflare is something you'd want to enable javascript for, because clouflare will not let you see the HTML code of the page, without enabling javascript for it first.
Well, that's too bad. Bad actors ruined it for you.
We make things more annoying for VPN traffic because it's 99% bad actors. Every time someone is up to no good on our services, they're behind Tor/VPN.
It's simple cost/benefit analysis. If you think a business should bend to every single whim someone might have, then you haven't built much of one.
Making someone run Javascript so they can click on a captcha? Worth the loss of a few pennies because someone's angry about it on HN.
You need to conveniently ignore why people use Cloudflare to say that Cloudflare is breaking the internet. Ideally, nobody would have to use it, but that isn't reality.
This happens over normal cable internet connection (as someone mentioned below, it is probaby when a website is in IUAM) and it is not related to captchas.
Cloudflare provides some page with JS code that computes something and then procedes to the correct page if it is computed correctly.
So I can either enable JS both for the cloudflare interstitial page and for the target website, or I'm not able to access the website.
I'm not angry, I just close the website/go back to search results. I will not allow the browser to run random JS code from the target website for no reason, just because the interstitial page requires it.
Still, if someone requires computing random challenges in javscript in order to gain access to a web page, they're breaking the web. Javascript is still an optional addon.
"... you need to have a stupidly complicated JavaScript engine enabled and accessible to a webpage..."
Does anyone have an example of this webpage?
Unless I am engaging in e-commerce, I do not run a browser JavaScript engine. I rarely if ever encounter a webpage that truly "requires" one. GitHub certainly does not require JavaScript for me to use it via www.
"We've also designed the new checks to not block search engine crawlers, your existing whitelists, and other pre-vetted traffic. As a result, enabling I'm Under Attack Mode will not negatively impact your SEO or known legitimate visitors."
"What's also cool is that data on attack traffic that doesn't pass the automatic checks is fed back into CloudFlare's system to further enhance our traditional protections."
"[P]re-vetted traffic"?
Does this mean they are whitelisting certain IP addresses?
GoogleBot can make hundreds of requests and double digit parallel connections, as frequently as they like, but a single user making one request and one connection is blocked because they are not enabling Javascript?
This does not sound like an intelligent filter.
"[K]nown legitimate visitors"?
What exactly does this mean? How do they "know" a visitor is "legitimate"?
"[A]ttack traffic that doesn't pass the automatic checks..."
Is it possible that non-attack traffic could fail the checks?
What about a single request from a single IP that does not pass the checks because the user does not have JavaScript enabled?
Does the IP address end up on some blacklist?
I have seen Cloudflare reject connections based on certain user agent strings, a header that everyone knows is user-configurable, arbitrary and not a reliable indicator of anything meaningful.
This despite volumes of "legitimate" traffic from same source preceding it. Pick wrong user agent string and suddenly the source becomes "illegitimate".
It would be interesting to know what "checks" the Javascript in question is performing.
I think cloudfare just needs to reject some percentage of all connections to reduce load on the website. The algorithm to decide which to accept/reject is meaningless as long as they hit the required reject percentage.
I done believe you need to enable these options in cloudflare, these are optional, I believe it may be different over tor, where you have to do captcha. Certainly a couple.of years ago, I didn't have any of these convenience features enabled, as they were optional.
According to [0] that is around 1/400th of total internet traffic per second. This begs the question: who has that kind of botnet at their disposal and why are they targeting Github?
Edit: The attacker didn't need nearly that kind of bandwidth to execute this attack. See [1]
I assume this explains the apparent hugeness of the attack:
> The vulnerability via misconfiguration described in the post is somewhat unique amongst that class of attacks because the amplification factor is up to 51,000, meaning that for each byte sent by the attacker, up to 51KB is sent toward the target.
During some analysis we did notice that at least some cloud providers default to having instances with public IPs (with no network-level ACLs) by default, and some Linux distributions default to having memcached listening for UDP traffic and binding to `0.0.0.0` by default as soon as it's installed. The unfortunate combination of these result in the machine being vulnerable to being used as an amplification vector in these attacks.
This is one area where I really disagree with Debian and derivatives default behavior of starting a service immediately after installation and not having a firewall enabled by default.
If I install a service on CentOS/RHEL/Fedora it is disabled by default, if I start the service firewalld will block traffic until I have explicitly enabled a rule to allow it (or explicitly stopped and disabled the firewalld service).
Does this prevent people from making poor decisions, like just blindly starting the service without reading the configuration file, or disabling firewalld/enabling a rule without checking the configuration first? No, it doesn't - but that small hurdle at least prevents people from inadvertently turning on a service and opening it up to the world just by installing a package.
Thankfully that is one thing they do pretty well in most cases, though there are some (apache2, for example) that do listen on all interfaces by default. While even services like Apache may have secure configurations by default, it can often be installed by other programs that link or copy their own configuration files into the apache config directory - and then all you need is a PHP vulnerability or whatever.
Did you receive any cooperation from those cloud providers in using ACLs at their network edge to drop that traffic (that they should’ve been blocking in the first place)?
> Who is not filtering outbound UDP traffic from their memcached instances?
This is of course the wrong way to do it -- you need to filter inbound UDP to your memcached instances so you don't waste your resources generating the responses, and also so you don't accidentally fragment the responses and only drop the first fragment outbound.
I disagree, due to seperation of responsibilities. Having run both an ISP and a hosting company, you have to filter traffic at your edge that can impact external resources (just as ISPs block outbound NetBios and SMTP traffic on port 25/tcp).
Yes, the server or instance customer should be doing this. But they’re not, because poor security practices are an externality, not a cost they sustain.
Security is more important than developer velocity, but users pay the bills.
Your confused if you think it's the clouds that are misconfigured here. The issue is the ISPs allowing the spoofed traffic going towards the memcached servers.
So the problem here is that a number of UDP packets were sent from somewhere (with a small bandwidth) that had a spoofed source address. They were then sent to the reflection servers which produced more/bigger UDP packets that did not have a spoofed source address.
So the attacker only needs to find somewhere on the internet that is capable of generating spoofed packets. They needed a lot of places that had a reflection server, but the requirements for the spoofing was much smaller.
In other words, you would have to prevent 99.9% of the internet from being able to spoof source addresses before you fixed this problem.
And taking down UDP services is just as folly. It only takes a 1000 servers with 100 mbps upload streams to wipe out any single load balancer. There are at least that many root name servers.
I see. For anyone else who doesn't have any background in this attack: memcached is an open source general purpose cache that uses sockets to cache data. From what I gather, the attack here was possible because Github engineers accidentally left the memcached port open. So the attackers were able to spam memcached with large requests, and memcached responds immediately with the full contents of the cached memory (assuming, of course, that the client is localhost).
> The memcache protocol was never meant to be exposed to the Internet, but there are currently more than 50,000 known vulnerable systems exposed at the time of this writing. By default, memcached listens on localhost on TCP and UDP port 11211 on most versions of Linux, but in some distributions it is configured to listen to this port on all interfaces by default.
> From what I gather, the attack here was possible because Github engineers accidentally left the memcached port open.
That is incorrect.
The attackers made requests that were forged to have the sender IP address of Github to multiple public memcached instances. Memcached then responds back to Github instead of the attacker.
This is documented in more detail in the Cloudflare vulnerability report[0]
I don't think it was GitHub's memcached instances. It was other public instances that with spoofed network requests ended up sending traffic back towards GitHub's network.
From what I understand the attack originates from publicly exposed memcached servers configured to support udp and that have no authentication requirements:
- put a large object in a key
- construct a memcached "get" request for that key
- forge the IP address of the udp request to point to that of the target/victim server
- memcached sends the large object to the target/victim
Multiply times thousands of exposed memcached servers.
Yes and there are a lot of attacks of very, very large sizes going on. Over the last few days we've mitigated some huge attacks. Luckily, everyone is working together to rate limit and clean up this problem.
Side observation: kudos to Sam Kottler for level-headed acknowledgement of the business impact of an incident like this to Github’s clientele, and appearing to own it. Well done, sir
These attacks are often described as denial of service attacks, but I wonder if many of them aren't employed as cover for an intrusion attempt. Is it possible that intrusive traffic could be mixed in with such an attack?
A DoS attack is, by literal definition, an attempt to overwhelm a host until it is forced to _deny service_ to valid user requests. Are there intrusion techniques that both bring down the server and break into it at the same time? I'm not a security expert, but that doesn't seem like it makes a whole lot of sense to me.
Maybe in the milliseconds between packet swarms, or immediately before or after? Just seems like a lot of resources to pour into an attack that was defeated in a few minutes. To what end?
There's validity to the approach of sending packet swarms to cover intrusion attempts but the traffic levels were more than a small amount of cover. It is possible that it was designed as a smokescreen and someone's calculations were wildly incorrect. No one's safe from off by one errors :>
What does an incident like this cost to Github in terms of the extra capacity added? I guess the potential loss of business is way higher, but still very curious about the magnitude.