I've done some googling before asking here: Can anybody explain why Linode is so often targeted like this? We moved Cronitor off Linode in spring 2015. During the christmas holiday when they suffered a 2 week DDOS I thought of the family time I'd be missing that year as we did a crash migration to AWS had we not migrated when we did. I have to imagine this has been horrible for their business.
I would use Linode if I needed to lease computational power, because it is still a great value vs AWS, but I could not run a high availability service there. It would feel like professional malpractice at this point.
I am also surprised as this is not the first time I am reading about it on HN. Linode seems to be highly vulnerable to certain attacks as we could see in the past. I hope they will fix it and provide a permanent solution as I was hoping to use them as a part of my network, but I see more and more signals they can't handle serious traffic. Hopefully, they will redesign their infrastructure to handle it.
I am with all of you guys that are affected by this. I am looking forward to them to resolving this soon.
I'd be ok with paying twice what I'm currently paying if they could solve this problem. We're with Linode because it would cost us about 10x as much to run on AWS, and we can't justify that.
My understanding is that Atlanta in particular has some poor upstreams and is making our job pretty difficult there. Notice that it's almost always Atlanta that's getting hit. I would suggest just using other datacenters or making sure your high availability model includes several DCs (which I would suggest at any hosting provider, really).
As others have said, maybe it's to affect a customer. However, hosting industry had a lot of roots in the adult content industry, and those guys didn't have many ethical guidelines. It was not uncommon for hosting companies to DDoS each other to drive a competitor out of business. Not saying that is the case here, but perhaps?
I've been wondering about this lately. Is it really feasible for a small (one man?) team to keep master-master MySQL replication over WAN running smoothly?
If you want things to work smoothly; dual-master, single-active is the way to go.
If you use mysql's read_only flag and application users don't have SUPER permissions, you can easily prevent writes to the wrong server; set read_only = 1 in my.cnf and manually set it to 0 on exactly one of the masters. Use the read_only flag to drive automation for which server to send writes to.
Manual failover is set old server read_only, kill existing connections (read_only flag is cached), wait for replication to catch up, set new server read_only = 0. You can make a script to do this with one button, but I wouldn't recommend making it autonomous: flapping between servers is disruptive and could lead to data inconsistency if you switch when replication is behind; data inconsistency is usually way worse than write downtime until someone logs in to flip the switch.
Try to have half your slaves off each master, so if a master is down, you still have 50% capacity. (I've seen some patches from google a while ago to keep binary logs in sync between masters, and make switching masters easy: If that's available, you may be able to have slaves just follow the current active master)
If you have budget for it, an extra slave off each master can be helpful: You can cron them to shutdown MySQL, tar up the directory, and restart. If you untar that on a new slave, it'll continue replicating from that point in time. If you rotate out the backups, you also have some ability to restore data from the past, if there is a bad update.
I'm a one man operation keeping a master-slave setup with a manual failover and it's been pretty smooth sailing once I got it setup. Don't know how much more complex master-master would be.
Same here, and as long as you understand how mysql replication works it's not too much effort to deal with. Performing the initial sync without downtime is a bit tricky, but can be done with a well-designed database and some thought. Basically you need to at least temporarily make the bulk of your data read-only, so that you can do most of the data transfer while things are running, and then only briefly lock tables on the source server for long enough to copy the stuff that has changed since the dump, and grab the binlog position. Then you copy that stuff over to the slave as well, update the slave to the correct position, and then start the slave.
That's master/slave, but to get master/master, all you need to do is start a slave on the original master and point it at the current master position on the original slave (which should be static since it isn't yet accepting any queries directly). These posts may be helpful:
Once it's running, as long as you're not running autoincrement queries or other things that can conflict on both servers at the same time, without taking appropriate precautions, it should chug away without any intervention.
If something does go wrong, you can often figure it out by looking at the slave status, fixing the inconsistency manually, skipping the bad query, and then starting the slave. If not though, you can always just re-synchronize from scratch. Or even better, run your databases off of an LVM volume, then take regular snapshots. (IE snapshot, make a tarball of the snapshot of the mysql directory, then remove the snapshot.) That will give you a consistent backup, even with the server running. On an SSD, the temporary added latency probably won't be noticed, especially if done off-peak. Then if anything goes wrong, you can restore from the snapshot, and it should catch back up to the master from the snapshot's position automatically (as long as your expire_logs_days setting in my.cnf is longer than the duration since the snapshot was taken).
No need to take production offline when using the percona xtrabackup tool to set up the slave. It's super easy to use, and I've done it multiple times on databases in the hundreds of gigabytes.
That does look like a great tool! If we were on all InnoDB I'd probably try switching over to it instead of LVM snapshot-based backups right now. (Since it can do incremental, mostly.) We have a bunch of large MyISAM tables though (MyISAM used because the tables are read-only, so read speed is the only real consideration), so those would have to be handled separately. I could always xtrabackup all the innodb stuff and then just file copy the myisam tables separately though, since I know they won't be changing.
It's pretty much the same. You almost never want writes on both sides (now in a failover plan anyway), so as long as you have a switch for which side receives the writes, it's simple.
Or allow writes in both datacenters with randomized tokens as keys. If you need datacenter-affinity for certain events, use one of the token bytes to encode the author datacenter. Updates that don't have to land in order can be written in an eventually consistent manner. Write a feed of changes in each datacenter and have the peers consume this update feed. Viola, partition-tolerant master-master with failover.
This is necessary but not sufficient to prevent issues. Sure, it will prevent auto increment key collisions, but unless you're using strict sessions everywhere you can run into other key consistency problems. For example if one master deletes a row while another updates it you'll end up with a missing key and stopped replication on the first one. (depends on the replication mode as well)
Doesn't this work only for an append-only structure?
If I'm updating existing records, and the MySQL master at Site A gets updated, then goes down before Site B is updated.. I've got an inconsistent setup.
Been thinking about Master-Master MySQL replication recently as we have a system that's duplicated and taken offline each summer (to run a summer camp), and looking for a way to sync changes in it back to the main 'live' MySQL database.
I do the same with tree nodes, works extremely well. Two active masters and one slave configured as master, conflicts are non-existant since I found out about this "trick".
We've moved off vultr. Too many network down times for no known reason and hard reboots on our servers causing loss of data. A friend has had similar experiences. Their support was also poor.
We were sort of in the same boat as you last year. We were already prepping our failover and it was about 95% ready to go when the DDOS started on our Christmas break (and our senior guy who did all of our deploys was out of communication). December 23rd was a difficult 10 hour, remote day but we got things finished up and could relax afterwards.
What in that article makes you think that? I don't see it. They do say "we'll have to upgrade Xen nodes", but they don't mention the DDoS or link them.
The XEN update was a scheduled thing -- I got an email about it weeks ago (had one linode I hadn't moved to KVM), and it was already scheduled for this weekend.
That said, I don't disagree that the attackers might be trying to distract the team while they exploit that... though I don't see what it gets them compared to quietly exploiting the XEN issues before they were common knowledge on 9/8.
XSAs are released alarmingly frequently. I've always wondered if KVM is really more secure or if it just gets less scrutiny (certainly there's a class of qemu vulnerabilities that impact both hypervisors).
I've noted the AWS security bulletins[0] list nearly every Xen advisory with "AWS customers' data and instances are not affected by these issues, and there is no customer action required."
It would appear? you'd need to go back for quite a few months of being unpatched to find a genuine issue. Unless something about Amazon's mitigations don't apply universally.
Amazon and other big cloud providers get xen security fixes first, there was an HN discussion about it. Some they've probably already implemented the fix.
I disagree with people saying these types of attacks can't be prevented if you switched hosts. I'm sure Google+cloudflare[0] would keep your website online. AWS also if you had the cash.
The amount of distributed traffic happening right now against linode would probably only represent a 5% increase in traffic to a popular Google product. At least you know they have the expertise. Nothing against the very smart and talented linode engineers, but the two companies are on very different levels of traffic engineering.
Attacks are rarely targeted to the hosting providers. They usually target a specific customer.
Google/AWS probably have 100 times the capacity (and redundancy and architecture reliability and failover and awesomeness) of linode. That means that, first, they can't be put down easily, second, a DDoS is limited to a small subset of the infrastructure and doesn't bleed to every customers and services.
As for traditional hosting companies (OVH and the likes) When you're being DDoSed, they'll null-route your IP space. (i.e. they advertise your IPs as dont-exist-on-the-internet-anymore). The traffic is dropped while in transit on the internet because it can't go anywhere. It doesn't reach the hosting company anymore.
Note: being null-routed means your site and all your services are off the internet and thus effectively dead.
As for CloudFlare. They have many locations all around the world and they can absorb a lot of traffic, to the point they themselves cannot be DDoS. They have active monitoring and mitigation against common attacks and known malicious sources, which may prevent the attack without even you knowing about it.
When you're under attack, you can block subnet/AS/countries in cloudflare settings, or request a challenge/capcha from every visitors. Cloudflare will reject visitors (with or without challenging them) at their edge location before any traffic can get to you. It is very effective from my experience.
Generally speaking. The only way to stop a DDoS is to do it before it reaches your datacenters so you need help from your ISP/provider/CDN.
Edit: The attack that put down linode last christmas was against linode itself and not a specific customer. Part of the mitigation included linode moving its critical services behind cloudflare :D
> As for traditional hosting companies (OVH and the likes) When you're being DDoSed, they'll null-route your IP space. (i.e. they advertise your IPs as dont-exist-on-the-internet-anymore). The traffic is dropped while in transit on the internet because it can't go anywhere. It doesn't reach the hosting company anymore.
OVH hasn't been doing this for a while, they got some beefy ddos protection setup for this exact reason - it was way too easy to take down someone for hours
Online.net also has included protection (+ paid upgrades)
At least here in europe the big hosting providers are all switching to providing included protection for all their customers, at least for traffic intensive attacks which hurt everyone
It's not hard to compare this to brush fires. They happen periodically, and only the big trees tend to survive them. Linode is getting pretty unlucky here, but I would imagine that all the small time (and even the medium sized) hosting provides are going to succumb eventually. Is the end game just going to be Google vs. Amazon?
I'd love to see a network infrastructure and transport protocol that's more resistant to many (D)DoS attacks, because it seems like things will only worsen if it never becomes more difficult for people to attack others' servers online.
If application- and transmission-protocol-level DoS vectors are fixed, then you're left with just the raw "lots of traffic" volumetric attacks, which means your attacker has to have a lot of compromised hosts (or the right compromised hosts with lots of bandwidth). I'd say that's a reality that would be easier to handle, because you raised the bar from anyone who can develop or use a script and deploy to a few low-power systems, exploiting protocol shortcomings, to only those who have a bunch of higher-powered systems.
The smaller hosting companies may still very well go out of the game if the problem worsens, even if most DoS venues do end up being mitigated. I don't know how I would respond to that as of this moment, but hopefully it doesn't have to come to that. It's already tough to find a decent hosting company in my experience.
I don't get the hate towards linode here, on hacker news. I've been their client for a couple of years now and I find it an excellent vps provider. Excellent uptime and performance at a pretty good price. AWS has a few outages every year. Google just had one last week. Azure sucks balls. So, why the hate? Is it because it competes with some ycombinator startups?
which was caused by our own maintenance several weeks apart (the root cause description is really quite good).
I think the distinction people make implicitly is a 25 minute outage versus 8 hours. DDoS attacks suck, but they're just standard these days. As a customer though, any source of (network) outage usually has the same outcome: "my site is unreachable (and I don't care why)".
The reason we (and AWS and others) offer multiple datacenters/zones within a <1ms boundary (a compute "region") is so you can build a highly available app that can fail over with minimal degredation. For customers that were using App Engine Flexible environment with the regional spreading turned on, only some of their instances were affected, but their apps shouldn't have skipped a beat.
Linode is good at what they do, but any customer in Atlanta just had to wait this entire event out.
I don't follow this too closely, so this is just wild speculation from me:
But could it simply be severity of the attacks? I keep seeing comments about a 2 week ddos attack last christmas - that's something that i would be shocked to see Google/AWS succumb to. Not that Google/AWS attacks don't happen, i just can't imagine them being down for ~2weeks
(I imagine it was just one datacenter from Linode, not the entire service, fwiw)
They weren't down for the entire two weeks, but various datacenters went up and down for hours, then quieted down for a few days, then was back again, then another hit; stretching across two weeks.
One thing that took them so long was that their upstream ISPs at some of the datacenters were themselves unable to handle the DDOS, so they had to switch ISPs, which took a while.
I don't see Google/AWS as easy to attack; but I'm not sure why similar tier players like DigitalOcean aren't being hit -- or maybe they're just less transparent about things, or are actually a smaller target (didn't think they were?).
No, its because at one time the community actually liked and recommended it and then got burned and acts accordingly.
Linode has a bunch of great features, but after seeing them get hacked a half dozen times over really silly things, more DDOS's than you would be happy with, and frankly I have had interactions with their management (just online) and was sorry to have had said interactions.
You can also read a bunch of implications from former employees about their management, but you can feel free to discount that given how many times ex-employees are a bit pissed.
I've always wondered, while in similar cases GCE/AWS can handle the traffic, is it not chargeable? So, while you will probably not get affected by the DDoS, aren't the costs going to cut your head off?
I'm running the maths right now.. and you could convince me to take down my side project by just having a server outside their network put wget in a loop targeted at my S3 resources.
>Update - We have been experiencing a catastrophic DDoS attack which is being spread across hundreds of different IP addresses in rapid succession, making mitigation extremely difficult. We are currently working with our upstreams to implement more complete mitigation.
Well, I had Linode shortlisted for an upcoming project. I hate to take them off the list because it is not their fault, but I don't want this kind of unreliability.
I don't understand this line of reasoning. It's not like DDoS attacks are some kind of 0-day failure mode that nobody has seen before.
Would you also say "it is not their fault" if their uplink provider had a fiber cut and they didn't have redundant uplinks? I'm guessing not: it's well understood that has a service provider you need to plan for this kind of unavailability and pay more money for redundant links. So it seems really weird to have this double standard for a different kind of availability failure mode.
Just like network availability or datacenter power availability, you need to invest technical and financial resources into DDoS defenses if you want to be resilient to incidents. If you don't do that as a hosting provider, I definitely won't feel sad for you.
There are a handful of environments that can sustain a large, coordinated DDOS attack. Can you sink 10-20Gb/s of traffic forever? Not cost effectively.
At work, my www servers get short ddos on a regular basis; on our 10g hosts, 10g+ attacks are livable (outgoing TCP throughout goes down because incoming acks are part of the traffic that's getting dropped when total inbound is above the Nic capacity). We have some newer boxes with 2x10g, I'd imagine those should be able to handle 20g of attack, but I haven't noticed. (I usually only check for a ddos if external monitoring shows an unexpected failure)
That's for volumetric attacks (udp reflection), tls handshaking can eat all the CPU way before we run out of network :(
Usually it is and according to linode it is so in this case too.
Edit: Not sure what alternate reality the downvoter lives in, but vast majority of the attacks these days are just "dumb" packet floods or even easier to filter reflection attacks. (Linode clarified this to be a mix of dns and ntp traffic on IRC)
But hey, go on and find me a layer 7 attack that'll take down entire datacenters :)
One difference between Linode and providers like AWS is that the typical deployment architecture on Linode still exposes customer VPSes to direct L3 internet traffic whereas on a best-practices AWS deployment that is almost never the case.
I'd imagine it's easier to filter out bogus L3 traffic when the vast majority of your target IP space comes with explicit configuration as to what sort of L7 application traffic is acceptable.
You can't really compare aws to linode. Aws have hundreds of Gb of transit bandwidth so they can easily absorbe big attacks. They also have a backbone network which allows them to increase the surface area of attacks which increases the available bandwidth.
It's actually not that easy to
Filter "bogus" traffic. In the hosting world, especially cloud, you have thousands of customers doing whatever they want. Who knows what is bogus or not. And even if you can filter it at your edge routers your transit links are still going to be getting slammed. The filtering needs to be done upstream in the ISP network. This is usually a manual process as no one supports BGP FlowSpec at the moment.
RTBH is the best way to defend if you don't have the bandwidth to absorbe.
Block DNS/NTP in the security group => Problem solved (unless it only filters traffic at the instance input)
Put an ELB in front of the services, the ELB only listening to port 80/443, roll out the ELB publicly, roll out new instances only accessible privately, kill old instances being DDoSed => Problem solved => Repeat for all other services, they shouldn't be publicly accessible in the first place.
Ain't saying it's easy but there are some options to help mitigate the attack.
Atlanta is their smallest datacenter IIRC with their lowest bandwidth capacity.
I would still recommend them as we have had great luck with their Dallas DC. It seems to be the most resilient of them all in terms of random network outages and also DDOS attacks.
However the one in December did take them offline for a few hours but after which they apparently implemented more DDOS protection. Keep in mind this could happen to anyone (Digital Ocean, Vultr, etc) and mostly their mitigation techniques seem to be to kill your VM until the attack is over.
>the one in December did take them offline for a few hours
That is a bit of an understatement. Our servers were down every couple of hours for a week, and the attack bounced around to pretty much all of their datacenters.
Sorry I should have clarified. Our servers were down for a total of 4-5 hours spread across a week or so. After the first incident it only took a few minutes for them to come online each time, although I hear others had varying downtimes while they mitigated that DDOS.
Thankfully it was over Christmas time so our clients were mostly offline.
How was the recent attack in Dallas? Labor day weekend Atlanta went down for about 25 minutes until they squashed it. Then I saw dallas came under attack. How did it handle the DDOS this time?
We had a few seconds of latency tacked onto our normal response time for about 40 mins but everything stayed up. Here is a graph from our monitoring tool (grafana with WorldPing) http://tomschlick-screenshots.s3.amazonaws.com/RKRIJQ9Q
Very interesting. Looks like Dallas can still handle a DDOS better. Got news that Atlanta is increasing their bandwidth 6X real soon. Its scheduled this month. They handled this DDOS pretty good. Although we actually had downtime. Adding 6X bandwidth should make a huge difference. Not sure which datacenter should be my primary though after the 6X upgrade. Oh wait, linode just responded to me. They are recommending atlanta for ddos over dallas after the 6x upgrade. Hmmmmm.
I have no evidence to support this theory, but I believe that Linode is not an outlier with regards to frequent DDOS attacks. What makes this company special seems to be with how it communicates to its customers when it's under attack.
This leads me to wonder: How much do other providers leave customers in the dark?
Can anyone recommend a good article that explains how attacks like these work, and what is required to stop them?
Also, we're on Heroku and they advertise Ddos mitigation as a feature, but "mitigation" sounds non-commital and I'm curious how they'd fare against a similar attack?
It's non-commital, because at some point, when you have enough zombie hosts properly distributed all over the world attacking you, your only defence is - have more bandwidth than the attackers. If your peers can't filter out the traffic before it hits your network and it simply saturates your pipes, there's nothing you can do inside the company anymore.
Cisco has the easiest to grasp fundamentals [1], but to really understand DDOS attacks you'll have to dig down into learning about how TCP works and network layers.
Mitigation sounds non-commital because it is. I don't personally have any experience with mitigating a DDOS attack on Heroku, so I'm not qualified to talk about their prepardness, but DDOS mitigation varies quite a bit [2][3], running the entire gamut of "block a single IP address" to "there's hundreds of IP's rotating and attacking". So the answer to that would be: it depends on who you've pissed off or who's feeling particularly nasty that day.
Linode is a cost effective solution comparing to AWS. But the security and DDoS issues could raise eyebrows and bring confidence issues with customers. This has happened even after they posted about enhanced DDoS mitigation strategies like procuring more bandwidth etc.
I really feel bad for these folks. Does know if they have a DDOS mitigation strategy other than RTBH with their transit providers? I would have thought that after the 2015 attack they would have looked into traffic scrubbing with something like Arbor Network or Prolexic. I understand that these are not cheap and Linodes as well as many other hosting provider's margins are probably thin but I would think that it would pay for itself in one or two attacks by minimizing customer churn an event like that causes.
I remember a few years ago when I moved my Linode from Fremont to Atlanta to avoid the frequent outages. I've never had show-stopping issues with Linode and the customer service has always been fast and responsive. Now though, I'm thinking of moving to their Frankfurt datacenter.
But now I think I need to setup fail-over with another VPS provider. What's a recommended alternative? Is Digital Ocean the next best choice after Linode?
Yes, the FBI. They're fantastic at it and part of their job is helping businesses recover from compromise and going after the attackers. However, they're overworked government employees with not enough resources.
FBI can probably monitor both sides of those nodes (tap the data centre?) if they're in USA? So can't they monitor all the nodes clients, then do something like block returning traffic and look for timing or other meta data of the re-request from the command server?
DigitalOcean has had a few DDoS attacks targeting their SFO1 datacenter over the past few months, but fortunately each one seemed to disappear in under half an hour.
When you are the dns root record and in many cases hosting too for some 60-70 million domains, someone in the internet is wanting to attack someone at godaddy all the time. There isn't a time they aren't being attacked somehow in all reality, and I'd presume it's much the same for Linode.
The means of really combating a ddos is costly and extensive, this is why most use a service like prolexic or silverline, and typically with some massive infrastructures that comes with it. Anything less than n-by 40gb disposable internet pipes, preferably regional as you are, you can/will be smited at will.
I would use Linode if I needed to lease computational power, because it is still a great value vs AWS, but I could not run a high availability service there. It would feel like professional malpractice at this point.