This is similar to how we direct traffic to MySQL at Eventbrite. We also have multiple MySQL clusters with complex topologies for different services. Every server that can access MySQL has a local HAProxy that it connects to. This local HAProxy is checking the health of a pool of dedicated HAProxy servers. Those HAProxy servers are hitting a script run by xinetd on the MySQL servers to determine whether a server should be in its pool. Here is where we differ the most from Github. Instead of having two backends per frontend, we only have one, but the health check itself is context-aware. In addition to takings its own lag into account, it also queries HAProxy to determine the number of healthy servers in its backend pool. As the percentage of healthy servers goes down, the threshold for declaring itself unhealthy based on lag goes up. Additionally, there are some extra safeguards to make sure we never dip below a certain number of servers in the pool. We use variations on this approach for other services like Solr and Redis as well.
So, I see more and more organizations using "chatops". Are they really delegating control of their infrastructure out to Slack/HipChat? If not, how does authentication/authorization work?
At CloudFlare we use a self-hosted, and accessible via corp VPN only, Hipchat.
However it's important to note that this is primarily information inbound, and not instructions outbound.
Additionally, the disabling of a Hipchat account is near the top of the list of the security incident response team.
If a laptop is lost, if a machine compromised, the ability for those accounts to continue broadcasting information of any kind is immediately neutralised.
The risks are real, but one can mitigate them by:
1. Securing access to the chat environment (self-hosting, VPN only, different auth to this than to the rest of the infrastructure)
2. Not having chatops send instructions to other systems
3. Having well practised plans for disabling accounts in response to incidents
4. All of the above is no replacement for access control and auditing on all other systems
I'm not going to speak on behalf of the security team, but the idea that the chatops would send instructions to systems is pretty horrifying even if they originated from a machine or person.
It's bad enough to imagine the scenario that if someone's chat account were compromised and someone could gain access to the VPN to then use the chat, that social engineering via chat is likely to be more successful than by other mediums (people trust chat). Hence ensuring those accounts are disabled very quickly.
So you make chatops secure by not having chatops, it sounds like.
Being able to gather information in a chat room is helpful, but I don't think its fair to call it chatops unless you have can act on that information through chat, too.
To use examples from the blog post, both querying for the slaves in a pool and marking one down for maintenance, all through chat.
Interestingly, the first operators to employ this strategy were botnet owners! Infected machines reach out to C2 servers over IRC and listen for commands from their operators.
Maybe there are more lessons devops could learn from botnets. Why mess about with ansible and the like when you could just infect all your machines with a worm? :)
They are delegating their control to a program running on their own servers that listens to commands given in Slack/Hipchat. I know Hubot has an ACL system, but I'm not sure if it integrates with LDAP, etc... I'm pretty sure it maintains its own groups in its database and relies on Slack/Hipchat vouching for identities.
Right.. The moment you allow Slack/Hipchat to vouch for your identity, you've effectively delegated control of your infrastructure -- an attack on Slack/Hipchat becomes a vector into your infrastructure.
If you could bind a Slack identity to LDAP, would that fix it? I've never really thought of Slack as a security threat directly, but you're correct. If you trust voltagex@slack to act on behalf of voltagex@internal.domain, you've just crossed a security boundaries with no checks whatsoever (does Hubot et-al have audit logging?)
With what jzelinskie has described the program actually running the commands could do auth in a number of ways, including ones that completely avoid the chat system.
Sure, in theory, but I've struggled to find a single example of such an implementation.
It seems like you either have to delegate authN to your chat service, with authZ handled by hubot-auth, or find a way to get your chat client to pass a token along with every chatops request.
At GitLab we think that when you are self hosted there should still be role based access control. We selected Cog as our chatops framework for that reason https://gitlab.com/gitlab-org/omnibus-gitlab/issues/1412 It has fine-grained command permissions, users/groups/roles, and audit logging.
Besides ensuring that users are restricted it is also important that a bug in a script can't do too much damage. Where Hubot runs every script in the same namespace Cog has separated the coordinator and the scripts. They use containers to do this so you can write scripts in other languages than javascript. Apart from that Cog as pretty sweet unix like piping between commands.
What are the pros/cons of doing this in HAProxy natively?
Could you just use something like a health-check on a cluster system like zookeeper/etcd/consul/keepalived/... to simply peek at the lag time, and then mark a replica as unhealthy and regenerate your HAProxy config?
Just curious as I have very little experience with HAProxy. I do everything with consul right now and it works very well (previously I used keepalived + vrrp, or pacemaker/corosync but no multicast "in the cloud.").
This is cool. I dont think it would be useful in my case though.
99% of the time the cause of lag I see is some heavy inserts on the master. These hit the slaves at the same time so this wouldn't mitigate that issue.
It got me to wondering, what is the common cause of slave lag for other people?
- say one of the slaves actually has some monitoring system break and now runs out of disk space, and stops the ability to write locally, so the lag will go up and/or health check will see its replication is failing, and it will automatically be taken out of the pool.
- say you run a site where a user has a list of songs they like. some users have larger libraries than others, but there is a few outliers that have 100x the amount of songs. your growing fast and you have some code paths that are not optimized for this, and some features that folks barely use that you don't spend time on. one of these users goes on your site and uses this feature and now you have 1 slave that is lagging because the overly intense query landed on it.
- say you have a high traffic site, or you are in a datacenter where you are sharing networking gear with a very high traffic neighbor, and the datacenter has over provisioned the networking gear, so there is no headroom for bandwidth. now you have slave lag if your replication is passing thru a switch at saturation. maybe this is only present some times of the day, on some slaves, so they will be taken out of the pool so servers with better network paths are prioritized.
You have to be careful to tune your xinetd script to increase the cps and per_source limits accordingly to the load that your pool of servers generate with this proxy forwarding.
Assuming that all servers that need to connect to MySQL have this local tcp 3306 proxy setup, each time you increase your pool of servers the load of incoming check requests increase as well.
Xinetd is quite nasty and by default will stop responding to requests for like 10 seconds after some defaults. This will effectively "block" your whole database connection flows, especially when you restart services at batch during a rolling update, and under some other circumstances...
It would be interesting to know GitHub's policies on persistent MySQL connections. If apps hold long lived connections, even if a server was deemed unhealthy, I don't think there would be anything to force the app to reconnect and find a new slave?
In addition, they could also have used agent checks to enable each MySQL server to decide of its own weight. This way, more busy servers could receive less traffic until the situation improves.
as a note this is pretty much the only way to run Galera (multi-master MySQL variant) from HAProxy - you need to provide a custom HTTP service that checks that the cluster node in question is ready to serve requests. Galera's flow control feature likely does a lot that the tests here for "lag" accomplish.
Er, so, are they using a separate database application to abstract the mysql operations? They don't mention it in the article. Not abstracting your database leads to major issues from unusual load and operations. And there's a lot of performance, portability, release engineering, etc wins. I don't see what other companies are doing so i'm not sure how common it really is.
It allows a dedicated application layer to handle database communication/changes in a more intelligent way than a proxy/load balancer can. Helps account for scaling issues, batches transactions, retries on failure/inconsistency, throttles, caches, advanced logic, schema/application change resiliency/consistency, network segregation, restricts operations, etc etc. Nobody seems to implement it unless there's a wide range of applications with different needs, or when things break.
BTW, if managing this sort of infrastructure excites you, we're hiring! https://jobs.lever.co/eventbrite/cd7d6b95-0226-4e03-b9fa-c15...