Hacker News new | past | comments | ask | show | jobs | submit login
How to properly manage SSH keys for server access (paepper.com)
373 points by mpaepper on Sept 26, 2020 | hide | past | favorite | 182 comments



>Every developer needs access to some servers for example to check the application logs.

I fundamentally disagree with this. I’ve been writing software a long time and I used to demand I have server access so I can tail logs, creating the problem this article talks about resolving (good read btw). But I can’t help but wonder why we keep teaching this mindset. We have log analysis tools available on pretty much every cloud now. Docker has an aws log driver, a gcp log driver... etc. Backend developers should make a conscious decision on how to ship log events out of the box and into something searchable, indexable, and can derive metrics from. I ran a cloud infrastructure group that vehemently said “No” to any dev requesting ssh access. Not because they couldn’t, but because we don’t have access either. We created our platform without the ability to modify it. Infrastructure as code. Only way is to redeploy the stack. You’d be surprised at how much less stress there is when you can have alerts on log events from your application when things break instead of a support call or a support ticket. Proactive > Reactive debugging. I also understand not all shops are at that level of maturity. I’d love it if the community as a whole stopped teaching people to treat their app and server as a second home, as a pet, that must be nurtured. Obviously these are my opinions and the article itself addresses how to handle ssh keys for server access in a logical way, my only issue is why create that mess in the first place?


What if you encounter an nginx bug, or a kernel bug? At some point, when you reach a low enough level you will need some deeper investigation tools and finally access to the machine to get enough data and fix the problem.


Then that would be a sysadmin / DevOps responsibility rather than your application developers. Or at least you'd have those guys involved with the investigation.

But honestly, how often is a web applications bug due to a kernel bug?


a) How can there be a "DevOps responsibility rather than your application developers"? Isn't the whole idea of the word "DevOps" to eliminate such distinctions?

b) In my experience, the application developer is held responsible for the application's behavior in production. In the luckiest .01% of scenarios, there might be an infrastructure engineer with appropriate permissions and free time trolling the Slack support channel at the moment you report the issue. Otherwise 99.99% of the time infrastructure will not acknowledge of investigate anything complicated or subtle with just one service owner complaining. The infrastructure group, organizationally, is graded on shipping new platform features and on coarse KPIs for the performance of the platform as a whole; nobody is getting paid to investigate the weird bugs of some application team somewhere.


> Isn't the whole idea of the word "DevOps" to eliminate such distinctions?

No, it doesn't eliminate such distinctions. My view of DevOps is more about ensuring that automation is used as much as possible to meet objectives.

It's definitely not about making everyone a homogeneous developer unit that can work on every problem.

People come in all shapes and sizes, some are more competent with certain things than others, others have a lot more experience with certain things. That's aside from the whole preference thing - not everyone wants to or has an interest in managing infrastructure.

Maybe when you have a handful of developers and a small set of infrastructure, that's fine - but at a certain point you start to require more and more specialised knowledge. Yes, even when you're all-in on Cloud and using all the SAAS/PAAS products out there.

>Otherwise 99.99% of the time infrastructure will not acknowledge of investigate anything complicated or subtle with just one service owner complaining.

Yeah, that's an organisational problem from the sound of it.


Everyone having their own idea of what devops means is problematic.

I think it’s best to consider the original source which is this talk from Flickr: https://youtu.be/LdOe18KhtT4

("10+ Deploys Per Day: Dev and Ops Cooperation at Flickr").

Directly it talks about joining developers and operations into the same team- later Patrick Debois would refer to this as DevOps and a year later the first DevOps days in Ghent was organised (also by Debois).


Thanks for reminding me that someone else remembers this. It seems like it took no more than 5-6 years for everyone to simply adopt the term as a replacement for sysadmin with no change in operational practices. Coincident with that it seems most infra engineers started calling themselves SREs to distance themselves from the diluted concept.


Yep. It's more of a concept of cooperation than anything else. Which, since it's not a concrete thing, makes it harder for people to understand. But then there's specific practices that arose out of trying to drive best practices in Ops at the same time (like IaC, II, CattleVsPets, automation, etc) so now DevOps "means" a jumble of slightly related things.

We really need some new terminology.


Is that the original devops talk, which is like the origin if the devops "movement"? Very cool, did not know it had such a clear origin.


At my company, a subset of developers have ssh access, if other developers need something that requires ssh access they work with someone who has it. But unless you are a very early startup, I don't see why every developer would need ssh access.


Yeah, ever since devops became a thing they’ve placed themselves in this position where they’re somehow better than the application developers, even though until a few years ago those same developers were doing the exact same things.

I swear, sysadmins were annoying as an application developer, but devops is something else.

People with one year of actual work experience get hired as devops, and have all the privileges I would need to fix their mistakes, but I can’t, because I’m an ‘application developer’. So instead you end up teaching them how to do their job.

I’m not salty at all.


DevOps is a set of practices not a role. I do Ops, practicing DevOps, and I serve my programmers. My job is to ensure they stay happy. If they're not happy about something in the production pipeline that's on me. I work hard to make sure that I'm their Jesus Christ for all things infrastructure.

If your programmers aren't delighted with you, I'd say you the Ops person is not practicing DevOps or you have a buy-in problem to DevOps practices at an organization level.


I love this take.

My previous role had my title as "DevOps Engineer" but it always rubbed me the wrong way. I was just an Operations Engineer with a focus on making my developers' jobs easier, in any way I could. Having that as my North Star kept me honest about the work I was doing versus considering the role more like Operations Engineer v2.0.

In the Silicon Valley, at least, DevOps seemed to be (seems to be?) sort of in vogue; I think it's important to keep its core qualities of bridging Development and Operations in mind as opposed to just shifting an existing position's title in an attempt to attract talent.


Preach!!

And this should extend throughout the organization. If Architecture or Security or any other group is making your life miserable, they too should be DevOps'ing, working closely with you, caring about your frustrations that only they can fix. Sadly there are still so many silos left to break up.


Agree, a job title of “DevOps Engineer” is an organizational smell for me.

Most people with such a title are actually something like “Automation Engineers”, “Infrastructure Engineers”, “Operations Engineers”, “Site Reliability Engineers”, etc, that are involved in a DevOps “process”, “initiative”, “culture”, etc.


Ah the classic ivory tower argument where some “other class” of engineers are universally inept, but not “my class!”

You can write the same screed full of generalizations from the perspective of any job title: a devops person would lament the fresh-out-of-bootcamp “application developers” who have no idea how systems work together so write SQL queries that retrieve a million rows, one at a time. “Works on my local!”


Pretty sure GP was bristling at the reverse happening. We must keep the developers from screwing up the important computers.

Saying the emperor has no clothes is not white tower thinking,


I completely agree. Access to those things should be given to those qualified to work with them, not based on an arbitrary role designation.


I’m sorry you’ve had some bad experiences but not everyone is like that.

However it’s also misguided to assume that specialities don’t exist. You can have infrastructure guys, developers, security folk; and there will be overlap between each role but it’s impossible to be the master of each trade.

I agree that arrogance is an unpleasant trait but arrogance can take many forms, rudeness to colleagues, or over assuming ones own technical capabilities in adjacent fields.


I've yet to find many organisations that split those responsibilities up enough. This might work for 1% of orgs, wait no - 1% of tech orgs, but everyone else needs something better.


I can't vouch for your experience but as a sysadmin myself I've found 100% of the companies I've worked for has had dedicated sysadmins ;)

I'm being a little flippant here though, I am aware that developers are often asked to wear the sysadmin hat too.


Here's a funny thing: as of two days after this post was created, pairing hasn't been mentioned once in the entire thread. If this thread is any indication, maybe it's developers who have an incomplete understanding of DevOps.


That should be the exception, rather than the rule.


And it is deeply frustrating when you run up against one of these exceptions and need to wade through some bureaucracy before you can investigate further.


It would seem to me that this is the perfect time to pull in someone with more production experience. Perhaps they can use the existing tools to pull logs, or analyse it in some way. Maybe they've seen it before and already know the fix.

Giving everyone production SSH experience is, in my experience, a way to run into all sorts of weirdness, not to mention endless frustration.

In a modern automated infrastructure, that box is likely a container running on a virtual machine that's ephemeral and can (and probably will) go away at any moment based on any number reasons - maybe CD kicked off a new deployment, or maybe the load changed and the instance was selected for scale-down, or maybe our spot bid for that AZ isn't sufficient for keeping the instance around, maybe you being SSHed in and poking around impacted the health-check, and so it's being killed for not performing right.

Theres many other problems, too - lots of applications are built in some way that there's simply no other way than secrets (passwords, api tokens, keys) to reach other systems, particularly third party systems. So production boxes have production secrets, which you probably don't want to share with everyone.

Giving everyone SSH access so they can, in theory, take nginx/kernel dumps as needed tends to imply giving superuser rights, which means they can do whatever they like.

So, yes, pull in someone else - find some way to try and reproduce the problem NOT on production, if that fails, perhaps there's a way to grab enough detail or pull additional logs or network captures to identify the issue. If that fails, well okay, lets SSH in - but we need to coordinate that to ensure that instance does't go away, and doesn't impact production while you do it.


the point people are trying to make is that if you are at the scale where a kernel bug or an nginx bug is borking your app, it's not the developers job to go poking around the system for a fix. It's the devops/infra people's job. In my world, if you want to investigate an nginx bug... "docker run -it nginx:latest /bin/bash" and go for it... find the issue, reproduce it, then fix it in the pipeline and deploy again. You didn't touch production at all. If your debugging relies on being ON PRODUCTION, you don't suffer from the scale you need to be on there in the first place.


Not all bugs are sufficiently cheaply reproducible outside the environment in which they are observed. It seems silly to tie your hands behind your back when you could just inspect what the computer is doing and then fix it.


Why can't your developers be "devops people"?


I think developers can be devops also - but they are different skills you need to learn and keep up. Someone good at, say, nodejs or python data science may not be the best at CUDA build compilation on CentOS. And being good at both makes you less good at each unless you're working 18hrs a day to keep up with everything.

There is also the case of ratios. An organization probably needs more developers in specific areas than DevOps, so with dedicated DevOps you could concentrate similar work from across several teams to a dedicated DevOps team that knows that work very well.


I've already written a response to this elsewhere in the thread, but developers are not all equal.

You can't hire twenty developers that all have the same skill/inclinations, the same interests, the same experience.

That's not to say that a DevOps Engineer is some super 10x rockstar developer - no, they're going to have the same variations on skill, interests, experience, etc.

It depends on your environment, but there's so much different tech once you count the entire stack, that I don't think it's reasonable to expect any one person to be an expert on all of it, or even a lot of it.


Sure, but "not all developers can be devops engineers" doesn't necessarily imply "none of your developers should have server access".


Lets go back to the original core assertion for the thread -

> > Every developer needs access to some servers for example to check the application logs.

> I fundamentally disagree with this.

So,

Developers shouldn't be reaching for SSH access to check logs.

If you're encountering problems that you can't diagnose through the existing logs, then you should probably be involving at least one other person - someone who has that production experience, who might have some additional knowledge about the problem.

If, and only if you've exhausted other avenues - then reach out for SSH access. But it should be a last resort, not the first resort. Plus, anyone SSHing into production boxes should really be very familiar with how production is configured. You can do more harm than good by poking around on a production box being completely unaware that you're causing alarms and outages elsewhere because you taking a memory dump of nginx caused in-flight requests to get timeouts and so-forth. The people with that experience are generally the DevOps/Infrastructrue folks since they're the ones who deal with production all day, and are going to get the pages if something goes wrong with that.


Some of your developers might have that experience even if they were not hired as devops engineers, or they might be able to consult with their colleagues who are developers who have that experience but aren't devops engineers. I think it is usually better to let people work in the way that is most efficient for them. Also, this doesn't necessarily have to apply to production systems.


> Some of your developers might have that experience even if they were not hired as devops engineers, or they might be able to consult with their colleagues who are developers who have that experience but aren't devops engineers. I think it is usually better to let people work in the way that is most efficient for them.

Yeah, perhaps. It'll depend on the circumstances, right.

I've got a reasonable amount of GSuite and Exchange experience, and same for Active Directory. I'm reasonably confident that I can work my way around those and do most of what I need to do without breaking it.

I needed some GSuite groups set up, some folks added to them, and a GSuite OAuth application set up and some values passed back and forth to do some integration. The only way to do all of that is with full GSuite Administrator rights.

Now, I could ask why they don't just give all Devops folks GSuite admin rights, it'd be much easier (for me) and I could do my job more efficiently.

The response is going to be something along the lines of:

> You don't need that access most of the time. > The times you do need that access it's often for a limited time or scope, and to resolve a specific problem. > For now, It's better that you work with someone who is responsible for that stuff on a day to day basis to do those things.

This is, in my opinion, pretty reasonable. Sure, it meant more delay until someone was available to do the GSuite configuration, and we needed to jump on a call to pass IDs back and forth and test it out. But it got done, and it wasn't overly burdonsome.

They didn't hire me as a GSuite, Exchange or AD Admin - they already have folks to handle that. That I have that knowledge and experience is still useful for the company - I know exactly what to request is done, and we can talk on the same level about it. Heck, if it turns out that I need this every day and I'm constantly going back and forth with them on setting things up - then I might get that access, but it'll probably come with an explicit requirement that I use it in specific ways, that I follow their processes/procedures, and keep them informed on when I'm using it and why


>What if you encounter an nginx bug, or a kernel bug?

That is the responsibility of system administrators. Application developers have no business on a production machine. If your sysadmins don't have the technical skills to diagnose these problems, they are incompetent and must be replaced.


> If your sysadmins don't have the technical skills to diagnose these problems, they are incompetent and must be replaced.

The actual result of this is that the sysadmins are not replaced, and the application developers end up in an emergency conference call at 3am to tell the sysadmins which buttons to click on the production environment, since they’re not allowed access themselves.


I spent several years at a large multinational cloud provider that gave developers and QA access to production systems and customer PII. That all changed after the company was bought by SAP and operations were integrated. I am amazed that engineers think this is acceptable. It is bad business practice, compromises security, and illegal in some jurisdictions.


If developers aren’t exposed to the deficiencies in their systems, they have no incentive to reduce SRE pages and triage. Build resilient code with quality documentation and you don’t have to attend a 3am conf call.

DevOps is not a role or role segregation, it’s about aligning incentives and outcomes across functions in an org (hopefully through collaboration, tooling, and knowledge transfer).

The caveat is that if your org is fundamentally broken, none of the above applies or works and it’s all lipstick on a pig.


I fundamentally agree with your fundamental disagreement with the article, but then I also fundamentally disagree with:

> ship log events out of the box and into something searchable, indexable, and can derive metrics

That's just a way to spend a ton of money. There's really not a reason to ship or index logs. Just leave them where they are produced and instead of shipping them to a central thing that inevitably becomes an expensive bottleneck, instead develop ways to access them in place (other than SSH) and frameworks for large-scale distributed searching where you push the predicate out to the machines where the logs exist.


That's fine if you don't have many servers and treat them as pets[1] but once you start scaling sideways, auto-scaling, using docker / Kubernetes or any of the other tooling around build pipelines or clustering and you quickly run into issues if those logs aren't streamed to a central repository.

This is the age old problem of simple problems being harder to solve at scale. Running a blog on Digital Ocean is very different to running a million dollar application on AWS.

[1] https://medium.com/@Joachim8675309/devops-concepts-pets-vs-c...


People always -- always -- pull cattle-vs-pets on me after I advocate for this position, but I've slaughtered more machines than any of you. The problem is that you are treating your logs as pets when the truth is the logs are also cattle. Virtually all debug logs will pass through their life cycle without being read, so indexing them is just a flagrant waste of energy.


> I've slaughtered more machines than any of you.

I'd rather not engage in pointless unprovable arguments, if you don't mind.

> The problem is that you are treating your logs as pets when the truth is the logs are also cattle. Virtually all debug logs will pass through their life cycle without being read, so indexing them is just a flagrant waste of energy.

You shouldn't have debugging enabled on production systems unless you have an interim process that filters out debug messages before they get indexed (and thus you can toggle which logs get indexed there rather than reconfiguring / redeploying all of your application nodes).

Also nobody is suggesting logs should be treated as "pets". You still want to purge out older logs however the problem is you cannot always replicate reported errors so if you don't have those log messages captured then you're sod out of luck.

Don't get me wrong, there is a certain allure to the traditional method of systems administration - I've been on both sides of the fence - but central logging services have so many other benefits such as security (tamper proof logs, users don't require SSH), ease of use, persistent logging, etc. The only real downside is cost but that quickly becomes absorbed in your pricing plan when customers start asking for SLAs.


I have to agree with the other guy. Going too far into "Don't touch the machine" leads you to ridiculous places. Designing your system to be instrumented is invaluable, and sometimes that means logging on to a machine. Should you minimize this? Yes, but not to the point you invest tens of thousands or more into a way to not use SSH.


Since we already know you can put a more or less adequate elasticsearch cluster on DO for $60 per month, I don’t think that argument holds a lot of water.

By the time you want centralized logs, you are probably already spending much more than the logging will cost on infrastructure.


ES cluster might be easy to install but doing all other chores with your log, along with learning ES search language and quirks is far from trivial. I personally spent months making viable solution on premise and its not something I want to do again. Way less hassle to access server directly for daily logs and just move them during the night on some archival storage.


Nobody said "you should never log into the machine!" We just said reading log files shouldn't require SSH access.

What you're doing here is constructing a straw man argument while agreeing with the same point I was making.


Hold up, never log into the machine is exactly what the discussion is about, not logging into machines for log access being one aspect of it.


I'm guessing you skim read most of the replies after the OP? The OP did touch upon it when describing their own architecture (albeit he wasn't actually suggesting that should be how everyone operates) but everything after that has been more narrowly focused. This particular branch of the discussion was specifically discussing log access:

> > ship log events out of the box and into something searchable, indexable, and can derive metrics

> That's just a way to spend a ton of money. There's really not a reason to ship or index logs...


They're explicitly talking about not logging into machines for log access. One of the ways to make it seem a reasonable suggestion is the broader topic of not logging into machines at all, ever, and why that may be a good thing.


> They're explicitly talking about not logging into machines for log access.

That's what I said. ;)

> One of the ways to make it seem a reasonable suggestion is the broader topic of not logging into machines at all, ever, and why that may be a good thing.

Except none of that was being discussed. Only log access.


The conversation involves both.


No it doesn't. By your own admission:

> They're explicitly talking about not logging into machines for log access.

Why can't you just admit that you didn't read the thread properly rather than insulting all of our intelligence with these piss poor mental gymnastics where you redefine the context of what people had very clearly written.


Because I read the thread? How many times do you want me to repeat myself? They're talking about the specific case of not logging in for log access, but that is partially justified by the wider desire to never have people log in at all. You can separate them but the original conversation didn't.


> You can separate them but the original conversation didn't.

The conversation you're replying to, however, did. I know this for a fact because I was involved in that conversation and I made that distinction myself ;)


> I've slaughtered more machines than any of you.

I guarantee that you haven't, as the number of people that can make that claim in relation to my background is vanishingly small and I know almost all of them by name.

I centralize logs. Not only does it make more sense for administration at scale, it's invaluable for security reasons and assists in compliance by providing a controllable guaranteed audit trail.

You may have a valid argument about cost here for some applications, but it's unwise to make arrogant claims you cannot back up.


> I've slaughtered more machines than any of you.

Not only is that vanishingly unlikely to be true with this crowd it’s also not at all the kind of rhetoric that they’ll listen to either.


When I read that comment I was thinking about our auto-scaling nodepools on GKE that have been running for a couple of years and scale up/down every business day by 20-40 nodes or so depending on load... but I am not sure I get to take credit for slaughtering them.


So that's the new replacement for pointless uptime boasting.... Except even less provable.


the logs in my case are audit trails required for compliance and for investigation by 3rd parties.


This way you lose them every time when something goes wrong with the infra, or when your fleet autoscales. Or if you do blue-green deployment with infra, then on every deployment.

Your storage/indexing money spend scales with your company. If you have < 10 hosts you can probably handle logs with a cost that's a rounding error with something like loki+grafana.

If you have a bigger environment and more people, the cost will be more significant, but you'll also likely have more revenue to justify it.


> There's really not a reason to ship or index logs.

Proper Information Security policy is one very good reason. If a box is compromised, you don't want the logs being tampered with.

You ship the logs to a central, dedicated server so that a clear audit trail can be established.


Shipping software logs is just useless bureaucracy. Don't confuse it with shipping firewall/ssh/rdp logs which is an absolute must for telling how/when you were pwned. Hacker will just disable app/db logs, update their user as admin, dump the db and proceed with the cryptofuckery.


Eh? In my experience logs are some of the most valuable data produced by software. Debugging, experiments, software quality, auth trails, analytics, customer support, privacy, security, audit/compliance and abuse are all seriously useful applications for logs. Throwing them away is like dumping gold.


Yes, nobody said throw away application logs. That's dumb.


I imagine basically any form of compliance would require centralizing logs in some form


and for audit. Audit/Compliance is one of the worst yet valid reasons to introduce complexity, redundancy and all sorts of costs into projects. It's also why some people earn a living out of it purely.


Authentication, authorization and access logs typically. Generic application logs can often be treated differently.


Inception of fundamental disagreements ...

On the point, what do you do if the machine is not there anymore? Your model fails immediately. Also, some processes operate on an assumption that the "machine" will stand-up -> do-something()- > die(), which your model also doesn't support. Not to be messing with you, but I don't think there is the "right" model, but whatever works for us. I do like your model for immediate accessibility of the logs.


In production systems when something goes wrong is very common to ask customers/support team for logs. In most cases customer need SSh access or some role to collect this info. By the time they get to those, they are gone. In my experience to really scale a service you need OOB logging tools and automation to perform actions around errors in logs


You could just have an sshable account without shell privileges, only able to run grep or some small set of controlled utilities and with filesystem access restricted to read only access to the log sirs.

Anyone who needs access could remotely run a command that would return the results of whatever permitted query they submitted.


I've seen centralised logging scale to FANG level, so I really wouldn't worry about bottlenecking.


Thank you, I agree that this should be avoided as much as possible in general.

Actually, in my company, we are pushing the logs to an external service, so I guess that line I wrote there was not totally thought out. ;)

However, it still might be useful for certain people to check servers for health or other metrics and you might not always have everything as perfect as you wish. I guess you already stated that yourself.

So in general I'd agree. Good comment!


Health is a function of the application, not it’s host. A health check endpoint is paramount to ensuring the app is healthy. I still disagree with having _any_ access to a “box”. Local dev, console log, deployed debug? Better make sure you are logging events and not non-sense. Actionable events with request tracing (preferably). But yeah, it was a good article. Bitwarden is something I’ve used to share privileged keys before but the whole signing stuff was the right way to go.

Also, if you aren’t on “cloud”, odds are you are still using something like Kubernetes or DC/OS or Swarm or the like. If you aren’t then well, wordpress sites aren’t really in the same ballpark technically. (Joke, Wordpress sites get traffic, some lots of traffic, I don’t discriminate against the PHP tribe).


I see your point. But you are incorrect and too much reasoning probably from your current position. In many new companies practicality is very important. There are maybe no resources to setup a giant set of utilities which replaces info you would normally get by SSH to a server. In a new company you wanna make this a graduate process. The security aspect is always important. But it's a balance. You cannot over engineer.


Agreed, and what you're saying is really the proper way do things in a cloud-native environment where all hosts are ephemeral, containerized apps are idempotent, and especially a serverless environment where no one has access to the invisible host.

However, for anything shy of full serverless, just s/developer/ops and you still have to solve for the same problem. At the end of the day you're always going to want someone to be able to get a shell on the box for troubleshooting some critical issue. If there is a box.


> Backend developers should make a conscious decision on how to ship log events out of the box

Great! hand waving. What are you going to do is this log streaming process stops working, and cannot restart? Kill the machine and lose the logs? That might be ok for your webapp, it could be trouble for more serious applications.

> You’d be surprised at how much less stress there is when you can have alerts on log events from your application when things break instead of a support call or a support ticket.

Giving SSH access doesn't mean you don't have event logs.

> stopped teaching people to treat their app and server as a second home, as a pet, that must be nurtured

In our case, nobody is treating anything as a pet. Our servers are mostly spot/preemptibles, but we still have SSH access. Because it makes it really easy to debug if things do go badly wrong. From being able to see htop output, to being able to upload custom binaries to investigate an issue, or to smoke test a code change... it's a huge time saver.

Give me SSH or give me death.


We also manage our whole infrastructure automatically via Ansible, but if there's a problem that goes beyond an error in the deployed code it's still necessary to access a machine via SSH in order to debug it. Logs are arguably something that can be easily centralized as most application support logging via syslog and there are a lot of great tools available for it like greylog.


Sure, having a privileged key you use with ansible to actually do the thing... but we’re talking about giving the shop access to the box. Not your team of highly trained infrastructure engineers (or bob, who has all his playbooks on a thumb drive for his office). If ansible and non-cloud is your deal, an ssh key (or two: deployments) is fine. I’m just against giving others access to things they could break or worse, leave the door open...


IMHO, using ansible to manage keys is a way simpler process. You won't miss things because your inventory file has everything.


The main counterpoint is what do you do when the logging system goes down at the same time there's an incident? Do you then just rely purely on metrics/customer reports to try to debug?


When you’re tailing logs on a server there’s still a “logging system“ (your ssh connection, tail -f, a disk with finite space) that can still “fail”. Probably in either case you’d work to fix logging and you’d hope that the system is simple enough to fix.


Logging systems are annoyingly complicated and there are a lot of moving pieces that can fail. Way more than reading a file over SSH.

This kind of access shouldn’t be your go-to but it’s important to have a battleshort.


What's complicated about a syslog server? I suppose some applications like to spew out multiline logs (which are something you just can't process outside the source while maintaining sanity) but seriously, when you actually consider these things while writing the application, you don't need all that much to have something workable.


I was going to make a note that an rsyslog collector would be my only exception to this because of how simple it is but you still have to be careful because by default rsyslog is okay with losing logs in-flight.

But I would be lying if the typical logging setup was this simple. It’s basically all remixes of ELK and while they’re fantastic when they work there are a lot many moving parts that can individually fail.


I think the point is that there's often a lot more to server access than just looking at logs. The article offers a solution to that.


My application runs on my customers' server(s). I don't generally have any access to anything.


Why not store the public key in a users ldap profile, then when they login ssh can pull that with the AuthorizedKeyCommand option to sshd config. As it's in AD / ldap you can allow them to manage their own public keys via a simple portal. When a user quits or is fired, their account is disabled and this will stop the key from being used.


The blast radius of storing ssh keys in LDAP is very big, if your LDAP is down you cannot ssh into your servers anymore.

To overcome this issue, you end up storing a set of public keys in the servers themselves, thereby going back to where you started.


A common way to implement this is through SSSD which can cache keys locally when the LDAP server is not responding.

https://access.redhat.com/documentation/en-us/red_hat_enterp...


If LDAP is down in an LDAP environment, you cannot authenticate anyway.

Our systems people get local accounts on machines for disasters like LDAP-down. Everyone else is LDAP-only, including ssh keys.

This is nowhere close to "where we started". Now we have a handful of privileged accounts and centralized auth management across thousands of other accounts.

Centralized logging and centralized auth are pretty much mandatory above some size. Without them, you literally do not know who is doing what.


A way to work around this is to normally work with short term certificate (24h) and allow your systems people to generate longer term certificates (one month) that are stored encrypted on their laptop or usb key.

That way you get both emergency access in case LDAP is broken, as well as a way to make sure old personal access gets revoked after one month.


Yep. I'm a big fan of Hashicorp Vault; that's the next step.

There are of course also DR considerations for that, as always, it is turtles all the way down.


You can have backdoor keys for the root account that you configure during provisioning. The use of these keys/account would trigger a security alert and only be for break glass scenarios. Other situations would use LDAP stored keys for authn/authz and LDAP stored sudo rules for additional authz.


Well absolutely there has to be an alternative way in, be it serial console or a dedicated admin user who does not have keys in ldap. With sshd you would specify a local key instead of ldap for that specific user, for example


Where I work if LDAP (AD) is down, the world stops. In 16 years AD "got stupid" once. So this is a good solution for us.


Yep, pretty neat solution.

This occasionally pops up at reddit and the "key in ldap" way feels surprisingly unknown/uncommon. Many use ansible.. but it requires guaranteed cleanup of revoked keys....


I’m using Ansible for this for my personal servers on a small scale, but revocation is pretty easy for me. I have all keys I want to distribute in my Playbook and I remove all authorized keys from the server and write only the ones in the playbook.


Well, the simple objection would be that the more remote network services you include in admin login management,the more likely you're going to have a problem when you least can afford it.

SSH configs in distros are moving towards not even doing rDNS lookups (which generally just delay logins for the timeout period when there's a problem).

That doesn't mean some hybrid solution doesn't have merit though.


If you go this route, why not go all the way and do Kerberos SSO so you don't need to fiddle with ssh keys at all.


That only works smoothly in a Windows environment. Sure you can do it from a Linux workstation too. But it's like a square peg in a round hole. Only on Windows Kerberos works so well it's natural to use it.


I have been using OpenSSH and OpenLDAP to do password authentication for decades and just recently discovered this as an option. It is pretty simple to setup. Basically just extend your ldap schema with the public key attribute and then tell openssh to use it.


I'm not sure I'd call AD or it's portal simple...


If we're talking about internal servers, isn't an existing ldap/AD infrastructure so uncommon? There would be almost no additional work to implement this(depending on the size..)

The discussed sining flow probably works better with cloud infrastructure. Afaik it's one of the ways hashicorps vault can be used for SSH.


I think that if you already have all users in a authz/authn system then anything will feel easy compared to the alternative, but I'd definitely not call them simple.


A while ago, I asked here: "Ask HN: What do you use for SSH key management of teams?" (https://news.ycombinator.com/item?id=24157180)

And in this blog entry I am summarizing what I learned and what I think is a very good approach now.


Did you forget to add the link to the blog entry?



Oh sorry, I didn't realise you submitted it.

Thanks!


This is good. There's not much information around on this topic, and often roles seem only vaguely hinted at.

The lab I work at recently had a "worry" during recent attacks on UK science infrastructure that someone might have compromised an unknown private key someone might have stored on a third part server. The institutional response was to switch to password-based ssh only for a month (so lots of people used sshpass...) and then revoke - at the bastion - every single key listed in every single authorized_keys for every one of thousands of users in the system, before turning key access on again.

I've wondered why they didn't switch to an ssh certificate system. My best guess is the complication, maintenance, and risk profile of setting up a secure key-signing system to do this signing automatically - presumably this acts as a single point-of-failure for compromising the entire network.


I've not seen any rationale why sites haven't done something sensible. Sensible, like allowing automation without storing your password to use sshpass, for instance; then if you insist on MFA, that can be done when issuing an ephemeral certificate. "Certificate" might bring back bad memories of Globus et al, but that was different, and I haven't seen it articulated.

The general response to those compromises has been a disaster area. Especially as (in the absence of any post mortem, as far as I know) the attack presumably involved the general password-spraying that was happening around then. Users were made to look responsible and suffer because systems had local privilege escalations due to poor management -- which, to be fair, might be due to relying on cluster vendors who "take security very seriously", as they say. That has been around the top of the threat list for decades too, it's just that it's now much larger scale. Then you got private keys purged that might be used to access systems on other sites with less risk than typing passwords on insecure systems -- arbitrarily deleting users' data.


I have given up on OpenSSH's built-in certificate handling because it does not properly support revocation.

It's not possible to have 100% of your keys on a short duration, so you have to cook up a revocation system.

That's not my idea of a fun time. Auditing keys on disk is much less trouble than properly managing CRLs.


Why not use smart cards? Then the key can't be copied. Only physically stolen but you'd need the pin as well and it works be noticed it's missing.


So the solution to the problem brought up by the author - having to regularly deploy a new authorized_keys file to servers - is to instead regularly deploy a new revocation document to servers. I don't see how this is any different or a solution to the primary problem discussed. It just adds the additional effort of having to sign and revoke certificates.

We use a cronjob to regularly fetch, verify and deploy updated authorized_keys on servers.


The primary difference is that you can have short-lived certificates, but ssh-keys are eternal. If you fail to remove one, it sticks around and may come back to bite you if compromised at a later point. Certificates expire, after which they’re useless to an attacker.


One of nice options in recent versions of OpenSSH is the "expiry-time" key option you can put in authorized_keys. There are also other options handy to restrict the usage of the key, for example you can limit the key to be used only as a jump host. One can combine them so the users on the jump host will not be able to execute any command and even not able to edit authorizd_keys file and remove the "expiry-time" option.


I think that you’re overreacting a tiny bit. One area that the OPs solution allows is rapid onboarding, quick/automatable generation for machine to machine roles and the ability to actually do some RBAC(if that matters to you). To make their system fail safe, they do the exact same thing you are except only making sure to synchronize the block-list.

I don’t know about you, but at least at my org, waiting for tens of thousands of machines to sync an authorized keys file when doing a ton of onboarding or a very very selective off-boarding... I’d rather minimize the amount of machines I touch and doing that via an authorized_keys push via ansible is untenable at scale or volume.


I'm just taking the article's main point at face value. Certificates help with enabling access quickly (fast onboarding) but revocation needs every server to be informed about the change, since OpenSSH doesn't do OCSP.

If a single file change is untenable at scale or volume, how does your organization manage version deployments, or even something small like an emergency reconfiguration, without collapsing or having to cope until "patch tuesday"? What you describe sounds untenable to me.


I came to the exact same conclusion as you.

I can see this working nicely in relaxed companies that are comfortable waiting for the signed keys to expire (but those companies probably also won't like the inconvenience of expiring keys).

With that said, there is an advantage, in that a revoked key will eventually expire even if the cronjob breaks.


In terms of the problem of removing SSH public keys from authorized_keys, etc we solve that problem by means of Ansible. If a developer leaves the company, he/she is removed and that Ansible task is applied to all the servers. Works fine too.


This. The author presents a different solution which is equally not scalable without automation and suggests you automate his new layer instead of just using ansible to automate handling of ssh keys.


And certificate revocation lists (CRLs) are potentially more brittle in my experience, too.

It’s pretty easy to audit a server for a key in a file or a key file on disk.


Hi, I'm new at using ansible. This is the module you're referring to right?

https://docs.ansible.com/ansible/latest/collections/ansible/...


Actually we wrote our own, just leveraging built in modules within Ansible, but that one looks good. Give it a shot, and loop through a datasource e.g a variable.


Can't some Hacker News reader start a company that can do this right so I don't have to think about it?

It would be really nice to have a solution to this problem that doesn't require the user to go to first principles involving Alice, Bob, and Charlie.



Hashicorp vault? You probably have more secrets than SSH to protect.

If you’re on AWS also consider high value add built ins like EC2 instance connect or ssm session manager so you can manage host access via IAM.


Actually we did!

Disclaimer: main developer and co-founder here.

We started theoapp [1] in mid 2018 and we introduced it at FOSDEM 2019 [2]. We then launched AuthKeys [3], a company that offers it as SaaS.

From a technical standpoint, we chose a different approach than that described by this thread's O.P. More specifically we implemented our service as a client/server suite to integrate with the standard sshd authorizedKeysCommand feature. Rationale and more technical details can be found at the links below.

We've been in "stealth mode" until now (this is our very first public post!), while running the service with selected customers since December 2019.

Recent developments will soon allow us to widen the product’s offer with a self-hosted solution of AuthKeys.

Feedback from the HN community is more than welcome: we'd like to hear whether the website is effective and clear enough on explaining how AuthKeys works [4]. We realized the topic comes with a certain degree of complexity for some potential users while others really want to quickly spot how the internals work and how we managed to make it secure to trust and easy to use.

We’ll be happy to answer any questions on the matter, and to offer promo codes to HN readers in case you wish to give it a go.

[1] https://github.com/theoapp/ [2] https://archive.fosdem.org/2019/schedule/event/theo_keys_man... [3] https://authkeys.io/ [4] https://authkeys.io/how-it-works



Solved all my headaches and saved a lot of time managing my (and teammates) SSH keys with this service: https://authkeys.io/


Nice, looks like they pretty much automated what I described in my article as a service.


Hi! Thanks for looking at Authkeys, company co-founder here. The technical approach is slightly different. Theoapp, the opensource core behind Authkeys leverages AuthorizedKeysCommand [1] sshd feature, instead of using a CA for signing public keys. macno's post at [2] covers it with more context and references.

[1] https://linux.die.net/man/5/sshd_config [2] https://news.ycombinator.com/item?id=24620842


Smallstep has done a few blog posts of doing this simply using their tools.


For our users we use Kerberos authentication with AD, rather than SSH keys. Once a user has a TGT on their desktop, ticket forwarding takes care of SSH SSO from there.

The same TGT can also be used for other neat things like secure and transparently automounted homes and other directories via NFS4 or CIFS.


Right, but people seem remarkably reluctant to use the facility which is just there, and it might even be proscribed for systems that aren't "joined to AD", for unexplained reasons. If ephemeral certificates are also used, your ticket can presumably cover getting them too. It's probably not an option for systems with off-site users, though, since sites won't expose their AD systems or put something in front of them.


Kerberos (sssd-ad) backed authentication for SSH is really the best.

You no longer have to deal with SSH keys whatsoever and all the management that goes with them: When users get their access revoked on AD, they get their SSH access revoked as well. You can have group based authorization (only those in the SRE group can access this class of QA endpoints), so when dozens of people a month are being added and removed from the various groups, you don't have to worry about giving them keys/access. They can SSO from their laptops, so all they have to do is open PuTTY and they can connect away without even typing their usernames and passwords. etc.

Lots of these new generation "devops" and "full-stack developers" haven't had the experience of AD and Kerberos, so they spend all this time, blog posts, money, etc. to reinvent the wheel.

Sad really.


That's great until you work for a company that bought Macs for everyone for their design and upper-management likes to keep it that way.


You can do it on Mac. I wouldn't recommend binding Macs anymore since Apple broke filevault for AD accounts in high Sierra (AD accounts don't get the secure token by default which is needed to unlock the drive)

But since Catalina there's now a great Kerberos SSO plugin that you can push through MDM. Previously this was known as enterprise connect but only available from Apple professional services.


With a ssh jump box you could also centrally manage access, revoke instantly and don't have to sign users keys repetitively..


this.

if you have access to all your machines through public internet you're doing it wrong (ie stop solving a problem that you should not have in the first place).

The proper way to do this is to have a bastion host (or a jump host) where you strictly control access. Someone leaves you revoke their access. (extra: have the access be protected via 2FA that depends on the person having an active [LDAP/corp] account)

There is another layer to this where the access to the bastion is allowed only from the corporate network (ie you need to VPN into the corpnet to be able to access the jump host). You leave, you no longer can access the bastion.

the ultimate level to this is that you should not ssh willy nilly into your production hosts (to the degree that this should not even be possible). you should have a solution for pushing the logs + instrumentation (ie metrics) that makes it so that you don't need to do this in production.


I think I understand how signing keys removes the need to update every server when adding a user to the system, but it seems like that comes at the price of having to update them all when someone leaves to revoke their certificates. What is the benefit of pulling revoked certificates to all servers periodically vs pulling authorized_keys files? Is it possible to work around this at all, e.g. conceive a system that eliminates all needs to push to servers? Is an online lookup like LDAP the way to go there?


If you use certs, then you also get Certificate revocation lists. You don't just trust the CA, you also trust the CA's CRL.

https://en.wikipedia.org/wiki/Certificate_revocation_list


The certificates have an expiration date, so if something goes wrong on that end at least it will expire after some time. Also, this takes care of role-based access - this you might not have with an authorized_keys file solution?


Certificates should have an expiration date, any system implemented with this pattern should expire after a couple of hours.

If manual, I would even consider doing it every week with the caveat that it would be a large attack vector.


Hashicorp Vault is able to issue short lived ssh certificate. It's pretty cool and you don't have to deal with revocations.


Once it reaches this level of complexity, one might consider setting up some kind of centralised authentication system. You can hook openssh to pull public keys off openldap, and possibly add some additional checks on it (like group membership).

This way when a developer leaves the company, you just remove their entry in the LDAP DIT and all and every access is removed.


I saw ssh keys and was ready to get on my moral high horse about ssh certs, only to read the article to be talked down. Nice guide!


Related (IMO a somewhat better explanation of why): https://smallstep.com/blog/use-ssh-certificates/ (note: you can also use Hashicorp Vault for the same)


Luckily for my team at least, we're using NixOS for everything so the `authorized_keys` are part of the system's configuration, which is all defined in code. This means we don't have the problem of any machine's configuration going out of sync with the rest.


Is the updated system configuration applied on all machines automatically?

I haven’t used Nix personally but I’ve used Guix - I want to go further with it and I’m curious how you’re managing this.


Yes. We're using NixOps which does this by default, though of course it can be configured otherwise.


What about having different access roles for certain people to certain servers?

Is that covered in the NixOS approach as well?

Sounds interesting!


What happens if a user manually places an ssh key in ~/.ssh/ ? Do you have a way to automatically remove those?


You can just modify the sshd_config not to look at the user authorized keys file


Anyone have a good solution for doing this with host keys? I’m not big on TOFU. It’d be nice if cloud-init would generate some keys and pass those off to a CA to sign (maybe vault?). That’s my rough idea, but I’m curious what others have come up with.


Am I mistaken? The author notes in alternative solutions that the alternative solution is a single point of failure and that is a problem.

Is the CA not a single point of failure also?

I agree with one users comment, ansible can solve all of this too...


Of course, if your CA is taken over, you are in big trouble.

However, the attack vector here is much lower given you can have this on a local machine which is not even connected to a network if you like.

The single point of failure for central management solutions is in having them as a running service - if that goes down, you are in trouble.

And that is far more likely than your CA becoming compromised somehow.

Also in practice, you can also run two or more CAs (we do that), so you can deprecate a full CA if the need arises.


The CA is not a single point of failure in the traditional sense where a single machine going down has a large impact.


How does ansible solve this?


Playbooks to update all keys on all machines. Run one script and they are all updated.


"Every developer needs access to some servers for example to check the application logs"

What year is this? 2010? No, application logs shouldn't reside on servers, and Devs shouldn't need access to servers either.


I’ve seen a variety of solutions to this working on different products but by far the best I have worked with so far is OS Login from Google with their Identity Aware Proxy product.

It’s allows developers to manage their own certificates (adding a new machine or rotating a key) whilst allowing us to use the GCP IAM tooling to grant access to certain machines - all without hosting bastion servers ourselves or exposing the servers themselves to the public internet.

As it’s based on PAM its also been relatively painless to integrate with other functions like audit logging.

If you’re using GCP already - I’d highly recommend it!


Maybe I've missed it but in the first setup (without roles) where on the server is defined as which user someone can ssh into the machine now? There's only this global setting:

  TrustedUserCAKeys /etc/ssh/ca.pub
Where can I define that a user can log in e.g. as "john" on a server? Is this the '-I USER_ID' part when creating the cert? If so that would mean that a certificate is bound to one username only and that every user needs his own account with exactly that name on every server he has access to, right?


Article content aside, I’m a big fan of the mobile site design.


Very nice! Thank you for breaking it down in such a succinct and straight to the point way, this is definitely something to bookmark.


Suppose I grant a user access to system X for 2 weeks via a cert. When this user then requires 8 hours of access to system Y, can I just provide an additional cert with this claim and have the users ssh client figure it all out?

Or does this scenario either require the user juggling certs, or me generating certs containing all concurrent claims?


Great question, I haven't tested this, yet.

One thing I'm sure would work if the user has generated two separate public-private key pairs and you sign two different certs.

Not so sure about having several certs for the same key.

If I was in that situation, I would probably generate a new cert which contains the concurrent claims and is short-lived, but we also don't have extremely many different roles.


Depending on your scale, this can be made much easier and simpler - just manage the keys using something like ansible. Got a new developer? Push out the key to the systems they need access to. Developer leaving? Remove the key from the systems. Configure the ssh server config to not use keys in home dirs.


I have team members that run on Windows, using private-public keys with Putty and Pageant for passphrases. When you read "put the third certificate" under ".ssh" directory, how would you do it on windows?


Windows 10 has native openssh built in now, it has a lot more features than putty, and wont ever have long lag time for features like certificates.


%USERPROFILE%\.ssh\


Thanks, in my case that directory didn't exist until I lookup how to properly start ssh-agent under windows


I prefer AWS SSM or something like teleport which offloads user management to AWS or an Idp. The CA still needs management and special handling and policies if you want to be compliant.


Annoyingly SSM still requires managing ssh keys and users somehow, it really only takes care of the network layer. I have been thinking of combining SSM with EC2 Instance Connect to deal with that issue, but haven't gotten around to actually implement that.

Also SSM has some annoying glitches like e.g. https://github.com/aws/amazon-ssm-agent/issues/274

Edit: found this blog post that shows how the two can work together https://skorfmann.com/blog/aws-ecs-instance-connect-meets-aw...


> Annoyingly SSM still requires managing ssh keys and users somehow

No, doesn't require managing ssh keys. SSM can work with SSH keys but it's not required. The downside I found is that all users connect with the same uid/gid which does not work great for bastion hosts where user profiles are required (e.g. connect to EKS API). You could work your way around this with documents permissions per user, but it's not great.

The user management is handled by IAM.


To be more specific, ssh keys (and their management) are required for ssh access, something which I took as a given considering the thread. Sure, the rest of SSM is not reliant on ssh or ssh keys in any other way


Love the idea but revokation became the problem now


It's even better if you force a hardware two-factor authentication before you grant the ssh cert.


Openssh also supports OCSP, which can be used to more efficiently manage revoked certificates.


Why not just use something like Foxpass? It’s well worth the money!


how does one manage new user creation dynamically? is the expectation here that there is a shared user that is preconfigured on the target host?


We use authorized_keys and that file is managed by ansible across our servers. Adding/removing users consists of updating a yaml file (list of tuples {username, public key, status}) and running the playbook.


When do you need to access a server via SSH?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: