KernelCare – Kernel updates without rebooting the system

sciurus · on May 8, 2014

LWN has recently covered different methods of doing this that SuSE and Red Hat have developed.

The initial kGraft submission: https://lwn.net/Articles/596854/

The first kpatch submission: http://lwn.net/SubscriberLink/597407/cd7861dacf2be0e9/

Bask in 2008 they covered how ksplice works: https://lwn.net/Articles/280058/

Already__Taken · on May 8, 2014

Not that this isn't a cool product, isn't it important to know your server can reboot?

mrweasel · on May 8, 2014

I really agree.

You need to be able to reboot a server to patch it. If it's so important that your server do not reboot, it shouldn't be accessible by the general public.

Assuming that you run some sort of website or internet service, you need to be able to deal with failing hardware/software anyway. Either you big enough to have redundant servers or you small enough that a reboot won't matter.

I have a similar feelings towards backup and OS upgrades. You should be able to do a clean install and deploy your "stuff" on a blank server or vm somewhat quickly. If you 100% rely on say VMWare snapshots, you're doing something wrong.

SoftwareMaven · on May 8, 2014

In many cases, you do not need to be able to reboot a server to patch it. We patch the kernel regularly, allowing our customers to be protected without the hassle of rebooting.

Applying patches to the kernel to fix security vulnerabilities is orthogonal to knowing you can reboot your server. Rebooting servers is more than "will it come back up", it also includes all the headache of scheduling and the risk of what happens if it doesn't come right back up. But, because servers can and do crash, hardware needs to be upgraded, etc., you certainly need to know you can bring your system back up from a cold state.

Applying a hot fix means you are immediately protected from a vulnerability. Unless you are running at massive scale, few companies have an infrastructure that allows random servers to reboot (good on all those that do, regardless of scale!). Instead, most companies have systems and processes that make rebooting painful, especially for the IT guys who wind up working at 10pm on Saturday so they don't affect most people. And, until the reboot window opens up (whenever it is), your server is vulnerable.

Disclaimer: I work on the Oracle Ksplice team.

fragmede · on May 8, 2014

If you're sufficiently big and have many, many servers, a full reboot cycle on all systems may take enough time that the ability to patch without having to reboot is even more valuable than for small companies.

sp332 · on May 8, 2014

Sure, on a schedule. If a security patch comes out mid-day, it would be nice to be able to apply it without downtime.

acdha · on May 8, 2014

If downtime is a concern in any way, wouldn't you have at least two servers and thus the ability to do rolling reboots?

timdorr · on May 8, 2014

This is being marketed to shared web hosts with cPanel. This is a monolithic server architecture, so everything (web, mail, database, ftp, etc etc) runs on one server for each account. Hence, if you take the server down for a reboot, your customer is completely denied service until it comes back up.

sp332 · on May 8, 2014

Sometimes downtime is in that grey area where you'd like to avoid it but it's not worth doubling your hardware costs.

Or maybe your other server is already down for maintenance. Or you want both servers to be patched as soon as possible - simultaneously would be best.

Gonzih · on May 8, 2014

Of course it can. It's just not necessary to reboot with solution like that.

bananas · on May 8, 2014

Until you reboot and it doesn't :)

ithkuil · on May 8, 2014

And if you do reboot frequently but then at one point it doesn't work ?

You either have a long downtime or have a secondary ready to take over. I understand you're talking about avoiding to rely on a false sense of confidence; that's a good rule of thumb, but shouldn't get in the way of better solutions:

Test the secondary more often, and automate the recovery, and you'll cover more failure cases.

You don't have the resources for a secondary ? This means you're to small; neither this tool is for you; keep it simple, and just reboot.

Moreover, in some context is not desirable to perform a maintenance window until some designated moments like weekends. This means that you could leave your system vulnerable for up to a whole week.

jdsnape · on May 8, 2014

I agree with what you're saying about having better recovery, but I think you need some balance between the two.

I'd rather look through a week's worth of changes to work out what borked the restart than two years worth!

zyx321 · on May 8, 2014

>And if you do reboot frequently but then at one point it doesn't work ?

Then you restore from yesterday's backup. It's probably still on-site. If a kernel live patch fails, you might not notice for a long time.

ithkuil · on May 8, 2014

My point is you should always be prepared that a reboot doesn't work, since it could happen at any time between your maintenance windows.

Frequently exercising a feature is a good idea, especially when you depend on that feature. That's why exercising backup restores is important; you want that to work when you need it. Do you really need a production machine to boot when you need it? I think is a wrong target to optimize for, given that that failure more is already covered by backups cold/hot secondary as long as you do exercise them frequently.

That said, you do have a point about whether a live patch is reliable, but this doesn't have anything to do with whether increasing the average uptime of a host is a good or bad idea w.r.t to ensuring that the machine can start or not.

TBH I wouldn't personally feel much comfortable using this live update thing, but I have no experience with it. I'd be curious to know more though.

_3u10 · on May 8, 2014

Shouldn't one design their systems so that a single machine rebooting is not fatal?

Generally I make my servers reboot at least once every 24 hours.

chris_wot · on May 8, 2014

Depends on the environment. If you are a big enough business, then yes this would be a good idea. Otherwise, if you are small and don't have much money for infrastructure then perhaps not.

Though I have to ask: why do you reboot your servers every 24 hours?

valarauca1 · on May 8, 2014

I was gonna say 24 hours is short. I think a good time is about ~1 month. I know HP-UX suggests 2 weeks, and that OS has a history of 10-15 years up time, stupid y2K bug killing server up times.

laumars · on May 8, 2014

> "Shouldn't one design their systems so that a single machine rebooting is not fatal?"

In an ideal world yes. Not everybody has that opportunity though.

> "Generally I make my servers reboot at least once every 24 hours."

That can't be true. There's Belkin routers out there with greater uptimes. Christ, I think even my old Windows ME PC had longer uptimes than that!

No sane sysadmin would reboot a server multiple times a day. If nothing else, it's just a needless waste of electricity and any other resources it might drain during the power cycle (eg network IO if a SAN hosted VM).

I'm amazed at the just how much some people will exaggerate figures to make a point....

_3u10 · on May 9, 2014

What if reproducibility is more important than uptime?

How does memory fragmentation affect performance after more than 24 hours?

What happens when your server has to restart from a cold cache scenario?

What happens when your server is down?

What about that setting that a sys admin manually applied to the server 6 months ago to fix some issue and isn't in saved in the server config?

By forcing a condition people try to avoid you get good at dealing with those situations.

How often do your database servers actually fail over?

How confident are you in your code and systems to actually fail over properly?

How do you prime the caches when your memcached servers reboot?

laumars · on May 12, 2014

> What if reproducibility is more important than uptime?

Reboots aren't something you need to reproduce multiple times a day

> How does memory fragmentation affect performance after more than 24 hours?

If this is a massive issue then your server daemon is piss poor. However I think you're just clutching at straws here.

> What happens when your server has to restart from a cold cache scenario?

That doesn't mean you have to reboot your server several times a day

> What happens when your server is down?

It would only be down because you're rebooting the bloody thing :p

> What about that setting that a sys admin manually applied to the server 6 months ago to fix some issue and isn't in saved in the server config?

You don't need to reboot a live server several times a day to apply and test server config.

> By forcing a condition people try to avoid you get good at dealing with those situations.

Advocating repeatedly breaking live servers as practice to know what to do when they break accidentally is the dumbest thing I've read in a while. I don't need to repeatedly walk in front of a bus to learn not to walk in front of a bus. If you need to practice test scenarios then do so on a test system - that's what they're there for!

> How often do your database servers actually fail over?

Through my own negligence? They haven't yet. Through dodgy code rushed live by our developers; more often that I'd like to admit.

> How confident are you in your code and systems to actually fail over properly?

Code: not very. But I don't manage that. Systems: very confident. But I do sane load and disaster testing on test systems; and monitoring and logging on all systems to highlight potential issues before they completely snap.

> How do you prime the caches when your memcached servers reboot?

We're not stupid enough to reboot all our live infrastructure (and their redundancies) when they're in use. Let alone to do it multiple times a day.

--------------------------------------------

You're not convincing me that you need to reboot your servers multiple times a day; if anything, you're just convincing me that you don't have a proper dev and test infrastructure in place. And that's far more dangerous than any of the other issues you or I've raised thus far.

rwmj · on May 8, 2014

How does this relate to ksplice/kpatch/kgraft?

dsr_ · on May 8, 2014

It's in the FAQ.

>>Q: Are you using same technology as the one from Oracle, Red Hat or Suse?

>>No, Our technology was fully developed in house, and uses different methods to generate the patche as well as to apply the patch. We believe that our method of generating patchines is significantly more efficent that what we have seen or read up to date from other vendors.

But here's the concerning thing:

>>Q: Do you apply all the patches from the newest kernels?

>>We apply only security patches. Sometimes we might decide to apply patches for critical bugs.

That sounds like the kernel you run will slowly diverge from upstream. Different security teams make different decisions. I would value this service more if it came with the blessing and cooperation of my distro's security and kernel teams.

yiedyie · on May 8, 2014

Unfortunately critical bugs can also be major security issues in many ways.

michaelmior · on May 8, 2014

Agreed. Although the policy implies that if that is the case, then a patch would be made available for those bugs.

Fizzadar · on May 8, 2014

The site is pretty light on details; I'd want to know more how it works before moving from kSplice...

pxsta · on May 8, 2014

We can download the kernel module. It seems so simple. http://patches.kernelcare.com/kmod_kcare.tar.gz

yiedyie · on May 8, 2014

Yes but we still need to make sure that on the next restart the kernel is patched.

orf · on May 8, 2014

looks good, our servers stay up for years at a time with KSplice, but now oracle owns it and it's not available to new customers means I think we might have to find something new soon.

akx · on May 8, 2014

The drawings are pretty shoddy and amateurish.

Doesn't inspire too much confidence in me...

LeonM · on May 8, 2014

It does to me, because real programmers cannot draw ;)

But jokes aside, they just didn't spend their money on an artist, that does not imply that the product is of the same quality as the drawings.

huhtenberg · on May 8, 2014

I don't know why you are in gray, but this is a valid argument.

If I am about to entrust some company with an access to the kernel on my live boxes, you bet I'd be looking very closely at who the hell they are and be expecting them to be a reputable and stable entity. These silly illustrations would've been OK if they were on RedHat's site, but if you are just starting up, I would strongly suggest to put (much) more effort into establishing your credibility.

teacup50 · on May 8, 2014

I hope you can see the irony of your position when you're talking about the Linux kernel.

huhtenberg · on May 8, 2014

If you are referring to the Tux, then you are missing the point. Submitted project has very ambitious goals but comes from an unknown source and no history. It also makes a deliberate effort to look silly, which is the exact opposite of what they should be doing if they are aiming at the actual production use in live environments.