Mine was about NFS. One or two years ago, I was tasked to solve the following is...

mjw1007 · on Sept 15, 2017

That server behaviour change is an excellent illustration of why "backwards compatible" isn't a simple black-and-white concept.

atsaloli · on Sept 15, 2017

Wow. Impressive, but I sure wish that drive and creative energy could have been put to getting off RHEL 4... My hat's off to you.

kakwa_ · on Sept 15, 2017

There were various reasons why it was difficult to get off RHEL 4.

It was a system with a lot of cruft accumulated overtime, and a lot of domain specific applications that needed to be ported to a newer environment (new OSes, migrating from QT3 to QT4/QT5 and other newer libraries millions of line of codes). Actually they were planning to move to a newer base system when I finally left the company. We actually did it for other part of the system, and it was a several years process.

I've also another horror story about NFS and RHEL 4 (at nearly the same time as the first one).

I was tasked to update a small internal infrastructure used by another project (replacing an old active directory, an old exchange server, adding a few other service, renewing the hardware, reworking the backups, etc).

For the most part, I managed to completely replace the old stuff by newer things (postfix+dovecote, openldap, samba4, CentOS7, bind, bacula, everything hosted in VMs).

There were a few things that were too hard to migrate without blocking the users too much.

So, to at least migrate away from a +10 years server, I made the choice to just transform it as a VM.

This server was hosting various things (300 subversion repositories, a custom tracker using php 4, a viewvc crazily deployed (apache listened on 10 different ports with some static pages to point to the different ports), probably some other stuff I didn't know, and, you guessed it, NFS.

There was a lot of stuff relying on this particular NFS server (lot of scripts with the server name and paths hardcoded).

Also, the company had decided that it was completely unacceptable to risk losing one line of code on any developer desktop. So they put their homes on this NFS server (And if you are wondering, compiling code on an old NFS server, over a 100MB/s network, with 10 other devs doing the same thing, it is just miserable...). The amount of data (~1TB) was quite large for a server that old. And yeah, it was also acting as a NIS server.

So I started preparing the migration:

* boot on a livecd

* assign a temporary IP to the new server

* creating the partitions by hand

* rsyncing the system from the live, old system

* and few other things like installing grub

(basically a Gentoo installation minus the compilation and packages choosing parts)

Then I checked the services, and everything was working correctly.

So I scheduled a final downtime in the late afternoon for the final synchronization and at the scheduled hour I shutdown all the services except ssh and did the final rsync. I finally shutdowned the old server and switch the new server to the old IP from its temporary one.

A final check showed me that most services seemed to basically work.

I arrived the next morning, and everyone was panicking, the NFS server was behaving badly. I logon on the server and saw that the partition used by NFS was mounted read only. I rebooted the server, and again the partition became read only.

I urgently shutdowned the new VM and restarted the old server.

Then I investigate the issue and I finally found that there was a likely candidate : https://access.redhat.com/security/cve/CVE-2006-3468

The good thing was that given it was a CVE, I managed to find an exploit and it did reproduced the issue on my system.

The issue was triggered because some desktops were not shutdown during the migration, they saw the old server disappear and a new one reappear under the same IP. Consequently they were sending old file handles from the old server to the new server. As described in the CVE the NFS server didn't handled bogus file handles correctly, remounting the partition Read Only.

I updated the kernel, replaned another maintenance window and everything went fine.

A final note: the kernel version installed was 2.6.9-41, the one that fixed the issue was 2.6.9-42, if this machine was updated ONCE in its life, the migration would have gone smoothly...

I left this company a few months later, with somewhat of a deep hatred for NFS ^^.

atsaloli · on Sept 16, 2017

Whoa. You're brave. This is what I love about sysadmins. They keep the show on the road, even in trying circumstances.