Hacker News new | past | comments | ask | show | jobs | submit login

Mine was about NFS.

One or two years ago, I was tasked to solve the following issue:

Installing an obsolete RHEL 4 on a brand new computer, with all the drivers issues that entails. Virtualization was not an option.

The solution I chose was to "backport" the latest RHEL 5 kernel in the RHEL 4. A few weeks of works later repackaging the kernel, adapting the mkinitrd script and a lot of headaches around the install iso (it was actually the hardest part, with a lot of hacks around anaconda), I finally managed to have a working server.

Then a few days latter, I was notified of a regression.

There was a somewhat crazy application that was managing various configuration files on various devices of this particular infrastructure which stopped working.

Digging in this application, I discovered that it was "pushing" the updated conf files through NFS, more exactly, it was notifying a service on the device targeted, and then this device mounted an NFS share on the RHEL 4 server, recovered the conf file, and unmounted the NFS share (I told you, "crazy").

The targeted devices were using RH 7 (RH, not RHEL, I'm talking about the one with a 2.4 kernel).

Strangely enough, mounting the share by hand and doing an ls showed that the files were indeed present, and there was no permission issues.

Reading the source code a little further and looking at quite old QT versions (the service on the device was QT based). I finally managed to find which line of code was not working: it was a simple readdir().

So I created a simple C program that just did the readdir and I finally managed to reproduce the bug, indeed, readdir was not listing the files present in the directory.

But it was not helping me much... So I decided to do some network captures, everything looked OK. I did the same network captures with the old RHEL 4 kernel, and it looked exactly the same.

After hours at steering at my screen with the 2 captures side by side, I finally spotted a subtle difference. The fileids (64 bits) was padded with zeros on the first 32 bits with the old kernel (ex: 0x0000000000000000A489097654456F97), and it was not the case with the new kernel (ex: 0xFC871902B9086456A489097654456F97).

Basically, the old RH 7 was violating the NFS RFC by only handling 32 bits fileids (by the way, the RFC predates the RH 7 by 5 years).

As it was impossible to upgrade the old RH 7, I finally backported the "bug" in my "RHEL 5 kernel on an RHEL 4" by adding a small two lines patch in the RHEL 5 kernel that ensured 32 bits fileids.




That server behaviour change is an excellent illustration of why "backwards compatible" isn't a simple black-and-white concept.


Wow. Impressive, but I sure wish that drive and creative energy could have been put to getting off RHEL 4... My hat's off to you.


There were various reasons why it was difficult to get off RHEL 4.

It was a system with a lot of cruft accumulated overtime, and a lot of domain specific applications that needed to be ported to a newer environment (new OSes, migrating from QT3 to QT4/QT5 and other newer libraries millions of line of codes). Actually they were planning to move to a newer base system when I finally left the company. We actually did it for other part of the system, and it was a several years process.

I've also another horror story about NFS and RHEL 4 (at nearly the same time as the first one).

I was tasked to update a small internal infrastructure used by another project (replacing an old active directory, an old exchange server, adding a few other service, renewing the hardware, reworking the backups, etc).

For the most part, I managed to completely replace the old stuff by newer things (postfix+dovecote, openldap, samba4, CentOS7, bind, bacula, everything hosted in VMs).

There were a few things that were too hard to migrate without blocking the users too much.

So, to at least migrate away from a +10 years server, I made the choice to just transform it as a VM.

This server was hosting various things (300 subversion repositories, a custom tracker using php 4, a viewvc crazily deployed (apache listened on 10 different ports with some static pages to point to the different ports), probably some other stuff I didn't know, and, you guessed it, NFS.

There was a lot of stuff relying on this particular NFS server (lot of scripts with the server name and paths hardcoded).

Also, the company had decided that it was completely unacceptable to risk losing one line of code on any developer desktop. So they put their homes on this NFS server (And if you are wondering, compiling code on an old NFS server, over a 100MB/s network, with 10 other devs doing the same thing, it is just miserable...). The amount of data (~1TB) was quite large for a server that old. And yeah, it was also acting as a NIS server.

So I started preparing the migration:

* boot on a livecd

* assign a temporary IP to the new server

* creating the partitions by hand

* rsyncing the system from the live, old system

* and few other things like installing grub

(basically a Gentoo installation minus the compilation and packages choosing parts)

Then I checked the services, and everything was working correctly.

So I scheduled a final downtime in the late afternoon for the final synchronization and at the scheduled hour I shutdown all the services except ssh and did the final rsync. I finally shutdowned the old server and switch the new server to the old IP from its temporary one.

A final check showed me that most services seemed to basically work.

I arrived the next morning, and everyone was panicking, the NFS server was behaving badly. I logon on the server and saw that the partition used by NFS was mounted read only. I rebooted the server, and again the partition became read only.

I urgently shutdowned the new VM and restarted the old server.

Then I investigate the issue and I finally found that there was a likely candidate : https://access.redhat.com/security/cve/CVE-2006-3468

The good thing was that given it was a CVE, I managed to find an exploit and it did reproduced the issue on my system.

The issue was triggered because some desktops were not shutdown during the migration, they saw the old server disappear and a new one reappear under the same IP. Consequently they were sending old file handles from the old server to the new server. As described in the CVE the NFS server didn't handled bogus file handles correctly, remounting the partition Read Only.

I updated the kernel, replaned another maintenance window and everything went fine.

A final note: the kernel version installed was 2.6.9-41, the one that fixed the issue was 2.6.9-42, if this machine was updated ONCE in its life, the migration would have gone smoothly...

I left this company a few months later, with somewhat of a deep hatred for NFS ^^.


Whoa. You're brave. This is what I love about sysadmins. They keep the show on the road, even in trying circumstances.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: