How we spent two weeks hunting an NFS bug in the Linux kernel

drewg123 · on Nov 29, 2018

I spend my days chasing bugs like this in the FreeBSD kernel, and make heavy use of dtrace. I expect that using something like bpftrace(1) might have accelerated their debugging as compared to inserting stack traces and prints...

(1): http://www.brendangregg.com/blog/2018-10-08/dtrace-for-linux...

lbotos · on Nov 29, 2018

We're getting there! :) I've asked my team to balance learning K8s with new learning lower level debugging tools like bcc and it's cohort.

sofaofthedamned · on Nov 29, 2018

That's awesome. I see too many younger staff know the hotness but not the fundamentals, and not enough employers care. I've had to strace a K8s cluster before where it turned out the problem was in kube-dns.

We are relearning the same things we did wrong in the 90s once we got nice libraries and middleware, i.e. to forget the platform we're building on.

kelleyk · on Nov 29, 2018

Out of curiosity, what do you do that you get to do that? Sounds fascinating---I've had lots of fun with dtrace (and, more recently, bcc).

drewg123 · on Nov 29, 2018

I work for Netflix on Open Connect, the Netflix CDN. We run a modified FreeBSD on our CDN nodes. My job involves improving performance and scalability in the FreeBSD kernel. See https://medium.com/netflix-techblog/serving-100-gbps-from-an... Very much like the NFS article author describes, I also tend to find "naive" solutions that end up being polished a lot before landing upstream.

atq2119 · on Nov 29, 2018

It's amazing how much we take the power of free and open source software for granted these days.

Imagine this same scenario if GitLab was using a closed source operating system. Would they have been able to track this down? Quite unlikely, but maybe if they were even more persistent and got lucky. Would they have been able to fix it? Absolutely not. They'd be at the mercy of the vendor.

erikb · on Nov 29, 2018

Just to give a picture how similar situations play out in closed source environments: They would have a key account manager who's live they would turn into hell. And depending on the size of gitlab's business in relation to the customers of the software provider it would have more or less effect towards software development. Reaction might range from having a fix in 2 days plus on-site engineer/consultant visits plus inviting the corresponding manager of gitlab to dinner for $100+/person to not even getting a response to even getting an angry call from the provider side's manager that the account will be closed if gitlab doesn't behave.

So, Github owned by Microsoft might not even notice such a bug for long, since the engineer would be able to get it fixed through an email or two. While the pre-Github-purchase Gitlab might have gotten no fix at all.

In that regard one might argue that open source doesn't completely destroy the ability to solve problems, but open source certainly helps balancing out the odds for different competitors.

pjc50 · on Nov 29, 2018

This very much depends on how big you are and how much you're willing to spend. At a medium-size all-Microsoft shop with 100+ MSDN licensees, I found a bug in WINCE7's handling of "structured exceptions", and eventually got them to acknowledge it but they never fixed it.

(I got as far as disassembling their DLLs to point to the exact problem)

uep · on Nov 29, 2018

Yeah, I don't really buy the parent comment's insinuation that Microsoft GitHub would be more likely to fix it than GitHub previously. My company has shipped hundreds of thousands of devices with a Microsoft OS on it, and they virtually never fixed anything reported. At least one bug was very serious, and we thought they would have to fix it, but it just didn't happen.

What's hundreds of thousands to the hundreds of millions they ship every year?

timerol · on Nov 29, 2018

I think you misread the comment. Assuming that NFS was sold by NFS-Co, GitHub, as part of Microsoft, would be a big enough customer to get NFS-Co to fix the bug quickly. However, GitLab would be a tiny NFS-Co customer, and so the bug would have gone unfixed. The difference is in the size of who reports the bug.

erikb · on Nov 29, 2018

> I don't really buy the parent comment's insinuation that Microsoft GitHub would be more likely to fix it than GitHub previously

If Linux would be a proprietary product, than post-acquisition-github would get a fix from Linux developers because Microsoft is big. Not Github would provide anything, but they would get something due to the size of the Microsoft empirial stamp.

tinus_hn · on Nov 29, 2018

Supposedly if you buy support incidents they should do something. I don’t know if that is actually true in practice though.

AceJohnny2 · on Nov 29, 2018

> but open source certainly helps balancing out the odds for different competitors.

Assuming all competitors have the same engineers who are competent in the various technologies involved in debugging this (at a glance: filesystem operations, strace, wireshark, linux kernel compilation and modules, Google Cloud Platform...), know how to contact and approach the open-source maintainers.

... and most importantly: have the time to dedicate to such a debugging task.

0xbadcafebee · on Nov 29, 2018

Or another example in closed source:

We had the source code for an API that hooked into a proprietary library. We found a bug in the library. I don't think we had a support contract, and the issue was affecting production. Fixing it could have entailed some decompiling of the library, identifying the bad function, writing a workaround, and shoving it all into a new library. But I didn't have the expertise for all that, so instead I hacked up the API with a different workaround, essentially killing off some functionality, which avoided the bug. The application worked again, and we went on with life.

Another example: a proprietary extension to a tool did data replication. Under certain circumstances, data replication would fail, and the loss of data meant we would have to full-sync all data, taking up to four days. We reported the bug to the company. They determined it was a "minor error" and said the fix would arrive in the next release, in six months. So we identified a workaround (add cacheing, monitor for potential service disruption, restart services to re-connect networks before cache would empty) and implemented it until the fix could be delivered.

Regardless of who fixes the bug or how, the amount of time and money you invest in the fix matters. If a workaround saves you time and money by deferring the cost of the fix, that's often an acceptable solution. In this case, if the issue was affecting customers in production, blocking 'git gc' just for affected customers may have been a perfectly good workaround while whoever owned the NFS Client code figured out and implemented a fix.

floatingatoll · on Nov 29, 2018

Representing such an example of work in a job application / resume / interview would be more valuable to me than a college degree. Due diligence and persistence — in the face of real-world, difficult, hundreds-of-moving-parts technical issues — are worth every penny.

EDIT: Yes, college degrees require due diligence and persistence, but they offer no indication of the willingness to exercise those skills _after_ college. This work does.

idclip · on Nov 29, 2018

i recently graduated and started work.

academic achievements, relating to actuall work, is a drop of piss in an ocean. my experience at a university actually made me lose respect for academics.

edit: who ever is downvoting is romanticizing the achievements of scientists of yore. or thinks that MIT is the norm.

no, for the most part its publish or die, ive heard professors refer to students as "harvest" and laugh, while copying slides off of google. ive seen professors lie their way into grants.

all that contrasted with how the industry actually works and its needs, and what it actually require the universities to produce.

yeah, ive become a achievement oriented cynic. titles truly only make me think less of a person if thats all they have to impress with.

Angostura · on Nov 29, 2018

You should name your university. So others can avoid it. Because not all are like that.

idclip · on Nov 29, 2018

i dont believe there is sense in naming it so people can scape goat it anf solve things by avoiding one bad university. because i think the BA/MA system is a road paved with good intentions. but its leading us to hell.

naming any singular entity would just make us think its them to blame, and i believe the problem is endemic.

i sat once on a table with PHD students and complained about the quality of introductory courses, where i was then sternly put back in my place with "a university does not prepare for work! it prepares for research!"

i told him someone should tell that to all the students enrolling to CS in hopes of careers.

not all maybe true, but its more likely most. maybe its this cynism.

solving this isnt easy, i would just like to open a school myself, and offer guidance / support for people struggeling as i did myself back then.

and my advice to most people who want to pursue CS is to do it via apprenticeship and later approach a technical university.

the drama/pitty is that this is a process that starts at 17. when we are most clueless.

pjc50 · on Nov 29, 2018

> "a university does not prepare for work! it prepares for research!"

This is what universities have always believed that they are for. PhD courses in particular. There used to be a separate category of school that was both technical and employment focused; in the UK these were called "polytechnics", in the US they would be things like the Texas Agricultural and Mining College. For complex reasons due to both the class system and the set of accidents of history that caused a lot of startup founders to come from places like Stamford, they have become a "unfashionable".

idclip · on Nov 29, 2018

thats actually interesting!

i have a feeling they will be making a comeback woth a vengeance, but maybe not in our time.

today i just do what i can when a youth comes to me for advice.

mrunkel · on Nov 29, 2018

Nice write up. It reminded me of the time I spent several weeks once trying to get OpenBGPD to work on OpenBSD.

First I tried getting some test VMs up and talking to each other. When I couldn't get that to work, I setup a few physical boxes to test it out.. When that didn't work either, I started debugging the code. A few strace's and some routine C debugging work later and I found a bug that would prevent any BGP connection from ever establishing.

A quick post on the OpenBSD listserv and the problem was fixed within day. (Wow that was almost 10 years ago?! How time flies)

https://github.com/openbsd/src/commit/13fba73cec6be16d64c86e...

We ultimately went with VyOS (back then called Vyatta) and Quagga but it felt good to find a bug like this.

Most of the work went in to confirming that there was an actual bug. Finding out where the bug was and fixing it was relatively trivial.

saketuec · on Nov 29, 2018

Great post! In my own experience of working with NFS version 4 servers, we discovered several bugs that have been actually fixed in latest version of kernels. The unfortunate thing is that most enterprises still run old CentOS / Redhat release kernels that although are stable, but yet lack several of these fixes.

asveikau · on Nov 29, 2018

I don't have a lot of experience with NFS aside from a few machines that don't see insane use, but it's surprising to me how v4 implementations seem to introduce such instability. I had an experience a few years ago with a Mac client, quitting vim would cause a kernel panic. NFS v3 did fine.

wumpus · on Nov 29, 2018

Red Hat usually backports those sorts of fixes to their old-looking kernel versions.

KaiserPro · on Nov 29, 2018

NFS is normally fairly modern, as thats the reason that you pay for a RHEL licences, its the host fix should you run into a bug like this.

bostonvaulter2 · on Nov 29, 2018

Excellent write-up, I like how they briefly summarized each section so you knew what to expect (I find it helps with understanding). Including the false path is very nice as well since those are very common when debugging.

danso · on Nov 29, 2018

Completely agree, finding and fixing the bug is commendable enough. But being so thorough as to provide a free educational lesson to everyone else about professional debugging is the kind of self-promotional material I definitely want to read.

stuxnet79 · on Nov 29, 2018

GitLab has a strong engineering team. I appreciate this article. For those with experience, what's the best approach to introducing a documentation / "writing up a post-mortem culture" into a company that traditionally doesn't value these things?

dsumenkovic · on Nov 29, 2018

At GitLab we really care about our culture and core values [1]. As freddie said below: "Start doing it, celebrate it, reward it". If you are not sure how to make the first move, I would like to say that transparency is what really pushes everyone forward.

Start iterating on transparency, it may be hard but you will see the great results and it will make everyone around collaborate much more.

You probably heard of the event [2] which occurred almost 2 years ago - people are still talking about it and we are really happy and impressed to see everyone, including us, learning from that experience.

I hope this non-technical suggestion will help you to think about a solution to your question. If your team is not used to this kind of openness, eventually they will like the positive feedback from the community (we see that as a small iteration :-)). A comment section at [2] may be the extra source of motivation.

Have a nice day,

Djordje - Community Advocate at GitLab

[1] https://about.gitlab.com/handbook/values/#transparency

[2] https://about.gitlab.com/2017/02/01/gitlab-dot-com-database-...

freddie_mercury · on Nov 29, 2018

Start doing it. Celebrate it. Reward it. Keep doing it.

Like all culture changes it isn't complicated, just hard.

rezrovs · on Nov 29, 2018

It depends very much on your current scope of influence. It can be hard to change a whole company, but easier to start with yourself and your team. Try doing one and see how it is received.

batbomb · on Nov 29, 2018

I had been experiencing a similar bug that we reported to Red Hat after being stumped. It started occurring out of the middle of nowhere but it would happen only in .01% of our jobs we launched into a batch farm. We launch about 15k batch jobs a day, and it was enough to be a problem.

Before a job was launched, a daemon pre-staged some job contents (logfiles, env, etc..) and started writing out to a job summary file. Then the job would start, continue writing to one of the files, which would become corrupted.

It ended up being this bug: https://www.spinics.net/lists/linux-nfs/msg41335.html

tinus_hn · on Nov 29, 2018

Too bad that kind of thing is difficult to catch using static analysis.

alexeiz · on Nov 29, 2018

Git on NFS has proved to be a full of annoyances. Though I haven't seen any Git repo integrity issues on NFS, normal Git operations can be so slow on NFS it's infuriating (especially when the size of your repository is sufficiently large). Do you like that fancy Bash prompt showing 'git status' for the repository you currently work with? Forget about it, if you're on NFS. Or get used to waiting a couple of seconds after each Bash command while it's blocked on 'git status'. The solution is just to avoid NFS altogether and work with Git repos on a local filesystem.

To add insult to injury, this is an example of how people like to work with Git repositories at our company:

* Clone a Git repo into $HOME to work with it on different Linux hosts. $HOME is an NFS automount so that you have the same home environment on any host you log in to.

* $HOME is also exposed to Windows desktop machines via SMB. So convenient, right? You can edit source code in your favorite Windows IDE now!

Imagine their surprise when they make yet another Git commit with garbage in it. CR/LF, file mode bits are all messed up. Sometimes a file change on Windows take a long time to propagate to NFS, or worse yet there can be some garbage at the end of the file. Combine this with a common practice of committing with 'git commit -am' without even looking at the diff and you get a recipe for disaster.

tracker1 · on Nov 29, 2018

I wish that git for windows would just change the default line endings to \n. Most modern editors should also make this shift as a default, or if there's no \r\n combination in an existing doc, just use \n.

I, generally set this as default, and sometimes forget on a new machine... I tend to prefer those tools/programs that work in Windows, Mac and Linux even if not quite as good, so concerns about where I am is less. I use windows keyboard on mac, and change the mapping... only gotcha is when I need ^C in a terminal on mac, the muscle memory screws me up sometimes switching from working at home (mac or linux) to working at work (windows).

Some quirkiness with git's bash on windows (my shell default) get me sometimes too.

rawoke083600 · on Nov 29, 2018

I love these posts ! Once or twice I was lucky enough to have had a somewhat similar intense debug-puzzle's to solve myself. Wonderful times :) strace is a god send !

virtualized · on Nov 29, 2018

I always wonder why Linux doesn't seem to have any kind of tests. How can they afford not to have regression tests for bugs they fixed? How do they know that this bug fix didn't break anything? What does "never break userspace" even mean if there is no way to check whether userspace has been broken?

jibal · on Nov 29, 2018

https://stackoverflow.com/questions/3177338/how-is-the-linux...

cosarara · on Nov 29, 2018

Never break userspace means that if you break userspace, that's a bug. The effort spent on preventing bugs is another matter.

chronolitus · on Nov 29, 2018

I guess the user is the test.

raverbashing · on Nov 29, 2018

To be honest, NFS is usually more pain than it is worth it. (But hey, at least it's not iSCSI)

Yes please let the default not allow me to unmount a fs from a server that died.

romeisendcoming · on Nov 29, 2018

Totally disagree. You have to understand how NFS caching effects your application and what kind of performance and security/availability you need. At that point you can make NFS work for you. You also have to make good decisions about how you implement. This type of use case (long open file waits with concurrent access) are a nightmare with any shared filesystem. Most approaches dealing with possibly stale content are shoulder shrugs.

jabl · on Nov 29, 2018

> This type of use case (long open file waits with concurrent access) are a nightmare with any shared filesystem. Most approaches dealing with possibly stale content are shoulder shrugs.

Well, it is possible to handle correctly, e.g. Lustre. Lustre, however, is very complex compared to NFS, so there's absolutely a price to be paid.

NFS implements close-to-open consistency, which is much weaker than full cache coherency (again, e.g. Lustre).

romeisendcoming · on Nov 30, 2018

You want a network shared filesystem then design your data for it.

roboyoshi · on Nov 29, 2018

I think NFS requires to have some knowledge for how you want to use it. The defaults seem reasonable, but I also get confused with all the v3/v4 differences etc. I use it at home to connect everything to my NAS and it works very well for me. My export is this:

  /mnt 192.168.8.0/24(rw,sync,insecure,no_subtree_check,crossmnt,all_squash,anonuid=0,anongid=100)

and my client config with autofs is this:

  rxd01 -fstype=nfs4,ro,soft,noatime,nodiratime,intr,rsize=65536,wsize=65536,nosuid,tcp,allow_other 192.168.8.3:/mnt/rxd01

If you know what each option does, how your network is setup and what your server/clients are capable of, you will eventually find the right settings, but it's not a good OOB experience.

steve19 · on Nov 29, 2018

Does anyone know why NFS is such a pain? In the past (10 years ago) I just assumed I was doing it wrong and stopped using it, and have not used it since.

jclulow · on Nov 29, 2018

In my experience, the quality of NFS client implementations varies significantly between different operating systems. We made _heavy_ use of NFS for home directories and application backing stores at the University where I used to work, and it was a very good experience -- but this was on Solaris 10 (and later, OpenSolaris) machines. We had heavy NFS client use on many multi-user machines (shell servers, Sun Ray servers, etc) and didn't see reliability problems. On the odd occasion that we needed to reboot the file server for updates, clients would pause and then resume promptly after the server rebooted.

Towards the end of my tenure there, I gave a Linux desktop a try. The NFS experience was amazingly bad by comparison; lots of issues with locking, with becoming disconnected (often until a reboot) from NFS servers, odd performance issues, reliability issues with the automounter, etc.

In the last few months I have tried the NFS client on my current Linux desktop again, thinking things might have improved -- they have, I guess, but not by much. It's still pretty easy for the client to get into a hung state if there's too much packet loss, or if the file server reboots, or whatever. I have to imagine that not enough people are really using Linux NFS clients in anger to drive fixing the issues with it. There is often no escape from the Quality Death Spiral.

romeisendcoming · on Nov 29, 2018

NFS requires long admin experience and tuning for each use case. The only comment I can agree with you on is that the linux automounter is lackluster. We used BSD amd for many years with good success.

jabl · on Nov 29, 2018

> NFS requires long admin experience and tuning for each use case.

Depends on what you're going to do with it. For something like sharing home directories, it works well enough.

The defaults are usually pretty decent. There's unfortunately a lot of obsolete NFS tuning advice hanging around on the internet that seems to get cargo culted over and over again.

Like the advice to set some specific rsize/wsize settings because the default is too small, oblivious to the fact that the NFS protocol allows the client and server to negotiate maximum sizes, and at least the Linux client and server have taken advantage of this negotiation mechanism for the past 2 decades or so.

romeisendcoming · on Nov 29, 2018

Well enough isn't good enough for production. Having worked in high volume, highly available environments for years soho scenarios are not good examples. Real world issues with complex NFS environments (mixed nfs3/4 + krb5p and multiple OS'es + automounters) or pNFS and gluster require more than tuning mount options. Tuning NFS for a latency averse and throughput intensive application operating on large netcdf and hdf5 file hierarchies is a worthy example.

jabl · on Nov 29, 2018

> Well enough isn't good enough for production.

How informative.

> Having worked in high volume, highly available environments for years soho scenarios are not good examples.

FWIW, I wasn't talking about SOHO. At least in my experience, defaults work well for home & shared work dirs for O(10k) users (not all simultaneously active, though). HA is a pain, though, if you want to DIY, I'll grant you that.

> Real world issues with complex NFS environments (mixed nfs3/4 + krb5p and multiple OS'es + automounters)

Complex? Sounds like a pretty standard NFS environment.

> or pNFS and gluster require more than tuning mount options.

Yeah, no personal experience there. What did you have to do there?

We did have a clustered NFS appliance for HPC use a decade or so ago. People like to complain how Lustre is a beast to run, but IME Lustre has been smooth sailing compared to the grief that POS gave us. But that wasn't really the fault of the NFS protocol per se, it was just the architecture as well as the implementation of that appliance was crap, particularly so for HPC.

romeisendcoming · on Nov 30, 2018

* routine buffer tweak is soho speak. * defaults don't work esp in mixed nfs 3/4 on linux across mixed 1/10 gb segment subnet boundaries. Try and get back to me. You will DOS your file service. * krb5p standard? First I've heard of it. FreeBSD won't do krb5p at nfs4 vanilla via linux nfs server. * Would never do it again. gluster is a shit storm of problems but nice when it works.

pjc50 · on Nov 29, 2018

It's an extremely old design, designed primarily for read-sharing and with its concurrency features retrofitted. It worked absolutely fine for network-booting diskless Sun workstations in the 90s but its failure modes are just too annoying for modern usage.

Trying to do anything like a database (and the 'git gc' process described is exactly that, a tiny database) over NFS requires the use of very specific techniques to get right.

Sibling commentator has it right - for unreliable WAN networks S3 offers far better semantics, because it's not quite a filesystem.

lacksconfidence · on Nov 29, 2018

Posix semantics and networks just dont play well together. Notice that binary storage systems these days avoid the filesystem integration (s3, etc.)

KaiserPro · on Nov 29, 2018

It depends.

Firstly NFS only really works reliably when your network has harmonised UIG/GIDs. Thats the first pain point. This normally means LDAP/AD or shipping /etc/passwd (_shudders_) Also you need to squash root, otherwise people who are local root can do lots of naughty things.

Then you have to make sure that your mountpoint doesn't go away, because stale file handles are a pain in the arse.

Then you have file locking, which causes loads of other pain aswell. Most people turn that off.

After that its mostly alright.

nfsv4 has certain things that are good (pNFS, kerberos, etc) but support was not that great.

tinus_hn · on Nov 29, 2018

It is very good at doing exactly what you don’t want though.

Ya9yoeGh · on Nov 29, 2018

NFS open file handle semantics are quite an annoyance.

The recommended way to perform atomic writes on POSIX is the create-write-fsync-rename-fsyncdir[0] dance. But that replaces the original file which causes ESTALE for all readers on NFS servers that don't support "delete on last close"[1] semantics.

This breaks common pattern where you can continue reading slightly stale data from unlinked files while writers updating the data atomically. In other words it makes it much harder to do filesystem concurrency correctly which already is hard enough.

A practical case where I'm seeing it is on Amazon's EFS. Updating thumbnails occasionally results in torn images because the server tries to send a stale file.

[0] https://danluu.com/file-consistency/ [1] http://nfs.sourceforge.net/#faq_d2

avar · on Nov 29, 2018

A tangential question, the post links to an earlier post[1] saying that GitLab itself doesn't use NFS anymore, pointing out that they migrated to Gitaly.

But ultimately Gitaly will need to do a local FS operation, so there's still the problem of ensuring HA for a given repository. GitHub solved this by writing their own replication layer on top of Git[2], but what's GitLab doing? Manually sharding repos on local FS's that are RAID-ed with frequent backups?

1. https://about.gitlab.com/2018/09/12/the-road-to-gitaly-1-0/

2. https://githubengineering.com/introducing-dgit/

lbotos · on Nov 29, 2018

We are working on Gitaly HA. You can check out the Epic here:

https://gitlab.com/groups/gitlab-org/-/epics/289

avar · on Nov 29, 2018

So since redundancy & horizontal scaling are goals of Gitaly HA am I to understand that right now GitLab.com is run on some ad-hoc setup like what I described, and you can lose data if you're unlucky enough with a machine or two disks going down at the same time?

justinclift · on Nov 29, 2018

The Red Hat BZ is "Access Denied", so can't see if it's fixed in RHEL & CentOS yet:

https://bugzilla.redhat.com/show_bug.cgi?id=1648482

:(

stanhu · on Nov 29, 2018

You can sign up for an account to see the bug report status. The patch has not yet been backported.

justinclift · on Nov 30, 2018

I have an account. Why would that make any difference?

RH BZ's are (or used to be) public by default, unless they're manually changed. eg for security related things

smarterclayton · on Nov 29, 2018

It’s still in NEW - don’t see any chatter yet

perlpimp · on Nov 29, 2018

Great writing effortlessly makes you feel smarter. Great story

jlokier · on Nov 29, 2018

Really nice. I have more respect for GitLab now. That's a great write-up, and it led me to read some of their other nice reports too.

It's not exactly new for NFS to have cache coherency "surprises". But it should have "close-to-open" coherency at least, and the bug found by GitLab fails even that.

Here's an anecdote.

A Mac client talking to Samba on Linux. The client deletes random files that the client isn't even looking at, but which happen to be changed on the server around the time the client looks at the directory containing those files.

I am not joking. Randomly deleting files it's not even reading.

It delayed a product rollout for about 8 months. I was sure there must be a flaw in some file-updating code, somewhere in application code running on Linux. What else would make update-by-rename-over files disappear once every few weeks? Surely the usual tmpfile-fsync-rename dance was durable on Linux, on ext4? It must have been a silly, embarrasing error in the application code right? Calling unlink() with the wrong string or something.

But no, application was fine. Libraries were fine. And the awful bugs in VMware Fusion's file sharing were not to blame this time. (Ahem, another anecdote...)

It only happened every few weeks. A random file would disappear and be noticed. A web application would be told to update a file, and it'd spontaneously complain that the file was gone. It wasn't reproducible until we went all-out on trying to make it happen more often. But they kept disappearing.

Things like invoices data files and edited documents. Once every few weeks for no obvious reason. Not happy. And not safe to deploy.

Eventually, we found a very old bug in Emacs which deletes the file that's being saved in rare circumstances that only manifest when file attributes change at the wrong moment, which does happen with the weird and wonderful Mac SMB client's way of caching attributes. We thought we'd found the cause with great relief, and could proceed to rollout. Until after a few weeks, another file disappeared. No!

It took weeks of tracing, reproducing, and learning new debugging tools (like auditd running permanently) to rule out faults in (1) the application code and libraries, (2) Linux itself, (3) Samba, (4) tools used on the Mac when viewing a directory, and viewing and editing files.

Nope, it wasn't a bug in application code after all. There weren't any faulty calls or wrong strings. Logging would have caught them. Linux rename() was fine, not to blame. It wasn't a durability problem on power loss (the reason you need fsync with rename). Nor VMware disk image snapshots, even though other bugs were spotted with those. Nor was it the Emacs bug although that was a surprise to find.

The reproducer turned out to be "run cat a lot on the Mac, on a file which isn't being changed at all, while repeatedly updating another file on Linux in the same directory, using rename to update. Watch the updated file disappear eventually".

auditd showed Samba was doing the deletes, so I suspected a crazy bug in Samba and had to work quite hard to convince myself Samba was only doing what it was told by the client. I hoped it was Samba, because that's open source and I can fix that.

No, it was an astonishingly crappy bug called "delete random files once in a blue moon, hahaha!" in the Mac SMB client, which happened to occasionally be used to look in the same directory, which happened to be shared over Samba for convenience to look at it.

The confirmation of cause was from watching the SMB protocol, looking at Samba logs set to maximum verbosity, and lots of reading.

atq2119 says: "Imagine this same scenario if GitLab was using a closed source operating system. Would they have been able to track this down?"

I think I've had an experience like that - the above bug in the Mac SMB client. (Seriously, deleting random files.)

Googling reveals similar-sounding bugs at least two versions of OSX later. Yuck. I have no idea how to meaningfully get these things fixed or usefully reported. And I've had enough to stop caring anyway. The workaround is "force it to use SMB v1" (ye olde anciente). I can imagine the cause is something trivial in directory caching; it's probably just a few lines to fix.

I'm certain if the Linux client had a bug like that, it would be fixed very quickly, and probably backported by the big distros. I'm certain a Linux SMBFS developer would have been very helpful. And, there's a fairly good chance I could have fixed it myself and submitted the patch - probably less work than finding the cause, in this instance.

As it is, I don't think I could have found the culprit if I couldn't look at the Samba source to understand in detail what was going on in the SMB network protocol, or if I didn't have excellent tracing tools in Linux to find which process was responsible for stray deletions (i.e. not my application code, but Samba, which was doing as requested).

specialist · on Nov 29, 2018

Great war story. Agree with need for access to source.

--

During the early Java WORA culture wars, Bill Joy's wisdom about NFS has always stuck with me:

Interoperability is hard.

Despite having access to source code, a stable spec, working reference implementations, testing suites, and aggressive evangelism, getting everyone's NFS implementations to interoperate was a major challenge.

https://en.wikipedia.org/wiki/Network_File_System

--

I continue to think the authoritative history of NFS would become a seminal text book. A useful guide for the younguns about to embark on grand new world changing adventures. Many, many other protocols (DNS, TCP, HL7, CORBA...) have faced the same challenges. But my hunch is NFS is a superset, hitting every pain point.