How to ruin Linus's vacation

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
July 19, 2011

It's all Hugh's fault.

Linus was all set to release the final 3.0 kernel when Hugh Dickins showed up on the list with a little problem: occasionally a full copy of the kernel source tree fails because one of the files found therein vanishes temporarily. What followed was a determined bug-chasing exercise which demonstrates how subtle and tricky some of our core code has become. The problem has been found and squashed, but there may be more.

A bit of background might help in understanding what was happening here. The 2.6.38 release included the dcache scalability patches; this code uses a number of tricks to avoid taking locks during the process of looking up file names. For the right kind of workload, the "RCU walk" method yields impressive performance improvements. But that only works if all of the relevant directory entry ("dentry") structures are in the kernel's dentry cache and the lookup process does not race with other CPUs which may be making changes on the same path. Whenever such a situation is encountered, the lookup process will fall back to the older, slower algorithm which requires locking each dentry.

The dentry cache (dcache) is a highly dynamic data structure, with dentries coming and going at all times. So one CPU might be removing a dentry at the same time that another is using it to look up a name. Chaos is avoided through the use of read-copy-update (RCU) to manage the removal of dentries; a dentry may be removed from the cache, but, if the thread using that dentry for lookup got a reference to it before its removal, the structure itself will continue to exist for as long as that thread needs it. The same should be true of the inode structure associated with that dentry.

Hugh tracked the problem down to a bit of code in walk_component():

	err = do_lookup(nd, name, path, &inode);
	/* ... */
	if (!inode) {
		path_to_nameidata(path, nd);
		terminate_walk(nd);
		return -ENOENT;
	}

If do_lookup() returns a null inode pointer, walk_component() assumes that a "negative dentry" has been encountered. Negative dentries are kept in the dentry cache to record the fact that a specific name does not exist; they are an important performance-enhancing feature in the Linux virtual filesystem layer. To see an example, run any simple program under strace and watch how many system calls return with ENOENT; lookups on nonexistent files happen frequently. What Hugh determined was that this inode pointer was coming back null even though the file exists, leading the code to believe that a negative dentry had been found and causing the "briefly vanishing file" problem.

Hugh must have looked at this code for some time before concluding that the kernel must be removing the dentry from the dcache at just the wrong time during the lookup process. As described above, the dentry itself continues to exist after its removal from the cache, but that does not mean that it is unchanged: the removal process sets its d_inode pointer to NULL. (It's worth noting that this behavior goes against normal RCU practice, which calls for the structure to be preserved unmodified until the last reference is known to be gone). Hugh concluded that this null pointer was being picked up later by the lookup process, causing walk_component() to conclude that the file does not exist when all that had happened was the removal of a dentry from the cache. His problem report included a patch causing the lookup code to check much more carefully when the inode pointer comes up null.

Linus acknowledged the problem but didn't like the fix which, he thought, was too specific to one particular situation. He proposed an alternative: just don't set d_inode to NULL; that would keep the inode pointer from picking up that value later. Al Viro posted an alternative fix which changed dcache behavior in less subtle ways, and worried about the possibility of introducing other weird bugs:

I'm not entirely convinced that it's a valid optimization in the first place (probably is, but I'm seriously scared by the complexity we already have there), and I'm really not fond of the idea of dealing with whatever subtle crap we might discover with Linus' patch. Again, dcache is not in a healthy shape right now; at this point dumb and straightforward is, IMO, better than subtle and risking to step on toes of very odd code out there...

Once we are done with code audit, sure, I'm fine with ->d_inode being kept until dentry is actually freed. Any code that relies on that thing being cleared is asking for trouble and should be rewritten anyway. The only thing is, it needs to be found before we rewrite it...

Linus didn't like Al's fix either; it threatened to force slow lookups when negative dentries are involved. The discussion of the patches went on at some length; in the process of trying to find the safest way to fix this subtle bug the participants slowly came to the realization that they did not actually know what was happening. After looking at things closely, Linus threw up his hands and admitted he didn't understand it:

So how could Hugh's NULL inode ever happen in the first place? Even with the current sources? It all looks solid to me now that I look at all the details.

As it happens, Linus's exposition was enough to point Hugh at the real problem. Just as the process of transiting through a specific dentry is almost complete, do_lookup() makes a call to __follow_mount_rcu(), whose job is to redirect the lookup process if it is passing through a mount point. The inode pointer is passed to __follow_mount_rcu() separately; Hugh noticed that this function was doing the following:

	*inode = path->dentry->d_inode;

In other words, the inode pointer is being re-fetched from the dentry structure; this assignment happens regardless of whether the dentry represents a mount point. That is the true source of the problem: if the dentry has been removed from the dcache after the lookup process gained a reference, d_inode will be NULL. So __follow_mount_rcu() will zero a pointer which had pointed to a valid inode, causing later code to think that the file does not exist at all.

Linus posted a fix for the real problem along with his now-famous Google+ posting saying that he was delaying the 3.0 release for a day just in case:

We have a patch, we understand the problem, and it looks ObviouslyCorrect(tm), but I don't think I want to release 3.0 just a couple of hours after applying it.

Linus delayed the release despite the inconvenient fact that it will push the 3.1 merge window into his planned vacation. That was a well-placed bit of caution on his part: the ObviouslyCorrect(tm) patch had YetAnotherSubtleBug(tm) in it. A fixed version of the patch exists, and this particular bug should, at this point, be history.

There is a sobering conclusion to be drawn from this episode, though. The behavior of the dentry cache is, at this point, so subtle that even the combined brainpower of developers like Linus, Al, and Hugh has a hard time figuring out what is going on. These same developers are visibly nervous about making changes in that part of the kernel. Our once approachable and hackable kernel has, over time, become more complex and difficult to understand. Much of that is unavoidable; the environment the kernel runs in has, itself, become much more complex over the last 20 years. But if we reach a point where almost nobody can understand, review, or fix some of our core code, we may be headed for long-term trouble.

Meanwhile, we should be able to enjoy a 3.0 release (and a 2.6.39 update) without mysteriously vanishing files. One potential short-term problem remains, though: given that the next merge window will push into Linus's vacation, there is a distinct chance that he might be more than usually grumpy with maintainers who get their pull requests in late. Wise subsystem maintainers may want to be ready to go when the merge window opens.

Index entries for this article
Kernel	Dentry cache

(Log in to post comments)

How to ruin Linus's vacation

Posted Jul 19, 2011 20:00 UTC (Tue) by cesarb (subscriber, #6266) [Link]

> The behavior of the dentry cache is, at this point, so subtle that even the combined brainpower of developers like Linus, Al, and Hugh has a hard time figuring out what is going on. [...] But if we reach a point where almost nobody can understand, review, or fix some of our core code, we may be headed for long-term trouble.

I wonder if we are reaching the point where we would need to write formal proofs for parts of the kernel code.

How to ruin Linus's vacation

Posted Jul 19, 2011 20:48 UTC (Tue) by smurf (subscriber, #17840) [Link]

Unfortunately, any VFS operation needs to be formally described not as in "X happens", but as in "there is an arbitrarily complex state Y which is transformed into the almost-identical state Y', and the only relevant difference between Y and Y' is X". (There may be non-relevant differences, e.g. cache state.)

Add in the fact that the kernel is reentrant (formal descriptions for concurrent processes? Dream on) and has the aforementioned caching and RCU and whatnot (so for each file there are multiple valid pre- and postconditions), and you're in for a _real_ treat.

I very much doubt that anybody can manage this for any specific non-trivial testcase, much less in general.

How to ruin Linus's vacation

Posted Jul 20, 2011 3:24 UTC (Wed) by bfields (subscriber, #19510) [Link]

Well, it's also true that you can't write purely "formal" proofs for most mathematical theorems. And yet, mathematics gets done, because people can write perfectly good proofs in ordinary language.

And in fact anyone that writes non-trivial code probably does form in their head at least a hand-wavy proof of its correctness. If those actually got written down, it would probably help clarify thinking and avoid some bugs. But that doesn't happen for the same reason that nobody writes documentation.

An example of an exception: Documentation/filesystems/directory-locking.

How to ruin Linus's vacation

Posted Jul 20, 2011 4:05 UTC (Wed) by viro (subscriber, #7872) [Link]

Yeah, well... you forgot to add "and actually read" to conditions... Exhibit A: people adding hardlinks to directories or equivalents thereof, despite the aforementioned example of documentation ;-/

We do need such writeups, of course. If nothing else, writing them tends to find holes - see e.g. ->d_lock mess discussion on fsdevel lately. There the locking order had been fscked in head (not transitive, for one thing), but locks outside of that set had mostly avoided bad trouble. Trying to write the proof of correctness hadn't been fun (and what I've got still relies on unverified assumptions about the things filesystem code does not do; verifying those has already caught a bunch of really broken things), but it helped to catch rather nasty stuff. Simply by reasoning about the properties of counterexample - i.e. "what would a deadlock have to look like". It's math, like any other...

FWIW, I wonder what backgrounds people have - in my case, it's geometry and topology and _that_ has certainly helped to acquire many mental habits useful for that kind of work...

How to ruin Linus's vacation

Posted Jul 20, 2011 23:02 UTC (Wed) by bfields (subscriber, #19510) [Link]

Yeah. Especially for CS students, something like the classic first point-set topology course might give good experience with that kind of proof-or-counterexample mode of problem solving. I think that's rare, unfortunately, at least outside a few countries with very rigorous math programs?

(Like some others, I'm a refugee from mathematics, coming late to this after getting a PhD (commutative algebra and some algebraic topology). Not a particularly smart career path, but fun in its own way.)

How to ruin Linus's vacation

Posted Jul 21, 2011 23:34 UTC (Thu) by fuhchee (guest, #40059) [Link]

"there is an arbitrarily complex state Y which is transformed into the almost-identical state Y', and the only relevant difference between Y and Y' is X"

That sounds like the AI Frame Problem.

How to ruin Linus's vacation

Posted Jul 19, 2011 20:54 UTC (Tue) by kleptog (subscriber, #1183) [Link]

I'm not sure about complete proofs, but we do need better tools for testing certain assertions with respect to race conditions.

It happens frequently that a cache introduced for performance actually contains some race condition. One way of checking this is to make the cache throw away entries almost immediately, this has a way of stress testing certain failure modes. I wonder if that would have helped here.

In any case, testing assertions in the face of race conditions is something we could all use. There is software proving software that tries but the false positive rate is still too high. I truly hope that in the future we will have good tools for this, because complexity is only getting worse.

How to ruin Linus's vacation

Posted Jul 19, 2011 23:03 UTC (Tue) by Ben_P (guest, #74247) [Link]

It's also important to note than when working in languages with the freedom that C gives you, determining accurate life times and reference counts can be hellish if very strict coding conventions are not adhered to.

How to ruin Linus's vacation

Posted Jul 19, 2011 21:57 UTC (Tue) by tshow (subscriber, #6411) [Link]

The trouble with formal proofs is that the logical errors and bad assumptions made when writing the code tend to be propagated into the formal proofs, where they typically remain equally unnoticed. Or worse, the proof is fine, but the code deviates in subtle ways due to abstraction mismatches and model simplification.

At least as I understand the state of the art in formal code proofs, the only place they work is places where they aren't particularly useful; places where things like timing, instruction reordering, hardware errors and concurrency are not considerations.

How to ruin Linus's vacation

Posted Jul 19, 2011 22:13 UTC (Tue) by smoogen (subscriber, #97) [Link]

I wonder if this is anything related to a set of "crashes" I have had with the 3.0.0 kernels (the 2.6.39 works fine so maybe not)

https://bugzilla.redhat.com/show_bug.cgi?id=722472

I noticed that at least in rc7 it came out with a new error message versus slowing down until the deadlock indicator starts spewing "hard-locks"

How to ruin Linus's vacation

Posted Jul 28, 2011 17:12 UTC (Thu) by pebolle (subscriber, #35204) [Link]

No, it's not related. That bug has to do - as far as I can tell right now - with lockdep noticing there's recursive locking going on but not being told that this is a case of expected nested locking. Well, I think the nesting is expected ...

Anyhow, the message lockdep prints, includes the neat line:
*** DEADLOCK ***

It took me a few days before I noticed that the kernel actually still was running quite OK after that disturbing message.

By the way, the reason you ran into this recently seems to be that systemd-30 is (apparently) the first program in wide use that triggers this.

How to ruin Linus's vacation

Posted Jul 19, 2011 22:22 UTC (Tue) by dgm (subscriber, #49227) [Link]

While reading the article I had this quote in my mind all the time:

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.

How to ruin Linus's vacation

Posted Jul 19, 2011 22:33 UTC (Tue) by neilbrown (subscriber, #359) [Link]

Yes, it is a great quote. But experience shows that it is wrong ... so where is the flaw?

I think the reality is that writing really clever code actually makes you smarter - by solving a difficult problem you can learn something useful. So having written (or just read and understood) really clever code, you become smart enough to at least make an attempt at debugging it.

Maybe.

How to ruin Linus's vacation

Posted Jul 19, 2011 22:55 UTC (Tue) by dgm (subscriber, #49227) [Link]

Maybe. But when you find yourself hunting that bug that makes no sense, you should question if you were as clever as you thought when writing the code.

How to ruin Linus's vacation

Posted Jul 20, 2011 0:26 UTC (Wed) by neilbrown (subscriber, #359) [Link]

I have a lot of experience hunting bugs that make no sense, sometimes in my own code and sometimes in other people's. I usually find it harder to find those bugs in my own code, and it certainly isn't because my code is particularly "clever".

The common pattern that I find in "bugs that make no sense" is that I am believing something that isn't true. Once I identify and challenge that belief, the bug becomes obvious. The reason that these bugs are harder to find in my own code is because I believe my code is correct (in general, and in lots of specifics), so I don't question it so closely. When I'm reading other people's code I have less belief that it is correct, so less hindrance to finding the bugs.

Put another way: finding a bug in my code is both a success and an admission of failure. Finding a bug in someone else's code is pure success. This sounds like a strong case for pair-programming!

How to ruin Linus's vacation

Posted Jul 20, 2011 2:25 UTC (Wed) by elanthis (guest, #6227) [Link]

This is the same reason why authors of prose have editors. Your internalized assumptions make it so your eyes can pass right over "obviously" bad constructs and not notice.

It's also why "check your assumptions" is the first rule of debugging I teach to new coders. :)

How to ruin Linus's vacation

Posted Jul 20, 2011 2:44 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Indeed.

This is also why I find debugging with print and assert statements to be much more helpful than single stepping, in the vast majority of cases. Single stepping was more useful to me when I didn't necessarily understand all the language constructs. That was 20-25 years ago, though.

Nowadays, if I have some weird bug, it's because something I think is true is not, or some other "invisible-to-me" error. A judiciously placed print statement (perhaps guarded by an 'if' to filter out the noise) makes it easy for me to prove or disprove my assumptions about what's happening in the code, though, without disrupting any of the code around it. I can compare working cases to non-working cases easily and in a batch-wise manner after the fact. Very useful.

How to ruin Linus's vacation

Posted Jul 21, 2011 19:19 UTC (Thu) by nas (subscriber, #17) [Link]

This is way the "Rubber Duckie Test" is effective.

How to ruin Linus's vacation

Posted Jul 22, 2011 19:58 UTC (Fri) by smurf (subscriber, #17840) [Link]

IMHO you mean "This is *why ".

Please enlighten us what this test means; not everybody is damiliar with the idiom.

How to ruin Linus's vacation

Posted Jul 23, 2011 9:45 UTC (Sat) by smurf (subscriber, #17840) [Link]

*familiar. *grumble*

How to ruin Linus's vacation

Posted Jul 23, 2011 21:38 UTC (Sat) by neilbrown (subscriber, #359) [Link]

I hadn't heard of it either .. but search engines are our friends.

http://lmgtfy.com/?q=Rubber+Duckie+Test

Pick the wikipedia link on Rubber Duck Debugging.

How to ruin Linus's vacation

Posted Jul 25, 2011 18:08 UTC (Mon) by Lennie (subscriber, #49641) [Link]

I just knew the "janitor effect".

I presume the name comes from working late because you can not figure out a problem. Then the janitor comes and he can sees you are kind of frustrated and asks what the problem is.

You try to explain it to the janitor in simple terms and then you understand your problem and can fix it.

Then you go home happy I guess.

The ducky is just a way to try and induce that state of mind with an inanimate object where you are trying to explain the problem to someone else in simple terms.

How to ruin Linus's vacation

Posted Jul 21, 2011 8:34 UTC (Thu) by dgm (subscriber, #49227) [Link]

I have another quote for you:

"Oh Lord it's hard to be humble
when you're perfect in every way.
I can't wait to look in the mirror
cause I get better looking each day.
To know me is to love me
I must be a hell of a man.
Oh Lord it's hard to be humble
but I'm doing the best that I can."

-- Mac Davis

;-)

How to ruin Linus's patch

Posted Jul 19, 2011 23:18 UTC (Tue) by neilbrown (subscriber, #359) [Link]

I'm trying to understand Linus' patch.

Moving the assignment to *inode from the start to the end makes lots of sense.

The other mucking about with seqcount doesn't make any sense to me at all.

What exactly is the read_seqcount_retry on path->dentry->d_seq trying to protect? That dentry is the mounted-on dentry so it is pinned and cannot be renamed or deleted.

About the only thing I can think of that might need to be protected against at this point is the mount being unmounted - but if that were the goal of the code I would have expected a comment to that effect, and it doesn't seem like the right place for it anyway..

Any ideas?

How to ruin Linus's patch

Posted Jul 19, 2011 23:47 UTC (Tue) by viro (subscriber, #7872) [Link]

No, you are right - the check is BUG_ON() misspelled. umount *can't* happen
at that point - we are holding vfsmount_lock all the way through RCU
walk.

How to ruin Linus's vacation

Posted Jul 25, 2011 8:57 UTC (Mon) by mlankhorst (subscriber, #52260) [Link]

> It's all Hugh's fault.
Blame the messenger, not the person who originally introduced the bug?

How to ruin Linus's vacation

Posted Jul 25, 2011 16:23 UTC (Mon) by bronson (subscriber, #4806) [Link]

It was obviously a joke.