How to ruin Linus's vacation
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
It's all Hugh's fault.
Linus was all set to release the final 3.0 kernel when Hugh Dickins showed up on the list with a little problem: occasionally a full copy of the kernel source tree fails because one of the files found therein vanishes temporarily. What followed was a determined bug-chasing exercise which demonstrates how subtle and tricky some of our core code has become. The problem has been found and squashed, but there may be more.
A bit of background might help in understanding what was happening here. The 2.6.38 release included the dcache scalability patches; this code uses a number of tricks to avoid taking locks during the process of looking up file names. For the right kind of workload, the "RCU walk" method yields impressive performance improvements. But that only works if all of the relevant directory entry ("dentry") structures are in the kernel's dentry cache and the lookup process does not race with other CPUs which may be making changes on the same path. Whenever such a situation is encountered, the lookup process will fall back to the older, slower algorithm which requires locking each dentry.
The dentry cache (dcache) is a highly dynamic data structure, with dentries coming and going at all times. So one CPU might be removing a dentry at the same time that another is using it to look up a name. Chaos is avoided through the use of read-copy-update (RCU) to manage the removal of dentries; a dentry may be removed from the cache, but, if the thread using that dentry for lookup got a reference to it before its removal, the structure itself will continue to exist for as long as that thread needs it. The same should be true of the inode structure associated with that dentry.
Hugh tracked the problem down to a bit of code in walk_component():
err = do_lookup(nd, name, path, &inode); /* ... */ if (!inode) { path_to_nameidata(path, nd); terminate_walk(nd); return -ENOENT; }
If do_lookup() returns a null inode pointer, walk_component() assumes that a "negative dentry" has been encountered. Negative dentries are kept in the dentry cache to record the fact that a specific name does not exist; they are an important performance-enhancing feature in the Linux virtual filesystem layer. To see an example, run any simple program under strace and watch how many system calls return with ENOENT; lookups on nonexistent files happen frequently. What Hugh determined was that this inode pointer was coming back null even though the file exists, leading the code to believe that a negative dentry had been found and causing the "briefly vanishing file" problem.
Hugh must have looked at this code for some time before concluding that the kernel must be removing the dentry from the dcache at just the wrong time during the lookup process. As described above, the dentry itself continues to exist after its removal from the cache, but that does not mean that it is unchanged: the removal process sets its d_inode pointer to NULL. (It's worth noting that this behavior goes against normal RCU practice, which calls for the structure to be preserved unmodified until the last reference is known to be gone). Hugh concluded that this null pointer was being picked up later by the lookup process, causing walk_component() to conclude that the file does not exist when all that had happened was the removal of a dentry from the cache. His problem report included a patch causing the lookup code to check much more carefully when the inode pointer comes up null.
Linus acknowledged the problem but didn't like the fix which, he thought, was too specific to one particular situation. He proposed an alternative: just don't set d_inode to NULL; that would keep the inode pointer from picking up that value later. Al Viro posted an alternative fix which changed dcache behavior in less subtle ways, and worried about the possibility of introducing other weird bugs:
Once we are done with code audit, sure, I'm fine with ->d_inode being kept until dentry is actually freed. Any code that relies on that thing being cleared is asking for trouble and should be rewritten anyway. The only thing is, it needs to be found before we rewrite it...
Linus didn't like Al's fix either; it threatened to force slow lookups when negative dentries are involved. The discussion of the patches went on at some length; in the process of trying to find the safest way to fix this subtle bug the participants slowly came to the realization that they did not actually know what was happening. After looking at things closely, Linus threw up his hands and admitted he didn't understand it:
As it happens, Linus's exposition was enough to point Hugh at the real problem. Just as the process of transiting through a specific dentry is almost complete, do_lookup() makes a call to __follow_mount_rcu(), whose job is to redirect the lookup process if it is passing through a mount point. The inode pointer is passed to __follow_mount_rcu() separately; Hugh noticed that this function was doing the following:
*inode = path->dentry->d_inode;
In other words, the inode pointer is being re-fetched from the dentry structure; this assignment happens regardless of whether the dentry represents a mount point. That is the true source of the problem: if the dentry has been removed from the dcache after the lookup process gained a reference, d_inode will be NULL. So __follow_mount_rcu() will zero a pointer which had pointed to a valid inode, causing later code to think that the file does not exist at all.
Linus posted a fix for the real problem along with his now-famous Google+ posting saying that he was delaying the 3.0 release for a day just in case:
Linus delayed the release despite the inconvenient fact that it will push the 3.1 merge window into his planned vacation. That was a well-placed bit of caution on his part: the ObviouslyCorrect(tm) patch had YetAnotherSubtleBug(tm) in it. A fixed version of the patch exists, and this particular bug should, at this point, be history.
There is a sobering conclusion to be drawn from this episode, though. The behavior of the dentry cache is, at this point, so subtle that even the combined brainpower of developers like Linus, Al, and Hugh has a hard time figuring out what is going on. These same developers are visibly nervous about making changes in that part of the kernel. Our once approachable and hackable kernel has, over time, become more complex and difficult to understand. Much of that is unavoidable; the environment the kernel runs in has, itself, become much more complex over the last 20 years. But if we reach a point where almost nobody can understand, review, or fix some of our core code, we may be headed for long-term trouble.
Meanwhile, we should be able to enjoy a 3.0 release (and a 2.6.39 update)
without mysteriously vanishing files. One potential short-term problem
remains, though: given that the next merge window will push into Linus's
vacation, there is a distinct chance that he might be more than usually
grumpy with maintainers who get their pull requests in late. Wise
subsystem maintainers may want to be ready to go when the merge window
opens.
Index entries for this article | |
---|---|
Kernel | Dentry cache |
(Log in to post comments)
How to ruin Linus's vacation
Posted Jul 19, 2011 20:00 UTC (Tue) by cesarb (subscriber, #6266) [Link]
I wonder if we are reaching the point where we would need to write formal proofs for parts of the kernel code.
How to ruin Linus's vacation
Posted Jul 19, 2011 20:48 UTC (Tue) by smurf (subscriber, #17840) [Link]
Add in the fact that the kernel is reentrant (formal descriptions for concurrent processes? Dream on) and has the aforementioned caching and RCU and whatnot (so for each file there are multiple valid pre- and postconditions), and you're in for a _real_ treat.
I very much doubt that anybody can manage this for any specific non-trivial testcase, much less in general.
How to ruin Linus's vacation
Posted Jul 20, 2011 3:24 UTC (Wed) by bfields (subscriber, #19510) [Link]
Well, it's also true that you can't write purely "formal" proofs for most mathematical theorems. And yet, mathematics gets done, because people can write perfectly good proofs in ordinary language.
And in fact anyone that writes non-trivial code probably does form in their head at least a hand-wavy proof of its correctness. If those actually got written down, it would probably help clarify thinking and avoid some bugs. But that doesn't happen for the same reason that nobody writes documentation.
An example of an exception: Documentation/filesystems/directory-locking.
How to ruin Linus's vacation
Posted Jul 20, 2011 4:05 UTC (Wed) by viro (subscriber, #7872) [Link]
We do need such writeups, of course. If nothing else, writing them tends to find holes - see e.g. ->d_lock mess discussion on fsdevel lately. There the locking order had been fscked in head (not transitive, for one thing), but locks outside of that set had mostly avoided bad trouble. Trying to write the proof of correctness hadn't been fun (and what I've got still relies on unverified assumptions about the things filesystem code does not do; verifying those has already caught a bunch of really broken things), but it helped to catch rather nasty stuff. Simply by reasoning about the properties of counterexample - i.e. "what would a deadlock have to look like". It's math, like any other...
FWIW, I wonder what backgrounds people have - in my case, it's geometry and topology and _that_ has certainly helped to acquire many mental habits useful for that kind of work...
How to ruin Linus's vacation
Posted Jul 20, 2011 23:02 UTC (Wed) by bfields (subscriber, #19510) [Link]
Yeah. Especially for CS students, something like the classic first point-set topology course might give good experience with that kind of proof-or-counterexample mode of problem solving. I think that's rare, unfortunately, at least outside a few countries with very rigorous math programs?(Like some others, I'm a refugee from mathematics, coming late to this after getting a PhD (commutative algebra and some algebraic topology). Not a particularly smart career path, but fun in its own way.)
How to ruin Linus's vacation
Posted Jul 21, 2011 23:34 UTC (Thu) by fuhchee (guest, #40059) [Link]
"there is an arbitrarily complex state Y which is transformed into the almost-identical state Y', and the only relevant difference between Y and Y' is X"That sounds like the AI Frame Problem.
How to ruin Linus's vacation
Posted Jul 19, 2011 20:54 UTC (Tue) by kleptog (subscriber, #1183) [Link]
It happens frequently that a cache introduced for performance actually contains some race condition. One way of checking this is to make the cache throw away entries almost immediately, this has a way of stress testing certain failure modes. I wonder if that would have helped here.
In any case, testing assertions in the face of race conditions is something we could all use. There is software proving software that tries but the false positive rate is still too high. I truly hope that in the future we will have good tools for this, because complexity is only getting worse.
How to ruin Linus's vacation
Posted Jul 19, 2011 23:03 UTC (Tue) by Ben_P (guest, #74247) [Link]
How to ruin Linus's vacation
Posted Jul 19, 2011 21:57 UTC (Tue) by tshow (subscriber, #6411) [Link]
At least as I understand the state of the art in formal code proofs, the only place they work is places where they aren't particularly useful; places where things like timing, instruction reordering, hardware errors and concurrency are not considerations.
How to ruin Linus's vacation
Posted Jul 19, 2011 22:13 UTC (Tue) by smoogen (subscriber, #97) [Link]
https://bugzilla.redhat.com/show_bug.cgi?id=722472
I noticed that at least in rc7 it came out with a new error message versus slowing down until the deadlock indicator starts spewing "hard-locks"
How to ruin Linus's vacation
Posted Jul 28, 2011 17:12 UTC (Thu) by pebolle (subscriber, #35204) [Link]
Anyhow, the message lockdep prints, includes the neat line:
*** DEADLOCK ***
It took me a few days before I noticed that the kernel actually still was running quite OK after that disturbing message.
By the way, the reason you ran into this recently seems to be that systemd-30 is (apparently) the first program in wide use that triggers this.
How to ruin Linus's vacation
Posted Jul 19, 2011 22:22 UTC (Tue) by dgm (subscriber, #49227) [Link]
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
Brian W. Kernighan and P. J. Plauger in The Elements of Programming Style.
How to ruin Linus's vacation
Posted Jul 19, 2011 22:33 UTC (Tue) by neilbrown (subscriber, #359) [Link]
I think the reality is that writing really clever code actually makes you smarter - by solving a difficult problem you can learn something useful. So having written (or just read and understood) really clever code, you become smart enough to at least make an attempt at debugging it.
Maybe.
How to ruin Linus's vacation
Posted Jul 19, 2011 22:55 UTC (Tue) by dgm (subscriber, #49227) [Link]
How to ruin Linus's vacation
Posted Jul 20, 2011 0:26 UTC (Wed) by neilbrown (subscriber, #359) [Link]
The common pattern that I find in "bugs that make no sense" is that I am believing something that isn't true. Once I identify and challenge that belief, the bug becomes obvious. The reason that these bugs are harder to find in my own code is because I believe my code is correct (in general, and in lots of specifics), so I don't question it so closely. When I'm reading other people's code I have less belief that it is correct, so less hindrance to finding the bugs.
Put another way: finding a bug in my code is both a success and an admission of failure. Finding a bug in someone else's code is pure success. This sounds like a strong case for pair-programming!
How to ruin Linus's vacation
Posted Jul 20, 2011 2:25 UTC (Wed) by elanthis (guest, #6227) [Link]
It's also why "check your assumptions" is the first rule of debugging I teach to new coders. :)
How to ruin Linus's vacation
Posted Jul 20, 2011 2:44 UTC (Wed) by jzbiciak (guest, #5246) [Link]
This is also why I find debugging with print and assert statements to be much more helpful than single stepping, in the vast majority of cases. Single stepping was more useful to me when I didn't necessarily understand all the language constructs. That was 20-25 years ago, though.
Nowadays, if I have some weird bug, it's because something I think is true is not, or some other "invisible-to-me" error. A judiciously placed print statement (perhaps guarded by an 'if' to filter out the noise) makes it easy for me to prove or disprove my assumptions about what's happening in the code, though, without disrupting any of the code around it. I can compare working cases to non-working cases easily and in a batch-wise manner after the fact. Very useful.
How to ruin Linus's vacation
Posted Jul 21, 2011 19:19 UTC (Thu) by nas (subscriber, #17) [Link]
How to ruin Linus's vacation
Posted Jul 22, 2011 19:58 UTC (Fri) by smurf (subscriber, #17840) [Link]
Please enlighten us what this test means; not everybody is damiliar with the idiom.
How to ruin Linus's vacation
Posted Jul 23, 2011 9:45 UTC (Sat) by smurf (subscriber, #17840) [Link]
How to ruin Linus's vacation
Posted Jul 23, 2011 21:38 UTC (Sat) by neilbrown (subscriber, #359) [Link]
http://lmgtfy.com/?q=Rubber+Duckie+Test
Pick the wikipedia link on Rubber Duck Debugging.
How to ruin Linus's vacation
Posted Jul 25, 2011 18:08 UTC (Mon) by Lennie (subscriber, #49641) [Link]
I presume the name comes from working late because you can not figure out a problem. Then the janitor comes and he can sees you are kind of frustrated and asks what the problem is.
You try to explain it to the janitor in simple terms and then you understand your problem and can fix it.
Then you go home happy I guess.
The ducky is just a way to try and induce that state of mind with an inanimate object where you are trying to explain the problem to someone else in simple terms.
How to ruin Linus's vacation
Posted Jul 21, 2011 8:34 UTC (Thu) by dgm (subscriber, #49227) [Link]
"Oh Lord it's hard to be humble
when you're perfect in every way.
I can't wait to look in the mirror
cause I get better looking each day.
To know me is to love me
I must be a hell of a man.
Oh Lord it's hard to be humble
but I'm doing the best that I can."
-- Mac Davis
;-)
How to ruin Linus's patch
Posted Jul 19, 2011 23:18 UTC (Tue) by neilbrown (subscriber, #359) [Link]
Moving the assignment to *inode from the start to the end makes lots of sense.
The other mucking about with seqcount doesn't make any sense to me at all.
What exactly is the read_seqcount_retry on path->dentry->d_seq trying to protect? That dentry is the mounted-on dentry so it is pinned and cannot be renamed or deleted.
About the only thing I can think of that might need to be protected against at this point is the mount being unmounted - but if that were the goal of the code I would have expected a comment to that effect, and it doesn't seem like the right place for it anyway..
Any ideas?
How to ruin Linus's patch
Posted Jul 19, 2011 23:47 UTC (Tue) by viro (subscriber, #7872) [Link]
at that point - we are holding vfsmount_lock all the way through RCU
walk.
How to ruin Linus's vacation
Posted Jul 25, 2011 8:57 UTC (Mon) by mlankhorst (subscriber, #52260) [Link]
Blame the messenger, not the person who originally introduced the bug?
How to ruin Linus's vacation
Posted Jul 25, 2011 16:23 UTC (Mon) by bronson (subscriber, #4806) [Link]