A nasty file corruption bug - fixed
In an attempt to explain what was going on, your editor will once again employ his rather dubious artistic skills. To that end, readers are kindly requested to look at the diagram to the right and suspend enough disbelief to imagine that it represents a page in memory - a page containing interesting data, and which represents an equivalent set of blocks found within a file on the disk. The distinction between the page and its component blocks is important, which is why the dotted lines divide up the page. A 4096-byte page in memory is likely represented by eight 512-byte disk blocks (which are, most likely, merged back together by the drive, but we'll pretend that isn't happening).
There are a couple of different kernel data structures which contain information about this page, making the diagram a bit more complicated:
The page may be mapped into one or more process address spaces. For each such mapping, there will be a page table entry (PTE) which performs the translation between the user-space virtual address and the physical address where the page actually lives. There is also some other information in the PTE, including a "dirty" bit. When the application modifies the page, the processor will set the dirty bit, allowing the operating system to respond by (for example) writing the page back to its backing store. Note that, if there are multiple PTEs pointing to a single page, they may well disagree on whether the page is dirty or not. The only way to know for sure is to scan all existing PTEs and see if any of them are marked dirty.
The kernel maintains a separate data structure known as the system memory map; it contains one struct page for every physical page known to exist. This structure contains a number of interesting bits of information, including a pointer to the page's backing store (if any), a data structure allowing the associated PTEs to be found relatively easily, and a set of page flags. One of those flags is a dirty bit - another flag which notes that the page is in need of writing to its backing store. (For those following closely, it may be worth pointing out that the red arrow pointing to the page does not actually exist as a pointer field; it is implicit in the structure's position within the memory map).
Finally, there is another set of structures which may be associated with this page:
The "buffer head" (or "bh") goes back to the earliest days of Linux. It can be thought of as a mapping between a disk block and its copy in memory. The bh is not central to Linux memory management in the way it once was, but a number of filesystems still use it to handle their disk I/O tracking. Note that there is not necessarily a bh structure for every block found within a page; if a filesystem has reason to believe that only some blocks need writing, it does not need to create bh structures for the rest. Among other things, the bh structure contains yet another dirty flag.
With all of these different flags representing what is essentially the same information, it is not entirely surprising that some confusion eventually came about. The maintenance of redundant data structures can be a challenge in any setting, and the kernel environment adds difficulties of its own.
Deep within the kernel, there is a function called set_page_dirty(); it is used by the memory management code when it notices (via a PTE or a direct application operation) that a page is in need of writeback. Among other things, it copies the dirty bit from the page table entries into the page structure. If the page is part of a file, set_page_dirty() will call back into the relevant filesystem - but only if said filesystem has provided the appropriate method. Many filesystems do not provide set_page_dirty() callback, however; for these filesystems, the kernel will, instead, traverse the list of associated bh structures and mark each of them dirty.
And that is where the problem comes in. The filesystem may well have noticed that a block represented by a given bh was dirty and started I/O on it before the set_page_dirty() call. When the I/O is complete, the filesystem clears the dirty flag in the bh. If the set_page_dirty() call comes while the I/O on the block is active, the filesystem will not notice the fact that the block's data may have changed after it was written. Instead, the block will be marked clean, even though what was written does not correspond to what is currently in memory. File corruption results.
Linus's fix is simple. When the virtual memory subsystem decides that it is time to write a page, a new call to set_page_dirty() is made. That ensures that all buffer heads will be marked dirty at the time the filesystem's writepage() method is called. That change ensures that all blocks of the page will be written; testers have confirmed that it makes the file corruption problems go away. The patch has gone into the mainline git repository; it should show up in the next 2.6.19 stable update as well.
The longer-term solution is to continue pushing buffer heads out of the kernel's I/O paths. As Linus puts it:
I think ext3 is terminally crap by now. It still uses buffer heads in places where it really really shouldn't, and as a result, things like directory accesses are simply slower than they should be. Sadly, I don't think ext4 is going to fix any of this, either.
Ted Ts'o responds that a fix for ext4 could yet happen, but it involves other filesystems as well. The ext3 filesystem is probably going to stay with buffer heads, however, meaning that the kernel will have to continue to work with them indefinitely.
Finally, this story illustrates just how hard it can be to track down and
fix certain kinds of kernel bugs. Early in the process it was hard for the
interested developers to reproduce the problem, so they had to rely on the
initial reporters to try out various patches. Those reporters stuck with
the process, building and testing a lot of kernels before the
problem was flushed out. They deserve much of the credit for the
resolution of this problem.
Index entries for this article | |
---|---|
Kernel | Buffer heads |
Kernel | Debugging |
Kernel | Filesystems/ext3 |
(Log in to post comments)
A nasty file corruption bug - fixed
Posted Jan 1, 2007 8:26 UTC (Mon) by arjan (subscriber, #36785) [Link]
I hope the distros will provide updates quickly since this seems to affect all 2.6 kernel versions out there...
A nasty file corruption bug - fixed
Posted Jan 1, 2007 9:57 UTC (Mon) by bronson (subscriber, #4806) [Link]
On the other hand, because nobody noticed this bug for four years, I don't think another week or two will cause anyone much trouble.
A nasty file corruption bug - fixed
Posted Jan 1, 2007 10:08 UTC (Mon) by arjan (subscriber, #36785) [Link]
people noticed.. in hindsight. Suddenly a series of db4 reports show up with people saying they see this regularly and it's now gone away with the fix...
A nasty file corruption bug - fixed
Posted Jan 1, 2007 21:23 UTC (Mon) by bgoglin (subscriber, #7800) [Link]
Nobody noticed previously because the bug was hidden. But some changes in 2.6.19 (dirty page balancing, causing writeback to happen earlier) revealed the bug, making it occur much more frequently. Everybody using 2.6.19 should probably downgrade to an earlier kernel or use an upcoming 2.6.19.2 with the fix.
A nasty file corruption bug - fixed
Posted Jan 1, 2007 21:29 UTC (Mon) by ber (subscriber, #2142) [Link]
With Cyrus imapd, especially within the Kolab Server we saw file corruptions which could be related to mmap problems. It occurrs rarely enough that we do not have a testcase. Details at kolab/issue840. So I would welcome patches for older kernels and referable information on how long this bug has been in there.
A nasty file corruption bug - fixed
Posted Jan 1, 2007 23:05 UTC (Mon) by dlang (guest, #313) [Link]
it seems that this bug has been in the kernel since at least the 2.5 timeframe, the change in 2.6.19 just made it far easier to hit.
A nasty file corruption bug - fixed
Posted Jan 2, 2007 4:51 UTC (Tue) by iabervon (subscriber, #722) [Link]
This, of course, leaves out three-quarters of the story, in which quite a number of people, including Linus, found a number of things which were confusing or actual bugs, but weren't actually the real issue. There was quite a bit of argument about whether dirty bits on pages or page tables were getting lost in complicated situations in the VM (including Linus finding something that probably was a bug, and probably would cause the right sort of corruption, but fixing it didn't solve the problem), but it turned out not to be the issue at all.
I'm not sure I actually completely follow what was going on, but I think it's a bit more subtle than the article concludes. If the PTE is already dirty, further writes don't lead to set_pte_dirty() being called. But the buffer heads may be cleaned by the filesystem after the PTE is initially marked dirty and before later writes. Then, when the page is finally done, the buffer heads are already marked clean, so they're skipped. Linus finally found that, when the bug triggered, the kernel was deciding to write out the page, at a point where there was no activity, and then doing nothing because all of the buffer heads were clean.
(Linus had previously thought the issue was that, somewhere, a dirty bit was getting cleared when I/O was completed rather than when I/O started. If you clear the dirty bit when I/O is completed, you'd lose any writes which happen during I/O. But he couldn't find anywhere this was happening, because the real issue was different.)
A nasty file corruption bug - fixed
Posted Jan 2, 2007 5:51 UTC (Tue) by rganesan (guest, #1182) [Link]
I agree with this comment that the article does not tell the full story. In particular, I don't think the statement "When the I/O is complete, the filesystem clears the dirty flag in the bh." is correct. I believe the filesystem clears the dirty flag in the bh when the I/O is started.
A nasty file corruption bug - fixed
Posted Jan 5, 2007 20:28 UTC (Fri) by riel (subscriber, #3142) [Link]
You are correct. Dirty bits are cleared when I/O is started, so the application can dirty the page again while the disk I/O happens, without the kernel forgetting that the page was dirtied again.
A nasty file corruption bug - fixed
Posted Jan 2, 2007 9:37 UTC (Tue) by kay (subscriber, #1362) [Link]
The article may be a little confusing about this, but it states clear:If the set_page_dirty() call comes while the I/O on the block is active, the filesystem will not notice the fact that the block's data may have changed after it was written
Kay
A nasty file corruption bug - fixed
Posted Jan 2, 2007 18:34 UTC (Tue) by iabervon (subscriber, #722) [Link]
But I don't think that's actually true. If the I/O on the block is active, it has already cleared the bh's dirty bit (because the rule is that you clear dirty bits when you decide to write out data, not when you finish, to plug exactly the race you're talking about), and therefore set_page_dirty() will set it and things will be okay. I think this was Linus's second-to-last theory (something was cleaning a buffer after it sent the data to the disk), but it turned out not to be the problem.
The issue is if the page gets written out after set_page_dirty() but before the last write to the page, because the VM didn't redirty buffers in dirty pages when more writes came in. After getting the concurrent dirtying case correct, it essentially missed the case of writes to a clean part of a dirty page.
A nasty file corruption bug - fixed
Posted Jan 2, 2007 21:29 UTC (Tue) by flewellyn (subscriber, #5047) [Link]
I might be naive in asking this, but why are buffer-heads still used at all? Obviously, filesystems were meant to transition away from using them for flushing, so what are they still used for?
Also, I might again be naive in asking, but why not patch all filesystems to not use them for flushing, if doing so is incorrect?
A nasty file corruption bug - fixed
Posted Jan 2, 2007 23:02 UTC (Tue) by dlang (guest, #313) [Link]
at some point you still need a pointer to each disk block of data, and that is what the bh is supposed to be used for (per Linus).
there are several good reasons for not just going in and changeing all filesystems to not use them for flushing
1. not everyone agrees with Linus (Andrew M for example)
2. it would be a very invasive set of changes to the filesystems, which would introduce their own risk of new bugs.
3. many people who agree with Linus that bh should not be used for flushing are also not sure of exactly what should be done to eliminate this (and how much of the new code should be filesystem neutral and how much should be specific to each filesystem)
There is still a race ?
Posted Jan 3, 2007 5:01 UTC (Wed) by mikov (guest, #33179) [Link]
Linus says that "it still has a tiny tiny race (see the comment), but Ibet nobody can really hit it in real life anyway, and I know several ways
to fix it, so I'm not really _that_ worried about it."
This worries me a bit. Things that are never supposed to happen usually
happen first :-) Are they planning to fix that race ?
There is still a race ?
Posted Jan 6, 2007 4:07 UTC (Sat) by Lovechild (guest, #3592) [Link]
The following is my take, seeing as I'm a retard baby compared to actual kernel hackers I might be wrong.
If it's a strictly theorically race and the fix means an overhead, it's often left with a comment to say 'here be dragons' so that if someone actually manage to hit it with a valid test case then it can be fixed. No need to endure overhead here and there for things that happen only in theory, it all adds up you know. Also adding code tends to cause more bugs to appear in sutle ways, adds to the complexity of reading and working with the codebase.
There is still a race ?
Posted Jan 6, 2007 17:11 UTC (Sat) by i3839 (guest, #31386) [Link]
I believe I read that this race, if it happened, would cause a writeout to happen twice, instead of only once. It wouldn't cause a writeout to be dropped, so this race can's cause corruption.
There is still a race ?
Posted Jan 8, 2007 13:43 UTC (Mon) by jzbiciak (guest, #5246) [Link]
I interpreted the comment to mean "don't let the fact there's a tiny race here stop you from trying out this intermediate, incomplete patch. I know how to fix the race, and presumably anything in its final form would do so."
Couple of clarifications
Posted Jan 13, 2007 6:11 UTC (Sat) by Nick (guest, #15060) [Link]
The article is quite good, but there may have been one thing unclear or not exactly right (to me, at least).
Firstly, there was no bug in 2.6.18 or earlier. Two bugs were introduced with the dirty shared mmap accounting patches: one was that pte dirty information would be thrown away, the other was removal of some vital lock coverage that exposed a race.
Secondly, the actual problem was not IO started before set_page_dirty() being called. As other people have noted, the buffers will be marked clean _before_ the IO starts, and set_page_dirty will redirty all buffers including the ones currently under IO.
The main problem was very simple: ptes were getting their dirty bits cleared without transferring that dirtyness into the page. Now this *appeared* to be safe, because that was only happening when we wanted to clean the dirty information, before starting page writeback. Now if the filesystem had previously cleaned some buffers, many filesystems will not write them out again when doing this page writeback.
Data is lost when the memory represented by one of these "clean" buffers has actually been modified via this pte.
Before the page dirty tracking patches went in, such a situation would also see the writeout of such a buffer to be skipped (because the dirty state is only in the pte, and not known to the pagecache). The difference is: that dirty info in the pte does not get chucked away -- it will get transferred to the page (and its buffers) either with msync, or when that memory is unmapped (munmap or exit).
Was that even slightly understandable or helpful? :)
A nasty file corruption bug - fixed
Posted Jun 26, 2014 15:32 UTC (Thu) by zykov (guest, #97629) [Link]
The rooting your Android device gives applications so access to the root of the device. The root is say the administrator account and can adjust everything inside the phone. This makes it possible to carry out the Samsung normally try to prevent edits. With a jailbreak (or rooting your Samsung) make you so that you get access to all functionalities of the phone or tablet, including features that are not normally accessible, but will be accessible once you Root Galaxy S3 Now still gives your applications access to the root of your phone or tablet apps can use this to customize the default interface and implement other changes that are not normally allowed. There by rooting it so all new features available. Additionally, you can install applications that are not normally possible and remove some apps that may be normal.