A kernel change breaks GlusterFS
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Linus Torvalds has railed frequently and loudly against kernel developers breaking user space. But that rule is not ironclad; there are exceptions. As Linus once noted:
The story of how a kernel change caused a GlusterFS breakage shows that there are sometimes unfortunate twists to those exceptions.
The kernel change and its consequences
GlusterFS is a widely-used, free, scale-out, distributed filesystem that is available on Linux and a number of other UNIX-like systems. GlusterFS was initially developed by Gluster, Inc., but since Red Hat acquired that company in 2011, it has mainly driven work on the filesystem.
GlusterFS's problems sprang from an ext4 filesystem patch by Fan Yong that addressed a long-standing issue in ext4's support for the readdir() API by widening the "directory offset" values used by the API from 32 to 64 bits. That change was needed to reliably support readdir() traversals in large directories; we'll discuss those changes and the reasons for making them in a companion article. One point from that discussion is worth making here: these "offset" values are in truth a kind of cookie, rather than a true offset within a directory. Thus, for the remainder of this article, we'll generally refer to them as "cookies". Fan's patch made its way into the mainline 3.4 kernel (released in May 2012), but appears also to have been ported into the 3.3.x kernel that was released with Fedora 17 (also released in May 2012).
Fan's patch solved a problem for ext4, but inadvertently created one for GlusterFS servers that use ext4 as their underlying storage mechanism. However, nobody reported problems in time to cause the patch to be reconsidered. The symptom on affected systems, as noted in a July 2012 Red Hat bug report, was that using readdir() to scan a directory on a GlusterFS system would end up in an infinite loop in some cases.
The cause of the problem—as detailed by Anand Avati in a recent (March 2013) discussion on the ext4 mailing list—is that GlusterFS makes some assumptions about the "cookies" used by the readdir() API. In particular, although these values are 64 bits long, the GlusterFS developers noted that only the lower 32 bits were used, and so decided to encode some additional information—namely the index of the Gluster server holding the file—inside their own internal version of the cookie, according to this formula:
final_d_off = (ext4_d_off * MAX_SERVERS) + server_idx
This GlusterFS internal cookie is exchanged in the 64-bit cookie that is passed in NFSv3 readdir() requests between GlusterFS clients and front-end servers. (An ASCII art diagram posted in the mailing list thread by J. Bruce Fields clarifies the relationship of the various GlusterFS components.) The GlusterFS internal cookie allows the server to easily encode the identify of the GlusterFS storage server that holds a particular directory. This scheme worked fine as long as only 32 bits were used in the ext4 readdir() cookies (ext4_d_off), but promptly blew up when the cookies switched to using 64 bits, since the multiplication caused some bits to be lost from the top end of ext4_d_off.
An August 2012 gluster.org blog post by Joe Julian pointed out that the problem affected not only Fedora 17's 3.3 kernel, but also the kernel in Red Hat's Enterprise Linux distribution, because the kernel change had been backported into the much older 2.6.32 distribution kernel supplied in RHEL 6.3 and later. The recommended workaround was either to downgrade to an earlier kernel version that did not include the patch or to reformat the GlusterFS bricks (the fundamental storage unit on a GlusterFS node) to use XFS instead of ext4. (Using XFS rather than ext4 had already been recommended practice when using GlusterFS.) Needless to say, neither of these solutions was easily practicable for some GlusterFS users.
Mitigating GlusterFS's problem
In his March 2013 mail, Anand bemoaned the fact that the manual pages gave no indication that the readdir() API "offsets" were cookies rather than something like a conventional file offset whose range might bounded. Indeed, the manual pages rather hinted towards the latter interpretation. (That, at least, is a problem that is now addressed.) Anand went on to request a fix to the problem:
But, as the ext4 maintainer, Ted Ts'o, noted, Fan's patch addressed a real problem that affected well-behaved applications that did not make mistaken assumptions about the value returned by telldir(). Adding a mount option that nullified the effect of that patch would affect all programs using a filesystem and penalize those well-behaved applications by exposing them to the problem that the patch was designed to fix.
Ted instead proposed another approach: a per-process setting that allowed an application to request the older readdir() cookie semantics. The advantage of that approach is that it provides a solution for applications that misuse the cookie without penalizing applications that do the right thing. This solution could, he said, take the form of an ext4-specific ioctl() operation employed immediately after calling opendir(). Anand thought that should be a workable solution for GlusterFS. The requisite patch does not yet seem to have appeared, but one supposes that it will be written and submitted during the 3.10 merge window, and possibly backported into earlier stable kernels.
So, a year after the ext4 kernel change broke GlusterFS, it seems that a (kernel) solution will be found to address GlusterFS's difficulties. In passing, it's probably fair to mention that one reason that the (proposed) fix took so long in coming was that the GlusterFS developers initially thought they might be able to work around the kernel change by making changes in GlusterFS. However, it ultimately turned out to be impossible to exchange both a full 64-bit readdir() cookie and a GlusterFS storage server ID in the NFS readdir() requests exchanged between GlusterFS clients and front-end servers.
Summary: the meta-problem
In the end, the GlusterFS breakage might have been avoided. Ted's proposed fix could have been rolled out at the same time as Fan's patch, so as to minimize any disruptions for GlusterFS users. Returning to Linus's quote at the beginning of this article puts us on the trail of a deeper problem.
"If there's nobody around to see it, did it really break?
"
was Linus's rhetorical question. The problem is that this is a test whose
results can be rather arbitrary. Sometimes, as was the case in the implementation
of EPOLLWAKEUP, a kernel change that causes a minor breakage
in a user-space application that is doing strange things will be reverted
or modified because it is fortuitously spotted by someone close to the
development scene—namely, a kernel developer who notices a
misbehavior on their desktop system.
However, other users may be so far from the scene of change that it can be a considerable time before they see a problem. By the time those users detect a user-space breakage, the corresponding stable kernel may already be several release cycles in the past. One can easily imagine that few kernel developers are running a GlusterFS node on their development systems. Conversely, one can imagine that most users of GlusterFS are running production environments where stability and uptime are critical, and testing an -rc kernel is neither practical nor a high priority.
Thus, a rather important user-space breakage was missed—one that, if it had been detected, would almost certainly have triggered modification or reversion of the relevant patches, or stern words from Linus in the face of any resistance to making such changes. And, certainly, this is not a one-off case. Your editor did not need to look too far to find another example, where a change in the way that POSIX message queue limits are enforced in Linux 3.5 led to a report of breakage in a database engine nine months later.
The "if there's nobody around to see it" metric requires that someone is looking. That is of course a strong argument that the developers of user-space applications such as GlusterFS who want to ensure that their applications keep working on newer kernels must vigilantly and thoroughly test -rc kernels. Clearly that did not happen.
However, it seems a little unfair to place the blame solely on user
space. The ext4 modifications that affected GlusterFS clearly represented a
change to the kernel-user-space ABI (and for reasons that we describe in
our follow-up article, that change was clearly necessary). In cases such as
this (and the POSIX message queue change), perhaps even more caution was
warranted when making the change. At the very least, a loud announcement in
the commit message that the kernel changes represented a change to the ABI
would have been helpful; that might have jogged some reviewers to think
about the possible implications and resulted in the ext4 changes
being made in a way that minimized problems for GlusterFS. A greater
commitment on both sides to improving the documentation would also be
helpful. It's notable that even after deficiencies in the documentation
were mentioned as a contributing factor to GlusterFS problem, no-one sent a
patch to improve said documentation. All in all, it seems that parties on
both sides of the ABI could be doing a better job.
Index entries for this article | |
---|---|
Kernel | Filesystems/ext4 |
Kernel | User-space API |
(Log in to post comments)
A kernel change breaks GlusterFS
Posted Mar 27, 2013 21:14 UTC (Wed) by bfields (subscriber, #19510) [Link]
There's a proposed solution for glusterfs (with an excellent changelog) at http://review.gluster.org/#change,4711.
A kernel change breaks GlusterFS
Posted Mar 27, 2013 22:25 UTC (Wed) by nix (subscriber, #2304) [Link]
A kernel change breaks GlusterFS
Posted Mar 27, 2013 23:37 UTC (Wed) by rvolgers (subscriber, #63218) [Link]
A kernel change breaks GlusterFS
Posted Mar 27, 2013 21:38 UTC (Wed) by dvdeug (subscriber, #10998) [Link]
A kernel change breaks GlusterFS
Posted Mar 29, 2013 5:19 UTC (Fri) by alankila (guest, #47141) [Link]
A kernel change breaks GlusterFS
Posted Mar 29, 2013 9:48 UTC (Fri) by dvdeug (subscriber, #10998) [Link]
A kernel change breaks GlusterFS
Posted Mar 27, 2013 22:06 UTC (Wed) by jengelh (subscriber, #33263) [Link]
A kernel change breaks GlusterFS
Posted Mar 27, 2013 23:32 UTC (Wed) by bronson (subscriber, #4806) [Link]
A kernel change breaks GlusterFS
Posted Mar 27, 2013 23:43 UTC (Wed) by bronson (subscriber, #4806) [Link]
Sounds like XFS also uses 32 bit cookies but uses some additional memory to ensure it doesn't suffer the collision problem.
A kernel change breaks GlusterFS
Posted Mar 28, 2013 0:35 UTC (Thu) by bfields (subscriber, #19510) [Link]
Traditionally the cookie is a byte-offset into some linear representation of the directory. If you do that, then in practice a "64 bit" cookie in practice probably isn't going to exceed 2^32 until you have millions of entries.
ext4 is unusual in that it uses a hash of the filename as the cookie--so the 64-bit cookies are effectively random 64-bit sequences, and really do use the high bits.
(Well, OK, actually they're limited to 63 bits, but you get the idea.)
A kernel change breaks GlusterFS
Posted Mar 28, 2013 9:20 UTC (Thu) by paulj (subscriber, #341) [Link]
A kernel change breaks GlusterFS
Posted Mar 28, 2013 9:52 UTC (Thu) by dlang (guest, #313) [Link]
The latest NFS protocol specifies that the token passed over the wire is a 64 bit value (earlier versions specified a 32 bit value)
Other network filesystems have similar specifications.
In fact, from a little googling, it looks like the POSIX spec says that these cookies are of type 'long'. this makes changing it to something like a variable length string like you would need to use the filename extremely hard.
A kernel change breaks GlusterFS
Posted Mar 28, 2013 10:32 UTC (Thu) by paulj (subscriber, #341) [Link]
a) How to get a stable order over fixed-length IDs (dir size limited to the ID space then).
b) Assuming you can't get that, how would you design sane APIs that still allowed iteration of directories?
For b, it's names. However, I still don't understand why 'a' isn't possible. See my comment on the companion ext4 article about using a ring for the ID space.
A kernel change breaks GlusterFS
Posted Mar 28, 2013 16:29 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
A kernel change breaks GlusterFS
Posted Mar 27, 2013 23:06 UTC (Wed) by zlynx (guest, #2285) [Link]
This is exactly like C programmers who complain when their undefined behavior changes. Like reading a[-1] or depending on signed integer wrap or reading a double through an integer pointer.
It never should have worked. It never did work except on your one compiler and machine combination. And changing it isn't the compiler's problem.
Back to kernel examples, if some user space program begins relying on the number of columns when parsing the proc files is that the kernel's problem? No. That is just a badly written program.
A kernel change breaks GlusterFS
Posted Mar 28, 2013 0:57 UTC (Thu) by ringerc (subscriber, #3071) [Link]
A kernel change breaks GlusterFS
Posted Mar 28, 2013 10:55 UTC (Thu) by mkerrisk (subscriber, #1978) [Link]
But there *was* no ABI change. A 64-bit value continued to hold 64 bits.
The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter. One can argue that it was a questionable decision, but given the problem they were trying to solve, and the constraints on how much information they could pass in the cookie sent over NFSv3, it wasn't a completely insane thing to do, given the observed kernel behavior.
This is exactly like C programmers who complain when their undefined behavior changes.
The analogy doesn't really hold. For C, there is a very carefully defined standard that thoroughly specifies behavior and notes the cases where behavior is undefined. For much of the kernel API, there is nothing like such precise documentation/specification. This leaves user-space programmers trying to make guesses about what is or is not permissible, and that is exactly the hole that the Gluster folk fell into. And as noted by another commenter, the Samba folk fell into the same hole. The fact that two independent groups fell into the same hole is quite telling, in my view.
Back to kernel examples, if some user space program begins relying on the number of columns when parsing the proc files is that the kernel's problem? No. That is just a badly written program.
I think that's a weak example to support your argument, because the advice that one should parse /proc defensively is reasonably well known. And don't get me wrong, your argument is reasonable, but I think it's far from definitive.
Returning to my point about documented standards versus "the behavior as (it appears to be) implemented"... The Linux kernel violates standards in a number of places, and when it comes down to contradictions between documented behavior (man pages and standards) versus existing implementation, Linus always firmly plumps for the latter (unless the existing behavior is causing actual pain to user space).
And take a look at the EPOLLWAKEUP example referred to in the article. In that case, the problem was that a program was setting random bits in the epoll API that formerly had no meaning. The application had *no* good reason to set those bits, because they had absolutely no effect (and unfortunately there was no kernel check to give an error when that was done). When someone tried to give those bits a meaning, the application broke. The response was not to say: user-space, go fix your stupid application; that would just inflict pain on thousands of users as their binaries break. Instead the response was: we'll need to modify this kernel patch in such a way that it does not break user-space (and that changed _decreased_ the usability of the kernel feature that was added). The argument in that case that the kernel should change was much weaker than the argument would have been for accommodating GlusterFS, if the GlusterFS problem had actually been detected in time
You can't have it both ways. Linus pretty consistently goes one way, and I can see his point (though I've disagreed with some specific cases in the past).
A kernel change breaks GlusterFS
Posted Mar 28, 2013 17:54 UTC (Thu) by jimparis (guest, #38647) [Link]
There are two things they could have done that would have made it less insane:
- Ask the kernel developers if it's OK to assume high bits are zero
- Verify the assumption with an assert (instead of missing files and going into infinite loops)
A kernel change breaks GlusterFS
Posted Mar 29, 2013 17:24 UTC (Fri) by giraffedata (guest, #1954) [Link]
The question is where you consider the definition of the ABI to be. Is it some documented standard ("this is a 64-bit field"), or is it "the behavior as (it appears to be) implemented" ("only 32 bits are ever used in this field")? The GlusterFS folks clearly took it to be the latter.
According to what you wrote, they did more than that. They looked at the man page, which is the closest Linux comes to an ABI specification. They noted the language in there, which could have said "continuation cookie" or something, but actually uses the word "offset." This suggests that it is a very old interface from days before we understood layering the way we do now and that it is (or at least has to emulate) a byte offset within the stored representation of the directory. They probably noticed that the C type of the member is the one used for file offsets. That means only in a huge directory could it have nonzero upper bits.
It's not clear to me how they disposed of the possibility of a huge directory, but that's probably beside the point.
But I have to say I assume the developers knew they were taking a risk. Common sense would have told them that modern filesystems, especially the ones that haven't been invented yet, don't have such simple implementations of directories and their developers might well use the upper 32 bits freely. They decided to trade the future breakage for all the present benefits truncating the offset gave them . We tend to remember those tradeoffs differently after payment becomes due.
A kernel change breaks GlusterFS
Posted Apr 6, 2013 16:56 UTC (Sat) by jra (subscriber, #55261) [Link]
It's interesting how it happened though. The cookie returned from telldir() is defined as a 'long', not a fixed length type. Back in the day on simpler filesystems, this used to be the index into the directory.
We were lazy and just naturally assumed it would always be such, and back in the 32-bit days a long fit into a 32-bit DOS protocol search directory field, so we just stuffed it in there.
Modern CIFS/SMB/SMB[2|3] use a last-filename character string to restart a search, not an integer index, so newer clients never run into this problem. It's only old DOS clients that use the 32-bit index protocol request. So when the underlying systems went from 32-bit to 64-bit, we mostly didn't notice (everyone was running Windows rather than DOS, plus the kernel still didn't use the top 32-bits of the now 64-bit long return from telldir()).
Then the kernel changed, the top 32-bit started to be used and old DOS clients broke. Oops. As it's old DOS clients we still haven't fixed this (there really aren't that many still out there, if there were the NAS vendors would have been screaming for a fix long before now).
So what happened ? Ignorance, lazyness and errors on our part - mostly. However, if the telldir() interface had returned a deliberately opaque struct as a cookie instead of an integer that was expected to be an index, then we probably wouldn't have made this error.
So I'd argue internal semantics of telldir() changed (from index to cookie), but the ABI didn't (the cookie was just hidden inside what used to be the index).
Still, I don't think the kernel should change. We just need to fix our crappy userspace code and learn our lesson in future :-).
A kernel change breaks GlusterFS
Posted Apr 12, 2013 13:07 UTC (Fri) by meuh (guest, #22042) [Link]
For Samba at least it's certainly just a bug in our code.
Or it's a bug in the API: why not using an off_t type ?
It's interesting how it happened though. The cookie returned from telldir() is defined as a 'long', not a fixed length type. Back in the day on simpler filesystems, this used to be the index into the directory.
telldir() returns a long, but readdir() returns a struct dirent that, under Linux (see readdir(3)) hold an off_t d_off field. d_off might be 64bits wide even on a 32bits system with support for Large File Support (LFS) (compiled with -D_FILE_OFFSET_BITS=64 -D_LARGE_FILE_SOURCE=1), while long is still 32bits wide. So after the ext4 cookie extension one would be afraid to see that d_off could hold a value bigger than the one returned by telldir(3).
Hopefully readdir(3) is implemented using getdents(2) which returns an unsigned long on all configurations.
You might want to read my other comment regarding extending to 64bits something that's going to be a 32bits value anyway.
The ext4 change 'breaks' Samba too
Posted Mar 28, 2013 2:43 UTC (Thu) by abartlet (subscriber, #3928) [Link]
We see the same thing, and have had to blacklist a number of our testsuites from our automated testing, because when an older diretory searching/enumeration protocol is used, we mapped the readdir() cookie onto a 4 byte element in that protocol. These tests in turn started spinning in an infinite loop due to the truncation.
Now, the fix for us is presumably to have some other in-memory mapping from 64-bit back to 32-bit for the cookies Samba clients have obtained. Naturally we also would need to avoid a DoS by allocation of an infinite number of mapped cookies, and other challenges, which along with the real-world use case for the interface being old DOS clients, means we haven't got around to it yet.
I only mention this to indicate that this was noticed beyond GlusterFS.
Andrew Bartlett
Samba Team
The ext4 change 'breaks' Samba too
Posted Mar 28, 2013 7:54 UTC (Thu) by Lumag (subscriber, #22579) [Link]
The ext4 change 'breaks' Samba too
Posted Mar 28, 2013 13:34 UTC (Thu) by bfields (subscriber, #19510) [Link]
Once handed out to an NFS client, you don't know when the client may use a cookie; could be much later, could be after we've rebooted. So that internal mapping would have to be kept forever, on disk (to survive reboots).
You could do it, but I think you'd quickly start feeling like you were doing a ton of work that the filesystem should really be doing for you.
SMB might make this easier, I don't know.
It might be possible to extend the NFS protocol to use some kind of directory-read pointer with a more limited lifetime. But that won't solve anybody's problem today, at least not until we get the IETF a time machine. (Four days left for someone to publish an RFC addressing the paradoxes inherent to time-travelling standards processes.)
The tree makes noise, but we need a test to listen for it
Posted Mar 30, 2013 17:30 UTC (Sat) by davecb (subscriber, #1574) [Link]
Ones like this we would have considered either "uninterpreted opaque cookie" (good) or "must be zero" (bad). Finding the latter usually resulted in version-number changes, weird backward-compatibility tests to see if anyone had stolen the upper half for anything and debates about how to change the documentation.
As a side effect, we tried to write a static test or shared-library test we could run against any apps that were being compatibility-tested by their vendors, and we recorded the difference in our porting database, so that if another vendor or an old SunOS system used it, it would get fixed in a port.
The latter, IMHO, was the really valuable part: if Hockey-PUX had the bug, we'd refrain from reproducing it on applications ported to Solaris. Linux rarely has such bugs: other vendors (including ourselves) weren't as fortunate.
--dave
A kernel change breaks GlusterFS
Posted Mar 28, 2013 19:35 UTC (Thu) by xorbe (guest, #3165) [Link]
A kernel change breaks GlusterFS
Posted Mar 28, 2013 19:52 UTC (Thu) by dlang (guest, #313) [Link]
A kernel change breaks GlusterFS
Posted Mar 28, 2013 21:02 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]
A kernel change breaks GlusterFS
Posted Apr 4, 2013 19:23 UTC (Thu) by aakef (subscriber, #38030) [Link]
Bernd
A kernel change breaks GlusterFS
Posted Apr 5, 2013 6:20 UTC (Fri) by mkerrisk (subscriber, #1978) [Link]
Bernd,
It was not my intention to assign blame, and I'm sorry if the article gave that impression. The change that Fan Yong (or something like it) made was obviously needed. And, at some level, I could imagine that the FS developers didn't see this as an ABI change. My point was a general one: even things that may not look like ABI changes may in fact be so, and developers as a group perhaps need to be even more vigilant about that possibility. (The message queue change is a similar kind of case. One could easily have thought of it as *not* being an ABI change, but it was.)
Thanks,
Michael