That massive filesystem thread

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jonathan Corbet
March 31, 2009

Long, highly-technical, and animated discussion threads are certainly not unheard of on the linux-kernel mailing list. Even by linux-kernel standards, though, the thread that followed the 2.6.29 announcement was impressive. Over the course of hundreds of messages, kernel developers argued about several aspects of how filesystems and block I/O work on contemporary Linux systems. In the end (your editor will be optimistic and say that it has mostly ended), we had a lot of heat - and some useful, concrete results.

One can only pity Jesper Krogh, who almost certainly didn't know what he was getting into when he posted a report of a process which had been hung up waiting for disk I/O for several minutes. All he was hoping for was a suggestion on how to avoid these kinds of delays - which are a manifestation of the famous ext3 fsync() problem - on his server. What he got, instead, was to be copied on the entire discussion.

Journaling priority

One of the problems is at least somewhat understood: a call to fsync() on an ext3 filesystem will force the filesystem journal (and related file data) to be committed to disk. That operation can create a lot of write activity which must be waited for. But contemporary I/O schedulers tend to favor read operations over writes. Most of the time, that is a rational choice: there is usually a process waiting for a read to complete, but writes can be done asynchronously. A journal commit is not asynchronous, though, and it can cause a lot of things to wait while it is in progress. So it would be better not to put journal I/O operations at the end of the queue.

In fact, it would be better not to make journal operations contend with the rest of the system at all. To that end, Arjan van de Ven has long maintained a simple patch which gives the kjournald thread realtime I/O priority. According to Alan Cox, this patch alone is sufficient to make a lot of the problems go away. The patch has never made it into the mainline, though, because Andrew Morton has blocked it. This patch, he says, does not address the real problem, and it causes a lot of unrelated I/O traffic to benefit from elevated priority as well. Andrew says the real fix is harder:

The bottom line is that someone needs to do some serious rooting through the very heart of JBD transaction logic and nobody has yet put their hand up. If we do that, and it turns out to be just too hard to fix then yes, perhaps that's the time to start looking at palliative bandaids.

Bandaid or not, this approach has its adherents. The ext4 filesystem has a new mount option (journal_ioprio) which can be used to set the I/O priority for journaling operations; it defaults to something higher than normal (but not realtime). More recently, Ted Ts'o has posted a series of ext3 patches which sets the WRITE_SYNC flag on some journal writes. That flag marks the operations as synchronous, which will keep them from being blocked by a long series of read operations. According to Ted, this change helps quite a bit, at least when there is a lot of read activity going on. The ext3 changes have not yet been merged for 2.6.30 as of this writing (none of Ted's trees have), but chances are they will go in before 2.6.30-rc1.

data=ordered, fsync(), and fbarrier()

The real problem, though, according to Ted, is the ext3 data=ordered mode. That is the mode which makes ext3 relatively robust in the face of crashes, but, says Ted, it has done so at the cost of performance and the encouragement of poor user-space programming. He went so far as to express his regrets for this behavior:

All I can do is apologize to all other filesystem developers profusely for ext3's data=ordered semantics; at this point, I very much regret that we made data=ordered the default for ext3. But the application writers vastly outnumber us, and realistically we're not going to be able to easily roll back eight years of application writers being trained that fsync() is not necessary, and actually is detrimental for ext3.

The only problem here is that not everybody believes that ext3's behavior is a bad thing - at least, with regard to robustness. Much of this branch of the discussion covered the same issues raised by LWN in Better than POSIX? a couple of weeks before. A significant subset of developers do not want the additional robustness provided by ext3 data=ordered mode to go away. Matthew Garrett expressed this position well:

But you're still arguing that applications should start using fsync(). I'm arguing that not only is this pointless (most of this code will never be "fixed") but it's also regressive. In most cases applications don't want the guarantees that fsync() makes, and given that we're going to have people running on ext3 for years to come they also don't want the performance hit that fsync() brings. Filesystems should just do the right thing, rather than losing people's data and then claiming that it's fine because POSIX said they could.

One option which came up a couple of times was to extend POSIX with a new system call (called something like fbarrier()) which would enforce ordering between filesystem operations. A call to fbarrier() could, for example, cause the data written to a new file to be forced out to disk before that file could be renamed on top of another file. The idea has some appeal, but Linus dislikes it:

Anybody who wants more complex and subtle filesystem interfaces is just crazy. Not only will they never get used, they'll definitely not be stable...

So rather than come up with new barriers that nobody will use, filesystem people should aim to make "badly written" code "just work" unless people are really really unlucky. Because like it or not, that's what 99% of all code is.

And that is almost certainly how things will have to work. In the end, a system which just works is the system that people will want to use.

relatime

Meanwhile, another branch of the conversation revisited an old topic: atime updates. Unix-style filesystems traditionally track the time that each file was last accessed ("atime"), even though, in reality, there is very little use for this information. Tracking atime is a performance problem, in that it turns every read operation into a filesystem write as well. For this reason, Linux has long had a "noatime" mount option which would disable atime updates on the indicated filesystem.

As it happens, though, there can be problems with disabling atime entirely. One of them is that the mutt mail client uses atime to determine whether there is new mail in a mailbox. If the time of last access is prior to the time of last modification, mutt knows that mail has been delivered into that mailbox since the owner last looked at it. Disabling atime breaks this mechanism. In response to this problem, the kernel added a "relatime" option which causes atime to be updated only if the previous value is earlier than the modification time. The relatime option makes mutt work, but it, too, turns out to be insufficient: some distributions have temporary-directory cleaning programs which delete anything which hasn't been used for a sufficiently long period. With relatime, files can appear to be totally unused, even if they are read frequently.

If relatime could be made to work, the benefits could be significant; the elimination of atime updates can get rid of a lot of writes to the disk. That, in turn, will reduce latencies for more useful traffic and will also help to avoid disk spin-ups on laptops. To that end, Matthew Garrett posted a patch to modify the relatime semantics slightly: it allows atime to be updated if the previous value is more than one day in the past. This approach eliminates almost all atime updates while still keeping the value close to current.

This patch was proposed for merging, and more: it was suggested that relatime should be made the default mode for filesystems mounted under Linux. Anybody wanting the traditional atime behavior would have to mount their filesystems with the new "strictatime" mount option. This idea ran into some immediate opposition, for a couple of reasons. Andrew Morton didn't like the hardwired 24-hour value, saying, instead, that the update period should be given as a mount option. This option would be easy enough to implement, but few people think there is any reason to do so; it's hard to imagine a use case which requires any degree of control over the granularity of atime updates.

Alan Cox, instead, objected to the patch as an ABI change and a standards violation. He tried to "NAK" the patch, saying that, instead, this sort of change should be done by distributors. Linus, however, said he doesn't care; the relatime change and strictatime option were the very first things he merged when he opened the 2.6.30 merge window. His position is that the distributors have had more than a year to make this change, and they haven't done so. So the best thing to do, he says, is to change the default in the kernel and let people use strictatime if they really need that behavior.

For the curious, Valerie Aurora has written a detailed article about this change. She doesn't think that the patch will survive in its current form; your editor, though, does not see a whole lot of pressure for change at this point.

I/O barriers

Suppose you are a diligent application developer who codes proper fsync() calls where they are necessary. You might think that you are then protected against data loss in the face of a crash. But there is still a potential problem: the disk drive may lie to the operating system about having written the data to persistent media. Contemporary hardware performs aggressive caching of operations to improve performance; this caching will make a system run faster, but at the cost of adding another way for data to get lost.

There is, of course, a way to tell a drive to actually write data to persistent media. The block layer has long had support for barrier operations, which cause data to be flushed to disk before more operations can be initiated. But the ext3 filesystem does not use barriers by default because there is an associated performance penalty. With ext4, instead, barriers are on by default.

Jeff Garzik pointed out one associated problem: a call to fsync() does not necessarily cause the drive to flush data to the physical media. He suggested that fsync() should create a barrier, even if the filesystem as a whole is not using barriers. In that way, he says, fsync() might actually live up to the promise that it is making to application developers.

The idea was not controversial, even though people are, as a whole, less concerned with caches inside disk drives. Those caches tend to be short-lived, and they are quite likely to be written even if the operating system crashes or some other component of the system fails. So the chances of data loss at that level are much smaller than they are with data in an operating system cache. Still, it's possible to provide a higher-level guarantee, so Fernando Luis Vazquez Cao posted a series of patches to add barriers to fsync() calls. And that is when the trouble started.

The fundamental disagreement here is over what should happen when an attempt to send a flush operation to the device fails. Fernando's patch returned an ENOTSUPP error to the caller, but Linus asked for it to be removed. His position is that there is nothing that the caller can do about a failed barrier operation anyway, so there is no real reason to propagate that error upward. At most, the system should set a flag noting that the device doesn't support barriers. But, says Linus, filesystems should cope with what the storage device provides.

Ric Wheeler, instead, argues that filesystems should know if barrier operations are not working and be able to respond accordingly. Says Ric:

One thing the caller could do is to disable the write cache on the device. A second would be to stop using the transactions - skip the journal, just go back to ext2 mode or BSD like soft updates.

Basically, it lets the file system know that its data integrity building blocks are not really there and allows it (if it cares) to try and minimize the chance of data loss.

Alan Cox also jumped into this discussion in favor of stronger barriers:

Throw and pray the block layer can fake it simply isn't a valid model for serious enterprise computing, and if people understood the worst cases, for a lot of non enterprise computing.

Linus appears to be unswayed by these arguments, though. In his view, filesystems should do the best they can and accept what the underlying device is able to do. As of this writing, no patches adding barriers to fsync() have been merged into the mainline.

Related to this is the concept of laptop mode. It has been suggested that, when a system is in laptop mode, an fsync() call should not actually flush data to disk; flushing the data would cause the drive to spin up, defeating the intent of laptop mode. The response to I/O barrier requests would presumably be similar. Some developers oppose this idea, though, seeing it as a weakening of the promises provided by the API. This looks like a topic which could go a long time without any real resolution.

Performance tuning

Finally, there was some talk about trying to make the virtual memory subsystem perform better in general. Part of the problem here has been recognized for some time: memory sizes have grown faster than disk speeds. So it takes a lot longer to write out a full load of dirty pages than it did in the past. That simple dynamic is part of the reason why writeout operations can stall for long periods; it just takes that long to funnel gigabytes of data onto a disk drive. It is generally expected that solid-state drives will eventually make this problem go away, but it is also expected that it will be quite some time, yet, before those drives are universal.

In the mean time, one can try to improve performance by not allowing the system to accumulate as much data in need of writing. So, rather than letting dirty pages stay in cache for (say) 30 seconds, those pages should be flushed more frequently. Or the system could adjust the percentage of RAM which is allowed to be dirty, perhaps in response to observations about the actual bandwidth of the backing store devices. The kernel already has a "percentage dirty" limit, but some developers are now suggesting that the limit should be a fixed number of bytes instead. In particular, that limit should be set to the number of bytes which can be flushed to the backing store device in (say) one second.

Nobody objects to the idea of a better-tuned virtual memory subsystem. But there is some real disagreement over how that tuning should be done. Some developers argue for exposing the tuning knobs to user space and letting the distributors work it out. Andrew is a strong proponent of this approach:

I've seen you repeatedly fiddle the in-kernel defaults based on in-field experience. That could just as easily have been done in initscripts by distros, and much more effectively because it doesn't need a new kernel. That's data.

The fact that this hasn't even been _attempted_ (afaik) is deplorable. Why does everyone just sit around waiting for the kernel to put a new value into two magic numbers which userspace scripts could have set?

The objections to that approach follow these lines: the distributors cannot get these numbers right; in fact, they are not really even inclined to try to get them right. The proper tuning values tend to change from one kernel to the next, so it makes sense to keep them with the kernel itself. And the kernel should be able to get these things right if it is doing its job at all. Needless to say, Linus argues for this approach, saying:

We should aim to get it right. The "user space can tweak any numbers they want" is ALWAYS THE WRONG ANSWER. It's a cop-out, but more importantly, it's a cop-out that doesn't even work, and that just results in everybody having different setups. Then nobody is happy.

Linus has suggested (but not implemented) one set of heuristics which could help the system to tune itself. Neil Brown also has a suggested approach, based on measuring the actual performance of the system's storage devices. Fixing things at this level is likely to take some time; virtual memory changes always do. But some smart people are starting to think about the problem, and that's an important first step.

That, too, could be said for the discussion as a whole. There are clearly a lot of issues surrounding filesystems and I/O which have come to the surface and need to be discussed. The Linux kernel community as a whole needs to think through the sort of guarantees (for both robustness and performance) it will offer to its users and how those guarantees will be fulfilled. As it happens, the 2009 Linux Storage & Filesystems Workshop begins on April 6. Many of these topics are likely to be discussed there. Your editor has managed to talk his way into that room; stay tuned.

Index entries for this article
Kernel	Filesystems

(Log in to post comments)

That massive filesystem thread

Posted Apr 1, 2009 0:36 UTC (Wed) by bojan (subscriber, #14302) [Link]

I love Linus' so called reality check:

> You may wish that was what they did, but reality is that "open(filename, O_TRUNC | O_CREAT, 0666)" thing.

Which is exactly what ext4 already works around. In line with reality.

> Harsh, I know. And in the end, even the _good_ applications will decide that it's not worth the performance penalty of doing an fsync(). In git, for example, where we generally try to be very very very careful, 'fsync()' on the object files is turned off by default.

Ah, thinking of doing fsync() after all, are we?

> Why? Because turning it on results in unacceptable behavior on ext3.

Chuckle :-)

And then, the real reality:

> Now, admittedly, the git design means that a lost new DB file isn't deadly, just potentially very very annoying and confusing - you may have to roll back and re-do your operation by hand, and you have to know enough to be able to do it in the first place.

Meaning, make your apps in such a way that an odd crash here and there cannot take out the whole thing.

That massive filesystem thread

Posted Apr 1, 2009 3:39 UTC (Wed) by ajross (guest, #4563) [Link]

Don't be snide, it wrecks the S/N ratio of this site. No doubt you've already made yourself heard in the other flame wars on the subject.

And to be fair, there's a difference in designing around "the odd crash here and there" and a 30 Second Window of Doom for every file creation.

That massive filesystem thread

Posted Apr 1, 2009 4:19 UTC (Wed) by bojan (subscriber, #14302) [Link]

> Don't be snide, it wrecks the S/N ratio of this site. No doubt you've already made yourself heard in the other flame wars on the subject.

What is your point here exactly? That I should not post because you may not like reading it? If you are a moderator of the site, please feel free to remove my post.

I make no apologies for my snideness - I think it was well deserved. Essentially, just because one file system does something in an idiotic way, we should now drop a perfectly good system call. Shouldn't we instead FIX what's broken so that all system calls and all file systems can be used as designed?

Similarly, we have seen heaps of new system calls introduced into Linux in recent times (dup3 and friends + other, backup related stuff from Ulrich Drepper), which all have to do with files. Why? Because they were needed. No complaints there. I thought the deal was that they would never get used? (see, being snide again).

> And to be fair, there's a difference in designing around "the odd crash here and there" and a 30 Second Window of Doom for every file creation.

And to be fair, there is difference in designing around complete system lockups for a number of seconds and committing data when required.

That massive filesystem thread

Posted Apr 1, 2009 8:34 UTC (Wed) by nix (subscriber, #2304) [Link]

Those new system calls (notably *at()) are present because 1) they fill a real hole in the API without which it is *impossible* to read files in particular directories (e.g. deeply nested ones) in a threaded app, and because 2) they allow real security holes in e.g. 'rm -r' to be fixed. Also they already existed in Solaris (hence the horrible misnaming of some of them, also inherited from Solaris). They are new syscalls hence there are no problems with people seeing old examples of their misuse.

They're not really intended for use by everyman, anyway.

The problem with what one might call the fsync() RANDOMLY_LOSE option is that it is something which must be used by everyman to avoid data loss, which if you get it wrong there is no sign unless you lose power at exactly the right time, and which nearly all programs you might clap eyes on other than Emacs have historically got wrong, and which many utility programs *cannot* get right no matter what, because there's no way they can tell if the data they are operating on is 'important', and thus should be fsync()ed, or not. (Sure, you could add a new command-line option to tell them, but that option is not in POSIX so portable applications can't rely on it for a long long time).

That's a big difference.

That massive filesystem thread

Posted Apr 1, 2009 10:24 UTC (Wed) by bojan (subscriber, #14302) [Link]

> They're not really intended for use by everyman, anyway.

You are kidding, right? dup3() is not for general use?

> That's a big difference.

Look, I'm not really bent on a particular mechanism of actually making sure that programmers have a reliable interface for doing this. Using fsync() before close() is the only portable solution now, but it is far from optimal. I think there is very little doubt about that. And we all know it sucks to high heaven on ext3 in ordered mode.

I don't know what the best way is: new call, some kind of flag to open that says O_ALWAYSDATABEFOREMETADATA, rename2(), close_with_magic() or whatever. But, saying that application programmers cannot grok this kind of stuff is just not true. They can and they will, only if given the tools. Just like they did dup3() and friends (and as you point out, there is little danger of misuse - these are new calls).

As I said many times before, overloading current pattern with non-portable behaviour is dangerous, because it provides false sense of robustness and ties one up to a particular FS and kernel. If we can get POSIX updated so that rename() actually means "always data before metadata, but don't put on disk now", then it may even fly. But, I don't know how that's going to make guarantees retroactively, when even Linux features file systems that don't do that (e.g. ext3 in writeback mode).

Also, having things like delayed allocation, where metadata can legitimately be committed before data, is really useful. Most short lived temporary files will never see disk platters, therefore making things faster and disks last longer. Meaning, keeping the old cruft around ain't that bad.

As for utility programs that are called from scripts, you can use dd with conv=fsync or conv=fdatasync in your pipe to commit files to disk today. On FreeBSD, they already have standalone fsync program for that. Yeah, I know. It sucks. But, your usual tools don't have to make any decisions on fsync()-ing - you can.

That massive filesystem thread

Posted Apr 1, 2009 18:09 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

By your logic, we should never fix bugs. Remember the 25 year old readdir bug? Don't you agree it was good to fix that? What if a program, somewhere, depended on that behavior? In reality, programs use rename for atomic replacement. POSIX doesn't say anything about guarantees after a hard system crash, and it's just disingenuous to think that by punishing application authors by giving them as little robustness as possible, you're doing them some kind of portability favor.

That massive filesystem thread

Posted Apr 1, 2009 20:55 UTC (Wed) by bojan (subscriber, #14302) [Link]

I will just answer this one comment, so that nobody gets "offended".

Quite the opposite. I'm all for fixing bugs and giving application programmers the _right_ tools for the job. If some Linux developers took a second to lift their noses out of the specifics of Linux and actually looked around, this could be fixed for _everyone_, not just for some Linux specific file systems. That is my point, in case you didn't get it by now.

That massive filesystem thread

Posted Apr 1, 2009 21:37 UTC (Wed) by man_ls (guest, #15091) [Link]

It is a worthless effort. Each filesystem must keep its house clean. Why invent a new system call which cannot (by necessity) be honored by ext2, or ext4 without a journal? Everything is working now fine in ext3, and if it doesn't work right in ext4 people will just look for a different filesystem.

After reading that Linus is not pulling from Mr Tso's trees made me suspect. Well, now that Ts'o's commit rights have been officially revoked I think that the whole discussion is moot. I wonder if the next ext4 head maintainer will learn from this painful experience and just do the right thing.

ext4 trees

Posted Apr 1, 2009 21:46 UTC (Wed) by corbet (editor, #1) [Link]

I'm confused. The article said that Ted's trees had not been pulled yet. In fact, that happened today; a bunch of ext4 work went into the mainline, including a number of patches which increase robustness for applications which don't use fsync(). I dunno what you were trying to link to, but it didn't work. I've not seen anything about revocation of commit rights. (It's hard to "revoke commit rights" in a distributed system in any case; at worst you can refuse to pull from somebody else's repository.)

Maybe it's an April 1 post that went over my head?

Recursive linking

Posted Apr 2, 2009 6:21 UTC (Thu) by man_ls (guest, #15091) [Link]

Sorry, it was a stupid attempt from a foreigner at an April Fools' prank :D I was hoping that the recursive link would give it away, but maybe it was too plausible altogether.

Will try to do better next time :D)

That massive filesystem thread

Posted Apr 1, 2009 22:38 UTC (Wed) by bojan (subscriber, #14302) [Link]

Just a few points, so please don't get offended. I apologise in advance to all sensitive LWN readers for any injury caused by this post.

> Why invent a new system call which cannot (by necessity) be honored by ext2, or ext4 without a journal?

Even if there was some kind of magical law that said that you could not order commits on the non-journaled file system this way, it can always be trivially implemented through - wait for it - fsync(), which has acceptable performance characteristics on such file systems.

> Everything is working now fine in ext3

Sure. Except fsync(), which locks the whole system for a few seconds. Hopefully, this will get fixed (or at least its effect reduced) as a result of the hoopla.

> Well, now that Ts'o's commit rights have been officially revoked I think that the whole discussion is moot.

Now you are really making a fool of yourself.

That massive filesystem thread

Posted Apr 2, 2009 23:16 UTC (Thu) by anton (subscriber, #25547) [Link]

The problem with what one might call the fsync() RANDOMLY_LOSE option is that it is something which must be used by everyman to avoid data loss, which if you get it wrong there is no sign unless you lose power at exactly the right time, and which nearly all programs you might clap eyes on other than Emacs have historically got wrong

s/other than/including/. However, I don't agree that this application behaviour is wrong; if the application wants to jump through hoops to get a little bit of extra safety on low-quality file systems, that's ok, but if it doesn't, that's also ok. It's up to the users to chose which applications they run and on which file system.

The end of LWN comment dialog?

Posted Apr 1, 2009 5:31 UTC (Wed) by ncm (guest, #165) [Link]

This is what the beginning of the end for unmoderated LWN commmenting looks like: "Please be polite." "You can't make me." It's really astonishing that LWN has lasted this long. It's not an accident. bojan, you are striking a beautiful, fragile object with a hammer. If you don't understand how destructive you are being, please stop and think until you do understand it.

The end of LWN comment dialog?

Posted Apr 1, 2009 6:07 UTC (Wed) by bojan (subscriber, #14302) [Link]

If you read my original post in this thread, you will find that I am pointing at inconsistencies of what Linus describes as reality check. So, I ridicule (among other things) his conclusion that: ext3 sucks at doing fsync(), hence we should drop fsync().

What exactly is not polite about that? Is sarcasm now verboten on LWN? I see plenty of it. Daily.

In a post not so long ago, someone accused me of hiding behind Ted's authority (although I actually used documentation to support my case - which many don't bother to read, of course). This time, I point out what to me is nonsense coming from an even bigger authority, but that's no good either. I'm not sure what position of mine would satisfy fragile sensibilities here. Only silence, I guess.

This time I was being accused of making snide remarks. So, I replied to ajross using his terminology, although I do not actually agree with that qualification (which you can see from my sarcastic: "see, being snide again" remark) and I should have used "so called snideness" in my reply instead. I am really just being sarcastic, because we are all supposed to rally behind the high priest or something.

Sure, Linus is a genius, but that doesn't mean that whatever he says is beyond criticism. And, I do not see how I am not being polite by exercising criticism with a hint of sarcasm.

What is it exactly that you have the issue with in my posts? What exactly is impolite?

Yup. It's the beginning of the end.

Posted Apr 1, 2009 7:54 UTC (Wed) by khim (subscriber, #9252) [Link]

If you read my original post in this thread, you will find that I am pointing at inconsistencies of what Linus describes as reality check.

Nope. You are being 100% smart-ass. Linus's reality check is not inconsistent. It's description of reality and reality is not consistent. Whenever it was? You have different factors and in different but quite real situations different factors prevail.

So, I ridicule (among other things) his conclusion that: ext3 sucks at doing fsync(), hence we should drop fsync().

That's different facet of reality. When you consider reality from kernel developer POV what the applications are doing is your "unchangeable fact", your "speed of light", when you consider reality from application developer POV what the kernel does is "unchangeable fact" and you should deal with it. This is true even if kernel developer and application developer is the same person. You can only think differently if your application is designed to only be used "in-house" and you can always guarantee control over both kernel and userspace - and git was not designed to only be used "in-house"...

And, I do not see how I am not being polite by exercising criticism with a hint of sarcasm.

You are exercising ignorance with a hint of sarcasm. That's different.

Yup. It's the beginning of the end.

Posted Apr 1, 2009 8:29 UTC (Wed) by bojan (subscriber, #14302) [Link]

> When you consider reality from kernel developer POV what the applications are doing is your "unchangeable fact", your "speed of light", when you consider reality from application developer POV what the kernel does is "unchangeable fact" and you should deal with it.

Let me review.

When another Unix kernel (or Linux) holds your data in buffers and commits metadata only (because it is allowed to), you, as an application developer, deal with it by ignoring that fact.

And, when your file system does crazy things with the perfectly good system call, you also ignore it as a kernel developer.

WOW, is that now the new "very special relativity"? We pick whichever behaviour is the most narrow to a specific file system and go with that?

Yup. It's the beginning of the end.

Posted Apr 1, 2009 14:22 UTC (Wed) by drag (guest, #31333) [Link]

> When another Unix kernel (or Linux) holds your data in buffers and commits metadata only (because it is allowed to), you, as an application developer, deal with it by ignoring that fact.

POSIX allows you never to write data to disk at all. That will make your file system very fast. After all you can have a POSIX-compliant file system that operates off of ramdisk quite easily.

POSIX file system access is designed to describe the interface layer between userland and the file system. It leaves the actual integration between the file system and the hardware, as well as the internals to the file system itself is left up to the developer of the OS.

It is like if you discovered all of a sudden a network service provided by a Apache-based web app uses SSL badly so that all usernames and passwords are transmitted over the Web in plain text... then you complain about it and the developer says back to you that his application's behavior is allowed by TCP/HTTP/SSL and that you should be changing your password with each usage, like people who use his app correctly do. Then he emails you some documentation from a security expert that says you should change your password frequently and that many other protocols like telnet or ftp send your username and password over the network in plain text.

Yup. It's the beginning of the end.

Posted Apr 1, 2009 16:10 UTC (Wed) by foom (subscriber, #14868) [Link]

This is starting to get very repetitive...all these points have been made already at least once in one
of the other article's threads. I'd like to suggest that it might be in everyone's interest to move on to
more useful pass-times than rehashing the same arguments over and over again every time there's
an update on the subject.

sticks & stones

Posted Apr 2, 2009 23:17 UTC (Thu) by xoddam (subscriber, #2322) [Link]

> In a post not so long ago, someone accused me of hiding behind Ted's authority

I plead guilty and I apologise. That was immediately after replying to someone else's post the gist of which was "Ted wrote ext2 and ext3 in the first place, he is therefore above criticism." It concluded with the words "Know your place", which got me riled.

[proverb: in the midst of great anger, never answer anyone's letter]

Your words were not so condescending but they had much the same emphasis: all ur filesystems are belong to POSIX (not users) 'cos POSIX is the law, and by the way Ted's interpretation is the only correct one because he's the primary implementor.

I hope you understand where I was coming from. Forgive me.

sticks & stones

Posted Apr 2, 2009 23:56 UTC (Thu) by bojan (subscriber, #14302) [Link]

Nothing to forgive. All is perfectly fine. I enjoy a robust discussion.

The end of LWN comment dialog?

Posted Apr 8, 2009 0:05 UTC (Wed) by jschrod (subscriber, #1646) [Link]

Well, I just decided to give you feedback, from someone who is subscribed to LWN quite a bit longer than you and who did not participate in this topic after you took over all its discussion threads: You showed that LWN really needs a KILL file feature where one can put a poster in it; you, in particular. Others have succinctly explained why, no need to repeat this.

But your self-rightousness doesn't allow to understand this, obviously. Luckily, there are still some discussion threads where you don't try to take over. I hope the likes of you will remain few on LWN in the future, this is not Slashdot, after all.

The end of LWN comment dialog?

Posted Apr 1, 2009 15:46 UTC (Wed) by GreyWizard (guest, #1026) [Link]

The comment from ajross above was not some gentle plea for polite discourse. What he actually said included this: "No doubt you've already made yourself heard in the other flame wars on the subject." A more accurate summary would have been, "Be polite you jerk."

People get nasty in the comments here all the time. If there's something beautiful and fragile here it's already in a thousand jagged pieces. But people hector one another about being polite all the time too. That also wrecks the signal-to-noise ratio and solves nothing.

The end of LWN comment dialog?

Posted Apr 4, 2009 9:05 UTC (Sat) by jospoortvliet (guest, #33164) [Link]

Hmmm. Just say whatever your brain produces, and if somebody has a problem
with what comes out, it's on their own plate.

Living in a country where that mode of thinking is the norm, I can tell
you it also has disadvantages... If only because the resulted hurt
feelings can muddy the discussion more than you might think. Besides, it
chases people away who would otherwise have contributed constructively -
it's not acceptable behavior in all cultures. Ever wondered why the FOSS
community is still predominantly western, despite many smart developers in
countries like India?

A little decency now and then doesn't hurt. I know people who, knowing how
blunt they can be, ask someone else to read certain emails before sending
them. After all, reality is that people DO have feelings.

The end of LWN comment dialog?

Posted Apr 5, 2009 3:34 UTC (Sun) by GreyWizard (guest, #1026) [Link]

Pardon me for saying so but you have not understood what I wrote. I did not urge anyone to "say whatever [their] brain produces" or anything equivalent. Nor did I issue a rallying cry against decency. Elevating the level of discourse would be a wonderful thing and if you've got an effective suggestion for how to do so I would be glad to read it.

But saying "be polite you jerk" merely drags things even further down into the muck.

The end of LWN comment dialog?

Posted Apr 5, 2009 12:43 UTC (Sun) by jospoortvliet (guest, #33164) [Link]

I disagree on your argument that saying 'be polite you jerk' merely drags things down, for two reasons.

First of all, some people don't notice their behavior is unnecessarily impolite. Pointing it out can help them (if they are willing to be reasonable in the first place). Never pointing out somebodies failures will make them fail forever.

Second, it shows you care about being polite. If others show they care too, a culture of 'you should be polite' can be maintained. As you might have noticed from the differences between FOSS communities, culture is important and heavily influential. And it can be changed.

Some things to note:
- people DO care about what others think of them. No matter how much they scream 'no I don't', they do. It is our nature.
- people should know their arguments are not supported by being mean - it is the other way around.
- I agree that a 'be polite you yerk' might not always be the best way to correct someone. A personal mail can do more. However, it won't show up in public (unless an apology is made), thus it does not much to influence others who might think it is acceptable behavior because the guy got away with it. Of course, giving a good example is better than anything else.
- Of course discussing without end whether somebody was polite enough or not muddies the discussion and lowers the SNR.

The end of LWN comment dialog?

Posted Apr 5, 2009 15:42 UTC (Sun) by GreyWizard (guest, #1026) [Link]

Both of your arguments could reasonably be applied to a comment that says "please be polite" but both fail for "be polite you jerk." Someone who is accidentally rude is much more likely to respond to the "you jerk" part than the "be polite" part. The contradiction immediately destroys the credibility of the person posting because he or she is not willing to be held to the standard set for others.

A truly polite request for more courtesy might help but it's difficult to be sure because such things are quite rare. Giving in to the temptation to scold even just a little makes the comment worse than useless. Unless you are absolutely certain you can do it right it's better to focus on substantive issues and avoid appointing yourself a courtesy cop.

The end of LWN comment dialog?

Posted Apr 5, 2009 16:20 UTC (Sun) by jospoortvliet (guest, #33164) [Link]

True, if you meant to point out that 'be polite YOU JERK' isn't very effective, I agree. I do however think that it's better than nothing. Nothing changes nothing.

The end of LWN comment dialog?

Posted Apr 5, 2009 16:27 UTC (Sun) by GreyWizard (guest, #1026) [Link]

It's better to change nothing than to make the situation worse.

The end of LWN comment dialog?

Posted Apr 5, 2009 17:11 UTC (Sun) by jospoortvliet (guest, #33164) [Link]

Maybe. Probably for this one response. However, as I pointed out, there are wider repercussions to be expected from such behavior, and it is worth to show, as a community, that we disapprove of such ways of communicating. Even if that is done in a rather unfriendly manner.

On re-reading the thread, I think you are right in that ajross was more impolite than bojan, which often leads to a downward spiral and isn't helpful... bojan's post wasn't that far off from the normal tone on this site.

Anyway. This is went pretty far off-topic, and I think we mostly agree. For as far as we don't, we at least agree on that ;-)

That massive filesystem thread

Posted Apr 1, 2009 6:27 UTC (Wed) by njs (guest, #40338) [Link]

> Meaning, make your apps in such a way that an odd crash here and there cannot take out the whole thing.

Well, yes, it's a nice goal. The problem is that *you can't* without calling fsync. When the guy who wrote the system calls it "very very annoying and confusing", then it's not really a great example of how we can make all our apps more awesome and usable in general. Unfortunately.

(During the whole ext4 discussion I spent some time trying to figure out how to abuse Ted's patches to create a transactional system that doesn't require rewriting whole databases on every write, and uses rename(2) for its write barrier instead of fsync(2). But I think block layer reordering makes it impossible. Maybe if there were an ioctl to trigger an immediate roll-over of the journal transaction.)

That massive filesystem thread

Posted Apr 1, 2009 7:15 UTC (Wed) by bojan (subscriber, #14302) [Link]

> The problem is that *you can't* without calling fsync.

fsync() sucks because it is a "commit now" thing. Not everyone wants to commit now - I fully understand that. I'm a notebook user and I don't want my disk being spun up unnecessarily. But, current semantics are what they are, so ignoring them is looking for trouble elsewhere. Sucks - yes, but living in denial doesn't help either. And, as you say, not a great way to make our apps more awesome. Just a necessary evil right now. Some of it can be avoided with backup files, but the underlying badness will persist.

It would be nice to have a system call that guarantees "data before metadata, but not necessarily now", so that other systems interested in it may also implement it. Then the apps could comfortably rely on this when renaming files over the top of other ones. I was even thinking that we should ask POSIX to standardise that fsync(-fd) means exactly that (because fd is always positive, but we use int for it, which can also have negative values), but this may confuse things even more and is probably stupid.

Sure, some Linux file systems will continue making it more comfortable even with the undefined order of current semantics, which will please users (BTW, this is really interesting: http://lwn.net/Articles/326583/). But, the long term solution in the world of Unix should probably be a bit more inclusive.

PS. To be fair to fsync(), it is an indispensable tool for databases, so making it work as fast as possible is most definitely a good thing. What ext3 in ordered mode does with it an abomination.

That massive filesystem thread

Posted Apr 1, 2009 7:50 UTC (Wed) by ebiederm (subscriber, #35028) [Link]

POSIX/UNIX semantics guarantees that renames are atomic.

POSIX/UNIX semantics do not make guarantees about the filesystem state after an OS crash.

Not having to do fsck after a filesystem crash gives the illusion that the filesystem is not corrupted.

It turns out that at least with extN after a crash we see filesystem
states that are illegal during normal operations. That is despite not
needing to run fsck the filesystem was corrupted.

It would be nice if there was a filesystem that could guarantee the visible state of the filesystem if fsck did not need to be run was:
- A legal state for the filesystem in normal operation.
- Everything that was fsynced was available.

Does anyone know of a journaling filesystem that guarantees not to give me a corrupt filesystem if fsck does not need to be run?

That massive filesystem thread

Posted Apr 1, 2009 8:05 UTC (Wed) by mjthayer (guest, #39183) [Link]

Yes, fsync does look like too blunt an instrument for many purposes. Your particular problem could be solved though, if the system owner (i.e. you) was able to take decisions about whether fsync should be honoured or not or partly or whatever, rather than having one filesystem do it, one not. Having said that, you as the system owner are also in a position to choose a filesystem that works well with the behaviour you need...

That massive filesystem thread

Posted Apr 1, 2009 8:39 UTC (Wed) by nix (subscriber, #2304) [Link]

fsync() sucks not because it is a 'commit now' operation (that would be fbarrier()) but because it is a 'commit and force to disk now' operation.

Actually on many OSes it's a 'start a background force to disk now and return before it's done' operation; on Linux it's a 'lob it at the disk controller so it can cache it instead' operation. Still not necessarily useful (although that is changing to optionally emit a barrier to the disk controller too.)

(Speaking as the owner of an Areca RAID card with a quarter-gig of battery-backed cache, using non-RAIDed filesystems purely as an fs-cache storage area, I *like* the ability to turn off barriers: all they do is slow my system down with no reliability gain at all.)

That massive filesystem thread

Posted Apr 1, 2009 8:55 UTC (Wed) by bojan (subscriber, #14302) [Link]

By commit now, I meant force to disk. I think that was clear from the "disk being spun up unnecessarily" bit.

fbarrier vs. fsync

Posted Apr 1, 2009 12:02 UTC (Wed) by butlerm (subscriber, #13312) [Link]

"fbarrier(fd)" is not a "commit now" operation - that would make it
indistinguishable from fsync. It is a "commit data before metadata"
request.

The real technical problem here is that from the application perspective,
the meta data update must take place immediately, i.e. before the system
call returns. However, from a recovery perspective, it is highly desirable
that the persistent meta data state not be committed until after the data
has been committed. Unless a filesystem maintains two versions of its
metadata (a la soft updates), that is an unusually difficult requirement to
meet without serious performance problems.

The alternative that I would really like to see is undo records for a few
critical operations like rename replacement, such that the physical data /
meta data ordering requirements are removed, and on recovery the filesystem
un-does rename replacements where the replacement data has not been
committed to disk. That replaces the ideal of point-in-time recovery with
the more practical ideal of consistent version recovery.

That massive filesystem thread

Posted Apr 1, 2009 12:17 UTC (Wed) by butlerm (subscriber, #13312) [Link]

Or in other words, fbarrier is a completely different kind of barrier than
the one that the "barrier=1" mount option requests. The latter is a low
level block I/O write barrier usually implemented with a full write cache
flush (barring some sort of battery backup), the former is a data before
meta data barrier.

That massive filesystem thread

Posted Apr 2, 2009 20:23 UTC (Thu) by iabervon (subscriber, #722) [Link]

Linus actually overstated git's use of fsync(). There are three relevant cases:

git writes to a brand new filename, and then updates an existing file to contain the new filename instead of an old filename. It will optionally do a fsync() between these two operations.
git writes all of the data from several existing files to a single new file, and then removes the existing files. It will always do a fsync() between these two operations.
git updates an existing file by writing to a temporary file and renaming over the existing file. It will never do a fsync() between these two operations.

That is, git relies on the assumption that a rename() is atomic with respect to the disk and dependent on all operations previously issued on the inode that is being renamed. It uses fsync() only to make sure that operations to different files happen in the order that it wants.

Now, obviously, if you want to be really sure to keep some data, write it once and never replace it at all. That'll do a good job of protecting against everything, including bugs where you do something like "open(), fsync(), close(), rename()" but forget or mess up the "write()". Obviously, this isn't an option for a lot of situations, however, but it's what git does for the most important data.

I/O barriers and LVM

Posted Apr 1, 2009 7:11 UTC (Wed) by TRS-80 (guest, #1804) [Link]

Looks like LVM finally got support for barriers in 2.6.29 (a year after first submitted) although only for linear targets.

I/O barriers and LVM

Posted Apr 1, 2009 13:40 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

Nifty; thanks. This change alone is making me consider using a kernel.org kernel, something I haven't done in years.

I/O barriers and LVM

Posted Apr 1, 2009 19:03 UTC (Wed) by sbergman27 (guest, #10767) [Link]

"""
This change alone is making me consider using a kernel.org kernel, something I haven't done in years.
"""

Are you saying that distro kernels already do this? If so, I'm not suggesting that you are wrong. Just interested in more info.

I/O barriers and LVM

Posted Apr 1, 2009 19:08 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

No, they don't do this. LVM's lack of barrier support has resulted in my not using LVM in some circumstances. If kernel.org kernels grow LVM barrier support and are otherwise stable, it might be worth using them until distribution kernels are updated to include the functionality.

I/O barriers and LVM

Posted Apr 1, 2009 13:44 UTC (Wed) by masoncl (subscriber, #47138) [Link]

Unfortunately, they don't actually get the barriers down to the device. There's a bug in the implementation...hopefully it'll get fixed up in 2.6.30 and then backported for 2.6.29-stable

(Credit to Eric Sandeen for tracking it down).

That massive filesystem thread

Posted Apr 1, 2009 7:38 UTC (Wed) by bakterie (guest, #37541) [Link]

How do filesystems other than ext3/4 do it? I haven't seen mentions of
other filesystems in the discussions, but then again I haven't looked very
hard. One of my computers is using reiserfs. How does reiserfs do it?

That massive filesystem thread

Posted Apr 1, 2009 11:17 UTC (Wed) by masoncl (subscriber, #47138) [Link]

More or less exactly the same as ext3 ;) The reiesrfs data=ordered is very similar to ext3.

That massive filesystem thread

Posted Apr 2, 2009 0:37 UTC (Thu) by davecb (subscriber, #1574) [Link]

How do filesystems other than ext3/4 do it?

Well, the Unix v6 filesystem implemented
in-order writes, as did 4.x BSD and the
other pre-journaled filesystems. POSIX allows
reordering to make coalesence easy, as
a lot of research was being done at that
time to get better performance.

A colleague at ICL (hi, Ian!) did his
masters at UofT on that, and found you
could get a performance improvement
and still preserve correctness by using
what I'd characterize as tsort(1), which
worked better than BSD/Solaris soft updates.

--dave

That massive filesystem thread

Posted Apr 1, 2009 8:09 UTC (Wed) by mjthayer (guest, #39183) [Link]

Perhaps the atime thing could also be massaged a bit by having strict atime behaviour, but a really low priority for the updates - leaving them in the cache to accumulate for a while, perhaps anything up to an hour, and trying to do them when the directory is to be updated anyway. I'm sure that if some system crash causes a loss of atime data, few people will be very sad.

That massive filesystem thread

Posted Apr 1, 2009 13:58 UTC (Wed) by TRS-80 (guest, #1804) [Link]

People were suggesting the same thing with renames - queue them up until the data is written to disk, but apparently it's too complex (BSD FFS softupdates being proof of this).

That massive filesystem thread

Posted Apr 2, 2009 6:40 UTC (Thu) by butlerm (subscriber, #13312) [Link]

'softupdates' is proof of the complexity of keeping two fully generalized
versions of the meta data around at all times - not only that but
converting back and forth between the two versions on demand.

You can't really "queue" a rename, without doing something comparable to
what softupdates does, because the rename has to take immediate affect from
the application perspective. To do that, somewhere there must be a layer
to keep track of the difference between what the user visible meta data is,
and what the committed meta data is. If the differences are sufficiently
general, that is a major problem. If one wants high performance rename
replacements, rename undo is much more practical.

It would be practical to update atimes on a low priority basis, with the
caveat that a lot of memory may be consumed holding metadata blocks around
until the atime updates are complete. In addition, on a system under
sufficient load, moving I/O to a low priority thread doesn't really help
anyway.

Rename undo

Posted Apr 2, 2009 14:01 UTC (Thu) by xoddam (subscriber, #2322) [Link]

> If one wants high performance rename replacements, rename undo is much more practical.

I'm intrigued, but not satisfied. Telling the journal that a metadata change is 'committed' means that the post-crash-recovery state will reflect the change (journal replay).

Surely the only satisfactory way to commit data before committing the metadata change is to delay *all* journal commits in-order until after the relevant file data is written in place, or to journal the data itself.

For performance reasons it's probably much saner not to journal most data, especially for random access within large files, but I'm thinking that if it makes sense to allocate-on-commit to preserve the in-order semantics of atomic rename, it might also make good sense to special-case data journalling for newly-written (created or truncated) files when they are renamed (perhaps only for small files, and allocate-on-commit larger ones as users will likey expect a delay).

Having the ability to unwind a specific kind of metadata change seems very confusing. I fear that winding back a rename could well result in a violation of expected in-order semantics w.r.t. metadata after crash recovery. Or might it be possible to wind back an entire 'transaction', all other metadata changes since the rename included?

Rename undo

Posted Apr 2, 2009 18:13 UTC (Thu) by butlerm (subscriber, #13312) [Link]

You are right, if you want guaranteed preservation of in-order semantics
there are no alternatives other than journalling the data or delaying all
journal commits until the corresponding data has been written. Both
options are available (e.g. data=journal, and data=ordered), and both have
serious performance problems. Of course, if that is really what you need,
than the price is worth paying.

"data=writeback" is the current alternative which doesn't make any pretense
to the preserving in-order semantics of data and meta-data after a crash.
You get a snapshot of your meta-data at a certain point of time, but the
data may be trashed.

Rename undo is a much less severe compromise to in-order semantics after a
crash. It is not point in time recovery, it is consistent version recovery.
That can have some unexpected side effects, but none remotely as severe as
losing the data completely.

In the case you mention, if you write a new version, rename it over the old
one, change the security permissions on the replacement, and then the
system crashes, you are not going to get the new (unwritten) data, the new
inode, and the new permissions, you are going to get the old inode, the old
data, and the old permissions. The permissions go with the inode (and the
data), not the directory entry. That is what you want. The old inode (and
the old data) has to be kept around until the data for the new inode is
completely on disk. Otherwise you cannot undo the rename replacement after
the fact.

That massive filesystem thread

Posted Apr 1, 2009 19:41 UTC (Wed) by Steve_Baker (guest, #265) [Link]

Perhaps instead they could just invert the meaning of the 'A' file
attribute when a filesystem has been mounted with the noatime or relatime
option to force strict atime updates for files so marked. That way you
can mount your filesystem(s) noatime and only put the A attribute on your
mailboxes and you're done.

per-inode noatime

Posted Jun 10, 2009 10:06 UTC (Wed) by pjm (guest, #2080) [Link]

One can more or less achieve that by applying chattr +A to the whole filesystem and then chattr -A on specific files. Note that chattr +A on a directory will be inherited by any files created in that directory (excluding moving an existing file into that directory).

That massive filesystem thread

Posted Apr 4, 2009 7:42 UTC (Sat) by dirtyepic (guest, #30178) [Link]

wouldn't mutt keep telling you there's new messages an hour after you've read them then?

That massive filesystem thread

Posted Apr 4, 2009 7:44 UTC (Sat) by dirtyepic (guest, #30178) [Link]

nevermind, you said cache didn't you.

That massive filesystem thread

Posted Apr 1, 2009 8:59 UTC (Wed) by lmb (subscriber, #39048) [Link]

The key issue with the performance penalty is that application writers intend fsync() to apply to their file(s), but it actually forces a file system-wide barrier. Fixing that, and selectively syncing just a subset of the journal, should help with the performance issues.

(A possible extension then might be to have fsyncl(), which accepts a list of fds to sync at the same time, but it is not strictly required.)

Or, of course, to get application writers to use more async IO.

When to flush()?

Posted Apr 1, 2009 11:22 UTC (Wed) by rvfh (guest, #31018) [Link]

I remain convinced that this should be an open() option whether a file is critical, normal or expendable.

The write-to-disk policy would thus be per-file, but it would be the kernel's decision to flush what needs to be when it deems necessary.

When to flush()?

Posted Apr 1, 2009 12:20 UTC (Wed) by RobSeace (subscriber, #4435) [Link]

You mean like open(O_SYNC)? Or, do you mean something more than that is required?

When to flush()?

Posted Apr 1, 2009 13:19 UTC (Wed) by rvfh (guest, #31018) [Link]

> You mean like open(O_SYNC)?

Yes, but with more granularity. O_SYNC means write everything immediately, whereas you might want to give the kernel some time to organise the reads/writes more efficiently:
* O_CRITICAL: 1 second
* O_EXPENDABLE: 30 seconds
* default: 5 seconds

Totally superfluous.

Posted Apr 2, 2009 14:14 UTC (Thu) by xoddam (subscriber, #2322) [Link]

The application developer who has the choice of using these options doesn't know whether the data is expendable or not, and the end-user doesn't care about the implementation details.

Things go strangely pear-shaped when the most irrelevant, trivial data (eg. GNOME configs when we're only using GNOME because it's a default someone else chose) goes missing or gets corrupted.

I most definitely don't care if GNOME forgets where I put a window or two. But I do care if it fails to start.

What we end-users want (I wear a developer hat much of the time but I'm *always* a user) is not to be annoyed by the things we don't care about. O_EXPENDABLE and its ilk are an invitation for corner-cases to bite end-users. End-users don't deserve such treatment.

Totally superfluous.

Posted Apr 2, 2009 14:30 UTC (Thu) by rvfh (guest, #31018) [Link]

> The application developer who has the choice of using these options doesn't know whether the data is expendable or not
How then are they expected to know when to flush?

And anyway, do we really not know which files are important and which are not?
Examples:
* pid file, browser cache: don't care
* conf file, document, code: care
* database file: care a lot

But I do thank you for challenging this idea ;-) Please feel free to give counter-examples and -arguments.

use of atime

Posted Apr 1, 2009 13:28 UTC (Wed) by ballombe (subscriber, #9523) [Link]

At least Debian popularity-constest (http://popcon.debian.org)uses atime to tell whether a package was used recently or not.

use of atime

Posted Apr 1, 2009 14:25 UTC (Wed) by knobunc (guest, #4678) [Link]

http://beta.linuxfoundation.org/news-media/blogs/browse/2...
--
Alternatively, you can use chattr +A to set the noatime flag on all files and directories where you dont want noatime semantics, and then clear the flag for the Unix mbox files where you care about the atime updates. Since the noatime flag is inherited by default, you can get this behaviour by setting running chattr +A /mntpt right after the filesystem is first created and mounted; all files and directories created in that file system will have the noatime file inherited.
--

fsync() and disk flushes

Posted Apr 2, 2009 19:15 UTC (Thu) by iabervon (subscriber, #722) [Link]

Linus pointed out that, in his experience, the only way you can possibly lose data that has gotten to a drive's write cache is to suddenly lose power. And if you suddenly lose power, in his experience, the drive is actually much more likely to wipe out some arbitrary track of data from the disk than it is to have anything in the write cache and lose it. That is, if you want to be really certain that you don't lose data, you need to have a couple seconds of power for your disk drive after its input goes idle, or you might lose some data completely regardless of anything the application that writes it could possibly do.

fsync() and disk flushes

Posted Apr 2, 2009 23:28 UTC (Thu) by anton (subscriber, #25547) [Link]

And if you suddenly lose power, in his experience, the drive is actually much more likely to wipe out some arbitrary track of data from the disk than it is to have anything in the write cache and lose it.

While I have experienced drives that damage sectors or tracks on power loss, I consider these drives faulty; and with such drives the problem does not seem to be limited to drives that are trying to write something at the time. However, most drives don't wipe out arbitrary data in my experience.

But I have tested two drives with a test program for out-of-order writing, and found that they both wrote data several seconds out of order with a certain access sequence. If we don't see more frequent problems from this, that's probably because the disks don't optimize accesses as aggressively as some people imagine.

fsync() and disk flushes

Posted Apr 2, 2009 23:31 UTC (Thu) by xoddam (subscriber, #2322) [Link]

I haven't crashed many disks but my limited experience is similar. If you get to the point where data loss is caused by the hardware, it is likely to be trashing a whole lot more than the contents of the cache at shutdown. The solution to this problem is RAID; journals solve a different problem altogether.

fsync() and disk flushes

Posted Apr 3, 2009 0:02 UTC (Fri) by nix (subscriber, #2304) [Link]

RAID doesn't really solve sudden-power-loss situations: in fact RAID-5 in
particular can make it much worse (turning small-range corruption into
apparent scattershot corruption).

A UPS, or battery-backing, is the answer (well, moves the failure point:
if it's a UPS, the UPS must fail before you lose: if it's battery-backed,
you often have to lose the battery first, then power, which is likely to
happen because you often have no idea the battery has failed until it's
too late).

In conclusion: we all suck, our data is doomed, the Second Law shall
triumph and Sod and Murphy shall dance above our mangled filesystems.

fsync() and disk flushes

Posted Apr 4, 2009 0:01 UTC (Sat) by giraffedata (guest, #1954) [Link]

The answer is RAID and UPS, but not that way. The RAID goes over the UPS; e.g. a mirror of two disk drives, each with its own UPS.

Such redundancy also makes it possible to test the UPS regularly and avoid the problem of two dead batteries when the external power fails.

The UPS doesn't count if you don't test, measure, and/or replace the its battery regularly.

fsync() and disk flushes

Posted Apr 3, 2009 23:49 UTC (Fri) by giraffedata (guest, #1954) [Link]

It's hard to believe there are disk drives out there (not counting an occasional broken one) that write trash over random areas as they power down. Disk drives I have seen have a special circuit to disconnect and park the head the moment voltage begins to drop. It has to park the head because you can't let the head land on good recording surface, and it has to cut off the write current because otherwise it's dragging a writing head all the way across the disk, pretty much guaranteeing the disk will never come back. I believe it's a simple circuit that doesn't involve any controller intelligence.

There is a related failure mode where the drive's client loses power and in its death throes ends up instructing the drive to trash itself while the drive still has enough power to operate normally. I've heard that's not unusual, and it's the best argument I know for a UPS that powers a system long enough for it to shut down cleanly.

fsync() and disk flushes

Posted Apr 27, 2009 6:24 UTC (Mon) by bersl2 (guest, #34928) [Link]

It's hard to believe there are disk drives out there (not counting an occasional broken one) that write trash over random areas as they power down. Disk drives I have seen have a special circuit to disconnect and park the head the moment voltage begins to drop. It has to park the head because you can't let the head land on good recording surface, and it has to cut off the write current because otherwise it's dragging a writing head all the way across the disk, pretty much guaranteeing the disk will never come back. I believe it's a simple circuit that doesn't involve any controller intelligence.

There is a related failure mode where the drive's client loses power and in its death throes ends up instructing the drive to trash itself while the drive still has enough power to operate normally. I've heard that's not unusual, and it's the best argument I know for a UPS that powers a system long enough for it to shut down cleanly.

One of these happened to me. $DEITY as my witness, I will never run an important system without an UPS again.

Bonus: The drive was a Maxtor. Serves me right.
Double bonus: That still wasn't traumatic enough to compel me to make backups.

fsync() and disk flushes

Posted Apr 27, 2009 10:43 UTC (Mon) by nix (subscriber, #2304) [Link]

You don't need a UPS. A battery-backed disk controller is just as good
(and perhaps better because the battery failing doesn't take your machine
down if the power is otherwise OK, while the UPS failing *does*).

Put hardware in between RAM and Disk features to work in between these

Posted Apr 17, 2009 0:05 UTC (Fri) by hozelda (guest, #19341) [Link]

Is there a reason why solid state drives aren't use on PCs as a cache level ahead of the main disk drives? These can even be made replaceable (it's just a cache anyway). Wouldn't these improve the performance/correctness trade-offs enough for many use cases?