POSIX v. reality: A position on O_PONIES

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

September 9, 2009

This article was contributed by Valerie Aurora

Sure, programmers (especially operating systems programmers) love their specifications. Clean, well-defined interfaces are a key element of scalable software development. But what is it about file systems, POSIX, and when file data is guaranteed to hit permanent storage that brings out the POSIX fundamentalist in all of us? The recent fsync()/rename()/O_PONIES controversy was the most heated in recent memory but not out of character for fsync()-related discussions. In this article, we'll explore the relationship between file systems developers, the POSIX file I/O standard, and people who just want to store their data.

In the beginning, there was `creat()`

Like many practical interfaces (including HTML and TCP/IP), the POSIX file system interface was implemented first and specified second. UNIX was written beginning in 1969; the first release of the POSIX specification for the UNIX file I/O interface (IEEE Standard 1003.1) was released in 1988. Before UNIX, application access to non-volatile storage (e.g., a spinning drum) was a decidedly application- and hardware-specific affair. Record-based file I/O was a common paradigm, growing naturally out of punch cards, and each kind of file was treated differently. The new interface was designed by a few guys (Ken Thompson, Dennis Ritchie, et alia) screwing around with their new machine, writing an operating system that would make it easier to, well, write more operating systems.

As we know now, the new I/O interface was a hit. It turned out to be a portable, versatile, simple paradigm that made modular software development much easier. It was by no means perfect, of course: a number of warts revealed themselves over time, not all of which were removed before the interface was codified into the POSIX specification. One example is directory hard links, which permit the creation of a directory cycle - a directory that is a descendant of itself - and its subsequent detachment from the file system hierarchy, resulting in allocated but inaccessible directories and files. Recording the time of the last access time - atime - turns every read into a tiny write. And don't forget the apocryphal quote from Ken Thompson when asked if he'd do anything differently if he were designing UNIX today: "If I had to do it over again? Hmm... I guess I'd spell 'creat' with an 'e'". (That's the creat() system call to create a new file.) But overall, the UNIX file system interface is a huge success.

POSIX file I/O today: Ponies and fsync()

Over time, various more-or-less portable additions have accreted around the standard set of POSIX file I/O interfaces; they have been occasionally standardized and added to the canon - revelations from latter-day prophets. Some examples off the top of my head include pread()/pwrite(), direct I/O, file preallocation, extended attributes, access control lists (ACLs) of every stripe and color, and a vast array of mount-time options. While these additions are often debated and implemented in incompatible forms, in most cases no one is trying to oppose them purely on the basis of not being present in a standard written in 1988. Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.

Why, then, does the topic of when file system data is guaranteed to be "on disk" suddenly turn file systems developers into pedantic POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes down to this: Waiting for data to actually hit disk before returning from a system call is a losing game for file system performance. As the most extreme example, the original synchronous version of the UNIX file system frequently used only 3-5% of the disk throughput. Nearly every file system performance improvement since then has been primarily the result of saving up writes so that we can allocate and write them out as a group. As file systems developers, we are going to look for every loophole in fsync() and squirm our way through it.

[PULL QUOTE: As file systems developers, we are going to look for every loophole in fsync() and squirm our way through it. END QUOTE] Fortunately for the file systems developers, the POSIX specification is so very minimal that it doesn't even mention the topic of file system behavior after a system crash. After all, the original FFS-style file systems (e.g., ext2) can theoretically lose your entire file system after a crash, and are still POSIX-compliant. Ironically, as file systems developers, we spend 90% of our brain power coming up with ways to quickly recover file system consistency after system crash! No wonder file systems users are irked when we define file system metadata as important enough to keep consistent, but not file data - we take care of our own so well. File systems developers have magnanimously conceded, though, that on return from fsync(), and only from fsync(), and only on a file system with the right mount options, the changes to that file will be available if the system crashes after that point.

At the same time, fsync() is often more expensive than it absolutely needs to be. The easiest way to implement fsync() is to force out every outstanding write to the file system, regardless of whether it is a journaling file system, a COW file system, or a file system with no crash recovery mechanism whatsoever. This is because it is very difficult to map backward from a given file to the dirty file system blocks needing to be written to disk in order to create a consistent file system containing those changes. For example, the block containing the bitmap for newly allocated file data blocks may also have been changed by a later allocation for a different file, which then requires that we also write out the indirect blocks pointing to the data for that second file, which changes another bitmap block... When you solve the problem of tracing specific dependencies of any particular write, you end up with the complexity of soft updates. No surprise then, that most file systems take the brute force approach, with the result that fsync() commonly takes time proportional to all outstanding writes to the file system.

So, now we have the following situation: fsync() is required to guarantee that file data is on stable storage, but it may perform arbitrarily poorly, depending on what other activity is going on in the file system. Given this situation, application developers came to rely on what is, on the face of it, a completely reasonable assumption: rename() of one file over another will either result in the contents of the old file, or the contents of the new file as of the time of the rename(). This is a subtle and interesting optimization: rather than asking the file system to synchronously write the data, it is instead a request to order the writes to the file system. Ordering writes is far easier for the file system to do efficiently than synchronous writes.

However, the ordering effect of rename() turns out to be a file system specific implementation side effect. It only works when changes to the file data in the file system are ordered with respect to changes in the file system metadata. In ext3/4, this is only true when the file system is mounted with the data=ordered mount option - a name which hopefully makes more sense now! Up until recently, data=ordered was the default journal mode for ext3, which, in turn, was the default file system for Linux; as a result, ext3 data=ordered was all that many Linux application developers had any experience with. During the Great File System Upheaval of 2.6.30, the default journal mode for ext3 changed to data=writeback, which means that file data will get written to disk when the file system feels like it, very likely after the file's metadata specifying where its contents are located has been written to disk. This not only breaks the rename() ordering assumption, but also means that the newly renamed file may contain arbitrary garbage - or a copy of /etc/shadow, making this a security hole as well as a data corruption problem.

Which brings us to the present day fsync/rename/O_PONIES controversy, in which many file systems developers argue that applications should explicitly call fsync() before renaming a file if they want the file's data to be on disk before the rename takes effect - a position which seems bizarre and random until you understand the individual decisions, each perfectly reasonable, that piled up to create the current situation. Personally, as a file systems developer, I think it is counterproductive to replace a performance-friendly implicit ordering request in the form of a rename() with an impossible to optimize fsync(). It may not be POSIX, but the programmer's intent is clear - no one ever, ever wrote "creat(); write(); close(); rename();" and hoped they would get an empty file if the system crashed during the next 5 minutes. That's what truncate() is for. A generalized "O_PONIES do-what-I-want" flag is indeed not possible, but in this case, it is to the file systems developers' benefit to extend the semantics of rename() to imply ordering so that we reduce the number of fsync() calls we have to cope with. (And, I have to note, I did have a real, live pony when I was a kid, so I tend to be on the side of giving programmers ponies when they ask for them.)

My opinion is that POSIX and most other useful standards are helpful clarifications of existing practice, but are not sufficient when we encounter surprising new circumstances. We criticize applications developers for using folk-programming practices ("It seems to work!") and coming to rely on file system-specific side effects, but the bare POSIX specification is clearly insufficient to define useful system behavior. In cases where programmer intent is unambiguous, we should do the right thing, and put the new behavior on the list for the next standards session.

Index entries for this article
Kernel	Filesystems
GuestArticles	Aurora (Henson), Valerie

(Log in to post comments)

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 15:54 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]

Great article. Everyone listen to Valerie.

Great article!

Posted Sep 9, 2009 16:36 UTC (Wed) by dwheeler (guest, #1216) [Link]

I agree, great article. Thanks for putting this in perspective.

I appreciate the side comment that something like this should submitted to the next version of the POSIX standard. Standards authors try to document the expectations of users and implementers, and sometimes they omit something important. It'd be hard to nail down the exact expectation, but it'd be worth it.

hail czar

Posted Sep 9, 2009 17:05 UTC (Wed) by ncm (guest, #165) [Link]

I think we can shorten that to, simply, "Everyone listen to Valerie."

hail czar

Posted Sep 9, 2009 20:07 UTC (Wed) by sbergman27 (guest, #10767) [Link]

I don't usually do "Me too!". But in this case, I will make an exception. Absolutely, listen to Valerie, Goddess of Sanity in the world of Linux/Unix filesystems. And whom I've always suspected was Stephen Tweedie's long lost twin sister, separated at birth.

hail Ivanova!

Posted Sep 10, 2009 13:58 UTC (Thu) by liljencrantz (guest, #28458) [Link]

And keep in mind the Linux file system mantra:

«Valerie is always right.
I will listen to Valerie.
I will not ignore Valeries recommendations.
Valerie is god.»

(Stolen from B5)

hail Ivanova!

Posted Sep 14, 2009 21:16 UTC (Mon) by roelofs (guest, #2599) [Link]

«Quoters of B5 are always right.
I will listen to quoters of B5.
I will not ignore the recommendations of those who quote B5.
JMS is god.»

;-)

(Apologies to JMS and LWN for the off-topic drivel...)

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 16:04 UTC (Wed) by nix (subscriber, #2304) [Link]

Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.

Thats decidedly optional, isn't it? So failing link() on directories isn't a conformance violation anyway.

(If it wasn't for NFS, I suspect a much wider violation would be failure to suppport the seekdir()/telldir() horror show.)

POSIX v. reality: A position on O_PONIES

Posted Sep 18, 2009 21:02 UTC (Fri) by jch (guest, #51929) [Link]

> Thats decidedly optional, isn't it? So failing link() on directories isn't a conformance violation anyway.

Indeed. According to the 2001 edition:

> Upon successful completion, link() shall mark for update the st_ctime field of the file. Also, the st_ctime and st_mtime fields of the direc-tory that contains the new entry shall be marked for update.

POSIX v. reality: A position on O_PONIES

Posted Sep 20, 2009 19:47 UTC (Sun) by nix (subscriber, #2304) [Link]

Um, I think you copied the wrong section of the standard ;) Also 2001 is
kind of out of date now.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 16:10 UTC (Wed) by jonth (guest, #4008) [Link]

An article in the best Reithian tradition: It informed, educated and entertained. Thanks, Valerie.

Only with a UPS

Posted Sep 10, 2009 2:02 UTC (Thu) by ncm (guest, #165) [Link]

We can only complain of what was omitted, not what was said. What was omitted, as is unfortunately always omitted from presentations by file system experts howsoever brilliant, is mention of the crucial distinction between crashes and power drops. Disks being the way they are, none of the above really applies to power drops. If power to the drive drops, your file system can offer you no guarantee, O_PONIES support notwithstanding.

Anywhere that a power drop is overwhelmingly more likely than a system crash, which includes Most of the Known World, the whole discussion is more or less moot. That does not make the discussion moot overall, though, because people who care about their data can move themselves into the Rest of Known World by providing a few seconds' UPS backing for the drive. We just need to make clear which part of the world we're talking about.

Only with a UPS

Posted Sep 10, 2009 6:14 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

Could you elaborate more on this distinction?

Only with a UPS

Posted Sep 10, 2009 17:34 UTC (Thu) by ncm (guest, #165) [Link]

There are two concerns. First, if power drops during a physical write operation, that sector is scragged. If it was writing metadata, you have serious problems with whatever files that metadata describes, if anything points to that sector.

Second, more subtle but probably more important, drives lie about what is physically on disk. To look good on benchmarks, they tell the controller that sectors have been physically copied to the platter while they are still only in buffer RAM in the drive -- up to several megabytes' worth. A few seconds after the last controller operation, these writes have drained to the disk. Before that, there's no guessing which have been written and which haven't, and blocks the system meant to write first may be written last. As a consequence, after powerup the file system sees blocks that are supposed to have important metadata in them with, instead, whatever was left there.

Only with a UPS

Posted Sep 10, 2009 18:05 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

Okay, but I guess what I'm asking is, what do you classify as a crash, if power drops are not included?

Only with a UPS

Posted Sep 10, 2009 19:19 UTC (Thu) by ncm (guest, #165) [Link]

Back about 1998, a Windows user told me that, for him, Windows "hardly ever crashes". Further questioning revealed that he defined "crash" as "I have to re-install". Lockups, a multiple-daily event, didn't count. Generally, though, by "crash" we mean the system stops responding to events, and must be re-started; usually this is a software failure, although all manner of hardware faults can cause it. When these happen, the disk has plenty of time to drain its buffers. Usually the software fault has not caused any disk writes with crazy parameters.

OS developers don't count power drops among crashes because those aren't their fault. That's commendable, because when they say "crash" they mean something they accept responsibility for.

Only with a UPS

Posted Sep 10, 2009 22:20 UTC (Thu) by flewellyn (subscriber, #5047) [Link]

Ah, that makes sense.

Handling power drops, to me, seems to be a matter of impossibility, at least as long as disks lie about when writes actually complete.

Only with a UPS

Posted Sep 11, 2009 8:29 UTC (Fri) by jschrod (subscriber, #1646) [Link]

I'd say a crash happens any time that my system stands still because of some kernel oops, or any time I have to press the reset button because something hangs beyond redemption. (The latter being more often the case in my environment.)

Only with a UPS

Posted Sep 10, 2009 18:18 UTC (Thu) by aliguori (subscriber, #30636) [Link]

<i>Second, more subtle but probably more important, drives lie about what is physically on disk.</i>

Not really. You can make a disk tell you when data is actually on the platter vs in the write cache. Furthermore, most "enterprise" drives have battery-backed write caches that guarantee enough power for the write caches to be flushed.

Only with a UPS

Posted Sep 10, 2009 19:06 UTC (Thu) by ncm (guest, #165) [Link]

You can make a disk tell you when data is actually on the platter

No, you can ask a disk to tell you. It might even be honest about it if you never get the buffer too full. The commercial incentives to lie for the sake of benchmarks are extremely strong. Drives that don't lie cost a lot more, and are slower. Honesty is an extra-cost option. If you don't pay for honesty (few do) you won't get it.

Honesty usually costs a lot more than a UPS.

Only with a UPS

Posted Sep 10, 2009 19:12 UTC (Thu) by dlang (guest, #313) [Link]

I have yet to see a drive that includes battery backup in it, and I work at a place that spends millions of dollars a year on enterprise grade storage

I think this is a myth like the drives that use platter energy to power themselves to write their buffer.

if you can point to a drive that includes a battery backup on the drive please post a link to it.

Only with a UPS

Posted Sep 10, 2009 19:22 UTC (Thu) by ncm (guest, #165) [Link]

I suspect aliguori is referring to disk-array boxes, not drives.

Only with a UPS

Posted Sep 11, 2009 16:24 UTC (Fri) by nix (subscriber, #2304) [Link]

Even then... it's just occurred to me that while I know my Areca RAID
controller has its own battery-backed cache, I have no idea whether it's
asked the drives it controls to turn off *their* internal write cache...
(Obviously we don't want that cache, as it's not battery-backed in any
way.)

Only with a UPS

Posted Sep 11, 2009 14:41 UTC (Fri) by anton (subscriber, #25547) [Link]

First, if power drops during a physical write operation, that sector is scragged. If it was writing metadata, you have serious problems with whatever files that metadata describes, if anything points to that sector.

In my experiments on cutting power on disk drives while writing, the drives did not corrupt sectors. I have seen IBM and Maxtor drives corrupt sectors under more unusual power fluctuation circumstances; maybe that's a reason why you can no longer buy drives from IBM or Maxtor; Hitachi (IBM successor) and Seagate-Maxtor (not Seagate proper) are certainly on my dont-buy list.

And a modern file system can protect against the corruption of a single sector:

E.g., in a journaling file system, that sector is either in the log or in the permanent storage. If it's in the log, just stop the replay when you encounter the sector. If it's in permanent storage, then you will notice that the replay write fails, and the file system can remap the sector/block to a working one (or the drive might remap it transparently on the replay write, or might just perform the write on the original sector; in these cases the file system has nothing to do). Of course, if the file system performs only meta-data journaling, then it will likely not notice corrupt data (because it is not accessed during replay), but apparently neither the file system maintainer nor the user (or whoever decided to use a meta-data journaling file system) cares about data anyway, so that's ok.

In a copy-on-write file system, the sector either contains the root of the file system, or it contains something written after the last root. In the latter case these blocks are unreachable anyway after recovery (unless there is also an intent log, in which case the discussion above applies). If the root is affected, then on recovery the youngest alternative root is read, giving us the latest consistent state of the file system.

Second, more subtle but probably more important, drives lie about what is physically on disk.

In the experiments mentioned above, when the drive had write caching enabled (default on PATA and SATA drives), the drives not just reported completion right away, but worse, also reordered the writes (so using barriers or turning off write caching is essential for every kind of consistency).

With write caching disabled, the results of my experiments (both in performance and in what was on disk after powering off) are consistent with the theory that the drive reports the completion of writes only after the sector hits the platter and (with the program I used) consequently only wrote the sectors in order.

BTW, it's not just the drive manufacturers that default to fast rather than safe; the Linux kernel developers do a similar thing (with a much smaller performance incentive) when they disable barriers by default, and turned ext3 from data=journal to data=ordered (and letting data=journal rot), and recently to data=writeback (although that may be just to make ext3 as bad as ext4 so people will not switch back). Hmm, are Solaris or BSD developers less cavalier about their user's data?

On the subject: UPSs and computer PSUs can fail, too. Better recommend a dual power supplies with dual UPSs; double failures should be relatively rare.

Only with a UPS

Posted Sep 10, 2009 12:02 UTC (Thu) by xilun (guest, #50638) [Link]

On the other hand, I've already written file system _highly_ intrusive software (in the form of specialized data reordering for HFS+ to be able to non destructively resize this fs) and tested it like 20 times on a non trivial fs content by unplugging the power cord of the computer in the middle of a resize operation, without experiencing a single data corruption (and the fs was also always at least recoverable quickly, but this wasn't even needed for read only operations to properly work).

I know that my statistic sample is to small (and worse, this was 6 years ago and I don't know if HD today are of the same quality), but anyway my first guess is that if the software is careful enough and the hardware of decent quality, the risk of massive data corruption due to a power failure is not too high (at least in absence of bad system design, like using RAID 5/6 in a power unsafe context)

Only with a UPS

Posted Sep 10, 2009 14:13 UTC (Thu) by Cato (guest, #7643) [Link]

Since we are trading anecdotes, here's mine: http://lwn.net/Articles/350072/ - loss of thousands of files and LVM metadata corruption on a PC using ext3 on top of LVM.

This PC was frequently reset accidentally by the user pressing the power button, which caused at least one data loss event within one year. Since disabling write caching (and a couple of other changes) I've not had any data loss on this PC, but it's probably too early to be sure these changes have fixed the problem.

FWIW, I believe that at least on this setup, disabling write caching helps avoid ext3 and LVM corruption.

Only with a UPS

Posted Sep 11, 2009 5:28 UTC (Fri) by magnus (subscriber, #34778) [Link]

In my experience, system hangs are much more common than power outages for desktop systems.

In the past I've had to reboot due to X server hangs (probably problems in the display driver), oopses due to unstable hardware (memory mainly) and sometimes soft hangs like losing connection to an NIS or NFS server or getting PAM misconfigured and not having a prompt to work from.

Only with a UPS

Posted Sep 19, 2009 20:30 UTC (Sat) by efexis (guest, #26355) [Link]

Alt+Printscreen+U. Always press it before reboot, if the kernel's not oopsed, can save you data :-)

Only with a UPS

Posted Sep 20, 2009 7:51 UTC (Sun) by Cato (guest, #7643) [Link]

There are some other handy Magic SysRq (i.e. Alt-PrintScreen) keystrokes as well: http://en.wikipedia.org/wiki/Magic_SysRq_key#.22Raising_E...

Only with a UPS

Posted Sep 21, 2009 7:20 UTC (Mon) by efexis (guest, #26355) [Link]

Although I wouldn't recommend using S(ync) as the third option for rebooting the system, after terminating processes etc. If the system's becoming unstable, syncing the drives is the very first thing I'd want to do. I prefer the order S-E-I-U-B. AFAIA, an S before U is redundant as buffers are written out as part of the remount-ro process, so a seperate sync() isn't needed (if anyone knows otherwise please correct me).

But of all those, the U is the most important, as if it succeeds it will protect your filesystem. You may end up with some left over temp files as tasks that didn't receive the terminate request signal didn't clean up after themselves, but this is usually not too great a cost.

Alex

Only with a UPS

Posted Sep 22, 2009 6:13 UTC (Tue) by Cato (guest, #7643) [Link]

Yes, personally I prefer the mnemonic Raising Skinny Elephants Is Utterly Boring.

Only with a UPS

Posted Sep 22, 2009 9:28 UTC (Tue) by efexis (guest, #26355) [Link]

Does the R do much? If you're rebooting/etc anyway... if the kernel's able to trap the Alt+SysRq+R, then it can trap the S/E/I/U/B keys too? Or is there another reason for it?

Only with a UPS

Posted Sep 12, 2009 0:35 UTC (Sat) by spitzak (guest, #4593) [Link]

You are wrong.

While the power is still running, and the disk is spinning and working perfectly, EXT4 has *already* stored information on it that says the file that the atomic rename() went to is empty. The disk is in the wrong state! It is irrelevant whether a power failure may further damage the data!

Only with a UPS

Posted Sep 12, 2009 6:22 UTC (Sat) by ncm (guest, #165) [Link]

You rather miss the point. Given reliable storage -- i.e., doesn't lie about what's reached disk, or has enough battery backup to make sure it gets there, eventually -- it's possible to write a reliable file system. Without, it doesn't matter how well done the file system is, a power drop can corrupt it. If you want safety against power drops, you need both.

Power loss -> no guarantee?

Posted Sep 17, 2009 11:11 UTC (Thu) by forthy (guest, #1525) [Link]

This is wrong. Consider a log-structured, checksummed file system like NILFS. It gathers all writes, writes them out in one go, and checksums every chunk it writes. What happens when power is lost during that write? The checksum is wrong. The last update before isn't touched, so the file system will revert to this last update. All is hunky dory, all ponies still there, no data lost except the last update - which is the guarantee of such a file system: You can only depend that those data is on disk where the transaction was completely written to disk. And note: writing one sector to a hard disk takes a few microseconds nowadays, so the drive can detect a power outage and stop writing before it randomly scrambles a sector - it might not complete everything, but leaving a garbled sector is possible to avoid.

On the other argument: In the part of the world where I live (Munich), power outages are far less frequent than crashes. Our file server had some CPU problems two years ago and crashed about once a week. Thanks to the stability of ReiserFS, no data loss occurred during the half year until we found the root cause and replaced the CPUs. Even when not including hardware defects, I definitely have more crashes than power outages. Frequent power outages happen in poor countries with third-world infrastructure.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 16:22 UTC (Wed) by njd27 (subscriber, #5770) [Link]

And, I have to note, I did have a real, live pony when I was a kid, so I tend to be on the side of giving programmers ponies when they ask for them.

The trouble of course, as any student of ponydynamics can tell you, is that once the kernel developers start handing out ponies as the solution to any difficult problem, the workload in having to deal with mucking them out might grow exponentially. Hence the reluctance. Worse might be better.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 17:11 UTC (Wed) by drag (guest, #31333) [Link]

That's why they should lie about the ponies. Say they are not going to give them ponies, but give it anyway and that way only smart people notice the herd and everybody else forgets about it and lives with the better file system.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 19:43 UTC (Wed) by martinfick (subscriber, #4455) [Link]

Hmm, seems to me they quietly gave the ponies years ago, and they are now recalling them.

Given and denied

Posted Sep 9, 2009 21:59 UTC (Wed) by man_ls (guest, #15091) [Link]

Indeed. And now they are saying it was in fact O_UNICORNS, so they were not really given to anyone.

Given and denied

Posted Sep 9, 2009 22:59 UTC (Wed) by Tara_Li (guest, #26706) [Link]

O_UNICORNS aren't mythical - O_VIRGINS are.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 16:46 UTC (Wed) by dannf (subscriber, #7105) [Link]

Excellent/concise summary of this debate, thanks!

what's needed is a application barrier() call

Posted Sep 9, 2009 18:14 UTC (Wed) by dlang (guest, #313) [Link]

rather than useing rename to imply ordering, what is needed is a barrier() call that an application can use to day 'write everything before this point before you write anything after this point'

barriers are starting to exist deeper inside the storage stack, but there is no way for the application writer to invoke them.

what's needed is a application barrier() call

Posted Sep 9, 2009 18:36 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

rename ought to imply a barrier because there are no useful cases for a non-barrier rename, and plenty of cases for a rename-with-barrier. Forcing application authors to use a separate call will simply allow them to forget it, and will ensure that existing applications are still unsafe. A barrier-on-rename scheme adds safety with no loss in expressiveness, and is a performance increase over fsync, which is, well, synchronous.

what's needed is a application barrier() call

Posted Sep 9, 2009 18:52 UTC (Wed) by dlang (guest, #313) [Link]

I am not claiming that rename should not imply a barrier.

I am stating that rename should not be necessary to create a barrier.

don't require application writers to add renames if they wouldn't otherwise need them just because they want a barrier.

what's needed is a application barrier() call

Posted Sep 9, 2009 19:06 UTC (Wed) by quotemstr (subscriber, #45331) [Link]

Ah, I see. fbarrier is an interest construct. You'd still need the rename trick in order to prevent observers from seeing inconsistent files at runtime. A program modifying /etc/passwd should prevent other applications from seeing an incompletely-written file. One could do this locks, but a simpler mechanism is rename.

what's needed is a application barrier() call

Posted Sep 9, 2009 20:55 UTC (Wed) by dlang (guest, #313) [Link]

correct, barrier() would not be a replacement for rename(), rename could (and probably should) imply a barrier.

History (re: what's needed is a application barrier() call

Posted Sep 10, 2009 16:18 UTC (Thu) by davecb (subscriber, #1574) [Link]

Rename was one of the two Unix V6 system calls which were documented as being necessary and sufficient to allow one to do atomicity and locking: the other was open(... O_EXCL|O_CREAT). The latter atomically creates files, the former atomically changes them, including atomically making them cease to exist. I rather expect Thompson and Ritchie would be bemused by some of the discussion to date (;-)

--dave

History (re: what's needed is a application barrier() call

Posted Sep 10, 2009 20:18 UTC (Thu) by aegl (subscriber, #37581) [Link]

"Rename was one of the two Unix V6 system calls"

Nope. "rename" wasn't in V6 (see http://minnie.tuhs.org/UnixTree/V6/usr/sys/ken/sysent.c.html).

"the other was open(... O_EXCL|O_CREAT)"

V6 open(2) didn't have all those fancy O_* options. You just got the choice of FREAD, FWRITE, or both.

Applications in V6 era typically used "link(2)" as their locking primitive (create a randomly named tempfile, then link that to a statically named lockfile. If the link call succeeds, you own the lock. If you get EEXIST, then someone else does).

History (re: what's needed is a application barrier() call

Posted Sep 10, 2009 21:03 UTC (Thu) by davecb (subscriber, #1574) [Link]

Thanks! I used V6, but you are entirely correct,
the open/rename indeed come later.

--dave

what's needed is a application barrier() call

Posted Sep 9, 2009 20:02 UTC (Wed) by vaurora (subscriber, #38407) [Link]

That's one of the interesting parts about Featherstitch - it exports a useful write-ordering interface to userland that cannot be used to stall the system or otherwise break things, as with a "normal" transaction-style interface. If you let userland do "transaction_start(); write(); write(); transaction_commit();" you have all sorts of issues with the app waiting too long (or never doing it) to commit, interesting interactions with other transactions, potential deadlocks, etc.

So, you have a hierarchy something like:

fsync() << barrier() << depends_on()

Where to the left performs worse and to the right is more difficult to implement in the file system. I need to write about Featherstitch, some amazing difficulties there.

what's needed is a application barrier() call

Posted Sep 9, 2009 20:53 UTC (Wed) by dlang (guest, #313) [Link]

I'm not proposing full transactions (which to me imply rollback if the transaction is aborted)

simply

write1 write2 write3 barrier write4 write5

will guarantee that writes 1-3 will hit the disk before writes 4 and 5 but says nothing about the ordering or timeing of the two seperate sets.

this is _much_ easier to implement.

what's needed is a application barrier() call

Posted Sep 9, 2009 21:49 UTC (Wed) by njs (subscriber, #40338) [Link]

Yes, that sort of ordering is what she's talking about. (The Featherstitch papers are worth checking out.)

IIUC, the trickiness in implementing it is that you need to keep track of which writes depend on which other writes, and which intermediate states are allowed, but then you have to keep non-dependent writes as disentangled as possible, to avoid the slowdowns and soft-updates craziness that she describes in the original article -- but you can't be *too* clever about it, or your accounting overhead will become a bottleneck.

Featherstitch's solution involves careful optimizations and giant graphs in kernel memory.

what's needed is a application barrier() call

Posted Sep 9, 2009 23:44 UTC (Wed) by dlang (guest, #313) [Link]

my gut feeling is that they are trying too hard.

if you have a way to prevent memory blocks that have been submitted for I/O from being changed before the I/O is completed this is just a matter of prohibiting reordering in the device stack.

this would be overkill for what is being asked for (the user cares about ordering the changes to their file, this would order changes to the entire device), but it would do the job without having to worry about tracing dependancies

if the device stack were to mark all buffers it has pending as COW when it gets a barrier() call, this doesn't seem that hard to do.

now, it does open up the possibility of running out of memory and having problems writing something to swap (until the currently pending writes complete), but if you do the barrier on a per-partition basis this shouldn't be that bad (this assumes that the device stack can easily tell what partition pending writes would go to)

what's needed is a application barrier() call

Posted Sep 10, 2009 6:51 UTC (Thu) by njs (subscriber, #40338) [Link]

I dunno, I'm not an expert, though "prohibiting reordering in the device stack" raises a big red flag for me -- reordering is pretty much the entire way you get speed out of spinning-disk hard drives!

But whether my analysis is accurately pin-pointing the problem or not, with all respect, I think I'll trust Val's word over yours that there *is* a problem :-).

what's needed is a application barrier() call

Posted Sep 10, 2009 7:01 UTC (Thu) by dlang (guest, #313) [Link]

it's only prohibiting reordering around the barrier, not all reordering.

Val is definantly right about the difficulty in doing it the best possible way, as I noted my approach would run some possibility of the COW causing out of memory problems, but I suspect that it's a 90% solution.

what's needed is a application barrier() call

Posted Sep 10, 2009 7:35 UTC (Thu) by njs (subscriber, #40338) [Link]

But now you're synchronizing all writes by all processes everywhere on the disk, not just the two blocks that actually got written to in this one file somewhere. If apps used that sort of barrier() too often then I think you'd end up with disk throughput that looks like you mounted with -o sync -- very little reordering would ever be allowed.

I guess this is complicated by the question of when dirty blocks get flushed, and in how large batches; maybe it's solvable. But I don't think memory is the main concern, at least.

what's needed is a application barrier() call

Posted Sep 10, 2009 7:48 UTC (Thu) by dlang (guest, #313) [Link]

you are absolutely correct, that's why I say it's only a 90% solution, but it may be something that gives you most of the benefit for a fraction of the effort.

an alternative to the application barrier() call

Posted Sep 11, 2009 15:05 UTC (Fri) by anton (subscriber, #25547) [Link]

write1 write2 write3 barrier write4 write5
will guarantee that writes 1-3 will hit the disk before writes 4 and 5 but says nothing about the ordering or timeing of the two seperate sets.

An alternative would be to just extend POSIX logical ordering guarantees (as visible by other processes) to the post-recovery state. That would mean that the file system would implicitly put a barrier between any of the writes in your example.

The question is: how much would this guarantee cost compared to what you have in mind? In a copy-on-write filesystem it could cost very little, if anything. The file system could still perform the user writes in any order (all of them, not just a subset), but just would never commit a write for which the earlier writes have not been performed yet. For journaled file systems the reasoning is more complex, but I believe that in the usual case (writing new data) the cost is also very small.

The benefits of this guarantee are that it makes programming easier, and especially testing easier: If your files are always consistent as seen by other processes, they will also be consistent in case of a crash or power outage; no need to pull the power plug in order to test the crash resilience of your application.

an alternative to the application barrier() call

Posted Sep 11, 2009 16:29 UTC (Fri) by dlang (guest, #313) [Link]

you don't want to put an implicit barrier between any two writes because it would prevent a lot of very useful write merging and reordering (so the performance cost would be very high)

since most writes are less than a sector, multiple writes would be even more expensive for a COW system

An alternative to the application barrier() call

Posted Sep 11, 2009 17:06 UTC (Fri) by anton (subscriber, #25547) [Link]

Barriers certainly don't prevent write merging. Why would they? A barrier just means that logically later writes are not committed before logically earlier writes, but they can become visible at the same time. So you can merge as many writes across barriers as you want.

Ok, your formulation of barriers exclude the same-time option, but apart from the lower performance, how could an external observer tell whether two logical writes happened one after another or at the same time? Once they are both committed, there is no difference.

As my posting explains, they also don't prevent reordering of physical data writes, they only restrict which sets of writes are committed by a commit.

Multiple small writes can be merged together into a large one.

BTW, most writes probably happen through libc buffers, and are typically larger than one sector (unless most of your files are smaller than one sector).

An alternative to the application barrier() call

Posted Sep 11, 2009 19:26 UTC (Fri) by dlang (guest, #313) [Link]

I have seen a LOT of code that does

work...work
write one line, or a couple words of a line
work..work
write a little more
etc

enforcing a barrier between all of these writes would kill you

remember that you don't know the storage stack below you, what you submit as one write may be broken up into multiple writes, and you have no guarantee of what order those multiple writes could be done in (think a raid array where your write spans drives as one example)

as a result a barrier needs to prohibit merging across the barrier as well as just reordering across the barrier.

An alternative to the application barrier() call

Posted Sep 13, 2009 17:46 UTC (Sun) by anton (subscriber, #25547) [Link]

Code that writes a few characters here and a few characters there usually uses the FILE * based interface, which performs user-space buffering and then typically performs write() (or somesuch) calls of 4k or 8k at a time; just strace one of these programs. That's done to reduce the system call overhead. But even if such programs perform a write() for each of the application writes, having barriers between each of them does not kill performance, because a sequence of such writes can be merged.

Concerning the block device below, if that does not heed the block device barriers or other block device ordering mechanisms that the file system requests, then you get no guarantee at all of any consistency on crash/power failure. It's not just that merged writes won't work, your style of merge-preventing barriers won't work, either, and neither will the guarantees that fsync()/fdatasync are supposed to provide; that's because all of them require that the block device ordering mechanism(s) that the file system uses actually work, and all of them will produce inconsistent states if the writes happen in an order that violates the ordering requests. So, if you want any consistency guarentees at all, you need an appropriate block device, and then you can implement mergeable writes just as well as anything else.

As for an array where a write spans drives, implementing a barrier or other ordering mechanism on the array level certainly requires something more involved than just doing barriers on the individual block devices, but the device has to provide these facilities, or you can forget about crash consistency on that device (i.e., just don't use it).

An alternative to the application barrier() call

Posted Sep 13, 2009 20:23 UTC (Sun) by dlang (guest, #313) [Link]

my point is that enforcing a barrier through all these layers can be expensive (on a multi-disk array you would need to make sure that one disk has completed it's work before submitting the write to the next disk)

this isn't always needed, so don't try to do it for every write (and I've straced a lot of code that does lots of wuite() calls)

do it when the programmer says that it's important. 99+% of the time it won't be (the result is not significantly more usable after a crash with part of the file if it's not all there, or this really is performance sensitive enought to risk it)

you would be amazed at the amount of risk that people are willing to take to get performance. talk to the database gurus at MySQL or postgres about the number of people they see disabling f*sync on production databases in the name of speed.

An alternative to the application barrier() call

Posted Sep 14, 2009 22:16 UTC (Mon) by anton (subscriber, #25547) [Link]

Fortunately writes on the file system level can be merged across file system barriers, resulting in few barriers that have to be passed to the block device level. So there is no need to pass a block device barrier down for every file system barrier.

And since it is possible to implement these implicit barriers between each write efficiently (by merging writes), why burden programmers with inserting explicit file system barriers? Look at how long the Linux kernel hackers needed to use block device barriers in the file system code. Do you really expect application developers to do it at all? And if they did, how would they test it? This has the same untestability properties as asking application programmers to use fsync.

Concerning the risk-loving performance freaks, they will use the latest and greatest file system by Ted T'so instead of one that offers either implicit or explicit barriers, but of course they will not use fsync() on that file system:-).

BTW, if you also implement block device writes by avoiding overwriting live sectors and by using commit sectors, then you can implement mergeable writes at the block device level, too (e.g., for making them cheaper in an array). However, the file system will not request a block device barrier often, so there is no need to go to such complexity (unless you need it for other purposes, such as when your block device is a flash device).

An alternative to the application barrier() call

Posted Sep 20, 2009 5:22 UTC (Sun) by runekock (subscriber, #50229) [Link]

> Fortunately writes on the file system level can be merged across file system barriers, resulting in few barriers that have to be passed to the block device level.

But what about eliminating repeated writes to the same place? Take this contrived example:

repeat 1000 times:
write first byte of file A
write first byte of file B

A COW file system may well be able to merge the writes, but it would require a lot of intelligence for it to see that most of the writes could actually be skipped. And a traditional file system would be even worse off.

An alternative to the application barrier() call

Posted Sep 20, 2009 18:38 UTC (Sun) by anton (subscriber, #25547) [Link]

For a copy-on-write file system that example would be easy: Do all the writes in memory (in proper order), and when the system decides that it's time to commit the stuff to disk, just do a commit of the new logical state to disk (e.g., by writing the first block each of file A and file B and the respective metadata to new locations, and finally a commit sector that makes the new on-disk state visible.

An update-in-place file system (without journal) would indeed have to perform all the writes in order to have the on-disk state reflect one of the logical POSIX states at all times (assuming that there are no repeating patterns in the two values that are written; if there are, it is theoretically possible to skip the writes between two equal states).

an alternative to the application barrier() call

Posted Sep 12, 2009 0:39 UTC (Sat) by spitzak (guest, #4593) [Link]

barriers do not force write ordering. If you do "write 1, barrier, write 2", all that is required is that anybody looking at the file will see *one* of these three states: nothing, write 1, or write 1+2. But that does not imply that all three states have to somehow exist, if only nothing,write1+2 ever exist, you have fulfilled the requirements.

An actual implementation may be "allocate temporary space, write 2, write 1, make the file point at temporary space". Notice that write 2 is done BEFORE write 1, but we have fulfilled the requirements of barrier.

what's needed is a application barrier() call

Posted Sep 10, 2009 1:43 UTC (Thu) by ras (subscriber, #33059) [Link]

The problem can be stated simply enough. User land application writers need to be able to do atomic updates, and (orthogonally) for way for information to be sure on disk single way that works across all file systems without special libraries or having to change their code. That means those interfaces must be in POSIX, and with the added complication that all existing file systems we have, and all those we can envisage in the future can implement them efficiently.

Maybe write barriers are just the interface we need, maybe not. What we need is a Paul McKenney to do for this problem what Paul did for a very similar problem in multi CPU memory architectures. He wrote the RCU stuff and in the process became intimately familiar for the problem and all possible solutions. He then worked for years to get a standard set of functions that solved the problem to be added to the C standard library. Where is your white knight in shining armour when you need them? Who knows, maybe our knight will be female for a change.

One thing hasn't changed though, and that is the ability of the female of our special to create work. Now I feel compelled to learn about Featherstich.

what's needed is a application barrier() call

Posted Sep 10, 2009 11:18 UTC (Thu) by nix (subscriber, #2304) [Link]

He then worked for years to get a standard set of functions that solved the problem to be added to the C standard library.

Obviously I missed something, but userspace RCU isn't in glibc and certainly isn't in POSIX.

what's needed is a application barrier() call

Posted Sep 10, 2009 12:19 UTC (Thu) by ras (subscriber, #33059) [Link]

> Obviously I missed something, but userspace RCU isn't in glibc
> and certainly isn't in POSIX.

Paul describes his efforts so far here: http://www.rdrop.com/users/paulmck/scalability/paper/CPP-...

what's needed is a application barrier() call

Posted Sep 10, 2009 13:37 UTC (Thu) by nix (subscriber, #2304) [Link]

Ah, thanks, great stuff. I didn't realise the 'threads cannot be implemented in a library' standardization fix had got as far as this. I'd encountered bits of this (quick_exit()), but not realised why they were useful.

(And Paul's slides are the best explanation of the problem I've ever seen. Slide 26 is particularly good ;} )

what's needed is a application barrier() call

Posted Sep 10, 2009 11:19 UTC (Thu) by nix (subscriber, #2304) [Link]

btw, even I found your female crack to be cringeworthy, and I'm male. Less of the stupid sexist jokes, please.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 18:26 UTC (Wed) by iabervon (subscriber, #722) [Link]

I think you're a bit off about the origin of the rename() pattern; I don't think it's normally about worrying about system crashes (because, historically, when your system crashes, you lose everything that hasn't been backed up to tape; and POSIX doesn't say that, during a system crash, the kernel won't simply write garbage over all of the data that was carefully written to permanent storage with fsync() anyway). It's about two things:

1) concurrent readers; they shouldn't sometimes see an /etc/passwd without half of the users, a mail spool without half of the messages, etc. It would be a problem if, when any user changes their password, there's a chance that other users will get random login failures. It would also be a problem if, which making modifications to messages in your inbox, a program that just looked at the state of the inbox might see messages disappear only to reappear a little while afterwards. Concurrent reads also include that all-important tape backup, which really really shouldn't catch some intermediate state of /etc/passwd.

2) application crashes; if an application is trying to make a modification to a file, and it is forced to exit before it is done, the old version should be left there. The application may be forced to exit by a segfault (bug in the application), being terminated by the user, being terminated by the system, a clean system shutdown or filesystem umount/remount, etc. The application may also, due to a bug, go into an infinite loop such that it would never actually write the rest of the unchanged content, and would have to be killed such that a different program can get exclusive control over writing it.

Unlike the system crash case, these cases are both required to have particular, internally visible, behaviors by POSIX, and POSIX conformance tests have something to test. Also, applications can (and sometimes do) get regression testing which tries to ensure that they work in conditions which would provoke bad behavior. Unless people develop applications on systems which panic frequently, they're not likely to consider the system crash case, probably won't do anything special in order to make sure that it behaves in some way they want, and certainly won't test that it actually does behave the way they expect. (In fact, POSIX notes that any test involving a system crash can be treated as a quality-of-implementation issue rather than a correctness issue.)

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 18:36 UTC (Wed) by alexl (subscriber, #19068) [Link]

I think there is some basic misunderstanding from a lot of kernel people on exactly why applications use rename() to save a file. It does not (at least did not initially) have anything to do with what happens on a system crash.

Instead, its all about fixing race conditions. What if two applications write to the same file at the same time, or one reads a file at the same time it is replaced. If i use rename to save I am guaranteed by the posix specification (unless the system crashes) that any reader of the saved filename at *any* time, atomically gets either the full old file, or the full new file. There is never any partial or non-existing file. Similarly, if two apps write the file you only ever get one of the two files written, never some mixup of them.

So, the use of rename is basic in how we replace a file. This means everyone is doing it, and it seems natural that filesystems would use the knowledge of this pattern to efficiently and safely implement crash-safe file replacement. Such a thing is not guaranteed by POSIX, but should probably be guaranteed by "good implementations" of unix filesystems. Especially since the safe alternative (fsync) is far less performant due to its semantics being more than we need.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 19:19 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]

I strongly endorse the parent comment. This is how I was taught to write robust Unix software back in the 80s (talk into the good ear, sonny), and the reason had nothing to do with the accidental properties of the ext3 file system, which had not yet been invented. I recall discussing what I thought were oddities of the elm mailer with its author, Dave Taylor, and he explained all the paranoid hoops he jumped through to try to assure that the user would not lose his/her mail despite application crashes, disk-full conditions and the like. The rename pattern was a part of the solution, along with lots of consistency checking.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 20:20 UTC (Wed) by job (guest, #670) [Link]

My though exactly. A rename is first and foremost something that guarantees atomicity, not file system consistency. The latter is a (welcome) side effect that (Linux) programs have come to rely on.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 22:40 UTC (Wed) by njs (subscriber, #40338) [Link]

I would be shocked if you could find a filesystem developer who isn't aware of exactly what POSIX requires of rename (i.e. atomicity with respect to other processes viewing the same fs concurrently), and why it works the way it does.

Our standards keep rising, though, and these days people actually care about what happens to data over crashes -- and POSIX's primitives *suck* for this, unless you are writing a giant database with dedicated storage.

In this discussion, whenever kernel folks talk about developers coming to depend on rename's atomicity, I'm pretty sure they're talking about its atomicity with respect to crashes. (For instance, I believe Subversion's backend format uses atomic-rename for reliability over crashes, because fsync is just untenable.)

crash vs. power drop

Posted Sep 10, 2009 2:15 UTC (Thu) by ncm (guest, #165) [Link]

I suspect that many readers, perhaps most, are equating "crash" with "power drop".

How often do any of us see crashes on production systems, nowadays? Things being the way they are, power drops are overwhelmingly more likely on your typical desktop or rack installation, just because a UPS is an extra expense and crippling data loss isn't especially likely even without. "All of the above" is meaningful only if power doesn't drop unexpectedly, and, sadly, that needs to be repeated every time.

crash vs. power drop

Posted Sep 10, 2009 6:12 UTC (Thu) by njs (subscriber, #40338) [Link]

Good point, though with two caveats: 1) Even if, as you've claimed, power drops can cause corruption to sectors that are in the middle of being written (it seems plausible), it isn't clear to me how frequently this happens. Your average drive probably spends most of its time seeking, for instance. 2) In principle, there's no reason a fs couldn't maintain guarantees even over events like that, though I don't know whether any current ones do.

crash vs. power drop

Posted Sep 10, 2009 17:50 UTC (Thu) by ncm (guest, #165) [Link]

Scragged sectors are a risk, but not the main risk, of power drops. What's worse is blocks still in buffer RAM on the drive that the system thought were already physically on disk, because the drive told it so.

http://lwn.net/Articles/352002/

When the drives lie to the file system, the file system can't guarantee anything. Most drives (i.e. the ones you have) lie. Drives that don't lie cost way more, and are slower; it's much cheaper to add some battery backup, and even some redundancy, than to buy the expensive drives. There's no point in discussing fine points of data ordering if you haven't got one or the other.

crash vs. power drop

Posted Sep 11, 2009 6:27 UTC (Fri) by zmi (guest, #4829) [Link]

> Second, more subtle but probably more important, drives lie about what is
physically on disk.

That's why in a RAID you really *must* turn off the drive's write cache.
I've tried to explain that in the XFS FAQ:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_w...
and also in the questions below that one.

Short: I've got a new 2TB WD drive with 64MB cache, we intend to use them
in a RAID. Take 16 of these drives, it adds to 1024MB (1GB) of write cache.
So in the worst situation,
1) you've got an UPS, but your power supply fails and the PC/server is out
of power
2) drives have their caches full, so up to 1GB of data is lost, where the
filesystem believed they are on disk. There's a *very* high chance that
lots of metadata is included in the cached writes.
3) each of the 16 drives could write "half sectors", effectively destroying
the previous and the actual content.

In all this discussion, it would have been worth noting that if you really
*care* about your data, you *must* turn off the drive write cache. Yes,
power failures are not so often in countries with good power supply. Still,
I use an UPS and in the last half year, had
a) my daughter playing around turning the power of the server off
b) a dead power supply in my workstation
and so, even with an UPS, "drive write cache off" is a must. Simply put a
hdparm -W0 /dev/sda
in your boot scripts.

Note that still this only helps in 1) and 2), but for problem 3) there's
nothing anybody but the disk manufacturers can do. I must say that I have
no evidence of ever having had that problem somewhere. It might be that
happened when there are "strange filesystem problems" after a crash, but
you can't tell for sure.

As for the rename: Really, there should only ever be the chance of having
either the old file or the new one, and the filesystem should care about
this even for crash situations.

Note: In Linux you can tune writeback behaviour in /etc/sysctl.conf:
# start writeback at around 16MB, max. 250MB
vm.dirty_background_bytes = 16123456
vm.dirty_bytes = 250123456
# older kernels had this:
#vm.dirty_background_ratio = 5
#vm.dirty_ratio = 10
# write blocks to disk after 1 second (default: 3000ms)
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100

Note that dirty_bytes/dirty_ratio is to block new writes once the cache has
that many bytes to write. On systems with 8GB RAM or more, you could end up
having gigabytes of disk cache.

Sorry for putting all in one post, but I hope it helps people who care
about their data to have some tunings to start with.

mfg zmi

crash vs. power drop

Posted Sep 11, 2009 11:40 UTC (Fri) by Cato (guest, #7643) [Link]

Thanks for explaining all that.

On the topic of writing 'half sectors' due to a power drop: the author of http://lwn.net/Articles/351521/ has done quite a lot of tests on various hard drives, and generally found that they usually don't do this, though some instances have done. He has a useful program that can test any drive for this behaviour, though it's mostly intended to test for out of order writes due to caching - I believe only some drives lie about this.

crash vs. power drop

Posted Sep 14, 2009 14:13 UTC (Mon) by nye (guest, #51576) [Link]

This is why barriers *should* be a good thing.

If all you care about is benchmark results, there's an obvious incentive for the drive to claim falsely that some data has been physically written to disk.

On the other hand, there's far less incentive to lie about barriers. If all you're saying is that '*if and when* this data has been written, then that other data will have also been written', you can still happily claim that it's all done when it isn't, without breaking the barrier commitment.

When you have that commitment, it's possible to build a transaction system upon it which works even under the assumption that the drive will lie. You're not going to achieve the full benchmark speed, but it's going to be far better than turning off the cache.

Of course, whether drive manufacturers see it that way is another matter. Is there any data on whether drives actually honour write barriers? It would be interesting to see if there are indeed drives that aren't expensive enough to report accurately on when data has been written, but still honour the barrier request.

crash vs. power drop

Posted Sep 14, 2009 15:06 UTC (Mon) by ncm (guest, #165) [Link]

This is where Scott Adams's "Which is more likely" principle is useful. We just frame the question, thus:

Drive manufacturer A can sell almost equally as many drives made this way as that way, but "that way" costs more development time and might make it come out a little slower in benchmarks. Some purchase decisions depend on claiming it's made "that way". The manufacturer can make it "that way" or just say it is, but not. Which is more likely?

crash vs. power drop

Posted Sep 14, 2009 16:12 UTC (Mon) by nye (guest, #51576) [Link]

A fair point, but in all things there's a strong economic incentive to claim that a product does something it doesn't, if that will make more people buy it, and yet most things don't claim do do something which is simply factually not true.

Usually when it's a feature that either works or doesn't - so it's not a subjective measurement - it's likely that a product does at least technically do what it says it does.

Presumably the argument is that the manufacturers aren't specifically claiming a particular feature, but the disk is behaving in a particular way that just happens to be not what the user expected - so they're not technically lying. This does seem to weaken the idea that they're doing it to improve the chances of people buying it though, if they're not stating it as a feature.

Just out of interest, I've just spent a while trying to see if I can find out what the difference is between the Samsung HE502IJ and HD502IJ - two drives which are identical on paper, but one is sold as 'RAID-class'. Neither are even remotely expensive enough not to lie about their actions, so what's the difference? Well, some forum post claims that one has a 'rotational vibration sensor', whatever that means.

In conclusion, people who try to sell you things are all liars and cheats, and I intend to grow a beard and live out the rest of my days as a hermit, never having to worry about these things again. Perhaps I shall raise yaks.

crash vs. power drop

Posted Sep 15, 2009 12:52 UTC (Tue) by Cato (guest, #7643) [Link]

I believe that TLER (time limited error recovery) is one characteristic of enterprise drives - simply means fewer retries so that the RAID controller or OS gets a drive failure more quickly, and knows to use redundancy to complete the I/O.

crash vs. power drop

Posted Sep 15, 2009 12:49 UTC (Tue) by Cato (guest, #7643) [Link]

Maybe the best way round this is to make sure that performance benchmarks always include a 'reliability benchmark' that detects drives which are lying about writes to hard disk.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 8:24 UTC (Thu) by alexl (subscriber, #19068) [Link]

I also think its unlikely that kernel people are not aware of the atomicity of rename. However, almost every post, including this one, avoid mentioning why rename is actually used, and rather use handwavy notions about application authors using rename because somehow we got hooked on it via ext3 behaviour, which is far from the truth.

I'm all for a reasonable API addition to implement O_PONIES, and would implement support for it in the stuff i work on (glib, gio, etc) the second it was availible. However, all existing applications already use rename() to do atomic renames() for reasons unrelated to system crashes, so why not just make all these applications work without additional changes? At little cost in performance.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 8:52 UTC (Thu) by dlang (guest, #313) [Link]

define 'a little cost in performance' that you (and everyone else)would be willing to loose.

doing a fsync on ext3 (what the ext maintainers believe is nessasary to get to the disk to be safe) can take several seconds. if you want a rename to provide that sort of guarantee you need to be willing to pay that sort of cost for every rename.

ext3 never provided the guarantees that people think it did. it just happened to work if you didn't crash too soon after doing a rename.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 9:04 UTC (Thu) by alexl (subscriber, #19068) [Link]

Here we go again...

NO NO NO NO. We do not need/want the file to be fsynced.

Why do people keep repeating this fallacy? We all know that fsync is expensive, and don't want to use it, or something with similar semantics.

What we want is something that gives us the natural behavior of rename() replace (atomically get either the old or the new file) and extend it to a system crash. This does not imply a fsync, but rather that the data for the new file is on disk before the metadata is on disk. This is much cheaper than an fsync because it does not require the data to be written immediately, but rather that we have to delay the write of the metadata until the data has been written. Thus "little cost in performance", at least in relation to fsync.

And then you write "ext3 never provided the guarantees that people think it did" when my whole point has been about how everyone gives this reason for why people use rename when its not actually the reason! I am well aware that rename() does not give me system crash safety, I use it for other reasons. However, I *would* like it if this common operation that has been in use for decades before ext3 was written also was recognized by ext3 and made even more useful (even though this is in no way guaranteed by POSIX).

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 16:37 UTC (Thu) by nye (guest, #51576) [Link]

I have noticed a tendency in this discussion (I don't mean the responses to this article, but the overall discussion) that the 'POSIX-fundamentalist' faction is unwilling or unable to accept that saying 'I want A to happen iff B happens' is *not* the same as saying 'I want a guarantee that A and B happen'.

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 20:38 UTC (Thu) by HelloWorld (guest, #56129) [Link]

If you want both the write and the rename to happen, you'd have to fsync() the file *and* the directory. Which means that the open(), write(), fsync(the_file), close(), rename() sequence provides exactly the semantics you describe.

POSIX v. reality: A position on O_PONIES

Posted Sep 21, 2009 1:52 UTC (Mon) by efexis (guest, #26355) [Link]

That's the point though... that's /not/ what people in the discussion want, or are asking for.

POSIX v. reality: A position on O_PONIES

Posted Sep 21, 2009 13:45 UTC (Mon) by nye (guest, #51576) [Link]

*speechless*

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:01 UTC (Thu) by forthy (guest, #1525) [Link]

I really wonder why all this "data=ordered" stuff is said to cost performance. If implemented right, it must improve performance. All you want to do is the following: Push data into the write buffer. Push metadata into the metadata write buffer. Push freed blocks into the freed blocks buffer (but don't actually free them). If your buffers are full, there's no free block around any more, or a timer expires, do the following:

Write out data.
Write out metadata (first to journal, then to the actual file system).
Actually free the blocks from the freed block list

You only have to write data once - new files go to newly allocated blocks which don't appear in the metadata when you write them (they are still marked as free in the on-disk data). For files with in-place writes, we usually don't care (there are many race conditions for writing in-place, so the general usage pattern is not to do that if you care about your data). For crash-resilient systems, you want to write your metadata twice (once into a journal, once into the file system), order it (ordered metadata updates), or use a COW/log structured file system, where you write a new file system root (snapshot) on every update round. While you are writing data from your buffers, open up new buffers for the OS to be used as buffers for the next round (double-buffering strategy). This double buffering should be a common part of the FS layer, because it will be used in all major file systems.

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:41 UTC (Thu) by dlang (guest, #313) [Link]

the problem with your approach is that various pieces (including the hard drive itself) will re-order anything in it's buffer to shorten the total time it takes to get everything in the buffer to disk.

that is why barriers are needed to tell the device not to reorder across the buffer.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 20:21 UTC (Wed) by kjp (guest, #39639) [Link]

The article was great other than the extremely fast and loose comment about 3.6.30 and writeback. From the linked article:

"Ted Ts'o has mitigated that problem somewhat, though, by adding in the same safeguards he put into ext4. In some situations (such as when a new file is renamed on top of an existing file), data will be forced out ahead of metadata."

Please clarify or contradict that before you give us poor app developers any more heart trouble... thanks.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 21:08 UTC (Wed) by vaurora (subscriber, #38407) [Link]

My understanding is that the patches you are talking about only decrease the likelihood of the rename() data loss problem outside of ext3 with data=ordered mode. E.g.:

commit e7c8f5079ed9ec9e6eb1abe3defc5fb4ebfdf1cb
Author: Theodore Ts'o <tytso@mit.edu>
Date: Fri Apr 3 01:34:49 2009 -0400

ext3: Add replace-on-rename hueristics for data=writeback mode

In data=writeback mode, start an asynchronous flush when renaming a
file on top of an already-existing file. This lowers the probability
of data loss in the case of applications that attempt to replace a
file via using rename().

---

If you aren't sure, I recommend using ext3 with the explicit data=ordered option until you've had the opportunity to sit down and understand the data=writeback and/or ext4 semantics.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 20:33 UTC (Wed) by kjp (guest, #39639) [Link]

Also, in reply to "no one ever, ever wrote "creat(); write(); close(); rename();" and hoped they would get an empty file if the system crashed during the next 5 minutes. "

I have an example of when this actually can be useful (however to be clear, I am very much in favor of rename being safe by default).

A web interface needs to display a text dump from a daemon to a user. The daemon writes out /tmp/daemon-dump.tmp, and renames it to /tmp/daemon-dump.txt. It does this since concurrent users in the web site may still be accessing the previous version. However, on a crash it doesn't matter what happens to this file since it is always refreshed by a web process before starting display.

In a nutshell, the rename here is for IPC concurrency only and not related to any crash consistency. However, it is FAR more sane to add a fcntl flag to say (O_EAT_MY_DATA_SPEED_TOO_IMPORTANT) for these speed freaks than brick files or entire computers....

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 21:00 UTC (Wed) by jzbiciak (guest, #5246) [Link]

I posit that there should rarely be a good reason to run fsck on /tmp. Keeping the contents of /tmp across a crash may be useful for forensics (ie. "why did we crash?") and maybe rescuing a couple things, but otherwise you should be able to jettison it if there are any issues with the filesystem holding it, so long as /tmp is in its own filesystem. (That said, my RHEL4 box has a /tmp/lost+found. Hmmm.)

I'd further posit that nobody hopes for the empty file across a system crash. (If they did, they'd unlink() it just after opening it and before writing to it so that the file effectively evaporates on a crash. Ok, it might show up in lost+found, but only root can play there.) In your example, the programmer simply doesn't care. The programmer hopes for peak performance and doesn't really care if that means the file has garbage if the system crashes. If the contents were perfectly preserved without slowing the program down, I doubt the programmer in your example would care.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 21:03 UTC (Wed) by sbergman27 (guest, #10767) [Link]

> Keeping the contents of /tmp across a crash may be useful for forensics (ie. "why did we crash?") and maybe rescuing a couple things,

But doing so is far more likely to cause problems than to help solve them.

POSIX v. reality: A position on O_PONIES

Posted Sep 9, 2009 22:20 UTC (Wed) by iabervon (subscriber, #722) [Link]

Actually, systems that want that probably really want to mount tmpfs on /tmp. After a system crash, they could be sure that the directory accounts for all of the changes and the exact order of operations, including the contents being wiped out after all of those operations. In your example, there's no need to write the file to disk at all, except to free up RAM.

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 8:58 UTC (Thu) by lysse (guest, #3190) [Link]

> O_EAT_MY_DATA_SPEED_TOO_IMPORTANT

... O_PONYCRAP ?

POSIX v. reality: A position on O_PONIES

Posted Sep 17, 2009 16:04 UTC (Thu) by forthy (guest, #1525) [Link]

O_RACE_HORSE | O_ON_STEROIDS

That's obvious, isn't it?

POSIX v. reality: A position on O_PONIES

Posted Sep 10, 2009 15:04 UTC (Thu) by MisterIO (guest, #36192) [Link]

I agree with the article. It also seems a bit strange that the kernel developers, who are famous for criticizing the gcc folks for their c-standard-obedient "insane" optimizations, tend then to become "pedantic POSIX-quoting fundamentalists".

The question of course is...

Posted Sep 10, 2009 16:28 UTC (Thu) by ikm (subscriber, #493) [Link]

How does one get real, live pony?

POSIX ain't what you think it is....

Posted Sep 11, 2009 0:48 UTC (Fri) by faramir (subscriber, #2327) [Link]

The author states that the problem with POSIX is that it codifies errors. That is probably true, but I believe the author has not accurately relaying what POSIX has codified either. In particular, the case of making hard links to directories is exaggerated. I don't have access to the 1988 document at the moment, but the current version treats the issue as follows:

...If path1 names a directory, link() shall fail unless the process has appropriate privileges and the implementation supports using link() on directories...

So what does that mean? It says that IF the process is sufficiently privileged (traditionally running as root) AND the implementation elects to do so THEN it is not an error for a link to actually be created for a directory. Any program that actually wants to work properly on all POSIX system can NEVER require such an operation to work even if running as root. That's a pretty lukewarm endorsement of the functionality.

You might ask why it's even in there at all. The original UNIX systems (all version) had no atomic mkdir() system call. You used mknod() to make a TOTALLY empty directory and then used link() to create the standard . and .. entries in the newly created directory. The mkdir program was setuid root and if you wanted to create a directory from within a program
you had to do a fork() and exec() the program. Specifying link() in this way allowed UNIX versions which had traditionally made directories this way to get POSIX certification while in no way encouraging implementations of this type. Frankly, I think that was a good decision. Leniency for historical practice, but encouraging more reasonable implementations in the future. Users don't run their own programs as root so there was never any danger of people making directory loops to screw up your system.

Directory hard links

Posted Sep 11, 2009 6:40 UTC (Fri) by salimma (subscriber, #34460) [Link]

... a number of warts revealed themselves over time, not all of which were removed before the interface was codified into the POSIX specification. One example is directory hard links, which permit the creation of a directory cycle - a directory that is a descendant of itself - and its subsequent detachment from the file system hierarchy, resulting in allocated but inaccessible directories and files.

How ironic that Apple chose to implement this feature in HFS+ ...

Directory hard links

Posted Sep 11, 2009 19:06 UTC (Fri) by foom (subscriber, #14868) [Link]

Apple did forbid the creation of recursive directory hard links, however.

Directory hard links

Posted Sep 12, 2009 0:21 UTC (Sat) by cras (guest, #7000) [Link]

Speaking of HFS+ and hard links.. Last I heard they were implemented in a way that required
adding a file (or link) to some special hidden directory. Once you had done some thousand hard
links the performance became more and more horrible until the system was rebooted (the directory
gets wiped at startup). I wonder if that ever gets fixed.

truncate()

Posted Sep 12, 2009 0:26 UTC (Sat) by cras (guest, #7000) [Link]

This is actually the first time I consciously noticed the existence of truncate() call. I must have seen
it before when doing "man ftruncate", but it never registered to my mind. Is it actually used by some
real programs? In my over 10 years of programming on Linux I've never had a need for it. If
anything it sounds potentially dangerous to use (TOCTOU race).

POSIX v. reality: A position on O_PONIES

Posted Sep 12, 2009 0:46 UTC (Sat) by spitzak (guest, #4593) [Link]

Yay! Finally some sense in talking about this. Please everybody understand what this article is saying, and if you disagree realize you are wrong.

I would like to go further and really extend POSIX to get the atomic operations everybody really wants. I would actually redefine the flags sent by creat() (O_CREAT|O_WRONLY|O_TRUNC) to be this "ponies" flag with the following rules:

1. If the file already exists, then other programs either see the old file or the contents of the new file when close() was called on it.

2. If the file does not already exist then other programs either see no such file or they see the file with the contents when close() was called on it.

3. If the program crashes or exits without calling close() then it is exactly like nothing ever happended.

3a. It may be useful to add some new call that "forgets" the file, ie it is "closed" in the same way as when the program exits.

This will avoid the need to use a temporary file and then rename it to get real atomic writes. And it would not use a new flag, because all programs using the creat() flags are already acting exactly like it works this way.

POSIX v. reality: A position on O_PONIES

Posted Sep 12, 2009 3:50 UTC (Sat) by cras (guest, #7000) [Link]

I don't think close() is the proper checkpoint here. I think it'll already break POSIX guarantees, and
it's quite likely that it'll break assumptions by some programs. Not all programs close() the file once
they're done with it. After all, there are less syscalls to do if you want to append a few lines to a file
you already once wrote.

POSIX v. reality: A position on O_PONIES

Posted Sep 14, 2009 18:20 UTC (Mon) by spitzak (guest, #4593) [Link]

I don't think POSIX guarantees *anything* until close() is called, so anybody relying on seeing what is there is relying on non-POSIX behavior.

That said, I'm fairly certain that any lseek() on the file can be a trigger that this behavior is not wanted and that the partial file should become visible at that point. Of course no guarantee until close() is called...

This has no effect on pipes. So at absolute worst, you can write a logfile writer, so instead of "foo > logfile" you write "foo | write-old-style logfile".

I very much believe the improvements to atomicity from this so vastly outweigh any incompatibility problems that they are irrelevant.

POSIX v. reality: A position on O_PONIES

Posted Sep 13, 2009 19:40 UTC (Sun) by njs (subscriber, #40338) [Link]

IIUC, with your suggestion, running 'long_running_process > logfile' would make it impossible to track progress in the logfile, because the logfile would not become visible until after long_running_process exited. And if your box crashed, then the logfile would be impossible to read, period. (Think of checking /var/log/Xorg.0.log to debug X crashes...)

POSIX v. reality: A position on O_PONIES

Posted Sep 14, 2009 18:15 UTC (Mon) by spitzak (guest, #4593) [Link]

Most such software does ">>" or O_APPEND or otherwise opens the file with different flags, so it would act normally. Otherwise you are correct that something would have to be done, I believe any call to lseek() could indicate that you want the previous behavior, and it would be easy to alter software to do this, or add some syntax to the shell to do this.

Also it is possible that the file will revert to previous behavior after a certain size is achieved.

I still feel the benifits outweigh any compatibility problems. By far the majority of programs writing files with creat() act as though it works exactly like I describe.

Atomicity

Posted Sep 17, 2009 8:17 UTC (Thu) by Nicolas.Boulay (guest, #59722) [Link]

Fsync() is a means to declare a kind of full priority against any performance ordering.

rename() is a kind of tricks to minimise the problem of empty file after a power failure.

But what an application writer really want is a fast file system that do _atomicity_ : that means he want the previous file states or the new content of the last sys_write() and nothing else.

At the time of fsync(), i think we better need a fdone() which should be a kind of "wait on complete transaction" instead of "flush everthing quickly".
If fdone() is too long, i could use threads. If fdone() take time, it's for bandwith optimisation. One of a great linux optimisation for system without important data is to map fsync() to a void function, then everything fly :)

Is it coslty to have the behavior of open()/write()/rename() for a single sys_write() ?

Atomicity

Posted Sep 25, 2009 3:39 UTC (Fri) by xoddam (subscriber, #2322) [Link]

We already have an API for atomicity in POSIX. It is called rename().

Rename is *not* a 'kind of trick'. By specification, it is guaranteed to be atomic in the face of concurrent readers on a working system. Unfortunately the specification has nothing to say about it with respect to unclean shutdown.

Extending the atomicity of rename() so that it still applies in the face of a successful recovery (such as a journal replay) after an unclean shutdown is perfectly logical.

Atomicity

Posted Oct 26, 2009 10:09 UTC (Mon) by Nicolas.Boulay (guest, #59722) [Link]

You completly forget the case where the file is too big to be copied.

KB is ok, MB is not.

It's typical in any data base work. In that case, rename() have no use.

Atomicity

Posted Oct 30, 2009 4:30 UTC (Fri) by xoddam (subscriber, #2322) [Link]

Database implementors have many choices for implementing data stores and any transactional semantics that they need.

Databases traditionally use very large files because their implementors have chosen to re-implement filesystem functionality at the low level for performance reasons.

Most often they use their own journalling implementations and fsync(). This is of course legitimate. But using filesystem-level rename to provide atomicity would also be perfectly reasonable.

The size of the renamed and replaced file is an implementation detail only. Rename doesn't impose a requirement to copy large hunks of data only to throw it away. The unit of replacement might be a btree node, for example.

Nothing forces an implementor to use large files for any particular purpose.

POSIX v. reality: A position on O_PONIES

In the beginning, there was creat()

POSIX file I/O today: Ponies and fsync()

POSIX v. reality: A position on O_PONIES

Great article!

hail czar

hail czar

hail Ivanova!

hail Ivanova!

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Only with a UPS

Power loss -> no guarantee?

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

Given and denied

Given and denied

POSIX v. reality: A position on O_PONIES

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

History (re: what's needed is a application barrier() call

History (re: what's needed is a application barrier() call

History (re: what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

an alternative to the application barrier() call

an alternative to the application barrier() call

An alternative to the application barrier() call

An alternative to the application barrier() call

An alternative to the application barrier() call

An alternative to the application barrier() call

An alternative to the application barrier() call

An alternative to the application barrier() call

An alternative to the application barrier() call

an alternative to the application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

what's needed is a application barrier() call

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

POSIX v. reality: A position on O_PONIES

crash vs. power drop

In the beginning, there was `creat()`