POSIX v. reality: A position on O_PONIES
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
Sure, programmers (especially operating systems programmers) love their specifications. Clean, well-defined interfaces are a key element of scalable software development. But what is it about file systems, POSIX, and when file data is guaranteed to hit permanent storage that brings out the POSIX fundamentalist in all of us? The recent
fsync()
/rename()
/O_PONIES
controversy was the most heated in recent memory but not out of
character for fsync()
-related discussions. In this
article, we'll explore the relationship between file systems
developers, the POSIX file I/O standard, and people who just want to
store their data.
In the beginning, there was creat()
Like many practical interfaces (including HTML and TCP/IP), the POSIX file system
interface was implemented first and specified second. UNIX was
written beginning in 1969; the first release of the POSIX
specification for the UNIX file I/O interface (IEEE Standard 1003.1)
was released in 1988. Before UNIX, application access to non-volatile
storage (e.g., a spinning drum) was a decidedly application- and
hardware-specific affair. Record-based file I/O was a common paradigm,
growing naturally out of punch cards, and each kind of file was treated
differently. The new interface was designed by a few guys (Ken
Thompson, Dennis Ritchie, et alia) screwing around with their new
machine, writing an operating system that would make it easier
to, well, write more operating systems.
As we know now, the new I/O interface was a hit. It turned out to be a
portable, versatile, simple paradigm that made modular software
development much easier. It was by no means perfect, of course: a
number of warts revealed themselves over time, not all of which were
removed before the interface was codified into the POSIX
specification. One example is directory hard links, which permit the
creation of a directory cycle - a directory that is a descendant of
itself - and its subsequent detachment from the file system hierarchy,
resulting in allocated but inaccessible directories and files.
Recording the time of the last access time - atime - turns every read
into a tiny write. And don't forget the apocryphal quote from Ken
Thompson when asked if he'd do anything differently if he were
designing UNIX today: "If I had to do it over again? Hmm... I guess
I'd spell 'creat' with an 'e'
". (That's the creat()
system call to create a new file.) But overall, the UNIX file system
interface is a huge success.
POSIX file I/O today: Ponies and fsync()
Over time, various more-or-less portable additions have accreted around the standard set of POSIX file I/O interfaces; they have been occasionally standardized and added to the canon - revelations from latter-day prophets. Some examples off the top of my head include pread()/pwrite(), direct I/O, file preallocation, extended attributes, access control lists (ACLs) of every stripe and color, and a vast array of mount-time options. While these additions are often debated and implemented in incompatible forms, in most cases no one is trying to oppose them purely on the basis of not being present in a standard written in 1988. Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.
Why, then, does the topic of when file system data is guaranteed to be
"on disk" suddenly turn file systems developers into pedantic
POSIX-quoting fundamentalists? Fundamentally (ha), the problem comes
down to this: Waiting for data to actually hit disk before returning
from a system call is a losing game for file system performance. As
the most extreme example, the original synchronous version of the UNIX
file system frequently used only 3-5% of the disk throughput. Nearly
every file system performance improvement since then has been
primarily the result of saving up writes so that we can allocate and
write them out as a group. As file systems developers, we are going
to look for every loophole in fsync()
and squirm our way
through it.
[PULL QUOTE:
As file systems developers, we are going
to look for every loophole in fsync()
and squirm our way
through it.
END QUOTE]
Fortunately for the file systems developers, the POSIX specification
is so very minimal that it doesn't even mention the topic of file
system behavior after a system crash. After all, the original
FFS-style file systems (e.g., ext2) can theoretically lose your entire
file system after a crash, and are still POSIX-compliant. Ironically,
as file systems developers, we spend 90% of our brain power coming up
with ways to quickly recover file system consistency after system
crash! No wonder file systems users are irked when we define file
system metadata as important enough to keep consistent, but not file
data - we take care of our own so well. File systems developers have
magnanimously conceded, though, that on return
from fsync()
, and only from fsync()
, and
only on a file system with the right mount options, the changes to
that file will be available if the system crashes after that point.
At the same time, fsync()
is often more expensive than it
absolutely needs to be. The easiest way to
implement fsync()
is to force out every outstanding write
to the file system, regardless of whether it is a journaling file
system, a COW file system, or a file system with no crash recovery
mechanism whatsoever. This is because it is very difficult to map
backward from a given file to the dirty file system blocks needing to
be written to disk in order to create a consistent file system
containing those changes. For example, the block containing the
bitmap for newly allocated file data blocks may also have been changed
by a later allocation for a different file, which then requires that
we also write out the indirect blocks pointing to the data for that
second file, which changes another bitmap block... When you solve the
problem of tracing specific dependencies of any particular write, you
end up with the complexity
of soft updates. No
surprise then, that most file systems take the brute force approach,
with the result that fsync()
commonly takes time
proportional to all outstanding writes to the file system.
So, now we have the following situation: fsync()
is
required to guarantee that file data is on stable storage, but it may
perform arbitrarily poorly, depending on what other activity is going
on in the file system. Given this situation, application developers
came to rely on what is, on the face of it, a completely reasonable
assumption: rename()
of one file over another will either
result in the contents of the old file, or the contents of the new
file as of the time of the rename()
. This is a subtle
and interesting optimization: rather than asking the file system to
synchronously write the data, it is instead a request to order the
writes to the file system. Ordering writes is far easier for the file
system to do efficiently than synchronous writes.
However, the ordering effect of rename()
turns out to be
a file system specific implementation side effect. It only works when
changes to the file data in the file system are ordered with respect
to changes in the file system metadata. In ext3/4, this is only true
when the file system is mounted with the data=ordered
mount option - a name which hopefully makes more sense now! Up until
recently, data=ordered
was the default journal mode for
ext3, which, in turn, was the default file system for Linux; as a result,
ext3 data=ordered was all that
many Linux application developers had any experience with. During the
Great File System Upheaval of 2.6.30, the default journal mode for
ext3 changed to data=writeback
, which means that file
data will get written to disk when the file system feels like it, very
likely after the file's metadata specifying where its contents are
located has been written to disk. This not only breaks
the rename()
ordering assumption, but also means that the
newly renamed file may contain arbitrary garbage - or a copy
of /etc/shadow
, making this a security hole as well as a
data corruption problem.
Which brings us to the present
day fsync
/rename
/O_PONIES
controversy, in which many file systems developers argue that
applications should explicitly call fsync()
before
renaming a file if they want the file's data to be on disk before the
rename takes effect - a position which seems bizarre and random until
you understand the individual decisions, each perfectly reasonable,
that piled up to create the current situation. Personally, as a file
systems developer, I think it is counterproductive to replace a
performance-friendly implicit ordering request in the form of
a rename()
with an impossible to
optimize fsync()
. It may not be POSIX, but the
programmer's intent is clear - no one ever, ever wrote
"creat(); write(); close(); rename();
" and hoped they
would get an empty file if the system crashed during the next 5
minutes. That's what truncate()
is for. A generalized
"O_PONIES
do-what-I-want" flag is indeed not possible,
but in this case, it is to the file systems developers' benefit to
extend the semantics of rename()
to imply ordering so
that we reduce the number of fsync()
calls we have to cope
with. (And, I have to note, I did have a real, live pony when I was a
kid, so I tend to be on the side of giving programmers ponies when
they ask for them.)
My opinion is that POSIX and most other useful standards are helpful clarifications of existing practice, but are not sufficient when we encounter surprising new circumstances. We criticize applications developers for using folk-programming practices ("It seems to work!") and coming to rely on file system-specific side effects, but the bare POSIX specification is clearly insufficient to define useful system behavior. In cases where programmer intent is unambiguous, we should do the right thing, and put the new behavior on the list for the next standards session.
Index entries for this article | |
---|---|
Kernel | Filesystems |
GuestArticles | Aurora (Henson), Valerie |
(Log in to post comments)
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 15:54 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]
Great article. Everyone listen to Valerie.
Great article!
Posted Sep 9, 2009 16:36 UTC (Wed) by dwheeler (guest, #1216) [Link]
I agree, great article. Thanks for putting this in perspective.
I appreciate the side comment that something like this should submitted to the next version of the POSIX standard. Standards authors try to document the expectations of users and implementers, and sometimes they omit something important. It'd be hard to nail down the exact expectation, but it'd be worth it.
hail czar
Posted Sep 9, 2009 17:05 UTC (Wed) by ncm (guest, #165) [Link]
hail czar
Posted Sep 9, 2009 20:07 UTC (Wed) by sbergman27 (guest, #10767) [Link]
hail Ivanova!
Posted Sep 10, 2009 13:58 UTC (Thu) by liljencrantz (guest, #28458) [Link]
«Valerie is always right.
I will listen to Valerie.
I will not ignore Valeries recommendations.
Valerie is god.»
(Stolen from B5)
hail Ivanova!
Posted Sep 14, 2009 21:16 UTC (Mon) by roelofs (guest, #2599) [Link]
«Quoters of B5 are always right.I will listen to quoters of B5.
I will not ignore the recommendations of those who quote B5.
JMS is god.»
;-)
(Apologies to JMS and LWN for the off-topic drivel...)
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 16:04 UTC (Wed) by nix (subscriber, #2304) [Link]
Similarly, there is relatively little debate about refusing to conform to some of the more brain-dead POSIX details, such as the aforementioned directory hard link feature.Thats decidedly optional, isn't it? So failing link() on directories isn't a conformance violation anyway.
(If it wasn't for NFS, I suspect a much wider violation would be failure to suppport the seekdir()/telldir() horror show.)
POSIX v. reality: A position on O_PONIES
Posted Sep 18, 2009 21:02 UTC (Fri) by jch (guest, #51929) [Link]
Indeed. According to the 2001 edition:
> Upon successful completion, link() shall mark for update the st_ctime field of the file. Also, the st_ctime and st_mtime fields of the direc-tory that contains the new entry shall be marked for update.
POSIX v. reality: A position on O_PONIES
Posted Sep 20, 2009 19:47 UTC (Sun) by nix (subscriber, #2304) [Link]
kind of out of date now.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 16:10 UTC (Wed) by jonth (guest, #4008) [Link]
Only with a UPS
Posted Sep 10, 2009 2:02 UTC (Thu) by ncm (guest, #165) [Link]
We can only complain of what was omitted, not what was said. What was omitted, as is unfortunately always omitted from presentations by file system experts howsoever brilliant, is mention of the crucial distinction between crashes and power drops. Disks being the way they are, none of the above really applies to power drops. If power to the drive drops, your file system can offer you no guarantee, O_PONIES support notwithstanding.Anywhere that a power drop is overwhelmingly more likely than a system crash, which includes Most of the Known World, the whole discussion is more or less moot. That does not make the discussion moot overall, though, because people who care about their data can move themselves into the Rest of Known World by providing a few seconds' UPS backing for the drive. We just need to make clear which part of the world we're talking about.
Only with a UPS
Posted Sep 10, 2009 6:14 UTC (Thu) by flewellyn (subscriber, #5047) [Link]
Only with a UPS
Posted Sep 10, 2009 17:34 UTC (Thu) by ncm (guest, #165) [Link]
Second, more subtle but probably more important, drives lie about what is physically on disk. To look good on benchmarks, they tell the controller that sectors have been physically copied to the platter while they are still only in buffer RAM in the drive -- up to several megabytes' worth. A few seconds after the last controller operation, these writes have drained to the disk. Before that, there's no guessing which have been written and which haven't, and blocks the system meant to write first may be written last. As a consequence, after powerup the file system sees blocks that are supposed to have important metadata in them with, instead, whatever was left there.
Only with a UPS
Posted Sep 10, 2009 18:05 UTC (Thu) by flewellyn (subscriber, #5047) [Link]
Only with a UPS
Posted Sep 10, 2009 19:19 UTC (Thu) by ncm (guest, #165) [Link]
Back about 1998, a Windows user told me that, for him, Windows "hardly ever crashes". Further questioning revealed that he defined "crash" as "I have to re-install". Lockups, a multiple-daily event, didn't count. Generally, though, by "crash" we mean the system stops responding to events, and must be re-started; usually this is a software failure, although all manner of hardware faults can cause it. When these happen, the disk has plenty of time to drain its buffers. Usually the software fault has not caused any disk writes with crazy parameters.OS developers don't count power drops among crashes because those aren't their fault. That's commendable, because when they say "crash" they mean something they accept responsibility for.
Only with a UPS
Posted Sep 10, 2009 22:20 UTC (Thu) by flewellyn (subscriber, #5047) [Link]
Handling power drops, to me, seems to be a matter of impossibility, at least as long as disks lie about when writes actually complete.
Only with a UPS
Posted Sep 11, 2009 8:29 UTC (Fri) by jschrod (subscriber, #1646) [Link]
Only with a UPS
Posted Sep 10, 2009 18:18 UTC (Thu) by aliguori (subscriber, #30636) [Link]
Not really. You can make a disk tell you when data is actually on the platter vs in the write cache. Furthermore, most "enterprise" drives have battery-backed write caches that guarantee enough power for the write caches to be flushed.
Only with a UPS
Posted Sep 10, 2009 19:06 UTC (Thu) by ncm (guest, #165) [Link]
You can make a disk tell you when data is actually on the platterNo, you can ask a disk to tell you. It might even be honest about it if you never get the buffer too full. The commercial incentives to lie for the sake of benchmarks are extremely strong. Drives that don't lie cost a lot more, and are slower. Honesty is an extra-cost option. If you don't pay for honesty (few do) you won't get it.
Honesty usually costs a lot more than a UPS.
Only with a UPS
Posted Sep 10, 2009 19:12 UTC (Thu) by dlang (guest, #313) [Link]
I think this is a myth like the drives that use platter energy to power themselves to write their buffer.
if you can point to a drive that includes a battery backup on the drive please post a link to it.
Only with a UPS
Posted Sep 10, 2009 19:22 UTC (Thu) by ncm (guest, #165) [Link]
Only with a UPS
Posted Sep 11, 2009 16:24 UTC (Fri) by nix (subscriber, #2304) [Link]
controller has its own battery-backed cache, I have no idea whether it's
asked the drives it controls to turn off *their* internal write cache...
(Obviously we don't want that cache, as it's not battery-backed in any
way.)
Only with a UPS
Posted Sep 11, 2009 14:41 UTC (Fri) by anton (subscriber, #25547) [Link]
First, if power drops during a physical write operation, that sector is scragged. If it was writing metadata, you have serious problems with whatever files that metadata describes, if anything points to that sector.In my experiments on cutting power on disk drives while writing, the drives did not corrupt sectors. I have seen IBM and Maxtor drives corrupt sectors under more unusual power fluctuation circumstances; maybe that's a reason why you can no longer buy drives from IBM or Maxtor; Hitachi (IBM successor) and Seagate-Maxtor (not Seagate proper) are certainly on my dont-buy list.
And a modern file system can protect against the corruption of a single sector:
E.g., in a journaling file system, that sector is either in the log or in the permanent storage. If it's in the log, just stop the replay when you encounter the sector. If it's in permanent storage, then you will notice that the replay write fails, and the file system can remap the sector/block to a working one (or the drive might remap it transparently on the replay write, or might just perform the write on the original sector; in these cases the file system has nothing to do). Of course, if the file system performs only meta-data journaling, then it will likely not notice corrupt data (because it is not accessed during replay), but apparently neither the file system maintainer nor the user (or whoever decided to use a meta-data journaling file system) cares about data anyway, so that's ok.
In a copy-on-write file system, the sector either contains the root of the file system, or it contains something written after the last root. In the latter case these blocks are unreachable anyway after recovery (unless there is also an intent log, in which case the discussion above applies). If the root is affected, then on recovery the youngest alternative root is read, giving us the latest consistent state of the file system.
Second, more subtle but probably more important, drives lie about what is physically on disk.In the experiments mentioned above, when the drive had write caching enabled (default on PATA and SATA drives), the drives not just reported completion right away, but worse, also reordered the writes (so using barriers or turning off write caching is essential for every kind of consistency).
With write caching disabled, the results of my experiments (both in performance and in what was on disk after powering off) are consistent with the theory that the drive reports the completion of writes only after the sector hits the platter and (with the program I used) consequently only wrote the sectors in order.
BTW, it's not just the drive manufacturers that default to fast rather than safe; the Linux kernel developers do a similar thing (with a much smaller performance incentive) when they disable barriers by default, and turned ext3 from data=journal to data=ordered (and letting data=journal rot), and recently to data=writeback (although that may be just to make ext3 as bad as ext4 so people will not switch back). Hmm, are Solaris or BSD developers less cavalier about their user's data?
On the subject: UPSs and computer PSUs can fail, too. Better recommend a dual power supplies with dual UPSs; double failures should be relatively rare.
Only with a UPS
Posted Sep 10, 2009 12:02 UTC (Thu) by xilun (guest, #50638) [Link]
I know that my statistic sample is to small (and worse, this was 6 years ago and I don't know if HD today are of the same quality), but anyway my first guess is that if the software is careful enough and the hardware of decent quality, the risk of massive data corruption due to a power failure is not too high (at least in absence of bad system design, like using RAID 5/6 in a power unsafe context)
Only with a UPS
Posted Sep 10, 2009 14:13 UTC (Thu) by Cato (guest, #7643) [Link]
This PC was frequently reset accidentally by the user pressing the power button, which caused at least one data loss event within one year. Since disabling write caching (and a couple of other changes) I've not had any data loss on this PC, but it's probably too early to be sure these changes have fixed the problem.
FWIW, I believe that at least on this setup, disabling write caching helps avoid ext3 and LVM corruption.
Only with a UPS
Posted Sep 11, 2009 5:28 UTC (Fri) by magnus (subscriber, #34778) [Link]
In the past I've had to reboot due to X server hangs (probably problems in the display driver), oopses due to unstable hardware (memory mainly) and sometimes soft hangs like losing connection to an NIS or NFS server or getting PAM misconfigured and not having a prompt to work from.
Only with a UPS
Posted Sep 19, 2009 20:30 UTC (Sat) by efexis (guest, #26355) [Link]
Only with a UPS
Posted Sep 20, 2009 7:51 UTC (Sun) by Cato (guest, #7643) [Link]
Only with a UPS
Posted Sep 21, 2009 7:20 UTC (Mon) by efexis (guest, #26355) [Link]
But of all those, the U is the most important, as if it succeeds it will protect your filesystem. You may end up with some left over temp files as tasks that didn't receive the terminate request signal didn't clean up after themselves, but this is usually not too great a cost.
Alex
Only with a UPS
Posted Sep 22, 2009 6:13 UTC (Tue) by Cato (guest, #7643) [Link]
Only with a UPS
Posted Sep 22, 2009 9:28 UTC (Tue) by efexis (guest, #26355) [Link]
Only with a UPS
Posted Sep 12, 2009 0:35 UTC (Sat) by spitzak (guest, #4593) [Link]
While the power is still running, and the disk is spinning and working perfectly, EXT4 has *already* stored information on it that says the file that the atomic rename() went to is empty. The disk is in the wrong state! It is irrelevant whether a power failure may further damage the data!
Only with a UPS
Posted Sep 12, 2009 6:22 UTC (Sat) by ncm (guest, #165) [Link]
You rather miss the point. Given reliable storage -- i.e., doesn't lie about what's reached disk, or has enough battery backup to make sure it gets there, eventually -- it's possible to write a reliable file system. Without, it doesn't matter how well done the file system is, a power drop can corrupt it. If you want safety against power drops, you need both.
Power loss -> no guarantee?
Posted Sep 17, 2009 11:11 UTC (Thu) by forthy (guest, #1525) [Link]
This is wrong. Consider a log-structured, checksummed file system like NILFS. It gathers all writes, writes them out in one go, and checksums every chunk it writes. What happens when power is lost during that write? The checksum is wrong. The last update before isn't touched, so the file system will revert to this last update. All is hunky dory, all ponies still there, no data lost except the last update - which is the guarantee of such a file system: You can only depend that those data is on disk where the transaction was completely written to disk. And note: writing one sector to a hard disk takes a few microseconds nowadays, so the drive can detect a power outage and stop writing before it randomly scrambles a sector - it might not complete everything, but leaving a garbled sector is possible to avoid.
On the other argument: In the part of the world where I live (Munich), power outages are far less frequent than crashes. Our file server had some CPU problems two years ago and crashed about once a week. Thanks to the stability of ReiserFS, no data loss occurred during the half year until we found the root cause and replaced the CPUs. Even when not including hardware defects, I definitely have more crashes than power outages. Frequent power outages happen in poor countries with third-world infrastructure.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 16:22 UTC (Wed) by njd27 (subscriber, #5770) [Link]
And, I have to note, I did have a real, live pony when I was a kid, so I tend to be on the side of giving programmers ponies when they ask for them.The trouble of course, as any student of ponydynamics can tell you, is that once the kernel developers start handing out ponies as the solution to any difficult problem, the workload in having to deal with mucking them out might grow exponentially. Hence the reluctance. Worse might be better.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 17:11 UTC (Wed) by drag (guest, #31333) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 19:43 UTC (Wed) by martinfick (subscriber, #4455) [Link]
Given and denied
Posted Sep 9, 2009 21:59 UTC (Wed) by man_ls (guest, #15091) [Link]
Indeed. And now they are saying it was in fact O_UNICORNS, so they were not really given to anyone.
Given and denied
Posted Sep 9, 2009 22:59 UTC (Wed) by Tara_Li (guest, #26706) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 16:46 UTC (Wed) by dannf (subscriber, #7105) [Link]
what's needed is a application barrier() call
Posted Sep 9, 2009 18:14 UTC (Wed) by dlang (guest, #313) [Link]
barriers are starting to exist deeper inside the storage stack, but there is no way for the application writer to invoke them.
what's needed is a application barrier() call
Posted Sep 9, 2009 18:36 UTC (Wed) by quotemstr (subscriber, #45331) [Link]
rename
ought to imply a barrier because there are no useful cases for a non-barrier rename, and plenty of cases for a rename
-with-barrier. Forcing application authors to use a separate call will simply allow them to forget it, and will ensure that existing applications are still unsafe. A barrier-on-rename
scheme adds safety with no loss in expressiveness, and is a performance increase over fsync
, which is, well, synchronous.
what's needed is a application barrier() call
Posted Sep 9, 2009 18:52 UTC (Wed) by dlang (guest, #313) [Link]
I am stating that rename should not be necessary to create a barrier.
don't require application writers to add renames if they wouldn't otherwise need them just because they want a barrier.
what's needed is a application barrier() call
Posted Sep 9, 2009 19:06 UTC (Wed) by quotemstr (subscriber, #45331) [Link]
Ah, I see.fbarrier
is an interest construct. You'd still need the rename
trick in order to prevent observers from seeing inconsistent files at runtime. A program modifying /etc/passwd should prevent other applications from seeing an incompletely-written file. One could do this locks, but a simpler mechanism is rename
.
what's needed is a application barrier() call
Posted Sep 9, 2009 20:55 UTC (Wed) by dlang (guest, #313) [Link]
History (re: what's needed is a application barrier() call
Posted Sep 10, 2009 16:18 UTC (Thu) by davecb (subscriber, #1574) [Link]
Rename was one of the two Unix V6 system calls which were documented as being necessary and sufficient to allow one to do atomicity and locking: the other was open(... O_EXCL|O_CREAT). The latter atomically creates files, the former atomically changes them, including atomically making them cease to exist. I rather expect Thompson and Ritchie would be bemused by some of the discussion to date (;-)--dave
History (re: what's needed is a application barrier() call
Posted Sep 10, 2009 20:18 UTC (Thu) by aegl (subscriber, #37581) [Link]
Nope. "rename" wasn't in V6 (see http://minnie.tuhs.org/UnixTree/V6/usr/sys/ken/sysent.c.html).
"the other was open(... O_EXCL|O_CREAT)"
V6 open(2) didn't have all those fancy O_* options. You just got the choice of FREAD, FWRITE, or both.
Applications in V6 era typically used "link(2)" as their locking primitive (create a randomly named tempfile, then link that to a statically named lockfile. If the link call succeeds, you own the lock. If you get EEXIST, then someone else does).
History (re: what's needed is a application barrier() call
Posted Sep 10, 2009 21:03 UTC (Thu) by davecb (subscriber, #1574) [Link]
the open/rename indeed come later.
--dave
what's needed is a application barrier() call
Posted Sep 9, 2009 20:02 UTC (Wed) by vaurora (subscriber, #38407) [Link]
So, you have a hierarchy something like:
fsync() << barrier() << depends_on()
Where to the left performs worse and to the right is more difficult to implement in the file system. I need to write about Featherstitch, some amazing difficulties there.
what's needed is a application barrier() call
Posted Sep 9, 2009 20:53 UTC (Wed) by dlang (guest, #313) [Link]
simply
write1 write2 write3 barrier write4 write5
will guarantee that writes 1-3 will hit the disk before writes 4 and 5 but says nothing about the ordering or timeing of the two seperate sets.
this is _much_ easier to implement.
what's needed is a application barrier() call
Posted Sep 9, 2009 21:49 UTC (Wed) by njs (subscriber, #40338) [Link]
Yes, that sort of ordering is what she's talking about. (The Featherstitch papers are worth checking out.)IIUC, the trickiness in implementing it is that you need to keep track of which writes depend on which other writes, and which intermediate states are allowed, but then you have to keep non-dependent writes as disentangled as possible, to avoid the slowdowns and soft-updates craziness that she describes in the original article -- but you can't be *too* clever about it, or your accounting overhead will become a bottleneck.
Featherstitch's solution involves careful optimizations and giant graphs in kernel memory.
what's needed is a application barrier() call
Posted Sep 9, 2009 23:44 UTC (Wed) by dlang (guest, #313) [Link]
if you have a way to prevent memory blocks that have been submitted for I/O from being changed before the I/O is completed this is just a matter of prohibiting reordering in the device stack.
this would be overkill for what is being asked for (the user cares about ordering the changes to their file, this would order changes to the entire device), but it would do the job without having to worry about tracing dependancies
if the device stack were to mark all buffers it has pending as COW when it gets a barrier() call, this doesn't seem that hard to do.
now, it does open up the possibility of running out of memory and having problems writing something to swap (until the currently pending writes complete), but if you do the barrier on a per-partition basis this shouldn't be that bad (this assumes that the device stack can easily tell what partition pending writes would go to)
what's needed is a application barrier() call
Posted Sep 10, 2009 6:51 UTC (Thu) by njs (subscriber, #40338) [Link]
But whether my analysis is accurately pin-pointing the problem or not, with all respect, I think I'll trust Val's word over yours that there *is* a problem :-).
what's needed is a application barrier() call
Posted Sep 10, 2009 7:01 UTC (Thu) by dlang (guest, #313) [Link]
Val is definantly right about the difficulty in doing it the best possible way, as I noted my approach would run some possibility of the COW causing out of memory problems, but I suspect that it's a 90% solution.
what's needed is a application barrier() call
Posted Sep 10, 2009 7:35 UTC (Thu) by njs (subscriber, #40338) [Link]
I guess this is complicated by the question of when dirty blocks get flushed, and in how large batches; maybe it's solvable. But I don't think memory is the main concern, at least.
what's needed is a application barrier() call
Posted Sep 10, 2009 7:48 UTC (Thu) by dlang (guest, #313) [Link]
an alternative to the application barrier() call
Posted Sep 11, 2009 15:05 UTC (Fri) by anton (subscriber, #25547) [Link]
write1 write2 write3 barrier write4 write5An alternative would be to just extend POSIX logical ordering guarantees (as visible by other processes) to the post-recovery state. That would mean that the file system would implicitly put a barrier between any of the writes in your example.will guarantee that writes 1-3 will hit the disk before writes 4 and 5 but says nothing about the ordering or timeing of the two seperate sets.
The question is: how much would this guarantee cost compared to what you have in mind? In a copy-on-write filesystem it could cost very little, if anything. The file system could still perform the user writes in any order (all of them, not just a subset), but just would never commit a write for which the earlier writes have not been performed yet. For journaled file systems the reasoning is more complex, but I believe that in the usual case (writing new data) the cost is also very small.
The benefits of this guarantee are that it makes programming easier, and especially testing easier: If your files are always consistent as seen by other processes, they will also be consistent in case of a crash or power outage; no need to pull the power plug in order to test the crash resilience of your application.
an alternative to the application barrier() call
Posted Sep 11, 2009 16:29 UTC (Fri) by dlang (guest, #313) [Link]
since most writes are less than a sector, multiple writes would be even more expensive for a COW system
An alternative to the application barrier() call
Posted Sep 11, 2009 17:06 UTC (Fri) by anton (subscriber, #25547) [Link]
Barriers certainly don't prevent write merging. Why would they? A barrier just means that logically later writes are not committed before logically earlier writes, but they can become visible at the same time. So you can merge as many writes across barriers as you want.Ok, your formulation of barriers exclude the same-time option, but apart from the lower performance, how could an external observer tell whether two logical writes happened one after another or at the same time? Once they are both committed, there is no difference.
As my posting explains, they also don't prevent reordering of physical data writes, they only restrict which sets of writes are committed by a commit.
Multiple small writes can be merged together into a large one.
BTW, most writes probably happen through libc buffers, and are typically larger than one sector (unless most of your files are smaller than one sector).
An alternative to the application barrier() call
Posted Sep 11, 2009 19:26 UTC (Fri) by dlang (guest, #313) [Link]
work...work
write one line, or a couple words of a line
work..work
write a little more
etc
enforcing a barrier between all of these writes would kill you
remember that you don't know the storage stack below you, what you submit as one write may be broken up into multiple writes, and you have no guarantee of what order those multiple writes could be done in (think a raid array where your write spans drives as one example)
as a result a barrier needs to prohibit merging across the barrier as well as just reordering across the barrier.
An alternative to the application barrier() call
Posted Sep 13, 2009 17:46 UTC (Sun) by anton (subscriber, #25547) [Link]
Code that writes a few characters here and a few characters there usually uses the FILE * based interface, which performs user-space buffering and then typically performs write() (or somesuch) calls of 4k or 8k at a time; just strace one of these programs. That's done to reduce the system call overhead. But even if such programs perform a write() for each of the application writes, having barriers between each of them does not kill performance, because a sequence of such writes can be merged.Concerning the block device below, if that does not heed the block device barriers or other block device ordering mechanisms that the file system requests, then you get no guarantee at all of any consistency on crash/power failure. It's not just that merged writes won't work, your style of merge-preventing barriers won't work, either, and neither will the guarantees that fsync()/fdatasync are supposed to provide; that's because all of them require that the block device ordering mechanism(s) that the file system uses actually work, and all of them will produce inconsistent states if the writes happen in an order that violates the ordering requests. So, if you want any consistency guarentees at all, you need an appropriate block device, and then you can implement mergeable writes just as well as anything else.
As for an array where a write spans drives, implementing a barrier or other ordering mechanism on the array level certainly requires something more involved than just doing barriers on the individual block devices, but the device has to provide these facilities, or you can forget about crash consistency on that device (i.e., just don't use it).
An alternative to the application barrier() call
Posted Sep 13, 2009 20:23 UTC (Sun) by dlang (guest, #313) [Link]
this isn't always needed, so don't try to do it for every write (and I've straced a lot of code that does lots of wuite() calls)
do it when the programmer says that it's important. 99+% of the time it won't be (the result is not significantly more usable after a crash with part of the file if it's not all there, or this really is performance sensitive enought to risk it)
you would be amazed at the amount of risk that people are willing to take to get performance. talk to the database gurus at MySQL or postgres about the number of people they see disabling f*sync on production databases in the name of speed.
An alternative to the application barrier() call
Posted Sep 14, 2009 22:16 UTC (Mon) by anton (subscriber, #25547) [Link]
Fortunately writes on the file system level can be merged across file system barriers, resulting in few barriers that have to be passed to the block device level. So there is no need to pass a block device barrier down for every file system barrier.And since it is possible to implement these implicit barriers between each write efficiently (by merging writes), why burden programmers with inserting explicit file system barriers? Look at how long the Linux kernel hackers needed to use block device barriers in the file system code. Do you really expect application developers to do it at all? And if they did, how would they test it? This has the same untestability properties as asking application programmers to use fsync.
Concerning the risk-loving performance freaks, they will use the latest and greatest file system by Ted T'so instead of one that offers either implicit or explicit barriers, but of course they will not use fsync() on that file system:-).
BTW, if you also implement block device writes by avoiding overwriting live sectors and by using commit sectors, then you can implement mergeable writes at the block device level, too (e.g., for making them cheaper in an array). However, the file system will not request a block device barrier often, so there is no need to go to such complexity (unless you need it for other purposes, such as when your block device is a flash device).
An alternative to the application barrier() call
Posted Sep 20, 2009 5:22 UTC (Sun) by runekock (subscriber, #50229) [Link]
But what about eliminating repeated writes to the same place? Take this contrived example:
repeat 1000 times:
write first byte of file A
write first byte of file B
A COW file system may well be able to merge the writes, but it would require a lot of intelligence for it to see that most of the writes could actually be skipped. And a traditional file system would be even worse off.
An alternative to the application barrier() call
Posted Sep 20, 2009 18:38 UTC (Sun) by anton (subscriber, #25547) [Link]
For a copy-on-write file system that example would be easy: Do all the writes in memory (in proper order), and when the system decides that it's time to commit the stuff to disk, just do a commit of the new logical state to disk (e.g., by writing the first block each of file A and file B and the respective metadata to new locations, and finally a commit sector that makes the new on-disk state visible.An update-in-place file system (without journal) would indeed have to perform all the writes in order to have the on-disk state reflect one of the logical POSIX states at all times (assuming that there are no repeating patterns in the two values that are written; if there are, it is theoretically possible to skip the writes between two equal states).
an alternative to the application barrier() call
Posted Sep 12, 2009 0:39 UTC (Sat) by spitzak (guest, #4593) [Link]
An actual implementation may be "allocate temporary space, write 2, write 1, make the file point at temporary space". Notice that write 2 is done BEFORE write 1, but we have fulfilled the requirements of barrier.
what's needed is a application barrier() call
Posted Sep 10, 2009 1:43 UTC (Thu) by ras (subscriber, #33059) [Link]
Maybe write barriers are just the interface we need, maybe not. What we need is a Paul McKenney to do for this problem what Paul did for a very similar problem in multi CPU memory architectures. He wrote the RCU stuff and in the process became intimately familiar for the problem and all possible solutions. He then worked for years to get a standard set of functions that solved the problem to be added to the C standard library. Where is your white knight in shining armour when you need them? Who knows, maybe our knight will be female for a change.
One thing hasn't changed though, and that is the ability of the female of our special to create work. Now I feel compelled to learn about Featherstich.
what's needed is a application barrier() call
Posted Sep 10, 2009 11:18 UTC (Thu) by nix (subscriber, #2304) [Link]
He then worked for years to get a standard set of functions that solved the problem to be added to the C standard library.Obviously I missed something, but userspace RCU isn't in glibc and certainly isn't in POSIX.
what's needed is a application barrier() call
Posted Sep 10, 2009 12:19 UTC (Thu) by ras (subscriber, #33059) [Link]
> and certainly isn't in POSIX.
Paul describes his efforts so far here: http://www.rdrop.com/users/paulmck/scalability/paper/CPP-...
what's needed is a application barrier() call
Posted Sep 10, 2009 13:37 UTC (Thu) by nix (subscriber, #2304) [Link]
(And Paul's slides are the best explanation of the problem I've ever seen. Slide 26 is particularly good ;} )
what's needed is a application barrier() call
Posted Sep 10, 2009 11:19 UTC (Thu) by nix (subscriber, #2304) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 18:26 UTC (Wed) by iabervon (subscriber, #722) [Link]
1) concurrent readers; they shouldn't sometimes see an /etc/passwd without half of the users, a mail spool without half of the messages, etc. It would be a problem if, when any user changes their password, there's a chance that other users will get random login failures. It would also be a problem if, which making modifications to messages in your inbox, a program that just looked at the state of the inbox might see messages disappear only to reappear a little while afterwards. Concurrent reads also include that all-important tape backup, which really really shouldn't catch some intermediate state of /etc/passwd.
2) application crashes; if an application is trying to make a modification to a file, and it is forced to exit before it is done, the old version should be left there. The application may be forced to exit by a segfault (bug in the application), being terminated by the user, being terminated by the system, a clean system shutdown or filesystem umount/remount, etc. The application may also, due to a bug, go into an infinite loop such that it would never actually write the rest of the unchanged content, and would have to be killed such that a different program can get exclusive control over writing it.
Unlike the system crash case, these cases are both required to have particular, internally visible, behaviors by POSIX, and POSIX conformance tests have something to test. Also, applications can (and sometimes do) get regression testing which tries to ensure that they work in conditions which would provoke bad behavior. Unless people develop applications on systems which panic frequently, they're not likely to consider the system crash case, probably won't do anything special in order to make sure that it behaves in some way they want, and certainly won't test that it actually does behave the way they expect. (In fact, POSIX notes that any test involving a system crash can be treated as a quality-of-implementation issue rather than a correctness issue.)
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 18:36 UTC (Wed) by alexl (subscriber, #19068) [Link]
Instead, its all about fixing race conditions. What if two applications write to the same file at the same time, or one reads a file at the same time it is replaced. If i use rename to save I am guaranteed by the posix specification (unless the system crashes) that any reader of the saved filename at *any* time, atomically gets either the full old file, or the full new file. There is never any partial or non-existing file. Similarly, if two apps write the file you only ever get one of the two files written, never some mixup of them.
So, the use of rename is basic in how we replace a file. This means everyone is doing it, and it seems natural that filesystems would use the knowledge of this pattern to efficiently and safely implement crash-safe file replacement. Such a thing is not guaranteed by POSIX, but should probably be guaranteed by "good implementations" of unix filesystems. Especially since the safe alternative (fsync) is far less performant due to its semantics being more than we need.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 19:19 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 20:20 UTC (Wed) by job (guest, #670) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 22:40 UTC (Wed) by njs (subscriber, #40338) [Link]
Our standards keep rising, though, and these days people actually care about what happens to data over crashes -- and POSIX's primitives *suck* for this, unless you are writing a giant database with dedicated storage.
In this discussion, whenever kernel folks talk about developers coming to depend on rename's atomicity, I'm pretty sure they're talking about its atomicity with respect to crashes. (For instance, I believe Subversion's backend format uses atomic-rename for reliability over crashes, because fsync is just untenable.)
crash vs. power drop
Posted Sep 10, 2009 2:15 UTC (Thu) by ncm (guest, #165) [Link]
How often do any of us see crashes on production systems, nowadays? Things being the way they are, power drops are overwhelmingly more likely on your typical desktop or rack installation, just because a UPS is an extra expense and crippling data loss isn't especially likely even without. "All of the above" is meaningful only if power doesn't drop unexpectedly, and, sadly, that needs to be repeated every time.
crash vs. power drop
Posted Sep 10, 2009 6:12 UTC (Thu) by njs (subscriber, #40338) [Link]
crash vs. power drop
Posted Sep 10, 2009 17:50 UTC (Thu) by ncm (guest, #165) [Link]
http://lwn.net/Articles/352002/
When the drives lie to the file system, the file system can't guarantee anything. Most drives (i.e. the ones you have) lie. Drives that don't lie cost way more, and are slower; it's much cheaper to add some battery backup, and even some redundancy, than to buy the expensive drives. There's no point in discussing fine points of data ordering if you haven't got one or the other.
crash vs. power drop
Posted Sep 11, 2009 6:27 UTC (Fri) by zmi (guest, #4829) [Link]
physically on disk.
That's why in a RAID you really *must* turn off the drive's write cache.
I've tried to explain that in the XFS FAQ:
http://xfs.org/index.php/XFS_FAQ#Q:_What_is_the_problem_w...
and also in the questions below that one.
Short: I've got a new 2TB WD drive with 64MB cache, we intend to use them
in a RAID. Take 16 of these drives, it adds to 1024MB (1GB) of write cache.
So in the worst situation,
1) you've got an UPS, but your power supply fails and the PC/server is out
of power
2) drives have their caches full, so up to 1GB of data is lost, where the
filesystem believed they are on disk. There's a *very* high chance that
lots of metadata is included in the cached writes.
3) each of the 16 drives could write "half sectors", effectively destroying
the previous and the actual content.
In all this discussion, it would have been worth noting that if you really
*care* about your data, you *must* turn off the drive write cache. Yes,
power failures are not so often in countries with good power supply. Still,
I use an UPS and in the last half year, had
a) my daughter playing around turning the power of the server off
b) a dead power supply in my workstation
and so, even with an UPS, "drive write cache off" is a must. Simply put a
hdparm -W0 /dev/sda
in your boot scripts.
Note that still this only helps in 1) and 2), but for problem 3) there's
nothing anybody but the disk manufacturers can do. I must say that I have
no evidence of ever having had that problem somewhere. It might be that
happened when there are "strange filesystem problems" after a crash, but
you can't tell for sure.
As for the rename: Really, there should only ever be the chance of having
either the old file or the new one, and the filesystem should care about
this even for crash situations.
Note: In Linux you can tune writeback behaviour in /etc/sysctl.conf:
# start writeback at around 16MB, max. 250MB
vm.dirty_background_bytes = 16123456
vm.dirty_bytes = 250123456
# older kernels had this:
#vm.dirty_background_ratio = 5
#vm.dirty_ratio = 10
# write blocks to disk after 1 second (default: 3000ms)
vm.dirty_expire_centisecs = 1000
vm.dirty_writeback_centisecs = 100
Note that dirty_bytes/dirty_ratio is to block new writes once the cache has
that many bytes to write. On systems with 8GB RAM or more, you could end up
having gigabytes of disk cache.
Sorry for putting all in one post, but I hope it helps people who care
about their data to have some tunings to start with.
mfg zmi
crash vs. power drop
Posted Sep 11, 2009 11:40 UTC (Fri) by Cato (guest, #7643) [Link]
On the topic of writing 'half sectors' due to a power drop: the author of http://lwn.net/Articles/351521/ has done quite a lot of tests on various hard drives, and generally found that they usually don't do this, though some instances have done. He has a useful program that can test any drive for this behaviour, though it's mostly intended to test for out of order writes due to caching - I believe only some drives lie about this.
crash vs. power drop
Posted Sep 14, 2009 14:13 UTC (Mon) by nye (guest, #51576) [Link]
If all you care about is benchmark results, there's an obvious incentive for the drive to claim falsely that some data has been physically written to disk.
On the other hand, there's far less incentive to lie about barriers. If all you're saying is that '*if and when* this data has been written, then that other data will have also been written', you can still happily claim that it's all done when it isn't, without breaking the barrier commitment.
When you have that commitment, it's possible to build a transaction system upon it which works even under the assumption that the drive will lie. You're not going to achieve the full benchmark speed, but it's going to be far better than turning off the cache.
Of course, whether drive manufacturers see it that way is another matter. Is there any data on whether drives actually honour write barriers? It would be interesting to see if there are indeed drives that aren't expensive enough to report accurately on when data has been written, but still honour the barrier request.
crash vs. power drop
Posted Sep 14, 2009 15:06 UTC (Mon) by ncm (guest, #165) [Link]
Drive manufacturer A can sell almost equally as many drives made this way as that way, but "that way" costs more development time and might make it come out a little slower in benchmarks. Some purchase decisions depend on claiming it's made "that way". The manufacturer can make it "that way" or just say it is, but not. Which is more likely?
crash vs. power drop
Posted Sep 14, 2009 16:12 UTC (Mon) by nye (guest, #51576) [Link]
Usually when it's a feature that either works or doesn't - so it's not a subjective measurement - it's likely that a product does at least technically do what it says it does.
Presumably the argument is that the manufacturers aren't specifically claiming a particular feature, but the disk is behaving in a particular way that just happens to be not what the user expected - so they're not technically lying. This does seem to weaken the idea that they're doing it to improve the chances of people buying it though, if they're not stating it as a feature.
Just out of interest, I've just spent a while trying to see if I can find out what the difference is between the Samsung HE502IJ and HD502IJ - two drives which are identical on paper, but one is sold as 'RAID-class'. Neither are even remotely expensive enough not to lie about their actions, so what's the difference? Well, some forum post claims that one has a 'rotational vibration sensor', whatever that means.
In conclusion, people who try to sell you things are all liars and cheats, and I intend to grow a beard and live out the rest of my days as a hermit, never having to worry about these things again. Perhaps I shall raise yaks.
crash vs. power drop
Posted Sep 15, 2009 12:52 UTC (Tue) by Cato (guest, #7643) [Link]
crash vs. power drop
Posted Sep 15, 2009 12:49 UTC (Tue) by Cato (guest, #7643) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 8:24 UTC (Thu) by alexl (subscriber, #19068) [Link]
I'm all for a reasonable API addition to implement O_PONIES, and would implement support for it in the stuff i work on (glib, gio, etc) the second it was availible. However, all existing applications already use rename() to do atomic renames() for reasons unrelated to system crashes, so why not just make all these applications work without additional changes? At little cost in performance.
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 8:52 UTC (Thu) by dlang (guest, #313) [Link]
doing a fsync on ext3 (what the ext maintainers believe is nessasary to get to the disk to be safe) can take several seconds. if you want a rename to provide that sort of guarantee you need to be willing to pay that sort of cost for every rename.
ext3 never provided the guarantees that people think it did. it just happened to work if you didn't crash too soon after doing a rename.
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 9:04 UTC (Thu) by alexl (subscriber, #19068) [Link]
NO NO NO NO. We do not need/want the file to be fsynced.
Why do people keep repeating this fallacy? We all know that fsync is expensive, and don't want to use it, or something with similar semantics.
What we want is something that gives us the natural behavior of rename() replace (atomically get either the old or the new file) and extend it to a system crash. This does not imply a fsync, but rather that the data for the new file is on disk before the metadata is on disk. This is much cheaper than an fsync because it does not require the data to be written immediately, but rather that we have to delay the write of the metadata until the data has been written. Thus "little cost in performance", at least in relation to fsync.
And then you write "ext3 never provided the guarantees that people think it did" when my whole point has been about how everyone gives this reason for why people use rename when its not actually the reason! I am well aware that rename() does not give me system crash safety, I use it for other reasons. However, I *would* like it if this common operation that has been in use for decades before ext3 was written also was recognized by ext3 and made even more useful (even though this is in no way guaranteed by POSIX).
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 16:37 UTC (Thu) by nye (guest, #51576) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 20:38 UTC (Thu) by HelloWorld (guest, #56129) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 21, 2009 1:52 UTC (Mon) by efexis (guest, #26355) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 21, 2009 13:45 UTC (Mon) by nye (guest, #51576) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 16:01 UTC (Thu) by forthy (guest, #1525) [Link]
I really wonder why all this "data=ordered" stuff is said to cost performance. If implemented right, it must improve performance. All you want to do is the following: Push data into the write buffer. Push metadata into the metadata write buffer. Push freed blocks into the freed blocks buffer (but don't actually free them). If your buffers are full, there's no free block around any more, or a timer expires, do the following:
- Write out data.
- Write out metadata (first to journal, then to the actual file system).
- Actually free the blocks from the freed block list
You only have to write data once - new files go to newly allocated blocks which don't appear in the metadata when you write them (they are still marked as free in the on-disk data). For files with in-place writes, we usually don't care (there are many race conditions for writing in-place, so the general usage pattern is not to do that if you care about your data). For crash-resilient systems, you want to write your metadata twice (once into a journal, once into the file system), order it (ordered metadata updates), or use a COW/log structured file system, where you write a new file system root (snapshot) on every update round. While you are writing data from your buffers, open up new buffers for the OS to be used as buffers for the next round (double-buffering strategy). This double buffering should be a common part of the FS layer, because it will be used in all major file systems.
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 16:41 UTC (Thu) by dlang (guest, #313) [Link]
that is why barriers are needed to tell the device not to reorder across the buffer.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 20:21 UTC (Wed) by kjp (guest, #39639) [Link]
"Ted Ts'o has mitigated that problem somewhat, though, by adding in the same safeguards he put into ext4. In some situations (such as when a new file is renamed on top of an existing file), data will be forced out ahead of metadata."
Please clarify or contradict that before you give us poor app developers any more heart trouble... thanks.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 21:08 UTC (Wed) by vaurora (subscriber, #38407) [Link]
commit e7c8f5079ed9ec9e6eb1abe3defc5fb4ebfdf1cb
Author: Theodore Ts'o <tytso@mit.edu>
Date: Fri Apr 3 01:34:49 2009 -0400
ext3: Add replace-on-rename hueristics for data=writeback mode
In data=writeback mode, start an asynchronous flush when renaming a
file on top of an already-existing file. This lowers the probability
of data loss in the case of applications that attempt to replace a
file via using rename().
---
If you aren't sure, I recommend using ext3 with the explicit data=ordered option until you've had the opportunity to sit down and understand the data=writeback and/or ext4 semantics.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 20:33 UTC (Wed) by kjp (guest, #39639) [Link]
I have an example of when this actually can be useful (however to be clear, I am very much in favor of rename being safe by default).
A web interface needs to display a text dump from a daemon to a user. The daemon writes out /tmp/daemon-dump.tmp, and renames it to /tmp/daemon-dump.txt. It does this since concurrent users in the web site may still be accessing the previous version. However, on a crash it doesn't matter what happens to this file since it is always refreshed by a web process before starting display.
In a nutshell, the rename here is for IPC concurrency only and not related to any crash consistency. However, it is FAR more sane to add a fcntl flag to say (O_EAT_MY_DATA_SPEED_TOO_IMPORTANT) for these speed freaks than brick files or entire computers....
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 21:00 UTC (Wed) by jzbiciak (guest, #5246) [Link]
I posit that there should rarely be a good reason to run fsck on /tmp. Keeping the contents of /tmp across a crash may be useful for forensics (ie. "why did we crash?") and maybe rescuing a couple things, but otherwise you should be able to jettison it if there are any issues with the filesystem holding it, so long as /tmp is in its own filesystem. (That said, my RHEL4 box has a /tmp/lost+found. Hmmm.)
I'd further posit that nobody hopes for the empty file across a system crash. (If they did, they'd unlink() it just after opening it and before writing to it so that the file effectively evaporates on a crash. Ok, it might show up in lost+found, but only root can play there.) In your example, the programmer simply doesn't care. The programmer hopes for peak performance and doesn't really care if that means the file has garbage if the system crashes. If the contents were perfectly preserved without slowing the program down, I doubt the programmer in your example would care.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 21:03 UTC (Wed) by sbergman27 (guest, #10767) [Link]
But doing so is far more likely to cause problems than to help solve them.
POSIX v. reality: A position on O_PONIES
Posted Sep 9, 2009 22:20 UTC (Wed) by iabervon (subscriber, #722) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 8:58 UTC (Thu) by lysse (guest, #3190) [Link]
... O_PONYCRAP ?
POSIX v. reality: A position on O_PONIES
Posted Sep 17, 2009 16:04 UTC (Thu) by forthy (guest, #1525) [Link]
O_RACE_HORSE | O_ON_STEROIDS
That's obvious, isn't it?
POSIX v. reality: A position on O_PONIES
Posted Sep 10, 2009 15:04 UTC (Thu) by MisterIO (guest, #36192) [Link]
The question of course is...
Posted Sep 10, 2009 16:28 UTC (Thu) by ikm (subscriber, #493) [Link]
POSIX ain't what you think it is....
Posted Sep 11, 2009 0:48 UTC (Fri) by faramir (subscriber, #2327) [Link]
...If path1 names a directory, link() shall fail unless the process has appropriate privileges and the implementation supports using link() on directories...
So what does that mean? It says that IF the process is sufficiently privileged (traditionally running as root) AND the implementation elects to do so THEN it is not an error for a link to actually be created for a directory. Any program that actually wants to work properly on all POSIX system can NEVER require such an operation to work even if running as root. That's a pretty lukewarm endorsement of the functionality.
You might ask why it's even in there at all. The original UNIX systems (all version) had no atomic mkdir() system call. You used mknod() to make a TOTALLY empty directory and then used link() to create the standard . and .. entries in the newly created directory. The mkdir program was setuid root and if you wanted to create a directory from within a program
you had to do a fork() and exec() the program. Specifying link() in this way allowed UNIX versions which had traditionally made directories this way to get POSIX certification while in no way encouraging implementations of this type. Frankly, I think that was a good decision. Leniency for historical practice, but encouraging more reasonable implementations in the future. Users don't run their own programs as root so there was never any danger of people making directory loops to screw up your system.
Directory hard links
Posted Sep 11, 2009 6:40 UTC (Fri) by salimma (subscriber, #34460) [Link]
... a number of warts revealed themselves over time, not all of which were removed before the interface was codified into the POSIX specification. One example is directory hard links, which permit the creation of a directory cycle - a directory that is a descendant of itself - and its subsequent detachment from the file system hierarchy, resulting in allocated but inaccessible directories and files.How ironic that Apple chose to implement this feature in HFS+ ...
Directory hard links
Posted Sep 11, 2009 19:06 UTC (Fri) by foom (subscriber, #14868) [Link]
Directory hard links
Posted Sep 12, 2009 0:21 UTC (Sat) by cras (guest, #7000) [Link]
adding a file (or link) to some special hidden directory. Once you had done some thousand hard
links the performance became more and more horrible until the system was rebooted (the directory
gets wiped at startup). I wonder if that ever gets fixed.
truncate()
Posted Sep 12, 2009 0:26 UTC (Sat) by cras (guest, #7000) [Link]
it before when doing "man ftruncate", but it never registered to my mind. Is it actually used by some
real programs? In my over 10 years of programming on Linux I've never had a need for it. If
anything it sounds potentially dangerous to use (TOCTOU race).
POSIX v. reality: A position on O_PONIES
Posted Sep 12, 2009 0:46 UTC (Sat) by spitzak (guest, #4593) [Link]
I would like to go further and really extend POSIX to get the atomic operations everybody really wants. I would actually redefine the flags sent by creat() (O_CREAT|O_WRONLY|O_TRUNC) to be this "ponies" flag with the following rules:
1. If the file already exists, then other programs either see the old file or the contents of the new file when close() was called on it.
2. If the file does not already exist then other programs either see no such file or they see the file with the contents when close() was called on it.
3. If the program crashes or exits without calling close() then it is exactly like nothing ever happended.
3a. It may be useful to add some new call that "forgets" the file, ie it is "closed" in the same way as when the program exits.
This will avoid the need to use a temporary file and then rename it to get real atomic writes. And it would not use a new flag, because all programs using the creat() flags are already acting exactly like it works this way.
POSIX v. reality: A position on O_PONIES
Posted Sep 12, 2009 3:50 UTC (Sat) by cras (guest, #7000) [Link]
it's quite likely that it'll break assumptions by some programs. Not all programs close() the file once
they're done with it. After all, there are less syscalls to do if you want to append a few lines to a file
you already once wrote.
POSIX v. reality: A position on O_PONIES
Posted Sep 14, 2009 18:20 UTC (Mon) by spitzak (guest, #4593) [Link]
That said, I'm fairly certain that any lseek() on the file can be a trigger that this behavior is not wanted and that the partial file should become visible at that point. Of course no guarantee until close() is called...
This has no effect on pipes. So at absolute worst, you can write a logfile writer, so instead of "foo > logfile" you write "foo | write-old-style logfile".
I very much believe the improvements to atomicity from this so vastly outweigh any incompatibility problems that they are irrelevant.
POSIX v. reality: A position on O_PONIES
Posted Sep 13, 2009 19:40 UTC (Sun) by njs (subscriber, #40338) [Link]
POSIX v. reality: A position on O_PONIES
Posted Sep 14, 2009 18:15 UTC (Mon) by spitzak (guest, #4593) [Link]
Also it is possible that the file will revert to previous behavior after a certain size is achieved.
I still feel the benifits outweigh any compatibility problems. By far the majority of programs writing files with creat() act as though it works exactly like I describe.
Atomicity
Posted Sep 17, 2009 8:17 UTC (Thu) by Nicolas.Boulay (guest, #59722) [Link]
rename() is a kind of tricks to minimise the problem of empty file after a power failure.
But what an application writer really want is a fast file system that do _atomicity_ : that means he want the previous file states or the new content of the last sys_write() and nothing else.
At the time of fsync(), i think we better need a fdone() which should be a kind of "wait on complete transaction" instead of "flush everthing quickly".
If fdone() is too long, i could use threads. If fdone() take time, it's for bandwith optimisation. One of a great linux optimisation for system without important data is to map fsync() to a void function, then everything fly :)
Is it coslty to have the behavior of open()/write()/rename() for a single sys_write() ?
Atomicity
Posted Sep 25, 2009 3:39 UTC (Fri) by xoddam (subscriber, #2322) [Link]
Rename is *not* a 'kind of trick'. By specification, it is guaranteed to be atomic in the face of concurrent readers on a working system. Unfortunately the specification has nothing to say about it with respect to unclean shutdown.
Extending the atomicity of rename() so that it still applies in the face of a successful recovery (such as a journal replay) after an unclean shutdown is perfectly logical.
Atomicity
Posted Oct 26, 2009 10:09 UTC (Mon) by Nicolas.Boulay (guest, #59722) [Link]
KB is ok, MB is not.
It's typical in any data base work. In that case, rename() have no use.
Atomicity
Posted Oct 30, 2009 4:30 UTC (Fri) by xoddam (subscriber, #2322) [Link]
Databases traditionally use very large files because their implementors have chosen to re-implement filesystem functionality at the low level for performance reasons.
Most often they use their own journalling implementations and fsync(). This is of course legitimate. But using filesystem-level rename to provide atomicity would also be perfectly reasonable.
The size of the renamed and replaced file is an implementation detail only. Rename doesn't impose a requirement to copy large hunks of data only to throw it away. The unit of replacement might be a btree node, for example.
Nothing forces an implementor to use large files for any particular purpose.