|
|
Subscribe / Log in / New account

Barriers and journaling filesystems

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
May 21, 2008
Journaling filesystems come with a big promise: they free system administrators from the need to worry about disk corruption resulting from system crashes. It is, in fact, not even necessary to run a filesystem integrity checker in such situations. The real world, of course, is a little messier than that. As a recent discussion shows, it may be even messier than many of us thought, with the integrity promises of journaling filesystems being traded off against performance.

A filesystem like ext3 works by maintaining a journal on a dedicated portion of the disk. Whenever a set of filesystem metadata changes are to be made, they are first written to the journal - without changing the rest of the filesystem. Once all of those changes have been journaled, a "commit record" is added to the journal to indicate that everything else there is valid. Only after the journal transaction has been committed in this fashion can the kernel do the real metadata writes at its leisure; should the system crash in the middle, the information needed to safely finish the job can be found in the journal. There will be no filesystem corruption caused by a partial metadata update.

There is a hitch, though: the filesystem code must, before writing the commit record, be absolutely sure that all of the transaction's information has made it to the journal. Just doing the writes in the proper order is insufficient; contemporary drives maintain large internal caches and will reorder operations for better performance. So the filesystem must explicitly instruct the disk to get all of the journal data onto the media before writing the commit record; if the commit record gets written first, the journal may be corrupted. The kernel's block I/O subsystem makes this capability available through the use of barriers; in essence, a barrier forbids the writing of any blocks after the barrier until all blocks written before the barrier are committed to the media. By using barriers, filesystems can make sure that their on-disk structures remain consistent at all times.

There is another hitch: the ext3 and ext4 filesystems, by default, do not use barriers. The option is there, but, unless the administrator has explicitly requested the use of barriers, these filesystems operate without them - though some distributions (notably SUSE) change that default. Eric Sandeen recently decided that this was not the best situation, so he submitted a patch changing the default for ext3 and ext4. That's when the discussion started.

Andrew Morton's response tells a lot about why this default is set the way it is:

Last time this came up lots of workloads slowed down by 30% so I dropped the patches in horror. I just don't think we can quietly go and slow everyone's machines down by this much...

There are no happy solutions here, and I'm inclined to let this dog remain asleep and continue to leave it up to distributors to decide what their default should be.

So barriers are disabled by default because they have a serious impact on performance. And, beyond that, the fact is that people get away with running their filesystems without using barriers. Reports of ext3 filesystem corruption are few and far between.

It turns out that the "getting away with it" factor is not just luck. Ted Ts'o explains what's going on: the journal on ext3/ext4 filesystems is normally contiguous on the physical media. The filesystem code tries to create it that way, and, since the journal is normally created at the same time as the filesystem itself, contiguous space is easy to come by. Keeping the journal together will be good for performance, but it also helps to prevent reordering. In normal usage, the commit record will land on the block just after the rest of the journal data, so there is no reason for the drive to reorder things. The commit record will naturally be written just after all of the other journal log data has made it to the media.

That said, nobody is foolish enough to claim that things will always happen that way. Disk drives have a certain well-documented tendency to stop cooperating at inopportune times. Beyond that, the journal is essentially a circular buffer; when a transaction wraps off the end, the commit record may be on an earlier block than some of the journal data. And so on. So the potential for corruption is always there; in fact, Chris Mason has a torture-test program which can make it happen fairly reliably. There can be no doubt that running without barriers is less safe than using them.

Anybody can turn on barriers if they are willing to take the performance hit. Unless, of course, their filesystem is based on an LVM volume (as certain distributions do by default); it turns out that the device mapper code does not pass through or honor barriers. But, for everybody else, it would be nice if that performance cost could be reduced somewhat. And it seems that might be possible.

The current ext3 code - when barriers are enabled - performs a sequence of operations like this for each transaction:

  1. The log blocks are written to the journal.
  2. A barrier operation is performed.
  3. The commit record is written.
  4. Another barrier is executed.
  5. Metadata writes begin at some later point.

On ext4, the first barrier (step 2) can be omitted because the ext4 filesystem supports checksums on the journal. If the journal log data and the commit record are reordered, and if the operation is interrupted by a crash, the journal's checksum will not match the one stored in the commit record and the transaction will be discarded. Chris Mason suggests that it would be "mostly safe" to omit that barrier with ext3 as well, with a possible exception when the journal wraps around.

Another idea for making things faster is to defer barrier operations when possible. If there is no pressing need to flush things out, a few transactions can be built up in the journal and all shoved out with a single barrier. There is also some potential for improvement by carefully ordering operations so that barriers (which are normally implemented as "flush all outstanding operations to media" requests) do not force the writing of blocks which do not have specific ordering requirements.

In summary: it looks like the time has come to figure out how to make the cost of barriers palatable. Ted Ts'o seems to feel that way:

I think we have to enable barriers for ext3/4, and then work to improve the overhead in ext4/jbd2. It's probably true that the vast majority of systems don't run under conditions similar to what Chris used to demonstrate the problem, but the default has to be filesystem safety.

Your editor's sense is that this particular dog is now wide awake and is likely to bark for some time. That may disturb some of the neighbors, but it's better than letting somebody get bitten later on.

Index entries for this article
KernelBlock layer
KernelFilesystems/ext3
KernelWrite barriers


(Log in to post comments)

Barriers and journaling filesystems

Posted May 21, 2008 17:32 UTC (Wed) by allesfresser (guest, #216) [Link]

So, for those that wish to enable them, barriers apparently are turned on by giving
"barrier=1" as an option to the mount(8) command, either on the command line or in /etc/fstab:

mount -t ext3 -o barrier=1 <device> <mount point>

(along with whatever other options you desire)

If this is in error, please feel free to correct.

Enabling barriers

Posted May 21, 2008 17:45 UTC (Wed) by corbet (editor, #1) [Link]

Sigh. I actually meant to put that into the article, but it slipped my mind in the Wednesday haze. Sorry. Yes, "-o barrier=1" is what you want. The mount(8) man page, it seems, is a little out of date...

Enabling barriers

Posted May 21, 2008 20:13 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

Since which Kernel release has -o barrier=1 been an available option for mount(8)? Although I'm certain the kernel I'm running now (vanilla 2.6.25.4) would support barriers, I'm just a little curious how long this has been available. Thanks!

Enabling barriers

Posted May 21, 2008 20:17 UTC (Wed) by corbet (editor, #1) [Link]

Some quick grepping on my kernel tree disk suggests that ext3 got the barrier option in 2.6.9.

Enabling barriers since 2.6.9 and hdparm question

Posted May 21, 2008 20:36 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

Ahhh. The "big" one (kernel release, that is). Thank you for the swift reply (gee, that was quick!). :-)

Now another question for anyone: What is the effect of toggling the disk drive's write-caching feature with hdparm -W {0,1} /dev/devicename? The man page for hdparm(8) doesn't say anything about that, and just now querying my four disks (two IDEs and two SATAs) shows this enabled for all four. I've never fiddled with this setting, but I'm convinced it runs automatically based on vendor/drive/capability (all four are late-model Seagate drives FWIW). Thanks again!

Enabling barriers since 2.6.9 and hdparm question

Posted May 22, 2008 11:33 UTC (Thu) by zmi (guest, #4829) [Link]

> What is the effect of toggling the disk drive's write-caching feature

Turning off write caches of disks and the RAID controller slows writing 
badly. Examples of a recent HP ML350 server with 3x 146GB 10k SAS disks:
- disk cache and RAID write cache OFF: 65MB/s
- disk cache and RAID write cache ON: 145MB/s
and that is on _sequential_ writes. We didn't measure _random_ writes, but 
the system felt like an old 386 then. Really unusable.

Enabling barriers since 2.6.9 and hdparm question

Posted May 22, 2008 13:40 UTC (Thu) by NAR (subscriber, #1313) [Link]

Interestingly we've had different experience - turning off the write cache actually improved
performance. I'm not sure why, but the whole system was somewhat complicated and we had data
loss. Turning off the write cache solved the data loss problem and had this surprising
side-effect too.

What about LVM?

Posted May 21, 2008 18:05 UTC (Wed) by robertlemmen (guest, #12997) [Link]

so is there any work going on towards making the barriers work on lvm volumes? 

regards  robert

What about LVM?

Posted May 22, 2008 2:49 UTC (Thu) by snitm (guest, #4031) [Link]

Yes, but only single disk DM targets (e.g. linear), see: http://lkml.org/lkml/2008/2/15/125

Unfortunately, this patch hasn't been pushed upstream and the DM maintainer (agk) hasn't
really commented on when it might.

What about LVM?

Posted May 27, 2008 18:06 UTC (Tue) by shapr (subscriber, #9077) [Link]

Sad, I now want to rip out my LVM setup and go back to straight-up ext3.

I've had some random unexplained file corruption at power-loss, even with safety-first
features of ext3 enabled. I wonder ext3 on LVM would explain it.

What about LVM?

Posted May 27, 2008 19:32 UTC (Tue) by nix (subscriber, #2304) [Link]

LVM without snapshots only uses the linear target.

What about LVM?

Posted Aug 29, 2019 16:07 UTC (Thu) by daveburton (guest, #134115) [Link]

Eleven years later... is this fixed? Or does LVM still preclude use of barriers?

What about LVM?

Posted Aug 30, 2019 5:17 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

Quick googling reveals support was added in 2.6.33 in 2009:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/...

Barriers and journaling filesystems

Posted May 21, 2008 18:17 UTC (Wed) by raven667 (subscriber, #5198) [Link]

This is all well and good but it would be I think most effective if drives had their own
battery 
backed write-back cache which can be pinned by the OS to the journal.  If the feature can be
made 
persistent, by committing the data to flash or something on drive shutdown, then the data
doesn't 
even need to be committed to the disk.  It seems to me that this kind of design could remove 
inherent penalties associated with journaling using proper write barriers.  This kind of
technology 
could use used for transactional databases as well by making the transaction record work at 
memory and interface speed rather than be limited by the rotational speed of the drives.  This

should reduce contention on the platters by removing one of the more constant sources of
activity.

Barriers and journaling filesystems

Posted May 21, 2008 18:42 UTC (Wed) by jwb (guest, #15467) [Link]

Or you could just use a CF device for your journal.

Barriers and journaling filesystems

Posted May 22, 2008 5:05 UTC (Thu) by jzbiciak (guest, #5246) [Link]

Except that flash has a limited number of rewrites before it's toast?

Barriers and journaling filesystems

Posted May 22, 2008 9:39 UTC (Thu) by ekj (guest, #1524) [Link]

Would you please stop whipping this long-dead horse ?

Typical flash-memory today is rated for 1M writes. There are internal wear-leveling that
ensures that this is that number of writes to the ENTIRE module. (i.e. it is impossible to
wear out flash faster by writing repeatedly to the same location)

So, even a SMALL 1GB flash-memory requires the writing of a minimum of 1000TB worth of data
before it'll start failing. (or another order of magnitude if it's a 1M flash-module)

To put this in perspective, if you are writing 24x365 to the module at a constant speed of
1MB/s, then you'll wear it out in 31 years. If you write constantly at 10MB/s, you wear it out
in 3 years.

For most uses, this simply isn't a concern. Most computers, even file-servers, don't write
constantly to disk 24x365, and even those that do; the journal is metadata-only by default, so
only writes that changes the filesystem-structure generate load on the journal at ALL.

For those extremely rare servers that DO a gargantuan amount of writes that changes the
filesystem-structure, there's a simple cure: Buy a sligthly LARGER flash-module. 

Because of the wear-leveling a flash-module that is 4 times the size will sustain 4 times the
amount (measured in GB) of data written before the cells start reaching their limit.

A server that -MUST- journal more than a PETABYTE of data before the end of its lifespan can
also afford to buy 16GB or 64GB of flash-storage rather than a paultry 1GB.

Barriers and journaling filesystems

Posted May 22, 2008 11:51 UTC (Thu) by SimonKagstrom (guest, #49801) [Link]

You are right, but it's also important to lookup the specs of the CompactFlash card before buying it. How they actually do wear-levelling differs, and I've heard of brands which perform wear-levelling only within regions of the disk - not over the entire disk.

In such cases, you can still wear out the flash in "reasonable" times. There are also some brands (I know of SiliconSystems) which allow reading CF-internal wear-levelling data (spare blocks, number of erases etc), although I'm not sure if there is any non-proprietary software to actually use this.

// Simon

Barriers and journaling filesystems

Posted May 22, 2008 12:44 UTC (Thu) by joern (guest, #22392) [Link]

> Typical flash-memory today is rated for 1M writes. There are internal wear-leveling that
ensures that this is that number of writes to the ENTIRE module. (i.e. it is impossible to
wear out flash faster by writing repeatedly to the same location)

This happens to be wrong on both accounts.  Typical flash-memory today is rated at either 10k
or 100k - the lower number being for MLC flashes, which are cheaper and therefore can be
expected in your average cheap medium from Fry's.

Far worse, the normal wear leveling scheme does _not_ cover the complete device.  It covers
some part of it, typically 16M or so.  The next part is also wear-leveled in itself, but not
wrt. any other part of the device.  Therefore having a really hot area like a 32M journal is
comparable to disabling the wear leveling for the device completely.  After 10k journal wraps
you're depending on pure luck.

The horse may be locked away in a broken shed, but it's still kicking.

[ To be fair, some expensive devices are far far better.  Some expensive devices are just that
- more expensive.  So do your own QA to be sure. ]

Barriers and journaling filesystems

Posted May 22, 2008 15:13 UTC (Thu) by drag (guest, #31333) [Link]

> This happens to be wrong on both accounts.  Typical flash-memory today is rated at either
10k or 100k - the lower number being for MLC flashes, which are cheaper and therefore can be
expected in your average cheap medium from Fry's.


Well you wouldn't buy the cheapest thing you can find when you go want to use it in a server,
right? So you make sure you get the 'high endurance' versions of the drives with SLC NAND
chips and make sure that you go through a cycle of swapping it out every year or so.

The thing is is that people are actually using flash to help speed up disk access in their
datacenters.  This isn't the first time I've heard of people doing this sort of thing.

Personally I work with a lot of flash media. The cheaper stuff. I haven't been here long, but
I've talked to people that have been working here for 20 years. (of course they haven't been
using CF cards that long). Nobody has yet to see any sort of flash media failing that they can
remember. The actual physical connections (the plastic holes for the pins get malformed) get
all screwed up before the any actual data ever gets corrupted. 


Were do you get your numbers for the 16M wear leveling? Typically your dealing with media that
is at least 512 megs and soon you'll have a hard time finding new stuff that is under 2 gigs.
I am doubtful that only 16megs out of 2gigs is going to be wear leveled.

Barriers and journaling filesystems

Posted May 22, 2008 15:24 UTC (Thu) by joern (guest, #22392) [Link]

> Personally I work with a lot of flash media. The cheaper stuff. I haven't been here long,
but I've talked to people that have been working here for 20 years. (of course they haven't
been using CF cards that long). Nobody has yet to see any sort of flash media failing that
they can remember. The actual physical connections (the plastic holes for the pins get
malformed) get all screwed up before the any actual data ever gets corrupted.

I know of reports.  Ext3 on CF seems to be fairly good at corrupting stuff, particularly data
that is stored close to the journal.  Whether the cards in question used SLC or MLC I don't
know.  The pesky thing about them is that vendors hardly ever publish information at all.

> Were do you get your numbers for the 16M wear leveling? Typically your dealing with media
that is at least 512 megs and soon you'll have a hard time finding new stuff that is under 2
gigs. I am doubtful that only 16megs out of 2gigs is going to be wear leveled.

http://www.linuxconf.eu/2007/papers/Engel.pdf
Mainly based on the smartmedia spec and some reverse engineering.  I didn't do the latter
myself, though.

Barriers and journaling filesystems

Posted May 23, 2008 2:21 UTC (Fri) by drag (guest, #31333) [Link]

> Whether the cards in question used SLC or MLC I don't know.  The pesky thing about them is
that vendors hardly ever publish information at all.


Ya. There is only a handful of people that actually make flash media. Maybe four companies in
total, I forget. 

Everybody that sells that flash media to end users uses a hodgepodge of different sources for
different products. Cheaper folks will mix in different flashes for different  production runs
on the same product... We ran into this problem with Kingston shipping devices that had any
sizes from 470megs on up for their 512 meg products.

So I'd stay far away from vendors that don't publish details about their products for anything
serious. 

Barriers and journaling filesystems

Posted Jun 12, 2008 14:42 UTC (Thu) by salimma (subscriber, #34460) [Link]

Not commenting on whether the 16MB information is correct or not, but grandparent's point is
not that only 16MB gets write-leveled; it's that for the purpose of write-leveling, the drive
is treated as a series of 16MB blocks, each of them are write-leveling within themselves.

(The write-leveling circuitry would then be much simpler -- imagine a parallel series of, say,
8-bit adders, compared to a 64-bit adders made up of 8-bit adders)

Barriers and journaling filesystems

Posted May 22, 2008 22:30 UTC (Thu) by brouhaha (subscriber, #1698) [Link]

i.e. it is impossible to wear out flash faster by writing repeatedly to the same location)
Actually that's not true. Many (possibly most) large flash memory cards have the wear leveling done in sections, so it's possible to wear out one section before the others.

The vendors tend to be fairly secretive about the details of their wear-leveling algorithms.

Barriers and journaling filesystems

Posted May 23, 2008 2:29 UTC (Fri) by drag (guest, #31333) [Link]

What would be great would be to get flash manufacturers to, optionally, allow the OS to access
the flash media more directly as a MTD, which reflects the true nature of flash media.

This way Linux folks can take advantage of MTD-specific file systems that can handle wear
leveling in a very effective and open manner. (and probably get better performance, to boot)

(runnning MTD file systems on top of Block-to-MTD emulation in software on top of MTD-to-Block
emulation in hardware on top of MTD flash seems self-defeating..)

This way for the 'industrial' flash people using the flash for performance reasons on Unix
systems can get the most benefit while their Windows-using counterparts can continue to use
that stuff to speed up swap file access and application pre-caching in Vista using the block
emulation hardware.

wearing out Flash memory

Posted May 23, 2008 3:07 UTC (Fri) by sbishop (guest, #33061) [Link]

Part of the trouble is that people confuse Flash memory with devices implemented using Flash.
A location within a Flash memory chip, for example, will certainly wear out faster if it's
written to repeatedly.  The chips themselves do absolutely no wear leveling.  But, of course,
it would be insane to build a Flash-based device without built-in wear-leveling logic and CRC
checks, which may have been the reason for the "it is impossible to wear out flash faster by
writing repeatedly" comment.

By the way, I work for a memory manufacturer, and it's my job to do reliability testing on
this stuff.  My co-workers and I have all come to hate Flash.  It is expected that the chips
will wear out, and transient failures are okay.  The controllers are expected to deal with
these issues; it's the nature of the beast.  So what does "working" mean?!  Oh, and the state
machine of each one of these *#$%!@ things are different.  I miss DRAM...

Barriers and journaling filesystems

Posted May 29, 2008 18:22 UTC (Thu) by mcortese (guest, #52099) [Link]

Or you could just use a CF device for your journal.
But then how do you guarantee the synchronization between the data written to the HD and the journal witten to the flash? The whole issue here is to avoid any reordering that would spoil the journaling strategy. You are merely moving the problem from flushing within a device, to syncing two devices!

Barriers and journaling filesystems

Posted May 21, 2008 20:15 UTC (Wed) by evgeny (subscriber, #774) [Link]

> it would be I think most effective if drives had their own battery backed write-back cache

This would be as controversial as using battery modules for some RAID controllers. Many folks
don't like this idea - if we talk reliability, a UPS is a must. Then an extra battery could
only be a source of extra trouble.

Barriers and journaling filesystems

Posted May 21, 2008 22:28 UTC (Wed) by iabervon (subscriber, #722) [Link]

Having a UPS is great until the cat turns it off or the battery ages to the point where power
fluctuations cause it to reset or somebody turns off the computer's power switch or the power
supply burns out. Even if the outside power situation is well-protected, there's value to
having the drive store enough power to finish with its buffers and spin down carefully and
such.

Barriers and journaling filesystems

Posted May 22, 2008 18:42 UTC (Thu) by amikins (guest, #451) [Link]

> Having a UPS is great until the cat turns it off

I'm NOT the only one that's been plagued by this? I've since learned to make cat-proof covers
for my UPS buttons...

Barriers and journaling filesystems

Posted May 22, 2008 20:03 UTC (Thu) by v13 (guest, #42355) [Link]

Why would anyone not like the battery backed cache?

Have you ever considered the effects of not having to do a disk write 
when doing synchronous disk operations? Journals and databases perform a 
*lot* faster on BB controllers. (based on my experience).

Consider no having to do any safety-related writes to disk for safety. 
Even barriers have zero overhead.

Barriers and journaling filesystems

Posted May 21, 2008 18:54 UTC (Wed) by magila (guest, #49627) [Link]

> There is also some potential for improvement by carefully ordering operations so that
barriers (which are normally implemented as "flush all outstanding operations to media"
requests) do not force the writing of blocks which do not have specific ordering requirements.

SCSI actually provides a convenient mechanism for doing this. You can declare a command as
being "Ordered" which will prevent it from being reordered relative to any other commands
which have also been declared as such. Unfortunately AFAIK ATA provides no such niceties. I
also don't know how widely supported this feature is by the various SCSI HBA drivers (I
suspect not very) but it is an option to explore.

Barriers and journaling filesystems

Posted Jul 22, 2014 22:44 UTC (Tue) by zlynx (guest, #2285) [Link]

Some SATA devices support FUA (Force Unit Access).

Barriers and journaling filesystems

Posted May 21, 2008 19:50 UTC (Wed) by jengelh (subscriber, #33263) [Link]

XFS has barriers enabled by default since 2.6.17, and it still seems to perform well (on the
last server where I put it on without disabling barriers).

^1 http://lkml.org/lkml/2006/5/22/278

XFS / LVM

Posted May 21, 2008 20:29 UTC (Wed) by tarvin (guest, #4412) [Link]

Do XFS' barriers work if the file system is created on an LVM2 logical volume?

XFS / LVM

Posted May 21, 2008 20:47 UTC (Wed) by nix (subscriber, #2304) [Link]

device-mapper doesn't pass the barriers down, so no. (Actually it asserts 
that it doesn't support barriers, which is safer: the fs can take that 
into account rather than firing off barriers which are just ignored).

(Maybe this has changed recently, but this was the state of affairs a 
month or so back.)

XFS / LVM

Posted May 21, 2008 21:10 UTC (Wed) by jengelh (subscriber, #33263) [Link]

Correct.

[ 51.404922] Filesystem "dm-0": Disabling barriers, not supported by the underlying device
[27138.731115] Filesystem "dm-1": Disabling barriers, not supported by the underlying device

Even though I know from before my time of using dm-crypt that only one of sda or sdb does not support barriers at the hardware level.

Barriers and journaling filesystems

Posted May 22, 2008 20:29 UTC (Thu) by mikachu (guest, #5333) [Link]

Disabling barriers on xfs increases performance very very much in some situations, especially
when deleting a directory tree with many small files, eg the linux-2.6 tree. With barriers it
takes something like 2-3 minutes and without barriers around 20 seconds. (These numbers are
from my memory).

Battery backed caches / 'hdparm -W 0'

Posted May 21, 2008 19:58 UTC (Wed) by tarvin (guest, #4412) [Link]

On a system with ext3 filesystems and barrier=0 (because of LVM usage), am I at risk in the
following situations?

a. The system's storage controller has a battery
   backed cache.

b. The system uses plain local disks, but 
   'hdparm -W 0 <device>'
   is run at boot time.

Battery backed caches / 'hdparm -W 0'

Posted May 22, 2008 12:34 UTC (Thu) by i3839 (guest, #31386) [Link]

Disabling the write cache avoids this whole barrier problem, because writes can't be reordened
then. Performace will be much worse too, so better to enable the write cache and enable
barriers. (because writing with write cache enabled is faster the chance you'll lose data
while writing is also smaller. ;-)

'hdparm -W 0'

Posted May 22, 2008 19:10 UTC (Thu) by tarvin (guest, #4412) [Link]

I'm glad to hear that turning off the write cache helps. I've done that for years on hosts
which house data which are important to me, if the hosts don't have battery-backed cache.

No doubt, I'm probably loosing performance, but I've never found it unacceptable.

'hdparm -W 0'

Posted May 24, 2008 20:13 UTC (Sat) by giraffedata (guest, #1954) [Link]

To be precise: ordering of write commands, ordering of hardening to disk, and write caching (in the drive) are all separate things. With write caching turned off, the drive can still reorder the write commands and write the data to the platters in random order. But Linux will always know when the data has hit the platter and won't initiate the commit record until the journal data it covers is on the platter.

Linux can have multiple writes in progress at the disk drive (sent to the drive but the drive hasn't responded) at the same time, independent of write caching.

Barriers and journaling filesystems

Posted May 21, 2008 20:04 UTC (Wed) by nix (subscriber, #2304) [Link]

Of course, md also doesn't support barriers for raid5/raid6, which means 
that using RAID actually *reduces* reliability in some respects. This is 
somewhat irritating, but also hard to fix: barriering all the constituent 
devices when a barrier comes in from the higher level isn't good enough on 
raid5, because if one drive loses power before the other you still have 
data corruption.

(In practice this too is rare enough that it takes explicit torture tests 
to trip it unless you're very unlucky.)

Barriers and journaling filesystems

Posted May 24, 2008 9:10 UTC (Sat) by Xman (guest, #10620) [Link]

I don't see the problem with the power failure scenario. If you encounter a barrier at an MD
device, I'd think you would essentially not submit IO to the underlying devices until such
time as all I/O's prior to the barrier had completed. In that scenario, at the first sign of a
power failure effecting *any* of the drives, I'd start reporting I/O errors and be free of any
guarantees whatsoever.

Barriers and journaling filesystems

Posted May 21, 2008 20:37 UTC (Wed) by hpp (subscriber, #4756) [Link]

I manage transactional systems for a living (first IBM MQ, now relational databases including
Sybase, DB2 and Oracle).  I am sorry to say the current filesystem approach is still somewhat
optimistic, even in the presence of barriers; and that commercial transational systems do
additional work that ext3/ext4 do not yet do, but that can be required for some disk failures.
Sadly, implementing this correctly may make the filesystems even slower.

One issue has to do with disk failures (say on a power failure) and a partial write.  If
transaction A is committed and uses a partial disk block, then transaction B cannot safely be
written to the same disk block - a power failure could lead to the block being partially
written and losing the data from committed transaction B.  On the other hand, always using a
new disk block or each transaction would eat up log space really quickly.  One solution is to
use ping-pong blocks (three are required of transactions can span a block).

Another issue has to do with when transactions are written to disk.  As the article indicates,
delaying writes may help if multiple concurrent and independent transactions can be written to
disk at the same time; but waiting to write data to disk is bad if one application is doing
most of the writes, as it would not run as quickly as possible.  Some databases allow this to
be tuned on the fly (e.g. mincommit in DB2), but that is not desirable for a filesystem; the
kernel should use heuristics to figure this out as a workload is running.

Next, you want to carefully tune: the size of the log buffer in memory and the  total size of
the transaction log on disk (i.e. when do we wrap round).  In databases you also care about
how much log space a single transaction can use and how much other concurrent log activity
(from other transactions) may occur between activity and commit, but until we expose
filesystem transactions to userspace we can safely ignore this.

Summary: filesystem implementors ought to talk to database implementors. I'm sure both groups
can teach each other a lot, but in this area databases are still quote a bit ahead of
ext3/ext4.

Barriers and journaling filesystems

Posted May 22, 2008 10:06 UTC (Thu) by Fats (guest, #14882) [Link]

"Summary: filesystem implementors ought to talk to database implementors. I'm sure both groups can teach each other a lot, but in this area databases are still quote a bit ahead of ext3/ext4."

The purpose of a journal is not to be sure that everything is written to disk when you do a write. It's to be sure that the file system is always in a consistent state so you don't need a very expensive fsck and risk loosing other data then what was being written. If you need to be sure that something is written to disk you have to use the fsync function in your code.

greets,
Staf.

Barriers and journaling filesystems

Posted May 24, 2008 9:14 UTC (Sat) by Xman (guest, #10620) [Link]

fsync *still* isn't going to help you much if I/O reordering is allowed.

Barriers and journaling filesystems

Posted May 24, 2008 9:30 UTC (Sat) by Fats (guest, #14882) [Link]

AFAIK fsync explicitly tells the hard drive to perform all outstanding IOs and then returns.
So, of course, if your hard drive lies to you, you are screwed.
Don't know if LVM is broken here also.

Barriers and journaling filesystems

Posted May 24, 2008 18:00 UTC (Sat) by Xman (guest, #10620) [Link]

fsync will block until the outstanding requests have been sync'd do disk, but it doesn't
guarantee that subsequent I/O's to the same fd won't potentially also get completed, and
potentially ahead of the I/O's submitted prior to the fsync. In fact it can't make such
guarantees without functioning barriers.

Barriers and journaling filesystems

Posted May 24, 2008 19:48 UTC (Sat) by Fats (guest, #14882) [Link]

Sure, my comment was in response to hpp and what I wanted to say is that user land code has to
take care of transactions as defined in relation databases and that fsync is the tool to use
for this.
A journaled file system only takes care that the file system stays in a consistent state so no
expensive fsck is needed with possible loss of data. Open write files may lose some of the
last writen data if no fsync was performed. To keep the file system consistency barriers are
used to guarantee a certain order of the writes. This limited guarantee allows file system to
be faster the relational databases.

greets,
Staf.

Barriers and journaling filesystems

Posted May 23, 2008 17:05 UTC (Fri) by jlokier (guest, #52227) [Link]

Journalling commit is more like "asynchronous delayed commit" from a database point of view,
when fsync() isn't used.  They protect the integrity of filesystem structure itself, and are
not used for application-level transactional changes.  Sometimes that weaker kind of commit is
fine, and the performance gain is large.

fsync() makes it more like a standard database commit, where the data is supposed to be secure
before the call returns.

This is one area where traditional databases can learn from filesystems.  There are some
things where you don't actually need a database to commit quickly - that can take as long as
it needs to batch and optimise I/O.  All you need then is consistent rollback.  For example,
databases which hold cached calculations are like this.

Your point about partial writes on power failure and not using overlapping blocks (will
sectors do?) is valid, and I would like to know more about what the database professionals
have discovered about what exactly is and isn't safe.  For example, can failure to write
sector N+1 corrupt sector N written a long time ago?  Is the "failure block size" larger than
a single sector when doing O_DIRECT (when that really works)?  Is it larger than a
filesystem/blockdev block size when not using O_DIRECT?  What's the reason Oracle uses
different journal block sizes on different OSes?

I think the filesystem implementors do know about that effect.  Journal entries are finished
with a commit block, to isolate the commit into its own block, which is not touched by the
next transaction.  I think your two/three ping-pong blocks correspond to the journal's finite
wraparound length on a filesystem - do say more if that's not so.

Lying drives

Posted May 21, 2008 20:50 UTC (Wed) by ncm (guest, #165) [Link]

Then, of course, we have drives that claim to support barrier operations, but lie about it whenever traffic goes up. They look way better in benchmarks. Most likely whatever drive you have on your own system is one of those. Drives that don't lie are low-volume products, so cost a lot more.

My conclusion is that it doesn't matter much what sort of barriers the system asks of a drive; the best you can do is have the OS send blocks to the drive in the right order. Then, your only hope is battery backup, which actually helps only if your system stops writing to disk at least a few seconds before the drive itself loses power. Distressingly many UPS-backed machines don't actually do a controlled shutdown when the battery gets low.

Note that if the head's writing when the power drops, that sector (and possibly several following it) is toast. The notion of using the motor as a generator to keep the electronics sane for a few milliseconds is a fun idea, but not actually implemented in real drives.

Barriers and journaling filesystems

Posted May 22, 2008 10:29 UTC (Thu) by perrymg (subscriber, #39684) [Link]

"if the commit record gets written first, the journal may be corrupted."

Would placing the Journal into its own mirrored device suffice?
(Assuming of course that only one mirror could fail at a time.)

Wouldn't a mirrored Journal perform faster on (re)mount, especially if journal_data was in
use? 

For LVM an LV has the option to be mirrored (as well as striped), could a separate mirrored LV
be used just for the Journal to avoid concerns about the lack of barriers on LVM?

Barriers and journaling filesystems

Posted May 22, 2008 13:35 UTC (Thu) by perrymg (subscriber, #39684) [Link]

I do see that a Linux system crash would still cause the Journal to become corrupted, mirrored
Journal or not. Just 2 copies of a bad Journal at the point in time of the crash!

But for disk related errors, mirroring at least the Journal would seem to make sense to me.
Mirroring the whole disk would achieve the same thing, but obviously more data would have to
be mirrored.

I guess for Enterprise level users all disk would be in RAID arrays and hence their only real
fear is that of a Linux system crash.




Barriers and journaling filesystems

Posted May 22, 2008 13:50 UTC (Thu) by perrymg (subscriber, #39684) [Link]

Sorry for the multiple posts, but this topic is fascinating and important.

If enabling barriers has a performance impact, then putting the Journal onto a separate disk
with barriers enabled (Mirrored or not), and leaving the FS data on a disk with barriers
disabled may be a short term compromise for some users?

Barriers and journaling filesystems

Posted May 22, 2008 17:20 UTC (Thu) by jordanb (guest, #45668) [Link]

Why not insert a barrier *only* when the commit record is inserted before the journaled data
on the disk? I can't imagine that happens too often, so it should cause few barriers will be
inserted. It shouldn't have a terrible impact on performance, but it *would* help deal with
the edge case of a journal write wrapping on the commit record causing the file system to be
inconsistent.

Barriers and journaling filesystems

Posted May 23, 2008 16:55 UTC (Fri) by jlokier (guest, #52227) [Link]

A couple of clarifications.
...contiguous space is easy to come by. Keeping the journal together will be good for performance, but it also helps to prevent reordering. In normal usage, the commit record will land on the block just after the rest of the journal data, so there is no reason for the drive to reorder things. The commit record will naturally be written just after all of the other journal log data has made it to the media.
That helps only with the first barrier, before the commit block. There's a better way to eliminate that barrier, which is a checksum in the commit block and ext4 does do that. You still need the second barrier, somewhere after the commit block, because it orders the journal write against writes elsewhere on the disk - those are never contiguous.
Disabling the write cache avoids this whole barrier problem, because writes can't be reordered then
It's not clear if disabling write cache is enough, when ext3 is mounted with barrier=0 (the current default). That stops the disk from reordering writes, but the kernel elevator is still able to reorder writes, when barrier=0, before sending them to the disk. Setting barrier=1 has the dual effect of telling the kernel not to reorder requests around barrier writes, and ideally passing that constraint to the disk as well.
Disabling barriers on xfs increases performance very very much in some situations, especially when deleting a directory tree with many small files, eg the linux-2.6 tree. With barriers it takes something like 2-3 minutes and without barriers around 20 seconds. (These numbers are from my memory).
That suggests a flaw in the way XFS implements deletions. There is no reason to require so many barriers. The only thing which should be able to cause a high rate of barriers is a high rate of fsync() calls (which aren't done in this case) or the journal being too small.

Barriers and journaling filesystems

Posted May 24, 2008 20:31 UTC (Sat) by giraffedata (guest, #1954) [Link]

It's not clear if disabling write cache is enough, when ext3 is mounted with barrier=0 (the current default). That stops the disk from reordering writes, but the kernel elevator is still able to reorder writes, when barrier=0, before sending them to the disk. Setting barrier=1 has the dual effect of telling the kernel not to reorder requests around barrier writes, and ideally passing that constraint to the disk as well.

But isn't it impossibly naive for ext3 to assume writes it submits to the block layer get realized on disk in the order submitted? I assume designers weren't that naive and, when working without barriers, ext3 withholds writes from the block layer until every prerequisite write has completed.

The value of barriers is supposed to be that ext3 doesn't have to let the queue run dry, with its attendant throughput slowdown. Ext3 can submit writes for before the commit record, the commit record, and after the commit record, with barriers placed appropriately in the stream, and the block layer will take care of enforcing the required ordering.

The fact that "write completed" doesn't imply the data is persistent across a disk drive power failure (and furthermore the gaining of that persistence isn't in any particular order) is an orthogonal issue. Which code deals with it depends upon whether ext3 uses barriers or not. (And ISTR the block layer doesn't provide any mechanism separate from barriers to deal with it, so if you don't use barriers=1, it doesn't get dealt with).

Barriers and journaling filesystems

Posted May 24, 2008 9:18 UTC (Sat) by Xman (guest, #10620) [Link]

I gotta say, I'm dubious of the notion that most of the time things out just fine with the
ordering of journal writes. If that were true, enabling barriers wouldn't impose a 30%
performance penalty (unless they really are being generated far more often than they need be,
in which case, you have two problems ;-).

Barriers and journaling filesystems

Posted May 24, 2008 19:50 UTC (Sat) by giraffedata (guest, #1954) [Link]

Keeping the journal together will be good for performance, but it also helps to prevent reordering. In normal usage, the commit record will land on the block just after the rest of the journal data, so there is no reason for the drive to reorder things.

I don't see how that's true even usually. Even a classic elevator algorithm sweeps backward as well as forward. And if I were a disk drive with multiple blocks to write on the same track, I would not wait for the lowest numbered block to come around and write the track in order; I'd start writing whatever is under the head right now.

Barriers and journaling filesystems

Posted May 24, 2008 23:40 UTC (Sat) by dlang (guest, #313) [Link]

but on modern disks you really don't know where the borders of a track are. it's really just
an array of blocks

Barriers and journaling filesystems

Posted May 25, 2008 2:47 UTC (Sun) by giraffedata (guest, #1954) [Link]

but on modern disks you really don't know where the borders of a track are. it's really just an array of blocks

I don't know what "but" refers to here; I made a statement about "if I were a disk drive," and the disk drive definitely knows where the borders of the tracks are. And though I have no actual evidence of it, I would be very surprised if the disk drive did not use that knowledge to optimize performance.

The idea suggested in the article that ext3 tends to get ordered writes (to the platters) because it usually does journal updates in logical block number order seems to assume that the disk drive does writes from cache to platter in logical block number order. I can't believe that it would do that, even usually, because it would be so much slower than optimum.

Barriers and journaling filesystems

Posted May 25, 2008 3:27 UTC (Sun) by dlang (guest, #313) [Link]

the drive will write an entire track at a time (they have done so for years)

but you have no way of knowing if the journal that you allocated spans a track boundry (since
you don't know where the boundry is), and you also don't know if the drive has re-allocated
any blocks to avoid bad spots on the disk in the middle of your journal.

either one of these will destroy the atomic nature of writes to the journal

Barriers and journaling filesystems

Posted May 25, 2008 6:40 UTC (Sun) by giraffedata (guest, #1954) [Link]

Whether it writes the whole track doesn't matter. It's the order in which it writes the blocks. Does it wait for the beginning of the track to come around and write from start to finish (average time - 1.5 revolutions)? Or does it write starting now (average time 1 revolution)? I presume it's the latter.

Nobody's said anything about ext3 knowing where the blocks are physically, so I don't know why you bring that up. The article just says ext3 updates the journal in logical block number order and suggests that means they tend to go the platter in the same order.

And I've been saying I don't believe that.

Barriers and journaling filesystems

Posted May 25, 2008 10:53 UTC (Sun) by dlang (guest, #313) [Link]

I misunderstood what you were saying. I thought that you were saying that if the entire
journal was on the same track it would get written all at once, even if it wrapped over the
end of the journal.

Barriers and journaling filesystems

Posted May 27, 2008 5:48 UTC (Tue) by perrymg (subscriber, #39684) [Link]

This article has generated a lot of comments, and I for one would like to see a follow on
article that takes many of these comments into account and filters the real facts out of them.

Thanks for the great work, keep it coming.

Barriers and journaling filesystems

Posted Jul 28, 2012 1:38 UTC (Sat) by carot (guest, #85987) [Link]

I want to ask a question about barrier,
what is the timing to do barrier operation?
only for file system metadata change or for user data change or for both?

Barriers and journaling filesystems

Posted Jul 10, 2014 7:58 UTC (Thu) by 07Srivathsan (guest, #97808) [Link]

Hi all,

I am using Debian 6 Operating System (Kernel: 2.6.32-5-686).
The partitions created are of type ext3.
The machine will turn off frequently due to UPS failures, in such cases i face file system corruption.
To reduce the file system corruption many have suggested to enable barrier while mounting partitions.
By default for ext3 type partitions, barrier is disabled.
Will there be any side effects or serious issues on enabling barrier for ext3 partition?
Any suggestions on this please reply back.

Barriers and journaling filesystems

Posted Jul 10, 2014 13:30 UTC (Thu) by roblucid (guest, #48964) [Link]

Enabling the barrier by default, has been done on openSUSE for years, precisely because data integrity was valued more over benchmark results.

The only issue over enabling it was the poorer showing in comparative benchmark reviews, where the reporters just use defaults without any care for fair comparisons or even reporting on default differences.

What's the "reorder" mean in this sentence?

Posted Jul 21, 2014 12:46 UTC (Mon) by leafonsword (guest, #97971) [Link]

----------------------------------------------------------
contemporary drives maintain large internal caches and will reorder operations for better performance。。。。。。
----------------------------------------------------------

I have a question to above sentence--What's the "reorder" mean in above sentence?

Suppose normal oder--journal 1,data 1;journal 2,data 2。
Which folowing reorder meaning is true:
1. journal 2,data 2;journal 1,data 1;
2. data 1,journal 1;data 2,journal 2。

What's the "reorder" mean in this sentence?

Posted Jul 22, 2014 6:33 UTC (Tue) by dlang (guest, #313) [Link]

unless you issue commands to the drive to force the order, the drive could do anything it wants

journal2 data 1data2 jounrnal1

What's the "reorder" mean in this sentence?

Posted Sep 11, 2014 10:18 UTC (Thu) by K28.5 (guest, #98820) [Link]

leafonsword: this doesn't refer to Journal vs Data.
Drive caches are generally involved in re-ordering I/Os by their logical block address (LBA) on the disk. Writes or reads in a randomly scattered sequence would be re-ordered by their LBA sequentially (i.e. an ascending or descending order) and unrelated requests for blocks in a similar location grouped together, so that they can (on a spinning disk) be executed with one smooth movement of the head/s, saving time.


Copyright © 2008, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds