Btrfs aims for the mainline

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
January 7, 2009

The Btrfs filesystem has been under development for the last year or so; for much of that time, it has been widely regarded as the most likely "next generation filesystem" for Linux. But, before it can claim that title, Btrfs must stabilize and find its way into the mainline kernel. Btrfs developer Chris Mason has been saying for a while that he thinks the code will come together more quickly if it is merged relatively soon, even if it is not yet truly ready for production use. General experience with kernel development tends to support this position: in-tree code gets more review, testing, and fixes than out-of-tree code. So the development community as a whole has been reasonably supportive of a relatively early Btrfs merge.

In our last Btrfs episode, Andrew Morton suggested that a 2.6.29 merge be targeted. Chris would like that happen; to that end, he has posted a version of Btrfs for consideration. Unsurprisingly, that posting has already increased the amount of attention being paid to this code, with the result that Chris quickly got a list of things to fix. Most of those have now been addressed, but there are a few remaining issues which could still impede the merging of Btrfs in this development cycle. This article will look at the potential roadblocks.

One of those is the user-space API. Btrfs brings with it a whole set of new ioctl() calls, none of which have been seriously reviewed or even documented. These calls perform functions like creating snapshots, initiating defragmentation, creating or resizing subvolumes, adding devices to the volume set, etc. Interestingly, there has been no real complaint about the volume-management features of Btrfs in general. But the interface to features like that needs close scrutiny; normally, user-space APIs cannot be broken once they are merged into the mainline. There has been some talk of making an exception for Btrfs, since there is little chance of systems becoming dependent on a specific interface before Btrfs is production-ready.

Still, once distributions start shipping Btrfs tools - to help testers if nothing else - an API change would cause pain. Any potential for this kind of pain would make API changes very hard to do. So Linux may well end up being stuck with the early Btrfs API. Given that at least one developer thinks that this API needs a serious rework, this issue could turn out to be a serious roadblock indeed.

Then, there is the issue of the special-purpose locking primitives used in Btrfs. To understand this discussion, it's worth looking at the locking function used within Btrfs:

    int btrfs_tree_lock(struct extent_buffer *eb)
    {
	int i;

	if (mutex_trylock(&eb->mutex))
	    return 0;
	for (i = 0; i < 512; i++) {
	    cpu_relax();
	    if (mutex_trylock(&eb->mutex))
		return 0;
	}
	cpu_relax();
	mutex_lock_nested(&eb->mutex, BTRFS_MAX_LEVEL - btrfs_header_level(eb));
	return 0;
    }

The lock in question is a mutex, but it is being acquired in an interesting way. If the lock is held by another process, this function will poll it up to 512 times, without sleeping, in the hope that it will become available quickly. Should that happen, the lock can be acquired without sleeping at all. After 512 unsuccessful attempts, the function will finally give up and go to sleep.

Chris justifies this behavior this way:

Btrfs is using mutexes to protect the btree blocks, and btree searching often hits hot nodes that are always in cache. For these nodes, the spinning is much faster, but btrfs also needs to be able to sleep with the locks held so it can read from the disk and do other complex operations.

For btrfs, dbench 50 performance doubles with the unconditional spin, mostly because that workload is almost all in ram. For 50 procs creating 4k files in parallel, the spin is 30-50% faster. This workload is a mixture of disk bound and CPU bound.

That kind of performance increase seems worth going for. In fact, it reflects a phenomenon which has been observed in other situations as well: even when sleeping locks are used, performance often improves if a processor spins for a while in the hope that a contended lock will become available. If the lock can be acquired without sleeping, then the overhead associated with putting the process to sleep and waking it up can be avoided. Beyond that, though, there is the fact that the process seeking to acquire the lock is probably well represented in the CPU's cache. Allowing that process to continue to run will, if the lock can be acquired quickly, almost certainly lead to better system performance.

For this reason, the adaptive realtime locks patch was developed last year, though it never found its way into the mainline. In response to the Btrfs discussion, Peter Zijlstra proposed a spinning mutex patch which is intended to provide the same benefits as the special Btrfs locking function, but for more general use and without the addition of magic constants. In Peter's patch, an attempt to acquire a contended lock will spin for as long as the process holding that lock is actually running on a CPU. If the lock holder goes to sleep, any process trying to acquire the lock also goes to sleep. The heuristic seems to make sense, though detailed benchmarks have not been posted. The patch was received reasonably well, though Linus has insisted that some changes be made.

So a more general spinning mutex may well find its way into the mainline. Whether it will go in for 2.6.29 is not clear, though. Developers tend to like their core locking primitives to be reasonably well tested; merging something which was developed toward the end of the merge window could be a hard sell. Until something like that happens, Chris is uninterested in removing his special locking function:

But, if anyone working on adaptive mutexes is looking for a coder, tester, use case, or benchmark for their locking scheme, my hand is up. Until then, this is my for loop, there are many like it, but this one is mine.

Finally, there is the question of the name. Some reviewers have suggested that the filesystem should be merged with a name which makes it clear that it's not meant for production use - "btrfsdev," for example. Chris is resistant to that idea, noting that, unlike existing filesystems, Btrfs is known to be new and has no reputation for stability. He has stated his willingness to make the change, though, if it is truly considered to be necessary. Bruce Fields pointed out that calling it "Btrfs" from the beginning could possibly burn future developers who boot an old kernel (with a non-production Btrfs) after switching to a newer, production-ready version of the filesystem.

All of this adds up to an uncertain fate for Btrfs in 2.6.29; there are a fair number of open issues and it's late in the merge window. Of course, Btrfs could be merged after 2.6.29-rc1; since it is a completely new subsystem, it won't cause regressions. But if Linus concludes that there are enough loose ends in the current Btrfs code, he may just decide to give it one more development cycle before bringing it into the mainline. So, while nobody seems to doubt that Btrfs will go in, the question of when remains open.

(With any luck, we hope to have an authoritative article on Btrfs for this page in the near future, once the author - you know who you are! - gets it written. Stay tuned.)

Index entries for this article
Kernel	Btrfs
Kernel	Filesystems/Btrfs
Kernel	Locking mechanisms/Mutexes

(Log in to post comments)

Hand-waving about a phased sleep

Posted Jan 7, 2009 17:27 UTC (Wed) by BrucePerens (guest, #2510) [Link]

Pardon me for engaging in some hand-waving, and not really knowing what the kernel developers have tried to do about this issue over time.

Isn't the problem here that sleep and the following process wakeup are expensive? In the case that sleep might be necessary only for a short time, a phased sleep makes more sense in the context of multiprocessing or hyperthreads. Halt the processor, but leave the process context in place in the processor. Don't move the process context out of the processor until some time has passed. Handle wakeups during this period by restarting the processor.

First, you stop banging on the lock in a tight loop, and that's important because atomic operations make expensive use of inter-processor cache coordination, potentially keeping other processors from doing useful work for some time. Also, you get to put the CPU core to useful work if the stopped processor is actually a hyperthread.

Bruce

Hand-waving about a phased sleep

Posted Jan 7, 2009 17:55 UTC (Wed) by johntb86 (subscriber, #53897) [Link]

That's an interesting idea, but as far as I know to wake up the processor you'd need an IPI, so it's not exactly cheap. You'd also need to create a timer interrupt to wake up the processor if it's not woken up by some other method, so this could become a pretty expensive way to have the processor do no work for a while.

Hand-waving about a phased sleep

Posted Jan 7, 2009 18:26 UTC (Wed) by BrucePerens (guest, #2510) [Link]

Perhaps we need hardware support to make this work properly. A timed halt instruction, or a low-level atomic primitive that can wake a processor waiting for an address.

Hand-waving about a phased sleep

Posted Jan 7, 2009 21:29 UTC (Wed) by jlokier (guest, #52227) [Link]

x86 does have a primitive to wait on an address: monitor/mwait.
It waits until the monitored location is modified by another processor.

I read that it's a bit slow for this sort of thing.

In principle, if it were fast (and it could be with MESI caches),
it would be very well suited to replace all spinning waits.

Hand-waving about a phased sleep

Posted Jan 8, 2009 3:03 UTC (Thu) by nevets (subscriber, #11875) [Link]

Actually, what Peter Zijlstra has done (and we do in the -rt patch) is not to spin on any atomic operations. We simply spin and check a variable. For instance, when we fail to get the lock, we go into a loop, checking the lock owner, and if that owner is still running. No atomic operations involved. There are a few races in different implementations, but none of those races are detrimental. A missed race just means we may sleep when we could have gotten the lock (current behavior). When the lock owner changes, or the current owner goes to sleep, we try to retake the lock. If the owner is asleep, we then schedule.

In the -rt patch, we are a bit more conservative and avoid these races, but we still do not spin on any atomics. Thus, we do not slow down the bus traffic of other CPUS.

If API is in flux, how about not exposing it for now?

Posted Jan 7, 2009 20:06 UTC (Wed) by proski (subscriber, #104) [Link]

If the btrfs API is in flux, how about not exposing it for now? It's not like the filesystem cannot be used without that API. There are also more flexible ways of communicating with the kernel, such as configfs.

If API is in flux, how about not exposing it for now?

Posted Jan 7, 2009 20:48 UTC (Wed) by masoncl (subscriber, #47138) [Link]

Unfortunately, the ioctls are required to use the filesystem. The part that is most in flux is the system used to mount multi-device filesystems.

This is just one ioctl, and keeping it around shouldn't be a problem for compatibility.

If API is in flux, how about not exposing it for now?

Posted Jan 8, 2009 17:11 UTC (Thu) by nas (subscriber, #17) [Link]

Why can't those ioctls be marked as experimental? It's very difficult to design the perfect API up front. Also, I don't see any need to rename the filesystem. Just merge the darn thing already and mark it experimental. I've got over 1 TB in my desktop machine (not really a whole lot of space nowdays) and I'm willing to devote some space to testing. People who compile their own kernels should be able to deal with filesystem format changes.

Btrfs aims for the mainline

Posted Jan 7, 2009 20:36 UTC (Wed) by jwb (guest, #15467) [Link]

I'd like to hear how this lock-spinning loop behaves on a system with 1024 CPUs. I'm sure it's swell on a dual-core, shared-cache, UMA machine but it sounds like it could practically grind to a halt on a many-way NUMA machine.

Btrfs aims for the mainline

Posted Jan 7, 2009 21:41 UTC (Wed) by martinfick (subscriber, #4455) [Link]

Wouldn't it make spinning more worthwhile the more CPUs you have running? The more other CPUs running, the more likely the lock will be released sooner than later, making the spinning instead of sleeping more likely to pay off, right?

Btrfs aims for the mainline

Posted Jan 7, 2009 23:32 UTC (Wed) by jzbiciak (guest, #5246) [Link]

I don't know that it'd cause things to grind to a halt on a NUMA under normal circumstances. The cost of migrating lines goes up, sure, but you only get extra traffic when the busywaiter arrives just ahead of a lock-release.

I'm sure it could cause some interesting thundering-herd behavior, though, if there are a lot of people waiting on the lock to release though. If a large number of CPUs are in the busywaiting loop when the lock releases, things get really fun I suppose.

Btrfs aims for the mainline

Posted Jan 7, 2009 21:32 UTC (Wed) by MisterIO (guest, #36192) [Link]

Isn't it actually trying to acquire the lock for 513 times?

Btrfs aims for the mainline

Posted Jan 7, 2009 23:21 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

I assume (but haven't checked) that the kernel's mutex functions do not have the 0 on success behaviour seen in pthread_mutex_trylock(3p)

Btrfs aims for the mainline

Posted Jan 7, 2009 23:28 UTC (Wed) by tialaramex (subscriber, #21167) [Link]

Oh, or did you mean 513 as distinct from the 512 claimed by the author? No, this is the usual C idiom for doing something N times...

for (int k = 0; k < N; ++k) { /* do something */ }

It won't do them N+1 times because after the Nth iteration k is N and will fail the condition (k < N)

(Actually mine is a slightly more modern idiom, requiring C9X semantics, and it has the K&R increment style, but it amounts to the same thing)

Btrfs aims for the mainline

Posted Jan 7, 2009 23:25 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Are you commenting on the fact that the loop iterates 512 times in addition to the one try outside the loop? The way I read the following, it tries once, and if it fails, it tries another 512 times:

The lock in question is a mutex, but it is being acquired in an interesting way. If the lock is held by another process, this function will poll it up to 512 times, without sleeping, in the hope that it will become available quickly. Should that happen, the lock can be acquired without sleeping at all. After 512 unsuccessful attempts, the function will finally give up and go to sleep.

So, yeah, it tries 513 times. I guess it's hair-splitting as to exactly how many times it tries. I think the point was "try several times just in case."

Btrfs aims for the mainline

Posted Jan 8, 2009 4:16 UTC (Thu) by pr1268 (subscriber, #24648) [Link]

Why not just get rid of the outer (top) if-statement altogether, and move the cpu_relax() statement to after the inner if-statement?

Btrfs aims for the mainline

Posted Jan 9, 2009 17:57 UTC (Fri) by giraffedata (guest, #1954) [Link]

Why not just get rid of the outer (top) if-statement altogether, and move the cpu_relax() statement to after the inner if-statement?

In at least some systems, it is probably an extra instruction or two for the most common case -- that the lock is available immediately. Lock algorithms are usually written this way.

Incidentally, as long as we're counting, it's 514. mutex_lock_nested() will necessarily try to acquire the lock before going to sleep.

Btrfs aims for the mainline

Posted Jan 8, 2009 4:25 UTC (Thu) by MisterIO (guest, #36192) [Link]

Yeah, I didn't mean to say I catched who knows what kind of big error, but if you sum the first attempt with the 512 into the loop, that's 513!

Btrfs aims for the mainline

Posted Jan 8, 2009 0:00 UTC (Thu) by lmb (subscriber, #39048) [Link]

I have to admit that the idea of merging yet another RAID/volume management implementation, this time within a filesystem, does not make me very happy.

However, the design of btrfs - its ability to have subvolumes etc - clearly shows the path forward. In a way, the block versus filesystem boundary blurs, and stops making sense. It would be interesting to consider if filesystems could be stacked like this too, so that "raid1" would simply become a filesystem consuming two containers/objects and providing a new object, on which a new filesystem can then be mounted ...

Even if that is crazy talk, we really need more convergence here, not proliferation.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 1:57 UTC (Thu) by khim (subscriber, #9252) [Link]

I have to admit that the idea of merging yet another RAID/volume management implementation, this time within a filesystem, does not make me very happy.

Actually I can not imagine sane volume management outside of filesystem. For example here I have 4 HDD drivers in my system. What I really want?
1. Keep most of the data on just one drive (for movies from my own DVDs).
2. Keep the rest in RAID-5 form (for movies in games and such: PITA to reinstall but can be done if needed).
3. Keep my own personal files (1% of total size or so) duplicated 4 times (on 4 HDDs).
Pretty easy and simple requirements, right? Yet totally unachievable with usual LVM/filesystem separation. Currently btrfs can not support this mode of operation too, but potentially - it's doable...

P.S. Actually I got the idea after reading in GFS paper: Users can specify different replication levels for different parts of the file namespace. I was dumbfound when read this: this is exactly what I need from normal filesystem - why it was never actually done? There are two answers:
1. It's harder to do for normal filesystem (GFS works with huge chunks and so fragmentation is not an issue).
2. RAID/volume management is separated from filesystem - there are just not enough info in filesystem to make in happen!

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 2:54 UTC (Thu) by Da_Blitz (guest, #50583) [Link]

you can do some of the thinks you mentioned with LVM

to keep most of the data on one partion you create a new pv with the vgcreate -p 1 <name> <dev> <dev>, -p limits the maximum amount of physical volumes for pv

while you cant do raid 5 in LVM (something i am assuming can be patched in) you can do RAID 1 using LVM with support for migrating the data from one drive to another should you remove or fail a drive, use lvcreate --mirrors 3 <volume name> <pvname>/-m

this will create one partition that is mirrored to 3 other drives

at the end you would end up with 2 volume groups, one for your DVDs that contains one lv partition and another volume group that includes all 4 drives that is replicated 4 times by lvm

i would pay for raid5 support in lvm as i could then specify different levels of replication for different partitions like you suggested,

I don't need volume groups!

Posted Jan 8, 2009 3:48 UTC (Thu) by khim (subscriber, #9252) [Link]

Yes, that's more or less how I do it today. But it's insane: I have a perfectly capable system developed to automate tasks, it can do billions of operations per second and yet it can not automagically reallocate space for me? Pathetic. Plus I want ALL metainformation keept on ALL drives: it's small amount of data, after all, and even if I can restore all files from DVDs it's certainly easier to do if I'll know what exactly was stored on that broken drive!

Sure I can develop a lot of palliatives (I do ls -lR regularly and store it on RAIDed partition), but it's all ugly and stupid.

I don't need volume groups!

Posted Jan 8, 2009 5:43 UTC (Thu) by Ze (guest, #54182) [Link]

In reply to: Actually RAID/volume management is superlimited when not in filesystem... by Da_Blitz Parent article: Btrfs aims for the mainline Yes, that's more or less how I do it today. But it's insane: I have a perfectly capable system developed to automate tasks, it can do billions of operations per second and yet it can not automagically reallocate space for me? Pathetic. Plus I want ALL metainformation keept on ALL drives: it's small amount of data, after all, and even if I can restore all files from DVDs it's certainly easier toes, that's more or less how I do it today. But it's insane: I have a perfectly capable system developed to automate tasks, it can do billions of operations per second and yet it can not automagically reallocate space for me? Pathetic. Plus I want ALL metainformation keept on ALL drives: it's small amount of data, after all, and even if I can restore all files from DVDs it's certainly easier to do if I'll know what exactly was stored on that broken drive! Sure I can develop a lot of palliatives (I do ls -lR regularly and store it on RAIDed partition), but it's all ugly and stupid. do if I'll know what exactly was stored on that broken drive!

So what you really want is one logical volume with arbitrary tags on block allocation and replication on a per file basis.

Tux3 approach of keeping metadata information in normal files strikes me as being the nicest way to map meta-data onto multiple physical volumes.

Perhaps in the future we'll handle proper separation of concerns when it comes to different layers properly instead of munging em all in one.

The real bitch though with that is backward compatibility.

Since we'd effectively be splitting existing file systems up into two separate concerns , a block format and a file system format composed of blocks.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 11:51 UTC (Thu) by lmb (subscriber, #39048) [Link]

Sure; you want redundancy, encryption etc being managed at the file/directory object level. That makes perfect sense, and is exactly what I was hinting at by being able to stack "filesystems".

But we need a general solution for that. Not yet a 3rd one inside one component. Or, rather, maybe we need to start with that now, and then phase out or at least depreciate the past.

But what we definitely don't need is three or more permanent solutions to this problem.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 8, 2009 16:58 UTC (Thu) by roblucid (guest, #48964) [Link]

> 1. Keep most of the data on just one drive (for movies from my own DVDs).
> 2. Keep the rest in RAID-5 form (for movies in games and such: PITA to
> reinstall but can be done if needed).
> 3. Keep my own personal files (1% of total size or so) duplicated 4 times > (on 4 HDDs).
> Pretty easy and simple requirements, right? Yet totally unachievable with
> usual LVM/filesystem separation. Currently btrfs can not support this mode > of operation too, but potentially - it's doable...

What's wrong with using partitions and bind mounts?

3) A partition that's mirrored on all disks
2) A RAID 5 (yuck!) using partition on 3 disks
1) A partition for your data disk

You can bind mount stuff to be at convenient points in the filesystem hierarchy. That is "specifiying different methods of replication for different parts of the namespace".

( Frankly if you have 4 disks, then I'd rather stripe, lower level mirrors with RAID 10 and accept the lower capacity for the performance and reliability benefits of avoiding RAID5 in a 3 disk configuration. )

Using a RAID layer to do RAID
An LVM layer to provide logical volumes (caveat on FS barriers)
File System layer to hand filesystem structure, and journalling

Would appear like a logical structure, though it does require initial planning.

Perhaps you want to be able to expand the alloted space, given over to RAID1, RAID5, RAID10 etc?

Couldn't that be done, by using LVM type block devices, used by the RAID layer which is then exposed to filesystems, so they can grow/shrink their capacity, as chunks of disk are (de)allocated to partitions?

Call me a cynic if you like but pushing every feature you could want into 1 layer, the filesystem, which should then become a generic dynamic disk management system, would appear to become a very complicated monolithic block of code. If sane implementations would then use generic layers, then you're really back to where you started.

That's the setup I have - and HATE it

Posted Jan 8, 2009 18:03 UTC (Thu) by khim (subscriber, #9252) [Link]

Perhaps you want to be able to expand the alloted space, given over to RAID1, RAID5, RAID10 etc?

I want to forget words "RAID", "LVM" and related. Forever. I want sane options. Like:
A. Store data cheaply (say... $0.10/GB) but unreliably: single disk failure - and be ready to redownload/reinstall)
B. Store data reliably but expensively ($.40/GB): up to three disks can fail without any problems
C. Some intermediate versions: cheap and reliable ($0.12/GB to $0.15/GB), Ok with single disk failure (if it'll happen OS will of course restore status quo if possible), but sloooooow (still much faster then DVD).
Just like filesystems were invented to make unnecessary manual manageent of data on a single disk I want something to hide all this RAID/LVM/etc stuff from me. Will it be btrfs or stack of other technologies - I don't care as long as I have nice simple option list in "Save As..." dialog.

Couldn't that be done, by using LVM type block devices, used by the RAID layer which is then exposed to filesystems, so they can grow/shrink their capacity, as chunks of disk are (de)allocated to partitions?

May be. But then - it'll need huge, very complex schemes to make it work well as whole. This is "microkernel vs monolitic kernel" discussion all over again.

Call me a cynic if you like but pushing every feature you could want into 1 layer, the filesystem, which should then become a generic dynamic disk management system, would appear to become a very complicated monolithic block of code. If sane implementations would then use generic layers, then you're really back to where you started.

Huh? Why "monolithic block of code"??? I'm perfectly happy with separating of functions - different filesystems are free to use the same implementation of RAID, LVM, etc - if their authors decide it's the best way to do things. Just as long as it's not exposed to userspace (or at least to user).

That's the setup I have - and HATE it

Posted Jan 8, 2009 18:56 UTC (Thu) by dlang (guest, #313) [Link]

what you want isn't anything resembling a traditional filesystem. what you want is something like the 'object based storage' things that are being discussed (but you want something far more complex than what has been proposed, let alone implemented or accepted).

defining the redundancy for each file as it is saved will also require changes to every single program out there, which is very unlikely to happen.

if you are willing to deal with different directories having different redundancy options, then what you want is doable today, with no kernel changes. it just needs userspace tools written to make it easier to deal with.

That's the setup I have - and HATE it

Posted Jan 8, 2009 20:11 UTC (Thu) by roblucid (guest, #48964) [Link]

> defining the redundancy for each file as it is saved will also require
> changes to every single program out there, which is very unlikely to happen.

Probably a lot of ppl still expect filesystems to be fast, and having every node in the hierarchy and all files, having some different form of backing storage depending on redundancy requirements... *sucks through teeth* sounds expensive to me.

The "I want to know nothing" and just have it managed by some kind of Storage management system that takes care of details, does sound a better requirement to me.

Actually I don't think "every program" would need modifying, as when files are created, they can inherit the characteristics of the parent directory, as would new sub-directories.

That's the setup I have - and HATE it

Posted Jan 8, 2009 20:26 UTC (Thu) by roblucid (guest, #48964) [Link]

> Huh? Why "monolithic block of code"??? I'm perfectly happy with
> separating of functions - different filesystems are free to use the same
> implementation of RAID, LVM, etc - if their authors decide it's the best
> way to do things. Just as long as it's not exposed to userspace (or at
> least to user).

Well you seemed to imply it by saying you couldn't imagine it being down outside of the filesystem.

Permitting layers, I could imagine some specialised "meta" filesystem being feasible, that would give you a name space, that lets you tag directories and files with storage characteristics using more traditional type file systems as backing stores for bulk data storage. The real files might end up in a file hierarchy, spread over a number of volumes a bit like http proxy caches, to provide manageable chunks of RAID1, RAID0, RAID5, RAID10 storage, which can be increased (and freed) on demand by the Disk Management System.

The only thing is, how many ppl would actually need that? And if it were provided, how many would ever use funky "Save As" options, in applications, compared to the number who would whinge about excessive options, being confusing and unclean in their precious GUI?

That's the setup I have - and HATE it

Posted Jan 9, 2009 18:43 UTC (Fri) by giraffedata (guest, #1954) [Link]

Permitting layers, I could imagine some specialised "meta" filesystem being feasible, that would give you a name space, that lets you tag directories and files with storage characteristics using more traditional type file systems as backing stores for bulk data storage.

I believe you're describing object storage. The lower of those layers is the object layer. But it differs from a traditional filesystem in that the only names the objects (files) have are made up by the system (essentially, inode numbers).

In most of the work on object storage, that layer actually lives in a hardware unit separate from the one with the POSIX filesystem image, but it could be in the kernel between the filesystem drivers and the block device drivers as well (if it hasn't been done already).

But we do need to ask whether there is a need to have more than one future filesystem type with this kind of storage function before we put a lot of effort into making a reusable layer.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 16, 2009 0:02 UTC (Fri) by topher (guest, #2223) [Link]

What you're asking for is quite doable right now, using RAID+LVM (actually, only RAID is needed).

1. Keep most of the data on just one drive (for movies from my own DVDs).
2. Keep the rest in RAID-5 form (for movies in games and such: PITA to reinstall but can be done if needed).
3. Keep my own personal files (1% of total size or so) duplicated 4 times (on 4 HDDs).

Assuming 4 500GB Disks:
Drive 1: 10GB partition, RAID1; 490GB partition, stand alone
Drive 2: 10GB partition, RAID1; 490GB partition, RAID5
Drive 3: 10GB partition, RAID1; 490GB partition, RAID5
Drive 4: 10GB partition, RAID1; 490GB partition, RAID5

The 10GB partitions are all part of a 4x replicated RAID1 for your personal files. For additional redundancy across the system, put /boot and / on that RAID1 also, install GRUB on the bootloader for each disk, and you can lose any disk and boot the system. The stand alone 490GB partition is for your movies. The 3 490GB partitions in the RAID5 are for the rest of your stuff.

It's not required, but you could make use of LVM on top of those to more easily split things out as desired.

What you're asking for in a couple of your other posts, however, is not a simple thing. It doesn't fit well with how computers work in general. You seem to be saying you want to just save something and have the computer magically understand that it's "important" or "not important" or "kind of important" and know what that means. But computers don't do that. Someone has to tell them what each of those categories means, and how they're defined. And since it's your data, it's going to be hard for someone else to do that.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 17, 2009 21:34 UTC (Sat) by speedster1 (guest, #8143) [Link]

I don't think this automatic classification of "importance" is really the killer feature khim is pining for! There are a couple of key things that are painful or impossible with current LVM+RAID (which khim does know about and is currently using, for lack of a better alternative):

allocation of space among the areas of differing redundancy
ability to store metadata in a higher-redundancy area than the corresponding normal data

If the normal filesystem in your example 10GB partition of high-redundancy storage fills up, your app will get an out-of-space error and you will have to hack around with LVM tools to try to reclaim some space from that big RAID 5 partition. Are you at all confident that a typical user could shrink it safely? I think a typical user would end up having to buy more disks, even though there was lots of unused space in the existing disks!

In a smart filesystem that handled levels of redundancy internally, you would not have this problem at all. The filesystem would have one big pool of storage, and would create additional regions of high redundancy as needed.

I know it is possible to put filesystem journals on separate partitions, but I don't think point #2 is even possible with current filesystems.

Actually RAID/volume management is superlimited when not in filesystem...

Posted Jan 18, 2009 0:33 UTC (Sun) by Kamilion (subscriber, #42576) [Link]

... Yknow, you've basically just described ZFS on OpenSolaris. Check it out.

There's already a FUSE module for linux. If someone bothered reimplimenting the ZFS filesystem with FUSE, shouldn't it be much easier to use it as the basis of a cleanroom GPL2 kernel implementation of ZFS?

Btrfs aims for the mainline

Posted Jan 8, 2009 20:52 UTC (Thu) by masoncl (subscriber, #47138) [Link]

Andrew Morton recently asked about why the device management in the FS makes sense. The thread covered a bunch of ground:

http://article.gmane.org/gmane.comp.file-systems.btrfs/1764

The most important reason in my mind is connecting the transaction domains of the storage management with the rest of the filesystem. It makes it much more efficient to move storage around without having to wonder if the FS is currently using it.

Btrfs aims for the mainline

Posted Jan 9, 2009 0:29 UTC (Fri) by roblucid (guest, #48964) [Link]

Interesting.

Presumbably there's similar issues with the "array of blocks/bytes" idea, when it comes to disks themselves. It shows with Flash, where it'd be nice to let the device know blocks are unused. Perhaps with bad block reallocation to on disks, which might have unpleasant consequences for journal files (without barriers).

It used to be, that disks were always damn near full, so you didn't really gain anything by not doing a raw partition copy when transfering data from one disk to another.

Btrfs aims for the mainline

Posted Jan 15, 2009 6:25 UTC (Thu) by angelortega (guest, #1306) [Link]

I've translated this article to spanish here:
http://triptico.com/docs/lwn_313682.html