Linux Storage and Filesystem Workshop, day 2

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
April 8, 2009

The second and final day of the Linux Storage and Filesystem Workshop was held in San Francisco, California on April 7. Conflicting commitments kept your editor from attending the entire event, but he was able to participate in sessions on solid-state device support, storage topology information, and more.

Supporting SSDs

The solid-state device topic was the most active discussion of the morning. SSDs clearly stand to change the storage landscape, but it often seems that nobody has yet figured out just how things will change or what the kernel should do to make the best use of these devices. Some things are becoming clearer, though. The kernel will be well positioned to support the current generation SSDs. Supporting future products, though, is going to be a challenge.

Matthew Wilcox, who led the discussion, started by noting that Intel SSDs are able to handle a large number of operations in parallel. The parallelism is so good, in fact, that there is really little or no advantage in delaying operations. I/O requests should be submitted immediately; the block I/O subsystem shouldn't even attempt to merge adjacent requests. This message was diluted a bit later on, but the core message is clear: the kernel should, when driving an SSD, focus on getting out of the way and processing operations as quickly as possible.

It was asked: how do these drives work internally? This would be nice to know; the better informed the kernel developers are, the better they can do at driving the devices better. It seems, though, that the firmware in these devices - the part that, for now, makes Intel devices work better than most of the alternatives - is laden with Valuable Intellectual Property, and not much information will be forthcoming. Solid-state devices will be black boxes for the foreseeable future.

In any case, current-generation Intel SSDs are not the only type of device that the kernel will have to work with. Drives will differ greatly in the coming years. What the kernel really needs to know is a few basic parameters: what kind of request alignment works best, what request sizes are fastest, etc. It would be nice if the drives could export this information to the operating system. There is a mechanism by which this can be done, but current drives are not making much information available.

One clear rule holds, though: bigger requests are better. They might perform better in the drive itself, but, with high-quality SSDs, the real bottleneck is simply the number of requests which can be generated and processed in a given period of time. Bundling things into larger requests will tend to increase the overall bandwidth.

A related rule has to do with changes in usage patterns. It would appear that the Intel drives, at least, observe the requests issued by the computer and adapt their operation to improve performance. In particular, they may look at the typical alignment of requests. As a result, it is important to let the drive know if the usage pattern is about to change - when the drive is repartitioned and given a new filesystem, for example. The way to do this, evidently, is to issue an ATA "secure erase" command.

From there, the conversation moved to discard (or "trim") requests, which are used by the host to tell the drive that the contents of specific blocks are no longer needed. Judicious use of trim requests can help the drive in its garbage collection work, improving both performance and the overall life span of the hardware. But what constitutes "judicious use"? Doing a trim when a new filesystem is made is one obvious candidate. When the kernel initializes a swap file, it trims the entire file at the outset since it cannot contain anything of use. There is no controversy here (though it's amusing to note that mkfs does not, yet, issue trim commands).

But what about when the drive is repartitioned? It was suggested that the portion of the drive which has been moved from one partition to another could be trimmed. But that raises an immediate problem: if the partition table has been corrupted and the "repartitioning" is really just an attempt to restore the drive to a working state, trimming that data would be a fatal error. The same is true of using trim in the fsck command, which is another idea which has been suggested. In the end, it is not clear that using trim in either case is a safe thing to do.

The other obvious place for a trim command is when a file is deleted; after all, its data clearly is no longer needed. But some people have questioned whether that is a good time too. Data recovery is one issue; sometimes people want to be able to get back the contents of an erroneously-deleted file. But there is also a potential performance issue: on ATA drives, trim commands cannot be issued as tagged commands. So, when a trim is performed, all normal operations must be brought to a halt. If that happens too often, the throughput of the drive can suffer. This problem could be mitigated by saving up trim operations and issuing them all together every few minutes. But it's not clear that the real performance impact is enough to justify this effort. So some benchmarking work will be needed to try to quantify the problem.

An alternative which was suggested was to not use trim at all. Instead, a similar result could be had by simply reusing the same logical block numbers over and over. A simple-minded implementation would always just allocate the lowest-numbered free block when space is needed, thus compressing the data toward the front end of the drive. There are a couple of problems with this approach, though, starting with the fact that a lot of cheaper SSDs have poor wear-leveling implementations. Reusing low-numbered blocks repeatedly will wear those drives out prematurely. The other problem is that allocating blocks this way would tend to fragment files. The cost of fragmentation is far less than with rotating storage, but there is still value in keeping files contiguous. In particular, it enables larger I/O operations, and, thus, better performance.

There was a side discussion on how the kernel might be able to distinguish "crap" drives from those with real wear-leveling built in. There's actually some talk of trying to create value-neutral parameters which a drive could use to export this information, but there doesn't seem to be much hope that the vendors will ever get it right. No drive vendor wants its hardware to self-identify as a lower-quality product. One suggestion is that the kernel could interpret support for the trim command as an indicator that it's dealing with one of the better drives. That led to the revelation that the much-vaunted Intel drives do not, currently, support trim. That will change in future versions, though.

A related topic is a desire to let applications issue their own trim operations on portions of files. A database manager could use this feature to tell the system that it will no longer be interested in the current contents of a set of file blocks. This is essentially a version of the long-discussed punch() system call, with the exception that the blocks would remain allocated to the file. De-allocating the blocks would be correct at one level, but it would tend to fragment the file over time, force journal transactions, and make O_DIRECT operations block while new space is allocated. Database developers would like to avoid all of those consequences. So this variant of punch() (perhaps actually a variant of fallocate()) would discard the data, but keep the blocks in place.

From there, the discussion went to the seemingly unrelated topic of "thin provisioning." This is an offering from certain large storage array vendors; they will sell an array which claims to be much larger than the amount of storage actually installed. When the available space gets low, the customer can buy more drives from the vendor. Meanwhile, from the point of view of the system, the (apparently) large array has never changed.

Thin provisioning providers can use the trim command as well; it lets them know that the indicated space is unused and can be allocated elsewhere. But that leads to an interesting problem if trim is used to discard the contents of some blocks in the middle of the file. If the application later writes to those blocks - which are, theoretically, still in place - the system could discover that the device is out of space and fail the request. That, in turn, could lead to chaos.

The truth of the matter is that thin provisioning has this problem regardless of the use of the trim command. Space "allocated" with fallocate() could turn out to be equally illusory. And if space runs out when the filesystem is trying to write metadata, the filesystem code is likely to panic, remount the filesystem read-only, and, perhaps, bring down the system. So thin provisioning should be seen as broken currently. What's needed to fix it is a way for the operating system to tell the storage device that it intends to use specific blocks; this is an idea which will be taken back to the relevant standards committees.

Finally, there was some discussion of the CFQ I/O scheduler, which has a lot of intelligence which is not needed for SSDs. There's a way to bypass CFQ for some SSD operations, but CFQ still adds an approximately 3% performance penalty compared to the no-op I/O scheduler. That kind of cost is bearable now, but it's not going to work for future drives. There is real interest in being able to perform 100,000 operations per second - or more - on an SSD. That kind of I/O rate does not leave much room for system overhead. So, at some point, we're going to see a real effort to streamline the block I/O paths to ensure that Linux can continue to get the best out of solid-state devices.

Storage topology

Martin Petersen introduced the storage topology issue by talking about the coming 4K-sector drives. The sad fact is that, for all the talk of SSDs, rotating storage will be with us for a while yet. And the vendors of disk drives intend to shift to 4-kilobyte sectors by 2011. That leads to a number of interesting support problems, most of which were covered in this LWN article in March. In the end, the kernel is going to have to know a lot more about I/O sizes and alignment requirements to be able to run future drives.

To that end, Martin has prepared a set of patches which export this information to the system. The result is a set of directories under /sys/block/drive/topology which provide the sector size, needed alignment, optimal I/O flag, and more. There's also a "consistency flag" which tells the user whether any of the other information actually matches reality. In some situations (a RAID mirror made up of drives with differing characteristics, for example), it is not possible to provide real information, so the kernel has to make something up.

There was some wincing over this use of sysfs, but the need for this kind of information is clear. So these patches will probably be merged into the 2.6.31 kernel.

readdirplus()

There was also a session on the proposed readdirplus() system call. This call would function much like readdir() (or, more likely, like getdents()), but it would provide file metadata along with the names. That, in turn, would avoid the need for a separate stat() call and, hopefully, speed things considerably in some situations.

Most of the discussion had to do with how this new system call would be implemented. There is a real desire to avoid the creation of independent readdir() and readdirplus() implementations in each filesystem. So there needs to be a way to unify the internal implementation of the two system calls. Most likely that would be done by using only the readdirplus() function if a filesystem provides one; this callback would have a "no stat information needed" flag for the case when normal readdir() is being called.

The creation of this system call looks like an opportunity to leave some old mistakes behind. So, for example, it will not support seeking within a directory. There will also probably be a new dirent structure with 64-bit fields for most parameters. Beyond that, though, the shape of this new system call remains somewhat cloudy. Somebody clearly needs to post a patch.

Conclusion

And there ends the workshop - at least, the part that your editor was able to attend. There were a number of storage-related sessions which, beyond doubt, covered interesting topics, but it was not possible to be in both rooms at the same time (though, with luck, your editor will soon receive another attendee's notes from those sessions). The consensus among the attendees was that it was a highly successful and worthwhile event; the effects should be seen to ripple through the kernel tree over the next year.

Index entries for this article
Kernel	Block layer
Kernel	Filesystems/Workshops
Conference	Storage and Filesystem Workshop/2009

(Log in to post comments)

Intel drives and TRIM

Posted Apr 8, 2009 17:01 UTC (Wed) by willy (subscriber, #9762) [Link]

A quick clarification:

"That led to the revelation that the much-vaunted Intel drives do not, currently, support trim. That will change in future versions, though."

It's a property of the firmware; the drive I am testing TRIM with uses a special firmware build on current hardware. I am not able to comment on when or whether firmware that supports TRIM will be available for drives that people already own.

From what I've read on public websites, this appears to be true for other manufacturers too.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 8, 2009 20:04 UTC (Wed) by jzbiciak (guest, #5246) [Link]

I wonder if SSDs will ever gain an additional level of hierarchy, such as an NVRAM layer over the flash layer. "Hot write" zones could then live in the NVRAM layer and only be committed to flash infrequently. By "hot write," I'm thinking of stuff such as the journal on a journaled filesystem.

Aggressive use of TRIM when committing entries out of the journal would make it easier to reap blocks within this faster level of hierarchy, and would make the drive less sensitive to the size of the journal. That is, the journal could be much larger than the NVRAM size, but you'd still get the benefit if the *active* part of the journal fit in the NVRAM.

Having such a buffer should also make it much easier to wear-level the drive, pushing writes out of the NVRAM LRU only as needed, rotating among all the pieces of flash. The NVRAM would also allow the SSD to buffer requests (and mark writes as complete!) while it's in the middle of erasing sectors in the flash.

Sure, it'd be expensive, but I imagine it'd fly like a bat outta hell.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 8, 2009 22:06 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Also, how long until flash drives go the "WinModem" route, exposing a raw interface and putting all the smarts in an OS driver? I guess you still need to have some minimal disk emulation to get booted far enough to load such a driver...

Linux Storage and Filesystem Workshop, part 2

Posted Apr 9, 2009 0:43 UTC (Thu) by mattdm (subscriber, #18) [Link]

As I understand it, that'd be a very good thing for Linux just like Linux software RAID is usually better than the "fraid" options.

Winmodem-like solid state storage

Posted Apr 10, 2009 18:34 UTC (Fri) by giraffedata (guest, #1954) [Link]

Also, how long until flash drives go the "WinModem" route, exposing a raw interface and putting all the smarts in an OS driver?
As I understand it, that'd be a very good thing for Linux

But in the Winmodem route, the raw interface is a secret and specific to a small subset of devices, and the manufacturer doesn't write any Linux drivers. That's all pretty bad for Linux, isn't it?

On the other hand, assuming flash storage lasts (doesn't get replaced by a different solid state storage technology that has different characteristics), I do expect it to eventually grow an interface optimized for driving flash (i.e. low level) rather than use an interface designed for disk drives, and then we could get the advantages of running more of the logic in the client and less in the storage device.

Winmodem-like solid state storage

Posted Apr 11, 2009 19:00 UTC (Sat) by dwmw2 (subscriber, #2063) [Link]

"But in the Winmodem route, the raw interface is a secret and specific to a small subset of devices, and the manufacturer doesn't write any Linux drivers. That's all pretty bad for Linux, isn't it?"

A WinModem is mostly just a sound card — we do actually know how to drive most of them. That's not the problem at all.

The problem with WinModems is that modem algorithms better than about v.32 are covered by patents. And while a lot of stuff in that situation has still found its way into software projects in the Free World, we still don't have a decent modem implementation. Someone needs to do a Free World fork of spandsp, perhaps? And/or pick up the late Tony Fisher's work on v.32bis and v.34.

For flash, the situation is different. While there are plenty of patents flying around, they mostly cover the ways in which you make a flash device pretend to be a block device — and the beauty of exposing flash directly to the operating system is that you don't need that gratuitous extra layer any more. You can have a file system which knows about flash, and is designed to operate directly, and optimally, on it.

The recent TRIM work goes some way to fixing the most obvious disadvantage of the extra layer, but the fact remains that you still have your real file system running on top of another pseudo-filesystem which is pretending to be a block device. And you can never attempt to debug or improve that lower layer.

The Linux kernel has two file systems for real flash already, and more are in the works. I'd very much like to see direct access to the flash being permitted by these devices. I'm confident that we can do better than anything they can do inside their little black box.

Winmodem-like solid state storage

Posted Apr 11, 2009 20:25 UTC (Sat) by giraffedata (guest, #1954) [Link]

Thanks for elucidating the patent angle.

But that really just raises another issue. With flash storage having its novel foibles, I find it hard to believe there aren't patents covering the various things you have to do to make it useful. Method for storing data on flash without wearing out hot spots? Method for extracting small blocks of data from flash quickly?

I'm confident that we can do better than anything they can do inside their little black box.

So you're saying there is innovation to be done. That means there's something for someone to monopolize with a patent.

At least with a black box, the patents are all paid for as part of acquiring the box. In contrast, when you need a patent license to run some code in your own Linux system, it usually means the code is useless.

So it still looks to me like Winmodem-style flash storage would be a bane to Linux and free software.

Winmodem-like solid state storage

Posted Apr 11, 2009 23:09 UTC (Sat) by jzbiciak (guest, #5246) [Link]

So it still looks to me like Winmodem-style flash storage would be a bane to Linux and free software.

Nonsense. There are plenty of machines out there that have just plain NAND or NOR flash hooked up to the CPU and Linux reads/writes these effectively. The issue is that currently we only see that in the embedded space, and it's typically just enough flash to hold the code for whatever that machine is supposed to do. For example, be a WiFi router or a set-top box or a cell phone.

What I'd like to see is something I can get off the shelf at my local Computer Mart (or on the web) that plugs into my PC and gives me raw flash. Instead of focusing on "right sized" and "small" and "maximizing battery life", it instead can be a bank of parallel flash such as what Intel's SSD disks are, but with a raw interface. We can then use our existing flash filesystems and infrastructure to drive those in a desktop and laptop space, rather than just the netbook/smart-phone/smart-router space.

Now, these (potentially) massively parallel performance oriented disks will need additional software support. You want something akin to RAID striping across the media along with maybe some redundancy in addition to wear leveling. That's just enhancements on top of our existing wear leveling filesystems and infrastructure.

The only real issue is that once you give raw flash to the OS and put the smarts in the OS, it'll be harder for dual-boot systems to communicate on the same media, because the likelihood that $VENDOR's Windows driver organizes the disk the way Linux does is slim to none unless $VENDOR works with the Linux community also.

Winmodem-like solid state storage

Posted Apr 12, 2009 1:21 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

"The only real issue is that once you give raw flash to the OS and put the smarts in the OS, it'll be harder for dual-boot systems to communicate on the same media, because the likelihood that $VENDOR's Windows driver organizes the disk the way Linux does is slim to none unless $VENDOR works with the Linux community also."

I see two reasons why that wouldn't be a problem, in practice.

Firstly, we've never had many problems working with "foreign" formats. We cope with NTFS, HFS and various bizarre crappy "Software RAID" formats, amongst other things. That includes the special on-flash formats like the NAND Flash Translation Layer used on the M-Systems DiskOnChip devices, which has been supported for about a decade. Are you suggesting that hardware vendors take Linux less seriously now than they did ten years ago, and that we'd have a harder time working out how to interoperate? Remember, documenting the on-medium format doesn't necessarily give away all the implementation details like algorithms for wear levelling, etc. — that's why M-Systems were content to give us documentation, all that time ago.

Secondly, interoperability at that level isn't a showstopper. It's nice to have, admittedly, but I'm not going to lose a lot of sleep if I can't mount my Windows or MacOS file system under Linux. It's the native functionality of the device under Linux that I care about most of the time.

Of course, I see no reason why the device vendors should be pushing their own "speshul" formats anyway — the hard drive vendors don't. But I'm not naïve enough to think that they won't try.

Imagine a world where every hard drive you buy is actually a more like a NAS. You can only talk a high-level protocol like CIFS or NFS to it; you can't access the sectors directly. Each vendor has their own proprietary file system on it internally, implemented behind closed doors by the same kind of people who write BIOSes. You have no real information about what's inside, and can't make educated decisions on which products to buy. Having made your choice you can't debug it, you can't optimise it for your own use case, you can't try to recover your data when things go wrong, and you sure as hell can't use btrfs on it. All you can do is pray to the deity of your choice, then throw the poxy thing out the window when it loses your data.

If the above paragraph leaves you in a cold sweat, it was intended to. That's the kind of dystopia I see in my head, when we talk about SSDs without direct access to the flash.

Winmodem-like solid state storage

Posted Apr 12, 2009 1:46 UTC (Sun) by giraffedata (guest, #1954) [Link]

What I'd like to see is something I can get off the shelf at my local Computer Mart (or on the web) that plugs into my PC and gives me raw flash.

If a PCIe expansion socket is sufficient, several companies are now selling that. I remember IBM demonstrating last year a prototype storage server composed of a bunch of Linux systems with Fusion-IO PCI Express cards for storage. It broke some kind of record.

In that system, the flash storage still appeared as a block device, but it did it at the Linux block device interface instead of at the SCSI physical interface.

Winmodem-like solid state storage

Posted Apr 16, 2009 18:53 UTC (Thu) by wmf (guest, #33791) [Link]

Fusion io is not raw flash since the driver contains a sophisticated FTL that cannot be disabled. In theory they could release an MTD driver, but they're not going to.

Winmodem-like solid state storage

Posted Apr 16, 2009 20:15 UTC (Thu) by giraffedata (guest, #1954) [Link]

Fusion io is not raw flash since the driver contains a sophisticated FTL that cannot be disabled. In theory they could release an MTD driver, but they're not going to.

Is the driver you're talking about a Linux kernel module? An object code only one?

What is MTD?

All this has happened before...

Posted Apr 16, 2009 21:55 UTC (Thu) by wmf (guest, #33791) [Link]

Is the driver you're talking about a Linux kernel module? An object code only one?

Yep.

What is MTD?

How you access raw flash. See also UBIFS Raw flash vs. FTL devices.

Winmodem-like solid state storage

Posted Apr 19, 2009 14:11 UTC (Sun) by oak (guest, #2786) [Link]

> What I'd like to see is something I can get off the shelf at my local
Computer Mart (or on the web) that plugs into my PC and gives me raw
flash. Instead of focusing on "right sized" and "small" and "maximizing
battery life", it instead can be a bank of parallel flash such as what
Intel's SSD disks are, but with a raw interface. We can then use our
existing flash filesystems and infrastructure to drive those in a desktop
and laptop space

Unlike block based file systems like ext[234], the existing flash file
systems are designed for very small file systems. E.g. JFFS2 keeps the
whole file system metadata in RAM and is unusable in GB sized file
systems.

However, the newly merged UBIFS promises to work much better:
* http://lwn.net/Articles/275706/
* http://www.linux-mtd.infradead.org/doc/ubifs.html#L_scala...

There's not usage data on how well it performs with desktop and server
loads though.

Winmodem-like solid state storage

Posted Apr 19, 2009 14:28 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

" Unlike block based file systems like ext[234], the existing flash file systems are designed for very small file systems. E.g. JFFS2 keeps the whole file system metadata in RAM and is unusable in GB sized file systems."

Very true — although we put a lot of effort in to make JFFS2 better for OLPC with its 1GiB of NAND flash. It mounts in 6 seconds or so, and we reduced the RAM usage by a significant amount too. But still, JFFS2 was designed in the days of 32MiB or so of NOR flash, and definitely isn't intended to scale up to the kind of sizes we're seeing now.

UBIFS is much more promising, but as you correctly observe is not yet proven for desktop or server workloads. I'm actually keen to get btrfs working on raw flash, too.

The point is that with stuff done in software, we can do better; whether we do better or not today is a different, and less interesting issue.

After all, we can always implement the same "pretend to be a block device" kind of thing to tide us over in the short term, if we need to. We have three or four such translation layers in Linux already, and more on the way.

Winmodem-like solid state storage

Posted Apr 12, 2009 0:42 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

"So you're saying there is innovation to be done. That means there's something for someone to monopolize with a patent.
"At least with a black box, the patents are all paid for as part of acquiring the box. In contrast, when you need a patent license to run some code in your own Linux system, it usually means the code is useless."

That's a very pessimistic viewpoint. If you truly believe that the patent system is so broken and abused that it prevents all innovation, I'd recommend a career in goat-herding. You obviously wouldn't want to be involved in any form of innovative software development — either Free Software or otherwise.

Thankfully, I don't think it's a valid viewpoint either — as broken as the patent system is, I don't think it's time to throw in the towel just yet.

Winmodem-like solid state storage

Posted Apr 12, 2009 1:33 UTC (Sun) by giraffedata (guest, #1954) [Link]

So you're saying there is innovation to be done. That means there's something for someone to monopolize with a patent.
At least with a black box, the patents are all paid for as part of acquiring the box. In contrast, when you need a patent license to run some code in your own Linux system, it usually means the code is useless."

That's a very pessimistic viewpoint. If you truly believe that the patent system is so broken and abused that it prevents all innovation, ...

But I said the opposite. I suggested someone would do the innovation. And then patent it. It is not pessimistic to expect an inventor to patent his invention; they do it all the time, even for trivial inventions.

Patents seem to be anathema to the Linux world. I thought you said patents are the reason Linux and Winmodems don't get along; I'm just trying to complete the analogy.

Winmodem-like solid state storage

Posted Apr 12, 2009 2:05 UTC (Sun) by dwmw2 (subscriber, #2063) [Link]

"But I said the opposite. I suggested someone would do the innovation. And then patent it. It is not pessimistic to expect an inventor to patent his invention; they do it all the time, even for trivial inventions."

Then we need to make sure we get there first, patent it ourselves and license the patent appropriately for use in Free Software.

What's the alternative? To always assume that someone will have got there first, and that any software development that's even remotely innovative will fall foul of a patent and thus, in your words, be "useless"?

That's what I meant when I said it "prevents innovation" — I mean it prevents innovation for us, if we always assume everything interesting will already be patented. And that part of the discussion isn't really specific to modems or SSDs, is it? It applies right across the board.

"Patents seem to be anathema to the Linux world. I thought you said patents are the reason Linux and Winmodems don't get along; I'm just trying to complete the analogy."

Modems are a special case, because you need to implement precisely the patented algorithms in order to communicate with another modem using the affected standards.

For flash storage, you don't have to do that; you have a lot more flexibility to come up with something that isn't affected by patents. A closer analogy might be audio/video compression — where the Free Software world was able to come up with the patent-free Ogg and Theora codecs.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 9, 2009 8:28 UTC (Thu) by viiru (subscriber, #53129) [Link]

> Aggressive use of TRIM when committing entries out of the journal would
> make it easier to reap blocks within this faster level of hierarchy, and
> would make the drive less sensitive to the size of the journal. That is,
> the journal could be much larger than the NVRAM size, but you'd still get
> the benefit if the *active* part of the journal fit in the NVRAM.

Aggressive use of TRIM on the journal could also mean TRIMming the journal completely on a clean mount, or even after replaying it on an unclean mount (some care needs to be taken here, though). I think this could help wear leveling on laptops and netbooks quite a bit (they boot often), and those are currently the place where the biggest advantages of SSDs are.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 16, 2009 18:43 UTC (Thu) by wmf (guest, #33791) [Link]

I wonder if SSDs will ever gain an additional level of hierarchy, such as an NVRAM layer over the flash layer. "Hot write" zones could then live in the NVRAM layer and only be committed to flash infrequently.

If you're willing to pay $50/GB, the STEC ZeusIOPS and TMS RamSAN provide something like this. For affordable SSDs I expect DRAM caches to range between tiny and nonexistent.

Linux Storage and Filesystem Workshop, part 2

Posted Apr 17, 2009 22:10 UTC (Fri) by nix (subscriber, #2304) [Link]

But why? RAM caches are not not tiny-to-nonexistent for inexpensive server
hardware RAID cards, for instance (Areca, for instance, has a 256Mb cache
in its 4-port cards, rising to 2Gb I think in the huge ones).

Now 256Mb may not be immense but it's surely not tiny either.

Unerase

Posted Apr 8, 2009 21:29 UTC (Wed) by Felix_the_Mac (guest, #32242) [Link]

"sometimes people want to be able to get back the contents of an erroneously-deleted file."

Isn't this desire/expectation just a relic of having been exposed to poorly implemented/insecure file systems in the past? ie. DOS!

Do we really want to design this in?

Unerase

Posted Apr 9, 2009 12:24 UTC (Thu) by prl (guest, #44893) [Link]

If it is needed, then it needs to be designed in. Properly, with well understood version control and security (arguably this should be done in user space). And if it *isn't* (whether filesystem wide or per-file), the OS should be able to scrunge a file from the media beyond any reasonable hope of recovery). What we *don't* want us the DOS/FAT style of "well you, might get it back if you're lucky and buy this add-on utility".

One of the problems with letting the device firmware handle this is just how effectively a deleted block has been deleted. If the OS has access to the raw hardware, then the user actually gets to control the precise level of undeleteability, which strikes me as being what we want.

Unerase

Posted Apr 10, 2009 0:50 UTC (Fri) by aigarius (guest, #7329) [Link]

Users delete important file to which they have no backups. Happens all the time. A very simple way to undelete stuff is a very welcome feature in any operating system. It could be an option in a filesystem, enabled by a parameter.

Intel SSD vs. cheap netbook SSDs

Posted Apr 8, 2009 21:37 UTC (Wed) by osma (subscriber, #6912) [Link]

When Linux SSD performance is discussed Intel SSD hardware always comes up, which is natural since it represents the next generation of SSD storage.

What I'm interested in, though, is not the expensive, big and fast SSDs like the Intel offering but the much cheaper and smaller Flash drives used in netbooks and the like. I'm writing this on an Eee 901 with two SSDs, a relatively fast 4GB system drive and a slower 16GB drive for user data. My understanding is that Linux performance on these drives is not quite optimal.

I currently use ext2 filesystems with the noop scheduler on my Eee, as this seems to be one of the better (at least faster) configurations. What I'd like is a fast, reliable and power-efficient filesystem/scheduler combination for a netbook. I don't have specific complaints about the current state of affairs, but I'd be surprised if there wasn't room for improvements considering that the current filesystems and schedulers were designed with spinning disks in mind.

Is there any development effort currently underway that would improve performance specifically on cheap netbook SSDs, or will the work discussed in the article be relevant for all kinds of SSDs?

Intel SSD vs. cheap netbook SSDs

Posted Apr 8, 2009 21:44 UTC (Wed) by corbet (editor, #1) [Link]

The intent, certainly, is to try to perform as well as possible on the full range of devices.

Repartition and trim

Posted Apr 9, 2009 11:59 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

For heaven's sake, don't allow the data on the device to be splatted just because some unfortunate system manager makes a mistake with fdisk! (I've been there with a plain vanilla disk - after getting over being very scared, I worked out what I did wrong, rewrote the partition table right, abd lost no data).

Surely the right answer is just to add a trim command into fdisk and other partition table tools.

Open hardware opportunity?

Posted Apr 9, 2009 12:19 UTC (Thu) by NRArnot (subscriber, #3033) [Link]

Is the "valuable intellectual property" being built into SDDs simply a load of firmware fixing to make the things usable by current and historical operating systems by pretending to be disks? Or is there stuff in there which really does require the sort of realtime or other intense attention that a general purpose operating system can't efficiently provide?

If the former, how about some hardware person designs a fully open PCIX board populated with a load of flash memory and appropriate (minimalist) interfaces? I'm not a hardware engineer, but I suspect that such a board might be quite a simple thing to design. Then implement all the code for using it as a storage device in linux drivers, kernel and filesystems. Could such a thing end up outperforming even the best SDD manufacturers' firmware? And might some genuinely key bit of intellectual property end up being invented here first, and GPLed?

OK I'm dreaming, now shoot me down.

Open hardware opportunity?

Posted Apr 9, 2009 20:31 UTC (Thu) by neli (guest, #51380) [Link]

The SSD hardware/firmware is more than wear levelling alone. I.e. just providing access to an array of flash devices isn't the same - consider sector size for example; while also providing those insane read/write speeds.

Open hardware opportunity?

Posted Apr 10, 2009 18:48 UTC (Fri) by giraffedata (guest, #1954) [Link]

how about some hardware person designs a fully open PCIX board populated with a load of flash memory and appropriate (minimalist) interfaces? ... Then implement all the code for using it as a storage device in linux drivers, kernel and filesystems.

I recently saw a list of four companies doing that. One I remember is FusionIO. (Not actually PCIX, though -- PCI Express (PCIe), which is probably what you meant). In addition to allowing more efficient use by the OS than SSDs, it's also cheaper -- less waste.

It's the best solution for many problems, but SSDs are going to be essential for a long, long time because they're easier to integrate into existing systems.

Open hardware opportunity?

Posted Apr 16, 2009 22:02 UTC (Thu) by wmf (guest, #33791) [Link]

If the former, how about some hardware person designs a fully open PCIX board populated with a load of flash memory and appropriate (minimalist) interfaces? I'm not a hardware engineer, but I suspect that such a board might be quite a simple thing to design.

The mass market doesn't want this, so it would be more expensive than a traditional SSD. It's a cool idea, though.

Linux Storage and Filesystem Workshop, day 2

Posted Apr 9, 2009 16:59 UTC (Thu) by MisterIO (guest, #36192) [Link]

Aren't the two phrases "I/O requests should be submitted immediately; the block I/O subsystem shouldn't even attempt to merge adjacent requests" and "Bundling things into larger requests will tend to increase the overall bandwidth" in constrast?

Linux Storage and Filesystem Workshop, day 2

Posted Apr 9, 2009 20:31 UTC (Thu) by jzbiciak (guest, #5246) [Link]

It would seem that way. The main thing is that you don't want excessive scheduling between unrelated requests (which is what CFQ and elevator algorithms do), whereas things that are naturally large blocks you want to go as large blocks (such as a large streaming read or large streaming write).

Linux Storage and Filesystem Workshop, day 2

Posted Apr 9, 2009 17:31 UTC (Thu) by iabervon (subscriber, #722) [Link]

In the spirit of this particular workshop, you shouldn't call it "receiving another attendee's notes", you should call it "rebuilding your RAGE".

(Where, of course, a "RAGE" is a Redundant Array of Grumpy Editors)

Linux Storage and Filesystem Workshop, day 2

Posted Apr 14, 2009 2:52 UTC (Tue) by xoddam (subscriber, #2322) [Link]

SOL (Snigger Out Loud).

Linux Storage and Filesystem Workshop, day 2

Posted Apr 10, 2009 19:04 UTC (Fri) by giraffedata (guest, #1954) [Link]

they will sell an array which claims to be much larger than the amount of storage actually installed.

This is a really poor description of thin provisioning. For those who didn't get it: thin provisioning means the storage server allows you to create volumes whose total size exceeds the actual storage capacity of the system. So you don't buy disk drives to back the unused portion of those volumes.

But that leads to an interesting problem if trim is used to discard the contents of some blocks in the middle of the file. If the application later writes to those blocks - which are, theoretically, still in place - the system could discover that the device is out of space and fail the request. That, in turn, could lead to chaos.

That's no more interesting than the simpler problem that when you go to extend a sequential file, the write request fails even though the filesystem has space available.

These systems work only if you can manage the system in such a way that running out of actual storage space is about as rare as a power supply failure. That means keeping a large amount of unused space at all times and monitoring consumption rates.

I think to do it right, the storage system would probably also have to slow down as it approaches full so as to protect itself from a runaway storage consumer.

Linux Storage and Filesystem Workshop, day 2

Posted Apr 17, 2009 21:10 UTC (Fri) by willy (subscriber, #9762) [Link]

That's no more interesting than the simpler problem that when you go to extend a sequential file, the write request fails even though the filesystem has space available.

This was in the context of an application having used the posix_fallocate() call. If that call succeeds, the application is guaranteed to be able to use the storage that has been so allocated. Thin Provisioning breaks this, and it's far from clear how to fix it.

Linux Storage and Filesystem Workshop, day 2

Posted Apr 18, 2009 22:10 UTC (Sat) by giraffedata (guest, #1954) [Link]

That's no more interesting than the simpler problem that when you go to extend a sequential file, the write request fails even though the filesystem has space available.
This was in the context of an application having used the posix_fallocate() call. If that call succeeds, the application is guaranteed to be able to use the storage that has been so allocated. Thin Provisioning breaks this, and it's far from clear how to fix it.

No, posix_fallocate() doesn't guarantee you can use the storage. Your write to it could fail, for example, due to a media defect. What posix_fallocate() guarantees is that a write won't fail because there is no space left in the filesystem. That's the same guarantee you get from mkfs that if you create a 1G filesystem, you can put (about) 1G of files in it. So it's equally interesting that with thin provisioning, your fallocated space within a file may be unusable as that your filesystem space may be unusable for extending the file.

With thin provisioning, the filesystem storage medium is a volume in the storage server, which does in fact have space for the write in question. What doesn't have space is the server's pool of backing storage, and the effect of that is a write to certain sectors of the volume fails.

Linux Storage and Filesystem Workshop, day 2

Posted Apr 20, 2009 14:35 UTC (Mon) by willy (subscriber, #9762) [Link]

The text from the standard says:

DESCRIPTION
The posix_fallocate() function shall ensure that any required storage for regular file data starting at offset and continuing for len bytes is allocated on the file system storage media. If posix_fallocate() returns successfully, subsequent writes to the specified file data shall not fail due to the lack of free space on the file system storage media.

I stand by my assertion that it is currently impossible to provide this with Thin Provisioning without actually writing to every single block.

Linux Storage and Filesystem Workshop, day 2

Posted Apr 20, 2009 16:05 UTC (Mon) by giraffedata (guest, #1954) [Link]

That's entirely consistent with what I wrote. There is space in the filesystem. (If there weren't, the write would fail with ENOSPC and statfs() would show no space).

The only problem is that writes to certain sectors of the storage medium fail for reasons out of the scope of POSIX.

There's layering going on here.

Linux Storage and Filesystem Workshop, day 2

Posted Apr 11, 2009 0:27 UTC (Sat) by neilbrown (subscriber, #359) [Link]

readdirplus()

The creation of this system call looks like an opportunity to leave some old mistakes behind. So, for example, it will not support seeking within a directory.

There is some irony here. The name "readdirplus" comes from NFS - NFSv3 introduced a READDIRPLUS protocol request for exactly the same reason.

And to support NFS, you absolutely need to be able to seek within a directory. It is conceivable that some .x version of NFSv4 might remove this requirement. But until we agree to stop supporting NFSv3, we need to be able to seek within a directory.