The future is bright for Linux filesystems
In a recent article, Linux File Systems: Ready for the Future?, Henry Newman expands on what he feels are shortcomings in current GNU/Linux filesystems. Specifically, he believes current Linux filesystem technology cannot meet the demands that massive implementations of 100TB or larger require. He states he received some emotional responses trying to either refute his information or impugn his character, although those comments do not show on either of the article's pages. This prompted me to get the real scoop on how Linux filesystem technology is trying to keep pace with the ever-growing need for storage space.
Most of the native filesystems for Linux do have an issue with massive size. In fact, there was a report from 2006 highlighting this issue. The main Linux.org server running ext3 experienced a failure at the RAID level requiring fsck to repair the ext3 file system. The repair took more than a week to complete. Imagine any company having hundreds of terabytes of commercial data unavailable for a week where some companies can lose hundreds of thousands of dollars per minute of downtime! Demand for more and more storage space is growing fast, and many current Linux filesystems are not quite keeping up.
In the recent Linux Collaboration Summit in Austin, Texas, the Linux Foundation stressed this issue again and highlighted several projects that mean to address this looming problem. The Linux Foundation's Linux Filesystem Weather Forecast covers much of up-and-coming technology. Instead of simply rehashing the information offered there, I wished to delve deeper into the issue of whether Linux simply cannot be used where single instances of massive filesystems are necessary.
XFS - the current "champion"
Back in the middle of 1999, SGI announced it would release XFS to open source developers, giving GNU/Linux one of its first journaling filesystems. Among its features, XFS supports 64-bit addressing with the possiblity of a 9 exabyte filesystem, up to 4GB individual extents, high bandwidth, and other high-performance features.
Newman, however, does not feel XFS supports the features necessary for massive filesystems (hundreds of terabytes) for extremely high I/O. "XFS allocation for data and metadata is still a problem as far as I am concerned," he states. "The metadata areas are not aligned with RAID strips and allocation units are FAR too small but better than ext. As for repair given problems for allocation it takes a LONG time but again better than ext." From his second article, he further clarifies this issue, "Almost all high-performance file systems that work at large scale do this automatically, and metadata regions are always aligned to RAID stripes, as metadata is often separated from the data on different devices so the RAID stripes can be aligned with the allocation of the data and the metadata."
Why is aligning the metadata with the RAID striping important? "[Y]ou must fsck the whole file system... Since the metadata is spread through the file system and all the metadata must be checked, given a variety of issues from power to RAID hiccups to undetected or mis-corrected errors, there is going to be a great deal of head-seeking involved and read-modify write on the RAID devices, since the metadata areas are not aligned with RAID stripe applications."
So does XFS handle this kind of feature? For quite a while, XFS has offered the ability to tune its stripe unit when creating the filesystem using the "sunit=value" option within mkfs.xfs. Martin Petersen from Oracle states further, "The stuff that queries MD/LVM for stripe unit/stripe size has been in XFS for a while. For hardware RAID there is no non-proprietary way to obtain the information from the device. So whoever runs mkfs on a hardware RAID device must manually specify the geometry using the sunit and swidth parameters."
Petersen goes on to reference a new SCSI Block Command version 3 (SBC-3) which will include the ability for array firmware to push RAID geometry information to the OS. There are two issues he sees with this yet: SBC-3 is a work in progress making it a moving target, and he has not seen any firware that implements the geometry feature stated above. The need to allow GNU/Linux to be portable means using Open Standards, and this means using non-proprietary processes to help automate alignment. Better alignment automation will take place once SBC-3 is implemented further by manufacturers and the feature is fleshed out in XFS.
Newman also contends that XFS does not allow large enough block sizes for massive filesystems. Christoph Hellwig, a kernel and XFS developer, took exception to this claim. "It does huge allocations because ... allocations sizes don't have anything to do with filesystem block size. It is explicitly designed for parallel operation with lots of parallel writers and high CPU counts, etc. It now has an extremly fast fsck and it is used in large-scale production HPC and file serving setups in the near petabyte range today, with up to hundreds of CPU in single system image setups, and produces high iops and high streaming bandwith as well as very good filesystem aging behaviour."
Eric Sandeen from Red Hat states, "A 100 million inode filesystem is not that uncommon on xfs, and some of the tests he proposes are probably in daily use at SGI customers." Hellwig adds, "For streaming I/O workloads it doesn't matter anymore. The direct to bio I/O path mitigates any blocksize impact." The filestreaming feature in XFS does work towards this end, helping to keep a single file from being broken up across a RAID array among other performance gains.
Regardless if XFS fully answers Newman's criteria, there are several new filesystems on the horizon that will move FOSS further into this massive size arena.
ext4 will be the latest implementation of the native ext filesystem for Linux. From the initial plans and development, ext4 is a rather large leap from ext3 with significant improvement over ext3. The total filesystem size could reach up to 1 exabyte, making it a prime candidate for massive implementations. ext4 developers do warn that the filesystem hasn't been tested extensively for massive sizes, so more work is in progress to allow it to scale successfully. Journal data will have checksums to increase integrity and performance. Running fsck will process much faster since unused portions of the disk won't have to be checked. There are plenty of other features; you can follow much of it from their features wiki page.
When posing the issues Newman brought up, most of them are either answered already in ext4 or are works in progress. Andreas Dilger from Sun states the RAID stripe alignment feature "is actually done in ext4 with mballoc." He does however add, "The ext4/mballoc code does NOT align the metadata to RAID boundaries, though this is being worked on also." Touching on block size, Dilger says, "What Henry is misunderstanding here is that the filesystem blocksize isn't necessarily the maximum unit for space allocation. I agree we could do this more efficiently (e.g. allocate an entire 128MB block group at a time for large files), but we haven't gotten there yet."
Block protection (a.k.a T10 DIF) boosts data integrity in filesystems by adding 8 bytes to each 512-byte block of data. This includes 2 bytes of CRC, 2 bytes for an application tag, and 4 bytes for a reference tag. Martin Petersen adds to this, "The way DIF works is that you add a checksum to the I/O when it is submitted. If there's a mismatch, the HBA or the drive will reject the I/O." ext4 developers already wish to include this in ext4, and XFS and BTRFS work fine with T10 DIF now.
BTRFS is licensed under the GPL and is also being written specifically to answer the issue of large scale installations. Among other features, extent-based BTRFS touts dynamic inode allocation, subvolumes, online and fast offline filesystem checks, and a theoretical 18 exabyte maximum size. Early alphas of BTRFS were announced in June, 2007, and May 29th, 2008, saw version 0.15 released. As of 0.14, tools for multiple devices were included and plans to expand these tools exist.
RAID stripe alignment and splitting data and metadata are possible future developments. Miguel Filipe states that BTRFS "does allow to specify to have different replication/stripping policies for metadata and data." The ability to align to RAID stripes is already being discussed by developers in the BTRFS IRC channel according to Filipe. Chris Mason from Oracle adds, "Btrfs can duplicate metadata via the internal raid1 and raid10 code. There is no code today in btrfs to force data and metadata to different devices, but the disk format has the bits it needs to make that happen." Concerning block sizes, Mason says that "larger [block] sizes will soon work again once better btree locking code is in place. "
The last filesystem, HAMMER, is currently being written for DragonFly BSD by Matt Dillon. However, its maximum size of half an exabyte certainly qualifies it for this discussion. I did ask Dillon if HAMMER will be portable, and he replied, "Yes, but it will not remove the work involved in actually porting the filesystem. To properly connect the filesystem to an operating system requires connecting the front-end vnode ops and the back-end IO subsystem." He continues, "I expect the structure of HAMMER's source files to change after the release as people begin porting it to other environments. I hope to support all such ports in the primary source tree with minimal code pollution."
Dillon did say the current HAMMER version is not ready for performance with its low-level storage allocator still in need of improvement. Other features in HAMMER are planned for performance optimization. "My intention is to depend heavily on HAMMER's reblocker to optimize the filesystem's data and meta-data layout. The reblock basically runs through the B-Tree and reallocates the B-Tree nodes and data in order to make things contiguous within the blockmap (and, ultimately, on the physical disk). The idea is to have a cron job run the reblocker for a fixed period of time daily, resulting in an optimal long-term disk layout."
HAMMER also has the ability to scale with additional work from interested developers. Dillon claims, "With regard to RAID striping, while the current HAMMER allocator has no real concept of RAID striping, people interested in pursuing that area can make adjustments to the low level allocator and reblocker to optimize the layout for RAID. I expect that a very large stripe size would have to be used... probably the HAMMER large-block size, which is 8MB, to best optimize the layout."
I have attempted to answer Henry Newman's contentions concerning high-performance I/O on GNU/Linux well enough, plus given some insight into the true future of Linux filesystems for massive installs. Linux has proved itself time and again as much more than a mere "desktop replacement for Windows," as Newman quipped in his first article. XFS is a current proven technology for these kinds of environments, especially with features added within the past couple of years. New filesystems are already on the horizon ready to step in and scale well. Looking from the outside, Newman can be excused for not realizing developers already recognize many of the issues he has brought up and projects are already underway, yet I trust he no longer dismisses XFS and GNU/Linux outright. Again, I hope this initiates even more discussion that will lead to further improvements and my thanks to Henry Newman and the various developers who suffered my questions.
|Subject||Topic Starter||Replies||Views||Last Post|
|Meticulious work||helios||1||1,624||Jun 11, 2008 1:15 PM|
You cannot post until you login.