Kernel development

Brief items

Kernel release status

The current 2.6 prepatch remains 2.6.17-rc6; no prepatches have been released over the past week. A few dozen fixes have been merged into the mainline repository since -rc6 was released, but the pace has slowed considerably. The 2.6.17 final release may well happen in the near future.

The current -mm tree is 2.6.17-rc6-mm2. Recent changes to -mm include a new statistics infrastructure for the memory management subsystem, virtualized namespaces for SYSV interprocess communications primitives, and some lock validator work.

Comments (none posted)

Kernel development news

Quote of the week

I think the interesting point is how we're moving away from the "global development" model (ie everything breaks at the same time between 2.4.x and 2.6.x), and how the fact that we're trying to maintain a more stable situation may well mean that we'll see more of the "local development" model where a specific subsystem goes through a development series, but where stability requirements mean that we must not allow it to disturb existing users.

And even more interestingly (at least to me), the question might become one of "how does that affect the tools and build and configuration infrastructure", and just the general flow of development.

-- Linus Torvalds

Comments (none posted)

Ext3 for large filesystems

Linux supports a wide variety of filesystems. While it is true that the Linux VFS layer treats all filesystems equally, the ext3 filesystem is certainly the first among equals. Ext3 is the default choice for a large majority of distributions; it can thus be found on vast numbers of installed Linux systems. If any filesystem were to be named the Linux filesystem, it would be ext3.

Ext3 is based on decades of experience with Unix filesystems. As a result, it is relatively straightforward to understand and highly reliable in its operation. It is, however, also showing its age in a number of ways. One of those is the maximum size of the underlying device it can handle. This limit is a mere 8 TB. That is enough to hold most of our mail spools - even before spam filtering - but it is a limit which is already affecting some users. With the size of contemporary disks, the creation of an 8 TB array is not an entirely outlandish thing to do now, and it will only become easier over time.

There are a couple of reasons for this limit. One of them is the use of 32-bit block numbers within the filesystem - and signed 32-bit numbers at that. The ext3 code can only track 2 gigablocks, which, using a 4K block size, sets the limit at 8 TB. Switching to an unsigned type can double that limit, but that only pushes back the problem by about one year. Clearly, larger block numbers are required.

The other problem has to do with how ext3 tracks the blocks associated with any given file. The ext3 inode structure contains an array of fifteen 32-bit pointers; the first twelve of those pointers contain the indexes of the first twelve blocks of the file. Thus, with a filesystem using 4K blocks, the first twelve pointers can describe a file of up to 48KB in length. If the file exceeds that length, an "indirect block" is created. This block is a big array of block pointers, holding the indexes for the next 1024 blocks; the 13th pointer in the inode structure tracks the location of this indirect block. Should that space not suffice, the 14th pointer is used for a double-indirect block - a block holding pointers to indirect blocks. Finally, the 15th pointer will be used for a triple-indirect block if need be.

This arrangement is not too different from how Unix systems structured filesystems two decades or more ago. It imposes a per-file maximum size of about 4 TB - big, but perhaps limiting for today's hot applications (such as comprehensive, nationwide telephone call archival). It works well for small files but, as files get larger, this organization becomes increasingly inefficient. Keeping a pointer to every single block is expensive, both in terms of space usage and the time it can take to locate a specific file block. Since larger filesystems will tend to hold larger files, this overhead becomes increasingly limiting over time.

A solution to these problems can be found in the extents and 48-bit support patch set. These patches have been posted by Mingming Cao; many other developers - especially Alex Tomas - have worked on them as well. They change the way files are stored to make things more efficient, and to allow the filesystem to index the blocks on larger devices.

The core of the patch is the support for extents. An extent is simply a set of blocks which are logically contiguous within the file and also on the underlying block device. Most contemporary filesystems put considerable effort into allocating contiguous blocks for files as a way of making I/O operations faster, so blocks which are logically contiguous within the file often are also contiguous on-disk. As a result, storing the file structure as extents should result in significant compression of the file's metadata, since a single extent can replace a large number of block pointers. The reduction in metadata should enable faster access as well.

An ext3 filesystem mounted with the extents option enabled will handle files stored in the old way, using block pointers, as always. New files will be created using extents, however. In these files, the fifteen-pointer array described above is overlaid with a new data structure. There is a short header, followed by a few occurrences of this structure:

    struct ext3_extent {
	__le32	ee_block;	/* first logical block extent covers */
	__le16	ee_len;		/* number of blocks covered by extent */
	__le16	ee_start_hi;	/* high 16 bits of physical block */
	__le32	ee_start;	/* low 32 bits of physical block */
    };

Here, ee_block is the index (within the file, not on disk) of the first block covered by this extent. The number of blocks in the extent is stored in ee_len, and the pointer to the first of those blocks (on disk, now) lives in the combination of ee_start and ee_start_hi. By storing physical block numbers this way, ext3 can handle 48-bit block numbers - enough to index a 1024 PB device. That should be enough to last for a couple years or so.

For files with few extents, all of the information can be stored within the on-disk inode itself. As the number of extents grows, however, the available space runs out. In that case, a form of indirect blocks is used; the in-inode extents array describes ranges of blocks holding extents arrays of their own. The tree of indirect extents blocks can grow to an essentially unlimited depth, allowing the filesystem to represent even very large, highly-fragmented files.

Beyond extents, relatively little had to be done to prepare ext3 for 48-bit block addressing. The signed, 32-bit block numbers are gone, having been converted to the larger sector_t type. Some reserved space in the ext3 superblock has been grabbed to store the high 16 bits of some global block counts. Much of the tracking of free blocks within the filesystem is done using block numbers relative to the beginning of the block group, so that code did not need to change much at all. A few tweaks to the journaling code were required for it to be able to handle the larger block numbers.

The end result is an enhancement to the ext3 filesystem which enables it to work with much larger devices. Existing filesystems can use the new features immediately with no dump-and-restore cycle. It would appear to be (nearly) universally agreed that these changes turn ext3 into a better filesystem. Whether that better filesystem should still be called ext3 is controversial, but that is a subject for another article.

Comments (18 posted)

Time for ext4?

As described in this article, patches which add extents and 48-bit capability to the ext3 filesystem have been circulated for wider review. Everybody seems to agree that these changes are good and should be part of the Linux kernel. Well, almost everybody agrees. But the way in which these features get in has become the inspiration for an extended discussion on how filesystem development should work.

Some developers, most prominently Jeff Garzik, have expressed concerns about merging these changes into ext3; they would rather see a new ext4 filesystem created for new features. There are a number of reasons put forward for doing things this way. First and foremost, perhaps, is the fact that using the extents/48-bit features results in filesystems which are no longer backward compatible. If a system administrator enables extents on a filesystem, a special "incompatible feature" flag will be set in the filesystem superblock. Thereafter, it will no longer be possible to mount that filesystem with any older kernel which does not recognize that flag. Until now, it has generally been possible to mount ext3 filesystems on older kernels - even those which only support ext2 (with one ugly exception involving a distributor which was heavily pushing SELinux features).

The overall effect of all these changes on filesystem stability is also a concern. Filesystems are important, and users tend to take a very dim view of "upgrades" which introduce bugs or impact performance. As Linus puts it:

For me, the biggest cost tends to actually be support. A stable filesystem that is used by thousands and thousands of people and that isn't actually developed outside of just maintaining it IS A REALLY GOOD THING TO HAVE.

The incorporation of major features into ext3 certainly takes it out of the "just maintaining it" realm.

As more features are added, the filesystem code (which must support filesystems both with and without that feature) gets more complicated. In particular, one sees increasing amounts of code which looks like:

    if (has_this_fancy_feature)
    	do_it_the_fancy_way();
    else
    	do_it_the_old_boring_way();

Such code can be harder to follow, and it tends not to isolate the new feature code as nicely as one might like. If, instead, one were to put the new features into a new filesystem, a lot of these conditionals could be taken out.

Finally, it is said that the need to be so careful about backward compatibility is a drag on filesystem development. By separating development filesystems from those which are meant to be stable, the developers can push forward with the new capabilities they would like to implement. For practical examples, consider the separation of ext2 and ext3, the separation of the SMB and CIFS filesystems, and the creation of libata rather than shoehorning serial ATA support into the old ATA drivers.

Needless to say, the ext3 developers have their own take on all of this. A filesystem with the new features will not work on older kernels regardless of whether it is called ext3 or ext4. Since a feature like extents must be explicitly enabled by the system administrator (assuming the distributor does not quietly do it for them), nobody should be surprised by a filesystem which no longer works on older systems. Pushing the new features into an ext4 would simply slow their uptake without buying much else.

While some think that splitting out development into a new filesystem will ease code maintenance, others are less sure. In particular, there is worry that bugs fixed in one of the filesystems may not get fixed in the other.

It has been noted, repeatedly, that users very much like to be able to get new features into their filesystems without having to backup and restore the whole thing. The transition from ext2 to ext3 is a clear example of how this can work; if moving to ext3 had required restoring the filesystem from scratch, ext3 would have been adopted much more slowly, and less universally, than it was. As this example shows, however, putting new features into a new ext4 filesystem would not necessarily preclude this sort of upgrade.

The ext3 developers also point out that they have been working on that filesystem for many years and have not yet created big problems for the Linux user base. They have, they feel, earned a certain amount of trust. So they would rather move ahead with some features which have been put together with great care and extensive review rather than cloning ext3 into ext4 and starting something new.

An attempt to guess how all this might settle out could start with these words from Linus:

Quite frankly, at this point, there's no way in hell I believe we can do major surgery on ext3. It's the main filesystem for a lot of users, and it's just not worth the instability worries unless it's something very obviously transparent.

Yet another point of view comes from Andrew Morton:

All that being said, Linux's filesystems are looking increasingly crufty and we are getting to the time where we would benefit from a greenfield start-a-new-one. That new one might even be based on reiser4 - has anyone looked? It's been sitting around for a couple of years.

As reiser4 shows, getting a truly new filesystem into the kernel isn't necessarily an easy thing to do. It may well not happen before large numbers of users start running into the current limits of ext3. So the current set of enhancements will probably find its way in - though what the resulting filesystem will be called is still not entirely clear.

Comments (31 posted)

64-bit resources

"Resource" is the term used within the Linux kernel for a specific set of I/O-related hardware resources - I/O memory and ports, in particular. Device drivers allocate specific resources with functions like request_region(), but, underneath that layer, Linux has a set of generic resource allocation utilities. And at the core of that code is struct resource, which tracks individual resource allocations. A read of /proc/iomem or /proc/ioports is really just dumping out one of those resource data structures.

Since the resource management code was added by Linus at the beginning of the 2.3 development cycle, the unsigned long type has been used to track actual resource values. That worked at the time, but it can be problematic on 32-bit machines which have I/O memory resources at high addresses. If a memory region is located out of the 32-bit range, the resource management code can no longer deal with it.

The solution, of course, is to start using 64-bit numbers to track resource allocations. Vivek Goyal (along with others) has been working for some time on a set of patches which makes this change. Those patches have been fixed up by Greg Kroah-Hartman and, by all appearances, are set for merging once the 2.6.18 development cycle starts.

Introducing new typedefs into the kernel is generally frowned upon, but this patch creates resource_size_t anyway. Early in the patch series, this type is just unsigned long; only when the type has propagated through the source is it changed to a 64-bit value. There is a configuration option controlling whether 64-bit resources are used; interestingly, 64-bit is the default, and the old 32-bit mode is marked "experimental."

The result of the change is that the prototypes for some commonly-used functions change:

    struct resource *request_region(resource_size_t start,
                                    resource_size_t n,
				    const char *name);
    void release_region(resource_size_t start, resource_size_t n);

    struct resource *request_mem_region(resource_size_t start,
                                        resource_size_t n,
					const char *name);
    void release_mem_region(resource_size_t start, resource_size_t n);

Most driver code will be unaffected by these changes; simple constant resource locations will still work, and, in many cases, the bus layer handles the details of resource allocation anyway. But, in cases where a driver is directly storing or working with resource locations, the new type will have to be used.

Comments (none posted)

Patches and updates

Kernel trees

Andrew Morton 2.6.17-rc6-mm2 ?

Ingo Molnar 2.6.17-rc6-rt1 ?

Ingo Molnar 2.6.17-rc6-rt3 ?

Core kernel code

Roland McGrath utrace: new modular infrastructure for user debug/tracing ?

Matt Helsley Task watchers: Task Watchers ?

Badari Pulavarty VFS fileop cleanups by collapsing AIO and vector IO ?

Development tools

Mark Hounschell RT exec for exercising RT kernel capabilities ?

Junio C Hamano GIT 1.4.0 ?

Catalin Marinas Stacked GIT 0.10 ?

Catalin Marinas Kernel memory leak detector 0.7 ?

Linus Torvalds suspend-to-ram debugging patches ?

Device drivers

Jeff Garzik Promise 'stex' driver ?

Greg KH Reworked 64bit resource patches ?

Tejun Heo rolled up patch for new Power Management for libata ?

Documentation

Diego Calleja Updated sysctl documentation take #2 ?

Filesystems and block I/O

Mingming Cao extents and 48bit ext3 ?

Dave Hansen Read-only bind mounts ?

Memory management

Peter Zijlstra mm: tracking dirty pages -v6 ?

Christoph Lameter Zoned VM counters V2 ?

Christoph Lameter Zoned VM counters V4 ?

Christoph Lameter Light weight event counters V3 ?

Networking

Gerrit Renker net: RFC 3828-compliant UDP-Lite support ?

Sridhar Samudrala in-kernel sockets API ?

Security-related

kang RSBAC 1.2.7 Released ?

Amy Griffis audit: path-based rules ?

Virtualization and containers

Kirill Korotaev IPC namespace ?

dlezcano@fr.ibm.com [patch 0/6] [Network namespace] introduction ?

Miscellaneous

Thomas Jarosch ipt_ACCOUNT 1.6 released ?

Page editor: Jonathan Corbet
Next page: Distributions>>