Kernel development [LWN.net]

Kernel release status

The current stable 2.6 kernel is 2.6.17.3, released on June 30. It was a single-fix release for a denial of service vulnerability in the netfilter SCTP connection tracking code. One day earlier, 2.6.17.2 had been released with a relatively large set of important fixes. The SCTP fix can also be found in 2.6.16.23.

The current 2.6 prepatch is 2.6.18-rc1, released by Linus on July 5. A summary of changes can be found in a separate article below. Also available are the short-form changelog (too bulky to be included with Linus's announcement) and the long-form changelog.

The current -mm tree is 2.6.17-mm6. Recent changes to -mm include some extensions to the read-copy-update API, some "massive" CPU scheduler cleanup work, the removal of a number of old (OSS) sound drivers, and a set of patches shrinking the inode structure. A great many patches have been removed from -mm as they have found their way into 2.6.18-rc1.

Comments (none posted)

Quote of the week

I like colorized diffs, but let's face it, those particular color choices will make most people decide to pick out their eyes with a fondue fork. And that's not good. Digging in your eye-sockets with a fondue fork is strictly considered to be bad for your health, and seven out of nine optometrists are dead set against the practice.

So in order to avoid a lot of blind git users, please apply this patch.

-- Linus Torvalds

Comments (3 posted)

Looking forward to 2.6.18

Your editor, having returned from an all-too-short vacation, was faced with the prospect of looking over the 4500 (and counting) patches merged for the 2.6.18-rc1 release. Much of what has been merged is the usual set of fixes and updates, but some more user and developer-visible patches have gone in as well. The user-visible patches include:

The new core time system has finally found its way into the mainline; it was covered here in January, 2005, but has evolved considerably since then.
New device drivers for SMSC LAN911x Ethernet chipsets, ZyDAS ZD1211-based wireless LAN adapters, Myricom Myri-10G interfaces, CS553x NAND flash controllers, Amstrad E3 Delta flash controllers, Abit uGuru hardware monitoring chips, NS LM70 temperature sensors, a number of Echoaudio sound cards, and more.
Generic support for hardware random number generators has been added, along with drivers for a long list of generators.
The Philips Webcam driver has seen a massive update which adds image decompression support (without legal issues this time), support for a number of new devices, and many improvements.
A large set of NFS patches has been merged, adding, among other things, direct I/O support.
A netlink interface for networking bridging management.
A netfilter connection tracking helper for the SIP protocol.
The TCP Low Priority, TCP Compound, and TCP Veno congestion control algorithms.
A new mechanism for attaching SELinux labels to network packets. There is also a new set of hooks allowing SELinux to regulate the kernel key management subsystem.
Extended attribute support in the JFFS2 filesystem.
A number of kernel include files have been cleaned up to make it easier to include them into user-space applications.
PCI devices now export an "enable" attribute via sysfs. The main purpose for the new attribute is to allow the X server to enable and disable devices without doing direct I/O memory access.
The swapless page migration patches have been merged, easing the movement of pages between NUMA nodes. There is also a new move_pages() system call which can be used to determine where pages reside and possibly move them to a new node.
The TCP segmentation offload code has been updated and improved. There is a new "generic segmentation offload" layer which can emulate TSO in software; evidently this approach yields some of the performance benefits of TSO on hardware which does not support segmentation offloading.
The default disk I/O scheduler is now the "completely fair queueing" (CFQ) scheduler.
A massive set of serial ATA changes has been merged, including a new error handler, rewritten programmed I/O support, native command queueing (NCQ) support (which should improve performance considerably), and hotplug support.
Priority-inheriting futexes have been merged into the mainline.
SMPnice, a set of scheduler heuristic changes meant to improve handling of low-priority processes on SMP systems, has been merged.

Internal API changes visible to kernel developers include:

The generic IRQ layer has been merged. The SA_* flags to request_irq() have been renamed; the new prefix is IRQF_. A long series of patches has converted in-tree drivers over to the new names; The old names are scheduled for removal in January, 2007.
64-bit resources are now supported. This change affects a number of users of the resource management API.
The kernel lock validator has gone in, along with a number of fixes for potential deadlocks found by the validator.
At long last, the devfs subsystem has been removed.
An API and support for the Intel I/OAT DMA engine.
The skb_linearize() function has been reworked, and no longer has a GFP flags argument. There is also a new skb_linearize_cow() function which ensures that the resulting SKB is writable.

Network drivers should no longer manipulate the xmit_lock spinlock in the net_device structure; instead, the following new functions should be used:

     int netif_tx_lock(struct net_device *dev);
     int netif_tx_lock_bh(struct net_device *dev);
     void netif_tx_unlock(struct net_device *dev);
     void netif_tx_unlock_bh(struct net_device *dev);
     int netif_tx_trylock(struct net_device *dev);

The long-deprecated inter_module API has finally been removed altogether.
A new kernel API providing access to the "inotify" functionality has been added.
The old scsi_request infrastructure has been removed, since there are no longer any in-tree drivers which use it.
The include file <linux/usb_input.h> is now <linux/usb/input.h>.
The VFS get_sb() filesystem method has a new prototype:
```
     int (*get_sb)(struct file_system_type fstype, int flags,
                   const char *dev_name, void *data,
		   struct vfsmount *mnt);
```
The mnt parameter is new; it allows the filesystem to receive a pointer to the target mount point structure. The mount point should be associated with the superblock in the get_sb() method with a call to:
```
     int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb);
```
The return value of get_sb() has also been changed to an int error status. The various get_sb_*() convenience functions have had the same changes applied. The purpose of all this work is to allow NFS to share superblocks across mount points.
The statfs() superblock operation has a new prototype:
```
     int (*statfs)(struct dentry *dentry, struct kstatfs *stats);
```
The old struct super_block pointer is now a dentry pointer instead.
Some functions have been added to make it easy for kernel code to allocate a buffer with vmalloc() and map it into user space. They are:
```
     void *vmalloc_user(unsigned long size);
     void *vmalloc_32_user(unsigned long size);
     int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
                             unsigned long pgoff);
```
The first two functions are a form of vmalloc() which obtain memory intended to be mapped into user space; among other things, they zero the entire range to avoid leaking data. vmalloc_32_user() allocates low memory only. A call to remap_vmalloc_range() will complete the job; it will refuse, however, to remap memory which has not been allocated with one of the two functions above.
The read-copy-update API is now accessible only to GPL-licensed modules. The deprecated function synchronize_kernel() has also been removed.
There is a new strstrip() library function which removes leading and trailing white space from a string.
A new WARN_ON_ONCE macro will test a condition and complain if that condition evaluates true - but only once per boot.
A number of crypto API changes have been merged, the biggest being a change to most algorithm-specific functions to take a pointer to the crypto_tfm structure, rather than the old "context" pointer. This change was necessary to support parameterized algorithms.
There is a new make target "headers_install". Its purpose is to install a set of kernel headers useful for libraries and user-space tools. A limited set of headers is installed, and those headers are sanitized on their way to the destination directory. It is hoped that distributors will use this mechanism to set up kernel headers for inclusion from user space in the future.

As of this writing, the 2.6.18 merge window has closed, so there probably will not be a whole lot of additions to the above list.

Comments (7 posted)

Time for ext4

A few weeks ago, this page looked at possible additions to the ext3 filesystem and the question of whether the time had come to freeze ext3 and put new features into a new ext4 filesystem again. The ext2/3 filesystem developers have now responded to that discussion with a clear answer: they will be moving on to ext4.

More specifically, a new filesystem will be created under fs/ext4 in the kernel source. Said filesystem will register itself as "ext3dev," in an attempt to make it crystal clear that it is a development filesystem, not suitable for the storage of data which one actually wishes to keep. New feature work - especially changes which change on-disk formats and prevent interoperation with current ext3 implementations - will go into this new filesystem, while ext3 will continue to receive bug fixes and some safe improvements. Throughout this process, the new filesystem will retain its ability to work with the current ext3 format.

Sometime in the future, ext3dev will be declared stable and renamed "ext4." Once the last bugs have been shaken out, this filesystem will lose its "experimental" designation and users will be encouraged to upgrade. Since support for ext3 formats will be there, this upgrade should be an easy process, with no backup-and-restore step or downtime required. Further in the future, the ext3 code may be removed and ext4 would transparently handle ext3 filesystems as well.

There seems to be little opposition to this approach, so it would appear that things will happen this way. Since the addition of a new, experimental filesystem carries little regression risk, the creation of ext4 and the addition of some new features (extents, for example) could yet happen for 2.6.18.

Comments (2 posted)

The 2006 Linux File Systems Workshop

July 5, 2006

This article was contributed by Valerie Henson

The Linux file systems community met in Portland in June 2006 to discuss the next 5 years of file system development in Linux. Organized by Val Henson, Zach Brown, and Arjan van de Ven, and sponsored by Intel, Google, Oracle, the Linux File Systems Workshop brought together thirteen Linux file systems developers and experts to share data and brainstorm for three days. Our goal was to discuss the direction of Linux file systems development during the next 5 years, with a focus on disruptive technologies rather than incremental improvements. Our goal was not to design one new file system to rule them all, but to come up with several useful new file system architecture ideas (which may or may not reuse existing file system code). To stay focused, we explicitly ruled out discussion of the design of distributed or clustered file systems, with the exception of how they impact local file system design. We came out of the workshop with broad agreement on the problems facing Linux file systems, several exciting file system architecture ideas, and a commitment to working together on the next generation of Linux file systems.

The Problem

Why do we need a Linux file systems workshop, when all seems well in Linux file systems land? Disks purr gently along, larger and fatter than ever before, but still essentially the same. I/O errors are an endangered species, more rumor than fact, and easily corrected with a simple fsck. The "df" command returns a comforting 50% free on most of your file systems. You chuckle gently as you read old file system man pages with directions for tuning inode/block ratios. Sure, that 32-bit file system size limit is looming somewhere over the horizon, but a quick patch to change the size of your block pointers is all you need and you'll be back in business again. After all, file systems are a solved problem, right? Right?

If computer hardware never changed, we kernel developers would have nothing better to do than argue about the optimal scheduling algorithm and flame each others' coding style. Unfortunately, hardware has this terrible habit of changing frequently, drastically, and worst of all, exponentially. File systems are especially vulnerable to changes in hardware because of their long-lived nature. Much of operating systems software can be changed at will given a simple system reboot. But file systems - and their on-disk data layouts - live on and on.

What has changed in hardware that affects file systems? Let's start with some simple, unavoidable facts about the way disks are evolving. Everyone knows that disk capacity is growing exponentially, doubling every 9-18 months. But what about disk bandwidth and seek time? At the last Storage Networking World conference, Seagate presented some details of their hard disk road map for the next 7 years (see page 16 of the slides [PDF]). Their predictions for 3.5 inch hard disks are summarized in the following table.

Parameter 2006 2009 2013 Improvement

Capacity (GB) 500 2000 8000 16x

Bandwidth (Mb/s) 1000 2000 5000 5x

Read seek time (ms) 8 7.2 6.5 1.2x

Parameter	2006	2009	2013	Improvement
Capacity (GB)	500	2000	8000	16x
Bandwidth (Mb/s)	1000	2000	5000	5x
Read seek time (ms)	8	7.2	6.5	1.2x

In summary, over the next 7 years, disk capacity will increase by 16 times, while disk bandwidth will increase only 5 times, and seek time will barely budge! Today it takes a theoretical minimum 4,000 seconds, or about 1 hour to read an entire disk sequentially (in reality, it's longer due to a variety of factors). In 2013, it will take a minimum of 12,800 seconds, or about 3.5 hours, to read an entire disk - an increase of 3 times. Random I/O workloads are even worse, since seek times are nearly flat. A workload that reads, e.g., 10% of the disk non-sequentially will take much longer on our 8TB 2013-era disk than it did on our 500GB 2006-era disk.

Another interesting change in hardware is the rate of increase in capacity versus the rate of reduction in I/O errors per bit. In order for a disk to have the same overall number of I/O errors, every time capacity doubles, the per-bit I/O error rate must halve. Needless to say, this isn't happening, so I/O errors are actually more common even though the per-bit error rate has dropped.

These are only a few of the changes in disk hardware that will occur over the next decade. What do these changes mean for file systems? First, fsck will take a lot longer in absolute terms, because disk capacity is larger, but disk bandwidth is relatively smaller, and seek time is relatively much larger. Fsck on multi-terabyte file systems today can easily take 2 days, and in the future it will take even longer! Second, the increasing number of I/O errors means that fsck is going to happen a lot more often - and journaling won't help. Existing file systems simply weren't designed with this kind of I/O error frequency in mind.

These problems aren't theoretical - they are already affecting systems that you care about. Recently, the main server for Linux kernel source, kernel.org, suffered file system corruption from a failure at the RAID level. It took over a week for fsck to repair the (ext3) file system, when it would have taken far less time to restore from backup.

The workshop

Now that the stage is set, we'll move on to what happened at the 2006 Workshop. The coverage has been split into the following pages:

Day 1, devoted mostly to understand the current state of the art: file system repair, disk errors, lessons learned from existing file systems, and major filesystem architectures.
Days 2 and 3, concerned with the way forward: interesting ideas, near-term needs, and development plans.

Comments (34 posted)

Linus Torvalds Linux v2.6.18-rc1 Jul 06

Chris Wright Linux 2.6.17.2 Jun 30

Chris Wright Linux 2.6.17.3 Jun 30

Andrew Morton 2.6.17-mm6 Jul 03

Andrew Morton 2.6.17-mm5 Jul 03

Andrew Morton 2.6.17-mm4 Jul 03

Andrew Morton 2.6.17-mm3 Jul 03

Ingo Molnar 2.6.17-rt3 Jul 03

Greg KH Linux 2.6.16.23 Jul 03

=?ISO-8859-1?Q?H=E5vard_Skinnemoen?= AVR32 architecture support Jul 03

Haavard Skinnemoen AVR32 architecture patch against Linux 2.6.18-rc1 available Jul 06

Arjan van de Ven sLeAZY FPU feature Jul 03

Nigel Cunningham Suspend2 - Request for review & inclusion in -mm Jul 03

Peter Williams sched: Add SCHED_BGND (background) scheduling policy Jul 05

Shailabh Nagar [PATCH] per-task delay accounting taskstats interface: control exit data through cpumasks] Jul 06

Badari Pulavarty [2.6.17-mm6 PATCH 0/3] VFS fileop cleanups by collapsing AIO and vector IO Jul 06

Junio C Hamano GIT 1.4.1 Jul 03

Samo Pogacnik LTT patch for 2.6.16 Jul 03

Greg KH change netdevice to use struct device instead of struct class_device Jul 04

Linsys Contractor Amit S. Kale NetXen: ethernet nic driver Jul 05

Theodore Ts'o Proposal and plan for ext2/3 future development work Jul 03

Mingming Cao extents and 48bit ext3/4 patches Jul 03

Theodore Ts'o Inode slimming patches, part 1 Jul 03

Trond Myklebust FSCACHE support for AFS and NFS Jul 06

Adrian Bunk OSS driver removal, 2nd round Jul 03

Arjan van de Ven make more file_operation structs static Jul 03

Christoph Lameter Reduce MAX_NR_ZONES and remove useless zones. Jul 04

Herbert Xu GSO: Generic Segmentation Offload Jul 03

Evgeniy Polyakov Netchannels: progress report. Jul 03

Kernel development

Brief items

Kernel release status

Kernel development news

Quote of the week

Looking forward to 2.6.18

Time for ext4

The 2006 Linux File Systems Workshop

The Problem

The workshop

Patches and updates

Kernel trees

Architecture-specific

Core kernel code

Development tools

Device drivers

Filesystems and block I/O

Janitorial

Memory management

Networking