Speculating on page faults
LWN.net needs you! Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing |
Improving the performance of the kernel is generally a good thing to do; that is why many of our best developers have put considerable amounts of time into optimization work. One area which has recently seen some attention is in the handling of soft page faults. As the course of this work shows, though, performance problems are not always where one thinks they might be; sometimes it's necessary to take a step back and reevaluate the situation, possibly dumping a lot of code in the process.
Page faults can be quite expensive, especially those which must be resolved by reading data from disk. On a typical system, though, there are a lot of page faults which do not require I/O. A page fault happens because a specific process does not have a valid page table entry for the needed page, but that page might already be in the page cache, in which case handling the fault is just a matter of fixing the page table entry and increasing the page's reference count; this happens often with shared pages or those brought in via the readahead mechanism. Faults for new anonymous pages (application data and stack space, mostly), instead, can be handled through the allocation of a zero-filled page. In either case, the fault is quickly taken care of with no recourse to backing store required.
In many workloads, this kind of "soft" fault happens much more often than hard faults requiring actual I/O. So it's important that they be executed quickly. Various developers had concluded that the kernel was, in fact, not handling this kind of fault quickly enough, and they identified the use of the mmap_sem reader/writer semaphore as the core of the problem. Contention wasn't the issue in this case - only a reader lock is required for page fault handling - but the cache line bouncing caused by continual acquisition of the lock was killing performance. As the number of cores in systems increases, this kind of problem can only get worse.
In response, Hiroyuki Kamezawa posted the first speculative page fault patch back in November. The core idea behind the patch was to try to handle page faults without taking mmap_sem at all. Doing so invites race conditions; in particular, the vm_area_struct (VMA) structure which controls the memory mapping can change while the work is in progress. So the speculative fault code must handle the fault, then check for concurrent changes and, if necessary, redo the work the older, slower way. That's the "speculative" part: doing the work in a lockless mode in the hope that the world will not change in the meantime and force that work to be done again.
The speculative page fault code must also ensure that no changes which could create real trouble (such as freeing the VMA outright) can happen while the fault is being handled. To this end, various versions of the patch have tried techniques like adding reference counts to the VMA structure or using read-copy-update with the red-black tree code (which is used to locate the VMA covering a specific address within an address space) to defer changes while the speculative code is doing its thing.
This work yielded some real results: the number of page faults per unit
time that the system could handle, when running a fault-heavy benchmark,
approximately doubled. The idea clearly had merit, but Peter Zijlstra felt that Kamezawa-san's patches
"weren't quite crazy enough
". He set out to rectify this
problem with a speculative page
fault patch of his own, which saw a number of revisions. Peter's
approach included the addition of speculative page table locks and the use
of RCU to manage VMA structures. The result was a patch which would
"sometimes boot" and which seemed to improve performance.
This is about when Linus got involved, pointing out some problems with Peter's patch, concluding:
Peter agreed with this conclusion, noting that he'd never thought it was ready yet.
It was in further discussion that Linus, looking at a profile of page fault handling activity, noticed something funny: the real overhead seemed to be in spinlock operations, which shouldn't be involved in the x86-optimized rwsem implementation at all. It turns out that said optimization was only applied to 32-bit systems; on 64-bit builds, reader/writer semaphores were using the generic, semaphore-based implementation. That meant that they were rather slower than they needed to be.
So Linus tossed out a new rwsem implementation based on the x86 exchange-and-add (xadd) instruction with a typical warning:
Kamezawa-san bravely tried the code anyway, and got an interesting result. His pets and his beer both came through without trauma - and the page fault performance was better than with his speculative fault patch. Peter, too, ran some tests against his own speculative code; those results showed that the rwsem change tripled page fault performance. His speculative fault patch improved performance by just a tiny bit more than that, and the two together a little more yet. But the rwsem patch is a small and clear fix, while the speculative page fault patch is large, widespread, scary, and with known problems. So nobody really disputed Peter's conclusion:
As of this writing, nobody has posted a final version of the rwsem patch.
Linus has noted that there are things which can be improved with it, but it
would be fairly typical for him to leave that work to others. But, one
assumes, some version of this patch will be waiting in the wings when the
2.6.34 merge window opens. It will be a clear demonstration that
low-hanging performance fruit exists even in our highly-optimized kernel;
one need only think to look in the right place.
Index entries for this article | |
---|---|
Kernel | Memory management/Scalability |
Kernel | Semaphores |
(Log in to post comments)
Speculating on page faults
Posted Jan 12, 2010 23:49 UTC (Tue) by dorado2 (guest, #53970) [Link]
Speculating on page faults
Posted Jan 13, 2010 9:18 UTC (Wed) by intgr (subscriber, #39733) [Link]
to x86_64?
Why did the x86 and x86_64 implementations differ so significantly in
the first place?
Speculating on page faults
Posted Jan 13, 2010 12:01 UTC (Wed) by mlankhorst (subscriber, #52260) [Link]
arch-neutral mechanism that used a spinlock, which is of course a lot
slower.
Speculating on page faults
Posted Feb 9, 2010 5:59 UTC (Tue) by yuhong (guest, #57183) [Link]
Another success of many eyes
Posted Jan 13, 2010 14:05 UTC (Wed) by dwheeler (guest, #1216) [Link]
Here's another example where different people tackled the problem in different ways, ending up with a good final result.
Speculating on page faults
Posted Jan 13, 2010 16:46 UTC (Wed) by eparis123 (guest, #59739) [Link]
Speculating on page faults
Posted Feb 15, 2010 22:35 UTC (Mon) by jengelh (subscriber, #33263) [Link]