The problem with prefetch
This article brought to you by LWN subscribers Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible. |
Over time, software developers tend to learn that micro-optimization efforts are generally not worthwhile, especially in the absence of hard data pointing out a specific problem. Performance problems are often not where we think they are, so undirected attempts to tweak things to make them go faster can be entirely ineffective. Or, indeed, they can make things worse. That is a lesson that the kernel developers have just relearned.
At the kernel level, performance often comes down to cache behavior. Memory references which must actually be satisfied by memory are extremely slow; good performance requires that needed data be in a CPU cache much of the time. The kernel goes out of its way to use cache-hot memory when possible; there has also been some significant work put into tasks like reordering structures so that fields that are commonly accessed together are found in the same cache line. As a general rule, these optimizations have helped performance in measurable ways.
Cache misses are often unavoidable, but it is sometimes possible to attempt to reduce their cost. If the kernel knows that it will be accessing memory at a particular location in the near future, it can use a CPU-specific prefetch instruction to begin the process of bringing the data into cache. This instruction is made available to kernel code via the generic prefetch() function; developers have made heavy use of it. Consider, for example, this commonly-used macro from <linux/list.h>:
#define list_for_each(pos, head) \ for (pos = (head)->next; prefetch(pos->next), pos != (head); \ pos = pos->next)
This macro (in a number of variants) is used to traverse a linked list. The idea behind the prefetch() call here is to begin the process of fetching the next entry in the list while the current entry is being processed. Hopefully by the time the next loop iteration starts, the data will have arrived - or, at least, it will be in transit. Linked lists are known to be cache-unfriendly data structures, so it makes sense that this type of optimization can help to speed things up.
Except that it doesn't - at least, not on x86 processors.
Andi Kleen may have been the first to question this optimization when he tried to remove the prefetches from list operations last September. His patch generated little discussion, though, and apparently fell through the cracks. Recently, Linus did some profiling on one of his favorite workloads (kernel builds) and found that the prefetch instructions were at the top of the ranking. Performing the prefetching cost time, and that time was not being repaid through better cache behavior; simply removing the prefetch() calls made the build go faster.
Ingo Molnar, being Ingo, jumped in and did a week's worth of research in an hour or so. Using perf and a slightly tweaked kernel, he was able to verify that using the prefetch instructions caused a performance loss of about 0.5%. That is not a headline-inspiring performance regression, certainly, but this is an optimization which was supposed to make things go faster. Clearly something is not working the way that people thought it was.
Linus pointed out one problem at the outset: his test involved a lot of traversals of singly-linked hlist hash table lists. Those lists tend to be short, so there is not much scope for prefetching; in fact, much of the time, the only prefetch attempted used the null pointer that indicates the end of the list. Prefetching with a null pointer seems silly, but it's also costly: evidently every such prefetch on x86 machines (and, seemingly, ARM as well) causes a translation lookaside buffer miss and a pipeline stall. Ingo measured this effect and came to the conclusion that each null prefetch cost about 20 processor cycles.
Clearly, null prefetches are a bad idea. It would be nice if the CPU would simply ignore attempts to prefetch using a null pointer, but that's not how things are, so, as is often the case, one ends up trying to solve the problem in software instead. Ingo did some testing with a version of prefetch() which would only issue prefetch instructions for non-null pointers; that version did, indeed, perform better. But it still performed measurably worse than simply skipping the prefetching altogether.
CPU designers are well aware of the cost of waiting for memory; they have put a great deal of effort into minimizing that cost whenever possible. Among other things, contemporary CPUs have their own memory prefetch units which attempt to predict which memory will be wanted next and start the process of retrieving it early. One thing Ingo noticed in his tests is that, even without any software prefetch operations, the number of prefetch operations run by the CPU was about the same. So the hardware prefetcher was busy during this time - and it was doing a better job than the software at deciding what to fetch. Throwing explicit prefetch operations into the mix, it seems, just had the effect of interfering with what the hardware was trying to do.
Ingo summarized his results this way:
One immediate outcome from this work is that, for 2.6.40 (or whatever it ends up being called), the prefetch() calls have been removed from linked list, hlist, and sk_buff list traversal operations - just like Andi Kleen tried to do in September. Chances are good that other prefetch operations will be removed as well. There will still be a place for prefetch() in the kernel, but only in specific situations where it can be clearly shown to help performance. As with other low-level optimizations (likely() comes to mind), tossing in a prefetch because it seems like it might help is often not the right thing to do.
One other lesson to be found in this experience is that numbers matter.
Andi was right when he wanted to remove these operations, but he did not
succeed in getting his patch merged. One could come up with a number of
reasons why things went differently this time, starting with the fact that
Linus took an interest in the problem. But it's also true that
performance-oriented patches really need to come with numbers to show that
they are achieving the desired effect; had Andi taken the time to quantify
the impact of his change, he would have had a stronger case for merging
it.
Index entries for this article | |
---|---|
Kernel | Prefetch |
(Log in to post comments)
The problem with prefetch
Posted May 24, 2011 15:53 UTC (Tue) by MattBBaker (subscriber, #28651) [Link]
prefetch and buffer bloat
Posted May 24, 2011 16:15 UTC (Tue) by grunch (subscriber, #16603) [Link]
I wonder if it's a similar cause.
Thanks for the details Jonathan!
prefetch and buffer bloat
Posted May 24, 2011 16:22 UTC (Tue) by martinfick (subscriber, #4455) [Link]
prefetch and buffer bloat
Posted May 24, 2011 16:50 UTC (Tue) by nye (guest, #51576) [Link]
prefetch and buffer bloat
Posted May 24, 2011 17:11 UTC (Tue) by martinfick (subscriber, #4455) [Link]
prefetch and buffer bloat
Posted Jun 2, 2011 21:54 UTC (Thu) by jch (guest, #51929) [Link]
Bufferbloat is not about reducing the amount of buffers in routers; it is about designing algorithms to make sure that routers only use as much of their buffers as necessary, and getting the router vendors to deploy such algorithms.
--jch
prefetch and buffer bloat
Posted May 24, 2011 17:27 UTC (Tue) by nicooo (guest, #69134) [Link]
I doubt it's a marketing issue. Having their current clients telling their potential clients that the service sucks can't be good for business.
prefetch and buffer bloat
Posted May 24, 2011 22:58 UTC (Tue) by Lennie (subscriber, #49641) [Link]
So again you have manufactures doing non-real-world test (which might have been a good test a long time ago) for marketing purposes and optimising for that case.
prefetch and buffer bloat
Posted May 31, 2011 20:43 UTC (Tue) by marcH (subscriber, #57642) [Link]
... up to a size after which throughput does not get better. Yet we can sometimes see buffer sizes way past this point (e.g. 1 second), which proves that some manufacturers do not bother trying to optimize anything at all.
prefetch and buffer bloat
Posted May 31, 2011 21:05 UTC (Tue) by dlang (guest, #313) [Link]
since they have the ram anyway, and buffers that are too small can cause problems. the logic then follows 'why not just use the ram in the device as a buffer'
this causes other problems, but these other problems were not well described until recently.
prefetch and buffer bloat
Posted May 24, 2011 17:53 UTC (Tue) by clugstj (subscriber, #4020) [Link]
The problem with prefetch on Old Processors (e.g. P3s)
Posted May 24, 2011 18:13 UTC (Tue) by ds2horner (subscriber, #13438) [Link]
Lack of hardware support for providing the metrics (like the counters used for the new hardware), and the long kernel compile times make it quite unlikely that those who would benefit from using prefetch will be able to provide the compelling statistics sufficient to have x86 family specific prefetch optimizations retained in the kernel.
And the majority of kernel developers with their powerful hardware have moved on...
I do appreciate that there are some (I did not record or remember names) who do regression testing on the old hardware. Perhaps one of these heroes will pick up the challenge.
As much as software prefetch has now fallen out of favour, I (perhaps naively) expected that, in the Pentium IIIs era, tests were run that demonstrated an improvement on that hardware.
hopefully Andi Kleen's approach will be accepted now.
Posted May 24, 2011 21:37 UTC (Tue) by ds2horner (subscriber, #13438) [Link]
His approach would leave the optimizations in place for those CPUs that could benefit; so K7 and others would have the option.
Thanks for the foresight Andi.
The problem with prefetch
Posted May 24, 2011 19:07 UTC (Tue) by alvieboy (guest, #51617) [Link]
Cache is a very complex thing - actually cache + MMU + branchpredictor + multimaster buses will eventually eat most of your chip, space-wise.
Now, to the subject:
Prefetching is useful when you actually know you'll be doing sequential accesses (still only if you do it on background). This is almost never true for the data cache, unless you're doing block copies. Unfortunately the way DMA engines are designed do not allow for easy off-processor copies, because set-up, check-end and tear down of the DMA engine cost too much (DMA can also work memory <-> memory, not only device <-> memory). Now, the cache controller cannot (not that I know of) fetch a cache line while other line is being fetched (DDR does not have multiple ports, and I don't think you can invalidate a cache line while it's being fetched). This means that you have to trust logic inside your cache manager, and *eventually* give it some hints of what you are trying to do. Technically speaking these "hints" may exist on some buses, like AMBA and Wishbone, so that latencies are mitigated by sequentially transferring large chunks. I don't know x86 internals enough to say if CPU<->cache has also a similar approach.
The biggest problem is the pipeline stalls while cache is filling up/writing back. I believe you can still use the pipeline + units while cache is being filled up, but again this is so slow (latency and speed-wise) compared to the processor speed that you'll eventually will need to access memory again, thus halting the processing.
CPU Chips could be 4x smaller if we had proper RAM (faster and multiport). This is the truth. Perhaps multidimensional chips will bring us this, but I sincerely doubt it.
Alvie
The problem with prefetch
Posted May 24, 2011 21:11 UTC (Tue) by dgm (subscriber, #49227) [Link]
Expensive and fast RAM close to the processor... hummm... isn't that the cache?
The problem with prefetch
Posted May 24, 2011 21:25 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link]
That's why it's not used even on supercomputers.
The problem with prefetch
Posted May 24, 2011 23:48 UTC (Tue) by pphaneuf (guest, #23480) [Link]
The NEC SX-4 had 16 gigabytes of SRAM as its main memory. It didn't really have any CPU cache (only something like 32KB of instruction cache). That was a nice way of not having to deal with cache coherence issues in an SMP system.
It also had a 256 bytes wide memory bus (compared to the typical 64 bits).
Serious hardware, that. :-)
The problem with prefetch
Posted May 25, 2011 10:03 UTC (Wed) by kruemelmo (guest, #8279) [Link]
The problem with prefetch
Posted May 25, 2011 10:23 UTC (Wed) by dgm (subscriber, #49227) [Link]
The problem with prefetch
Posted May 25, 2011 11:36 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]
Power consumption is 123KVA for 32-core version, performance is around 2GFLOPS per core for vector operations.
The problem with prefetch
Posted May 25, 2011 19:50 UTC (Wed) by wazoox (subscriber, #69624) [Link]
The problem with prefetch
Posted May 25, 2011 22:05 UTC (Wed) by pphaneuf (guest, #23480) [Link]
Yeah, I don't know about the power consumption, but while it would put out a good deal of heat, the SX-4 was the first of the SX series to be air-cooled. I'm not sure if the SX-3 used SRAM or DRAM (it was before my time), but that one was water-cooled.
While the vector performance was amazing, it was pretty sluggish for scalar stuff, so we didn't even use it for compiling, it was too slow.
Not saying that this is the end-all, be-all (the SX-5 traded in the SRAM for DRAM, and in exchange got 128 gigabytes of it, which was a whole lot in 1999 or 2000!), but just pointing out that it's been used on supercomputers (and I've used them), and it didn't require liquid nitrogen.
Here is a photo of one of the four SX-5 I helped look after.
The problem with prefetch
Posted May 25, 2011 11:49 UTC (Wed) by nye (guest, #51576) [Link]
Wikipedia says "SRAM is more expensive, but faster and significantly less power hungry (especially idle) than DRAM. It is therefore used where either bandwidth or low power, or both, are principal considerations". This is also what they taught in my degree course; are you certain you're not confused?
The problem with prefetch
Posted May 25, 2011 12:58 UTC (Wed) by Cyberax (✭ supporter ✭, #52523) [Link]
DRAM uses capacitors and only needs to be refreshed from time to time.
Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power.
The problem with prefetch
Posted May 25, 2011 13:49 UTC (Wed) by nye (guest, #51576) [Link]
> DRAM uses capacitors and only needs to be refreshed from time to time.
All of the resources I can find describing SRAM state that the power used continuously while idle is trivial compared to the power needed to constantly refresh DRAM, and hence it's only during heavy utilisation that its power consumption can get *up to* that of DRAM.
>Our embedded guys tell me that SRAM also is quite power-hungry during reads and writes, so high-frequency SRAMs consume significantly more power than DRAM. Sometimes very significantly more power
I'm sure what you're saying is true in your case, but how do you reconcile that with the fact that every other comparison between the two disagrees?
The only idea I can come up with is that you're comparing SRAM running *much faster* than DRAM so it's an apples-to-oranges comparison(?)
The problem with prefetch
Posted May 25, 2011 16:38 UTC (Wed) by bronson (subscriber, #4806) [Link]
Yep.
> and in idle mode continuously seeps energy
Not if made out of CMOS (and these days everything is CMOS). At steady state the only power loss in SRAM is gate and substrate leakage. Negligible.
> DRAM uses capacitors and only needs to be refreshed from time to time.
Yeah, 50 times a second. It adds up to quite a bit of power. Plus DRAMs have all the substrate leakage of the SRAM.
I'm pretty sure you're comparing supercomputer SRAMs from the early 80s, clocked to the limits of their lives, with low power modern DRAMs. If you compare equivalent parts you'll see that DRAMs are smaller but burn a lot more power.
The problem with prefetch
Posted May 24, 2011 21:18 UTC (Tue) by oak (guest, #2786) [Link]
-> Is there an article about downsides of likely() usage?
likely()
Posted May 24, 2011 21:55 UTC (Tue) by corbet (editor, #1) [Link]
A quick look in the LWN kernel index turns up an article from 2004, an article from 2006, and one from last December.
likely()
Posted May 25, 2011 13:10 UTC (Wed) by oak (guest, #2786) [Link]
I typically use unlikely() in my code just to annotate ifs in error check&logging macros. If errors aren't unlikely, I think the performance of un/likely() is the least of my issues...
likely()
Posted May 25, 2011 15:30 UTC (Wed) by SLi (subscriber, #53131) [Link]
likely()
Posted Jun 1, 2011 4:20 UTC (Wed) by AdamRichter (guest, #11129) [Link]
The problem with branch prediction
Posted May 24, 2011 21:43 UTC (Tue) by davecb (subscriber, #1574) [Link]
Branch prediction gave us a bit of extra performance with a few code bases, but the older and better the code, they less we saw. My favorite example is Samba, so a Smarter Colleague[tm] and I looked at what was actually happening. Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around. The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, we'd slow down whenever it wasn't trivial-to-predict straight-line code.
The better and older the code, the less we would get the next i-cache line sitting waiting for us, and the slower we'd run. Grungy straight-line FORTRAN benefited fine.
I don't recollect ever seeing an actual slowdown, but we rarely could see the predicted benefits from branch prediction.
I'd venture as much as a five cent bet we'll see the same with the intel architecture.
--dave
I think it's about some particular case of branch prediction...
Posted May 25, 2011 6:10 UTC (Wed) by khim (subscriber, #9252) [Link]
To fill the pipeline on contemporary CPU you need 30-50 instructions in flight. Without branch prediction it's just impossible to do. If you disable branch prediction on contemporary CPU the slowdown is crippling. Sadly only Intel engineers can give you numbers (because there are no way to disable it on retail CPU) - and they are not telling.
I think it's about some particular case of branch prediction...
Posted May 25, 2011 10:58 UTC (Wed) by mingo (subscriber, #31122) [Link]
It's relatively easy to measure the cost of branch misses in certain cases, such as using 'perf stat --repeat N' (the branch miss rate will be measured by default) and a testcase that uses a pseudo-RNG so it can run the same workload random and non-random and comparing the two.
And yes, missing branches is crippling to performance: a 3% branch miss rate can cause a 5% total execution slowdown and a 20% percent miss rate can already double the runtime of a workload. (!)
The problem with outguessing the CPU
Posted May 25, 2011 7:16 UTC (Wed) by alex (subscriber, #1355) [Link]
The problem with branch prediction
Posted May 29, 2011 21:50 UTC (Sun) by giraffedata (guest, #1954) [Link]
Turns out most branches were around either short runs of legitimately conditional code or debug macros. In those cases it didn't matter if we set the prediction to correctly predict we'd branch around.
Why not? I can see there might not be any prefetching advantage because you're branching to something that is already in cache, but you can still do a lot of other execution of the instructions while still working on a prior one.
The branch was very often far enough we hit a different i-cache line. Since we didn't have a way of hinting what line we'd hit, ...
The line you'd hit is completely determined by the target in the branch instruction, isn't it?
The problem with prefetch
Posted May 25, 2011 6:16 UTC (Wed) by bronson (subscriber, #4806) [Link]
The problem with prefetch
Posted May 25, 2011 21:18 UTC (Wed) by vonbrand (guest, #4458) [Link]
Somebody told me there are three kinds of C compilers: Dumb ones (they just disregard register
), smart ones (they heed register
if possible) and very smart ones (they asign registers better than whatever the programmer could hint at, and just overlook register
). Current compilers are mostly in the third group.
The problem with prefetch
Posted Jun 2, 2011 13:17 UTC (Thu) by marcH (subscriber, #57642) [Link]
The problem with prefetch
Posted Jun 7, 2011 12:04 UTC (Tue) by etienne (guest, #25256) [Link]
The problem with prefetch
Posted May 25, 2011 13:00 UTC (Wed) by rilder (guest, #59804) [Link]
Yes, will older processors be further penalized?
Posted May 25, 2011 16:26 UTC (Wed) by ds2horner (subscriber, #13438) [Link]
In Andi Kleen's patch (LWN article linked to in the main article), he attempted to have the prefetch remain for list processing on CPUs (namely the K7 family) that would benefit from it.
His approach was to change the prefetch calls to list_prefetch calls and make the list_prefetch a no-op on most rchitectures (and mapping to prefetch via CONFIG_LIST_PREFETCH for MK7 only (presumably with more to be added if they would benefit).
But my main question was, other than the K7 where there is historical "evidence" that this CPU benefited, who would now do the justification for other older processors (like the P3s)?
The problem with prefetch
Posted May 25, 2011 17:57 UTC (Wed) by branden (guest, #7029) [Link]
"Sometimes, getting your patch in is just a matter of waiting for somebody else to reimplement it." -- Jonathan Corbet, https://lwn.net/Articles/51615/
The problem with prefetch
Posted May 25, 2011 23:13 UTC (Wed) by dashesy (guest, #74652) [Link]
So it was useful for me at least :)
The problem with prefetch
Posted May 26, 2011 0:49 UTC (Thu) by cesarb (subscriber, #6266) [Link]
I think I am seeing a common thread between this article and the recent undefined behavior article.
The other article: the compiler will do things you did not expect.
This article: the hardware will do things you did not expect.
HW prefetcher is smarter than you think
Posted May 26, 2011 1:22 UTC (Thu) by csd (subscriber, #66784) [Link]
Needless to say, we discarded any notion of using prefetch at the time.
The problem with prefetch
Posted May 26, 2011 2:26 UTC (Thu) by darthscsi (guest, #8111) [Link]
There is a large literature on how far in advance prefetches need to be to do any good.
The problem with prefetch
Posted May 26, 2011 16:38 UTC (Thu) by RogerOdle (subscriber, #60791) [Link]
I have not used Linux extensively in embedded in the past but the recent changes in the 2.6.39 kernel that bring more real time support make linux even more attractive in this area. One thing that commercial RTOS's allow is for applications to lock portions of the cache to hold highly reused code like DSP algorithms. I have not looked into how this is done in Linux or if it can be done at all. But partitioning control of the cache is an important feature in a small set of performance sensitive applications.
It has not been possible to use Linux for some of the applications I have been involved with in the past because of latency issues but Linux is constantly changing and 2.6.39 opens up possibilities that were out of reach before.
The problem with prefetch
Posted Jun 6, 2011 6:52 UTC (Mon) by hyoshiok (guest, #75469) [Link]
The problem with prefetch
Posted Jun 16, 2011 8:32 UTC (Thu) by tcucinotta (guest, #69261) [Link]
>
> So the conclusion is: prefetches are absolutely toxic,
> even if the NULL ones are excluded.
If I understand well, the main source of the "toxicity" here is the lists being (most of the cases very) short in the case of their usage within hash tables. So, if one wanted to keep some highly reusable/useful helper macro(s) to iterate items, perhaps it could be worth to have 2 versions of them, one for likely long lists, the other one for likely short lists.
However, I also suspect that, whilst for relatively "empty" hash tables those lists will likely contain only 1 item, for highly "full" tables those lists can easily become more and more crowded, so the developer's hint (likely short vs likely long) would easily be workload/scenario-dependent.
So, the final question would be whether there's a scenario to be considered "more likely" (or worth to be optimized) by developers than others.
Just my 2 cents.
You just need to know and remember in your local brain cache this prefetch problem
Posted Jul 12, 2017 10:16 UTC (Wed) by kazan417 (guest, #117591) [Link]
Again unobvious tecchnology behavior, which you just need to know and remember in your local brain cache.
And we have many of them(unobvious behavior) every day in modern technology, because now they are super complex