How 3.6 nearly broke PostgreSQL
Please consider subscribing to LWN Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net. |
In mid-September, the 3.6 kernel appeared to be stabilizing nicely. Most of the known regressions had been fixed, the patch volume was dropping, and Linus was relatively happy. Then Nikolay Ulyanitsky showed up with a problem: the pgbench PostgreSQL benchmark ran 20% slower than under 3.5. The resulting discussion shows just how hard scalability can be on contemporary hardware and how hard scheduling can be in general.
Borislav Petkov was able to reproduce the problem; a dozen or so bisection iterations later he narrowed down the problem to this patch, which was duly reverted. There is just one little problem left: the offending patch was, itself, meant to improve scheduler performance. Reverting it fixed the PostgreSQL regression, but at the cost of losing an optimization that improves things for many (arguably most) other workloads. Naturally, that led to a search to figure out what the real problem was so that the optimization could be restored without harmful effects on PostgreSQL.
What went wrong
The kernel's scheduling domains mechanism exists to optimize scheduling decisions by modeling the costs of moving processes between CPUs. Migrating a process from one CPU to a hyperthreaded sibling is nearly free; cache is shared at all levels, so the moved process will not have to spend time repopulating cache with its working set. Moving to another CPU within the same physical package will cost more, but mid-level caches are still shared, so such a move is still much less expensive than moving to another package entirely. The current scheduling code thus tries to keep processes within the same package whenever possible, but it also tries to spread runnable processes across the package's CPUs to maximize throughput.
The problem that the offending patch (by Mike Galbraith) was trying to solve comes from the fact that the number of CPUs built into a single package has been growing over time. Not too long ago, examining every processor within a package in search of an idle CPU for a runnable process was a relatively quick affair. As the number of CPUs in a package increases, the cost of that search increases as well, to the point that it starts to look expensive. The current scheduler's behavior, Mike said at the time, could also result in processes bouncing around the package excessively. The result was less-than-optimal performance.
Mike's solution was to organize CPUs into pairs; each CPU gets one "buddy" CPU. When one CPU wakes a process and needs to find a processor for that process to run on, it examines only the buddy CPU. The process will be placed on either the original CPU or the buddy; the search will go no further than that even if there might be a more lightly loaded CPU elsewhere in the package. The cost of iterating over the entire package is eliminated, process bouncing is reduced, and things run faster. Meanwhile, the scheduler's load balancing code can still be relied upon to distribute the load across the available CPUs in the longer term. Mike reported significant improvements in tbench benchmark results with the patch, and it was quickly accepted for the 3.6 development cycle.
So what is different about PostgreSQL that caused it to slow down in response to this change? It seems to come down to the design of the PostgreSQL server and the fact that it does a certain amount of its own scheduling with user-space spinlocks. Carrying its own spinlock implementation does evidently yield performance benefits for the PostgreSQL project, but it also makes the system more susceptible to problems resulting from scheduler changes in the underlying system. In this case, restricting the set of CPUs on which a newly-woken process can run increases the chance that it will end up preempting another PostgreSQL process. If the new process needs a lock held by the preempted process, it will end up waiting until the preempted processes manages to run again, slowing things down. Possibly even worse is that preempting the PostgreSQL dispatcher process — also more likely with Mike's patch — can slow the flow of tasks to all PostgreSQL worker processes; that, too, will hurt performance.
Making things better
What is needed is a way to gain the benefits of Mike's patch without making things worse for PostgreSQL-style loads. One possibility, suggested by Linus, is to try to reduce the cost of searching for an idle CPU instead of eliminating the search outright. It appears that there is some low-hanging fruit in this area, but it is not at all clear that optimizing the search, by itself, will solve the entire problem. Mike's patch eliminates that search cost, but it also reduces movement of processes around the package; a fix that only addresses the first part risks falling short in the end.
Another possibility is to simply increase the scheduling granularity, essentially giving longer time slices to running processes. That will reduce the number of preemptions, making it less likely that PostgreSQL processes will step on each other's toes. Increasing the granularity does, indeed, make things better for the pgbench load. There may be some benefit to be had from messing with the granularity, but it is not without its risks. In particular, increasing the granularity could have an adverse effect on desktop interactivity; there is no shortage of Linux users who would consider that to be a bad trade.
Yet another possibility is to somehow teach the scheduler to recognize processes — like the PostgreSQL dispatcher — that should not be preempted by related processes if it can be avoided. Ingo Molnar suggested investigating this idea:
The problem, of course, is the dragons. The O(1) scheduler, used by Linux until the Completely Fair Scheduler (CFS) was merged for 2.6.23, had, over time, accumulated no end of heuristics and hacks designed to provide the "right" kind of scheduling for various types of workloads. All these tweaks complicated the scheduler code considerably, making it fragile and difficult to work with — and they didn't even work much of the time. This complexity inspired Con Kolivas's "staircase deadline scheduler" as a much simpler solution to the problem; that work led to the writing (and merging) of CFS.
Naturally, CFS has lost a fair amount of its simplicity since it was merged; contact with the real world tends to do that to scheduling algorithms. But it is still relatively free of workload-specific heuristics. Opening the door to more of them now risks driving the scheduler in a less maintainable, more brittle direction where nothing can be done without a significant chance of creating problems in unpredictable places. It seems unlikely that the development community wants to go there.
A potentially simpler alternative is to let the application itself tell the scheduler that one of its processes is special. PostgreSQL could request that its dispatcher be allowed to run at the expense of one of its own workers, even if the normal scheduling algorithm would dictate otherwise. That approach reduces complexity, but it does so by pushing some of the cost into applications. Getting application developers to accept that cost can be a challenge, especially if they are interested in supporting operating systems other than Linux. As a general rule, it is far better if things just work without the need for manual intervention of this type.
So, in other words, nobody really knows how this problem will be solved at this time. There are several interesting ideas to pursue, but none that seem like an obvious solution. Further research is clearly called for.
One good point in all of this is that the problem was found before the
final 3.6 kernel shipped. Performance regressions have a way of hiding,
sometimes for years, before they emerge to bite some important workload.
Eventually, tools like Linsched may help to
find more of these problems early, but we will always be dependent on users
who will perform this kind of testing with workloads that matter to them.
Without Nikolay's 3.6-rc testing, PostgreSQL users might have had an
unpleasant surprise when this kernel was released.
Index entries for this article | |
---|---|
Kernel | Scheduler |
(Log in to post comments)
How 3.6 nearly broke PostgreSQL
Posted Oct 2, 2012 23:43 UTC (Tue) by xorbe (guest, #3165) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 1:01 UTC (Wed) by zlynx (guest, #2285) [Link]
My thought when I read about this problem is that PostgreSQL is causing it with their user-space locking and so PostgreSQL needs to fix it. Perhaps it should pin its processes to particular cores, ensuring that it spreads out across all the cores. It could also nice its worker processes while leaving the dispatcher at normal. Or it could negative-nice (would that be making the process mean?) its dispatcher.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 2:34 UTC (Wed) by fdr (guest, #57064) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 3:19 UTC (Wed) by josh (subscriber, #17465) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 4:30 UTC (Wed) by fdr (guest, #57064) [Link]
http://archives.postgresql.org/pgsql-hackers/2012-06/msg0...
This actually uses futexes indirectly, in my understanding from the post, and it's not the most awful thing for Linux.
It's possible that Linux futexes are not a bad idea (but it's also not clearly a huge improvement), but s_lock does okay and has existed an awfully long time (with iteration), so there's some inertia there.
Also, I think I oversimplified the parent's post, he could have simply meant that something about s_lock is not very good, as opposed to "self-rolled user space spinlocks? That's ridiculous!" (as a straw man). It's possible s_lock could be improved, however, it was fine before and doesn't clearly seem at fault right now. I think the suggestion of having to manhandle the platform-specific scheduler also seems excessive. There may be a better solution to serve everyone; somehow I can't see PostgreSQL's spinlocks being one-of-a-kind in this pathology, but I haven't attempted to prove that.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 18:56 UTC (Wed) by jberkus (guest, #55561) [Link]
So there's a couple major problems with the idea that this issue should be fixed in the PostgreSQL code:
1. PostgreSQL is not the only application with its own spinlock implementatation. There is, for example, Oracle.
2. While we might be able to change locking in future versions of PostgreSQL, we can't change locking in past ones.
Even if the next version of PostgreSQL (9.3) has a modified locking implementation which doesn't hit the issues in 3.6, the number of people running older versions of PostgreSQL will far outnumber the folks running the latest version for quite some time. What you'd be saying to all of those folks is, effectively, "don't upgrade Linux".
So any solution which hinges on "make these modifications to PostgreSQL" will instead result in PostgreSQL users deciding to stay on old versions of the Kernel. If this problem is equally bad for Oracle, you might even see RedHat refusing to deploy a version based on 3.6.
There's also the fact that PostgreSQL's locking implementation is complex and quite highly tuned. So the idea that the PostgreSQL project could make changes which wouldn't result in a worse regression in a few months is optimistic. Implementing things like futex support could take literally years.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 19:00 UTC (Wed) by corbet (editor, #1) [Link]
FWIW, the discussion in kernelland was based on the assumption that this regression was the kernel's problem. Nobody there suggested telling the PostgreSQL developers to come up with a new locking scheme.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 2:13 UTC (Wed) by ncm (guest, #165) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 12:36 UTC (Wed) by rvfh (guest, #31018) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 5, 2012 10:28 UTC (Fri) by amit.kucheria (subscriber, #59246) [Link]
Consolidating work onto fewer cores[1] is something we're actively looking into.
[1] https://blueprints.launchpad.net/linaro-power-kernel/+spe...
How 3.6 nearly broke PostgreSQL
Posted Oct 13, 2012 7:35 UTC (Sat) by rodgerd (guest, #58896) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 7:31 UTC (Wed) by ajb (guest, #9694) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 8:07 UTC (Wed) by Zenith (guest, #24899) [Link]
Would a better solution than the one suggested by Mike not be to have 3 cores in each set, and then make the sets overlap.
So say we have a 4-core CPU (labelled A, B, C, and D), that then gets divided into two sets X = (A, B, C) and Y = (A, C, D).
That way the load can migrate from one set (X) of CPU's into another set (Y), provided that the load migration code gets lucky on selecting set Y.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 12:03 UTC (Wed) by ajb (guest, #9694) [Link]
However, in your example, A and C have 3 buddies, but B and D have only two. It seems like the scheduling algorithm would be just as efficient if they all had three, and there would be a better chance that migrations would be beneficial.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 13:31 UTC (Wed) by and (guest, #2883) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 4, 2012 10:46 UTC (Thu) by dunlapg (guest, #57764) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 6, 2012 8:13 UTC (Sat) by eternaleye (guest, #67051) [Link]
I've found this to be a far clearer explanation than the Wikipedia article: http://pl.atyp.us/wordpress/index.php/2007/12/the-kautz-g...
In learning about them myself, I wrote a perl script to generate a Kautz graph: http://ix.io/36j/
It takes the degree as the first argument and the dimension and the second, and outputs a list of edges as <from> <to> tuples, one per line.
I still need to write the Kautz::Graph class (stubbed in the file above) to embody a full set of the Kautz nodes with the same degree and dimension parameters, and see about modifying it to generate dot so it can make a nice graphic.
How 3.6 nearly broke PostgreSQL
Posted Oct 6, 2012 8:26 UTC (Sat) by eternaleye (guest, #67051) [Link]
Well, one more thing to remedy.
How 3.6 nearly broke PostgreSQL
Posted Oct 6, 2012 8:45 UTC (Sat) by eternaleye (guest, #67051) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 12:40 UTC (Wed) by andresfreund (subscriber, #69562) [Link]
The thing you could describe as a dispatcher, the "postmaster" process, which forks to create individual backends after receiving a connection requests doesn't do any relevant locking. So preemting it should be mostly harmless.
There are other processes though which might have something like the effect described above. Namely the 'wal writer', 'checkpointer' (9.2+, included in bgwriter before) and bgwriter processes. Especially if you have a write intensive concurrent workload interactions between individual connection backends and the wal writer can get contended. So its somewhat likely that preemption might get painful there.
Postgres is *far* from the only application doing its own spinlock implementation and contending on the locks every now and then. I am pretty sure that if the patch had made its way into 3.6 more performane regression reports would have come in.
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 18:22 UTC (Wed) by cmccabe (guest, #60281) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 3, 2012 19:09 UTC (Wed) by martinfick (subscriber, #4455) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 5, 2012 20:45 UTC (Fri) by jhoblitt (subscriber, #77733) [Link]
How 3.6 nearly broke PostgreSQL
Posted Oct 11, 2012 21:29 UTC (Thu) by bbulkow (guest, #87167) [Link]
As a high performance server author (currently http://Aerospike.com/ ), I am working hard to avoid context switch overhead in a few places. We are getting hurt by core scheduling issues - we often recommend turning off hyperthreading.
I have tried signalling to the OS heuristic I think make my server go faster. I've used sched_setscheduler(), which seemed to be effective before CFS but makes no difference in "modern" kernels.
As Linux is more beloved as a server OS, what's so bad about having known tuning parameters / API calls to change the scheduling heuristics? Is sched_setscheduler() to be avoided?
How 3.6 nearly broke PostgreSQL
Posted Oct 12, 2012 1:27 UTC (Fri) by bjmaz (guest, #87170) [Link]