|
|
Subscribe / Log in / New account

Toward a smarter OOM killer

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Jonathan Corbet
November 4, 2009
The Linux memory management code does its best to ensure that memory will always be available when some part of the system needs it. That effort notwithstanding, it is still possible for a system to reach a point where no memory is available. At that point, things can grind to a painful halt, with the only possible solution (other than rebooting the system) being to kill off processes until a sufficient amount of memory is freed up. That grim task falls to the out-of-memory (OOM) killer. Anybody who has ever had the OOM killer unleashed on a system knows that it does not always pick the best processes to kill, so it is not surprising that making the OOM killer smarter is a recurring theme in Linux virtual memory development.

Before looking at the latest attempt to improve the OOM killer, it is worth mentioning that it is possible to configure a Linux system in a way which all but guarantees that the OOM killer will never make an appearance. OOM situations are caused by the kernel's willingness to overcommit memory. As a general rule, processes only use a portion of the address space they have allocated, so limiting allocations to the total amount of RAM and swap space on the system would lead to underutilization of system memory. But that limitation can be imposed on systems which can never be allowed to go into an OOM state; simply set the vm.overcommit_memory sysctl knob to 2. Individual processes are much more likely to see allocation failures in this mode, but the system as a whole will not overcommit its resources.

Most systems will allow overcommitted memory, though, because the alternative is too limiting. Overcommit works almost always, but the threat of a day when the Firefox developers add one memory leak too many always looms. When that sad occasion comes to be, it would be nice if the OOM killer would target that leaky Firefox process instead of, say, the X server and PostgreSQL. Many attempts have been made to add smarts to the OOM killer over the years; there's also a means by which the system administrator can steer the OOM killer toward or away from specific processes. But manual configuration is only suitable for certain, relatively static workloads; for the rest, the OOM killer often proves less discriminating than one would like.

The latest attempt to fix the OOM killer comes from Hiroyuki Kamezawa. This patch makes a number of fundamental changes to the selection of OOM victims. The result is an OOM killer which is smarter in some ways, but which takes a somewhat different approach to the selection of its victims.

One of the factors that the current OOM killer takes into account, naturally, is the amount of memory being used by each process. But the measure used (mm->total_vm) is somewhat crude: it penalizes processes using a lot of shared memory and says little about how much physical memory the process is using. Hiroyuki's patch tries to move away from total_vm in most situations, looking at the actual resident set size (RSS) and possibly taking into account the amount of swap space used as well.

Figuring in swap usage is controversial. A program which is using a lot of swap is clearly putting pressure on memory, but, if that program has been mostly swapped out, killing it will not immediately free much RAM. Eventually other processes can be shifted into the newly-freed swap space, but it might make more sense to just do away with those other processes at the outset. Even so, Hiroyuki's patch, for now, will figure in swap space if specific constraints do not force the use of other criteria.

One constraint which can change the calculation is when the memory shortage is specific to low memory - the region of memory which can be directly addressed by the kernel. When a low-memory allocation is required, nothing else will do, so there is little value in killing processes which are not hogging low-memory pages. With Hiroyuki's patch, the VM subsystem tracks how much low memory each process is using as a separate statistic. If the OOM situation is caused by an attempt to allocate low memory, the OOM killer's "badness" function will focus on processes holding large amounts of low memory.

Killing gnome-session is likely to free substantial amounts of memory, but the user's gratitude may be surprisingly limited. The current OOM killer makes an attempt to target "fork bomb" processes by adding half of each child's "badness" value to its parent. A process with a lot of children will thus have a high badness and will thus come under the OOM killer's baleful gaze sooner. The problem here, of course, is that some processes legitimately have lots of children - the session manager for the user's desktop environment is a good example. Killing gnome-session is likely to free substantial amounts of memory, but the user's gratitude may be surprisingly limited.

The patch changes the fork bomb detector significantly. The new code counts only the child processes which have been running for less than a specific amount of time (five minutes in the posted patch). If one process has newborn children which make up at least 1/8 of the processes on the system, that process is deemed to be a fork bomb; it is duly rewarded with a spot at the top of the OOM killer's short list.

Finally, the current OOM killer tries to kill newly-created processes, while allowing long-running processes to continue. Hiroyuki feels that this approach creates a loophole for long-running processes which slowly leak memory. That web browser may have been running for a long time and is thus a high-value process, but it has been dropping memory on the floor for that long time and is also the cause of the problem. So the new code changes the calculation to look at how long it has been since the process has expanded its virtual memory size. A process which has been running for a long time, but which has not grown in that time, will look better than one which has been expanding.

There seems to be little disagreement with the idea that the OOM killer needs a rework, but not everybody is sold on this approach yet. It looks like a very large change, which makes some people nervous. It also shifts the focus of the OOM killer's attention in a significant way: the current heuristics were designed to be as unsurprising to the user as possible, while the new ones are focused more strongly on freeing RAM quickly. But, given that the existing heuristics are still clearly producing plenty of surprises, perhaps a more goal-oriented approach makes sense.

(Naturally, no article on the OOM killer is complete without a link to this 2004 comment from Andries Brouwer).


(Log in to post comments)

Toward a smarter OOM killer

Posted Nov 4, 2009 16:37 UTC (Wed) by holstein (guest, #6122) [Link]

Would it be possible to send some of the intelligence needed to userspace?

I know this can lead to a chicken-and-eggs situation (after all, we have no more memory left now), but sometimes simple things could "fix" the memory problem with an operation specific to the current system.

Let's take for exemple a web server: if it is OOMing, it may be because of a runaway apache child eating all memory. Or more probably, a lots of child process eating just to much memory, a lot of time.

What if we could specify "run this script before OOM": in this case, restarting the webserver would be better than simply killing it...

In many case, if the sysadmin could specify specific action, or at least specific process to act on first, this would make for a more predictable behavior, no?

Toward a smarter OOM killer

Posted Nov 4, 2009 16:51 UTC (Wed) by mjthayer (guest, #39183) [Link]

Like an event of some sort to userspace that OOM is likely to occur soon?

Toward a smarter OOM killer

Posted Nov 4, 2009 17:00 UTC (Wed) by johill (subscriber, #25196) [Link]

Toward a smarter OOM killer

Posted Nov 4, 2009 17:11 UTC (Wed) by mjthayer (guest, #39183) [Link]

Do you know if that ever got merged?

Toward a smarter OOM killer

Posted Nov 4, 2009 23:24 UTC (Wed) by johill (subscriber, #25196) [Link]

I don't think anything close to that got merged, no. I think people are still fighting over how to do that with cgroups or not or something like that, but I have no idea really.

Toward a smarter OOM killer

Posted Nov 4, 2009 17:02 UTC (Wed) by holstein (guest, #6122) [Link]

Yes, exactly.

For a server, that would let the sysadmin log the event, react to it, etc.

For a desktop, one can imagine a popup warning the user, perhaps with a list of memory-hog processes. This will let the user use the session support of Firefox to restart it's browsing session...

In many case, restarting anew one guilty process can be enough to prevent OOM killing spree.

Toward a smarter OOM killer

Posted Nov 4, 2009 17:22 UTC (Wed) by mjthayer (guest, #39183) [Link]

One of my favourite grouses applies here though - at least on my system, I can be sure that a swap-to-death will start long before the OOM killer does, or before any notification would occur. For me, that would be something much more urgent to fix.

I think it might be doable by making the algorithm choosing pages to swap out more smart. So that each process was guaranteed a certain amount of resident memory (depending on the number of other processes and logged in users and probably a few other factors), and that a process that tried to hog memory would end up swapping out its own pages when other running processes dropped to their guaranteed minimum. If I ever have time, I will probably even try coding that up.

death by swap

Posted Nov 4, 2009 18:12 UTC (Wed) by jabby (guest, #2648) [Link]

Indeed. Many systems are installed with way too much swap space, such that swap thrashing begins killing performance before the OOM killer can be invoked. So, this is a plea to other people to

STOP USING the old "TWICE RAM" guideline!

That dates back to when 64MB of RAM was considered "beefy". I seriously had to provision a server last night with 16GB of physical memory and the customer wanted a 32GB swap partition!! Seriously?! If your system is still usable after you're 2 to 4 gigs into your swap, I'd be shocked.

death by swap

Posted Nov 4, 2009 18:21 UTC (Wed) by mjthayer (guest, #39183) [Link]

I think Ubuntu still do that by default :) Yes, the workaround for swapping to death is allocating no or little swap. I would hope though, that my "algorithm" above would reduce the need for workarounds though, by making sure that all non-hogging processes can have their working set in real memory.

Not sure that 32GB of swap would be appropriate even then though...

death by swap

Posted Nov 6, 2009 11:40 UTC (Fri) by patrick_g (subscriber, #44470) [Link]

>>> I think Ubuntu still do that by default :)

Yes Ubuntu still do that by default...despite many bug reports like mine.
Note that my bug report is old, very old (pre-Gutsy time) => https://bugs.launchpad.net/ubuntu/+source/partman-auto/+bug/134505
No reaction at all from the Ubuntu devs....very discouraging.

death by swap

Posted Nov 4, 2009 18:31 UTC (Wed) by clugstj (subscriber, #4020) [Link]

"If your system is still usable after you're 2 to 4 gigs into your swap, I'd be shocked."

It all depends on the workload.

death by swap

Posted Nov 4, 2009 18:36 UTC (Wed) by ballombe (subscriber, #9523) [Link]

If you use suspend-to-disk, you need a large amount of swap-space anyway.

death by swap

Posted Nov 5, 2009 6:54 UTC (Thu) by gmaxwell (guest, #30048) [Link]

Or using tmpfs for /tmp

death by swap

Posted Nov 5, 2009 11:11 UTC (Thu) by quotemstr (subscriber, #45331) [Link]

Which really should be the default. /var/tmp is for large temporary files.

death by swap

Posted Nov 5, 2009 18:25 UTC (Thu) by khc (guest, #45209) [Link]

How does tmpfs for /tmp help suspend to disk?

death by swap

Posted Nov 5, 2009 18:46 UTC (Thu) by nix (subscriber, #2304) [Link]

I think his point was that using tmpfs for /tmp needs a large amount of
swap :)

death by swap

Posted Nov 4, 2009 18:49 UTC (Wed) by knobunc (guest, #4678) [Link]

I agree with you... unless you want to hibernate. In which case you need at least the same amount of swap as RAM. But if you actually want to use that swap whilst running, you'd need a bit more. So now you are back to somewhere between 1x and 2x RAM.

death by swap

Posted Nov 4, 2009 19:59 UTC (Wed) by zlynx (guest, #2285) [Link]

Doesn't the user-space hibernate still exist? Didn't that allow hibernate to a file, with compression and everything?

I don't know, I haven't run a Linux laptop in almost a year now since an X.org bug killed my old laptop by overheating it.

death by swap

Posted Nov 5, 2009 18:22 UTC (Thu) by nix (subscriber, #2304) [Link]

tuxonice allows that too. But just because you *can* hibernate to a file
doesn't mean you *must*.

death by swap

Posted Nov 4, 2009 21:43 UTC (Wed) by drag (guest, #31333) [Link]

I don't know how the particular algorithm works but I really doubt that the
amount of swap space will dictate how likely or how much swap space the
Linux kernel wants to use.

I expect the only time it'll want to use swap in a busy system is if the
active amount of used memory exceeds the amount of main memory.

death by swap

Posted Nov 4, 2009 22:54 UTC (Wed) by mjthayer (guest, #39183) [Link]

The problem with the amount of swap space is that the more there is, the longer it takes until the OOM killer is fired up. If you have 2GB of swap, then it will be delayed by at least the time it takes to write a gig or so of data to the disk, page by page with pauses for running the swapping thread in-between each page, and your system will be unusable until then.

The algorithm is roughly as follows.

* Assign each process a contingent of main memory, e.g. by dividing the total available by the number of users with active running processes, and giving each process an equal share of the contingent of the user running it.
* When a page of memory is to be evicted to the swap file, make sure that it has either not been accessed for a certain minimum length of time, or that the process owning it is over it's contingent, or that it is owned by the process on who's behalf it is to be swapped out. If not, search for a new page to evict.

This should mean that if a process starts leaking memory badly or whatever, after a while it will just loop evicting its own pages and not trouble the other processes on the system. It should also mean that all not-too-large processes on the system should stay reasonably snappy, making it easier to find and kill the out-of-control process.

death by swap

Posted Nov 4, 2009 22:56 UTC (Wed) by mjthayer (guest, #39183) [Link]

Oh yes, and of course processes with a working set of less than their contingent will not hog the unused part when memory gets tight.

death by swap

Posted Nov 7, 2009 1:21 UTC (Sat) by giraffedata (guest, #1954) [Link]

If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted. You're also wasting the swap I/O it's doing. After a while, after other processes have had a chance to progress, you can swap them out and give the first process the memory it needs. If you can't do that because it's run amok and simply demands more memory than you can afford, that's when you kill that process.

Algorithms for this were popular in the 1970s for batch systems. Unix systems were born as interactive systems where the idea of not dispatching a process at all for ten seconds was less palatable than making the user kill some stuff or reboot, but with Unix now used for more diverse things, I'm surprised Linux has never been interested in long term scheduling to avoid page thrashing.

death by swap

Posted Nov 7, 2009 3:39 UTC (Sat) by tdwebste (guest, #18154) [Link]

On embedded devices I have constructed processing states with runit to control the running processes. This simple but effective long term scheduling to avoid out memory/swapping, works well when you know in advance what processes will be running on the device.

death by swap

Posted Nov 7, 2009 10:01 UTC (Sat) by dlang (guest, #313) [Link]

if the program you are stealing the memory from needs it to make progress, you would be right.

but back in the 70's they realized that most of the time most programs don't use all their memory at any one time. so the odds are pretty good that the page of ram that you swap out will not be needed right away.

and the poor programming practices that are commone today make this even more true

death by swap

Posted Nov 7, 2009 17:06 UTC (Sat) by giraffedata (guest, #1954) [Link]

but back in the 70's they realized that most of the time most programs don't use all their memory at any one time. so the odds are pretty good that the page of ram that you swap out will not be needed right away.

I think you didn't follow the scenario. We're specifically talking about a page that is likely to be needed right away. It's a page that the normal page replacement policy would have left alone because it expected it to be needed soon -- primarily because it was accessed recently.

But the proposed policy would steal it anyway, because the process that is expected to need it is over its quota and the policy doesn't want to harm other processes that aren't.

What was known in the 70s was that at any one time, a program has a subset of memory it accesses a lot, which was dubbed its working set. We knew that if we couldn't keep a process' working set in memory, it was wasteful to run it at all. It would page thrash and make virutally no progress. Methods abounded for calculating the working set size, but the basic idea of keeping the working set in memory or nothing was constant.

death by swap

Posted Nov 9, 2009 9:02 UTC (Mon) by mjthayer (guest, #39183) [Link]

> If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted.
I suppose I see three cases here. One is that the page was part of the process' working set at an earlier point in time, but no longer is. In that case swapping it out is the right thing to do. The other is that the process is in control, but it's working set is bigger than the available memory. Then I agree that there is a good case for putting it on hold until enough memory is available, although that is a non-trivial problem which is somewhat outside of the scope of what I am trying to do. And the third case is the one that I am interested in - a runaway process which will eventually be OOMed. In this case, the quota will stop it from trampling on the working set of every other process in memory in the meantime.

While we are on the subject, does anyone reading this know where RSS quotas are handled in the current kernel code? I was able to find the original patches enabling them, but the code seems to have changed out of recognition since then.

death by swap

Posted Nov 9, 2009 12:34 UTC (Mon) by hppnq (guest, #14462) [Link]

You may want to look at Documentation/cgroups/memory.txt. Otherwise, it seems there is no way to enforce RSS limits. Rik van Riel wrote a patch a few years ago but it seems to have been dropped.

Personally, I would hate to think that my system spends valuable resources managing runaway processes. ;-)

** Encouragement encouragement encouragement **

Posted Nov 13, 2009 22:32 UTC (Fri) by efexis (guest, #26355) [Link]

I (for one) would be most interested in your work. The systems I manage are very binary in whether they're behaved or not, because I have configured them to behave (ie, how much memory is available, decide how much to give to database query caching etc, so everything just works). I try to keep swap file around 384Meg whether the system has 1G or 8G of RAM because that's a nice size to swap stuff out that you don't need to keep in memory, but using disk as virtual RAM is just way too slow, I'd prefer processes be denied memory requests than have them granted at the cost of slowing the whole system down. But all in all, because everything's set up for the amount of memory available, the only time I will get into OOM situations is when there is a runaway process (I manage systems for hosting small numbers of database driven websites, some of them may be developed on windows systems and then moved to the linux system, most are written in PHP which has a very low bar of entry, and so developers often do not have a clue when it comes to writing scalable code).

So, what I would want is something that assumes that most of the system is being well behaved, but will quickly chop off anything that is not, and will stop the badly bahaved stuff from dragging the well behaved stuff down with it. The well behaved stuff quite simply doesn't need managing; that's my job. The badly behaved stuff needs taking care of quickly, by something that your idea seems to reflect *perfectly* (it's not often you read someones ideas and your brain flips "that's -exactly- what I need").

How would I find out if you do get chance to hammer out the code that achieves this? Is there an non-LKML route to watch this (please don't say twitter :-p )

** Encouragement encouragement encouragement **

Posted Nov 16, 2009 13:45 UTC (Mon) by mjthayer (guest, #39183) [Link]

Ahem, I haven't thought that far ahead yet :) So far it is one of a few small projects I have lined up for whenever I have time, but I was posting here in order to get some feedback from wiser minds than my own before I made a start.

death by swap

Posted Nov 16, 2009 13:50 UTC (Mon) by mjthayer (guest, #39183) [Link]

>> If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted.
>I suppose I see three cases here. One is that the page was part of the process' working set at an earlier point in time, but no longer is. In that case swapping it out is the right thing to do. The other is that the process is in control, but it's working set is bigger than the available memory. Then I agree that there is a good case for putting it on hold until enough memory is available, although that is a non-trivial problem which is somewhat outside of the scope of what I am trying to do. And the third case is the one that I am interested in - a runaway process which will eventually be OOMed. In this case, the quota will stop it from trampling on the working set of every other process in memory in the meantime.
Actually case 2 could be handled to some extent by lowering the priority of a process that kept on swapping for too long.

death by swap

Posted Nov 4, 2009 23:49 UTC (Wed) by jond (subscriber, #37669) [Link]

On the other hand, if you set vm_overcommit to 2, you almost certainly want
lots of swap, especially in cheapo VMs. There's a whole raft of programs
that you cannot start with say, 256M RAM and little swap without overcommit.
Mutt and irssi are two that spring to mind. Lots of swap lets you
"overcommit" with the risk being you end up swapping rather than you end up
going on a process killing spree.

death by swap

Posted Nov 6, 2009 8:46 UTC (Fri) by iq-0 (subscriber, #36655) [Link]

You can still set the vm.overcommit_ratio to something higher. I think in most cases a vm.overcommit_memory=2 and vm.overcommit_ratio=1000 is saner than using vm.overcommit_memory=0.
The only reason this couldn't be a sane default is that on systems with 32MB a overcommit_ratio of 1000% is still too small (but still if you have 32MB and no swap, your probably still better off with this limit)

death by swap

Posted Nov 18, 2009 16:12 UTC (Wed) by pimlottc (guest, #44833) [Link]

irssi can't start on a system with 256MB RAM and no swap? Seriously?

death by swap

Posted Nov 5, 2009 17:52 UTC (Thu) by sbergman27 (guest, #10767) [Link]

When I was running our 60 desktop XDMCP/NX server on 12GB of memory, performance was
fine even though swap space usage was usually 8+ GB. We ran this way for months with no
complaints. Just because you're using a lot of swap space doesn't mean you are paging
excessively. Note that if a page is paged out and then brought back into memory, it stays
written in swap to save writting it again if that page gets pages out again. You can't tell a
whole lot about how much swap you are *really* using by looking at the swap used number.
systat monitoring and sar -W are more useful than the swap used number for accessing
swapping.

I do use the twice ram rule. I'd rather the system get slow than crash or have the OOM
running loose on it.

death by swap

Posted Nov 5, 2009 21:14 UTC (Thu) by anton (subscriber, #25547) [Link]

I have seen several cases where a process slowly consumed more and more memory, but apparently always had a small working set, so it eventually consumed all the swap space and the OOM killer killed it (sometimes it killed other processes first, though). The machine was so usable during this, that I did not notice that anything was amiss until some process was missing. IIRC one of these cases happened on a machine with 24GB RAM and 48GB swap; there it took several days until the swap space was exhausted.

death by swap

Posted Nov 6, 2009 6:35 UTC (Fri) by motk (subscriber, #51120) [Link]

CPU states: 73.2% idle, 16.4% user, 10.4% kernel, 0.0% iowait, 0.0% swap
Memory: 64G real, 2625M free, 62G swap in use, 40G swap free

You were saying? :)

death by swap

Posted Nov 6, 2009 8:43 UTC (Fri) by mjthayer (guest, #39183) [Link]

I take it you don't have any out-of-control processes allocating memory there, do you? I wasn't saying it doesn't work in the general case :)

death by swap

Posted Nov 18, 2009 16:47 UTC (Wed) by pimlottc (guest, #44833) [Link]

STOP USING the old "TWICE RAM" guideline!
You know, I've been hearing this lately, but the problem is there seems to be no consensus on what the guideline should be. Some swear by no swap at all, while others say running without at least some is dangerous. No one seems to agree on what an appropriate amount is. Until there is a new accepted rule of thumb, everyone will keep using the old one, even if it's wrong.

death by swap

Posted Nov 18, 2009 18:37 UTC (Wed) by dlang (guest, #313) [Link]

there used to be a technical reason for twice ram.

nowdays it depends on your system and how you use it.

if you use the in-kernel suspend code, the act of suspending will write all your ram into the swap space, so swap must be > ram

if you don't use the in-kernel suspend code you need as much swap as you intend to use. How much swap you are willing to use depends very much on your use case. for most people a little bit of swap in use doesn't hurt much to use and by freeing up additional ram results in a overall faster system. for other people the unpredictable delays in applications due to the need to pull things from swap is unacceptable. In any case, having a lot of swap activity if pretty much unacceptable for anyone.

note that if you disable overcommit you need more swap or allocations (like when a large program forks) will fail so you need additional swap space > max memory footprint of any process you intend to allow to fork (potentially multiples of this). With overcommit disabled I could see you needing swap significantly higher than 2x ram in some conditions.

my recommendation is that if you are using a normal hard drive (usually including the SSD drives that emulate normal hard drives), allocate a 2G swap partition and leave overcommit enabled (and that's probably a lot larger than you will ever use)

if you are using a system that doesn't have a normal hard drive (usually this sort of thing has no more than a few gig of flash as it's drive) you probably don't want any swap, and definantly want to leave overcommit on.

death by swap

Posted Nov 19, 2009 16:46 UTC (Thu) by nye (guest, #51576) [Link]

>my recommendation is that if you are using a normal hard drive (usually including the SSD drives that emulate normal hard drives), allocate a 2G swap partition and leave overcommit enabled (and that's probably a lot larger than you will ever use)

FWIW, I agree, except that I'd make it a file instead of a partition - it's just as fast, and it leaves some flexibility just in case.

I use a 2GB swapfile on machines ranging from 256MB to 8GB of RAM - it may be overkill but that much disk space costs next to nothing. I wouldn't want to set it higher, because if I'm really using swap to that extent, the machine's probably past the point of usability anyway.

death by swap

Posted Nov 19, 2009 13:29 UTC (Thu) by makomk (guest, #51493) [Link]

On the other hand, too small a swap partition can also cause "death by swap". Once the swap partition fills up, the only way to free up memory is to start evicting code from RAM - and that utterly kills the responsiveness of the system, as less and less RAM becomes available for the running apps' code. The system enters a kind of death spiral, where the more memory the application allocates, the more slowly it and every other application runs.

death by swap

Posted Nov 19, 2009 18:33 UTC (Thu) by dlang (guest, #313) [Link]

I don't understand how that is any different than thrashing the swap.

in both cases you have to read from disk to continue, the only difference is if you are reading from the swap space or the initial binary (and since both probably require seeks, it's not even a case of random vs sequential disk access)

death by swap

Posted Dec 5, 2009 17:19 UTC (Sat) by misiu_mp (guest, #41936) [Link]

That was a description of trashing.
I would presume that executables do not make up much of the used memory. So reusing their pages will probably not be much gain.

Trashing is what happens when processes get their pages continuously swapped in and out as the system schedules them to run. Thats when everything grinds to a halt because each context switch or memory access needs to swap out some memory in order to make place for some other memory to be read in from the swap or the binary.
That can possibly happen when the total working set (actively used memory) of the busy processes exceeds the amount of ram, or more realistically, when the swap (nearly) runs out so there is nowhere to evict unused pages to free up ram - leaving space for only small chunks to run at a time.
Usually (in my desktop experience) soon after that the oom killer start kicking in, which causes the system to trash even more (as the oom have needs too) and it takes hours for it to be done.
When it happens I usually have no choice but to reboot loosing some data, so for me the oom killer has been useless and over-commitment the root of all evil.

Using up swap alone does not affect performance much, if you dont access whats on the swap. If you continuously do that - that's trashing.

Toward a smarter OOM killer

Posted Nov 5, 2009 13:45 UTC (Thu) by hppnq (guest, #14462) [Link]

What could work for you, is to run a dummy process (allocate memory as you like) and have that killed first by the OOM (use Evgeniy Polyakov's patch), so it would 1) notify the administrator that the system has run into this problem, and 2) free up enough memory so something can actually be done about it.

Just as an exercise, of course. ;-)

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 4, 2009 20:54 UTC (Wed) by vadim (subscriber, #35271) [Link]

All this tweaking IMO just points to that the OOM is a fundamentally bad
idea. It's based on the very wrong philosophy that a good thing to do
about a problem is to ignore its existence, as if that would make it go
away.

Of course it doesn't. Reality is what it is, there's a finite amount of
memory and it's not possible to really use more than what's available. So
after the kernel pretends there's more memory than there really is, it
must deal with the consequences of being unable to uphold that promise and
have to kill some process that may not have anything to do with the memory
problems.

Besides the whole mess the kernel has get involved in due to this strange
way of doing things, it means applications can't sanely handle memory
shortage, because the kernel won't let them.

The whole reason for the existence of this strange system seems to be that
some applications allocate more memory than they use. But I think there
are much better things to do about that.

First, any allocated but yet unused memory could be used for data that can
always be freed when the application tries to use the memory, such as disk
cache.

Second, I think the effort would be much better spent on writing some sort
of program (valgrind patch?) that would show which applications are
allocating memory they don't use, how much and where.

Fixing the application to allocate only what's required would have
benefits: it would make it stop wasting memory on systems without an OOM
killer, it would remove the need for the OOM killer, as overcommit would
stop providing any benefit, would improve the general stability of Linux
as without an OOM killer the kernel would stop doing stupid things like
killing database servers, and give some incentive to programmers to sanely
handle out of memory conditions (since currently they can't even if they
want to).

At the very least I think overcommit should be off by default.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 4, 2009 21:47 UTC (Wed) by drag (guest, #31333) [Link]

Applications absolutely need to be notified if they are running out of RAM.

They need to know it and need to be able to react to it.

Now whether or not applications actually do respond is something else
entirely. But you _MUST_ give application developers the chance to "do the
right thing", even if you don't expect them do to do it.

So out of memory errors to applications and hope that they use it and you
have OOM Killer if they don't.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 4, 2009 22:30 UTC (Wed) by vadim (subscriber, #35271) [Link]

The OOM killer is never needed. Here's how things work:

With overcommit:

1. Application mallocs 200MB when there are 150MB available. Kernel hopes
it doesn't actually use it, and lets it happen anyway.
2. Applications work for a while.
3. At some point (not necessarily while running the app that malloced the
200MB), kernel realizes: crap, I'm out of memory. Process needs a page,
but there's no memory that can be freed and swap is full. Got to kill
something to make room. It picks a process and kills it. With SIGKILL.

Some process dies with no chance to react to it. It can't react because it
can get killed in the middle of absolutely anything, even something like a
for(;;); which doesn't allocate any memory. There's no way for it to react
sanely.

Without overcommit:
1. Application mallocs 200MB with 150MB free. Kernel says "nope, there
isn't that much" and malloc returns NULL.
2. At that point application can decide what to do about that. It may
abort, or refuse to open a document too large for memory but keep running,
decide it can work with a smaller internal cache, etc. If it's badly
written it doesn't check the malloc return value and crashes on the null
pointer, but even then, the application that goes is precisely the one
that wanted too much memory.
3. OOM killer isn't needed, because the kernel never gets to the "crap,
I'm out of memory" stage.

Done this way there's no need to hope for anything. You simply don't allow
the situation overcommit gets into to happen in the first place.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 0:17 UTC (Thu) by madscientist (subscriber, #16861) [Link]

Unfortunately you're simplifying things a lot by only considering malloced memory. The system uses memory for many more things than an explicit malloc. The traditional reason that disabling overcommit is so painful is not that it causes malloc() to fail, but that it causes fork/exec problems. Consider what happens on fork: you get a complete copy of the parent process INCLUDING ALL ITS HEAP. Obviously that's painful, especially since 98% of the time you'll just turn around and call exec anyway, and then you won't need all that RAM. So what the system does is provide COW behavior... but what happens if you DON'T exec, and you start writing memory? Now the kernel has to come up with all that RAM it pretended that the child process had, that wasn't really committed. There's no way to catch that failure either.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 9:54 UTC (Thu) by epa (subscriber, #39769) [Link]

Which is why fork/exec, while having a nice conceptual simplicity and a long Unix tradition, really needs to be replaced by a 'run child process' call that wouldn't double the memory usage for an instant only to throw it away again on exec.

(vfork() as in classical BSD is one answer, but still a bit crufty IMHO: rather than a special kind of fork that you can only use before exec, better to just say what you mean and have fork+exec be a single call.)

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 9:55 UTC (Thu) by epa (subscriber, #39769) [Link]

Ah... and I see that posix_spawn(3) does exist. Now we just need to fix userspace to use it.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 18:25 UTC (Thu) by nix (subscriber, #2304) [Link]

posix_spawn*() is a bloody abomination. Nobody uses it because *despite*
being horrendously complicated it is *still* not flexible enough for
things that regular applications do all the time. And it never will be:
you'd have to implement a Turing-complete interpreter in there to approach
the flexibility of the fork/exec model...

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 6, 2009 8:55 UTC (Fri) by epa (subscriber, #39769) [Link]

At the moment if a 500Mbyte process forks itself the kernel has no idea whether it's about to exec() something else, in which case almost all those pages in the child process will be discarded, or if the child is going to continue on its way, in which case the pages are going to be needed and may well be written to. That ambiguity leads to a default policy of allowing the fork to succeed, but when that turns out to be the wrong judgement the OOM killer has to run.

It would be better for applications to give the kernel more clues about their intention, so the kernel can make better decisions on memory management.

I agree that posix_spawn, like almost anything that comes out of a committee, is a complicated monster. Perhaps a better answer would be to refine the distinction between fork() and vfork(), or to introduce a new fork-like call fork_intend_to_exec_soon(). Then the kernel could know that for an ordinary fork() it has to be cautious and check all the required memory is available, while fork_intend_to_exec_soon() has the current optimistic behaviour.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 6, 2009 13:43 UTC (Fri) by nix (subscriber, #2304) [Link]

fork_intend_to_exec_soon() should be the default, because *most* forks are
rapidly followed by exec()s. Whatever you choose, getting it used by much
software would be a long slow slog :/

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 6, 2009 19:14 UTC (Fri) by dlang (guest, #313) [Link]

and if an application misuses this and does fork_intend_to_exec_soon() and then doesn't exec soon, what would the penalty be?

if applications can misuse this without a penalty they will never get it right (especially when using it wrong will let their app keep running in cases where the fork would otherwise fail)

but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 8, 2009 21:21 UTC (Sun) by epa (subscriber, #39769) [Link]

I don't think it matters much if a few slightly-buggy applications use the wrong variant. If 90% of userspace including the most important programs such as shells passes the right hint to the kernel, the kernel can make better decisions than it does now, and the need for the OOM killer will be reduced. It's a similar situation with raw I/O, for example: a disk-heavy program such as a database server might know that it will scan through a large file just once. Ordinarily this file's contents might clog up the page cache and evict more useful things. To help get more consistent performance, apps can be coded to hint to the kernel that it needn't bother to cache a particular I/O request. The default is still to cache it, and it's not catastrophic if one or two userspace programs haven't been tuned to use the new fancy hinting mechanism.
but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.
Very true, but of course there's no way for the kernel to know this. I expect most apps would prefer the fork to either succeed for sure, or fail at once if not enough memory can be guaranteed. There may be a few where optimistically hoping for the best and perhaps killing a random process later is the ideal behaviour.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 8, 2009 23:32 UTC (Sun) by nix (subscriber, #2304) [Link]

Yeah, but forking to exec immediately afterwards is the common case. If
you make that some weird nonportable new variant, 90% of programs are
never going to use it, and none of the rest will until some considerable
time has passed (time for this call to percolate down into the kernel and
glibc --- and try getting this call past Ulrich, ho ho.)

(Anyway, we *have* fork_to_exec_soon(). It's called vfork().)

posix_spawn is stupid as a system call

Posted Nov 5, 2009 13:10 UTC (Thu) by helge.bahmann (subscriber, #56804) [Link]

how many parameters do you want to expend for a "unified fork+exec" system
call?

fork + set*uid/set*gid + exec
fork + chroot + chdir + exec
fork + close(write_side_of_pipe/read_side_of_pipe) + dup2() + exec
fork + open(arbitrary set of files) + exec
fork + sched_setscheduler/sched_setparam/sched_setaffinity... + exec
fork + personality + exec
fork + setpgid + exec

not to mention the various clone flags (fs namespace etc.) and any wild
combination of the above

I think I have needed all of the above in various circumstances, sometimes
two or three things between fork+exec at a time

bonus question: how many more parameters do you want to add to a "combined
fork/exec" syscall to make it future proof for other things that might
need to be done before the new process image is executed?

posix_spawn is stupid as a system call

Posted Nov 5, 2009 13:45 UTC (Thu) by madscientist (subscriber, #16861) [Link]

Exactly. The number of useful and important things you can, want, and need to do between a fork and an exec is far too great to make it one call, in all but the simplest cases. I had this _exact_ argument with a Windows API proponent who felt that fork+exec was crazy (not related to OOM issues, just having more than one call). But the number of flags and arguments you'd need to replicate that behavior is insane and you'd STILL run into situations where you can't do what you want.

Fork+exec definitely has its downsides in some of the technical implementation requirements, but from a higher level language perspective it's brilliant.

posix_spawn is stupid as a system call

Posted Nov 6, 2009 9:07 UTC (Fri) by epa (subscriber, #39769) [Link]

You're right, others pointed out the same thing; no single system call can handle all the things you might want to set up in the child process before exec()ing.

But that said, why does the whole child process (including, potentially, a complete copy of its parent's core pages, all ready to be written to) need to be created just to set a few uids or open some files? Perhaps it would work better to first prepare a new process structure, then set uids and open files for it, and as the last stage breathe life into it by giving a file to exec(). For example

    pid_t child = new_waiting_process();
    // Now child is an entry in the process table, but it is not running.
    // Use the p_ variants of some system calls to set things up for
    // this child process.
    p_setuid(child, uid);
    p_close(child, 0);
    p_open(child, "infile");
    // Finished setup, start it running.
    p_exec_and_start(child, "/bin/cat");
    wait(child);
This would give almost the same flexibility, but without the need to overcommit memory. The kernel would just need to create a new process in a not-runnable state, and the p_whatever system calls allow performing operations on another process rather than yourself. (Of course they would only allow manipulating your own not-yet-started child process, except perhaps for root.)

A process created with new_waiting_process() would inherit its parent's file descriptors, current directory, environment and so on as for fork(), but it would not inherit the parent's core.

posix_spawn is stupid as a system call

Posted Nov 6, 2009 10:07 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link]

The idea in itself is workable, but the number of system calls you have to
duplicate is _huge_. It would perhaps be easier to create an "almost
empty" process image (with at least one stack and executable code page set
up) in suspended state, and then use ptrace or something similar to inject
system calls into the new process image -- this is tricky, but at least
the kernel is not burdened with an exploding number of system calls.

Alternatively, you could also provide a "fork" variant that explicitly
declares which pages of the address space are to be COWed into the new
process (if you are extra-smart, all you ever need to COW are the stack
pages, but calling library functions before execve is probably going to
spoil that -- but then, finding out which pages a library requires is by
no means easier, so you have to exercise a lot of discipline).

Might be an interesting research project to attempt any of the above in
Linux :)

posix_spawn is stupid as a system call

Posted Nov 6, 2009 13:51 UTC (Fri) by nix (subscriber, #2304) [Link]

You could reduce the set of necessary syscalls to one:

int masquerade_as (pid_t pid)

which issues syscalls in 'pid' instead of the current process. ('pid' is a
process you'd be allowed to ptrace, so immediate children are permitted).
This is a per-thread attribute, and passing a pid of 0 flips back to the
parent again.

Then all you need is this (ignoring error checking just as the OP did,
what a horrible name that new_waiting_process() has got, vvfork() would
surely be better):

pid_t child = new_waiting_process();
masquerade_as (child);
setuid(uid);
close(0);
open("infile");
// Finished setup, start it running.
execve ("/bin/cat", "/bin/cat", environ);
masquerade_as (0);
wait(child);

Note the subtleties here: execution always continues after execve()
because the execve() was done to another process image. Non-syscalls are
very dangerous to run because they might update userspace storage in the
wrong process: we'd really need support for this in libc for it to be
usable.

(In practice this latter constraint destroys the whole idea no matter how
good it might be: Ulrich would say no, as he does to every idea anyone
else originates. Personally I suspect this idea sucks in any case :) )

posix_spawn is stupid as a system call

Posted Nov 8, 2009 21:26 UTC (Sun) by epa (subscriber, #39769) [Link]

From a purist point of view, all these 'new' calls are generalizations of the existing ones taking an extra pid argument, so they can just replace them, with the old ones provided by the C library; of course in the real world there is such a thing as backward compatibility :-p.

posix_spawn is stupid as a system call

Posted Nov 8, 2009 23:34 UTC (Sun) by nix (subscriber, #2304) [Link]

Yeah, breaking the entire installed base of Linux apps would probably be a
*bad* move :) I think, if you wanted to do this, you'd have to introduce a
huge pile of new syscalls and reimplement the old ones as thin wrappers
(inside the kernel so as not to force everyone to upgrade glibc) calling
the new ones.

posix_spawn is stupid as a system call

Posted Nov 23, 2009 15:08 UTC (Mon) by jch (guest, #51929) [Link]

This is analogous to the *at system calls (openat, fstatat, ...) that have been introduced in Linux and included in the latest revision of POSIX.

A suggestion

Posted Nov 12, 2009 5:17 UTC (Thu) by jlmassir (guest, #48904) [Link]

Maybe then the solution to this problem would be:

1. Never allow overcommit when calling malloc
2. Allow overcommit on fork/exec, but kill the child process if it tries to
write to more than 10% of its virtual size.

This way, buggy programs that malloc too much memory and never use them
would be fixed and fork bombs would be killed, while still allowing to do do
system calls between fork and exec.

What do you think?

A suggestion

Posted Nov 14, 2009 20:52 UTC (Sat) by Gady (guest, #1141) [Link]

Killing the child process if it uses more than 10% is kinda cruel. There are no rules against the child doing that. What should be done is that in this case the memory is allocated, and if that cannot be done, then the child is killed.

A suggestion

Posted Nov 15, 2009 20:03 UTC (Sun) by jlmassir (guest, #48904) [Link]

Killing a child if there is no memory for a fork-exec is kinda cruel.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 18:27 UTC (Thu) by nix (subscriber, #2304) [Link]

OK, so granted that... how do you plan to prevent processes allocating
stack space? A process with a lot of threads, mostly idle, could easily be
using gigabytes for stack address space, all potentially allocatable, but
only actually be using a tiny fraction of that (4K out of every 8Mb chunk,
say).

So overcommit doesn't just break programs that use fork/exec under high
load, forcing failure far sooner than necessary: it breaks programs that
use threads in the same way. Doesn't leave much, does it...

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 18:28 UTC (Thu) by nix (subscriber, #2304) [Link]

That is to say, *disabling* overcommit breaks these things. I hate negated
emphatic terms :/

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 6, 2009 9:10 UTC (Fri) by epa (subscriber, #39769) [Link]

Somebody else will have to suggest a possible answer to the stack space problem :-(. It might not be possible to turn off overcommit entirely for desktop systems. But anything that can be done to make overcommit happen less often - or, equally, to make strict allocation usable for a normal workload - narrows the gap between what the kernel promises and what it can deliver, and makes the OOM killer less likely to run.

(Doesn't a process have some way to specify the max. stack size that it will use for each thread?)

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 0:20 UTC (Thu) by mikov (guest, #33179) [Link]

Overcommit is primarily about copy-on-write, not malloc(). For example, the kernel cannot predict how much actual memory will be needed after a fork(). Are you suggesting that when a 500MB process does a fork, the kernel should reserve another 500MB?

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 0:56 UTC (Thu) by JoeBuck (subscriber, #2330) [Link]

Exactly. On a desktop Linux system, you might have a Firefox instance you've been using for ages and it's up to 1.5 gigabytes virtual memory. You have other processes running and your swap is mostly full, so that there's only another 0.5G available. Then you download a film clip and you want to fire up totem to view it. Firefox does a fork, followed by exec. But you don't have 1.5 additional gigabytes. Solaris would refuse to do the fork, even though you don't really need that additional 1.5G: you might dirty one page before doing an exec of totem, which is much smaller. Linux and AIX will issue a loan.

Solaris works around this problem by recommending that developers use posix_spawn rather than fork followed by exec, however they didn't add this call until Solaris 10.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 1:02 UTC (Thu) by mikov (guest, #33179) [Link]

Very interesting. How does Solaris deal with mapping shared libraries without overcommitting?

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 23, 2009 15:19 UTC (Mon) by jch (guest, #51929) [Link]

> How does Solaris deal with mapping shared libraries without overcommitting?

Shared libraries are backed by filesystem data, so a read-only map of a shared library does not involve overcommit.

why not posix_spawn()?

Posted Nov 5, 2009 8:02 UTC (Thu) by Cato (guest, #7643) [Link]

So why doesn't Linux have something like posix_spawn() in standard distros (doesn't seem to be in Ubuntu at least but must be in some kernel builds: http://linux.die.net/man/3/posix_spawn )?

Fork/exec really works well with smaller processes (as in the original Unix tools / shell pipeline approach), but forking a 1.5 GB Firefox process is insane...

It's got to the point that I can't click on mailto: links any more in Firefox (currently 850 GB resident memory) because it will take so long to fork and then exec Thunderbird.

why not posix_spawn()?

Posted Nov 5, 2009 8:30 UTC (Thu) by mikov (guest, #33179) [Link]

I can assure you that it is not the fork() that is causing the delay. In any modern Unix fork() doesn't actually copy anything - it creates a copy-on-write mapping of the process memory, which is a very fast operation (relatively speaking). This is where overcommit comes into play.

why not posix_spawn()?

Posted Nov 5, 2009 18:28 UTC (Thu) by khc (guest, #45209) [Link]

actually the page table entries are still copied, which can take measurable amount of time even when the parent process is only using 100s of MB. Linux does have a posix_spawn, but it's implemented by fork/exec so it's not useful for this problem anyway.

why not posix_spawn()?

Posted Nov 5, 2009 20:44 UTC (Thu) by mikov (guest, #33179) [Link]

fork() doesn't copy all page tables - for shared memory it defers copying it until fault time

Additionally, perhaps shared page tables will eventually improve on that. Does anybody know what the status of those patches is?

But anyway, I don't think that the slow starting of Thunderbird from Firefox, which the GP commented on, is caused by fork() copying the page tables.

why not posix_spawn()?

Posted Nov 5, 2009 23:21 UTC (Thu) by Cato (guest, #7643) [Link]

You're right about the original problem, which must have been something else - just retested with mailto: link from a large Firefox process and it's fine.

why not posix_spawn()?

Posted Nov 6, 2009 1:17 UTC (Fri) by khc (guest, #45209) [Link]

I am not sure if it copies the entire page table, but the effect is noticeable: http://hxbc.us/software/fork_bench.c (try running it with different c and n)

why not posix_spawn()?

Posted Nov 6, 2009 6:12 UTC (Fri) by mikov (guest, #33179) [Link]

You are absolutely right. Here are my results:
malloc(1MB) 3235 ms 0.161750 ms/iter
malloc(100MB) 390 ms 3.900000 ms/iter
malloc(500MB) 1663 ms 16.630000 ms/iter
malloc(1024MB) 3329 ms 33.290000 ms/iter

With a heap of 1G it takes 33ms to do a fork() on my machine, which to me is surprisingly long (although not that surprising when you consider the mere size of the 4KB page tables). While, as I said initially, it would definitely not be noticeable for interactive process creation, it is significant. The mach maligned "slow" process creation on Windows is much faster for sure...

I did run a couple of more tests though, which improve the situation. First I confirmed that the page tables of shared memory mappings really are not copied. I replaced the malloc() with mmap( MAP_ANON | MAP_SHARED ):
mmap(1MB) 1204 ms 0.120400 ms/iter
mmap(100MB) 1201 ms 0.120100 ms/iter
mmap(500MB) 1231 ms 0.123100 ms/iter
mmap(1024MB) 1229 ms 0.122900 ms/iter
As you can see, there is no relation between fork() speed and mapping size.

Then I restored the malloc(), but replaced the fork() with vfork():
vfork+malloc(1MB) 102 ms 0.010200 ms/iter
vfork+malloc(100MB) 107 ms 0.010700 ms/iter
vfork+malloc(500MB) 105 ms 0.010500 ms/iter
vfork+malloc(1000MB) 106 ms 0.010600 ms/iter

The last result is really encouraging (and actually not surprising). Even though everybody seems to hate vfork(), for the case we are discussing (fork of a huge address space + exec), it should solve all problems, removing the need for the clumsy posix_spawn(), while preserving all the flexibility of fork(). Beat that, Windows!

Any good reasons why vfork() should be avoided?

why not posix_spawn()?

Posted Nov 6, 2009 19:12 UTC (Fri) by khc (guest, #45209) [Link]

you are right that the speed of fork() is seldom noticeable in a GUI program, but it bites me all the time in daemons (big daemon wanting to launch many processes, one by one, do do some tasks). vfork() is too limiting because all you can do after is exec(), but sometimes you do want to extra flexibility that posix_spawn can provide.

I have to admit that I have never checked to see if posix_spawn fits my need, though. Since I only care about linux and posix_spawn on linux is the same as fork()/.../exec(), it's useless for me anyway.

why not posix_spawn()?

Posted Nov 6, 2009 19:28 UTC (Fri) by mikov (guest, #33179) [Link]

I am not sure what you mean. Unless I am missing something, vfork() is much more flexible and easier to use than posix_spawn().

If your purpose is to call exec() after fork(), you should just be able to mechanically replace all forks() with vforks() and get a big boost.

why not posix_spawn()?

Posted Nov 6, 2009 22:55 UTC (Fri) by cmccabe (guest, #60281) [Link]

> Any good reasons why vfork() should be avoided?

The manual page installed on my system says that copy-on-write makes vfork unecessary. It concludes with "it is rather unfortunate that Linux revived this specter from the past." :)

However... it seems like the results you've posted show quite a substantial performance gain for vfork + exec as opposed to fork + exec, for processes with large heaps.

Maybe the "preferred" way to do this on Linux would be using clone(2)??

C.

why not posix_spawn()?

Posted Nov 6, 2009 23:23 UTC (Fri) by cmccabe (guest, #60281) [Link]

> Any good reasons why vfork() should be avoided?

Ah. Found it.

https://www.securecoding.cert.org/confluence/display/secc...

> Due to the implementation of the vfork() function, the parent process is
> suspended while the child process executes. If a user sends a signal to
> the child process, delaying its execution, the parent process (which is
> privileged) is also blocked. This means that an unprivileged process can
> cause a privileged process to halt, which is a privilege inversion
> resulting in a denial of service.

clone(CLONE_VM) + exec might be the win...

Colin

Memory required for fork()

Posted Nov 5, 2009 1:04 UTC (Thu) by vomlehn (guest, #45588) [Link]

Exactly. The kernel reserves as much virtual memory for the child as the parent has. If overcommit is disabled, a parent with more memory than half of CommitLimit will not be able to fork() successfully. It will still be able to vfork(), though.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 3:11 UTC (Thu) by smoogen (subscriber, #97) [Link]

I thought most modern OS's allow for the fundamental problem of allocating allowing more allocation than exists. They either put in place some sort of OOM or they lock up when ram really runs out. [I have fuzzy memories of seeing something like this on old SunOS and HPUX boxes long ago. And Windows does the lockup issue.] I am not sure what happens with MacOSx when it runs out.

Now the question is why do some of thesedo this? Is it a basic assumption of every non-embedded os to be sloppy with memory? Is it POSIX? And if every system were set up with overcommit turned off how much would break?

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 5:19 UTC (Thu) by mikov (guest, #33179) [Link]

It is not sloppiness. The OS cannot predict the future, so it can either be optimistic or pessimistic with memory. This is a very deliberate design choice. Turns out that in practice being optimistic is much much more efficient.

Here is what Linus had to say about that in 1995:
http://groups.google.com/group/comp.os.linux.development....

The fundamentals still apply.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 12, 2009 9:50 UTC (Thu) by rlhamil (guest, #6472) [Link]

While optimistic overcommit might be _statistically_ better, it's not deterministic enough
for my liking. And my liking is that everything other than /dev/random is _totally_
deterministic (neglecting external input of course).

(I'd argue that overcommit-by-default is an invitation to denial of service attack, and, if
likely victims were more or less predictable, might be a "covert channel" as well.)

Solaris doesn't do overcommit, but does also offer MAP_NORESERVE, so that individual
mmap() operations can opt out of a reserve, in which case a write to a private mapping
(copy-on-write from a file) can cause the process to receive SIGSEGV or SIGBUS; see

http://docs.sun.com/app/docs/doc/816-5167/mmap-2?l=en&...
(the online version of the mmap() man page for Solaris 10)

I think that all that's missing is:
* a system call to turn on or off similar behavior for heap and/or stack, and to
turn on or off _implicit_ MAP_NORESERVE on all private mappings for that process
and its subsequently forked children (reset on exec)
* a shared library feature to implement system policy specifying which executables
should be be subject to overcommit, with a settable default for all not explicitly specified
* an OS default of no overcommit
* no OOM killer needed

Distros could supply default policy that opted for overcommit on chronically hoggish
(and typically not critical to system integrity) apps such as browsers. People might e.g.
not mind their browser dying a few more times than it would anyway, but might be very
glad to be sure that their X server (desktop user) or database server process was safe from
nondeterministic behavior possibly triggered by another process.

That gets overcommit out of the OS, and pushes the decisions into user space. A process
could always override policy with the system calls, but it would have to know what it was
doing to do that.

The only limitation with implementing the defaults for an executable in the dynamic linker
is that it wouldn't be able to allow overcommit for static executables. If that was a serious
limitation, a new mechanism would be needed to push the policy settings into the kernel,
and execve() (or equivalents) would have to implement them, which is IMO more comprehensive
but otherwise uglier.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 21:49 UTC (Thu) by anton (subscriber, #25547) [Link]

If an application does not handle ENOMEM gracefully, it's better to run it in overcommit mode. Hopefully it will never actually use all that memory, then it will be better off than if it got ENOMEM. If it gets OOM-killed, it won't be worse off. And being able to allocate large amounts of memory without using it makes writing programs quite a bit simpler in some cases.

OTOH, if an application is written to deal with ENOMEM gracefully, it's better not to overcommit memory for this application, to give it ENOMEM if there is no more committable memory, and then there is no need to OOM kill such an application (instead, one of the other overcommitting applications can be killed).

I have written this up in more detail; there I suggested making it depend on the process. In the meantime I have learned about the MAP_NORESERVE flag, which makes it possible to do this per-allocation. However, since the OOM-Killer kills a process, not an allocation, it's probably better to use MAP_NORESERVE either on all or no allocation in a process; but how to tell this to functions that call mmap only indirectly (malloc(), fopen() etc.)?

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 4, 2009 22:17 UTC (Wed) by paragw (guest, #45306) [Link]

I agree with what you said about OOM (actually overcommit) being a bad idea.

It is based on fundamentally flawed philosophy of handing out resources without accurate accounting of how much resources exactly are there - that cannot be a good idea under any circumstance.

Plus it encourages sloppy application code. And then the applications do not get any notification if kernel runs out of memory - this means it is game over without any warning.

I really like how Solaris VM handles this - if you ask for memory it will reserve that much swap as backing store - if it fails to allocate that much swap, applications get a NULL back from malloc(). So all you got to do is have good amount of swap space - disk space is relatively cheaper so that is not a big deal. Applications then can still ask for excess memory as long as address space and swap space are there - whatever doesn't get used remains in swap.

I once tried turning off overcommit and first program to fail was java - it needed 1Gb code cache if I recall correctly. But programs can be fixed eventually if kernel enforces strict allocation policy.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 9, 2015 16:48 UTC (Fri) by cortana (subscriber, #24596) [Link]

> I really like how Solaris VM handles this - if you ask for memory it will reserve that much swap as backing store - if it fails to allocate that much swap, applications get a NULL back from malloc().

AIUI this is what Linux does with overcommit_mode=2. What I'm trying to find out (and this is why I'm digging up this old thread) is how Solaris deals with the possiblity of a process with a large amount of private memory calling fork. Presumably those pages have to be reserved, but if they can't be, does fork fail (as on Linux with overcommit_mode=2)?

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 11, 2015 23:48 UTC (Sun) by nix (subscriber, #2304) [Link]

In that case, fork fails, annoying Emacs users doing M-! no end.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 9, 2015 20:25 UTC (Fri) by dlang (guest, #313) [Link]

> I really like how Solaris VM handles this - if you ask for memory it will reserve that much swap as backing store - if it fails to allocate that much swap, applications get a NULL back from malloc(). So all you got to do is have good amount of swap space - disk space is relatively cheaper so that is not a big deal.

so what do you do on a system that doesn't have a huge amount of disk available?

I've had systems with 128G of ram and 160G of (very high speed) disk.

High-end laptops with SSDs can easily have cases where RAM is a significant percentage of the total disk space

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 9, 2015 20:50 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]

> Applications then can still ask for excess memory as long as address space and swap space are there - whatever doesn't get used remains in swap.
This is a very flawed idea. In reality swap is nearly useless for overcommitted machines.

The problem is, once you start actually _using_ the swap everything simply stops. The IO bandwidth and latency is not enough for anything reasonable.

Pretty much the only valid use-case for swap is to off-load inactive applications and free RAM for those who need it.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 9, 2015 21:14 UTC (Fri) by dlang (guest, #313) [Link]

yep, I build my servers without swap, or with just a little bit. I'd rather that they crash and let the backup take over than to keep running in such a horribly degraded mode.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 11, 2015 23:51 UTC (Sun) by nix (subscriber, #2304) [Link]

Swap is also worthwhile for offloading things that are part of active applications but which simply aren't in the working set (a surprising number of applications have quite a lot of long-term-inactive private pages in the heap, etc).

I'd agree about its essential uselessness for other purposes, though. With modern CPU versus disk speeds, you hit the thrashing wall very, very fast.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 12, 2015 0:25 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Yep. However, even that is dangerous - we've had problems with swapped-out JVM heap. It's normally OK until a full garbage collection is triggered. Then it's suddenly not OK.

In fact, for us it'd be great if if was possible to simply freeze a swapped-out application if its swap IO bandwidth exceeds some sane value. Then un-freeze it once there is enough RAM to bring out all of its pages from swap.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 12, 2015 22:49 UTC (Mon) by renox (guest, #23785) [Link]

I remember reading a paper where the researchers stored in RAM a 'bookmark' of the swapped page to improve performance in this kind of situation. Unfortunately this requires some patch of Linux's memory manager to collaborate with the GC and the patch was rejected..

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 12, 2015 22:56 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]

Is it this paper http://dl.acm.org/citation.cfm?id=1065028 ( http://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf ) ? I remember reading it.

As it's common for GC algorithms, this optimization is probabilistic. It works most of the time, but sometimes blows up spectacularly.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 13, 2015 9:06 UTC (Tue) by renox (guest, #23785) [Link]

> Is it this paper http://dl.acm.org/citation.cfm?id=1065028 ( http://people.cs.umass.edu/~emery/pubs/f034-hertz.pdf ) ? I remember reading it.

Yes, this is the paper (sorry for not posting the link previously I had lost it).

> As it's common for GC algorithms, this optimization is probabilistic. It works most of the time, but sometimes blows up spectacularly.

Well that's not specific to GCs but it's quite widespread in computers: virtual memory itself, caches are 'probabilistic' optimizations.
So unless you need a hard real time system, or the 'blow up spectacularly' part is too frequent, this isn't a problem..

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 12, 2015 1:20 UTC (Mon) by flussence (subscriber, #85566) [Link]

So Solaris can't use all your RAM for programs unless you pre-allocate swap partitions of equal size? I assume that's just an annoying default, otherwise it'd render diskless NFS nodes useless.

The existence of OOM is one of the few really stupid things in Linux

Posted Jan 12, 2015 9:09 UTC (Mon) by paulj (subscriber, #341) [Link]

Solaris got swap over NFS to work reasonably well, precisely because of diskless NFS, I think.

The existence of OOM is one of the few really stupid things in Linux

Posted Nov 5, 2009 9:15 UTC (Thu) by Felix.Braun (guest, #3032) [Link]

Sorry vadim, you may be right, but that message from Andries quoted at the bottom of the article puts it just so much more entertainingly.

Toward a smarter OOM killer

Posted Nov 4, 2009 21:38 UTC (Wed) by Shewmaker (guest, #1126) [Link]

My experience with scientific computing is that even if the OOM killer picks the correct process, the system is then in a strange state where something else will not work correctly. We often end up rebooting a system like this so that it is back in a known good state. More recently, we have been disabling memory overcommit.

Though it is less flexible, disabling overcommit allows us to see and fix problems more quickly. It may be a system problem (e.g. a log file growing to fill a RAM-based filesystem on a diskless node) or an application that didn't realize it was using twice as much memory as it intended.

Toward a smarter OOM killer

Posted Nov 5, 2009 4:21 UTC (Thu) by zooko (guest, #2589) [Link]

Thanks for the reminder to make sure that I have vm.overcommit_memory=2 on my workstations
and servers.

Toward a smarter OOM killer

Posted Nov 5, 2009 13:40 UTC (Thu) by sean.hunter (guest, #7920) [Link]

Killing gnome-session is always something I as a user will be grateful for.

Toward a smarter OOM killer

Posted Nov 6, 2009 19:17 UTC (Fri) by dlang (guest, #313) [Link]

one thing that would be very nice to have show up in top (and similar tools) is how much ram _would_ be needed to satisfy all possible COW splits so that admins could get an idea of how overcommitted the kernel currently is.

I wouldn't be surprised to find that much of the time the ratio of overcommit rises significantly shortly before the OOM killer kicks in.

Per-process memory limits?

Posted Nov 8, 2009 3:19 UTC (Sun) by etrusco (guest, #4227) [Link]

Not exactly related, but since the article talks about "badly behaved Firefox", it would be kind of nice to have syscalls to set a VM (or RSS, etc) size limit for a process... (and trigger a callback, or suspend or kill the process)

(Sure, an interested application could limit its own heap usage through its memory allocator, but this couldn't stop a DoS from a virus/code injection.)

Per-process memory limits?

Posted Nov 8, 2009 11:56 UTC (Sun) by nix (subscriber, #2304) [Link]

We *have* setrlimit(). It would be nice if you could trigger actions other
than 'die, fiend!' on exceeding a limit, but it's not essential: you could
have a parent monitoring process which spots the kill and takes
appropriate action (though figuring out which limit the child exceeded, or
whether it exceeded a limit at all or was just randomly killed by the
admin, might be harder).

Per-process memory limits?

Posted Nov 9, 2009 9:22 UTC (Mon) by hppnq (guest, #14462) [Link]

It would be nice if you could trigger actions other than 'die, fiend!' on exceeding a limit

Some attempts to exceed a limit specified using setrlimit are met with an error rather than a kill, and I think all the signals delivered can be handled if appropriate measures are taken. Whether that is a good idea is another question. ;-)

Per-process memory limits?

Posted Nov 11, 2009 20:30 UTC (Wed) by oak (guest, #2786) [Link]

It would also be nice if setrlimit() actually would work properly.

...For something else than VmSize limit which is useless for processes
that mmap() files.

Per-process memory limits?

Posted Nov 12, 2009 5:51 UTC (Thu) by etrusco (guest, #4227) [Link]

Sorry, I meant a different limit for each process, AFAICT setrlimit/sysctl work globally?

Per-process memory limits?

Posted Nov 12, 2009 5:58 UTC (Thu) by etrusco (guest, #4227) [Link]

Oops, never mind. I should have RTFM before posting :-$

Per-process memory limits?

Posted Nov 12, 2009 6:38 UTC (Thu) by etrusco (guest, #4227) [Link]

(continuing my monologue...)
It would be nice, nonetheless, to be able set the limits from different process (e.g. a window manager) instead of resorting to a launcher and having to define the limit upfront...

Per-process memory limits?

Posted Nov 14, 2009 12:11 UTC (Sat) by efexis (guest, #26355) [Link]

You can do this using cgroups Resource Counters / Memory Resource Controller. This lets you see the memory usage estimates for each group (they're estimates because of shared libraries etc), change the limits on the fly, and also change which group a process belongs to (with new processes belonging to the group of its parent process by default).


Copyright © 2009, Eklektix, Inc.
This article may be redistributed under the terms of the Creative Commons CC BY-SA 4.0 license
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds