Toward a smarter OOM killer
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
The Linux memory management code does its best to ensure that memory will always be available when some part of the system needs it. That effort notwithstanding, it is still possible for a system to reach a point where no memory is available. At that point, things can grind to a painful halt, with the only possible solution (other than rebooting the system) being to kill off processes until a sufficient amount of memory is freed up. That grim task falls to the out-of-memory (OOM) killer. Anybody who has ever had the OOM killer unleashed on a system knows that it does not always pick the best processes to kill, so it is not surprising that making the OOM killer smarter is a recurring theme in Linux virtual memory development.
Before looking at the latest attempt to improve the OOM killer, it is worth mentioning that it is possible to configure a Linux system in a way which all but guarantees that the OOM killer will never make an appearance. OOM situations are caused by the kernel's willingness to overcommit memory. As a general rule, processes only use a portion of the address space they have allocated, so limiting allocations to the total amount of RAM and swap space on the system would lead to underutilization of system memory. But that limitation can be imposed on systems which can never be allowed to go into an OOM state; simply set the vm.overcommit_memory sysctl knob to 2. Individual processes are much more likely to see allocation failures in this mode, but the system as a whole will not overcommit its resources.
Most systems will allow overcommitted memory, though, because the alternative is too limiting. Overcommit works almost always, but the threat of a day when the Firefox developers add one memory leak too many always looms. When that sad occasion comes to be, it would be nice if the OOM killer would target that leaky Firefox process instead of, say, the X server and PostgreSQL. Many attempts have been made to add smarts to the OOM killer over the years; there's also a means by which the system administrator can steer the OOM killer toward or away from specific processes. But manual configuration is only suitable for certain, relatively static workloads; for the rest, the OOM killer often proves less discriminating than one would like.
The latest attempt to fix the OOM killer comes from Hiroyuki Kamezawa. This patch makes a number of fundamental changes to the selection of OOM victims. The result is an OOM killer which is smarter in some ways, but which takes a somewhat different approach to the selection of its victims.
One of the factors that the current OOM killer takes into account, naturally, is the amount of memory being used by each process. But the measure used (mm->total_vm) is somewhat crude: it penalizes processes using a lot of shared memory and says little about how much physical memory the process is using. Hiroyuki's patch tries to move away from total_vm in most situations, looking at the actual resident set size (RSS) and possibly taking into account the amount of swap space used as well.
Figuring in swap usage is controversial. A program which is using a lot of swap is clearly putting pressure on memory, but, if that program has been mostly swapped out, killing it will not immediately free much RAM. Eventually other processes can be shifted into the newly-freed swap space, but it might make more sense to just do away with those other processes at the outset. Even so, Hiroyuki's patch, for now, will figure in swap space if specific constraints do not force the use of other criteria.
One constraint which can change the calculation is when the memory shortage is specific to low memory - the region of memory which can be directly addressed by the kernel. When a low-memory allocation is required, nothing else will do, so there is little value in killing processes which are not hogging low-memory pages. With Hiroyuki's patch, the VM subsystem tracks how much low memory each process is using as a separate statistic. If the OOM situation is caused by an attempt to allocate low memory, the OOM killer's "badness" function will focus on processes holding large amounts of low memory.
[PULL QUOTE: Killing gnome-session is likely to free substantial amounts of memory, but the user's gratitude may be surprisingly limited. END QUOTE] The current OOM killer makes an attempt to target "fork bomb" processes by adding half of each child's "badness" value to its parent. A process with a lot of children will thus have a high badness and will thus come under the OOM killer's baleful gaze sooner. The problem here, of course, is that some processes legitimately have lots of children - the session manager for the user's desktop environment is a good example. Killing gnome-session is likely to free substantial amounts of memory, but the user's gratitude may be surprisingly limited.
The patch changes the fork bomb detector significantly. The new code counts only the child processes which have been running for less than a specific amount of time (five minutes in the posted patch). If one process has newborn children which make up at least 1/8 of the processes on the system, that process is deemed to be a fork bomb; it is duly rewarded with a spot at the top of the OOM killer's short list.
Finally, the current OOM killer tries to kill newly-created processes, while allowing long-running processes to continue. Hiroyuki feels that this approach creates a loophole for long-running processes which slowly leak memory. That web browser may have been running for a long time and is thus a high-value process, but it has been dropping memory on the floor for that long time and is also the cause of the problem. So the new code changes the calculation to look at how long it has been since the process has expanded its virtual memory size. A process which has been running for a long time, but which has not grown in that time, will look better than one which has been expanding.
There seems to be little disagreement with the idea that the OOM killer needs a rework, but not everybody is sold on this approach yet. It looks like a very large change, which makes some people nervous. It also shifts the focus of the OOM killer's attention in a significant way: the current heuristics were designed to be as unsurprising to the user as possible, while the new ones are focused more strongly on freeing RAM quickly. But, given that the existing heuristics are still clearly producing plenty of surprises, perhaps a more goal-oriented approach makes sense.
(Naturally, no article on the OOM killer is complete without a link to this 2004 comment from Andries
Brouwer).
(Log in to post comments)
Toward a smarter OOM killer
Posted Nov 4, 2009 16:37 UTC (Wed) by holstein (guest, #6122) [Link]
I know this can lead to a chicken-and-eggs situation (after all, we have no more memory left now), but sometimes simple things could "fix" the memory problem with an operation specific to the current system.
Let's take for exemple a web server: if it is OOMing, it may be because of a runaway apache child eating all memory. Or more probably, a lots of child process eating just to much memory, a lot of time.
What if we could specify "run this script before OOM": in this case, restarting the webserver would be better than simply killing it...
In many case, if the sysadmin could specify specific action, or at least specific process to act on first, this would make for a more predictable behavior, no?
Toward a smarter OOM killer
Posted Nov 4, 2009 16:51 UTC (Wed) by mjthayer (guest, #39183) [Link]
Toward a smarter OOM killer
Posted Nov 4, 2009 17:00 UTC (Wed) by johill (subscriber, #25196) [Link]
Toward a smarter OOM killer
Posted Nov 4, 2009 17:11 UTC (Wed) by mjthayer (guest, #39183) [Link]
Toward a smarter OOM killer
Posted Nov 4, 2009 23:24 UTC (Wed) by johill (subscriber, #25196) [Link]
Toward a smarter OOM killer
Posted Nov 4, 2009 17:02 UTC (Wed) by holstein (guest, #6122) [Link]
For a server, that would let the sysadmin log the event, react to it, etc.
For a desktop, one can imagine a popup warning the user, perhaps with a list of memory-hog processes. This will let the user use the session support of Firefox to restart it's browsing session...
In many case, restarting anew one guilty process can be enough to prevent OOM killing spree.
Toward a smarter OOM killer
Posted Nov 4, 2009 17:22 UTC (Wed) by mjthayer (guest, #39183) [Link]
I think it might be doable by making the algorithm choosing pages to swap out more smart. So that each process was guaranteed a certain amount of resident memory (depending on the number of other processes and logged in users and probably a few other factors), and that a process that tried to hog memory would end up swapping out its own pages when other running processes dropped to their guaranteed minimum. If I ever have time, I will probably even try coding that up.
death by swap
Posted Nov 4, 2009 18:12 UTC (Wed) by jabby (guest, #2648) [Link]
STOP USING the old "TWICE RAM" guideline!
That dates back to when 64MB of RAM was considered "beefy". I seriously had to provision a server last night with 16GB of physical memory and the customer wanted a 32GB swap partition!! Seriously?! If your system is still usable after you're 2 to 4 gigs into your swap, I'd be shocked.
death by swap
Posted Nov 4, 2009 18:21 UTC (Wed) by mjthayer (guest, #39183) [Link]
Not sure that 32GB of swap would be appropriate even then though...
death by swap
Posted Nov 6, 2009 11:40 UTC (Fri) by patrick_g (subscriber, #44470) [Link]
>>> I think Ubuntu still do that by default :)Yes Ubuntu still do that by default...despite many bug reports like mine.
Note that my bug report is old, very old (pre-Gutsy time) => https://bugs.launchpad.net/ubuntu/+source/partman-auto/+bug/134505
No reaction at all from the Ubuntu devs....very discouraging.
death by swap
Posted Nov 4, 2009 18:31 UTC (Wed) by clugstj (subscriber, #4020) [Link]
It all depends on the workload.
death by swap
Posted Nov 4, 2009 18:36 UTC (Wed) by ballombe (subscriber, #9523) [Link]
death by swap
Posted Nov 5, 2009 6:54 UTC (Thu) by gmaxwell (guest, #30048) [Link]
Or using tmpfs for /tmp
death by swap
Posted Nov 5, 2009 11:11 UTC (Thu) by quotemstr (subscriber, #45331) [Link]
death by swap
Posted Nov 5, 2009 18:25 UTC (Thu) by khc (guest, #45209) [Link]
death by swap
Posted Nov 5, 2009 18:46 UTC (Thu) by nix (subscriber, #2304) [Link]
swap :)
death by swap
Posted Nov 4, 2009 18:49 UTC (Wed) by knobunc (guest, #4678) [Link]
death by swap
Posted Nov 4, 2009 19:59 UTC (Wed) by zlynx (guest, #2285) [Link]
I don't know, I haven't run a Linux laptop in almost a year now since an X.org bug killed my old laptop by overheating it.
death by swap
Posted Nov 5, 2009 18:22 UTC (Thu) by nix (subscriber, #2304) [Link]
doesn't mean you *must*.
death by swap
Posted Nov 4, 2009 21:43 UTC (Wed) by drag (guest, #31333) [Link]
amount of swap space will dictate how likely or how much swap space the
Linux kernel wants to use.
I expect the only time it'll want to use swap in a busy system is if the
active amount of used memory exceeds the amount of main memory.
death by swap
Posted Nov 4, 2009 22:54 UTC (Wed) by mjthayer (guest, #39183) [Link]
The algorithm is roughly as follows.
* Assign each process a contingent of main memory, e.g. by dividing the total available by the number of users with active running processes, and giving each process an equal share of the contingent of the user running it.
* When a page of memory is to be evicted to the swap file, make sure that it has either not been accessed for a certain minimum length of time, or that the process owning it is over it's contingent, or that it is owned by the process on who's behalf it is to be swapped out. If not, search for a new page to evict.
This should mean that if a process starts leaking memory badly or whatever, after a while it will just loop evicting its own pages and not trouble the other processes on the system. It should also mean that all not-too-large processes on the system should stay reasonably snappy, making it easier to find and kill the out-of-control process.
death by swap
Posted Nov 4, 2009 22:56 UTC (Wed) by mjthayer (guest, #39183) [Link]
death by swap
Posted Nov 7, 2009 1:21 UTC (Sat) by giraffedata (guest, #1954) [Link]
If you get to the point that you're stealing a page from a process simply because that process is over its quota of real memory, you should steal ALL that process' pages. It can't fit its working set into memory, so it isn't going to make decent progress, so the memory you do give it is wasted. You're also wasting the swap I/O it's doing. After a while, after other processes have had a chance to progress, you can swap them out and give the first process the memory it needs. If you can't do that because it's run amok and simply demands more memory than you can afford, that's when you kill that process.Algorithms for this were popular in the 1970s for batch systems. Unix systems were born as interactive systems where the idea of not dispatching a process at all for ten seconds was less palatable than making the user kill some stuff or reboot, but with Unix now used for more diverse things, I'm surprised Linux has never been interested in long term scheduling to avoid page thrashing.
death by swap
Posted Nov 7, 2009 3:39 UTC (Sat) by tdwebste (guest, #18154) [Link]
On embedded devices I have constructed processing states with runit to control the running processes. This simple but effective long term scheduling to avoid out memory/swapping, works well when you know in advance what processes will be running on the device.
death by swap
Posted Nov 7, 2009 10:01 UTC (Sat) by dlang (guest, #313) [Link]
but back in the 70's they realized that most of the time most programs don't use all their memory at any one time. so the odds are pretty good that the page of ram that you swap out will not be needed right away.
and the poor programming practices that are commone today make this even more true
death by swap
Posted Nov 7, 2009 17:06 UTC (Sat) by giraffedata (guest, #1954) [Link]
but back in the 70's they realized that most of the time most programs don't use all their memory at any one time. so the odds are pretty good that the page of ram that you swap out will not be needed right away.
I think you didn't follow the scenario. We're specifically talking about a page that is likely to be needed right away. It's a page that the normal page replacement policy would have left alone because it expected it to be needed soon -- primarily because it was accessed recently.
But the proposed policy would steal it anyway, because the process that is expected to need it is over its quota and the policy doesn't want to harm other processes that aren't.
What was known in the 70s was that at any one time, a program has a subset of memory it accesses a lot, which was dubbed its working set. We knew that if we couldn't keep a process' working set in memory, it was wasteful to run it at all. It would page thrash and make virutally no progress. Methods abounded for calculating the working set size, but the basic idea of keeping the working set in memory or nothing was constant.
death by swap
Posted Nov 9, 2009 9:02 UTC (Mon) by mjthayer (guest, #39183) [Link]
I suppose I see three cases here. One is that the page was part of the process' working set at an earlier point in time, but no longer is. In that case swapping it out is the right thing to do. The other is that the process is in control, but it's working set is bigger than the available memory. Then I agree that there is a good case for putting it on hold until enough memory is available, although that is a non-trivial problem which is somewhat outside of the scope of what I am trying to do. And the third case is the one that I am interested in - a runaway process which will eventually be OOMed. In this case, the quota will stop it from trampling on the working set of every other process in memory in the meantime.
While we are on the subject, does anyone reading this know where RSS quotas are handled in the current kernel code? I was able to find the original patches enabling them, but the code seems to have changed out of recognition since then.
death by swap
Posted Nov 9, 2009 12:34 UTC (Mon) by hppnq (guest, #14462) [Link]
You may want to look at Documentation/cgroups/memory.txt. Otherwise, it seems there is no way to enforce RSS limits. Rik van Riel wrote a patch a few years ago but it seems to have been dropped.Personally, I would hate to think that my system spends valuable resources managing runaway processes. ;-)
** Encouragement encouragement encouragement **
Posted Nov 13, 2009 22:32 UTC (Fri) by efexis (guest, #26355) [Link]
So, what I would want is something that assumes that most of the system is being well behaved, but will quickly chop off anything that is not, and will stop the badly bahaved stuff from dragging the well behaved stuff down with it. The well behaved stuff quite simply doesn't need managing; that's my job. The badly behaved stuff needs taking care of quickly, by something that your idea seems to reflect *perfectly* (it's not often you read someones ideas and your brain flips "that's -exactly- what I need").
How would I find out if you do get chance to hammer out the code that achieves this? Is there an non-LKML route to watch this (please don't say twitter :-p )
** Encouragement encouragement encouragement **
Posted Nov 16, 2009 13:45 UTC (Mon) by mjthayer (guest, #39183) [Link]
death by swap
Posted Nov 16, 2009 13:50 UTC (Mon) by mjthayer (guest, #39183) [Link]
>I suppose I see three cases here. One is that the page was part of the process' working set at an earlier point in time, but no longer is. In that case swapping it out is the right thing to do. The other is that the process is in control, but it's working set is bigger than the available memory. Then I agree that there is a good case for putting it on hold until enough memory is available, although that is a non-trivial problem which is somewhat outside of the scope of what I am trying to do. And the third case is the one that I am interested in - a runaway process which will eventually be OOMed. In this case, the quota will stop it from trampling on the working set of every other process in memory in the meantime.
Actually case 2 could be handled to some extent by lowering the priority of a process that kept on swapping for too long.
death by swap
Posted Nov 4, 2009 23:49 UTC (Wed) by jond (subscriber, #37669) [Link]
lots of swap, especially in cheapo VMs. There's a whole raft of programs
that you cannot start with say, 256M RAM and little swap without overcommit.
Mutt and irssi are two that spring to mind. Lots of swap lets you
"overcommit" with the risk being you end up swapping rather than you end up
going on a process killing spree.
death by swap
Posted Nov 6, 2009 8:46 UTC (Fri) by iq-0 (subscriber, #36655) [Link]
The only reason this couldn't be a sane default is that on systems with 32MB a overcommit_ratio of 1000% is still too small (but still if you have 32MB and no swap, your probably still better off with this limit)
death by swap
Posted Nov 18, 2009 16:12 UTC (Wed) by pimlottc (guest, #44833) [Link]
death by swap
Posted Nov 5, 2009 17:52 UTC (Thu) by sbergman27 (guest, #10767) [Link]
fine even though swap space usage was usually 8+ GB. We ran this way for months with no
complaints. Just because you're using a lot of swap space doesn't mean you are paging
excessively. Note that if a page is paged out and then brought back into memory, it stays
written in swap to save writting it again if that page gets pages out again. You can't tell a
whole lot about how much swap you are *really* using by looking at the swap used number.
systat monitoring and sar -W are more useful than the swap used number for accessing
swapping.
I do use the twice ram rule. I'd rather the system get slow than crash or have the OOM
running loose on it.
death by swap
Posted Nov 5, 2009 21:14 UTC (Thu) by anton (subscriber, #25547) [Link]
I have seen several cases where a process slowly consumed more and more memory, but apparently always had a small working set, so it eventually consumed all the swap space and the OOM killer killed it (sometimes it killed other processes first, though). The machine was so usable during this, that I did not notice that anything was amiss until some process was missing. IIRC one of these cases happened on a machine with 24GB RAM and 48GB swap; there it took several days until the swap space was exhausted.
death by swap
Posted Nov 6, 2009 6:35 UTC (Fri) by motk (subscriber, #51120) [Link]
Memory: 64G real, 2625M free, 62G swap in use, 40G swap free
You were saying? :)
death by swap
Posted Nov 6, 2009 8:43 UTC (Fri) by mjthayer (guest, #39183) [Link]
death by swap
Posted Nov 18, 2009 16:47 UTC (Wed) by pimlottc (guest, #44833) [Link]
STOP USING the old "TWICE RAM" guideline!You know, I've been hearing this lately, but the problem is there seems to be no consensus on what the guideline should be. Some swear by no swap at all, while others say running without at least some is dangerous. No one seems to agree on what an appropriate amount is. Until there is a new accepted rule of thumb, everyone will keep using the old one, even if it's wrong.
death by swap
Posted Nov 18, 2009 18:37 UTC (Wed) by dlang (guest, #313) [Link]
nowdays it depends on your system and how you use it.
if you use the in-kernel suspend code, the act of suspending will write all your ram into the swap space, so swap must be > ram
if you don't use the in-kernel suspend code you need as much swap as you intend to use. How much swap you are willing to use depends very much on your use case. for most people a little bit of swap in use doesn't hurt much to use and by freeing up additional ram results in a overall faster system. for other people the unpredictable delays in applications due to the need to pull things from swap is unacceptable. In any case, having a lot of swap activity if pretty much unacceptable for anyone.
note that if you disable overcommit you need more swap or allocations (like when a large program forks) will fail so you need additional swap space > max memory footprint of any process you intend to allow to fork (potentially multiples of this). With overcommit disabled I could see you needing swap significantly higher than 2x ram in some conditions.
my recommendation is that if you are using a normal hard drive (usually including the SSD drives that emulate normal hard drives), allocate a 2G swap partition and leave overcommit enabled (and that's probably a lot larger than you will ever use)
if you are using a system that doesn't have a normal hard drive (usually this sort of thing has no more than a few gig of flash as it's drive) you probably don't want any swap, and definantly want to leave overcommit on.
death by swap
Posted Nov 19, 2009 16:46 UTC (Thu) by nye (guest, #51576) [Link]
FWIW, I agree, except that I'd make it a file instead of a partition - it's just as fast, and it leaves some flexibility just in case.
I use a 2GB swapfile on machines ranging from 256MB to 8GB of RAM - it may be overkill but that much disk space costs next to nothing. I wouldn't want to set it higher, because if I'm really using swap to that extent, the machine's probably past the point of usability anyway.
death by swap
Posted Nov 19, 2009 13:29 UTC (Thu) by makomk (guest, #51493) [Link]
death by swap
Posted Nov 19, 2009 18:33 UTC (Thu) by dlang (guest, #313) [Link]
in both cases you have to read from disk to continue, the only difference is if you are reading from the swap space or the initial binary (and since both probably require seeks, it's not even a case of random vs sequential disk access)
death by swap
Posted Dec 5, 2009 17:19 UTC (Sat) by misiu_mp (guest, #41936) [Link]
I would presume that executables do not make up much of the used memory. So reusing their pages will probably not be much gain.
Trashing is what happens when processes get their pages continuously swapped in and out as the system schedules them to run. Thats when everything grinds to a halt because each context switch or memory access needs to swap out some memory in order to make place for some other memory to be read in from the swap or the binary.
That can possibly happen when the total working set (actively used memory) of the busy processes exceeds the amount of ram, or more realistically, when the swap (nearly) runs out so there is nowhere to evict unused pages to free up ram - leaving space for only small chunks to run at a time.
Usually (in my desktop experience) soon after that the oom killer start kicking in, which causes the system to trash even more (as the oom have needs too) and it takes hours for it to be done.
When it happens I usually have no choice but to reboot loosing some data, so for me the oom killer has been useless and over-commitment the root of all evil.
Using up swap alone does not affect performance much, if you dont access whats on the swap. If you continuously do that - that's trashing.
Toward a smarter OOM killer
Posted Nov 5, 2009 13:45 UTC (Thu) by hppnq (guest, #14462) [Link]
What could work for you, is to run a dummy process (allocate memory as you like) and have that killed first by the OOM (use Evgeniy Polyakov's patch), so it would 1) notify the administrator that the system has run into this problem, and 2) free up enough memory so something can actually be done about it.Just as an exercise, of course. ;-)
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 4, 2009 20:54 UTC (Wed) by vadim (subscriber, #35271) [Link]
idea. It's based on the very wrong philosophy that a good thing to do
about a problem is to ignore its existence, as if that would make it go
away.
Of course it doesn't. Reality is what it is, there's a finite amount of
memory and it's not possible to really use more than what's available. So
after the kernel pretends there's more memory than there really is, it
must deal with the consequences of being unable to uphold that promise and
have to kill some process that may not have anything to do with the memory
problems.
Besides the whole mess the kernel has get involved in due to this strange
way of doing things, it means applications can't sanely handle memory
shortage, because the kernel won't let them.
The whole reason for the existence of this strange system seems to be that
some applications allocate more memory than they use. But I think there
are much better things to do about that.
First, any allocated but yet unused memory could be used for data that can
always be freed when the application tries to use the memory, such as disk
cache.
Second, I think the effort would be much better spent on writing some sort
of program (valgrind patch?) that would show which applications are
allocating memory they don't use, how much and where.
Fixing the application to allocate only what's required would have
benefits: it would make it stop wasting memory on systems without an OOM
killer, it would remove the need for the OOM killer, as overcommit would
stop providing any benefit, would improve the general stability of Linux
as without an OOM killer the kernel would stop doing stupid things like
killing database servers, and give some incentive to programmers to sanely
handle out of memory conditions (since currently they can't even if they
want to).
At the very least I think overcommit should be off by default.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 4, 2009 21:47 UTC (Wed) by drag (guest, #31333) [Link]
They need to know it and need to be able to react to it.
Now whether or not applications actually do respond is something else
entirely. But you _MUST_ give application developers the chance to "do the
right thing", even if you don't expect them do to do it.
So out of memory errors to applications and hope that they use it and you
have OOM Killer if they don't.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 4, 2009 22:30 UTC (Wed) by vadim (subscriber, #35271) [Link]
With overcommit:
1. Application mallocs 200MB when there are 150MB available. Kernel hopes
it doesn't actually use it, and lets it happen anyway.
2. Applications work for a while.
3. At some point (not necessarily while running the app that malloced the
200MB), kernel realizes: crap, I'm out of memory. Process needs a page,
but there's no memory that can be freed and swap is full. Got to kill
something to make room. It picks a process and kills it. With SIGKILL.
Some process dies with no chance to react to it. It can't react because it
can get killed in the middle of absolutely anything, even something like a
for(;;); which doesn't allocate any memory. There's no way for it to react
sanely.
Without overcommit:
1. Application mallocs 200MB with 150MB free. Kernel says "nope, there
isn't that much" and malloc returns NULL.
2. At that point application can decide what to do about that. It may
abort, or refuse to open a document too large for memory but keep running,
decide it can work with a smaller internal cache, etc. If it's badly
written it doesn't check the malloc return value and crashes on the null
pointer, but even then, the application that goes is precisely the one
that wanted too much memory.
3. OOM killer isn't needed, because the kernel never gets to the "crap,
I'm out of memory" stage.
Done this way there's no need to hope for anything. You simply don't allow
the situation overcommit gets into to happen in the first place.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 0:17 UTC (Thu) by madscientist (subscriber, #16861) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:54 UTC (Thu) by epa (subscriber, #39769) [Link]
(vfork() as in classical BSD is one answer, but still a bit crufty IMHO: rather than a special kind of fork that you can only use before exec, better to just say what you mean and have fork+exec be a single call.)
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:55 UTC (Thu) by epa (subscriber, #39769) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:25 UTC (Thu) by nix (subscriber, #2304) [Link]
being horrendously complicated it is *still* not flexible enough for
things that regular applications do all the time. And it never will be:
you'd have to implement a Turing-complete interpreter in there to approach
the flexibility of the fork/exec model...
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 8:55 UTC (Fri) by epa (subscriber, #39769) [Link]
It would be better for applications to give the kernel more clues about their intention, so the kernel can make better decisions on memory management.
I agree that posix_spawn, like almost anything that comes out of a committee, is a complicated monster. Perhaps a better answer would be to refine the distinction between fork() and vfork(), or to introduce a new fork-like call fork_intend_to_exec_soon(). Then the kernel could know that for an ordinary fork() it has to be cautious and check all the required memory is available, while fork_intend_to_exec_soon() has the current optimistic behaviour.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 13:43 UTC (Fri) by nix (subscriber, #2304) [Link]
rapidly followed by exec()s. Whatever you choose, getting it used by much
software would be a long slow slog :/
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 19:14 UTC (Fri) by dlang (guest, #313) [Link]
if applications can misuse this without a penalty they will never get it right (especially when using it wrong will let their app keep running in cases where the fork would otherwise fail)
but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 8, 2009 21:21 UTC (Sun) by epa (subscriber, #39769) [Link]
I don't think it matters much if a few slightly-buggy applications use the wrong variant. If 90% of userspace including the most important programs such as shells passes the right hint to the kernel, the kernel can make better decisions than it does now, and the need for the OOM killer will be reduced. It's a similar situation with raw I/O, for example: a disk-heavy program such as a database server might know that it will scan through a large file just once. Ordinarily this file's contents might clog up the page cache and evict more useful things. To help get more consistent performance, apps can be coded to hint to the kernel that it needn't bother to cache a particular I/O request. The default is still to cache it, and it's not catastrophic if one or two userspace programs haven't been tuned to use the new fancy hinting mechanism.but forget the fork then exec situation for a moment and consider the real fork situation. for a large app, most of the memory will never get modified by either process, and so even there it will almost never use the 2x memory that has been reserved for it.Very true, but of course there's no way for the kernel to know this. I expect most apps would prefer the fork to either succeed for sure, or fail at once if not enough memory can be guaranteed. There may be a few where optimistically hoping for the best and perhaps killing a random process later is the ideal behaviour.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 8, 2009 23:32 UTC (Sun) by nix (subscriber, #2304) [Link]
you make that some weird nonportable new variant, 90% of programs are
never going to use it, and none of the rest will until some considerable
time has passed (time for this call to percolate down into the kernel and
glibc --- and try getting this call past Ulrich, ho ho.)
(Anyway, we *have* fork_to_exec_soon(). It's called vfork().)
posix_spawn is stupid as a system call
Posted Nov 5, 2009 13:10 UTC (Thu) by helge.bahmann (subscriber, #56804) [Link]
call?
fork + set*uid/set*gid + exec
fork + chroot + chdir + exec
fork + close(write_side_of_pipe/read_side_of_pipe) + dup2() + exec
fork + open(arbitrary set of files) + exec
fork + sched_setscheduler/sched_setparam/sched_setaffinity... + exec
fork + personality + exec
fork + setpgid + exec
not to mention the various clone flags (fs namespace etc.) and any wild
combination of the above
I think I have needed all of the above in various circumstances, sometimes
two or three things between fork+exec at a time
bonus question: how many more parameters do you want to add to a "combined
fork/exec" syscall to make it future proof for other things that might
need to be done before the new process image is executed?
posix_spawn is stupid as a system call
Posted Nov 5, 2009 13:45 UTC (Thu) by madscientist (subscriber, #16861) [Link]
Fork+exec definitely has its downsides in some of the technical implementation requirements, but from a higher level language perspective it's brilliant.
posix_spawn is stupid as a system call
Posted Nov 6, 2009 9:07 UTC (Fri) by epa (subscriber, #39769) [Link]
You're right, others pointed out the same thing; no single system call can handle all the things you might want to set up in the child process before exec()ing.But that said, why does the whole child process (including, potentially, a complete copy of its parent's core pages, all ready to be written to) need to be created just to set a few uids or open some files? Perhaps it would work better to first prepare a new process structure, then set uids and open files for it, and as the last stage breathe life into it by giving a file to exec(). For example
pid_t child = new_waiting_process(); // Now child is an entry in the process table, but it is not running. // Use the p_ variants of some system calls to set things up for // this child process. p_setuid(child, uid); p_close(child, 0); p_open(child, "infile"); // Finished setup, start it running. p_exec_and_start(child, "/bin/cat"); wait(child);This would give almost the same flexibility, but without the need to overcommit memory. The kernel would just need to create a new process in a not-runnable state, and the p_whatever system calls allow performing operations on another process rather than yourself. (Of course they would only allow manipulating your own not-yet-started child process, except perhaps for root.)
A process created with new_waiting_process() would inherit its parent's file descriptors, current directory, environment and so on as for fork(), but it would not inherit the parent's core.
posix_spawn is stupid as a system call
Posted Nov 6, 2009 10:07 UTC (Fri) by helge.bahmann (subscriber, #56804) [Link]
duplicate is _huge_. It would perhaps be easier to create an "almost
empty" process image (with at least one stack and executable code page set
up) in suspended state, and then use ptrace or something similar to inject
system calls into the new process image -- this is tricky, but at least
the kernel is not burdened with an exploding number of system calls.
Alternatively, you could also provide a "fork" variant that explicitly
declares which pages of the address space are to be COWed into the new
process (if you are extra-smart, all you ever need to COW are the stack
pages, but calling library functions before execve is probably going to
spoil that -- but then, finding out which pages a library requires is by
no means easier, so you have to exercise a lot of discipline).
Might be an interesting research project to attempt any of the above in
Linux :)
posix_spawn is stupid as a system call
Posted Nov 6, 2009 13:51 UTC (Fri) by nix (subscriber, #2304) [Link]
int masquerade_as (pid_t pid)
which issues syscalls in 'pid' instead of the current process. ('pid' is a
process you'd be allowed to ptrace, so immediate children are permitted).
This is a per-thread attribute, and passing a pid of 0 flips back to the
parent again.
Then all you need is this (ignoring error checking just as the OP did,
what a horrible name that new_waiting_process() has got, vvfork() would
surely be better):
pid_t child = new_waiting_process();
masquerade_as (child);
setuid(uid);
close(0);
open("infile");
// Finished setup, start it running.
execve ("/bin/cat", "/bin/cat", environ);
masquerade_as (0);
wait(child);
Note the subtleties here: execution always continues after execve()
because the execve() was done to another process image. Non-syscalls are
very dangerous to run because they might update userspace storage in the
wrong process: we'd really need support for this in libc for it to be
usable.
(In practice this latter constraint destroys the whole idea no matter how
good it might be: Ulrich would say no, as he does to every idea anyone
else originates. Personally I suspect this idea sucks in any case :) )
posix_spawn is stupid as a system call
Posted Nov 8, 2009 21:26 UTC (Sun) by epa (subscriber, #39769) [Link]
posix_spawn is stupid as a system call
Posted Nov 8, 2009 23:34 UTC (Sun) by nix (subscriber, #2304) [Link]
*bad* move :) I think, if you wanted to do this, you'd have to introduce a
huge pile of new syscalls and reimplement the old ones as thin wrappers
(inside the kernel so as not to force everyone to upgrade glibc) calling
the new ones.
posix_spawn is stupid as a system call
Posted Nov 23, 2009 15:08 UTC (Mon) by jch (guest, #51929) [Link]
A suggestion
Posted Nov 12, 2009 5:17 UTC (Thu) by jlmassir (guest, #48904) [Link]
1. Never allow overcommit when calling malloc
2. Allow overcommit on fork/exec, but kill the child process if it tries to
write to more than 10% of its virtual size.
This way, buggy programs that malloc too much memory and never use them
would be fixed and fork bombs would be killed, while still allowing to do do
system calls between fork and exec.
What do you think?
A suggestion
Posted Nov 14, 2009 20:52 UTC (Sat) by Gady (guest, #1141) [Link]
A suggestion
Posted Nov 15, 2009 20:03 UTC (Sun) by jlmassir (guest, #48904) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:27 UTC (Thu) by nix (subscriber, #2304) [Link]
stack space? A process with a lot of threads, mostly idle, could easily be
using gigabytes for stack address space, all potentially allocatable, but
only actually be using a tiny fraction of that (4K out of every 8Mb chunk,
say).
So overcommit doesn't just break programs that use fork/exec under high
load, forcing failure far sooner than necessary: it breaks programs that
use threads in the same way. Doesn't leave much, does it...
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 18:28 UTC (Thu) by nix (subscriber, #2304) [Link]
emphatic terms :/
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 6, 2009 9:10 UTC (Fri) by epa (subscriber, #39769) [Link]
(Doesn't a process have some way to specify the max. stack size that it will use for each thread?)
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 0:20 UTC (Thu) by mikov (guest, #33179) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 0:56 UTC (Thu) by JoeBuck (subscriber, #2330) [Link]
Exactly. On a desktop Linux system, you might have a Firefox instance you've been using for ages and it's up to 1.5 gigabytes virtual memory. You have other processes running and your swap is mostly full, so that there's only another 0.5G available. Then you download a film clip and you want to fire up totem to view it. Firefox does a fork, followed by exec. But you don't have 1.5 additional gigabytes. Solaris would refuse to do the fork, even though you don't really need that additional 1.5G: you might dirty one page before doing an exec of totem, which is much smaller. Linux and AIX will issue a loan.Solaris works around this problem by recommending that developers use posix_spawn rather than fork followed by exec, however they didn't add this call until Solaris 10.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 1:02 UTC (Thu) by mikov (guest, #33179) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 23, 2009 15:19 UTC (Mon) by jch (guest, #51929) [Link]
Shared libraries are backed by filesystem data, so a read-only map of a shared library does not involve overcommit.
why not posix_spawn()?
Posted Nov 5, 2009 8:02 UTC (Thu) by Cato (guest, #7643) [Link]
Fork/exec really works well with smaller processes (as in the original Unix tools / shell pipeline approach), but forking a 1.5 GB Firefox process is insane...
It's got to the point that I can't click on mailto: links any more in Firefox (currently 850 GB resident memory) because it will take so long to fork and then exec Thunderbird.
why not posix_spawn()?
Posted Nov 5, 2009 8:30 UTC (Thu) by mikov (guest, #33179) [Link]
why not posix_spawn()?
Posted Nov 5, 2009 18:28 UTC (Thu) by khc (guest, #45209) [Link]
why not posix_spawn()?
Posted Nov 5, 2009 20:44 UTC (Thu) by mikov (guest, #33179) [Link]
Additionally, perhaps shared page tables will eventually improve on that. Does anybody know what the status of those patches is?
But anyway, I don't think that the slow starting of Thunderbird from Firefox, which the GP commented on, is caused by fork() copying the page tables.
why not posix_spawn()?
Posted Nov 5, 2009 23:21 UTC (Thu) by Cato (guest, #7643) [Link]
why not posix_spawn()?
Posted Nov 6, 2009 1:17 UTC (Fri) by khc (guest, #45209) [Link]
why not posix_spawn()?
Posted Nov 6, 2009 6:12 UTC (Fri) by mikov (guest, #33179) [Link]
malloc(1MB) 3235 ms 0.161750 ms/iter
malloc(100MB) 390 ms 3.900000 ms/iter
malloc(500MB) 1663 ms 16.630000 ms/iter
malloc(1024MB) 3329 ms 33.290000 ms/iter
With a heap of 1G it takes 33ms to do a fork() on my machine, which to me is surprisingly long (although not that surprising when you consider the mere size of the 4KB page tables). While, as I said initially, it would definitely not be noticeable for interactive process creation, it is significant. The mach maligned "slow" process creation on Windows is much faster for sure...
I did run a couple of more tests though, which improve the situation. First I confirmed that the page tables of shared memory mappings really are not copied. I replaced the malloc() with mmap( MAP_ANON | MAP_SHARED ):
mmap(1MB) 1204 ms 0.120400 ms/iter
mmap(100MB) 1201 ms 0.120100 ms/iter
mmap(500MB) 1231 ms 0.123100 ms/iter
mmap(1024MB) 1229 ms 0.122900 ms/iter
As you can see, there is no relation between fork() speed and mapping size.
Then I restored the malloc(), but replaced the fork() with vfork():
vfork+malloc(1MB) 102 ms 0.010200 ms/iter
vfork+malloc(100MB) 107 ms 0.010700 ms/iter
vfork+malloc(500MB) 105 ms 0.010500 ms/iter
vfork+malloc(1000MB) 106 ms 0.010600 ms/iter
The last result is really encouraging (and actually not surprising). Even though everybody seems to hate vfork(), for the case we are discussing (fork of a huge address space + exec), it should solve all problems, removing the need for the clumsy posix_spawn(), while preserving all the flexibility of fork(). Beat that, Windows!
Any good reasons why vfork() should be avoided?
why not posix_spawn()?
Posted Nov 6, 2009 19:12 UTC (Fri) by khc (guest, #45209) [Link]
I have to admit that I have never checked to see if posix_spawn fits my need, though. Since I only care about linux and posix_spawn on linux is the same as fork()/.../exec(), it's useless for me anyway.
why not posix_spawn()?
Posted Nov 6, 2009 19:28 UTC (Fri) by mikov (guest, #33179) [Link]
If your purpose is to call exec() after fork(), you should just be able to mechanically replace all forks() with vforks() and get a big boost.
why not posix_spawn()?
Posted Nov 6, 2009 22:55 UTC (Fri) by cmccabe (guest, #60281) [Link]
The manual page installed on my system says that copy-on-write makes vfork unecessary. It concludes with "it is rather unfortunate that Linux revived this specter from the past." :)
However... it seems like the results you've posted show quite a substantial performance gain for vfork + exec as opposed to fork + exec, for processes with large heaps.
Maybe the "preferred" way to do this on Linux would be using clone(2)??
C.
why not posix_spawn()?
Posted Nov 6, 2009 23:23 UTC (Fri) by cmccabe (guest, #60281) [Link]
Ah. Found it.
https://www.securecoding.cert.org/confluence/display/secc...
> Due to the implementation of the vfork() function, the parent process is
> suspended while the child process executes. If a user sends a signal to
> the child process, delaying its execution, the parent process (which is
> privileged) is also blocked. This means that an unprivileged process can
> cause a privileged process to halt, which is a privilege inversion
> resulting in a denial of service.
clone(CLONE_VM) + exec might be the win...
Colin
Memory required for fork()
Posted Nov 5, 2009 1:04 UTC (Thu) by vomlehn (guest, #45588) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 3:11 UTC (Thu) by smoogen (subscriber, #97) [Link]
Now the question is why do some of thesedo this? Is it a basic assumption of every non-embedded os to be sloppy with memory? Is it POSIX? And if every system were set up with overcommit turned off how much would break?
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 5:19 UTC (Thu) by mikov (guest, #33179) [Link]
Here is what Linus had to say about that in 1995:
http://groups.google.com/group/comp.os.linux.development....
The fundamentals still apply.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 12, 2009 9:50 UTC (Thu) by rlhamil (guest, #6472) [Link]
for my liking. And my liking is that everything other than /dev/random is _totally_
deterministic (neglecting external input of course).
(I'd argue that overcommit-by-default is an invitation to denial of service attack, and, if
likely victims were more or less predictable, might be a "covert channel" as well.)
Solaris doesn't do overcommit, but does also offer MAP_NORESERVE, so that individual
mmap() operations can opt out of a reserve, in which case a write to a private mapping
(copy-on-write from a file) can cause the process to receive SIGSEGV or SIGBUS; see
http://docs.sun.com/app/docs/doc/816-5167/mmap-2?l=en&...
(the online version of the mmap() man page for Solaris 10)
I think that all that's missing is:
* a system call to turn on or off similar behavior for heap and/or stack, and to
turn on or off _implicit_ MAP_NORESERVE on all private mappings for that process
and its subsequently forked children (reset on exec)
* a shared library feature to implement system policy specifying which executables
should be be subject to overcommit, with a settable default for all not explicitly specified
* an OS default of no overcommit
* no OOM killer needed
Distros could supply default policy that opted for overcommit on chronically hoggish
(and typically not critical to system integrity) apps such as browsers. People might e.g.
not mind their browser dying a few more times than it would anyway, but might be very
glad to be sure that their X server (desktop user) or database server process was safe from
nondeterministic behavior possibly triggered by another process.
That gets overcommit out of the OS, and pushes the decisions into user space. A process
could always override policy with the system calls, but it would have to know what it was
doing to do that.
The only limitation with implementing the defaults for an executable in the dynamic linker
is that it wouldn't be able to allow overcommit for static executables. If that was a serious
limitation, a new mechanism would be needed to push the policy settings into the kernel,
and execve() (or equivalents) would have to implement them, which is IMO more comprehensive
but otherwise uglier.
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 21:49 UTC (Thu) by anton (subscriber, #25547) [Link]
If an application does not handle ENOMEM gracefully, it's better to run it in overcommit mode. Hopefully it will never actually use all that memory, then it will be better off than if it got ENOMEM. If it gets OOM-killed, it won't be worse off. And being able to allocate large amounts of memory without using it makes writing programs quite a bit simpler in some cases.OTOH, if an application is written to deal with ENOMEM gracefully, it's better not to overcommit memory for this application, to give it ENOMEM if there is no more committable memory, and then there is no need to OOM kill such an application (instead, one of the other overcommitting applications can be killed).
I have written this up in more detail; there I suggested making it depend on the process. In the meantime I have learned about the MAP_NORESERVE flag, which makes it possible to do this per-allocation. However, since the OOM-Killer kills a process, not an allocation, it's probably better to use MAP_NORESERVE either on all or no allocation in a process; but how to tell this to functions that call mmap only indirectly (malloc(), fopen() etc.)?
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 4, 2009 22:17 UTC (Wed) by paragw (guest, #45306) [Link]
It is based on fundamentally flawed philosophy of handing out resources without accurate accounting of how much resources exactly are there - that cannot be a good idea under any circumstance.
Plus it encourages sloppy application code. And then the applications do not get any notification if kernel runs out of memory - this means it is game over without any warning.
I really like how Solaris VM handles this - if you ask for memory it will reserve that much swap as backing store - if it fails to allocate that much swap, applications get a NULL back from malloc(). So all you got to do is have good amount of swap space - disk space is relatively cheaper so that is not a big deal. Applications then can still ask for excess memory as long as address space and swap space are there - whatever doesn't get used remains in swap.
I once tried turning off overcommit and first program to fail was java - it needed 1Gb code cache if I recall correctly. But programs can be fixed eventually if kernel enforces strict allocation policy.
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 9, 2015 16:48 UTC (Fri) by cortana (subscriber, #24596) [Link]
AIUI this is what Linux does with overcommit_mode=2. What I'm trying to find out (and this is why I'm digging up this old thread) is how Solaris deals with the possiblity of a process with a large amount of private memory calling fork. Presumably those pages have to be reserved, but if they can't be, does fork fail (as on Linux with overcommit_mode=2)?
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 11, 2015 23:48 UTC (Sun) by nix (subscriber, #2304) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 9, 2015 20:25 UTC (Fri) by dlang (guest, #313) [Link]
so what do you do on a system that doesn't have a huge amount of disk available?
I've had systems with 128G of ram and 160G of (very high speed) disk.
High-end laptops with SSDs can easily have cases where RAM is a significant percentage of the total disk space
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 9, 2015 20:50 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link]
This is a very flawed idea. In reality swap is nearly useless for overcommitted machines.
The problem is, once you start actually _using_ the swap everything simply stops. The IO bandwidth and latency is not enough for anything reasonable.
Pretty much the only valid use-case for swap is to off-load inactive applications and free RAM for those who need it.
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 9, 2015 21:14 UTC (Fri) by dlang (guest, #313) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 11, 2015 23:51 UTC (Sun) by nix (subscriber, #2304) [Link]
I'd agree about its essential uselessness for other purposes, though. With modern CPU versus disk speeds, you hit the thrashing wall very, very fast.
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 12, 2015 0:25 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]
In fact, for us it'd be great if if was possible to simply freeze a swapped-out application if its swap IO bandwidth exceeds some sane value. Then un-freeze it once there is enough RAM to bring out all of its pages from swap.
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 12, 2015 22:49 UTC (Mon) by renox (guest, #23785) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 12, 2015 22:56 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link]
As it's common for GC algorithms, this optimization is probabilistic. It works most of the time, but sometimes blows up spectacularly.
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 13, 2015 9:06 UTC (Tue) by renox (guest, #23785) [Link]
Yes, this is the paper (sorry for not posting the link previously I had lost it).
> As it's common for GC algorithms, this optimization is probabilistic. It works most of the time, but sometimes blows up spectacularly.
Well that's not specific to GCs but it's quite widespread in computers: virtual memory itself, caches are 'probabilistic' optimizations.
So unless you need a hard real time system, or the 'blow up spectacularly' part is too frequent, this isn't a problem..
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 12, 2015 1:20 UTC (Mon) by flussence (subscriber, #85566) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Jan 12, 2015 9:09 UTC (Mon) by paulj (subscriber, #341) [Link]
The existence of OOM is one of the few really stupid things in Linux
Posted Nov 5, 2009 9:15 UTC (Thu) by Felix.Braun (guest, #3032) [Link]
Toward a smarter OOM killer
Posted Nov 4, 2009 21:38 UTC (Wed) by Shewmaker (guest, #1126) [Link]
My experience with scientific computing is that even if the OOM killer picks the correct process, the system is then in a strange state where something else will not work correctly. We often end up rebooting a system like this so that it is back in a known good state. More recently, we have been disabling memory overcommit.Though it is less flexible, disabling overcommit allows us to see and fix problems more quickly. It may be a system problem (e.g. a log file growing to fill a RAM-based filesystem on a diskless node) or an application that didn't realize it was using twice as much memory as it intended.
Toward a smarter OOM killer
Posted Nov 5, 2009 4:21 UTC (Thu) by zooko (guest, #2589) [Link]
and servers.
Toward a smarter OOM killer
Posted Nov 5, 2009 13:40 UTC (Thu) by sean.hunter (guest, #7920) [Link]
Toward a smarter OOM killer
Posted Nov 6, 2009 19:17 UTC (Fri) by dlang (guest, #313) [Link]
I wouldn't be surprised to find that much of the time the ratio of overcommit rises significantly shortly before the OOM killer kicks in.
Per-process memory limits?
Posted Nov 8, 2009 3:19 UTC (Sun) by etrusco (guest, #4227) [Link]
(Sure, an interested application could limit its own heap usage through its memory allocator, but this couldn't stop a DoS from a virus/code injection.)
Per-process memory limits?
Posted Nov 8, 2009 11:56 UTC (Sun) by nix (subscriber, #2304) [Link]
than 'die, fiend!' on exceeding a limit, but it's not essential: you could
have a parent monitoring process which spots the kill and takes
appropriate action (though figuring out which limit the child exceeded, or
whether it exceeded a limit at all or was just randomly killed by the
admin, might be harder).
Per-process memory limits?
Posted Nov 9, 2009 9:22 UTC (Mon) by hppnq (guest, #14462) [Link]
It would be nice if you could trigger actions other than 'die, fiend!' on exceeding a limit
Some attempts to exceed a limit specified using setrlimit are met with an error rather than a kill, and I think all the signals delivered can be handled if appropriate measures are taken. Whether that is a good idea is another question. ;-)
Per-process memory limits?
Posted Nov 11, 2009 20:30 UTC (Wed) by oak (guest, #2786) [Link]
...For something else than VmSize limit which is useless for processes
that mmap() files.
Per-process memory limits?
Posted Nov 12, 2009 5:51 UTC (Thu) by etrusco (guest, #4227) [Link]
Per-process memory limits?
Posted Nov 12, 2009 5:58 UTC (Thu) by etrusco (guest, #4227) [Link]
Per-process memory limits?
Posted Nov 12, 2009 6:38 UTC (Thu) by etrusco (guest, #4227) [Link]
It would be nice, nonetheless, to be able set the limits from different process (e.g. a window manager) instead of resorting to a launcher and having to define the limit upfront...
Per-process memory limits?
Posted Nov 14, 2009 12:11 UTC (Sat) by efexis (guest, #26355) [Link]