|
|
Subscribe / Log in / New account

Still waiting for swap prefetch

It has been almost two years since LWN covered the swap prefetch patch. This work, done by Con Kolivas, is based on the idea that if a system is idle, and it has pushed user data out to swap, perhaps it should spend a little time speculatively fetching that swapped data back into any free memory that might be sitting around. Then, when some application wants that memory in the future, it will already be available and the time-consuming process of fetching it from disk can be avoided.

The classic use case for this feature is a desktop system which runs memory-intensive daemons (updatedb, say, or a backup process) during the night. Those daemons may shove a lot of useful data to swap, where it will languish until the system's user arrives, coffee in hand, the next morning. Said user's coffee may well grow cold by the time the various open applications have managed to fault in enough memory to function again. Swap prefetch is intended to allow users to enjoy their computers and hot coffee at the same time.

There is a vocal set of users out there who will attest that swap prefetch has made their systems work better. Even so, the swap prefetch patch has languished in the -mm tree for almost all of those two years with no path to the mainline in sight. Con has given up on the patch (and on kernel development in general):

The window for 2.6.23 has now closed and your position on this is clear. I've been supporting this code in -mm for 21 months since 16-Oct-2005 without any obvious decision for this code forwards or backwards.

I am no longer part of your operating system's kernel's world; thus I cannot support this code any longer. Unless someone takes over the code base for swap prefetch you have to assume it is now unmaintained and should delete it.

It is an unfortunate thing when a talented and well-meaning developer runs afoul of the kernel development process and walks away. We cannot afford to lose such people. So it is worth the trouble to try to understand what went wrong.

Problem #1 is that Con chose to work in some of the trickiest parts of the kernel. Swap prefetch is a memory management patch, and those patches always have a long and difficult path into the kernel. It's not just Con who has run into this: Nick Piggin's lockless pagecache patches have been knocking on the door for just as long. The LWN article on Wu Fengguang's adaptive readahead patches appeared at about the same time as the swap prefetch article - and that was after your editor had stared at them for weeks trying to work up the courage to write something. Those patches were only merged earlier this month, and, even then, only after many of the features were stripped out. Memory management is not an area for programmers looking for instant gratification.

There is a reason for this. Device drivers either work or they do not, but the virtual memory subsystem behaves a little differently for every workload which is put to it. Tweaking the heuristics which drive memory management is a difficult process; a change which makes one workload run better can, unpredictably, destroy performance somewhere else. And that "somewhere else" might not surface until some large financial institution somewhere tries to deploy a new kernel release. The core kernel maintainers have seen this sort of thing happen often enough to become quite conservative with memory management changes. Without convincing evidence that the change makes things better (or at least does no harm) in all situations, it will be hard to get a significant change merged.

In a recent interview Con stated:

Then along came swap prefetch. I spent a long time maintaining and improving it. It was merged into the -mm kernel 18 months ago and I've been supporting it since. Andrew [Morton] to this day remains unconvinced it helps and that it 'might' have negative consequences elsewhere. No bug report or performance complaint has been forthcoming in the last 9 months. I even wrote a benchmark that showed how it worked, which managed to quantify it!

The problem is that, as any developer knows, "no bug reports" is not the same as "no bugs." What is needed in a situation like this is not just testimonials from happy desktop users; there also needs to be some sort of sense that the patch has been tried out in a wide variety of situations. The relatively self-selecting nature of Con's testing community (more on this shortly) makes that wider testing harder to achieve.

A patch like swap prefetch will require a certain amount of support from the other developers working in memory management before it can be merged. These developers have, as a whole, not quite been ready to jump onto the prefetch bandwagon. A concern which has been raised a few times is that the morning swap-in problem may well be a sign of a larger issue within the virtual memory subsystem, and that prefetch mostly serves as a way of papering over that problem. And it fails to even paper things completely, since it brings back some pages from swap, but doesn't (and really can't) address file-backed pages which will also have been pushed out. The conclusion that this reasoning leads to is that it would be better to find and fix the real problem rather than hiding it behind prefetch.

The way to address this concern is to try to get a better handle on what workloads are having problems so that the root cause can be addressed. That's why Andrew Morton says:

To attack the second question we could start out with bug reports: system A with workload B produces result C. I think result C is wrong for <reasons> and would prefer to see result D.

and why Nick Piggin complains:

Not talking about swap prefetch itself, but everytime I have asked anyone to instrument or produce some workload where swap prefetch helps, they never do.

Fair enough if swap prefetch helps them, but I also want to look at why that is the case and try to improve page reclaim in some of these situations (for example standard overnight cron jobs shouldn't need swap prefetch on a 1 or 2GB system, I would hope).

There have been a few attempts to characterize workloads which are improved by swap prefetch, but the descriptions tend toward the vague and hard to reproduce. This is not an easy situation to write a simple benchmark for (though Con has tried), so demonstrating the problem is a hard thing to do. Still, if the prefetch proponents are serious about wanting this code in the mainline, they will need to find ways to better communicate information about the problems solved by prefetch to the development community.

Communications with the community have been an occasional problem with Con's patches. Almost uniquely among kernel developers, Con chose to do most of his work on his own mailing list. That has resulted in a self-selected community of users which is nearly uniformly supportive of Con's work, but which, in general, is not participating much in the development of that work. It is rare to see patches posted to the ck-list which were not written by Con himself. The result was the formation of a sort of cheerleading squad which would occasionally spill over onto linux-kernel demanding the merging of Con's patches. This sort of one-way communication was not particularly helpful for anybody involved. It failed to convince developers outside of ck-list, and it failed to make the patches better.

This dynamic became actively harmful when ck-list members (and Con) continued to push for inclusion of patches in the face of real problems. This behavior came to the fore after Con posted the RSDL scheduler. RSDL restarted the whole CPU scheduling discussion and ended up leading to some good work. But some users were reporting real regressions with RSDL and were being told that those regressions were to be expected and would not be fixed. This behavior soured Linus on RSDL and set the stage for Ingo Molnar's CFS scheduler. Some (not all) people are convinced that Con's scheduler was the better design, but refusal to engage with negative feedback doomed the whole exercise. Some of Con's ideas made it into the mainline, but his code did not.

The swap prefetch patches appear to lack any obvious problems; nobody is reporting that prefetch makes things worse. But the ck-list members pushing for its inclusion (often with Con's encouragement) have not been providing the sort of information that the kernel developers want to see. Even so, while a consensus in favor of merging this patch has not formed, there are some important developers who support its inclusion. They include Ingo Molnar and David Miller, who says:

There is a point at which it might be wise to just step back and let the river run it's course and see what happens. Initially, it's good to play games of "what if", but after several months it's not a productive thing and slows down progress for no good reason.

If a better mechanism gets implemented, great! We'll can easily replace the swap prefetch stuff at such time. But until then swap prefetch is what we have and it's sat long enough in -mm with no major problems to merge it.

So swap prefetch may yet make it into the mainline - that discussion is not, yet, done. If we are especially lucky, Con will find a way to get back into kernel development, where his talents and user focus are very much in need. But this sort of situation will certainly come up again. Getting major changes into the core kernel is not an easy thing to do, and, arguably, that is how it should be. If the process must make mistakes, they should probably happen on the side of being conservative, even if the occasional result is the exclusion of patches that end up being helpful.

Index entries for this article
KernelDevelopment model
KernelMemory management/Swapping


(Log in to post comments)

Still waiting for swap prefetch

Posted Jul 25, 2007 22:55 UTC (Wed) by mgb (guest, #3226) [Link]

"for example standard overnight cron jobs shouldn't need swap prefetch on a 1 or 2GB system"

Now I've been programming since magnetic drums were hip so I may be a bit confused here but it seems to me that I can remember a time less than a decade ago when a system didn't need a couple of gigs to run Linux well.

It seems that Linux may still be a little more efficient than Vista on most loads (but not Firefox) but back in the nifty nineties Linux was a _lot_ more efficient than Windows. In short, Linux has been getting worse faster than Windows.

Still waiting for swap prefetch

Posted Jul 25, 2007 23:04 UTC (Wed) by elanthis (guest, #6227) [Link]

_Linux_ has been getting worse, or the shit that people use on their systems is getting worse?

I run 2GB in my home machine, but I don't think I've ever seen the memory peak over 1GB of actual usage. Over 60% of that is just file caches and the like - unnecessary but nice performance enhancements, basically.

Now, there are some things that we use these days that eat up a shit load of memory. My mail client has to track some 40,000 messages among all my folders. Firefox has to deal with Youtube videos. I have Tracker indexing my file system. I have more data in a single file than a hard-drive 15 years ago could even hold.

There is a lot of fat in the current desktop system, that is for sure. Ubuntu and Red Hat both have this obsession with Python, which is about as bloaty a language as you can get, and then they write crappy apps in Python that would be slow as molasses even in C because they use crappy algorithms.

Still waiting for swap prefetch

Posted Jul 25, 2007 23:20 UTC (Wed) by xorbe (guest, #3165) [Link]

When I have 2GB of ram in my home desktop system, I never ever want to see Linux drop binaries from ram, and swap them back in from the binary text file. Or see it drop data pages to increase hdd cache in memory.

Still waiting for swap prefetch

Posted Jul 25, 2007 23:40 UTC (Wed) by johnkarp (guest, #39285) [Link]

Even when the disk-cache-pages are more frequently used than the
program-text-pages? If I understand things correctly, a program has to
wait ~9ms when the data it needs is on the disk, whether its program text
or data. Also, programs often have pages that are *never* used after
startup; the sooner those pages are evicted in favor of useful things the
better.

Still waiting for swap prefetch

Posted Jul 25, 2007 23:47 UTC (Wed) by sjlyall (guest, #4151) [Link]

When I have 2GB of ram in my home desktop system, I never ever want to see Linux drop binaries from ram, and swap them back in from the binary text file. Or see it drop data pages to increase hdd cache in memory.

Turn off swap then if you feel that way. I've got no problem with my system swapping out the getty copies which will never be used and instead using the RAM to cache my mail files.

I have a bunch of stuff that includes running programs, program data and files that the CPU has to deal with. I want everything to go as fast a possible so whenever possible if the CPU wants some of that data it should be able to get it as quickly as possible which means it should be in RAM if at all possible. The better the kernel does this, the faster everything will run.

Still waiting for swap prefetch

Posted Aug 7, 2007 2:33 UTC (Tue) by xorbe (guest, #3165) [Link]

Even with swap off, kernel can drop the program text, and swap it in from the backing binary on disk -- all without a swap file.

OpenOffice comes with a "pagein" utility.

Posted Jul 26, 2007 0:14 UTC (Thu) by dmarti (subscriber, #11625) [Link]

Why should the kernel have to guess what to page back in when you could use the "pagein" from OpenOffice, or something like it? Here's a simple example.

OpenOffice comes with a "pagein" utility.

Posted Jul 26, 2007 0:39 UTC (Thu) by corbet (editor, #1) [Link]

Interesting. It's doing something different, though: it's indiscriminately bringing in pages from the program's text area. I can't see how it could do anything about process anonymous pages which have been shoved into swap. Somebody who really wanted to enjoy their coffee hot would want to use both pagein (for text) and prefetch (for data) - and find some, currently unknown way to balance the two.

IWFM

Posted Jul 26, 2007 16:13 UTC (Thu) by dmarti (subscriber, #11625) [Link]

Subjectively, for me, running that shell function seems to get rid of the lag on alt-tab to Firefox, Gimp, or OpenOffice first thing in the morning.

What would be better, for a desktop system, would be a "goodnight" script that would run housekeeping tasks such as updatedb and logrotate, then page back in whatever the user had been working on, then suspend.

explicit pagein

Posted Jul 27, 2007 22:12 UTC (Fri) by giraffedata (guest, #1954) [Link]

To avoid the manual step (either good morning or good night), I run this in a cron job every hour all day. It doesn't matter that the interactive programs get paged in uselessly several times during the night; even if done right in the middle of updatedb, they just get paged right out again and the cost is insignificant.

I also send messages through some response-time-critical daemons regularly, to "loosen them up." I.e. page in executables, anonymous pages, etc.

Still waiting for swap prefetch

Posted Jul 27, 2007 0:05 UTC (Fri) by jschrod (subscriber, #1646) [Link]

Then you will be surprised.

When you get 2GB (I have them), just start Eclipse on a larger project, while Firefox and Thunderbird are running; and watch your memory getting used. And you didn't even start any VM (Xen or VMware)...

Still waiting for swap prefetch

Posted Jul 25, 2007 23:50 UTC (Wed) by kune (guest, #172) [Link]

If Firefox is really faster on Windows than on Linux, then it should be easy to hack some benchmarks in Javascript to prove it. Nobody will be able to fix it, if he cannot measure the performance. And Internet network bandwidth has be taken out of the equation.

Sure, you can write bad code in any language, but in my experience Python is not particularly slower than other script languages. Here is a benchmark that appears to validate that experience: http://www.timestretch.com/FractalBenchmark.html Another one is here: http://acker3.ath.cx/wordpress/archives/7

Lua might be something to look at.

Of course you will always be able to write faster code in C, but this will take you some more time.

Still waiting for swap prefetch

Posted Jul 25, 2007 23:53 UTC (Wed) by JoeBuck (subscriber, #2330) [Link]

Besides, for a GUI program "written in Python" all the expensive code is actually in libraries that are written in C or C++, with a thin binding to allow calls from Python. yum isn't slow because of Python, but because it does way too many system calls and I/O.

Still waiting for swap prefetch

Posted Jul 26, 2007 0:09 UTC (Thu) by briangmaddox (guest, #39279) [Link]

"Of course you will always be able to write faster code in C, but this will take you some more time."

Ya had me until you said this. Why stop at C, when it could be written in ASM? And heck, how do you know the assembler will generate fast code, better do it in hex instead.

I would have thought that after all these years that people would learn more about computer science and programming than to troll the "C is always faster than everything else" line.

Still waiting for swap prefetch

Posted Jul 26, 2007 5:15 UTC (Thu) by wilreichert (guest, #17680) [Link]

"Ya had me until you said this. Why stop at C, when it could be written in ASM? And heck, how do you know the assembler will generate fast code, better do it in hex instead."

Hex? No thanks, i prefer to hack the macro assembler and control the logic gates on my cpu directly.

Still waiting for swap prefetch

Posted Jul 26, 2007 9:46 UTC (Thu) by nix (subscriber, #2304) [Link]

The thing to do is to generate the assembler and then munge it with a horrible perl script.

(Hey, ghc does it, it must be good! :) )

Still waiting for swap prefetch

Posted Jul 26, 2007 12:04 UTC (Thu) by dcoutts (subscriber, #5387) [Link]

Shh! You're not supposed to tell people about that, it's called the Evil Mangler for a good reason.

Still waiting for swap prefetch

Posted Jul 26, 2007 13:51 UTC (Thu) by nix (subscriber, #2304) [Link]

Having just watched it use fifteen minutes of CPU time (on the ghc lexer, natch) I think it needs bringing into the light so it can be optimized by some hardier soul than I :)

program speed vs programming language

Posted Jul 27, 2007 23:06 UTC (Fri) by giraffedata (guest, #1954) [Link]

Why stop at C, when it could be written in ASM?

That's not a natural progression. Code compiled from C is often faster than that compiled from assembly language, for the same reason that a computer can land an airplane more smoothly than a human. Even code compiled from C by a naive compiler (e.g. gcc -O0) is unlikely to be slower than code compiled from assembly language. C is that low-level a language.

how do you know the assembler will generate fast code, better do it in hex instead

We do know that. The assembler will generate code that is not only the same speed as that generated by the hex editor, but is actually the same code. That's the definition of assembly language.

I would have thought that after all these years that people would learn more about computer science and programming than to troll the "C is always faster than everything else" line.

The only line I saw was, "C is always faster than Python." And it is, isn't it?

program speed vs programming language

Posted Jul 31, 2007 12:08 UTC (Tue) by liljencrantz (guest, #28458) [Link]

Depends on how you define your scope. I've seen situations where people solve the same problem in different languages, and because they have to spend so much time to do _anything_ in a low-level language, they are forced to chose a dumb algorithm, whereas people coding in a high level language can spend more time on the high level logic and can therefore chose a fast algorithm.

Still waiting for swap prefetch

Posted Jul 26, 2007 9:44 UTC (Thu) by nix (subscriber, #2304) [Link]

Benchmarks in Javascript will mostly show the performance of the JS interpreter and things it can block on, i.e. it's not a complete monitoring tool by any means.

Still waiting for swap prefetch

Posted Jul 26, 2007 7:46 UTC (Thu) by pointwood (guest, #2814) [Link]

"Ubuntu and Red Hat both have this obsession with Python, which is about as bloaty a language as you can get, and then they write crappy apps in Python that would be slow as molasses even in C because they use crappy algorithms."

Could you elaborate on that? I'm not much of a coder, but it would be interesting to hear what the problem with python is since I have considered playing with it.

In regards to Ubuntu and Redhat, well I bet they are open for improvements to their code though you don't say what apps it is. Your comment lacks facts.

Still waiting for swap prefetch

Posted Jul 26, 2007 8:28 UTC (Thu) by rsidd (subscriber, #2582) [Link]

There's nothing wrong with python (or other high-level languages), and it worked fine on linux systems 10 years ago. C is faster only for CPU-bound tasks, which hardly any system tasks are. C may be less memory intensive (it depends on what libraries you're using, what your coding style is, and so on), but with C you have to be careful about all kinds of bugs and security holes (buffer overflows, memory leaks, etc) that even experts get bitten by, but don't occur with high-level languages.

Still waiting for swap prefetch

Posted Jul 26, 2007 12:23 UTC (Thu) by rwmj (subscriber, #5474) [Link]

Oh dear no there's _plenty_ wrong with Python. It's dynamic
typing nature means that simple objects have huge amounts of
baggage that they have to carry around, mostly never used.

C sucks for programming too, for the reasons you outline.

But guess what folks! C and Python are not the only programming
languages in the world!! You won't believe it, but other
programming languages have been invented.

My personal fave at the moment is OCaml. About as fast as C,
statically typed, no buffer overflows, small memory footprint,
access to Perl & Python libraries, and loads of native libs.

Rich.

Still waiting for swap prefetch

Posted Jul 27, 2007 11:56 UTC (Fri) by IkeTo (subscriber, #2122) [Link]

> C and Python are not the only programming
> languages in the world!! You won't believe it, but other
> programming languages have been invented.

> My personal fave at the moment is OCaml. About as fast as C,
> statically typed, no buffer overflows, small memory footprint,
> access to Perl & Python libraries, and loads of native libs.

Two suggestions. Suggestion 1: start lobbying people all around to start using OCaml: universities, companies, etc. If you are successful you have a bunch of people who *know* what it is (currently the people with that characteristics are so few that it doesn't matter). Or choose suggestion 2: start implementing some *real cool* thing in OCaml, making sure that developing the equivalent thing (e.g., with same performance, flexibility, etc) is so expensive that nobody can do *because of a choice of different language*. Then you serve as an example showing others the real benefit of the language.

Before you do so, accept that languages behaving the same way as in C or Python are those who know by those who work on creating applications. Problem is, money talks. If the industry do need C and Python, that's what's university courses teach, that's what's people know, and that's what applications will be written in.

Still waiting for swap prefetch

Posted Aug 2, 2007 7:40 UTC (Thu) by renox (guest, #23785) [Link]

My problem with Ocaml is its syntax and its functional mindset: I tried once to learn Ocaml and I disliked the syntax plus the PDF book I used insisted on using functional way to solve everything which is strange as Ocaml is said to support both imperative and functional style, why the book insisted so much on the functional style is beyond me, bleach.

So to be successful, Ocaml would need 2 things:
1) replace the current default syntax with a better one.
There is already an alternative syntax for Ocaml (so apparently I'm not the only one who don't like the default syntax), it's quite better and F# (an Ocaml clone for .Net) 's syntax looks even better.
2) improve tutorials book to teach both imperative style and functional style, without having such blatant bias towards functional style, it has its place but not for everything.

Somehow I doubt that will happen, so Ocaml is bound to stay in the limelight..

Still waiting for swap prefetch

Posted Jul 26, 2007 15:15 UTC (Thu) by arjan (subscriber, #36785) [Link]

> There's nothing wrong with python

... except that even a simple "hello world" seems to take 40 megabytes of memory. It's not about a few cpu cycles that kills you in performance, it's the enormous overhead that even simple programs get....

Still waiting for swap prefetch

Posted Jul 26, 2007 17:11 UTC (Thu) by vmole (guest, #111) [Link]

That is so much crap.
$ cat hello.py
import time
print "hello, world"
time.sleep(30)

% python hello.py &
[1] 25243
hello, world
$ ps l -C python
F   UID   PID  PPID PRI  NI    VSZ   RSS WCHAN  STAT TTY        TIME COMMAND
0  1000 25446  2688  17   0   3844  2256 -      S    pts/0      0:00 python hello.py

So that's 2M resident, 4M virtual. And it exagerates a simple "hello world" program, because I had to import the time module for sleep() so I could run ps. Most of it is the interpeter itself, so multiple python programs don't actually take 4M each.

Python isn't perfect - what language is? But don't make stuff up, it just ruins your argument.

Still waiting for swap prefetch

Posted Jul 26, 2007 19:01 UTC (Thu) by nevyn (subscriber, #33129) [Link]

Let's try longer than 30 secs, eh?

% ps ax -o cmd:50,vsz,rss,stime | egrep '[p]ython'
/usr/bin/python -E /usr/sbin/setroubleshootd       808024 12104 Jul06
python /usr/share/system-config-printer/applet.py  232764  6436 Jul06
/usr/bin/python -E /usr/bin/sealert -s             411728 18524 Jul06
python /usr/libexec/revelation-applet --oaf-activa 458328 55056 Jul06
python /usr/lib64/gdesklets/gdesklets-daemon       453980 65768 Jul23
python ./hpssd.py                                  171220  1176 Jul11

...that's 2_536_044 KB VSZ and 159_064 KB RSS, and as you can see I've rebooted gdesklets recently (it was roughly double that, I think).

Plus I'm not running pupplet/yum-updatesd atm. And for instance "revelation-applet" is just a text entry box 95% of the time.

I appreciate that the huge VSZ numbers are (hopefully) mostly unused shared libs. etc. but half a GB is still a lot for the OS to manage for a text box, and 40+MB of RSS for a simple GUI is far from "so much crap".

For comparison my webserver uses two processes with a VSZ of about 12MB each and RSS of about 1MB each, and I'd prefer that to be smaller.

Still waiting for swap prefetch

Posted Jul 26, 2007 19:24 UTC (Thu) by vmole (guest, #111) [Link]

You said "a simple hello world" program, and that's what I tested. Comparing GUI programs to a webserver is irrelevant. What you've mostly demonstrated is that that GNOME/GTK is huge. Blaming that on Python doesn't seem to make sense. I'd bet a Perl implemenation of the same programs would be equally huge.

Still waiting for swap prefetch

Posted Jul 26, 2007 9:37 UTC (Thu) by nix (subscriber, #2304) [Link]

The page cache is not an `unnecessary but nice performance enhancement'. The text pages of all your binaries are sitting in there while they run, as is all your other mmap()ed stuff. If your page cache was empty you'd not be able to run any userspace code to speak of.

Still waiting for swap prefetch

Posted Jul 26, 2007 13:22 UTC (Thu) by cate (subscriber, #1359) [Link]

_Linux_ has been getting worse, or the shit that people use on their systems is getting worse?

Check the size of a Linux kernel, and you will see that the is a bigger increment (in percent) in 5 years than the memory used by mozilla. We have more memory, so we can afford to use it more inefficiently.

Still waiting for swap prefetch

Posted Jul 26, 2007 16:19 UTC (Thu) by khim (subscriber, #9252) [Link]

I was barely able to boot Linux on 2MB system back then. Today I need 8Mb for that. Netscape was quite useful on 12MB Linux system (8MB was too slow), today you can barely even start Firefox on 48MB system. Huge size of Linux kernel in real life is mostly page tables so I think Linux and Firefox grew in unison...

Still waiting for swap prefetch

Posted Jul 30, 2007 8:35 UTC (Mon) by gouyou (guest, #30290) [Link]

We have more memory
Not really take a look at the OLPC or at the new development on the embeded device scene (Nokia N770/N800, OpenMoko) ...

Still waiting for swap prefetch

Posted Aug 3, 2007 4:04 UTC (Fri) by bersl2 (guest, #34928) [Link]

"We have more memory, so we can afford to use it more inefficiently."

That's badly worded. It would be more correct to say that memory efficiency has been sacrificed for time efficiency.

Still waiting for swap prefetch

Posted Jul 25, 2007 23:56 UTC (Wed) by ronaldcole (guest, #1462) [Link]

Let's clarify... in a GNU/Linux system, the GNU part is getting bigger and slower. I can still happily load and run Debian 4.0 on a DX4/100 with 32M of ram and a 2G hdd. Can't do anything that involves a GUI, though.

Still waiting for swap prefetch

Posted Jul 26, 2007 0:03 UTC (Thu) by tomsi (subscriber, #2306) [Link]

It seems that Linux may still be a little more efficient than Vista on most loads (but not Firefox) but back in the nifty nineties Linux was a _lot_ more efficient than Windows. In short, Linux has been getting worse faster than Windows.

I am not sure if I agree with that. I am writing this on a Dell laptop with a 1GHz CPU and 512MB of memory. It is a pain to use on Windows XP, and I wouldn't dare try to put Vista on it. But latest version of Ubuntu trundles along quite nicely. There are a few applications that struggles; like eclipse; but they struggled 3 years ago too. Tom

Bloat

Posted Jul 27, 2007 0:16 UTC (Fri) by jmorris42 (guest, #2203) [Link]

> I am writing this on a Dell laptop with a 1GHz CPU and 512MB of memory.

You say that like you think that is a small machine. It shouldn't be, and that is the problem.

I remember running early GNOME based RedHat distributions on 300Mhz machines with 128M and it being totally usable. Bumped it to 256MB and could run Win95 under VMWare. (Gave Windows 64MB)

Now look at the pitiful state of bloat we have. Anaconda pretty much requires 512MB now unless you install in text mode. Red Hat won't even support a RHEL machine without a full GB. And I know ps's memory stats aren't all that reliable but look at the memory and CPU needs to bring up a GNOME or KDE desktop and a browser.

And speaking of insane, take a look at conglomerate. I have a test machine loaded up with F7 (Athlon64 2800 with 512MB) and tried using conglomerate to open up the Fedora 7 comps.xml file. After several minutes of constant thrashing went by as the poor machine exhausted all memory and swap the OOM killer finally put it out of it's misery. This is a 0.9 release so I'm assuming this sort of memory performace isn't considered a show stopper bug. More likely none of the devels (who by definition have big enough machines to compile a hog like Conglomerate) noticed the memory consumption.

I think the problem probably stems from the same source as Window's bloat. The developers no longer feel the pain. Ten years ago many key developers on every part of the stack from kernel to desktop (browser excluded prior to Mozilla becoming viable...) were private individuals, with a fair number outside the US/western Europe. The upshot of that was that they had hardware closer to a 'typical' machine. Now most of the key developers are using machines bought with OPM, i.e. corporate developer 'workstations' with big honking specs.

Perhaps the key developers should be given a $1,500 budget to buy their workstation with and then give them assess via ssh to a 'compile host' so they don't have to sit and spin waiting for compiles but DO have to sit and spin waiting on OO.o or Firefox. This would motivate them to care about resource consumption in more realistic desktop environments and not just care how well their quad Opteron monster squeezes the last percentage from all of it's CPUs.

Bloat

Posted Jul 29, 2007 21:35 UTC (Sun) by njs (subscriber, #40338) [Link]

>Perhaps the key developers should be given a $1,500 budget to buy their workstation with

Err... you can get 2GB of memory for < $100 these days. From a glance at dell.com, right now $550 low-end desktops come with 1GB of memory default, and $750 ones with 2GB. There are still real cases where memory is limited (embedded devices, OLPC, people living in non first-world countries, ...), but your scale seems a bit miscalibrated.

Bloat

Posted Aug 6, 2007 20:49 UTC (Mon) by happycube (guest, #42855) [Link]

The real problem is with older hardware - it's impossible to get more than 512MB in an i815-based P3, for instance.

Bloat

Posted Aug 9, 2007 2:28 UTC (Thu) by jmorris42 (guest, #2203) [Link]

I really hate this notion that a three year old computer should be tossed in the trash as so obsolete there is no use it can be put to. Linux used to be a good way to get good use out of older hardware. Not anymore. Now you need hardware equal to, and prior to Vista shipping greater than, the minimum Windows baseline.

And just throwing hardware at the problem doesn't make it go away. Having 2GB of RAM will make it livable but hard drives aren't getting all that much faster. Paging in enough of OO.o and all the libraries it needs to get to mapping the initial window means looking at a throbber almost as long on a hot new monster PC as it does on an older one. Same for all the disc thrashing involved in logon as multi-megabyte blobs of libraries and executables are mapped in to provide what should be small crap like battery indicators and CPU speed monitor widgets in menu bar.

Having more resources is no excuse for sloppy and wasteful practices. And if we want our stuff to be an option for the coming world of smart phones, flash based laptops (without swap) and the embedded world we need to be thinking about getting our act together now.

non-sequitur

Posted Jul 26, 2007 1:14 UTC (Thu) by qu1j0t3 (guest, #25786) [Link]

Memory 'needs' increase exponentially in a Moore's Law like process that is entirely unrelated to Linux. You can still run Linux on tiny systems (I know people running 2.6 on systems smaller than 32MB), which is not true of any recent version of Windows.

Your conclusion simply does not follow from Nick Piggin's quite reasonable postulate.

Still waiting for swap prefetch

Posted Jul 26, 2007 1:56 UTC (Thu) by sobdk (guest, #38278) [Link]

Now I've been programming since magnetic drums were hip so I may be a bit confused here but it seems to me that I can remember a time less than a decade ago when a system didn't need a couple of gigs to run Linux well.

I would say it still does. I'm writing this as we speak from a PIII 700 MHz with 256 MB of RAM and honest Linux runs great. I even use Firefox which tends to eat up 50% of my RAM according to top, but still no complaints. In fact even with Firefox using half my RAM I'm still not even touching swap.

Interestingly enough when I read some of the posts on lkml claiming that they had 1GB of RAM and swap prefetch drastically improved their workload, all I could think is "what on earth are these people running?" Until a couple of months ago my dev-machine at work only had 512 MB of RAM and had all kinds of nasty things like Firefox, beagle, and Lotus Notes running but my swap was rarely used. Conversely on days when I was forced to boot it into Windows XP for one reason or another I couldn't leave an application minimized for more than 10 minutes before Windows decided to swap it out. Waiting 1 minute every time I switch apps on Windows is enough to make me go crazy and be thankful that Linux has a sane swapping algorithm.

Still waiting for swap prefetch

Posted Jul 26, 2007 2:41 UTC (Thu) by Richard_J_Neill (guest, #23093) [Link]

Hmm - I've just upgraded to a 64-bit desktop machine. The principal reason is so that firefox can address > 4GB of swap (!). Really, FF crashes on me about once a week, because although I have 2GB RAM and 6GB of swap, firefox manages to malloc() 4 GB! No idea where it is going - admittedly I tend to have about 200 tabs open, but that alone shouldn't be the problem.

Still waiting for swap prefetch

Posted Jul 26, 2007 14:51 UTC (Thu) by lysse (guest, #3190) [Link]

Well, 200 tabs at about 2Mb per tab would use up your 4Gb quite effectively...

Off by an order of magnitude?

Posted Jul 26, 2007 20:29 UTC (Thu) by martinfick (subscriber, #4455) [Link]

200 * 2MB = 400MB != 4GB

Off by an order of magnitude?

Posted Jul 27, 2007 6:04 UTC (Fri) by man_ls (guest, #15091) [Link]

Let's make it 20 MB then per tab. If each page is three screens high, and my screen is 1680x1050, with 4 bytes per pixel (3 colors + alpha):

1680 x 1050 (pixels/screen) x 4 (bytes/pixel) x 3 (screens) = 20672 MB.

We do want the pages to be cached when we switch tabs, don't we?

Off by three orders of magnitude!

Posted Jul 27, 2007 6:06 UTC (Fri) by man_ls (guest, #15091) [Link]

That should read 20672 KB, sorry.

Off by an order of magnitude?

Posted Jul 27, 2007 14:36 UTC (Fri) by Los__D (guest, #15263) [Link]

Yeah we do, but does it really cache a bitmap of the page? That would seem a bit silly

Bitmap cache

Posted Jul 27, 2007 21:57 UTC (Fri) by man_ls (guest, #15091) [Link]

Yeah we do, but does it really cache a bitmap of the page?
I don't know exactly, but I would say it does. Loading a couple of big JPEG files takes quite longer than changing tabs between them.
That would seem a bit silly
Why, exactly? If you have the RAM to spare, it seems to be as good as any other use. I seem to remember some discussion on LWN about Firefox caching even pages in the history.

When people complain that "Firefox eats up 2 GB" in a 4 GB machine, it gives the wrong impression. Firefox runs fine on my 128 MB laptop, and memory seldom goes above 80 MB.

Bitmap cache

Posted Jul 28, 2007 13:38 UTC (Sat) by Los__D (guest, #15263) [Link]

I would never (never EVER) suggest using JPEG for computer graphics, but caching in PNG would seem smarter than a bitmap, and not really that much slower. Of course, when you only have 3-4 tabs, it wouldn't really matter, but as the count goes up, you can much better afford the CPU than the memory. Plus, the CPU is only when you change, the memory usage is constant.

I'm not completely sure how uncompressed PNG would handle the JPEG's on the page, but I couldn't imagine it getting worse (memory wise) than bitmap, just equal. (I've been known to be naive in my expectations from formats though).

The browser (toolkit?) could of course also do some kind of checking of available memory, and change algorithm from that, but maybe that is an unneeded complexity.

I don't know, maybe it's easiest/simplest and mostly correct to just keep the rendered pages as is in memory, but for some usage patterns, it does become a huge waste of memory.

Bitmap cache

Posted Jul 28, 2007 13:51 UTC (Sat) by man_ls (guest, #15091) [Link]

I would never (never EVER) suggest using JPEG for computer graphics
I wouldn't either. My little experiment with JPEG images involved decompressing a JPEG image vs. caching an uncompressed bitmap. Caching images in a lossy format would be ludicrous.

But caching in a lossless format such as PNG isn't such a good idea either. An important aspect of caches is that you should store an artifact which you already have, not one you have to generate. If you have to compress a bitmap to PNG before caching it, you are wasting a lot of CPU time just to generate a cache which you might as well never use again.

An example: you download a JPEG image, then uncompress it to show it, and then you compress it to PNG before caching it in memory. Messy.

Bitmap cache

Posted Jul 28, 2007 14:00 UTC (Sat) by Los__D (guest, #15263) [Link]

It really depends on the cost of CPU vs. the cost of memory.

For most usage patterns, I agree, but for those that likes to have 30+ tabs open (for some reason), caching the uncompressed bitmap is a _big_ waste.

The question is if the best approach wouldn't be to just forget or limit those caches for that kind of usage. At least the bitmap approach is asking for disk trashing.

Bitmap cache

Posted Oct 9, 2007 18:49 UTC (Tue) by Blaisorblade (guest, #25465) [Link]

I've read time ago a comment from a OLPC developer on exactly this issue. I'm writing a summary of his comments I recall.
Basically, an application can store an uncompressed bitmap in the X server. And Firefox caches all images this way! That's the problem. Probably avoiding the caching would be even faster (due to no swapping).

IMHO (and this is my opinion, I don't recall his one) a reasonable solution would be to extend the X protocol to allow caching compressed images in most reasonable formats (including especially JPEG). Then the X server can keep a cache of most shown decoded images.

Since the JPEG compression ratio is huge, there would be a huge gain from this.

Off by an order of magnitude?

Posted Aug 9, 2007 1:51 UTC (Thu) by lysse (guest, #3190) [Link]

Quite right. Sorry. Oops.

Still waiting for swap prefetch

Posted Jul 28, 2007 3:43 UTC (Sat) by roelofs (guest, #2599) [Link]

Really, FF crashes on me about once a week, because although I have 2GB RAM and 6GB of swap, firefox manages to malloc() 4 GB!

Geez, what are you doing to the poor thing??

No idea where it is going - admittedly I tend to have about 200 tabs open, but that alone shouldn't be the problem.

OK, that might do it... I have two FF windows containing a total of 9 tabs, and it's been running since early March. Total memory usage (per ps(1)): 197M allocated, 157M resident. (Of course, there's also X11 pixmaps, as was noted in another LWN article a while back. I've forgotten how to check that, but even if FF were the sole X client--which it's not, by far--that's only another 194M allocated and 46M resident.)

Greg

Still waiting for swap prefetch

Posted Aug 1, 2007 2:23 UTC (Wed) by set (guest, #4788) [Link]

xrestop will show you the X memory usage, such as pixmaps.

Still waiting for swap prefetch

Posted Aug 1, 2007 16:43 UTC (Wed) by zlynx (guest, #2285) [Link]

One of these days someone needs to add the amount of CPU X spent per client to xrestop.

Still waiting for swap prefetch

Posted Aug 3, 2007 23:45 UTC (Fri) by roelofs (guest, #2599) [Link]

xrestop will show you the X memory usage, such as pixmaps.

Danke, sir! That's the command I couldn't recall.

Greg

Still waiting for swap prefetch

Posted Jul 26, 2007 2:14 UTC (Thu) by h2 (guest, #27965) [Link]

>I can remember a time less than a decade ago when a system didn't need a couple of gigs to run Linux well.

Don't worry, it's still fine, running 2.6.22.1 kernel on old hardware, with a light desktop, xfce4, running lots of little server things, nfs, vnc, streaming, etc, the system using... 90 mB ram, out of 192 mB total. I originally had 512 in it but it never got remotely close to using it so I pulled it out and put back in small sticks I had lying around. Once in a while it uses swap, a tiny bit, but not often.

For full desktop like kde or gnome, 256 is what you need to avoid swapping.

And more for dev work, vmware, heavy graphics, and so on. Just because you put in 2 gig doesn't mean you use it, this box has 2 and it's using 3/4 gig with 50 or so apps open on 8 virtual desktops. But fire up a vmware or vbox install or two and you're getting closer to 2 gigs.

Still waiting for swap prefetch

Posted Jul 26, 2007 2:21 UTC (Thu) by butlerm (subscriber, #13312) [Link]

Isn't this precisely the sort of thing that should be controlled from user space? i.e. let the kernel provide the mechanism and let a user space daemon implement the policy?

For example, couldn't one create a virtual file called /proc/<pid>/swapin that when written to, fetches all the swapped data for that process back into ram?

The advantage, of course, would be that no existing users would be affected, and the optimal swap prefetch policy could be adjusted according to the application.

Still waiting for swap prefetch

Posted Jul 26, 2007 6:32 UTC (Thu) by sitaram (guest, #5959) [Link]

well I already do something like that :-)

All my cron jobs finish by around 7am, so at about 8am (I usually come in to work around 8:15), cron runs "swapoff -a; swapon -a".

Seems to be OK. I guess one day I will leave FF running with a gazillion tabs and it will get OOM-killed on the swapoff command, but hasn't happened yet...

Still waiting for swap prefetch

Posted Jul 26, 2007 8:41 UTC (Thu) by tialaramex (subscriber, #21167) [Link]

What you've described isn't quite the Right Thing™ though.

Eliminating swap with swapoff will force the kernel to bring back anonymous pages that had been pushed out, but not pages which are associated with existing storage, such as (almost all) program text, and mmap'd data files.

It's obviously possible to arrange for the kernel to notice that there's a lot of free RAM and a lot of pages, anonymous or otherwise that aren't in RAM and fix that. Unfortunately merely having a good idea isn't enough in Free Software, and in the kernel it's not even enough to have a good idea and an implementation.

One worry which swap prefetch people might be able to reassure me on - does it understand that if I have 2GB of as-yet unused RAM and a fresh 2GB mmap() I don't necessarily want it to try to pull all those pages off disk? ie does it know the difference between pages that have been pushed out and those which were simply never needed yet?

Still waiting for swap prefetch

Posted Jul 29, 2007 23:27 UTC (Sun) by dlang (guest, #313) [Link]

but if the system is pulling the pages off of disk while the system is idle, and puts those pages at the end of the LRU list so that they are the first thing thrown away if you need more memory, what does it hurt you to pull them in and never use them?

Still waiting for swap prefetch

Posted Jul 27, 2007 12:52 UTC (Fri) by IkeTo (subscriber, #2122) [Link]

> The advantage, of course, would be that no existing users would be
> affected, and the optimal swap prefetch policy could be adjusted according
> to the application.

The disadvantages are, of course, that (1) few programs will use that interface, since it is too system dependent, (2) few programs will use that interface at the time it matters, since developers usually cannot exactly know when their program will need a lot of memory (who knows that a crazy user create ten thousand bookmarks in his browser and call expand all...), and (3) large program with VM even larger than the available RAM will fail (or cause the whole system to run very slowly) the moment it calls that interface, and most unluckily, it happens only on some small systems that the developer doesn't have access to.

RDSL and ignoring feedback

Posted Jul 26, 2007 2:38 UTC (Thu) by zlynx (guest, #2285) [Link]

But some users were reporting real regressions with RSDL and were being told that those regressions were to be expected and would not be fixed. This behavior soured Linus on RSDL and set the stage for Ingo Molnar's CFS scheduler. Some (not all) people are convinced that Con's scheduler was the better design, but refusal to engage with negative feedback doomed the whole exercise.

I was following most of that on LKML as it happened, and the way that I saw it was that a guy testing RSDL was reporting the fact that his X server now got 25% CPU instead of 75% as a regression.

Con did respond. He said the scheduler was fair, that was the design, and that he (the tester) could renice X to -10 or -15 if he wished.

I don't see how else Con could respond to that. RSDL was supposed to be fair. Giving X 75% isn't fair. There's just no way to resolve those two things.

RDSL and ignoring feedback

Posted Jul 26, 2007 8:29 UTC (Thu) by jospoortvliet (guest, #33164) [Link]

Indeed. Imho Corbet should've mentioned this properly - Con did get
negative comments, but those where entirely silly. Complaining about the
fairness of a fair scheduler???

He might also have mentioned that by design, it is extremely unlikely to
find negative consequences of using swap-prefetch (which explains why
there are no real bugreports), and how the whole updatedb issue might be
solved by other means, but that doesn't go for all things swap-prefetch
helps.

For example, start OO.o on a low-mem machine, work in it, close it. Now
you've got 60 mb free ram. Swap prefetch will start filling that with
your swapped-out pages pretty quick, and if those pages are firefox or
some other app (which is likely, as swap-prefetch starts with the most
important data), you're pretty happy. There is NO other way of doing this
than swap-prefetch.

I think the main reason SwPr didn't make it in is that those who have to
decide over it have very very bulky hardware, and their employers are
very afraid it MIGHT in some weird way influence the 1024-cpu linux
deployments, and they don't care about the desktop at all.

RDSL and ignoring feedback

Posted Jul 26, 2007 13:35 UTC (Thu) by corbet (editor, #1) [Link]

Indeed. Imho Corbet should've mentioned this properly - Con did get negative comments, but those where entirely silly. Complaining about the fairness of a fair scheduler???

That's just the sort of approach which created trouble for SD/RSDL. If people see regressions with their workloads, stamping a "100% certified fair!" label on it will not make them feel better about it. You have to address these problems; if you are unwilling to do so, your code will not make it into the kernel.

CFS is also a "fair" scheduler, but it has not drawn the same sort of complaints - though it will be interesting to see what happens as the testing community gets larger. As I understand it, the CFS brand of "fairness" takes a longer-term view, allowing tasks to get their "fair" share even if they sleep from time to time. That helps to prevent the sort of regressions seen with SD.

The real key, though, is what happens when things go wrong. There will certainly be people reporting scheduler issues over the 2.6.23 cycle. Ingo and the other CFS hackers could certainly dismiss them as "entirely silly," seeing as the scheduler is "completely fair," after all. But they won't do that. Instead, they will do their best to understand and solve the problems. That is why CFS is in the kernel, and SD is not.

RDSL and ignoring feedback

Posted Jul 26, 2007 14:40 UTC (Thu) by nevyn (subscriber, #33129) [Link]

That is a very bad analogy, for a "fair" scheduler change requests along the lines of "app. X does Y, and gets an unfair amount of CPU" and should be dealt with promptly. Change requests like "app. X does Y, and I want it to get an unfair amount of CPU" should be solved another way. We didn't drop SELinux because random uid=0 processes could no longer re-write my /etc/shadow file, because that was a desired result.

Looking at it another way if you create holes in the algo. so that Xorg can walk through them, then so can any other piece of code. The correct fix was pointed out, if Xorg needs more CPU time than "normal" then just tell the kernel that using the interfaces specially for that purpose (Ie. nice).

IMNSHO CFS got into the kernel because Ingo wrote it, and basically no other reason. That's not entirely a bad thing, but it's not entirely a good thing either and doing "unbiased" reporting pretending otherwise isn't helping anyone.

RDSL and ignoring feedback

Posted Jul 26, 2007 17:38 UTC (Thu) by erwbgy (subscriber, #4104) [Link]

"unbiased" reporting pretending otherwise isn't helping anyone.

Of course it is not unbiased reporting - we pay good money for Jon's opinion. That's what an editorial is. Jon explained quite clearly, and even-handedly in my opinion, why he came to the conclusion he did. Agree with him or not (and I certainly don't always), but criticising his reporting because you don't agree with his conclusions is not helping anyone.

RDSL and ignoring feedback

Posted Jul 26, 2007 18:26 UTC (Thu) by nevyn (subscriber, #33129) [Link]

Maybe that sounded worse that I wanted it to. I didn't mean that Jon was biased against my view point, quite the opposite, it seemed more like he'd tried to "present both sides of an argument" instead of just saying what he thought was right.

But maybe he really did/does believe that the "reported problem" was a regression and needed fixing, at which point I guess I'll just have to disagree.

RDSL and ignoring feedback

Posted Jul 27, 2007 21:41 UTC (Fri) by jospoortvliet (guest, #33164) [Link]

Indeed. Top be honest, I don't get it why he re-stated that. I mean, it's
lovely to try and fix things if they come up, but this was and is
impossible to fix. And I'm sure Ingo won't try to fix it either. If
someone complains X gets not enough CPU (because it gets 1/10th if 10
heavy processes are running), he will tell the person to renice X. Just
like Con did. After all, it's what a fair scheduler does.

Someone complaining about it simply doesn't understand it - a fair
scheduler WILL lead to regressions. No way around it, period. It's unfair
to attack Con on this one, imho, and again - I really don't see how
Corbet can say these things.

BTW not to say Corbet is stupid or anything negatively, I just think he's
wrong here. Or there is something I utterly do NOT understand (and if
that's the case, I hope he can explain).

RDSL and ignoring feedback

Posted Jul 27, 2007 21:51 UTC (Fri) by corbet (editor, #1) [Link]

If somebody's workload works on 2.6.N, but fails on 2.6.N+1, it's a regression. It doesn't matter if life is better for a lot of other folks, or whether you call it "fair," it's still a regression. Regressions are bad news. And yes, CFS does do a better job with that sort of workload.

RDSL and ignoring feedback

Posted Jul 27, 2007 22:21 UTC (Fri) by zlynx (guest, #2285) [Link]

It may technically be a regression for that user, but if a change improves things for more other users than it hurts, I call that progress.

It may be a case of two steps forward, one step back, but still progress and a good thing, not bad news.

Regressions and progress

Posted Jul 27, 2007 22:41 UTC (Fri) by corbet (editor, #1) [Link]

Here's a message from Linus from a couple of weeks ago; I had considered it for the quote of the week:
So we don't fix bugs by introducing new problems. That way lies madness, and nobody ever knows if you actually make any real progress at all. Is it two steps forwards, one step back, or one step forward and two steps back? Different people will give different answers.

That's why regressions are _so_ much more important than new bugfixes. Because it's much more important to make slow but _steady_ progress, and have people know things improve (or at least not "deprove"). We don't want any kind of "brownian motion development".

Regressions and progress

Posted Jul 27, 2007 23:37 UTC (Fri) by zlynx (guest, #2285) [Link]

Linus' statement says regressions are much more important. But that doesn't specify how much more important. And when the benefits of the change build up, they override how important a regression is.

Ingo's scheduler does not work as well as Con's on many 3D applications, like games. That's a regression from Con's work. Ingo's doesn't always schedule games as well as mainline either (Transgaming went to some work to make Cedega share work between processes and sleep enough to fake out the 2.6 scheduler).

If I decide to complain about the regression (which I won't) should Linus hold up merging CFS until Ingo can meet my demand that the new thing work just like the old thing? (Renicing Cedega is just *too* hard! Poor me!)

Another example: I've complained that 4K stacks aren't ready to be the default (stacking enough device mappers crash), but I strongly suspect that the kernel is going to change the default to 4K no matter what I think.

Regressions and progress

Posted Oct 9, 2007 21:57 UTC (Tue) by Blaisorblade (guest, #25465) [Link]

This Linus quote is also from a couple of years ago, about drivers (when somebody fixed ACPI and broke suspend for most stuff). And he's a maintainer, and he must have this opinion (or the community should replace him).

What matters is "how deep do you need to stack DM" (is it a real problem) and "4k is a default (other choices are kept, so *you* will make the 8k choice, which is mostly worse)".

That is a known problem since years, and Device Mapper can be changed to be non-recursive; see the LWN articles about changes in link-resolution from recursive to iterative to understand what I mean - that's the same stuff. Technically, this is a tail-call optimization to reduce stack depth.

After reading the discussion over Con's community management, and thinking to Reiser4, I think that Linux is not about politics, but about communities, or rather Social Networks of developers and their influence on community filtering they do (that's a lot of academic buzzwords).

That said, it's known that many problem (including VM and I think scheduling) are computationally hard, so whichever solution you choose it has weak points (it's a theorem in some cases - you can prove that for each compressor there is a file the compressor expands). The point is how hard is the regression.

Distributions had to be fixed not to renice X to -10 (as usual for 2.4) when 2.6 came out. A stable kernel cannot require such a big change. I can fix my X startup script (well, working my way through X startup is not fun, even for an experienced Linux developer like me), but the Ubuntu average user cannot (I know tens of such users, switching from Windows because Vista sucks and Linux had Beryl - they are good Physics students).

RDSL and ignoring feedback

Posted Jul 28, 2007 12:08 UTC (Sat) by jospoortvliet (guest, #33164) [Link]

I find that hard to believe if we're talking about the same person
complaining. His problem can never be fixed by CFS, unless CFS
automatically would renice his X, or would introduce unfairness some
other way.

CFS will cause regressions, because it doesn't do unfair scheduling -
which is what users have come to expect. There is no way around it.

Besides, CFS does worse on 3D gaming compared to SD and mainline, and ppl
will complain about that as well.

Note that I'm happy CFS got in mainline, as far as I can tell, it has a
superior design. It's just that the mentioned reasoning for the choice
doesn't work for me...

Maybe this is worth reading, if you didn't already.
http://osnews.com/story.php/18350/Linus-On-CFS-vs.-SD
(don't forget the OTHER SIDE of the story ;-) )

CFS and SD internals, design

Posted Jul 26, 2007 15:08 UTC (Thu) by mingo (subscriber, #31122) [Link]

As I understand it, the CFS brand of "fairness" takes a longer-term view, allowing tasks to get their "fair" share even if they sleep from time to time.

Correct, and i call this concept "sleeper fairness".

The simplest way to describe it is via an specific example: on my box if i run glxgears, it uses exactly 50% of CPU time. If i boot into the SD scheduler, and start a CPU hog in parallel to the glxgears task, the two tasks share the CPU: the CPU hog will get ~60% of CPU time, glxgears will get ~40% of CPU time. If i boot CFS, both tasks will get exactly 50% of CPU time.

I've described this mechanism and other internal details in another thread already, but i think it makes sense to paste that reply here too:

wait_runtime is a scheduler-internal metric that shows how much out-of-balance this task's execution history is compared to what execution time it could get on a "perfect, ideal multi-tasking CPU". So if wait_runtime gets negative that means it has spent more time on the CPU than it should have. If wait_runtime gets positive that means it has spent less time than it "should have". CFS sorts tasks in an rbtree with this value as a key and uses this value to choose the next task to run. (with lots of additional details - but this is the raw scheme.) It will pick the task with the largest wait_runtime value. (i.e. the task that is most in need of CPU time.)

This mechanism and implementation is basically not comparable to SD in any way, the two schedulers are so different. Basically the only common thing between them is that both aim to schedule tasks "fairly" - but even the definition of "fairness" is different: SD strictly considers time spent on the CPU and on the runqueue, CFS takes time spent sleeping into account as well. (and hence the approach of "sleep average" and the act of "rewarding" sleepy tasks, which was the main interactivity mechanism of the old scheduler, survives in CFS. Con was fundamentally against sleep-average methods. CFS tried to be a no-tradeoffs replacement for the existing scheduler and the sleeper-fairness method was key to that.)

This (and other) design differences and approaches - not surprisingly - produced two completely different scheduler implementations. Anyone who has tried both schedulers will attest to the fact that they "feel" differently and behave differently as well.

Due to these fundamental design differences the data structures and algorithms are necessarily very different, so there was basically no opportunity to share code (besides the scheduler glue code that was already in sched.c), and there's only 1 line of code in common between CFS and SD (out of thousands of lines of code):

  * This idea comes from the SD scheduler of Con Kolivas:
  */
 static inline void sched_init_granularity(void)
 {
         unsigned int factor = 1 + ilog2(num_online_cpus());

This boot-time "ilog2()" tuning based on the number of CPUs available is a tuning approach i saw in SD and i asked Con whether i could use it in CFS. (to which Con kindly agreed.)

CFS and SD internals, design

Posted Jul 28, 2007 12:30 UTC (Sat) by jospoortvliet (guest, #33164) [Link]

Thanx, Ingo, an excellent explanation. I wonder if you could elaborate on
the following you wrote:

This mechanism and implementation is basically not comparable to SD in
any way, the two schedulers are so different. Basically the only common
thing between them is that both aim to schedule tasks "fairly" - but even
the definition of "fairness" is different: SD strictly considers time
spent on the CPU and on the runqueue, CFS takes time spent sleeping into
account as well. (and hence the approach of "sleep average" and the act
of "rewarding" sleepy tasks, which was the main interactivity mechanism
of the old scheduler, survives in CFS. Con was fundamentally against
sleep-average methods. CFS tried to be a no-tradeoffs replacement for the
existing scheduler and the sleeper-fairness method was key to that.)

How does this work, and effect fairness? I mean, can you tell a bit more
on the difference between SD and CFS in this area? (I'm pretty
interested, that's all)

CFS and SD internals, design

Posted Sep 18, 2007 9:25 UTC (Tue) by mingo (subscriber, #31122) [Link]

How does this work, and effect fairness? I mean, can you tell a bit more on the difference between SD and CFS in this area? (I'm pretty interested, that's all)

The practical difference is noticeable for something like the X server - Xorg is often a "sleepy" process but it's important that when it runs it gets its own maximum share of CPU time. With a "runners only" fairness model it will receive less CPU time than with a "sleepers considered too" (CFS) fairness model.

RDSL and ignoring feedback

Posted Jul 26, 2007 22:44 UTC (Thu) by bronson (subscriber, #4806) [Link]

SwPr didn't get in because some very rich employers don't want to destabilize their NUMA monsters? Er, doesn't this sound really unlikely?

The article mentions that Linus and some others feel that SwPr is just papering over a more fundamental problem. So, why not spend time trying to fix the fundamental problem before hacking around it?

That's a rhetorical question... there could be a number of reasons: the root cause is too complex to be understood, or the proper fix is worse than SwPr, etc. I just think that Linus & crew would like to see someone attempt to fix the real problem before resorting to a SwPr hack. If a proper fix is attempted and proves unweildy, then I bet SwPr will jump a lot higher on a number of kernel devs' priority queues.

> ...entirely silly. Complaining about the fairness of a fair scheduler?

They weren't complaining about the fairness, they were complaining about the quality. Is a 100% fair scheduler actually the best scheduler? Probably not.

RDSL and ignoring feedback

Posted Jul 28, 2007 12:35 UTC (Sat) by jospoortvliet (guest, #33164) [Link]

No, but a 100% fair scheduler is the only way to ensure you won't have
stalls and other problems brought by unfairness. You can't have your cake
and eat it too.

RDSL and ignoring feedback

Posted Jul 28, 2007 20:41 UTC (Sat) by bronson (subscriber, #4806) [Link]

Sorry, I don't quite understand... are you saying that a 100% fair scheduler actually *is* the best scheduler? If so, then would you have any evidence/research to back this up? I'm genuinely interested.

My uneducated view: in schedulers, as with government and parenting, 100% fairness is unattainable and probably paradoxical. The best policy may or may not be the most fair policy. They're simply disconnected.

RDSL and ignoring feedback

Posted Aug 2, 2007 8:00 UTC (Thu) by renox (guest, #23785) [Link]

The thing that this complaint show is that a 'fair scheduler' in itself is not good enough for enduser desktop..
If you have an application APP that is important to you, you renice it so it has lots of CPU, fine, but then say that this application sends a lot of work to do for the X server (could be any server really), then there is a kind of 'priority inversion' which happens where APP is slowed down because the X server doesn't have a big enough CPU share..

It's quite difficult to solve.. The only way would be to have some way to transfer the 'CPU token' that APP have to the server it's asking to do some work on its behalf, if it is a multi-threaded server which use a different thread for each client, then maybe the kernel could understand what's happening and boost the corresponding server's thread priority accordingly, but in a non-threaded server, I don't see how it could be solved even if the APP says gives my 'CPU token' to server X, how could the server X would be able to report/understand that currently he is supposed to be working for client APP and not for another client?

Still waiting for godot.

Posted Jul 26, 2007 7:27 UTC (Thu) by error27 (subscriber, #8346) [Link]

You know things are bad when Corbet starts breaking out the Beckett puns.

Still waiting for swap prefetch

Posted Jul 27, 2007 12:32 UTC (Fri) by IkeTo (subscriber, #2122) [Link]

> A concern which has been raised a few times is that the morning swap-in
> problem may well be a sign of a larger issue within the virtual memory
> subsystem, and that prefetch mostly serves as a way of papering over that
> problem. And it fails to even paper things completely, since it brings back
> some pages from swap, but doesn't (and really can't) address file-backed
> pages which will also have been pushed out. The conclusion that this
> reasoning leads to is that it would be better to find and fix the real
> problem rather than hiding it behind prefetch.

My interpretation: there are 3 classes of systems:

1. Those that have loads of memory and very few programs requiring big memory that ever run, and as such never write anything to the swap. Swap prefetch is of course a no-op in such systems.
2. Those that have loads of memory and very few programs requiring big memory that ever run, but still the swap is being written. Swap prefetch improves the performance of such system, but developers would ask, "why the hell did it happen in the first place?"
3. Those that have not such much memory to run all the programs requiring big memory that ever run, and naturally swap prefetch does help, as expected.

So because of the unclear reason that (2) happens, and perhaps because those systems in class (1) might see regressions and need to manually turn off prefetching via a kernel option (never mind that none is currently known after 18 months of testing), those (typically) systems among (3) have to suffer? After all, prefetch is not something that is done only to the swap. A block I/O system without prefetch also have horrible performance, and we have prefetch there for ages. So why swap has to be treated differently? Should we perfectly expect that an application being swapped (for whatever reason) should perform much worse than an application being loaded the first time if prefetching happens when the application is first being loaded and does not happen when the application is pushed to the swap due to uncontrollable memory pressure? Is it such a surprise that swap prefetch is something needed anyway?

And what is the consequence to (2) if swap can be prefetched? It means there is no way to detect such a problem exists? Of course not, the kernel keeps page fault counters, if developers care to write a script and collect the stats of each of the running processes. The only thing that will happen is that people will no longer be so unhappy about the problem, because that doesn't hit their bottom line: enjoying hot coffee while working in the morning. And the end result is, unsurprisingly, less attention to the problem. But... is it really such a bad thing after all, that some hard problem can be put aside because the problem no longer cause serious user dismay?

Perhaps I'm understanding something really wrong.

Great article, thanks!

Posted Jul 28, 2007 2:54 UTC (Sat) by wruppert (subscriber, #3041) [Link]

Thanks for explaining both sides of this issue so clearly. I have not been following this and then read the interview about Con leaving kernel dev. I was wondering what the other side of the story was. Great job!

Great article, thanks!

Posted Jul 28, 2007 12:33 UTC (Sat) by jospoortvliet (guest, #33164) [Link]

Imho this one isn't entirely fair either, but Corbet & me & others are
still fighting this one out, so don't take my word for it.


Copyright © 2007, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds