The case of the overly anonymous anon_vma

This article brought to you by LWN subscribers

Subscribers to LWN.net made this article — and everything that surrounds it — possible. If you appreciate our content, please buy a subscription and make the next set of articles possible.

By Jonathan Corbet
April 13, 2010

During the stabilization phase of the kernel development cycle, the -rc releases typically happen about once every week. 2.6.34-rc4 is a clear exception to that rule, coming nearly two weeks after the preceding -rc3 release. The holdup in this case was a nasty regression which occupied a number of kernel developers nearly full time for days. The hunt for this bug is a classic story of what can happen when the code gets too complex.

Sending email to linux-kernel can be an intimidating prospect for a number of reasons, one of which being that one never knows when a massive thread - involving hundreds of messages copied back to the original sender - might result. Borislav Petkov's 2.6.34-rc3 bug report was one such posting. In this case, though, the ensuing thread was in no way inflammatory; it represents, instead, some of the most intensive head-scratching which has been seen on the list for a while.

The bug, as reported by Borislav, was a null pointer dereference which would happen reasonably reliably after hibernating (and restarting) the system. It was quickly recognized as being the same as another bug report filed the same day by Steinar H. Gunderson, though this one did not involve hibernation. The common thread was null pointer dereferences provoked by memory pressure. The offending patch was identified by Linus almost immediately; it's worth taking a look at what that patch did.

Way back in 2004, LWN covered the addition of the anon_vma code; this patch was controversial at the time because the upcoming 2.6.7 kernel was still expected to be an old-style "stable, no new features" release. This patch, a 40-part series which fundamentally reworked the virtual memory subsystem, was not seen as stable material, despite Linus's attempt to characterize it as an "implementation detail." Still, over time, this code has proved solid and has not been changed significantly since - until now.

The problem solved by anon_vma was that of locating all vm_area_struct (VMA) structures which reference a given anonymous (heap or stack memory) page. Anonymous pages are not normally shared between processes, but every call to fork() will cause all such pages to be shared between the parent and the new child; that sharing will only be broken when one of the processes writes to the page, causing a copy-on-write (COW) operation to take place. Many pages are never written, so the kernel must be able to locate multiple VMAs which reference a given anonymous page. Otherwise, it would not be able to unmap the page, meaning that the page could not be swapped out.

The reverse mapping solution originally used in 2.6 proved to be far too expensive, necessitating a rewrite. This rewrite introduced the anon_vma structure, which heads up a linked list of all VMAs which might reference a given page. So a fork() also causes every VMA in the child process which contains anonymous pages to be added to a the list maintained in the parent's anon_vma structure. The mapping pointer in struct page points to the anon_vma structure, allowing the kernel to traverse the list and find all of the relevant VMA structures.

This diagram, from the 2004 article, shows how this data structure looks:

This solution scaled far better than its predecessor, but eventually the world caught up. So Rik van Riel set out to make things faster, writing this patch, which was merged for 2.6.34. Rik describes the problem this way:

In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed, this leads to a system with a million anonymous pages in the same anon_vma, each of which is mapped in just one of the 1000 processes. However, the current rmap code needs to walk them all, leading to O(N) scanning complexity for each page.

Essentially, by organizing all anonymous pages which originated in the same parent under the same anon_vma structure, the kernel created a monster data structure which it had to traverse every time it needed to reverse-map a page. That led to the kernel scanning large numbers of VMAs which could not possibly reference the page, all while holding locks. The result, says Rik, was "catastrophic failure" when running the AIM benchmark.

Rik's solution was to create an anon_vma structure for each process and to link those together instead of the VMA structures. This linking is done with a new structure called anon_vma_chain:

    struct anon_vma_chain {
	struct vm_area_struct *vma;
	struct anon_vma *anon_vma;
	struct list_head same_vma;
	struct list_head same_anon_vma;
    };

Each anon_vma_chain entry (AVC) maintains two lists: all anon_vma structures relevant to a given vma (same_vma), and all VMAs which fall within the area covered by a given anon_vma structure (same_anon_vma). It gets complicated, so some diagrams might help. Initially, we have a single process with one anonymous VMA:

Here, "AV" is the anon_vma structure, and "AVC" is the anon_vma_chain structure seen above. The AVC links to both the anon_vma and VMA structures through direct pointers. The (blue) linked list pointer is the same_anon_vma list, while the (red) pointer is the same_vma list. So far, so simple.

Imagine now that this process forks, causing the VMA to be copied in the child; initially we have a lonely new VMA like this:

The kernel needs to link this VMA to the parent's anon_vma structure; that requires the addition of a new anon_vma_chain:

Note that the new AVC has been added to the blue list of all VMAs referencing a given anon_vma structure. The new VMA also needs its own anon_vma, though:

Now there's yet another anon_vma_chain structure linking in the new anon_vma. The new red list has been expanded to contain all of the AVCs which reference relevant anon_vma structures. As your editor said, it gets complicated; the diagram for the 1000-child scenario which motivated this patch will be left as an exercise for the reader.

When the fork() happens, all of the anonymous pages in the area point back to the parent's anon_vma structure. Whenever the child writes to a page and causes a copy-on-write, though, the new page will map back to the child's anon_vma structure instead. Now, reverse-mapping that page can be done immediately, with no need to scan through any other processes in the hierarchy. That makes the lock contention go away, making benchmarkers happy.

The only problem is that embarrassing oops issue. Linus, Rik, Borislav, and others chased after it, trying no end of changes. For a while, it seemed that a bug causing excessive reuse of anon_vma structures when VMAs were merged could be the problem, but fixing the bug did not fix this oops. Sometimes, changing VMA boundaries with mprotect() could cause the wrong anon_vma to be used, but fixing that one didn't help either. The reordering of chains when they were copied was also noted as a problem...but it wasn't the problem.

Linus was clearly beginning to wonder when it might all end: "Three independent bugs found and fixed, and still no joy?" He repeatedly considered just reverting the change outright, but he was reluctant to do so; the solution seemed so tantalizingly close. Eventually he developed another hypothesis which seemed plausible. An anonymous page shared between parent and child would initially point to the parent's anon_vma:

But, if both processes were to unmap the page (as could happen during system hibernation, for example), then the child referenced it first, it could end up pointing to the child's anon_vma instead:

If the parent mapped the page later, then the child unmapped it (by exiting, perhaps), the parent would be left with an anonymous page pointing to the child's anon_vma - which no longer exists:

Needless to say, that is a situation which is unlikely to lead to anything good in the near future.

The fix is straightforward; when linking an existing page to an anon_vma structure, the kernel needs to pick the one which is highest in the process hierarchy; that guarantees that the anon_vma will not go away prematurely. Early testing suggests that the problem has indeed been fixed. In the process, three other problems have been fixed and Linus has come to understand a tricky bit of code which, if he has his way, will soon gain some improved documentation. In other words, it would appear to be an outcome worth waiting for.

Index entries for this article
Kernel	anon_vma
Kernel	Memory management/Object-based reverse mapping

(Log in to post comments)

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 16:26 UTC (Tue) by clugstj (subscriber, #4020) [Link]

I am in awe of Linus' debugging abilities!

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 17:31 UTC (Tue) by Wummel (subscriber, #7591) [Link]

It is even more awesome as he seems just to be looking at the source and guesses what could be wrong with it (or does Linus use printf() or even kgdb nowadays for debugging?).

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 17:49 UTC (Tue) by dlang (guest, #313) [Link]

it was primarily looking at the source and thinking about what was happening.

In the process three other bugs were found and fixed, and the section of code got significant documentation and cleanup improvements.

Linus was unable to reliably replicate the bug, so printf, kgdb, etc could not be used.

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 18:36 UTC (Tue) by brouhaha (subscriber, #1698) [Link]

"He fixes radios by thinking!"

-- someone whose radio was repaired by Richard Feynman

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 18:59 UTC (Tue) by iabervon (subscriber, #722) [Link]

The part I found most impressive was when he identified that something was crashing in the first iteration of a loop by looking at the register contents and assembly decode from the oops report. Where other people would have needed to look at memory in a debugger, he could just look at registers and infer where the bad value was coming from.

The case of the overly anonymous anon_vma

Posted Apr 14, 2010 4:11 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Actually, wasn't that Boris that did the initial assembly dump interpretation?

It's actually not so hard once you've done it a couple times. I've had to chase down bugs in our C compiler at work (ah, the joys of running internal alpha builds!).

The case of the overly anonymous anon_vma

Posted Apr 14, 2010 4:44 UTC (Wed) by iabervon (subscriber, #722) [Link]

The message I'm thinking of is <alpine.LFD.2.00.1004061220270.3487@i5.linux-foundation.org> from Apr 6 at 15:35; Linus looks at Steinar Gunderson's disassembly, where %rax is a kernel pointer and %rbx is not %rax+20 like it would be after running the loop, implying that "anon_vma->head.next" is NULL, not some other anon_vma_chain entry. Boris had worked out that there was some problem in the list previously, but Linus identified that the pointer from the anon_vma was bad, rather than the list further down being corrupt. This turned out to be important to the actual bug, which had to do with the anon_vma associated with the page being gone (and its memory reused) rather than some other anon_vma in the vma's chain being gone or something messing up the list. Boris picked up the stuff you'd get from a debugger that had registers but not core; Linus picked up an important detail that one would only normally get from a debugger by inspecting memory.

The case of the overly anonymous anon_vma

Posted Apr 14, 2010 4:57 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Ah, ok. I hadn't seen that email. Makes sense.

The case of the overly anonymous anon_vma

Posted Apr 15, 2010 20:09 UTC (Thu) by Felix (subscriber, #36445) [Link]

I think you mean this mail from Linus
http://groups.google.co.id/group/linux.kernel/msg/130d3ea...

(it has a slightly different message id+time but seems to fit)

The case of the overly anonymous anon_vma

Posted Apr 15, 2010 20:17 UTC (Thu) by iabervon (subscriber, #722) [Link]

Actually, that's a similar message, but not the one I was thinking of. In that one, he says "So again, I can show that..." I was looking at the first time, and this is the second time. The one I'm thinking of is http://groups.google.com/group/linux.kernel/msg/f9c7ca848... and has a more complete explanation of the middle steps.

The case of the overly anonymous anon_vma

Posted Apr 14, 2010 19:00 UTC (Wed) by mrfredsmoothie (guest, #3100) [Link]

That's precisely the point of the Linus "don't rely on a tool" philosophy of bug-fixing.

If your code cannot be reasoned about by a human, then it cannot be maintained by a human, either.

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 17:42 UTC (Tue) by Trelane (subscriber, #56877) [Link]

Awesome. *This* is why I subscribe to lwn!

The case of the overly anonymous anon_vma

Posted Apr 14, 2010 11:37 UTC (Wed) by sorpigal (guest, #36106) [Link]

I hereby register one (1) classic "Me too," reply.

The case of the overly anonymous anon_vma

Posted Apr 14, 2010 15:04 UTC (Wed) by gwittenburg (guest, #5080) [Link]

+1, same here. Thanks, Jonathan!

The case of the overly anonymous anon_vma

Posted Apr 15, 2010 12:07 UTC (Thu) by Darkmere (subscriber, #53695) [Link]

I just resubscribed. Still somewhat starving hacker, but it's better than nothing. Now to register a company account once I snipe a creditcard from someone ;)

The case of the overly anonymous anon_vma

Posted Apr 13, 2010 20:22 UTC (Tue) by ewen (subscriber, #4772) [Link]

The fix is straightforward; when linking an existing page to an anon_vma structure, the kernel needs to pick the one which is highest in the process hierarchy; that guarantees that the anon_vma will not go away prematurely.

That fix makes me wonder about the "daemon startup" situation, where the initial process fork()s at least once (sometimes twice), and then the parent process exits. I assume that the structure in the parent of the hierachy is being chosen with the expectation that it'd be longest lived. But in the "daemon startup" case the child ends up being longest lived. Hopefully the other actions to becoming a daemon, such as disassociating itself with parent process (reparenting to process 1) and becoming a new process group mean that the relevant structures get migrated appropriately down into the child.

I'm also left wondering whether a simpler change from "linked list" to, eg, something indexed by page address, would have improved the situation dramatically without the same degree of code complexity, and hence bugs. (Eg, O(n) over 1M pages is non-trivial, O(log n) over 1M pages is almost trivial.)

Ewen

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 20:33 UTC (Sat) by efexis (guest, #26355) [Link]

"eg, something indexed by page address"

That's what the page table is. But you can only index a fixed number of things from that, once you have a variable number of things (as there being any number of processes sharing that page), that's then another dimension, and can thus not be represented using a single dimension of the page address alone. A linked list might be slow for searching in, but in this case it looks like most operations are going to be either inserts (eg, at fork() time), deletes (at CoW or process termination time), and I guess possibly an action required on the whole group, although I couldn't suggest what. These seem like they'd all about as cheap on a linked list as you can get.

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 21:08 UTC (Sat) by ewen (subscriber, #4772) [Link]

I had in mind replacing the linked list with, eg, a hash (with buckets, probably), which is O(n) average case, or some sort of tree structure, which is O(log n) average case. Inserts are quick on a linked list if you don't bother to maintain any order, but searching and deletes are not; for most other structures inserts are a bit slower, but you get faster searching and deletes.

Still if the tricky details of the new solution have been sorted out then there's no point in changing back now. I guess time will tell.

Ewen

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 21:46 UTC (Sat) by efexis (guest, #26355) [Link]

Inserts are also quick if you do maintain order, simply by maintaining a pointer to the end of the list. You've only two pointers to update for the list (end of list and pointer to end of list) or am I forgetting something? Deletes could be sped up by making it double linked, and I can't think of why you'd ever need to perform a search... even if you needed to find a particular processes own structure for a page, seems like you'd start the search from that processes page table which can be done in a determinate amount of time, rather than from your own and then through the linked list.

I don't know what other property you could be looking for, other than memory address or process id, that you could use as the value for an index or hash. If you were collecting a stat (such as the process that's using the page that runs the most, which could be a useful thing to know) you'd have to iterate through the lot. Maintaining an index of that value is loads more work that I can't see paying off. What other value, other than page address and process id, would you want to search on?

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 22:21 UTC (Sat) by ewen (subscriber, #4772) [Link]

If you always insert at the start of the list, or always insert at the end of the list, then the order that you are maintaining is "order inserted" (either newest at start or newest at end). This isn't necessarily what one means by an ordered list (except if you actually want a stack or a queue, both of which are "ordered by time"). Inserting anywhere else in the list, to order by a data value, requires finding that location, eg, page address, which requires either knowing the pointer to the immediately prior page address in the list, or going and finding same (O(n)). (As you suggest deletes could be quick if you make the process track a pointer to the relevant entry in the link list, and use a double linked list, something I'd overlooked.)

From what I can see "find by page address" is the only primitive that actually matters here (so you can find the references to that page); "find by process id" seems easiest done starting with the processes structures. And the problem that we started with was a 1,000,000-long linked list which referenced multiple pages held by multiple processes. So any structure which made it faster to find the entries for the appropriate page would help, providing its other overhead wasn't too high.

Ewen

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 20:45 UTC (Sat) by efexis (guest, #26355) [Link]

"That fix makes me wonder about the "daemon startup" situation"

I don't think that matters (if I've interpreted what's said above correctly), as the page is unlikely to be unlinked from all the processes between the daemon fork()s taking place, as those initial processes are going to be pretty short lived. When the top process terminates, the next process down the list would have to take on the links, which will require some work, but no more than what would have to be done if this was done earlier in the game if you knew that the child was going to live longer. The only way it would save time would be if the memory was unlinked (eg, paged out), then paged back in and linked to the shorter running process - picking the longest running process would be more ideal. But yeah, for a process to be running long enough for this to happen probably rules out the ability to predict which is going to end sooner.

Anyone completed the exercise?

Posted Apr 15, 2010 6:08 UTC (Thu) by nikanth (guest, #50093) [Link]

the diagram for the 1000-child scenario which motivated this patch will be left as an exercise for the reader.
Has anyone completed the exercise. ;-)
Probably it shouldn't be daunting using the right tool + script. Even a video may be possible. :)

Anyone completed the exercise?

Posted Apr 16, 2010 4:27 UTC (Fri) by pjm (guest, #2080) [Link]

I think the point is just that standard tools like to avoid nodes overlapping each other, so if you have a million nodes on screen then each node must occupy only about 1 pixel, and you wouldn't be able to get any information about the problem other than a million sure is a big number, isn't it?. But it wasn't really a serious suggestion for understanding the bug, which could be illustrated with just one or two child processes.

(I was hoping to be able to point people to a certain impressive tool I've seen for interactively exploring large graphs, but I see that the code isn't yet publicly available. However, it's expected to be redeveloped and GPL'd in a year or two, under the name Dunnart.)

The case of the overly anonymous anon_vma

Posted Apr 15, 2010 22:02 UTC (Thu) by i3839 (guest, #31386) [Link]

Is it just me who's wondering whether all this complexity is worth it?

What was it needed for again? To have a reverse mapping? Isn't that mostly useful for swapping and not much else? It seems silly to always add the overhead when it's almost never needed. Why not only have the overhead when actually swapping or whenever it's needed? Then this whole mess disappears from the common case and the complex stuff can be separated and isolated.

I have to read more code and think more about it, preferably with a clear mind.

The case of the overly anonymous anon_vma

Posted Apr 23, 2010 13:41 UTC (Fri) by rilder (guest, #59804) [Link]

Even I feel the same. The scenario "In a workload with 1000 child processes and a VMA with 1000 anonymous pages per process that get COWed," -- may not arise in a normal desktop workload. In such as case won't this introduce additional overhead, unless this feature is introduced as a CONFIG variable which I don't think it is.

The case of the overly anonymous anon_vma

Posted Apr 23, 2010 16:17 UTC (Fri) by i3839 (guest, #31386) [Link]

Not only that, if the anon_vma isn't used for file caches, but only for anonymous pages then the whole thing seems dubious when swap is disabled.

In addition to that, anon_vma apparently didn't solve the problem well, but instead of replacing it with something that does, more kludges are added on top of it. I got the feeling that there's some much more elegant solution waiting for someone to find it.

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 20:59 UTC (Sat) by efexis (guest, #26355) [Link]

More complex doesn't necessarily mean less efficient. Having to search through all other pages belonging to all other processes, just to see who, if anyone, is sharing it with you, incures considerably more overhead. There're probably more reasons you'd want to do this than just at page swapout time too, maybe dealing with process memory limits - knowing who to charge the memory usage to, or with NUMA systems becoming more common, knowing whether it's worth moving a page to memory closer to the core running the majority of processes accessing it. Sure it's more complex to us, processing this extra information in our heads, but that little extra time can save a lot of time in the long run.

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 22:48 UTC (Sat) by i3839 (guest, #31386) [Link]

You're forgetting the memory overhead (including cache misses), not everything is pure CPU cycles. Besides that, this adds a cost to most memory operations, even if it turns out it was never necessary. Why slow down the common case to speed up special cases in rare situations?

I don't have time to look further into the details, at least not this month. Hopefully next month.

The case of the overly anonymous anon_vma

Posted Apr 26, 2010 3:03 UTC (Mon) by efexis (guest, #26355) [Link]

"I don't have time to look further into the details"

Well my guess is that they did, and that code that mostly slowed the memory manager down for savings only in a corner case would have been thrown out by Linus, as is often the way.

The case of the overly anonymous anon_vma

Posted Apr 26, 2010 19:09 UTC (Mon) by i3839 (guest, #31386) [Link]

I hope you're wrong. :-)

This seems more a gradual development, with every thing on its own making sense at the time, but together still going in the wrong direction. It for sure isn't a big enough problem yet.

It would be nice to know what the reverse mapping is used for besides swapping. It seems that the reverse mapping of files is a solved problem, but that all this complexity is only for anonymous memory. Which can only go away is you have swap enabled, or some obscure option like memory hotplug. (In the case of hibernation you need to scan all pages anyway, so no need for this either.) So I guess that my main complaint is that this is done even when not needed.

For swap you only need a reverse mapping for pages that might get swapped out soon. I wonder if it's possible to create that mapping only for inactive pages instead of all of them all the time. Alternatively, swapping out can be done per-process, then there's no need for a reverse map. Shared pages are swapped out less quickly, but that's usually better anyway.

The swapping system needs an overhaul anyway, it's very slow currently. This way both the VM and the swap systems can be improved.

Improving the current code may not be easy, but it for sure is possible.

The case of the overly anonymous anon_vma

Posted Apr 30, 2010 20:02 UTC (Fri) by efexis (guest, #26355) [Link]

Hmm... I dunno, I would have thought that this isn't so much only needed for anonymous memory, but this may be the only reason why it's needed for anonymous memory, iyswim... that in other uses there are other reasons for it too, and so it can actually be simpler to just have it, than have it in some places and not in others. Anywhere you have copy-on-write pages it's going to be important, as the way you achieve that is to mark the pages read only. Someone tried to write to it, it causes a memory protection fault, which jumps into the VM code giving it the address of memory that the fault occured while trying to write. So where do you start if you don't have a back link to get from that address to the structures belonging to those using the page? You'd end up just doing loads of searching instead. The same goes for when it's something like memory mapped files, you need to know when the pages have been modified so you know it has to at some point be written back to disk. If you can't look up what's going on from the address it's going on at, things get way more complicated. Seems this is just basic accounting :-/

The case of the overly anonymous anon_vma

Posted May 1, 2010 12:17 UTC (Sat) by i3839 (guest, #31386) [Link]

It is. Reverse mapping for file backed pages doesn't need anon_vma stuff, it's a lot simpler.

You don't need a reverse mapping to do COW, when you get a pagefault you know the virtual address which caused it.

Reverse mapping is needed to find all virtual pages belonging to a certain physical one. That isn't needed often for anonymous memory.

The case of the overly anonymous anon_vma

Posted May 3, 2010 16:51 UTC (Mon) by efexis (guest, #26355) [Link]

You're right about the page fault bit, my mistake, a reference count would be sufficient there, the reverse map would be for changing the backing for purposes of swap or migration in a NUMA or HIGHMEM system where quick knowledge page<-->active sets is required.
Reverse mapping for file backed pages don't need anon_vma stuff obviously, because they're not anon. What they use isn't simpler though, as, for example, a library will often be shared by more address spaces than an anon page, it makes more sense to use a search tree (at least this used to be the case, according to a slightly dated early 2.6.x book detailing it. If it's now simpler than this, that would show kernel devs in fact simplifying things, not adding complexity).
If you use none of these, go into your kernel conf and disable CONFIG_SWAP, CONFIG_NUMA and any CONFIG_HIGHMEM entries. If nothing else uses it, it'll either be removed, or worst case, a few greps will show you where the functions are being used elsewhere to throw a few #ifdef's around (I imagine embedded comunity would have interest in already doing this).

The case of the overly anonymous anon_vma

Posted Apr 24, 2010 21:15 UTC (Sat) by efexis (guest, #26355) [Link]

Fantastic. That last diagram is like a punchline, I wasn't completely with it until that last one which triggered what I refer to as a "cascade of realisation", which is like a nice brain massage :-) much appreciated.

Not so appreciated though was the pointing to a 2004 article that I remember reading like it was just erm... 200err... 2007? Way to make me feel old for the first time ever ;-p