The rest of the vmsplice() exploit story

LWN.net needs you!

Without subscribers, LWN would simply not exist. Please consider signing up for a subscription and helping to keep LWN publishing

By Jonathan Corbet
March 4, 2008

Back in February, LWN published a discussion of the vmsplice() exploit which showed how the failure to check permissions for a read operation led to a buffer overflow within the kernel. Subsequently, a linux-kernel reader pointed out that the article stopped short of a complete explanation: this is not an ordinary buffer overflow exploit. Travel schedules and such prevented the writing of an immediate followup, but your editor would still like to tell the full story. So this article picks up where the last one left off and describes how the vmsplice() exploit makes use of this buffer overflow to take over the system.

When vmsplice() is being used to feed data from memory into a pipe, the function charged with making it all happen is vmsplice_to_pipe(), found in fs/splice.c. It declares a couple of arrays of interest:

    struct page *pages[PIPE_BUFFERS];
    struct partial_page partial[PIPE_BUFFERS];

PIPE_BUFFERS, remember, is 16 on exploitable configurations. Both of these arrays are passed into get_iovec_page_array(), which, as described in the previous article, makes a call to get_user_pages() to fill in the pages array. As a result of the failure to check whether the calling application is allowed to read the requested region of memory, get_user_pages() will overflow the pages array, writing far more than PIPE_BUFFERS pointers into it. These are, however, pointers to legitimate kernel data structures; it remains to be seen how this overflow enables the attacker to take control of the system.

The partial array is also passed into get_iovec_page_array(); it describes the portion of each page which should be written into the pipe. To that end, a loop like this is run immediately after returning from get_user_pages():

    for (i = 0; i < error; i++) {
	const int plen = min_t(size_t, len, PAGE_SIZE - off);
	partial[buffers].offset = off;
	partial[buffers].len = plen;
	/* ... */
    }

Since full pages are being written in this case, the calculated offset will be zero, and the length will be PAGE_SIZE (4096). The value of error is the return value from get_user_pages(); that will be the number of pages actually mapped: 46, in the case of the exploit. Remember that the partial array is also dimensioned to hold 16 entries, so this loop will overflow that array as well.

Both of these arrays are declared, one right after the other, in vmsplice_to_page(). A quick test by your editor suggests that the partial array will be placed below pages in memory, so, once partial is overflowed, the loop will start overwriting pages instead. So the pages array will end up containing alternating values of zero and 4096 rather than the real struct page pointers it had before. (It's worth noting that the exploit still works if the arrays are placed in the opposite order, since the overflow causes code down the line to think that pages is larger than it really is).

Once all this has happened, control returns to vmsplice_to_pipe() - the overflow is not big enough to have overwritten the return address. A call to splice_to_pipe() is supposed to finish the job, but something interesting happens there. Toward the beginning of this function, this test is made:

    if (!pipe->readers) {
	send_sig(SIGPIPE, current, 0);
	if (!ret)
	    ret = -EPIPE;
    	break;
    }

Looking back at the exploit code, we see that it closes the read side of the pipe before calling vmsplice(). So splice_to_pipe() will quit almost immediately. On its way out, however, it does this:

    while (page_nr < spd_pages)
	page_cache_release(spd->pages[page_nr++]);

The call to get_user_pages() will have locked each of the relevant pages into memory to allow the kernel to work with them; this is the cleanup code which goes back and unlocks the pages which will not be used. But remember that the pointers in the pages array have been overwritten, and are now either zero or 4096. What would normally happen here is a kernel oops, since those are not legitimate addresses. The exploit code has done something tricky, though: using some special mmap() calls, it has created some anonymous memory at the bottom of its address space.

Directly dereferencing user-space addresses while running in kernel mode is frowned upon for a number of reasons; it can blow up in a number of ways. But, if the address is valid and the relevant page is resident in memory, direct access to user-space memory will work. So, when the kernel starts to work with the addresses that it thinks are struct page pointers, it does not get any sort of fault; instead, it gets the data placed in that memory by the exploit. Needless to say, that data has been arranged carefully.

The Linux kernel normally manages each page as an independent object. There are times, however, when pages are grouped into larger units, called "compound pages." This generally happens when physically contiguous allocations larger than one page are needed by the kernel; when this happens, a compound page is passed back to the caller. These pages are special in that they must be split back apart when they are released back into the system, and there may be other cleanup work to do. So compound pages have an attribute not found on normal pages: a destructor which is called when the page is freed.

So, if we look at how the exploit sets up its low-memory page structures, we see:

    pages[0]->flags    = 1 << PG_compound;
    pages[0]->private  = (unsigned long) pages[0];
    pages[0]->count    = 1;
    pages[1]->lru.next = (long) kernel_code;

When the kernel looks for a page structure at user-space address zero, it will find something which looks like a compound page. The destructor (stored in the lru.next field of the second page structure) is set to kernel_code(), a function defined within the exploit itself. Since the count is set to one, the call to page_cache_release() (which decrements that count) will conclude that there are no further references and, since the page looks like a compound page, the destructor will be called. At this point, the exploit has arbitrary code running in kernel mode, and the show is truly over. This code just sets the process's uid to zero (giving it root access), then engages in some assembly-language trickery to return immediately to user space, shorting out the rest of the cleanup process.

There are a couple of interesting implications from all of this. One, clearly, is that this exploit is not something which was bashed out by a script kiddie somewhere. It was written by somebody who understands low-level kernel code quite well and who is able to use that understanding to escalate an apparent information-disclosure vulnerability into a full code execution problem. It is, clearly, a mistake to underestimate those who write exploits, not all of whom immediately make their works known to the development community. One also should not assume that they have not already written exploits for other, still unfixed bugs.

Also worth noting is the fact that ordinary buffer overflow protection may well have not been effective against this vulnerability. The return address on the stack was not overwritten, and no exploit code was put in data areas. This episode has caused a renewed interested in technical security measures in the kernel. These measures are good, but it would be a mistake to think that they will fix the problem. What is really needed is stronger review of patches with security in mind; it is not yet clear to your editor that this review is happening.

Index entries for this article
Kernel	Security/Vulnerabilities
Security	Linux kernel
Security	Vulnerabilities/Privilege escalation

(Log in to post comments)

The rest of the vmsplice() exploit story

Posted Mar 5, 2008 2:02 UTC (Wed) by iabervon (subscriber, #722) [Link]

I wouldn't be surprised if this was the result of some tool that assembles exploits out of
constraint violations. It wouldn't be hard to have a program that lists exploits for cases
where the kernel thinks that some particular data structure is in memory that's either
provided by the userspace or in user address space, which could pick up on what line of what
function gets an oops in the zero page. If somebody's got such a program, it would just be a
matter of noticing that a bad value causes an oops, and running the exploit generator.

Someone not a script kiddie clearly wrote the tricky part of this exploit, but may have
written it to exploit an entirely different bug, and left it somewhere that someone entirely
different could find it to generate a quick proof that the oops that came up with a simple
invalid input was actually exploitable.

The rest of the vmsplice() exploit story

Posted Mar 5, 2008 11:12 UTC (Wed) by epa (subscriber, #39769) [Link]

    if (!pipe->readers) {
	send_sig(SIGPIPE, current, 0);
	if (!ret)
	    ret = -EPIPE;
	    break;
    }

Is the indentation in this code extract correct?

The rest of the vmsplice() exploit story

Posted Mar 5, 2008 12:32 UTC (Wed) by Los__D (guest, #15263) [Link]

Heh, you've got good eyes! :)

The rest of the vmsplice() exploit story

Posted Mar 5, 2008 14:00 UTC (Wed) by jzbiciak (guest, #5246) [Link]

Ironically, "break;" is the only correctly indented line in that loop body. (That is, if you go with the Linux kernel standard 8 character indent.)

Indentation

Posted Mar 5, 2008 14:48 UTC (Wed) by corbet (editor, #1) [Link]

The indentation of the break line was clearly wrong (and different from the real code), I fixed it.

As for indent depth, I routinely shorten it in code samples to make the result fit in the browser window. The original code uses full-tab indents.

Indentation

Posted Mar 5, 2008 19:14 UTC (Wed) by jzbiciak (guest, #5246) [Link]

I did notice the consistent 4-character indents elsewhere.  I was being a tad tongue-in-cheek
because of the apparent irony.  (It was clear that the 'break' statement was the odd man out.)


Cheers,

--Joe

The rest of the vmsplice() exploit story

Posted Mar 5, 2008 21:22 UTC (Wed) by PaXTeam (guest, #24616) [Link]

good job Jon, now only one thing is missing: the behaviour on 32 bit vs. 64 bit archs (in
practice that'd be i386 vs. amd64). the issue here becomes clear when one looks at struct
partial_page and realizes that its first two members are int, not long, therefore when
treating them as a struct page *, the userland address the kernel will go to isn't a mere NULL
or PAGE_SIZE anymore (something mmap_min_addr could have protected against) but a high enough
address that makes it indefensible.

The rest of the vmsplice() exploit story

Posted Mar 6, 2008 14:29 UTC (Thu) by fuhchee (guest, #40059) [Link]

> Also worth noting is the fact that ordinary buffer overflow protection may 
> well have not been effective against this vulnerability. The return address
> on the stack was not overwritten, and no exploit code was put in data 
> areas.

Has there been any talk about extending NX (no-execute) style page
protection to within kernel space itself, to prevent it from executing
code residing in user-space pages?

The rest of the vmsplice() exploit story

Posted Mar 6, 2008 20:05 UTC (Thu) by spender (guest, #23067) [Link]

The UDEREF feature of PaX prevents the kernel from accessing userland memory directly and has
been doing so for 2 years now, close to a year before the vulnerability class ever became
public.  It makes use of segmentation on x86 to accomplish this, so due to Linus' rules it
will never be accepted into the mainline kernel.

-Brad

The rest of the vmsplice() exploit story

Posted Mar 6, 2008 20:11 UTC (Thu) by spender (guest, #23067) [Link]

If you're interested, I had posted this information earlier regarding UDEREF to some mailing
lists, courtesy of the PaX Team:
http://grsecurity.net/~spender/uderef.txt

-Brad