Many more words on volatile ranges

Did you know...?

LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net.

By Michael Kerrisk
November 5, 2012

The volatile ranges feature provides applications that cache large amounts of data that they can (relatively easily) re-create—for example, browsers caching web content—with a mechanism to assist the kernel in making decisions about which pages can be discarded from memory when the system is under memory pressure. An application that wants to assist the kernel in this way does so by informing the kernel that a range of pages in its address space can be discarded at any time, if the kernel needs to reclaim memory. If the application later decides that it would like to use those pages, it can request the kernel to mark the pages nonvolatile. The kernel will honor that request if the pages have not yet been discarded. However, if the pages have already been discarded, the request will return an error, and it is then the application's responsibility to re-create those pages with suitable data.

Volatile ranges, take 12

John Stultz first proposed patches to implement volatile ranges in November 2011. As we wrote then, the proposed user-space API was via POSIX_FADV_VOLATILE and POSIX_FADV_NONVOLATILE operations for the posix_fadvise() system call. Since then, it appears that he has submitted at least another eleven iterations of his patch, incorporating feedback and new ideas into each iteration. Along the way, some feedback from David Chinner caused John to revise the API, so that a later patch series instead used the fallocate() system call, with two new flags, FALLOCATE_FL_MARK_VOLATILE and FALLOCATE_FL_UNMARK_VOLATILE.

The volatile ranges patches were also the subject of a discussion at the 2012 Linux Kernel Summit memcg/mm minisummit. What became clear there was that few of the memory management developers are familiar with John's patch set, and he appealed for more review of his work, since there were some implementation decisions that he didn't feel sufficiently confident to make on his own. As ever, getting sufficient review of patches is a challenge, and the various iterations of John's patches are a good case in point: several iterations of his patches received no or little substantive feedback.

Following the memcg/mm minisummit, John submitted a new round of patches, in an attempt to move this work further forward. His latest patch set begins with a lengthy discussion of the implementation and outlines a number of open questions.

The general design of the API is largely unchanged, with one notable exception. During the memcg/mm minisummit, John noted that repeatedly marking pages volatile and nonvolatile could be expensive, and was interested in ideas about how the kernel could do this more efficiently. Instead, Taras Glek (a Firefox developer) and others suggested an idea that could side-step the question of how to more efficiently implement the kernel operations: if a process attempts to access a volatile page that has been discarded from memory, then the kernel could generate a SIGBUS signal for the process. This would allow a process that wants to briefly access a volatile page to avoid the expense of bracketing the access with calls to unmark the page as volatile and then mark it as volatile once more.

Instead, the process would access the data, and if it received a SIGBUS signal, it would know that the data at the corresponding address needs to be re-created. The SIGBUS signal handler can obtain the address of the memory access that generated the signal via one of its arguments. Given that information, the signal handler can notify the application that the corresponding address range must be unmarked as volatile and repopulated with data. Of course, an application that doesn't want to deal with signals can still use the more expensive unmark/access/mark approach.

There are still a number of open questions regarding the API. As noted above, following Dave Chinner's feedback, John revised the interface to use the fallocate() system call instead of posix_fadvise(), only to have it suggested by other memory management maintainers at the memcg/mm minisummit that posix_fadvise() or madvise() would be better. The latest implementation still uses fallocate(), though John thinks his original approach of using posix_fadvise() is slightly more sensible. In any case, he is still seeking further input about the preferred interface.

The volatile ranges patches currently only support mappings on tmpfs filesystems, and marking or unmarking a range volatile requires the use of the file descriptor corresponding to the mapping. In his mail, John explained that Taras finds the file-descriptor-based interface rather cumbersome:

In talking with Taras Glek, he pointed out that for his needs, the fd based interface is a little annoying, as it requires having to get access to tmpfs file and mmap it in, then instead of just referencing a pointer to the data he wants to mark volatile, he has to calculate the offset from start of the mmap and pass those file offsets to the interface.

John acknowledged that an madvise() interface would be nice, but it raises some complexities. The problem with an madvise() interface is that it could be more generally applied to any part of the process's address space. However, John wondered what semantics could be attached to volatile ranges that are applied to anonymous mappings (e.g., when pages are duplicated via copy-on-write, should the duplicated page also be marked volatile?) or file-mappings on non-tmpfs filesystems. Therefore, the latest patch series provides only the file-descriptor-based interface.

There are a number of other subtle implementation details that John has considered in the volatile ranges implementation. For example, if a large page range is marked volatile, should the kernel perform a partial discard of pages in the range when under memory pressure, or discard the entire range? In John's estimation, discarding a subset of the range probably destroys the "value" of the entire range. So the approach taken is to discard volatile ranges in their entirety.

Then there is the question of how to treat volatile ranges that overlap and volatile ranges that are contiguous. Overlapping ranges are coalesced into a single range (which means they will be discarded as a unit). Contiguous ranges are slightly different. The current behavior is to merge them if neither range has yet been discarded. John notes that coalescing in these circumstances may not be desirable: since the application marked the ranges volatile in separate operations, it may not necessarily wish to see both ranges discarded together.

But at this point a seeming oddity of the current implementation intervenes: the volatile ranges implementation deals with address ranges at a byte level of granularity rather than at the page level. It is possible to mark (say) a page and half as volatile. The kernel will only discard complete volatile pages, but, if a set of contiguous sub-page ranges covering an entire page is marked volatile, then coalescing the contiguous ranges allows the page to be discarded if necessary. In response to this and various other points in John's lengthy mail, Neil Brown wondered if John was:

trying to please everyone and risked pleasing no-one… For example, allowing sub-page volatile region seems to be above and beyond the call of duty. You cannot mmap sub-pages, so why should they be volatile?

John responded that it seemed sensible from a user-space point of view to allow sub-page marking and it was not too complex to implement. However, the use case for byte-granularity volatile ranges is not obvious from the discussion. Given that the goal of volatile ranges is to assist the kernel in freeing up what would presumably be a significant amount of memory when the system is under memory pressure, it seems unlikely that a process would make multiple system calls to mark many small regions of memory volatile.

Neil also questioned the use of signals as a mechanism for informing user space that a volatile range has been discarded. The problem with signals, of course, is that their asynchronous nature means that they can be difficult to deal with in user-space applications. Applications that handle signals incorrectly can be prone to subtle race errors, and signals do not mesh well with some other parts of the user-space API, such as POSIX threads. John replied:

Initially I didn't like the idea, but have warmed considerably to it. Mainly due to the concern that the constant unmark/access/mark pattern would be too much overhead, and having a lazy method will be much nicer for performance. But yes, at the cost of additional complexity of handling the signal, marking the faulted address range as non-volatile, restoring the data and continuing.

There are a number of other unresolved implementation decisions concerning the order in which volatile range pages should be discarded when the system is under memory pressure, and John is looking for input on those decisions.

A good heuristic is required for choosing which ranges to discard first. The complicating factor here is that a volatile page range may contain both frequently and rarely accessed data. Thus, using the least recently used page in a range as a metric in the decision about whether to discard a range could cause quite recently used pages to be discarded. The Android ashmem implementation (upon which John's volatile ranges work is based) employed an approach to this problem that works well for Android: volatile ranges are discarded in the order in which they are marked volatile, and, since applications are not supposed to touch volatile pages, the least-recently-marked-volatile order provides a reasonable approximation of least-recently-used order.

But the SIGBUS semantics described above mean that an application could continue to access a memory region after marking it as volatile. Thus, the Android approach is not valid for John's volatile range implementation. In theory, the best solution might be to evaluate the age of the most recently used page in each range and then discard the range with the oldest most recently used page; John suspects, however, that there may be no efficient way of performing that calculation.

Then there is the question of the relative order of discarding for volatile and nonvolatile pages. Initially, John had thought that volatile ranges should be discarded in preference to any other pages on the system, since applications have made a clear statement that they can recover if the pages are lost. However, at the memcg/mm minisummit, it was pointed out that there may be pages on the system that are even better candidates for discarding, such as pages containing streamed data that is unlikely to be used again soon (if at all). However, the question of how to derive good heuristics for deciding the best relative order of volatile pages versus various kinds of nonvolatile pages remains unresolved.

One other issue concerns NUMA-systems. John's latest patch set uses a shrinker-based approach to discarding pages, which allows for an efficient implementation. However, (as was discussed at the memcg/mm minisummit) shrinkers are not currently NUMA-aware. As a result, when one node on a multi-node system is under memory pressure, volatile ranges on another node might be discarded, which would throw data away without relieving memory pressure on the node where that pressure is felt. This issue remains unresolved, although some ideas have been put forward about possible solutions.

Volatile anonymous ranges

In the thread discussing John's patch set, Minchan Kim raised a somewhat different use case that has some similar requirements. Whereas John's volatile ranges feature operates only on tmpfs mappings and requires the use of a file descriptor-based API, Minchan expressed a preference for an madvise() interface that could operate on anonymous mappings. And whereas John's patch set employs its own address-range based data structure for recording volatile ranges, Minchan proposed that volatility could be an implemented as a new VMA attribute, VM_VOLATILE, and madvise() would be used to set that attribute. Minchan thinks his proposal could be useful for user-space memory allocators.

With respect to John's concerns about copy-on-write semantics for volatile ranges in anonymous pages, Minchan suggested volatile pages could be discarded so long as all VMAs that share the page have the VM_VOLATILE attribute. Later in the thread, he said he would soon try to implement a prototype for his idea.

Minchan proved true to his word, and released a first version of his prototype, quickly followed by a second version, where he explained that his RFC patch complements John's work by introducing:

new madvise behavior MADV_VOLATILE and MADV_NOVOLATILE for anonymous pages. It's different with John Stultz's version which considers only tmpfs while this patch considers only anonymous pages so this cannot cover John's one. If below idea is proved as reasonable, I hope we can unify both concepts by madvise/fadvise.

Minchan detailed his earlier point about user-space memory allocators by saying that many allocators call munmap() when freeing memory that was allocated with mmap(). The problem is that munmap() is expensive. A series of page table entries must be cleaned up, and the VMA must be unlinked. By contrast, madvise(MAD_VOLATILE) only needs to set a flag in the VMA.

However, Andrew Morton raised some questions about Minchan's use case:

Presumably the userspace allocator will internally manage memory in large chunks, so the munmap() call frequency will be much lower than the free() call frequency. So the performance gains from this change might be very small.

The whole point of the patch is to improve performance, but we have no evidence that it was successful in doing that! I do think we'll need good quantitative testing results before proceeding with such a patch, please.

Paul Turner also expressed doubts about Minchan's rationale, noting that the tcmalloc() user-space memory allocator uses the madvise(MADV_DONTNEED) operation when discarding large blocks from free(). That operation informs the kernel that the pages can be (destructively) discarded from memory; if the process tries to access the pages again, they will either be faulted in from the underlying file, for a file mapping, or re-created as zero-filled pages, for the anonymous mappings that are employed by user-space allocators. Of course, re-creating the pages zero filled is normally exactly the desired behavior for a user-space memory allocator. In addition, MADV_DONTNEED is cheaper than munmap() and has the further benefit that no system call is required to reallocate the memory. (The only potential downside is that process address space is not freed, but this tends not to matter on 64-bit systems.)

Responding to Paul's point, Motohiro Kosaki pointed out that the use of MADV_DONTNEED in this scenario is sometimes the source of significant performance problems for the glibc malloc() implementation. However, he was unsure whether or not Minchan's patch would improve things.

Minchan acknowledged Andrew's questioning of the performance benefits, noting that his patch was sent out as a request for comment; he agreed with the need for performance testing to justify the feature. Elsewhere in the thread, he pointed to some performance measurements that accompanied a similar patch proposed some years ago by Rik van Riel; looking at those numbers, Minchan believes that his patch may provide a valuable optimization. At this stage, he is simply looking for some feedback about whether his idea warrants some further investigation. If his MADV_VOLATILE proposal can be shown to yield benefits, he hopes that his approach can be unified with John's work.

Conclusion

Although various people have expressed an interest in the volatile ranges feature, its progress towards the mainline kernel has been slow. That certainly hasn't been for want of effort by John, who has been steadily refining his well-documented patches and sending them out for review frequently. How that progress will be affected by Minchan's work remains to be seen. On the positive side, Minchan—assuming that his own work yields benefits—would like to see the two approaches integrated. However, that effort in itself might slow the progress of volatile ranges toward the mainline.

Given the user-space interest in volatile ranges, one supposes that the feature will eventually make its way into the kernel. But clearly, John's work, and eventually also Minchan's complementary work, could do with more review and input from the memory management developers to reach that goal.

Index entries for this article
Kernel	Memory management
Kernel	Volatile ranges

(Log in to post comments)

Would it be usable for GCs?

Posted Nov 5, 2012 15:47 UTC (Mon) by renox (guest, #23785) [Link]

I think that VM-aware GCs would be a nice feature ( http://lambda-the-ultimate.org/node/2391 ) but the patch to implement the needed part in Linux was rejected, so I wonder if volatile ranges would be usable for this?

IMHO no, but I'm not 100% sure.

Volatile whether you like it or not

Posted Nov 5, 2012 18:46 UTC (Mon) by epa (subscriber, #39769) [Link]

This could be a generalization of the OOM killer. When memory is tight, some unlucky process is chosen and any pages it hasn't used recently are marked as volatile. It keeps running but if it tries to access one of those pages then boom! This might be a bit better than just killing it immediately. I would guess that many user-space applications leak memory, in the sense that they allocate pages, use them for a bit and then never touch them again.

Volatile whether you like it or not

Posted Nov 5, 2012 20:23 UTC (Mon) by scottwood (guest, #74349) [Link]

The downside to that is you could still have time bombs lurking after the out-of-memory situation is cleared up. And if you notify the task that it had better clean up and restart, that could cause the bad memory to be touched when it otherwise wouldn't have...

Volatile whether you like it or not

Posted Nov 5, 2012 20:25 UTC (Mon) by jimparis (guest, #38647) [Link]

> This could be a generalization of the OOM killer. When memory is tight, some unlucky process is chosen and any pages it hasn't used recently are marked as volatile. It keeps running but if it tries to access one of those pages then boom! This might be a bit better than just killing it immediately.

That's a cool idea, but potentially a security risk via information leak -- if you use a lot of memory on a system and trigger the OOM killer, you can now determine how another process is behaving or treating its input by whether it goes boom or not. I can only think of contrived examples at the moment, but attackers are more clever.

Many more words on volatile ranges

Posted Nov 5, 2012 20:10 UTC (Mon) by martinfick (subscriber, #4455) [Link]

How about using swap heuristics when deciding which volatile pages to drop: if the volatile page would be swapped, drop it instead? That is assuming these volatile pages can't be swapped in the first place? If they can, then I am not sure how to drop them once they are in swap?

Many more words on volatile ranges

Posted Nov 5, 2012 20:18 UTC (Mon) by mikemol (guest, #83507) [Link]

I wonder if the count of minor faults, the recency of minor faults, and the duration of allocation could be combined to produce a useful heuristic.

Many more words on volatile ranges

Posted Nov 5, 2012 22:18 UTC (Mon) by dgc (subscriber, #6611) [Link]

The issue that I've found people don't fully grok w.r.t. volatile pages is that it doesn't just change the cache state of the page. When a volatile page is removed from memory, it *punches a hole* in the file - it physically discards the data from the file. IOWs, it's not just removed from memory, it's also removed from the underlying storage.

This key behaviour isn't mentioned at all in the article - it only talks about page cache and memory management. In reality, the volatile pages API is an asynchronous, memory-demand driven hole punch operation. This is a real filesystem operation, and the focus on tmpfs seems to blind people as to what the true data integrity ramifications of the operation are.

Fundamentally, the difference between current fadvise operations and volatile pages is that fadvise() never changes the contents of the file, just how it is cached and read. Every time you read that data it is guaranteed to be the same, even if it was removed from cache. If you mark that same range as volatile pages, there is no guarantee the data is the same next time you read it because the page may have been removed from both the page cache and the underlying storage. IOWs, VOLATILE is not an advisory operation - it is a _data manipulation operation_ that destroys data.

That's why I said fallocate() is more appropriate than fadvise() - fadvise() only affects in-memory page cache behaviour, and has no implications for data integrity. Removing volatile pages from memory is a an asynchronous hole punching operation, and hole punching is something we use fallocate() for. :)

-Dave

Many more words on volatile ranges

Posted Nov 6, 2012 2:19 UTC (Tue) by foom (subscriber, #14868) [Link]

Yowch. Um, but, is that really the *desired* behavior? I'm rather surprised anyone would actually want that.

I can only think of real use-cases for volatile anonymous pages, and thus having an API that only works on files, instead, seems rather odd.

Many more words on volatile ranges

Posted Nov 6, 2012 20:42 UTC (Tue) by dgc (subscriber, #6611) [Link]

Yes, it is the desired behaviour. It's for file based caches, and being able to quickly mark sections of the cache regions that can be reclaimed if space is needed. The application then has to verify that a volatile section is intact if it wants to use it again.

The typical use case is that cache expiry simply marks regions of the cache as volatile, then reclaim is controlled wholly by the kernel memory pressure. Reuse of an expired cache entry does verification and if it is intact it gets marked "unvolatile" again, and the cycle goes around.

When you are doing file backed caching, then "reclaim" means freeing the backing store of the range in the file. i.e. punching a hole in the file.

-Dave

Many more words on volatile ranges

Posted Nov 6, 2012 22:28 UTC (Tue) by foom (subscriber, #14868) [Link]

I'm really surprised to hear that. As I said, I can't really imagine use-cases for auto-punching holes in a file-based cache. Would a web-browser use it for its disk-based cache? I dunno, why bother?

Besides, filesystems would all need to be updated to support volatile files. And the existence of files whose data which might disappear at any point without anyone touching them seems a radical new feature for a filesystem, which seems to me like it could cause all sorts of problems.

But volatile anonymous memory pages? Yes please! That's certainly of use...which is probably why everyone is confused by this!

Many more words on volatile ranges

Posted Nov 7, 2012 0:49 UTC (Wed) by dgc (subscriber, #6611) [Link]

> As I said, I can't really imagine use-cases for auto-punching holes
> in a file-based cache.

That's no reason to say there aren't any. I gave a few when I first suggested that fallocate() be used instead. Marking parts of files as volatile can help optimise large scale cache management (e.g. for HSMs, SSD file caches, squid, etc) - it's not a phone/desktop web browser cache that I'm thinking of here.

> Besides, filesystems would all need to be updated to support volatile
> files.

Just like the page cache needs to be updated to support it, eh? :)

Besides, this is just confusing implementation with API. The API needs to support it from the start as we can't change that over time. Filesystem implementation can be done later, as well as change over time, so we don't need to have that up front....

> And the existence of files whose data which might disappear at any point
> without anyone touching them seems a radical new feature for a
> filesystem, which seems to me like it could cause all sorts of problems.

When you use your filesystem as an access cache for some other data, this is exactly the expected behaviour. Only right now, the cache application causes files to disappear at random points in time. Volatile ranges on files just moves a common cache management mechanism into the filesystem so it can be done when the filesystem needs it to be done....

-Dave.

Many more words on volatile ranges

Posted Nov 7, 2012 1:05 UTC (Wed) by neilbrown (subscriber, #359) [Link]

> causes files to disappear at random points in time.

Having files disappear spontaneously makes sense to me. A 'file' is a natural unit of caching. There is a clear distinction between the 'file' and the 'name', so that you can unlink a cache file even while it is in use, and the process using it will not lose out.

Having arbitrary blocks in the middle of a file disappear spontaneously it not something that I am so comfortable with. There is no 'natural unit' (so John had to invent 'ranges' and worry about semantics for merging etc) and there is no 'object/name' distinction so you have to think carefully about races between access and discard.

I would really like it if the whole 'volatile data' thing could be done with files. Files get marked as 'volatile' and the filesystem can unlink them as desired. One problem is that open/mmap/close is a whole lot slower than any single systemcall, and definitely slower than a simple memory access that might (but usually doesn't) cause SIGBUS.

Maybe an madvise style interface that works for ranges in anonymous memory, and some sort of per-file interface for filesystems when a shared cache is required.

I'm not sure that one size can fit all.

Many more words on volatile ranges

Posted Nov 7, 2012 8:04 UTC (Wed) by dgc (subscriber, #6611) [Link]

> Having arbitrary blocks in the middle of a file disappear
> spontaneously it not something that I am so comfortable with.

Fundamentally, HSMs make blocks disappear from files spontaneously. And those blocks come back when you try to read them. IOWs, the filesystem is basically a namespace with a great big data cache in front of some kind of slower storage.

Volatile ranges turn HSM space management on it's head - instead of moving data to tape when you run out of space, we can do it pre-emptively and mark the duplicated data ranges as volatile. When the filesystem runs out of space, it can just punch out the volatile ranges and everything continues quickly rather than blocking waiting for the HSM to move data out to tape.

Then when you add range based hot data tracking as teh method of selecting what parts of the files are copied to tape and marked volatile, you've got quite a neat way of automatically managing the filesystem space that doesn't impact performance when space runs low or the HSM moves frequently accessed data to tape mistakenly...

Big picture - we've got lots of infrastructure on the way for doing interesting things with our storage stack - the only thing missing is the application that ties them all together....

> There is no 'natural unit' (so John had to invent 'ranges' and worry
> about semantics for merging etc)

There doesn't need to be a natural unit. In reality, it is a filesystem block, but having a tracking structure is necessary regardless of unit. Using the mapping tree proved impractical for various reasons, and the simplest solution was to use it's own tree. Volatile ranges on files are not bad because we have no generic range tree library in the kernel that could be used for tracking them....

> and there is no 'object/name'
> distinction so you have to think carefully about races between access
> and discard.

Same for any method of tracking volatile ranges... :)

> I'm not sure that one size can fit all.

Probably not - the anonymous memory usage is recent, but I think it's separate to filebacked volatile regions which is what John's original proposal was for. Lumping them together as equivalent functionality is not really correct....

-Dave.

Many more words on volatile ranges

Posted Nov 7, 2012 1:31 UTC (Wed) by foom (subscriber, #14868) [Link]

> Only right now, the cache application causes files to disappear at random points in time. Volatile ranges on files just moves a common cache management mechanism into the filesystem so it can be done when the filesystem needs it to be done....

Yes, and having filesystems able to disappear (parts of) files all by themselves with no application involvement seems like a *major* change, and seems rather scary to me.

I mean, in the intended usage, the application itself expects its data to disappear, sure. But, I'm wondering about other knock-on effects of these sorts of files being able to exist. Will I, as admin, be able to easily tell that some files are "disappear-y"? New feature added to "ls"? How can I tell how much space is used by such data? New fields in "df"?

What sorts of controls over who can mark data like that will there be? Can it cause a security issue for data to disappear in the middle of a file unexpectedly? Maybe clearing volatile-ness on file ownership or permissions change fixes that?

I dunno...it just seems like so much complication versus in-memory volatility that it doesn't seem worth it. And, worse, pinning it to fallocate instead of something like madvise makes the API so much more fiddly to use for a simple in-memory case.

Many more words on volatile ranges

Posted Nov 7, 2012 6:29 UTC (Wed) by dlang (guest, #313) [Link]

This is a new way to punch holes in files, but the concept of files with holes in them is far from new.

This leads to a couple obvious answers

> What sorts of controls over who can mark data like that will there be? Can it cause a security issue for data to disappear in the middle of a file unexpectedly? Maybe clearing volatile-ness on file ownership or permissions change fixes that?

Tie this to the ability to modify/truncate the file and you are not adding any new possibilities, just new ways to trigger the possibilities (someone who can modify the file can truncate it, write a new file missing some data, etc)

> How can I tell how much space is used by such data?

the same way you would find out how much space is used by other sparse files today.

Many more words on volatile ranges

Posted Nov 8, 2012 3:04 UTC (Thu) by foom (subscriber, #14868) [Link]

The new thing isn't the holes after they've been punched, it's that holes can be pre-marked, such that they will be punched by the kernel some undetermined amount of time in the future. Perhaps 2 months later, after rebooting 20 times, and plugging the disk into another computer.

> Tie this to the ability to modify/truncate the file and you are not adding any new possibilities, just new ways to trigger the possibilities (someone who can modify the file can truncate it, write a new file missing some data, etc)

But without taking extra preventative measures, the ability to ever once *have* *had* permission to modify a file might then result in the ability to modify the file (by zeroing out some blocks) any arbitrary time in the future.

> the same way you would find out how much space is used by other sparse files today.

But these new files aren't sparse immediately; volatile data does use up actual space, until it gets dropped on the floor. That's a brand new type of thing.

Many more words on volatile ranges

Posted Nov 8, 2012 7:47 UTC (Thu) by dlang (guest, #313) [Link]

> But without taking extra preventative measures, the ability to ever once *have* *had* permission to modify a file might then result in the ability to modify the file (by zeroing out some blocks) any arbitrary time in the future.

remember that permission checks are only made when the file is opened, so even without this you could hold the filehandle open and erase blocks at any time.

Yes, this can now happen even after the program has exited (more below)

> But these new files aren't sparse immediately; volatile data does use up actual space, until it gets dropped on the floor. That's a brand new type of thing.

you can do one of two things by default (and either one is defensible)

1. show the size as it currently occupies disk space

2. show the size as if the holes had been reclaimed by the filesystem

I would probably do #1, because with sparse files, you already have a situation where the file size on disk can change significantly, so this is not that different.

to cover the rest of the first problem, and the corner cases of the second problem, there will need to be a utility to report on what's been tagged as being volatile, but that tool will probably just be needed for odd, corner cases. The existing options for dealing with sparse files should cover the 'normal' needs

Costs to re-fill discarded data

Posted Nov 6, 2012 2:48 UTC (Tue) by gdt (subscriber, #6284) [Link]

One thing which would be useful is marking the volatile memory allocation with a cost. For example, a DNS forwarder cache can gain greatly from volatile memory, as we don't have to guesstimate how much of the machine's memory we should leave to other processes, but simply use the lot knowing that a machine which is used for purposes other than only DNS forwarding will see memory pressure. However, re-filling the forwarder isn't free, it takes packet transmission and reception, so we should only start discarding the forwarder cache after other lower-hanging fruit.

Moreover, you want to be able to change the cost. Take a streaming video application: you want to pull in as much video in advance of viewing as you can, and you don't want to discard viewed memory as the user may rewind. There are three costs there: viewed video has the lowest cost (the user probably won't rewind), impendingly-viewed video has a high cost (discarding the data will lead to user-visible artifacts), unviewed video which won't be seen for a minute or more has a low cost. The interesting thing here is that the costs change all the while.

Many more words on volatile ranges

Posted Nov 6, 2012 10:16 UTC (Tue) by dgm (subscriber, #49227) [Link]

Doing this all in memory seems too complex. Probably I'm missing something, but why not do it with files? Create a file to store the cached data. Then, when the data is needed again test to see if it's still there or has been discarded by the kernel. The data would never touch any backing store, living in RAM until discarded.

The concept could even be extended to disks, with files that are automatically removed by the kernel if the free space on the file system goes below certain threshold.

It seems so obvious that I have the feeling that this already does exist, doesn't it?

Many more words on volatile ranges

Posted Nov 6, 2012 20:54 UTC (Tue) by dgc (subscriber, #6611) [Link]

> Doing this all in memory seems too complex. Probably I'm missing
> something, but why not do it with files?

That's the thing - it is doing this with files. People are focussing on memory/page cache behaviour because the initial target was tmpfs files.

> The concept could even be extended to disks, with files that are
> automatically removed by the kernel if the free space on the file
> system goes below certain threshold.

It already does that, through using the hole punching interface.

Indeed, this is exactly the reason I originally suggested it needs to be fallocate() based. There are good reasons for allowing the filesystem to track volatile ranges to allow discards to be done when the filesystem reaches ENOSPC thresholds. Once again, the focus on tmpfs has tended to make people think purely of "this is only for page cache pages" and that ignores the wider usefulness it has for disk based caching applications.

-Dave.

Many more words on volatile ranges

Posted Nov 7, 2012 18:07 UTC (Wed) by dgm (subscriber, #49227) [Link]

> That's the thing - it is doing this with files. People are focussing on memory/page cache behaviour because the initial target was tmpfs files.

Well, maybe I didn't express myself right. What I meant was _whole_ files. as neilbrown pointed out, they are a more "natural" unit of caching at the application level.

Many more words on volatile ranges

Posted Nov 15, 2012 23:44 UTC (Thu) by mm7323 (subscriber, #87386) [Link]

I think there much is value here.

It seems to me that a forgetful file system is the perfect model for a volatile cache. You can use the last access-time and file size as a metric for simple LRU removal together with the classic open(), read(), write(), mmap(), unmap(), close(), unlink() interfaces for access.

The open() syscall can trivially increment a reference count on some file and prevent the content being reclaimed while open. Open() can resolve within the kernel whether to succeed and return a file descriptor, or fail in the case that the file has already been reclaimed (e.g. due to memory pressure) and return a suitable error code such as ENOENT.

Once a valid descriptor has been obtained, read() and write() can trivially access the file contents, or mmap() could be used to further increase the reference count and create a memory mapping. Once the reference count is again zero, such volatile files within the filesystem would again be eligible to be reclaimed and removed at any time.

I think the major benefits of this would be that the user-space interface is traditional and easy to understand and there is no need to handle signals or actually use mmap() or munmap() to benefit from such a system. As you suggest, there is also the notion of a useful unit - a file. Discarding individual pages is a nonsense to userspace, whereas the idea that a file may at some point atomically be deleted is much easier to grasp and use without bugs. Files also allow applications to decide what is a useful unit to keep/lose, simply by deciding what to store within some file.

This obviously maps well to a browser's cache and similar such applications, though it could be argued that open() and mmap() are too heavy to use when backing a malloc() type allocator unless used at a very coarse level, in which case the benefits or reclaim could be reduced (it becomes all or nothing). That said dropping the volatile files of a process maybe a kind first stage which keeps a system running before oom_killer.

Many more words on volatile ranges

Posted Nov 6, 2012 14:51 UTC (Tue) by ibukanov (subscriber, #3942) [Link]

The difference between volatile range and MADV_DONTNEED is that MADV_DONTNEED does not allow to free the memory if one disables overcommit, right? However, if the overcommit is enabled, then AFAICS the only benefit of virtual ranges is that they allow to free page table entries. Are there are other use cases?

Many more words on volatile ranges

Posted Nov 7, 2012 1:48 UTC (Wed) by Trelane (subscriber, #56877) [Link]

> it seems unlikely that a process would make multiple system calls to mark many small regions of memory volatile.

Auto-freeing memory allocator? On call to free(), put in the linked list & mark volatile. If it's still there when you want it back, great. Otherwise, re-allocate.

Many more words on volatile ranges

Posted Nov 7, 2012 11:52 UTC (Wed) by juliank (guest, #45896) [Link]

Most memory allocators don't do this on a page level though, but rather on a span of pages. Currently they use MADV_DONTNEED, but something that doesn't zero memory is more useful.

Many more words on volatile ranges

Posted Nov 7, 2012 12:54 UTC (Wed) by Trelane (subscriber, #56877) [Link]

Ah, thanks!

Many more words on volatile ranges

Posted Nov 9, 2012 17:01 UTC (Fri) by mezcalero (subscriber, #45103) [Link]

One usecase for volatile ranges is the PulseAudio memory handling. In PA each client process and the server process set up a memory mapped region that they grant other processes read (but not write) access to, which is then used to pass sample data between each other. Large parts of this memory mapped area contains mostly cached samples. It would be highly desirable if we could drop those if the system gets into memory pressure. After all PA is seldom the reason why people start up their computers, but only one part of the stack that is auxiliary to what they actually want to do.

I believe getting SIGBUS handling for the dropped ranges would be a big step towards the right direction. That way we could drop the caches nicely and repopulate them easily when we need them again. However to be really useful the userspace interface to this would need some non-trivial upgrading: the problem with all signal based logic is that there can only be a single handler installed for a signal per-process. That basically means that different components of an application will fight about who "owns" the SIGBUS signal handler. Think of an app that includes mozilla code (for example by embedding some mozilla lib, or by actually being firefox) and libpulse at the same time. Both the mozilla code and libpulse would like to get SIGBUS for the client memory ranges each manage, but only one can actually install the handler. As long as this API problem is not fixed volatile ranges are only useful for a small subset of problems.

What we need is an API so that userspace can install SIGBUS (and SIGSEGV, ..) handlers for specific memory regions only. This could probably be implemented without any kernel support, but it would have to be done on the level of glibc because otherwise foreign code will hardly be willing to accept it.

Anyway, the point I am trying to make here is that to solve this problem properly just hacking the kernel is not enough. As much work has to be done in userspace to get the APIs right, because sigaction() alone as it stands now is simply not enough.

Many more words on volatile ranges

Posted Nov 10, 2012 22:19 UTC (Sat) by zlynx (guest, #2285) [Link]

As I understand it SIGBUS is only one of two ways to discover that a volatile page has gone missing. The other method requires making a system call.

Unless I am missing something about PulseAudio? Perhaps this read-only memory sharing makes it impossible to use the system call to mark the volatile range as in-use?