timerfd() and system call review

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jonathan Corbet
August 14, 2007

One of the fundamental principles of Linux kernel development is that user-space interfaces are set in stone. Once an API has been made available to user space, it must, for all practical purposes, be supported (without breaking applications) indefinitely. There have been times when this rule has been broken, but, even in the areas known for trouble (sysfs, for example), the number of times that the user-space API has been broken has remained relatively small.

Now consider the timerfd() system call, which was added to the 2.6.22 kernel. The purpose of this call is to allow an application to obtain a file descriptor to use with timer events, eliminating the need to use signals. The system call prototype, as found in 2.6.22, is:

    long timerfd(int fd, int clockid, int flags, struct itimerspec *utimer);

If fd is -1, a new timer file descriptor will be created and returned to the application. Otherwise, a timer will be set using the given clockid for the time specified in utimer. The TFD_TIMER_ABSTIME flag can be set to indicate that an absolute timer expiration is needed; otherwise the specified time is relative to the current time. The flags argument can also be used to request a repeating timer.

There is another aspect to the timerfd() API, though: a read on the timer file descriptor will return an integer value saying how many times the timer has fired since the previous read. If no timer expirations have happened, the read() call will block. In the 2.6.22 kernel, the returned value was 32 bits (on all architectures). It has since been decided that a 64-bit value would have been more appropriate, and a patch making that change has been merged for 2.6.23. The 2.6.22.2 stable update also contained the API change.

That is not the full story, though. Michael Kerrisk, while writing manual pages for the new system call, encountered a couple of other shortcomings with the interface. In particular, it is not possible to ask the system for the amount of time remaining on a timer. Other timer-related system calls allow for this sort of query, either as a separate operation or when changing a timer. Michael thought that the timerfd() system call should work similarly to those which came before.

Michael has now posted a patch fixing up the timerfd() interface. With this patch, the system call would now look like this:

	long timerfd(int fd, int clockid, int flags, struct itimerspec *utimer,
                     struct itimerspec *outmr);

The new outmr pointer must be NULL when the file descriptor is first being created. In any other context, it will be used to return the amount of time remaining at any timerfd() call. So user space can query a timer non-destructively by calling timerfd() with a NULL value for utimer. If both timer pointers are non-NULL, the timer will be set to utimer, with its previous value being returned in outmr.

This is, of course, an entirely incompatible change to an API which has already been exported to user space; any code which is using timerfd() now will break if it is merged. By the rules, such a change should not be merged, but it appears that there is a good chance that the rules will be bent this time around. One can argue that, in a real sense, the API has not yet been made available to user space: there has been no glibc release which supports timerfd(). The number of applications using this system call must be quite low - if, in fact, there are any at all. So a change at this point, especially if it can get into 2.6.23, will improve the interface without actually causing any user-space pain.

Fixing timerfd() might still be possible. But there is no denying that we would be better off if we could eliminate this kind of API problem before it gets into a stable kernel release and possibly has to be supported for many years. Therein lies the real problem: system calls (and other user-space API features) are being added to the kernel at a high rate, but review of these changes tends to lag behind. Given the difficulty of fixing user-space API mistakes, it would seem that the review standards for API additions should be especially high. Causing that to happen will not be easy, though; reviewer attention is a scarce resource throughout the free software community.

An idea which has been raised in the past is to explicitly mark new user-space interfaces as being in a volatile "beta" state. For as long as the API remains in that state, the kernel developers are free to change it. Applications would, during this period, rely in the API at their peril. This idea has been rejected in the past, though; it is seen as a way of avoid proper thought ahead of merging a new API into the kernel. Assuming that view still holds, another way will have to be found.

One part of the solution might well be seen in how the timerfd() problems came to light. Michael has demonstrated something your editor has also encountered a number of times: one of the best ways to find shortcomings in an API is to attempt to document it comprehensively. If the kernel community were to resolve that it would not merge user-space API features in the absence of complete documentation, it might just provide the necessary incentive to get that last review pass done.

This idea seems likely to come up at next month's kernel summit (for which a preliminary agenda has just been posted). How it will be received is anybody's guess; writing documentation appears to be a task so challenging that even kernel hackers fear to try it. This challenge may be worth taking up, though, if the reward is few long-lasting user-space API problems in the future.

Index entries for this article
Kernel	Development model/User-space ABI
Kernel	timerfd()
Kernel	User-space API

(Log in to post comments)

timerfd() and system call review

Posted Aug 14, 2007 18:30 UTC (Tue) by mhelsley (guest, #11324) [Link]

"one of the best ways to find shortcomings in an API is to attempt to document it comprehensively"

True. However it's not nearly as good when the person involved in writing the code implementing the API also writes the documentation -- that does not strain underlying assumptions in the way that thorough review and proper documentation processes tend to.

So perhaps once the API is in glibc and documented by another party it could be considered "stable".

timerfd() and system call review

Posted Aug 15, 2007 1:24 UTC (Wed) by mkerrisk (subscriber, #1978) [Link]

"one of the best ways to find shortcomings in an API is to attempt to document it comprehensively"

Agreed. However, I've been trying to encourage kernel developers to supply the beginnings of a man page that I then review. Even that is a very fruitful process, when it happens. But the ideal is of course as you suggest a much better review and documentation process involving kernel developers.

So perhaps once the API is in glibc and documented by another party it could be considered "stable".

There are many problems with this idea: some APIs never make it to glibc; sometimes glibc provides a wrapper that modifies the API; sometimes documentation does not arrive for a very long time...

timerfd() and system call review

Posted Aug 18, 2007 18:16 UTC (Sat) by landley (guest, #6789) [Link]

Two points:

1) Before a third party can document an API, they have to learn how to
use it, which is a chicken and egg problem (especially if you're trying
to be thorough). If nothing else it's insanely time consuming.

2) I don't pay much attention to glibc, I pay attention to uClibc. I'll
happily document what uClibc implements, and ignore the rest because if
uClibc doesn't implement it, it really can't be all that important.

timerfd() and system call review

Posted Aug 19, 2007 7:09 UTC (Sun) by mkerrisk (subscriber, #1978) [Link]

1) Before a third party can document an API, they have to learn how to use it, which is a chicken and egg problem (especially if you're trying to be thorough). If nothing else it's insanely time consuming.

Doing it that way would be. Obviously efficiently written documentation needs to be collaborative, either written by the developer, and improved via critique from peers and/or individuals well versed in writing documentation, or written by a third party with help from the developer, who explains the API.

timerfd() and system call review

Posted Aug 20, 2007 5:20 UTC (Mon) by landley (guest, #6789) [Link]

You didn't follow the saga of my attempts to document the subset of sysfs
used to populate /dev. Responses I got included:

A) Contradictory information from different developers.
B) Corrections consisting of "that's wrong" with no hint about the
approved way to do it.
C) Being repeatedly told I was an idiot and not worth their time.
D) Questioning why anyone would want to document this when someone's
already written a program using it.
E) Being repeatedly told "there is no stable API", I.E. outright
resistance to documenting this area because they didn't want to lose the
freedom to change it on a whim.

I also got some useful information, but both of the developers I need to
talk to are essentially spam-blocking me now. Oh well.

Also, although development and debugging parallelize just fine, editorial
functions don't. This is why you generally don't have multiple
maintainers whose jurisdictions overlap unless there's a clear hierarchy
of who reports to who. Writing documentation to be read by end users has
a significant editorial function.

Rob

timerfd() and system call review

Posted Aug 15, 2007 15:18 UTC (Wed) by dougg (subscriber, #1894) [Link]

In the 9 years that I have designed, implemented, supported and documented one particular kernel API not one of the thousands of emails that I have received concerning that API was an offer to write documentation.

I did notice that the glibc folks removed some documentation from the header file that defines the API. I'm not aware that they put that documentation anywhere else. And someone recently noted the discrepancy between the glibc distributed header and the kernel driver header. The solution proposed was to remove the documentation from the driver header as well.

Now I get to sit back and watch someone else go through a similar process with the bsg driver. And that driver is going to be released with an API that has pending changes (at least 6 months old) held up due to kernel bureaucracy.

timerfd() and system call review

Posted Aug 18, 2007 17:34 UTC (Sat) by landley (guest, #6789) [Link]

Which API are you referring to?

(Someone who writes documentation.)

timerfd() and system call review

Posted Aug 14, 2007 18:49 UTC (Tue) by bronson (subscriber, #4806) [Link]

> it is seen as a way of avoid proper thought ahead of merging a new API into the kernel.

The problem with this is that no amount of proper thinking will shake out everything. Comprehensive documentation goes a long way, certainly, but some problems will only come to light once people actually start using an interface.

The kernel team is really good at following convention and knocking bad patches down; I don't see any reason to believe that a beta period would turn into a crutch to avoid proper thinking. Are there any other reasons a beta period would be bad? It just seems like common sense to me!

timerfd() and system call review

Posted Aug 14, 2007 19:00 UTC (Tue) by musicon (guest, #4739) [Link]

Perhaps allowing interface changes to remain in "beta" through the first released kernel containing them, and then "final" in the next kernel release.

Eg, timerfd is beta in .22, final in .23?

timerfd() and system call review

Posted Aug 14, 2007 19:33 UTC (Tue) by nix (subscriber, #2304) [Link]

Certainly having a system where syscalls *don't* automatically transition
out of beta state is a recipe for disaster: the lesson of
CONFIG_EXPERIMENTAL is that a lot of them will simply never transition at
all :(

timerfd() and system call review

Posted Aug 17, 2007 17:40 UTC (Fri) by giraffedata (guest, #1954) [Link]

I agree. If you have a "We're not committed to this and don't stand behind it yet" status, things stay in that status a long time and then there's no distinction between function in that status and not.

I think it should be like the 5-second rule for reclaiming food dropped on the floor. You have one release to change your mind and redo a user interface before it becomes set in stone.

The effectiveness of that would depend entirely upon how well the rule can be communicated to users.

timerfd() and system call review

Posted Aug 28, 2007 17:31 UTC (Tue) by efexis (guest, #26355) [Link]

So you have a ratification process, where the next (or next+1) release either ratifies the API and it becomes 'stable', alters it (and it remains 'experimental'), or removes it. This stops anything sitting in the experimental state for too long, as developers have to make the improvements or formalise it to keep it in the kernel.

timerfd() and system call review

Posted Aug 15, 2007 1:32 UTC (Wed) by mkerrisk (subscriber, #1978) [Link]

But in the lack of a formalized review process, this still won't fix the problem.

timerfd() and system call review

Posted Aug 18, 2007 18:19 UTC (Sat) by landley (guest, #6789) [Link]

Don't confuse formalizing with automating. Adding bureaucratic
procedures to collaborative volunteer development really doesn't improve
matters. (You can't do this unless you've filled in the proper forms.
Isn't this a fun hobby?)

Having infrastructure so that something automatically times out if not
dealt with, and people can check what timeouts are pending, that might
help.

timerfd() and system call review

Posted Aug 19, 2007 7:52 UTC (Sun) by mkerrisk (subscriber, #1978) [Link]

I doubt that timeouts are enough. Too many things just get by because no-one notices (in time).

By a formalized process, I was suggesting something like a sign-off from one (or preferably more) individuals who might not necessarily be kernel developers, but must be well versed in the Unix system call APIs, who would

Consider the design of the API, in particular aspects such as:
- Generality of the design: is it too tailored towards a single purpose? could it be generalized to suit a wider range of uses?
- Simplicity; e.g., is the API overly complex? Complexity often hints at bad design.
- Consistency with similar APIs; e.g., if this API takes similar arguments to an existing API, does it interpret them in a similar way to that API? (This might seem obvious, but there are certainly some glaring examples where new Linux syscalls have failed to follow this simple idea.)
- Integration with existing APIs; e.g., could this API perhaps be better written as something that leverages existing features of the existing system call API? timerfd() is an interesting case in point. One of the things I have now begun to wonder is whether it would be feasible to have tight integration of timerfd with the POSIX timers API, so that all that is required is a simplified timerfd() call that takes only a clockid argument and creates a timerfd descriptor and returns a timer_t * which is then manipulated using the traditional timer_settimer(), timer_gettimer(), etc.
Verify that the API had undergone sufficient testing, either by examining the coverage of the test programs provided by the API developer, or by writing programs of their own (in fact I'd say the later is in any case necessary).
Verify that the API has been well documented, either by the developer, or by third parties (working in collaboration with the developer).

timerfd() and system call review

Posted Aug 20, 2007 5:08 UTC (Mon) by landley (guest, #6789) [Link]

I could see a separate API list, for discussing JUST API issues and not
implementation, which stuff could get cc'd to the way I cc stuff to
linux-doc. (Things just get buried and lost on linux-kernel.)

But adding more layers of bureaucracy seldom improves a process, and
inflicting bureaucracy on volunteers makes them go away. (We added
signed-off-by for _tracking_ purposes, in response to a clear and present
danger the nature of which was a lawsuit.)

Adding additional layers of verification and certification before
something can be merged is never how Linux has done stuff. Linus still
accepts some patches directly, when you can get his attention...

timerfd() and system call review

Posted Aug 14, 2007 20:46 UTC (Tue) by asamardzic (guest, #27161) [Link]

Great to see that lots of things once requiring messing with signals are getting exposed trough some kind of file descriptors, not only in Linux but in other Unix flavors as well. It really feels "Unix way", let's hope SUS will catch and standardize alike syscalls once...

timerfd() and system call review

Posted Aug 15, 2007 0:49 UTC (Wed) by gdt (subscriber, #6284) [Link]

IETF requires two independent implementations before approving a protocol. Something similar should be true for system calls: used by two independent applications. There's a lot of narrow thinking exposed in some system calls, and this would widen that. I've had the joy of using the TASKSTATS feature in a way not forseen by its author and until recently it wouldn't compile from user space at all since its header file was omitted.

timerfd() and system call review

Posted Aug 16, 2007 11:45 UTC (Thu) by nix (subscriber, #2304) [Link]

But a syscall can't easily be used by any applications until it's in glibc
or some other library, and once it is you can't change it because it's
exposed to applications so you can't tell who might be using it.

(Most of these new syscalls are Linux-specific, anyway, and thus likely to
see comparatively little use outside of abstraction layers like glibc.
POSIX still remains king.)

timerfd() and system call review

Posted Aug 15, 2007 1:58 UTC (Wed) by mkerrisk (subscriber, #1978) [Link]

Jon, thanks for the article! One point not made in the article is that in 2.6.22, the timerfd() API is broken (this I also discovered while working on the man page). In 2.6.22, it was intended that read() from a timerfd() file descriptor would return a 4-byte value, but a bug meant that only the least significant byte was returned. So the 2.6.22 interface is in any case unusable. (The fix for this problem went in with the switch to 8-byte reads.)

If the kernel community were to resolve that it would not merge user-space API features in the absence of complete documentation, it might just provide the necessary incentive to get that last review pass done.

Given the number of bugs and interface problems I've noticed while developing on man pages, I think this would be a hugely effective step. This will also be the subject of a presentation that I'll be making at linuxconf.eu, which precedes this year's Kernel Summit. Arnd Bergmann will cover some related ground at the conference talking about How to not invent kernel interfaces.

timerfd() and system call review

Posted Aug 15, 2007 16:45 UTC (Wed) by davecb (subscriber, #1574) [Link]

The Multicians did the whole mutation thing somewhat
better than we do in the Unix world... they tended to
document first, by writing white papers to get their
algorithms discussed.

After they delivered,they then froze the parameter
list part of the APIs, but they versions-numbered structures
passed as parameters, so one could
- add new elemnts to the end (a compatable change)
- retire old papameters (version change required), or
- change precision of parameter values (version change required).

I've used the same trick on Unix/Linux to avoid having to
have "flag days" and allow several of us (Hi, Edsel!) to
develop in parallel, even when we were changing intarfaces
with wild abandon.

--dave

timerfd() and system call review

Posted Aug 15, 2007 12:20 UTC (Wed) by clugstj (subscriber, #4020) [Link]

"The purpose of this call is to allow an application to obtain a file descriptor to use with timer events, eliminating the need to use signals."

Why is this going into the kernel? It is trivial to achieve the same effect in userspace with pipe() and a thread sleeping in pthread_cond_timedwait().

timerfd() and system call review

Posted Aug 15, 2007 16:28 UTC (Wed) by khim (subscriber, #9252) [Link]

Speed, I presume. Almost anything can done in a userspace (UML is a proof), the question is "what does it cost". Solution with pipe() and thread will be many times slower...

timerfd() and system call review

Posted Aug 15, 2007 19:01 UTC (Wed) by njs (subscriber, #40338) [Link]

You, uh, want to have one thread per pending timer?

The real solution avoiding timerfd is to write a proper main loop like the ones in glib, Qt, Twisted Python, libevent, etc., that puts timers on a heap and uses the delay from timer at the head of the heap to set the timeout on one's blocking syscall (select, epoll, kqueue, whatever).

This is all a truly fantastic pain in the butt, though, esp. once you bring in other events like signals, process handling (waitpid), etc. Even worse, it's not composable -- if you have libraries that need to do IO, getting them integrated with each other and with your main loop is almost impossible. Again there are pure-userspace solutions possible in principle (e.g. abstracted wrappers over multiple event loops like liboop), but in practice it remains a huge issue. This is one of the reasons we still don't have a really decent async dns resolver library, for instance.

I don't know how much of a help these Linux-specific solutions will be in the long run, but being able to wrap all event sources into fds via timerfd and signalfd and so on, and combine multiple fds into a single fd (for purposes of event selection) via epoll, certainly has the *potential* to simplify all these messes significantly. Maybe we'll even figure out eventually whether this or kqueue is better.

timerfd() and system call review

Posted Aug 15, 2007 22:34 UTC (Wed) by pphaneuf (guest, #23480) [Link]

My thoughts exactly, except that I'd point out that a "timer signaled over a pipe library" would probably have a single thread, which would use a heap for the timers. Kind of silly, having a thread whose goal is to sleep all the time, basically, but hey, you've got to do what you've got to do...

timerfd() and system call review

Posted Aug 16, 2007 10:18 UTC (Thu) by IkeTo (subscriber, #2122) [Link]

I think the "one thread per timer" might be part of the problem, but not the whole problem. The bigger part is that to allow programming is such a style, all your events (i.e., wait for input) must signal that condition when the input comes. What it means is that every fd you wait for must be in its own thread (creating a lot of headaches to prevent race conditions and deadlocks), or else you must use pselect() or epoll() or whatever which subsumes the need for the timer thread anyway. On the other hand, I still don't quite know what is the best use case of timerfd(). It clearly is geared towards pselect() or epoll(). But that simply replaces the user-mode code to manage the timers by some kernel-mode code to manage the timers via file descriptors. I can see the kernel-mode approach more wasteful: it requires a timer structure and a file structure for each timer, a userland approach would require just an entry in a priority queue.

timerfd() and system call review

Posted Aug 16, 2007 11:48 UTC (Thu) by nix (subscriber, #2304) [Link]

It might be more wasteful, but it's a lot easier to write the userspace
code. Memory is cheap in the quantities we're talking about here (apps
that use millions of timers simultaneously are going to be very rare).

And having *everything* be an fd would finally realize one of the goals of
the Unix world since its creation :)

timerfd() and system call review

Posted Aug 16, 2007 12:43 UTC (Thu) by IkeTo (subscriber, #2122) [Link]

> but it's a lot easier to write the userspace code

But when "userspace code" means library code, this is going to be hard to sell. After all the application developer see none of those. Can you imagine a version of, say, GLib implements its event loop using the timerfd() interface? Personally I can't.

timerfd() and system call review

Posted Aug 30, 2007 23:19 UTC (Thu) by nix (subscriber, #2304) [Link]

Actually I'd expect this to be mostly used by libraries. Currently
libraries have the problem that signal disposition is process-global and
can't be reset without interfering with other libraries, which is
ameliorated by signalfd. Also, major libraries like glib *can*
realistically include system-dependent code without being too annoying: it
only has to go into glib, rather than into all its users (and glib already
supports some Linux-specific interfaces anyway: indeed in a sense that's
part of its raison d'etre).

timerfd() and system call review

Posted Aug 28, 2007 21:58 UTC (Tue) by renox (guest, #23785) [Link]

>>And having *everything* be an fd would finally realize one of the goals of the Unix world since its creation :) <<

Given that Plan9 has been much more thorough than Unix in the 'everything is a file' way, I wonder how they solved this issue?

timerfd() and system call review

Posted Aug 30, 2007 23:23 UTC (Thu) by nix (subscriber, #2304) [Link]

Plan 9 has `notes' instead of signals, but it looks like they too were
`call this function automatically' things rather than being reified into
fds. Surprising. (However, notes are plan9ish in another way: they're
strings, not integers.)

timerfd() and system call review

Posted Aug 16, 2007 13:20 UTC (Thu) by pphaneuf (guest, #23480) [Link]

The use for timerfd is one of integration, I think. It's now possible to make an epoll fd, put all of your things in it as well as your timers, and just give back that single fd to an application, telling it to just call a specific function whenever it becomes readable. Note that in those integration situations (a classical example of which being an asynchronous DNS resolver), you don't get to specify the expiration of the select (or similar) call, or at least, not without complicating the interface.

Of course, it can currently be faked with a thread, either entirely (use a single pipe, and put your whole select loop in a thread, managing the timers as well) or in part (if you have epoll, you can put the fds in it, and use a thread just for the timers). You avoid the race conditions and deadlocks by only doing the minimal amount in the thread, just enough to simulate epoll or timerfd.

What's nice is that with each improvement (epoll and then timerfd), you can make that integration simpler and less complicated (running a thread in the background involves having to deal with untimely termination, making sure to block all signals and other such details). It's also possible to implement the same interface for the application whether you're faking it or not, so you can have some autoconf tests for epoll and timerfd, and then apply the right amount of emulation.

As for the wastefulness, the timer structure is also more or less the same as that "entry" you'd put in a priority queue, and if you use pipes to fake things, you end up having not one, but two file structures, not to mention an almost entirely useless buffer.

timerfd() and system call review

Posted Aug 17, 2007 11:49 UTC (Fri) by Octavian (guest, #7462) [Link]

> Why is this going into the kernel? It is trivial to achieve the same effect in userspace with pipe() and a thread sleeping in pthread_cond_timedwait().

The pipe() mechanism to achieve synchr. timer events wrt epoll()/select() are considered a 'trick' I suppose (which I've used in my own projects too). So, in the end we do not need to remember this trick at least, that's an achievement.

timerfd() and system call review

Posted Aug 16, 2007 15:25 UTC (Thu) by mrfredsmoothie (guest, #3100) [Link]

one of the best ways to find shortcomings in an API is to attempt to document it comprehensively.

Hmmm. Another way would be to require someone to write a non-trivial program which uses the API to do something useful.

timerfd() and system call review

Posted Aug 17, 2007 9:01 UTC (Fri) by addw (guest, #1771) [Link]

Another way would be to require someone to write a non-trivial program which uses the API to do something useful.

And make that non-trivial program part of the available documentation.

One of the most frustrating things in working with FLOSS is that often (but by no means always) the documentation is awful. It is generally a set of notes written by the author (who entirely understands what it is all about), so someone coming new to the problem has a really hard time getting to see how everything fits together.

A non-trivial program would be great for system calls; something similar needs to be available for other FLOSS components.

timerfd() and system call review

Posted Aug 20, 2007 8:25 UTC (Mon) by mkerrisk (subscriber, #1978) [Link]

I'm open to suggestions about specific pages in man-pages that need non-trivial example programs. (Over time an increasing number of example programs have been added.)

timerfd() and system call review

Posted Aug 18, 2007 17:32 UTC (Sat) by landley (guest, #6789) [Link]

Since I have an interest in this documentation stuff I thought I'd see if this travel fund might be willing to send me to Cambridge (or failing that get some cheap tickets through the University of Texas). Unfortunately, the summit's website says "Participation in the summit is by invitation only".

What's the point of telling us about an event we can't attend? If we were invited, presumably we'd get emails about the schedule...

timerfd() and system call review

Posted Sep 26, 2007 6:27 UTC (Wed) by ury (guest, #47805) [Link]

why not just open("/proc/sys/timerfd...", .. ) or some similar
to create timerfd descriptor?

maybe some ioctl can be used to control timer (or sysfs )