Papers from the Real Time Linux Workshop

Please consider subscribing to LWN

Subscriptions are the lifeblood of LWN.net. If you appreciate this content and would like to see more of it, your subscription will help to ensure that LWN continues to thrive. Please visit this page to join up and keep LWN on the net.

By Jake Edge
October 14, 2009

There are far too many interesting Linux and free software conferences these days, so it would be difficult—really, impossible—to attend them all. Slides and videos of the talks can help fill in the gaps, but, for conferences with a more academic bent, the papers that are the basis of the presentations can give an even more detailed look. The papers from the recently concluded Real Time Linux Workshop are a good example; this article will briefly look at a few of them.

Myths and Realities of Real-Time Linux Software Systems

This paper [PDF] can serve as an introduction to realtime for those who are not familiar with what that means. Author Kushal Koolwal starts with the basics: defining realtime, describing various kinds of latencies, and looking at hard vs. soft realtime, before moving into a few myths. Koolwal then looks at realtime in Linux, focusing on the PREEMPT_RT patchset. In a few short pages, this paper will give the reader a good foundation in realtime and the trade-offs necessary to support it.

Finding origins of latencies using Ftrace

Ftrace developer Steven Rostedt describes how to use ftrace to find unexpected and/or unacceptable latencies, which may be a barrier to realtime processing, in his paper [PDF]. Ftrace is a relatively new tool in the kernel that provides various kinds of tracing information and has some facilities that can be used specifically for tracking down latency issues. Tracers like irqsoff, preemptoff, and wakeup (along with some variants) capture information while the kernel is running in specific modes (i.e. with interrupts disabled, preemption turned off, etc.).

Rostedt's paper gives a fairly detailed look at the tracers, how to enable them, what they do, and the output they produce. While these latency tracers are active, they capture things like kernel functions called or trace event points encountered while looking for the maximum time spent in the latency-causing modes. By looking at what the kernel is doing when the latency has exceeded expectations, it can lead a developer to the specific cause—which may lead to a way to reduce the latency. Rostedt mentions the JACK "audio connection kit" developers as an early adopter of latency tracing, noting that they found both kernel and JACK bugs that were causing excess latency.

Towards Linux as a Real-Time Hypervisor

Jan Kiszka reported [PDF] on experiments using Linux as a hypervisor for realtime processing. Using KVM and QEMU, he measured the latency in both the host and guest operating systems under a number of different scenarios. One of the more obvious means to increase the responsiveness of the guest is to raise the priority of the QEMU threads and to put them into a realtime scheduling class. But that can lead to starving host OS processes that the guest is waiting on, which could lead to deadlock or other undesirable behavior.

The paper reports on the measurements of average and maximum latency, as part of a latency histogram, under different conditions: a baseline test in the host as well as in the guest, applying the priority and scheduling class changes to the guest, lowering the priority on the asynchronous I/O (AIO) QEMU threads, and using PREEMPT_RT kernel on the host. In addition, Kiszka describes a "paravirtualized scheduling" approach that allows the guest to send the host information on spinlock usage that will allow the host scheduler to adjust priorities of the guest processes for more efficient use of the CPUs, while avoiding priority inversions

ARM Fast Context Switch Extension for Linux

The organization of the ARMv5 cache can cause performance problems that may preclude its use for realtime tasks. The cache is based on virtual memory addresses and, since Linux processes share the same range of virtual addresses, each context switch requires invalidating the cache. Depending on the CPU type, memory speed, and the program's data access pattern, the cost of reloading a process's data from main memory can be on the order of 200 microseconds—too much for many time-critical applications.

One alternative is to share a flat address space between all of the processes, but then the memory protection provided by separate address spaces is lost. Gilles Chanteperdrix and Richard Cochran describe [PDF] another approach for doing context switches that preserves the memory protections without sacrificing the cache at every context switch. They use the ARM Fast Context Switch Extension (FCSE) and partition the virtual address space into separate 32MB chunks so that processes do not have overlapping address ranges. This allows for up to 128 processes running in the 3GB available for non-kernel addresses. The translation lookaside buffer (TLB) must still be flushed on context switches to enforce memory protection, but the data and instruction caches are preserved.

The actual implementation required reducing the number of available processes to 95. Either 95 or 128 processes, along with the 32MB address space restriction, were unacceptable for many embedded applications, so the authors added a "best effort mode" that eliminates those restrictions, but cannot guarantee that it won't have to do cache flushes on some context switches. They reported that average latencies for their test cases reduced by roughly half when the "guaranteed" mode was used, and by roughly one-quarter with "best effort" mode, when compared to the standard Linux kernel.

Design and Implementation of Node Order Protocol

Distributed systems often use "time division multiple access" (TDMA) as a means to coordinate access to a shared communications medium (e.g. shared bus or wireless frequencies). But, TDMA requires a reliable means to synchronize the clocks on the various systems and that synchronization uses some of the shared bandwidth simply for timekeeping. The authors, Li Chanjuan, Nicholas McGuire, and Zhou Qingguo, propose [PDF] a different protocol, Node Ordering Protocol (NOP), that avoids much of the complexity and bandwidth waste that occur with TDMA.

As its name implies, NOP relies on a consistent ordering of the nodes in the network. It also requires that nodes monitor the other nodes to determine if a faulty node is not correctly following the ordering scheme. The advantages, according to the authors, are that NOP is much easier to implement and validate than other protocols with complex synchronization requirements, loss of bandwidth due to temporal padding is not required, and that error detection is much simpler and bounded in time.

Use of cookies in real-time system development

One last paper to mention is the scholarly-sounding, if tongue-in-cheek, look at cookie consumption and "the positive impact on the real-time Linux community we were able to observe". The authors, M. Gleixner and M. McGuire, look at various cookie protocols—with code—and conclude that uni-directional protocols are best for real-time Linux development: "Though greedy protocols have been discussed in the past, we found that considering these has negative impacts on developers long term and thus are deprecated."

The slides for some of the presentations are available on the Open Source Automation Development Lab (OSADL) web site. There are quite a few more papers than we looked at here available as well. While the papers can't really replace the experience of attending, there is much of interest for those that are looking for more information on realtime in Linux.

Index entries for this article
Conference	Linux Plumbers Conference/2009

(Log in to post comments)

ARM Fast Context Switch Extension for Linux

Posted Oct 15, 2009 9:57 UTC (Thu) by meuh (guest, #22042) [Link]

This would explain why, in Windows CE, Microsoft choose to allocate to Windows CE applications with a 32M "slot" of virtual memory and load DLLs at fixed address inside this "slot" (loaded DLL have a fixed virtual memory portion reserved in all running processes). Doing this, DLLs are located in the same virtual address regardless of the process.

DLLs' code are kept in cache, but applications' code not.

More: http://blogs.msdn.com/hegenderfer/archive/2007/08/31/slay...

ARM Fast Context Switch Extension for Linux

Posted Oct 19, 2009 11:10 UTC (Mon) by etienne_lorrain@yahoo.fr (guest, #38022) [Link]

Having the memory cache dealing in virtual memory has also its advantages, it means that each *applications* has the whole memory cache available.
Obviously then, context switch has to save and restore the cache - but after all that is the memory which is really often used by the application, and it will worth it to restart the application with its cache "hot".
I do not know if the cache memory can be saved/restored on that ARM processor, and probably some optimisation is needed if the application do not use some of the cache lines or is short lived, but this whole virtual memory cache system may be an improvement in some cases.

ARM Fast Context Switch Extension for Linux

Posted Oct 27, 2009 7:46 UTC (Tue) by robbe (guest, #16131) [Link]

I may miss something, but restoring the *whole* cache at context switch
will not magically be faster than restoring *all* inidividual cachelines
on first (re)use. So you are essentially trading a huge context switch
delay for less initial cache misses. Overall this should at best perform
equally (if each restored cacheline is used at least once), and typically
worse (if there are cachelines that won't be used again) than the normal,
lazy strategy.

Unless there is some win by doing bulk memory transfers. But individual
cachelines should be wide enough to incure this win also.