A look at ftrace

Benefits for LWN subscribers

The primary benefit from subscribing to LWN is helping to keep us publishing, but, beyond that, subscribers get immediate access to all site content and access to a number of extra site features. Please sign up today!

By Jake Edge
March 13, 2009

There are quite a variety of tracing options for Linux, with SystemTap being the most prominent, but that particular solution has yet to become easily usable, at least partly due to its many dependencies on user-space configuration and tools. Another choice, which originally came from work on realtime Linux, is ftrace. Ftrace is a self-contained solution, requiring no user-space tools or support, that is useful for tracking down problems—not only in the kernel, but in its interactions with user space as well.

The name ftrace comes from "function tracer", which was its original purpose, but it can do more than that. Various additional tracers have been added to look at things like context switches, how long interrupts are disabled, how long it takes for high-priority tasks to run after they have been woken up, and so on. Its genesis in the realtime tree is evident in the tracers so far available, but ftrace also includes a plugin framework that allows new tracers to be added easily.

In a suitably configured kernel—one with CONFIG_FUNCTION_TRACER enabled—accessing ftrace is done through the debug filesystem. Typically, it is mounted this way:

    # mount -t debugfs nodev /debug

That creates a /debug/tracing subdirectory which is used to control ftrace and for getting output from the tool.

One chooses a particular tracer to use by writing to /debug/tracing/current_tracer—possibly after querying the tracers available by reading /debug/tracing/available_tracers. On a recent Fedora rawhide kernel, checking the tracers available results in:

    # cat /debug/tracing/available_tracers
    power wakeup irqsoff function sysprof sched_switch initcall nop

Tracing is then enabled or disabled by writing to /debug/tracing/tracing_enabled. Using the function tracer would be done as follows:

    # echo function >/debug/tracing/current_tracer
    # echo 1 >/debug/tracing/tracing_enabled
    ...some commands or activity to trace...
    # echo 0 >/debug/tracing/tracing_enabled

This produces a trace of each kernel function called, along with a timestamp, which will allow a kernel hacker to see what is going on inside the kernel.

Output from ftrace can be read from one of several files in the tracing directory:

trace - contains human-readable output of the trace
latency_trace - output from the same trace, but organized so that system latency issues can be diagnosed, also in human-readable format
trace_pipe - contains the same output as trace, but is meant to be piped into a command. Unlike the other two, reading from this file consumes the output.

The ring buffer used by ftrace is a fixed size (governed by the buffer_size_kb file), so trace or latency_trace entries will eventually be overwritten.

Here are examples of the output from a function trace. The header helps to decode the various fields in the trace. This is what the output from trace looks like (heavily edited for brevity's sake):

    # tracer: function
    #
    #           TASK-PID    CPU#    TIMESTAMP  FUNCTION
    #              | |       |          |         |
		bash-3330  [000]   147.799029: sys_open <-syscall_call
		bash-3330  [000]   147.799030: do_sys_open <-sys_open
		bash-3330  [000]   147.799030: getname <-do_sys_open
		bash-3330  [000]   147.799031: kmem_cache_alloc <-getname

While this is latency_trace output (similarly edited):

    # tracer: function
    #
    function latency trace v1.1.5 on 2.6.29-0.215.rc7.fc11.i586
    --------------------------------------------------------------------
     latency: 0 us, #120119/5425477, CPU#0 | (M:desktop VP:0, KP:0, SP:0 HP:0 #P:2)
	-----------------
	| task: -0 (uid:0 nice:0 policy:0 rt_prio:0)
	-----------------

    #                  _------=> CPU#            
    #                 / _-----=> irqs-off        
    #                | / _----=> need-resched    
    #                || / _---=> hardirq/softirq 
    #                ||| / _--=> preempt-depth   
    #                |||| /                      
    #                |||||     delay             
    #  cmd     pid   ||||| time  |   caller      
    #     \   /      |||||   \   |   /           
	bash-3330    0.... 3531221us : sys_open (syscall_call)
	bash-3330    0.... 3531222us : do_sys_open (sys_open)
	bash-3330    0.... 3531222us : getname (do_sys_open)
	bash-3330    0.... 3531223us : kmem_cache_alloc (getname)

Each line in the two output formats shows one function call, with one level of backtrace, along with the process name, process id, and which CPU the call was made on. The latency_trace output also provides information on whether interrupts have been disabled, whether a reschedule has been called for, whether its running in an interrupt context, and whether preemption has been disabled. The timestamp for the latency_trace output is relative to the start of the trace in microseconds; the space after the time, and before the colon, is a field that gets set to either '!' or '+' to call attention to especially long delays (in the example output it is always blank). Unfortunately, the header is a little misleading for where the "delay" and "caller" fields point.

The sysctl kernel.ftrace_enabled governs whether function entry is recorded as part of the trace. It can be turned on by using either of the following commands:

    # sysctl kernel.ftrace_enabled=1 
    or 
    # echo 1 >/proc/sys/kernel/ftrace_enabled

Without that, many of the tracers are essentially pointless.

Half-a-dozen different tracers are described in the voluminous Documentation/ftrace.txt that comes in the kernel source tree. In addition to the function tracer, there is the sched_switch tracer that tracks and reports on context switches in the kernel, showing process wakeups along with when they get scheduled. Each trace entry has a timestamp along with the priorities and states of the affected processes.

The nop tracer is not a tracer at all, but can be used to clear whatever tracer is currently active. There are another seven tracers in the mainline that have yet to make it into the documentation. In addition, there are even more tracers that are targeted for the 2.6.30 kernel.

There are four tracers that look for the maximum time spent in a particular state, recording that maximum time (in trace_max_latency) along with a trace of the functions called during that state. irqsoff looks for the longest time spent with interrupts disabled, while preemptoff looks for the maximum time spent with preemption turned off. Combining the two gives the preemptirqsoff tracer. The final tracer, wakeup looks for the maximum latency for the highest priority process to get scheduled after it has been woken up.

Each of those helps by reducing the amount of trace data that a kernel hacker needs to wade through. For kernels that are configured with CONFIG_DYNAMIC_FTRACE, there is another way to filter the trace data. Dynamic ftrace allows the user to select kernel functions that are either included or excluded from the tracing information collected. The available_filter_functions file lists the functions that can be filtered, while writing function names to set_ftrace_filter or set_ftrace_nofilter will include or exclude those functions. Those lists can be appended to by using the standard >> shell redirection operator. In addition, there is support for wildcards so that:

    echo 'hrtimer_*' >/debug/tracing/set_ftrace_filter

will only gather tracing information from high-resolution timer functions.

Other tracers available in the mainline kernel (i.e. what will become 2.6.29) include:

power - traces the CPU power state transitions
function_graph - similar to the function tracer, but traces both function entry and exit, which allows a call graph to be generated
sysprof - periodically (governed by sysprof_sample_period) generate stack traces for the running process or kernel thread
initcall - traces the entry and exit of initialization function calls during system boot
branch - traces branch prediction and execution
hw-branch-tracer - uses the Branch Target Stack (BTS) feature of x86 CPUs to trace branch execution
mmiotrace - traces memory-mapped I/O

The production of human-readable output directly from the kernel recently led to some hard questions from Andrew Morton: "Why on earth do we keep on putting all these pretty-printers and pretty-parsers into the kernel? I mean, how hard is it for userspace to read a text file, do some basic substitutions and print it out again?" But, others argue that is precisely because of the ease of getting a readable trace directly from the kernel which makes ftrace so useful.

Morton's argument is that the fixed format is more difficult for scripts to parse and that the output is English-only. He contends that some kind of API would simplify things. Pointing to the latency_trace output as an example, he said:

[...] now you need to think about how this interface would have been designed if we'd decided to access it with something smarter than `cat'.

I mean, look at it. All the multi-space column lining upping, the unnecessary "us" annotation, the strange symbol(symbol) thing, etc. Plus it would have been more self-describing. Right now, your parser has to either assume that the second character of "0d..1" is "irqs-off", or it has to learn how to follow ascii art lines.

Rather than needing some kind of user-space tool to interpret the ftrace output, a kernel hacker can quickly get what they need from the kernel itself. As Ingo Molnar pointed out, there may be no build environment available on the target machine, so having the trace output available directly is useful: "Self-sufficient kernel instrumentation is a key concept to usability". Further, he said, the output has been designed to be scriptable, but if that is not sufficient, there are options to produce raw output. Overall, he sees kernel pretty-printing as a needed feature:

IMO pretty-printing in the kernel should not be made a religious argument but a technical argument. Pretty-printing is a tool, and as a tool it sometimes can be used for silly purposes, and sometimes it can be used for helpful purposes. You seem to argue that doing it in the kernel is always silly - and i strongly disagree with that and consider it an extreme-end position - for the reasons outlined above.

The options that Molnar refers to are accessed via the trace_options file, which lists the current options settings when read, or it can be written to change options. Three separate options: raw, hex, and bin (along with several to determine which fields are included) control the format of the output. These can produce output in formats that may be easier for some tools to use. Not requiring any tool to get a useful trace, though, is seen as a very important part of ftrace, at least by some.

There are lots more options and features of ftrace than are covered here. The aforementioned ftrace.txt document is a good place to start for those interested in more details.

Beyond those tracers already in the mainline, there are another handful that are being readied for 2.6.30. This includes a new event tracer, which allow tracing various scheduling, locking, and interrupt handling events. Continuous statistics gathering for workqueues and branch prediction is included as well.

Ftrace is a useful tool that can provide excellent diagnostic information from within a running kernel. It is not a general-purpose tool, nor is it geared to people unfamiliar with the innards of the kernel, but it certainly has its uses. There is quite a bit of activity going on with ftrace these days, with numerous patches and enhancements floating around on linux-kernel. While it may not reduce Dtrace envy directly, it is a tool that many kernel hackers are using today.

Index entries for this article
Kernel	Ftrace

(Log in to post comments)

A look at ftrace

Posted Mar 13, 2009 23:48 UTC (Fri) by mb (subscriber, #50428) [Link]

Thanks Jake for an excellent article!

debugfs mount point

Posted Mar 14, 2009 1:22 UTC (Sat) by gregkh (subscriber, #8) [Link]

Traditionally debugfs is supposed to be mounted at /sys/kernel/debug with the following command:

mount -t debugfs none /sys/kernel/debug/

That's the recommended place to put it.

debugfs mount point

Posted Mar 16, 2009 19:33 UTC (Mon) by nevets (subscriber, #11875) [Link]

Although I do mention /sys/kernel/debug in ftrace.txt, I find that too much to type. But perhaps I should change my general instructions to be...

mount -t debugfs none /sys/kernel/debug/
ln -s /sys/kernel/debug /debug

debugfs mount point

Posted Mar 17, 2009 13:38 UTC (Tue) by tao (subscriber, #17563) [Link]

Hype? Not creating a spurious, mostly unused, directory in the root namespace seems a perfectly good reason...

debugfs mount point

Posted Mar 17, 2009 15:42 UTC (Tue) by vapier (guest, #15768) [Link]

i think you misread ... he said "type", not "hype"

debugfs mount point

Posted Mar 17, 2009 16:04 UTC (Tue) by nevets (subscriber, #11875) [Link]

Thanks, I was confused by what he meant by "Hype".

But to his question about spurious directories in root. I still have on many systems an empty /opt. How's that different? I think seeing a /debug directory is quite obvious to what it is. Are you that worried about an inode that is seldom used?

The only ones that ever complain about /debug are a few kernel developers that know that /sys/kernel/debug was made for that purpose. The majority of people that use ftrace seem to mount the debugfs on /debug.

Typing "cat /debug/tracing/trace" is much faster than typing
"cat /sys/kernel/debug/tracing/trace".

debugfs mount point

Posted Mar 19, 2009 3:01 UTC (Thu) by daniel (guest, #3181) [Link]

I prefer /debug over /sys/kernel/debug/. After all, this symbol only needs to be there for debugging, if present then it is unmistakable why it is there, and there is no real chance of collision.

debugfs mount point

Posted Mar 17, 2009 15:45 UTC (Tue) by vapier (guest, #15768) [Link]

personally, i prefer to avoid "nodev" when possible as overuse can result in `mount` output that isnt completely obvious at first ... yes, that info is also available later on, but us english readers scan from left to right and get bored fairly quickly :)

$ mount -t debugfs debugfs /sys/kernel/debug
$ mount | grep debugfs
debugfs on /sys/kernel/debug type debugfs (rw,noexec,nosuid,nodev)

Are "fixed in stone" kernel ABIs being created here?

Posted Mar 19, 2009 15:02 UTC (Thu) by sbishop (guest, #33061) [Link]

Reading this over makes me wonder if the userspace-visible ABIs exposed by these tracers are fixed in stone. Or are they intended to be only for kernel developers, users beware?

Are "fixed in stone" kernel ABIs being created here?

Posted Mar 19, 2009 18:33 UTC (Thu) by chantecode (subscriber, #54535) [Link]

They are not supposed to be fixed in stone, because they are in a debugfs
namespace. But some care can be taken. For instance the sysprof tracer is
used by a userspace program.

ftrace not working

Posted Jul 5, 2012 20:58 UTC (Thu) by SwamiS (guest, #85527) [Link]

I am trying to setup ftrace on a linux based android platform. I did the initial setups by enabling these below configs.

CONFIG_FUNCTION_TRACER
CONFIG_FUNCTION_GRAPH_TRACER
CONFIG_STACK_TRACER
CONFIG_DYNAMIC_FTRACE

Built it and trying to see if it is working. But, I don't get any outputs from "trace". Inside the shell, I am trying the following. Can you please help find out what is missing to make it work?

mount -t debugfs debugfs /sys/kernel/debug/
cd /sys/kernel/debug/tracing/

/sys/kernel/debug/tracing # echo 1 > /proc/sys/kernel/ftrace_enabled
/sys/kernel/debug/tracing # echo function > current_tracer
echo function > current_tracer
/sys/kernel/debug/tracing # echo 1 > tracing_enabled
echo 1 > tracing_enabled
/sys/kernel/debug/tracing # echo 1 > tracing_on
echo 1 > tracing_on
/sys/kernel/debug/tracing # cat trace
cat trace
# tracer: function
#
# TASK-PID CPU# TIMESTAMP FUNCTION
# | | | | |

ftrace not working

Posted Jul 16, 2012 20:11 UTC (Mon) by SwamiS (guest, #85527) [Link]

Solution:
Tried multiple things and it was not working. Finally, got to know that removing "STRICT_MEMORY_RWX" Kernel config help solve the problem and I was able to do function tracing.

Trace Call using ftrace

Posted Jun 4, 2014 10:03 UTC (Wed) by 11189209 (guest, #97355) [Link]

Hi,

I have written the client server program and want to trace the write()/Send() call using ftrace.
How can i do that?

Please guide.