How old is our kernel?
Did you know...? LWN.net is a subscriber-supported publication; we rely on subscribers to keep the entire operation going. Please help out by buying a subscription and keeping LWN on the net. |
April 2005 was a bit of a tense time in the kernel development community. The BitKeeper tool which had done so much to improve the development process had suddenly become unavailable, and it wasn't clear what would replace it. Then Linus appeared with a new system called git; the current epoch of kernel development can arguably be dated from then. The opening event of that epoch was commit 1da177e4, the changelog of which reads:
Let it rip!
The community did, indeed, let it rip; some 180,000 changesets have been added to the repository since then. Typically hundreds of thousands of lines of code are changed with each three-month development cycle. A while back, your editor began to wonder how much of the kernel had actually been changed, and how much of our 2.6.33-to-be kernel dates back to 2.6.12-rc2, which was tagged at the opening of the git era? Was there anything left of the kernel we were building in early 2005?
Answering this question is a simple matter of bashing out some ugly scripts and dedicating many hours of processing time. In essence, the "git blame" command can be used to generate an annotated version of a file which lists the last commit to change each line of code. Those commit IDs can be summed, then associated with major version releases. At the end of the process, one has a simple table showing the percentage of the current kernel code base which was created for each major release since 2.6.12. Here's what it looks like:
In summary: just over 41% 31% of the kernel tree dates
back to 2.6.12, and has not been
modified since then. Our kernel may be changing quickly, but parts of it
have not changed at all for nearly five years. Since then, we see a steady
stream of changes, with more recent kernels being more strongly represented
than the older ones. That curve will partly be a result of the general
increase in the rate of change over time; 2.6.13 had fewer than 4,000
commits, while 2.6.33 will have almost 11,000. Still, one has to wonder
what happened with 2.6.20 (5,000 commits) to cause that
release to represent less than 2% of the total code base.
Much of the really old material is interspersed with newer lines in many files; comments and copyright notices, in particular, can go unchanged for a very long time. The 2.6.12 top-level makefile set VERSION=2 and PATCHLEVEL=6, and those lines have not changed since; the next line (SUBLEVEL=33) was changed in December.
There are interesting conclusions to be found at the upper end of the graph as well. Using this yardstick, 2.6.33 is the smallest development cycle we have seen in the last year, even though this cycle will have replaced some code added during the previous cycles. 4.2% of the code in 2.6.33 was last touched in the 2.6.33 cycle, while each of the previous four kernels (2.6.29 through 2.6.32) still represents more than 5.5% of the code to be shipped in 2.6.33.
Another interesting exercise is to look for entire files which have not been touched in five years. Given the amount of general churn and API change which has happened over that time, files which have not changed at all have a good chance of being entirely unused. Here is a full list of files which are untouched since 2.6.12 - all 1062 of them. Some conclusions:
- Every kernel tarball carries around drivers/char/ChangeLog, which is
mostly dedicated to documenting the mid-90's TTY exploits of Ted
Ts'o. There is only one change since 1998, and that was in 2001.
Files like this may be interesting from a historical point of view,
but they have little relevance to current kernels.
- Unsurprisingly, the documentation directory contains a great deal of
material which has not been updated in a long time. Much of it need
not change; the means by which one configures an ISA Sound Blaster
card is pretty much as it always was - assuming one can find such a
card and an ISA bus to plug it into. Similarly, Klingon language
support (Documentation/unicode.txt), Netwinder support, and such have
not seen much development activity recently, so the documentation can
be deemed to be current, if not particularly useful. All told,
41% of the documentation directory dates back to 2.6.12. There was a
big surge of
documentation work in 2.6.32; without that, a larger percentage of
this subtree would look quite old.
- Some old interfaces haven't changed in a long time, resulting in a lot
of static files in include/.
<linux/sort.h> declares sort(), which is used
in a number of places. <include/fcdevice.h> declares
alloc_fcdev(), and includes a warning that "
This file will get merged with others RSN.
" Much of the sunrpc interface has remained static for a long time as well. - Ancient code abounds in the driver tree, though, perhaps
unsurprisingly, old header files are much more common than old C
files. The ISDN driver tree has been quite static.
- Much of sound/oss has not been touched for a long time
and must be nicely filled with cobwebs by now; there hasn't been much
of a reason to touch the OSS code for some time.
- net/decnet/TODO contains a "
quick list of things that need finishing off
"; it, too, hasn't been changed in the git era. One wonders how the DECnet hackers are doing on that list...
So which subsystem is the oldest? This plot looks at the kernel subsystems (as defined by top-level directories) and gives the percentage of 2.6.12 code in each:
The youngest subsystem, unsurprisingly, is tools/, which did not exist prior to 2.6.29. Among subsystems which did exist in 2.6.12, the core kernel, with about 13% code dating from that release, is the newest. At the other end, the sound subsystem is more than 45% 2.6.12 code - the highest in the kernel. For those who are curious about the age distribution in specific subsystems, this page contains a chart for each.
In summary: even in a code base which is evolving as rapidly as the kernel, there is a lot of code which has not been touched - even by coding style or white space fixes - in the last five years. Code stays around for a long time.
(For those who would like to play with this kind of data, the scripts used have been folded into the gitdm repository at git://git.lwn.net/gitdm.git).
Note: this article has been edited to fix an error which overstated
the amount of 2.6.12 code remaining in the full kernel.
Index entries for this article | |
---|---|
Kernel | Git |
Kernel | Releases/2.6.33 |
(Log in to post comments)
How old is our kernel?
Posted Feb 17, 2010 16:30 UTC (Wed) by nix (subscriber, #2304) [Link]
Blank lines
Posted Feb 17, 2010 23:04 UTC (Wed) by xoddam (subscriber, #2322) [Link]
If it is a valid, apples-to-apples comparison to compare the line count without blank lines (debatable, but I think it is in this case : it's all Linux kernel code after all), I reckon it's equally valid to compare it with them.
Blank lines
Posted Feb 18, 2010 10:16 UTC (Thu) by hummassa (guest, #307) [Link]
I'd suggest cleaning all and anyy comments, blank lines and strings from the
measurements, and comparing. Does not seem too difficult to do.
Blank lines
Posted Feb 18, 2010 14:11 UTC (Thu) by nix (subscriber, #2304) [Link]
Blank lines
Posted Feb 19, 2010 10:33 UTC (Fri) by hummassa (guest, #307) [Link]
But if you are measuring the age of the code, I would weed out anything that
ends up generating the same object code. Just my opinion there.
Blank lines
Posted Feb 26, 2010 8:51 UTC (Fri) by efexis (guest, #26355) [Link]
Documentation
Posted Feb 17, 2010 16:31 UTC (Wed) by dw (guest, #12017) [Link]
David
(once upon a time, a youth who found it therepeutic to sit up all night reformatting "t-philes" :))
How old is our kernel?
Posted Feb 17, 2010 17:03 UTC (Wed) by joey (guest, #328) [Link]
historical kernel versions and rerun the analysis on that to see where the
code in the big spike on your graph really comes from.
Older trees
Posted Feb 17, 2010 17:09 UTC (Wed) by corbet (editor, #1) [Link]
I've thought about doing that; clearly, there would be more work involved, and I was interested in the five-year horizon for now.We have good history through the BitKeeper era, which could easily extend the view back a few years. Prior to that, of course, it's a big mess, though we do have per-release resolution for the most part.
But, yes, wouldn't it be interesting to know how much of 2.4.x we're still running?
Older trees
Posted Feb 17, 2010 17:57 UTC (Wed) by eli (guest, #11265) [Link]
I am picturing a different graph that I think might tell us more about kernel development than the bar graphs above.
The Y-axis would be the absolute number of lines of code, and the X-axis would be kernel releases.
Each kernel release would have a line that started at its release number. You'd have the total number of lines in the kernel marked for 2.6.12.
Then you'd draw a line to the 2.6.13 release to show how many lines of 2.6.12 remained in 2.6.13, and you'd start a new line for 2.6.13 showing the total number of lines in 2.6.13. Each new release would add a new line on the graph. It should look a bit like strata layers or something.
Over time, you should see some patterns in how the releases get replaced over time. If one release was particularly badly done, we'd see it start out with a large number of lines of code at its release, and see it rapidly squeezed to a small number of lines of code by later releases.
I wonder if there is such a thing as a "code half-life"...
Older trees
Posted Feb 19, 2010 22:32 UTC (Fri) by aegl (subscriber, #37581) [Link]
removal follows a "half-life" curve. We still have 58.6% of the git origin
(2.6.12-rc2) code present in the current kernel (it only makes up 30.5% of
the current code because the kernel is almost twice as large now).
Here's a graph showing growth of the kernel, and decline of the original
code:
http://www.kernel.org/pub/linux/kernel/people/aegl/codede...
Older trees
Posted Feb 22, 2010 17:04 UTC (Mon) by aegl (subscriber, #37581) [Link]
http://www.kernel.org/pub/linux/kernel/people/aegl/codest...
The lowest line is the 2.6.12-rc2 git origin. Count up from there
to 2.6.32 at the top. Scripts were run with current tip of "linus"
tree at v2.6.33-rc8-113-gf8b55f2 so it doesn't take into account the
320 lines of code added and 149 deleted over the weekend.
Visually there does seem to be an inflection point around 2.6.27
where we slowed down at deleting old code (perhaps because there
was so much new code to be deleted instead?)
Older trees
Posted Feb 22, 2010 17:40 UTC (Mon) by nix (subscriber, #2304) [Link]
I wonder if the inflection point can be attributed to the staging tree?
(That's certainly a lot of new crap^Wcode to be deleted...)
Older trees
Posted Feb 26, 2010 0:26 UTC (Fri) by robert_s (subscriber, #42402) [Link]
Older trees
Posted Feb 26, 2010 14:44 UTC (Fri) by eli (guest, #11265) [Link]
'Course, now I need to stare at it for an hour looking for all the interesting things it's trying to tell me. ;)
Older trees
Posted Feb 17, 2010 18:31 UTC (Wed) by alex (subscriber, #1355) [Link]
best I could Google today though an archive of discussions about it:
http://kerneltrap.org/node/13996
Older trees
Posted Feb 17, 2010 18:35 UTC (Wed) by corbet (editor, #1) [Link]
Such things exist, yes. And, indeed, I've grabbed copies of them over time. Lots of old stuff in git://git.kernel.org/pub/scm/linux/kernel/git/davej/history.git, for example. Eventually I'll see what I can do about trawling through it all.
WoPDaSD 2010
Posted Feb 17, 2010 18:57 UTC (Wed) by PO8 (guest, #41661) [Link]
I'm on the Program Committee for the 5th Workshop on Public Data about Software Development (WoPDaSD 2010), and the paper deadline is coming up in March. I would really love to see you and/or other readers of LWN get this kind of data and analysis together as a workshop paper and submit it there. We don't get so many submissions from outside academia, and that's a shame—I'm confident that this work would be quite well-received.
WoPDaSD 2010
Posted Feb 17, 2010 20:50 UTC (Wed) by ajross (guest, #4563) [Link]
WoPDaSD 2010
Posted Feb 18, 2010 0:02 UTC (Thu) by felixfix (subscriber, #242) [Link]
"The full name of the compiler is "Compiler Language With No Pronounceable Acronym", which is, for obvious reasons, abbreviated "INTERCAL"."
WoPDaSD 2010
Posted Feb 18, 2010 7:48 UTC (Thu) by PO8 (guest, #41661) [Link]
Older trees
Posted Feb 17, 2010 19:25 UTC (Wed) by marineam (guest, #28387) [Link]
I'm guessing writing a smarter script would be faster. :-P
Older trees
Posted Feb 17, 2010 22:24 UTC (Wed) by dlang (guest, #313) [Link]
once this is done the combined archive can be treated as a single archive and I expect that the scripts used for this report could be used as-is (although it will obviously take longer)
IIRC, the historical git archive goes all the way back to the 0.0x days (although not without gaps)
David Lang
Older trees
Posted Feb 17, 2010 22:31 UTC (Wed) by corbet (editor, #1) [Link]
That's davej's repository, yes. I have it. The tools will require some tweaks to work well with that data source, but it's all certainly doable.
Older trees
Posted Feb 17, 2010 22:38 UTC (Wed) by viro (subscriber, #7872) [Link]
Older trees
Posted Feb 23, 2010 0:34 UTC (Tue) by Aissen (subscriber, #59976) [Link]
I contacted the original author about 3 months ago and built a tree using his ocaml program. It seems to gather data from dave, tglx and linus' tree.
If anyone is interested, I can forward the ~210k archive of the program building the tree.
How old is our kernel?
Posted Feb 17, 2010 17:26 UTC (Wed) by jzbiciak (guest, #5246) [Link]
Quite the opposite is true, though. The core kernel has the lowest percentage, weighing in at 13%! The drivers directory, on the other hand, is way out at about 30%.
The math doesn't work for me, though. How can the whole kernel be at 41% if only one of the subsystems (sound) is noticeably above 41%, Documentation is tied at 41%, and the rest are below? Even fs, net and include (which are near 40%, but below it) should bring the average down. And isn't drivers (which weighs in at ~30%) pretty much where the vast majority of code lives?
Is there some other half-million line dreadnaught hiding that's not captured in the subsystem graph, anchoring the average at 41%?
How old is our kernel?
Posted Feb 17, 2010 19:58 UTC (Wed) by corbet (editor, #1) [Link]
Hmm...something may have confused the full-kernel numbers. I'm checking (again) now. I'm quite convinced about the individual subsystem numbers, though. Stay tuned.
Yep, I blew it
Posted Feb 17, 2010 20:18 UTC (Wed) by corbet (editor, #1) [Link]
The 41% number was wrong; the article has been edited with the correct data.For the curious: the bug was a combination of the tool descending into the .git directory (which has no history of its own) and it attributing unknown lines to the oldest release. That essentially inflated the number of "old" lines in the full kernel by about 2 million - enough to make a big difference.
Please accept my apologies for the screwup. I'd taught the program to avoid .git a while back, but I thought it was an efficiency improvement only; I didn't realize it had messed up the full-kernel numbers. Embarrassing.
Yep, I blew it
Posted Feb 17, 2010 21:30 UTC (Wed) by jcm (subscriber, #18262) [Link]
No worries!
Posted Feb 17, 2010 22:10 UTC (Wed) by jzbiciak (guest, #5246) [Link]
I guess I wasn't too far off (in orders of magnitude) on the number of unaccounted-for lines in the data, although I did undercall it by a factor of 4.
Man, the kernel's gotten huge! I actually tried to download the latest kernel and untar it in a 300MB partition, and it b0rked out, filling the disk only part-way into the "drivers/" directory. Next time I'll do it on a machine with a more generous disk space allotment. :-)
How old is our kernel?
Posted Feb 17, 2010 17:48 UTC (Wed) by gouyou (guest, #30290) [Link]
I'm expecting that there is a much better chance to have some changes made on new code from the last release than on code introduced 5 releases ago.
How old is our kernel?
Posted Feb 17, 2010 18:51 UTC (Wed) by creemj (subscriber, #56061) [Link]
Don't knock old ISA cards!
Posted Feb 17, 2010 18:39 UTC (Wed) by cruff (subscriber, #7201) [Link]
Much of it need not change; the means by which one configures an ISA Sound Blaster card is pretty much as it always was - assuming one can find such a card and an ISA bus to plug it into.
Funny you mentioned ISA SoundBlaster cards. I just powered up my 16 year old Alpha AXPpci33 based system to see if a newer releases of a Linux distribution would still run on it. It contains an SoundBlaster 64 ISA card. This system is a real monster! :-) 166 MHz CPU, 96 MBytes memory. It feels pathetically slow these days.
How old is our kernel?
Posted Feb 17, 2010 19:53 UTC (Wed) by Hanno (guest, #41730) [Link]
Will these results lead to removal of old cruft?
How old is our kernel?
Posted Feb 18, 2010 2:23 UTC (Thu) by neilbrown (subscriber, #359) [Link]
Further proof that you cannot observe a system without changing it....
How old is our kernel?
Posted Feb 18, 2010 7:57 UTC (Thu) by nix (subscriber, #2304) [Link]
(there are some things not even kudos can do)
How old is our kernel?
Posted Feb 19, 2010 12:11 UTC (Fri) by hmh (subscriber, #3838) [Link]
drivers/block/floppy.c
1 files changed, 618 insertions(+), 601 deletions(-)
How old is our kernel?
Posted Feb 18, 2010 8:40 UTC (Thu) by awils1 (guest, #48857) [Link]
(Heh, and here I was thinking I was the only crazy-reformatter -- it may have not been t-philes, but I'm guilty of reformatting some of the GNU documentation.)
How old is our kernel?
Posted Feb 18, 2010 11:22 UTC (Thu) by error27 (subscriber, #8346) [Link]
He he.
How old is our kernel?
Posted Feb 18, 2010 15:28 UTC (Thu) by mezcalero (subscriber, #45103) [Link]
It would be really interesting to see the directories beneath drivers/ showing up in these stats at the same level as sound/.
How old is our kernel?
Posted Feb 19, 2010 0:34 UTC (Fri) by jengelh (subscriber, #33263) [Link]
Size
Posted Feb 19, 2010 0:49 UTC (Fri) by corbet (editor, #1) [Link]
The early git repositories - one file per object - were truly huge. There was a lot of griping at the time. Obviously, things have gotten a lot better since.
How old is our kernel?
Posted Feb 25, 2010 22:15 UTC (Thu) by ariveira (guest, #57833) [Link]
Argument from Linus back in git early days was that disk space
is cheap.
Others come up with the whole xdelta pack thing later.
Loose objects, packfile, deltification and the cost of disk space
Posted Mar 2, 2010 1:10 UTC (Tue) by jnareb (subscriber, #46500) [Link]
There was no deltification at all iirc Argument from Linus back in git early days was that disk space is cheap. Others come up with the whole xdelta pack thing later.Actually packfiles and deltification (LibXDiff, not xdelta) was, from what I remember and understand, originally because of network bandwidth (which is much more costly than disk space), and I/O performance of using single mmapped file instead of very large number of loose objects.
How old is our kernel?
Posted Feb 19, 2010 5:03 UTC (Fri) by aegl (subscriber, #37581) [Link]
All of the code in the current tree is attributed to just 131,681 commits by "git blame". So, not counting the merges, it would appear that about 23% of all commits are ultimately completely superceeded (or just plain reverted).
Changing Licence
Posted Feb 25, 2010 0:54 UTC (Thu) by Felix_the_Mac (guest, #32242) [Link]
With the ownership of each line of code traced to its author, and an agreed definition of what constitutes original work (rather than just a minor change to somebody else's work) it would be possible to determine who held rights in each file.
If, for example, the GPL2 became unenforceable in China or India, and it was decided that a license change to GPL2++ was required, then permission could be sought from identified license holders and the code belonging to those who were unavailable or unwilling to give assent could be rewritten.
Any chance of increasing the resolution?
Posted Feb 25, 2010 19:46 UTC (Thu) by KGranade (guest, #56052) [Link]
How were the moved/renamed files accounted for ?
Posted Feb 26, 2010 8:46 UTC (Fri) by phdm (guest, #56884) [Link]
How were the moved/renamed files accounted for ?
Posted Feb 26, 2010 14:28 UTC (Fri) by corbet (editor, #1) [Link]
Git tracks renames nicely and retains the history for the moved files. So no, a rename does not, alone, cause any lines in the file to be considered to be changed.
How were the moved/renamed files accounted for ?
Posted Feb 26, 2010 16:33 UTC (Fri) by phdm (guest, #56884) [Link]
How were the moved/renamed files accounted for ?
Posted Feb 26, 2010 18:31 UTC (Fri) by dlang (guest, #313) [Link]
this can also track when parts of a file are copied to another file.
How were the moved/renamed files accounted for ?
Posted Mar 1, 2010 3:42 UTC (Mon) by phdm (guest, #56884) [Link]
I have now discovered that 'git log --follow' does.
How were the moved/renamed files accounted for ?
Posted Mar 2, 2010 1:13 UTC (Tue) by jnareb (subscriber, #46500) [Link]
In my experience, 'git log -M' did never give the full log of a single file. I have now discovered that 'git log --follow' does.Actually the problem is that in "git log -M filename" the filename part is path limiter, and is applied (for history simplification) before rename detection, and that is why you need "git log --follow filename". "git log -M" (no pathspec), or "git log -M directory" should work as expected.
How were the moved/renamed files accounted for ?
Posted Mar 2, 2010 12:47 UTC (Tue) by nye (guest, #51576) [Link]
This seems like a good, specific example of one of those usability issues people are always handwaving about.