Who wrote 2.6.20?

[Posted February 21, 2007 by corbet]

Time recently published an article entitled Getting rich off those who work for free which, among other things, talked about free software this way:

Open-source, volunteer-created computer software like the Linux operating system and the Firefox Web browser have also established themselves as significant and lasting economic realities.

It is not uncommon to see Linux referred to as a volunteer-created system, as opposed to the corporate-sponsored, proprietary alternatives. There has been little research, however, into how much work on Linux is truly "volunteer" - done on a hacker's spare, unpaid time. In general, the assumption that Linux is created by volunteers is simply accepted.

Determining the real provenance of free software can be a daunting task. There is a wealth of information available for those who look, however. In an attempt to shine some light in this area, your editor hacked up some scripts to do a lot of digging around in the kernel git repository. The idea was that, by looking at who is putting changes into the kernel, we can get a sense for where our source is coming from.

Who got patches into 2.6.20

This study looked at the stream of patches that changed the 2.6.19 kernel into the current 2.6.20 release. There were, as it turns out 4983 non-merge changesets in this release, contributed by 741 different developers. (Merge changesets mark where the contents of other repositories were pulled into the mainline, but they do not carry any code changes, so the analysis skipped them). These patches added 286,439 lines of code and removed 159,812 others, for a total growth of 126,627 lines over the 2.6.20 development cycle.

Your editor's scripts looked over every non-merge commit in 2.6.20. For each, the developer listed as the "author" was given credit for the patch. This approach is not entirely fair, since one developer will, in some cases, be submitting code written by a group of people. In general, though, there is no easy way of getting around this problem - the true breakdown of authorship of a joint work simply is not available in the mainline repository. Your editor believes that this inaccuracy affects the accounting of a relatively small portion of the patches merged into the mainline.

Beyond that, how one generates statistics from a patch stream is an interesting question. How does one measure the productivity of programmers? One possibility is to look at the number of changesets merged. By that metric, this is the list of the most prolific contributors to 2.6.20:

Developers with the most changesets

Al Viro 241 4.8%

Andrew Morton 92 1.8%

Jiri Slaby 92 1.8%

Adrian Bunk 87 1.7%

Gerrit Renker 79 1.6%

Josef Sipek 79 1.6%

Avi Kivity 68 1.4%

Tejun Heo 67 1.3%

Patrick McHardy 63 1.3%

Ralf Baechle 61 1.2%

Randy Dunlap 59 1.2%

Alan Cox 58 1.2%

Mariusz Kozlowski 57 1.1%

Andrew Victor 53 1.1%

Paul Mundt 52 1.0%

Stefan Richter 49 1.0%

David S. Miller 48 1.0%

Russell King 44 0.9%

Benjamin Herrenschmidt 44 0.9%

Akinobu Mita 43 0.9%

Developers with the most changesets
Al Viro	241	4.8%
Andrew Morton	92	1.8%
Jiri Slaby	92	1.8%
Adrian Bunk	87	1.7%
Gerrit Renker	79	1.6%
Josef Sipek	79	1.6%
Avi Kivity	68	1.4%
Tejun Heo	67	1.3%
Patrick McHardy	63	1.3%
Ralf Baechle	61	1.2%
Randy Dunlap	59	1.2%
Alan Cox	58	1.2%
Mariusz Kozlowski	57	1.1%
Andrew Victor	53	1.1%
Paul Mundt	52	1.0%
Stefan Richter	49	1.0%
David S. Miller	48	1.0%
Russell King	44	0.9%
Benjamin Herrenschmidt	44	0.9%
Akinobu Mita	43	0.9%

Looking at patch counts rewards developers who put in large numbers of small patches. Al Viro's patches include a vast number of code annotations (to enable better checking with sparse), include file fixups, etc. Many of the changes are small - many do not affect the resulting kernel executable at all - but there are a lot of them. Even so, as the biggest contributor, Al generated less than 5% of the total changesets added to the kernel. The top 20 contributors, all together, generated 28% of the total changesets in 2.6.20.

One could make the argument that a better way to look at the problem is by the number of lines affected by a patch. In this way, a contributor's portion of the whole will not depend on whether it has been split into a long series of small patches or not. On the other hand, simply renaming a file can make it look like a developer has touched a large amount of code. Be that as it may, by looking at lines changed (defined as the greater of the number of lines added or removed by each individual changeset), one gets a table like this:

Developers with the most changed lines

Jeff Garzik 20712 6.0%

Patrick McHardy 15024 4.3%

Jiri Slaby 13917 4.0%

Avi Kivity 11726 3.4%

Andrew Victor 9710 2.8%

Amit S. Kale 9537 2.7%

Stephen Hemminger 9120 2.6%

Geoff Levand 8396 2.4%

Michael Chan 8307 2.4%

Chris Zankel 8099 2.3%

Mauro Carvalho Chehab 7390 2.1%

Adrian Bunk 6138 1.8%

Yoshinori Sato 5232 1.5%

Al Viro 4981 1.4%

Benjamin Herrenschmidt 4588 1.3%

Thierry MERLE 4549 1.3%

Dan Williams 4516 1.3%

Jonathan Corbet 3924 1.1%

Gerrit Renker 3857 1.1%

Jiri Kosina 3805 1.1%

Developers with the most changed lines
Jeff Garzik	20712	6.0%
Patrick McHardy	15024	4.3%
Jiri Slaby	13917	4.0%
Avi Kivity	11726	3.4%
Andrew Victor	9710	2.8%
Amit S. Kale	9537	2.7%
Stephen Hemminger	9120	2.6%
Geoff Levand	8396	2.4%
Michael Chan	8307	2.4%
Chris Zankel	8099	2.3%
Mauro Carvalho Chehab	7390	2.1%
Adrian Bunk	6138	1.8%
Yoshinori Sato	5232	1.5%
Al Viro	4981	1.4%
Benjamin Herrenschmidt	4588	1.3%
Thierry MERLE	4549	1.3%
Dan Williams	4516	1.3%
Jonathan Corbet	3924	1.1%
Gerrit Renker	3857	1.1%
Jiri Kosina	3805	1.1%

Jeff Garzik comes out on top of this particular measurement by virtue of having deleted the long-unmaintained floppy tape subsystem. Patrick McHardy's work includes a number of additions to the netfilter subsystem, Jiri Slaby did a great deal of driver cleanup work, Avi Kivity was the contributor of the KVM virtualization code, and Andrew Victor contributed a number of ARM-related patches and the Atmel AT91 i2c driver. (The contributions made by other authors can be found by searching out their name in the 2.6.20 short-form changelog).

Most of the developers in the above list got there by adding code to the kernel. It can be said, however, that the true heroes in the development community are those who remove code and make the kernel smaller. The developers who were best at removing more code than they added were:

Developers with the most lines removed

Jeff Garzik 19862 12.4%

Chris Zankel 5608 3.5%

Adrian Bunk 5528 3.5%

Arnd Bergmann 2224 1.4%

Linus Torvalds 1739 1.1%

Atsushi Nemoto 1425 0.9%

Thierry MERLE 911 0.6%

David Gibson 878 0.5%

Dominik Brodowski 528 0.3%

Stefan Richter 509 0.3%

Developers with the most lines removed
Jeff Garzik	19862	12.4%
Chris Zankel	5608	3.5%
Adrian Bunk	5528	3.5%
Arnd Bergmann	2224	1.4%
Linus Torvalds	1739	1.1%
Atsushi Nemoto	1425	0.9%
Thierry MERLE	911	0.6%
David Gibson	878	0.5%
Dominik Brodowski	528	0.3%
Stefan Richter	509	0.3%

Once again, Jeff Garzik's removal of ftape comes out on top, by far. Chris Zankel cleaned up the Xtensa architecture, removing a number of files in the process. Adrian Bunk worked on the ftape removal, got rid of the frame diverter code, removed an old, broken block driver, and generally performed cleanups all over the tree. Mr. Bunk is, in fact, the bane of old code; over the last year (since 2.6.16) he has removed a full 127,000 lines from the kernel source tree. Arnd Bergman got rid of a bunch of syscall*() macros. Linus Torvalds removed the broken x86 stack unwinder code.

Finally, one could look at a different measure entirely: the number of patches signed off by each developer. A Signed-off-by: line is an indication that the person involved believes that the code is suitable for merging into the kernel; it implies that some degree of attention has been paid to the patch. Authors sign off their code, as do the subsystem maintainers who pass it up the chain. The top signers-off in 2.6.20 were:

Developers with the most signoffs

Andrew Morton 1422 13.7%

Linus Torvalds 1366 13.2%

David S. Miller 483 4.7%

Jeff Garzik 331 3.2%

Greg Kroah-Hartman 269 2.6%

Al Viro 241 2.3%

Paul Mackerras 232 2.2%

Andi Kleen 177 1.7%

Mauro Carvalho Chehab 170 1.6%

Russell King 166 1.6%

Adrian Bunk 120 1.2%

Arnaldo Carvalho de Melo 119 1.1%

Ralf Baechle 117 1.1%

James Bottomley 109 1.1%

Patrick McHardy 96 0.9%

Jiri Slaby 94 0.9%

Avi Kivity 87 0.8%

Josef Sipek 79 0.8%

Paul Mundt 78 0.8%

Gerrit Renker 78 0.8%

Developers with the most signoffs
Andrew Morton	1422	13.7%
Linus Torvalds	1366	13.2%
David S. Miller	483	4.7%
Jeff Garzik	331	3.2%
Greg Kroah-Hartman	269	2.6%
Al Viro	241	2.3%
Paul Mackerras	232	2.2%
Andi Kleen	177	1.7%
Mauro Carvalho Chehab	170	1.6%
Russell King	166	1.6%
Adrian Bunk	120	1.2%
Arnaldo Carvalho de Melo	119	1.1%
Ralf Baechle	117	1.1%
James Bottomley	109	1.1%
Patrick McHardy	96	0.9%
Jiri Slaby	94	0.9%
Avi Kivity	87	0.8%
Josef Sipek	79	0.8%
Paul Mundt	78	0.8%
Gerrit Renker	78	0.8%

There were a total of 10,354 signoff lines in the 2.6.20 patch stream, so each changeset, on average, was signed off just over two times. It is interesting that Linus, who ultimately merges every patch, only signed off 13% of them. It seems that most patches, these days, go directly into the mainline from subsystem repositories without a signoff from Linus or Andrew. Most of the other names on that list, with just a few exceptions, are the maintainers of subsystem or architecture trees.

Who paid them

So now we have a sense for who got their fingers on the code which went into 2.6.20. But one interesting question still has not been answered: to what extent was that code contributed by volunteers (or "hobbyists")? Finding an answer to that question is somewhat trickier than looking at who wrote the patches, mostly because very few developers say "I wrote this on behalf of my employer."

The approach taken by your editor was relatively simplistic, but, perhaps, the best that is practical. Any patch whose author's given email address indicates a corporate affiliation is assumed to have been developed by an employee of that corporation. So any patch posted by somebody with an ibm.com email address is accounted as having been done by an IBM employee. Things are complicated by the fact that many people who work for companies do not use corporate addresses; it is not unheard-of for companies to have policies explicitly prohibiting code contributions associated with their domains. Your editor has coped with this problem by filling in the relevant developer's affiliation whenever it is known to him; in some cases, the developer was asked for this information.

This method has the effect of crediting all of an employee's work to his or her employer. In many cases, the situation is probably more complicated than that; one assumes, for example, that a certain kernel hacker's employer has not directed him to hack on Battle for Wesnoth. When looking only at kernel code, however, crediting all work to the employer is probably relatively safe.

Using this approach, the top sources of changesets were:

Top changeset contributors by employer

(Unknown) 1244 25.0%

Red Hat 636 12.8%

(None) 383 7.7%

IBM 368 7.4%

Novell 295 5.9%

Linux Foundation 261 5.2%

Intel 178 3.6%

Oracle 126 2.5%

Google 97 1.9%

University of Aberdeen 79 1.6%

HP 78 1.6%

Qumranet 71 1.4%

Nokia 67 1.3%

SGI 64 1.3%

Astaro 63 1.3%

MIPS Technologies 61 1.2%

SANPeople 53 1.1%

Miracle Linux 43 0.9%

MontaVista 41 0.8%

Broadcom 39 0.8%

Top changeset contributors by employer
(Unknown)	1244	25.0%
Red Hat	636	12.8%
(None)	383	7.7%
IBM	368	7.4%
Novell	295	5.9%
Linux Foundation	261	5.2%
Intel	178	3.6%
Oracle	126	2.5%
Google	97	1.9%
University of Aberdeen	79	1.6%
HP	78	1.6%
Qumranet	71	1.4%
Nokia	67	1.3%
SGI	64	1.3%
Astaro	63	1.3%
MIPS Technologies	61	1.2%
SANPeople	53	1.1%
Miracle Linux	43	0.9%
MontaVista	41	0.8%
Broadcom	39	0.8%

Looking instead at the number of lines of code changed, the results become:

Top lines changed by employer

(Unknown) 66154 19.0%

Red Hat 44527 12.8%

(None) 38099 11.0%

IBM 25244 7.3%

Astaro 15306 4.4%

Linux Foundation 13638 3.9%

Qumranet 12108 3.5%

Novell 11930 3.4%

Intel 11652 3.4%

SANPeople 9888 2.8%

NetXen 9607 2.8%

Sony 8497 2.4%

Broadcom 8349 2.4%

Tensilica 8195 2.4%

Nokia 5581 1.6%

MontaVista 4394 1.3%

University of Aberdeen 4324 1.2%

LWN.net 3975 1.1%

Secretlab 3370 1.0%

HP 3211 0.9%

Top lines changed by employer
(Unknown)	66154	19.0%
Red Hat	44527	12.8%
(None)	38099	11.0%
IBM	25244	7.3%
Astaro	15306	4.4%
Linux Foundation	13638	3.9%
Qumranet	12108	3.5%
Novell	11930	3.4%
Intel	11652	3.4%
SANPeople	9888	2.8%
NetXen	9607	2.8%
Sony	8497	2.4%
Broadcom	8349	2.4%
Tensilica	8195	2.4%
Nokia	5581	1.6%
MontaVista	4394	1.3%
University of Aberdeen	4324	1.2%
LWN.net	3975	1.1%
Secretlab	3370	1.0%
HP	3211	0.9%

[Note that these tables have been updated once since the article was originally published; the curious can see what the original versions looked like.]

In these tables, the line marked "(Unknown)" is exactly that: patches for which existence of a supporting employer could not be determined. The line marked "(None)", instead, indicates the patches from developers known to be working on their own time.

Either way, the results come out about the same: at least 65% of the code which went into 2.6.20 was created by people working for companies. If the entire "unknown" group turns out to be developers working on a volunteer basis - an unlikely result - then just over 1/3 of the 2.6.20 patch stream was written by volunteers. The real number will be lower, but it still shows that a significant portion of the code we run is written by developers who are donating their time.

One year's worth of changes

Looking at a single kernel release is instructive, but it can also be deceptive. The relatively short release cycle used by the kernel project makes it fairly easy for prolific developers to see few of their patches go into a specific release. In an attempt to gain a longer-term perspective, your editor forced his suffering system to crank through the entire history from 2.6.16 (released almost exactly one year ago) to the present. Some 28,000 non-merge changesets have been added to the mainline (by 1,961 developers) over that time, replacing 1.26 million lines of old code with 2.01 million lines of new code - the kernel grew by 754,000 lines.

The developers who touched the most lines over that time were:

Developers with the most changed lines

Adrian Bunk 134021 5.3%

Jeff Garzik 87847 3.5%

Andrew Vasquez 75195 3.0%

Mauro Carvalho Chehab 68568 2.7%

David Teigland 46607 1.9%

Ralf Baechle 38559 1.5%

David S. Miller 35958 1.4%

Andrew Victor 35594 1.4%

Bryan O'Sullivan 33901 1.4%

Paul Mundt 27041 1.1%

Dave Kleikamp 26615 1.1%

Lennert Buytenhek 25192 1.0%

Haavard Skinnemoen 24372 1.0%

Ben Dooks 23207 0.9%

Patrick McHardy 23175 0.9%

Ingo Molnar 22456 0.9%

James Bottomley 22205 0.9%

David Howells 19168 0.8%

Jiri Slaby 18335 0.7%

Divy Le Ray 17909 0.7%

Developers with the most changed lines
Adrian Bunk	134021	5.3%
Jeff Garzik	87847	3.5%
Andrew Vasquez	75195	3.0%
Mauro Carvalho Chehab	68568	2.7%
David Teigland	46607	1.9%
Ralf Baechle	38559	1.5%
David S. Miller	35958	1.4%
Andrew Victor	35594	1.4%
Bryan O'Sullivan	33901	1.4%
Paul Mundt	27041	1.1%
Dave Kleikamp	26615	1.1%
Lennert Buytenhek	25192	1.0%
Haavard Skinnemoen	24372	1.0%
Ben Dooks	23207	0.9%
Patrick McHardy	23175	0.9%
Ingo Molnar	22456	0.9%
James Bottomley	22205	0.9%
David Howells	19168	0.8%
Jiri Slaby	18335	0.7%
Divy Le Ray	17909	0.7%

The results for employers were:

Top lines changed by employer

(Unknown) 740990 29.5%

Red Hat 361539 14.4%

(None) 239888 9.6%

IBM 200473 8.0%

QLogic 91834 3.7%

Novell 91594 3.6%

Intel 78041 3.1%

MIPS Technologies 58857 2.3%

Nokia 39676 1.6%

SANPeople 36038 1.4%

SteelEye 36021 1.4%

Freescale 35034 1.4%

Linux Foundation 34163 1.4%

MontaVista 30211 1.2%

Simtec 26166 1.0%

Atmel 25975 1.0%

HP 23714 0.9%

SGI 22057 0.9%

Oracle 21251 0.8%

Open Grid Computing 20505 0.8%

Top lines changed by employer
(Unknown)	740990	29.5%
Red Hat	361539	14.4%
(None)	239888	9.6%
IBM	200473	8.0%
QLogic	91834	3.7%
Novell	91594	3.6%
Intel	78041	3.1%
MIPS Technologies	58857	2.3%
Nokia	39676	1.6%
SANPeople	36038	1.4%
SteelEye	36021	1.4%
Freescale	35034	1.4%
Linux Foundation	34163	1.4%
MontaVista	30211	1.2%
Simtec	26166	1.0%
Atmel	25975	1.0%
HP	23714	0.9%
SGI	22057	0.9%
Oracle	21251	0.8%
Open Grid Computing	20505	0.8%

The end result of all this is that a number of the widely-expressed opinions about kernel development turn out to be true. There really are thousands of developers - at least, almost 2,000 who put in at least one patch over the course of the last year. Linus Torvalds is directly responsible for a very small portion of the code which makes it into the kernel. Contemporary kernel development is spread out among a broad group of people, most of whom are paid for the work they do. Overall, the picture is of a broad-based and well-supported development community.

There are many other interesting things to be learned by looking at the kernel's development history. Expect more articles along these lines as your editor finds the time to improve his scripts.

Index entries for this article
Kernel	Development model/Contributor statistics
Kernel	Releases/2.6.20

(Log in to post comments)

why skip the merge commits?

Posted Feb 21, 2007 2:13 UTC (Wed) by dlang (guest, #313) [Link]

in many cases the merges involve a significant amount of effort and skill to do right. I'm curious why they were stripped out of the results?

If I understand git correctly a merge commit will only happen when there's something interesting. if linus pulls from a subsystem tree that is based directly on his latest version it will do a fast-forward, not a merge.

why skip the merge commits?

Posted Feb 21, 2007 2:35 UTC (Wed) by corbet (editor, #1) [Link]

There's no useful information in the merges - at least, for the depth I have gone to so far. They do carry information on the path patches took into the kernel, but they do not, themselves, carry any code changes. If you look at the short-form 2.6.20 changelog, you'll see a lot of lines like:

      Merge git://git.kernel.org/.../bunk/trivial
      Merge git://git.kernel.org/.../sfrench/cifs-2.6
      Merge master.kernel.org:/.../gregkh/driver-2.6
      Merge master.kernel.org:/.../gregkh/pci-2.6
      Merge master.kernel.org:/.../gregkh/usb-2.6

They only indicate that the trees came together at that point.

why skip the merge commits?

Posted Feb 21, 2007 5:03 UTC (Wed) by dlang (guest, #313) [Link]

my git-foo isn't good enough to know the answer to this, so I'll as what's probably a dumb question

I know that the vast majority of merges are non-events like this, what happens when there is a conflict in the merge?

I thought the changes were recorded as part of the merge, the only other option would be for a merge, followed by the changes needed to make it work, and this would seem to cause problems for things like bisect

why skip the merge commits?

Posted Feb 21, 2007 19:01 UTC (Wed) by iabervon (subscriber, #722) [Link]

For all commits, what is recorded is the resulting state and the commit(s) which went into it. In order to determine if there were conflicts, you just try merging the inputs yourself and see if it's trivial or not. Of course, you can't tell if the person who actually did the merge used some special strategy which knew how to do the merge without conflicts. If your try didn't give conflicts, you should also compare the result against the commit, because it's possible that the person fixed stuff that didn't get flagged as a conflict (e.g., the two branches added the same function in different places, and the person removed one copy when the compiler complained).

In a sense, all merges are events (otherwise, you get a fast-forward), but an external observer can never really tell how much of the event was done by the committer and how much was done by software. Who knows, somebody might have a secret special sparse-based C source merger.

why skip the merge commits?

Posted Feb 21, 2007 20:17 UTC (Wed) by dlang (guest, #313) [Link]

the question is, should merge events be ignored, or can code changes take place as part of the merge event.

if so then we'll need to update the scripts to account for this when corbet releases them in a week or so.

why skip the merge commits?

Posted Feb 21, 2007 21:20 UTC (Wed) by iabervon (subscriber, #722) [Link]

Code changes really shouldn't be part of a merge event. Even resolving conflicts is really a matter of "not changing" stuff in some sense.

Who wrote 2.6.20?

Posted Feb 21, 2007 2:44 UTC (Wed) by pr1268 (subscriber, #24648) [Link]

Using lines of code as a metric is pure evil. Sorry for venting, but I've learned and read that LOC is the single most misused and abused metric in all of software engineering.

However, I do respect and appreciate the hard work our editor has done. I assume there was no easier way to quantify and qualify the data above into meaningful information which accurately represents the state of authorship of the Linux Kernel. Is that a fair assessment? Finally, is there a correlation between the quantity of patches in a particular functional section of the Kernel (i.e. virtualization, filesystems, network device drivers, etc.) with whatever company has a vested interest in ensuring that functionality adds value to the company's Linux product(s)?

Thank you, Jon, for this research.

Who wrote 2.6.20?

Posted Feb 21, 2007 2:51 UTC (Wed) by corbet (editor, #1) [Link]

As I noted in the article, measuring these things is hard, and I agree that lines-of-code is of limited utility. Still, there's some information there, so I thought it was worth a look.

Delving into the various kernel subsystems is an area of future research. I did some quick-and-dirty runs which suggest that the representation of the various companies does not change as much as one might expect from one subsystem to another. It also looks like the "hobbyist" contribution to the core parts of the kernel is just as high as in, say, the driver tree. I will be looking at this more in the future.

Who wrote 2.6.20?

Posted Feb 21, 2007 6:20 UTC (Wed) by jamesm (guest, #2273) [Link]

grepping for names and email addresses in the kernel source is sometimes useful (try grep -ri davem /usr/src/kernel for example).

If not SLOC, then what?

Posted Feb 21, 2007 6:49 UTC (Wed) by ldo (guest, #40946) [Link]

LOC is the single most misused and abused metric in all of software engineering.

So what? What's the alternative?

Who wrote 2.6.20?

Posted Feb 21, 2007 16:23 UTC (Wed) by richardl@redhat.com (guest, #31678) [Link]

LOC is a perfectly valid metric as long as you normalize against language, etc. In this case, LOC is used as a relative metric. The effort required to produce 100 LOC in C for the kernel is different from the effort required to produce 100 LOC in, say, Ruby for a webapp -- but that's not what the editor is doing here.

I'd be interested in hearing why you think LOC is "pure evil." I think it all depends on how you use it.

Who wrote 2.6.20?

Posted Feb 21, 2007 16:46 UTC (Wed) by lmb (subscriber, #39048) [Link]

LoC changed is difficult though. For example, I could iterate 100 times trying to get a single line of code right. But then, software metrics are hard.

One suggestion for a possibly interesting metric, so that I don't have to code it myself:

Annotate the whole of the tree: Who last changed which line? Number of lines * age = Author score.

This can then be extended to a historical score: who contributed how many lines of code, and how long did they remain in the tree before being removed/changed? Developers changing their own code would get accumulated, so this is essentially neutral.

LOC metric

Posted Feb 23, 2007 1:23 UTC (Fri) by giraffedata (guest, #1954) [Link]

...as long as you normalize against language, etc. In this case, LOC is used as a relative metric. The effort required to produce 100 LOC in C for the kernel is different from the effort required to produce 100 LOC in, say, Ruby for a webapp

I saw a study long ago that had the remarkable result that there is nothing to normalize here. It was looking specifically at the cost to develop and test new software, and found that 100 LOC costs the same regardless of the language or subject. What I've seen is consistent with that.

The study did find a few variables that added precision to a LOC-based estimate. With modification of existing code, there were some measurements of the code base that helped. I think number of files touched added precision too.

Who wrote 2.6.20?

Posted Feb 24, 2007 11:05 UTC (Sat) by bockman (guest, #3650) [Link]

Well, for one thing often you can accomplish something equivalent with 1000 lines of dumb code or with 300 lines of very smart code. Most of the programming effort is going into figuring out the 'commonalities' between potential code blocks and write customizable code ( loops, routines, classes, templates) that exploit said commonalities. But the more time a developer spends in this kind of exercise, the shorter the final code would result.

I don't say that LOC measurements are meaningless. Just that they are statistics and should not used outside of this context ( for instance should not be used to measure the productivity of a developer or even a team ).

Ciao
-----
FB

Who wrote 2.6.20?

Posted Mar 1, 2007 21:00 UTC (Thu) by jboorn (guest, #43808) [Link]

So what. You can write reallly slow naive brute force code for some problem with 300 lines. Or you can you use a fancy complicated algorithm that takes 1000 lines of code, but is much faster.

In this case the code is for the same project and I think using lines of code with in a project is good enough for the analysis sought here.

It is a bit annoying to see the same argument about lines of code count come up that is pointless. Sure it is possible to find examples of code that is smaller and as efficient (or more efficient) than a given larger implementation. But, that does not exclude the existence of larger code that is more desirable for a given project based on a meteric other than executable size.

LOC is quite ok...

Posted Feb 21, 2007 21:25 UTC (Wed) by nettings (subscriber, #429) [Link]

"Using lines of code as a metric is pure evil. "

wrong. absolute lines-of-code counts are certainly bogus as a measure for productivity, but the purpose of this article was to find a relative measure of where commits come from.
unless you can demonstrate that corporate-backed hackers produce a significantly different amount of functionality or utility per line of code (which would introduce a systemic error), the method is perfectly valid, because the inherent bogosity of LOC measurements will level out.

LOC is quite ok...

Posted Mar 3, 2007 17:36 UTC (Sat) by jzbiciak (guest, #5246) [Link]

Also, LOC is only meaningful if the output of the measurement isn't an input into future productivity. If coders are incentivized by their KLOC numbers (either directly, such as through wages and promotions, or indirectly through ego boosting), then KLOC can quickly become meaningless.

LOC metrics

Posted Feb 21, 2007 23:32 UTC (Wed) by man_ls (guest, #15091) [Link]

LOC is a perfectly valid metric; all metrics can be abused, and LOC have suffered more than their due, but well understood and with a little effort (e.g. removing blanks and comments) they are very useful.

Laird and Brennan said it well: LOC are like square meters for an apartment. Sure, 160 m^2 in Madrid are not comparable directly to 160 m^2 in rural Teruel. And even in the same city, if you compare the price of m^2 for luxury attics with old basements you are probably going to make a bad decision. But if you are going to buy a house, you have better know how many m^2 it has, instead of relying on subjective impressions of size.

In this case, what do you propose measuring? Function points? In case you don't know, when you don't have direct fp counts from construction data, you backfire them from... lines of code, by applying a coefficient.

LOC metrics

Posted Feb 23, 2007 0:00 UTC (Fri) by giraffedata (guest, #1954) [Link]

But if you are going to buy a house, you have better know how many m^2 it has, instead of relying on subjective impressions of size.

I'd say just the opposite. If you're looking at the house, your subjective impression of size is what really counts. The square meters in the listing are a cheap estimate -- cheaper than visiting the house -- of how spacious it is.

And so it is with LOC. If you're asking what it would cost to duplicate the development of 2.6.20 from 2.6.19, getting a bunch of professionals to look at the function and give their impression of how many person-hours it would take would be a lot better than counting LOC, but LOC is much cheaper. And history shows that the quality of the estimate you get by multiplying by LOC is quite acceptable.

Who wrote 2.6.20?

Posted Feb 25, 2007 15:55 UTC (Sun) by kingdon (guest, #4526) [Link]

To his credit, Jon gave higher praise to deleting code than writing it.

So although I agree that a naive attitude of "more lines of code means the developers are working harder/better" is dead wrong, I wouldn't tar this analysis with that brush.

non-merge changesets

Posted Feb 21, 2007 2:53 UTC (Wed) by smitty_one_each (subscriber, #28989) [Link]

Request term definition, please.

non-merge changesets

Posted Feb 21, 2007 3:08 UTC (Wed) by corbet (editor, #1) [Link]

See other comment...a merge changeset just marks the intersection of two trees, but carries no changes itself. Guess I should see if I can add a sentence to clarify the article...

Who wrote 2.6.20?

Posted Feb 21, 2007 4:47 UTC (Wed) by bcs (guest, #27943) [Link]

The difference in Linux Foundation's rank between the 2.6.20 changes and changes over the past year caught my interest. Out of curiosity, is there a simple explanation for that discrepancy? TIA.

Who wrote 2.6.20?

Posted Feb 21, 2007 8:55 UTC (Wed) by seyman (subscriber, #1172) [Link]

Out of curiosity, is there a simple explanation for that discrepancy?

I suspect this is due to the fact that the Linux Foundation is just one month old. It was created on Jan 21, 2007 from the merger of OSDL and the Free Standards Group.

Who wrote 2.6.20?

Posted Feb 21, 2007 12:22 UTC (Wed) by bcs (guest, #27943) [Link]

That makes sense, but the number of lines contributed from Linux Foundation employees is different between the 2.6.20 table and the year-long table, and OSDL doesn't appear at all, so I assumed the "Linux Foundation" entry included OSDL's old numbers as well.

Who wrote 2.6.20? LKML Traffic

Posted Feb 21, 2007 5:13 UTC (Wed) by PaulDickson (guest, #478) [Link]

A couple of years ago I looked at the traffic on the Linux-Kernel Mailing List. It too supported the claim the development work had moved "business hours".

Who wrote 2.6.20? LKML Traffic and patches

Posted Mar 2, 2007 5:32 UTC (Fri) by lacostej (guest, #2760) [Link]

I second this. Having a look at when the emails or commits are produced local time (not email|git server time) might give an interesting estimate at wether the work was done during work or leisure. Following this number over time might be even more interesting.

Who wrote 2.6.20?

Posted Feb 21, 2007 6:16 UTC (Wed) by dambacher (subscriber, #1710) [Link]

Very interesting statistics that is!
One thing I personally found remarkable: to see that Broadcom is "sponsoring" kernel work. In the past they were not well known for good linux adoption (at least to me using their network/wifi devices). But that may have changed.

Who wrote 2.6.20?

Posted Feb 21, 2007 6:55 UTC (Wed) by drag (guest, #31333) [Link]

It's probably mostly driver code for hardware they want to sell on servers or in embedded systems were it's worth their time to contribute.

I realy doubt that their position on 'ip' has changed any in regards to consumer grade stuff.

It may also be partly due to the fact that large corporations have decentralized management were the left hand may disagree entirely with the right, while in the meantime the left foot is quietly contributing code.

Who wrote 2.6.20?

Posted Feb 21, 2007 9:52 UTC (Wed) by johill (subscriber, #25196) [Link]

Broadcom appears to have quite separated internal groups so while for example the wired networking groups is doing tg3 and b44, bcm43xx is entirely a volunteer effort.

Who wrote 2.6.20?

Posted Feb 21, 2007 11:49 UTC (Wed) by gouyou (guest, #30290) [Link]

It also looks like they are making a significant part of the hardware for the OLPC project.

Who wrote 2.6.20?

Posted Feb 21, 2007 14:35 UTC (Wed) by dwmw2 (subscriber, #2063) [Link]

It also looks like they are making a significant part of the hardware for the OLPC project.

Er, Broadcom? Not so.

Who wrote 2.6.20?

Posted Feb 21, 2007 15:57 UTC (Wed) by gouyou (guest, #30290) [Link]

My bad, I was remembering the mail from Theo de Raadt about OLPC and NDAs where he was talking about Broadcom. Confused it in my mind with Marvell ...

Broadcom and Linux

Posted Feb 22, 2007 15:02 UTC (Thu) by massimiliano (subscriber, #3048) [Link]

This is not kernel related, but it is Linux related anyway...

A Broadcom employee ported the Mono JIT to the MPIS architecture, because they needed it, and of course they were going to use it on Linux.

Who wrote 2.6.20?

Posted Feb 21, 2007 9:39 UTC (Wed) by simlo (guest, #10866) [Link]

How can the kernel community take code from people working from an "unknown" employer? Who has the copyright then? Is it something he does on his own time or is it the employer's code? Who is entitled to add GPL to the code?

Who wrote 2.6.20?

Posted Feb 21, 2007 9:45 UTC (Wed) by schutz (subscriber, #3760) [Link]

People who code during their free time ? Self-employed people ?

Who wrote 2.6.20?

Posted Feb 21, 2007 9:49 UTC (Wed) by simlo (guest, #10866) [Link]

> People who code during their free time ?
They are under "(None)" already

> Self-employed people ?
Should be either under "(None)" or under the name of their tiny company.

Who wrote 2.6.20?

Posted Feb 21, 2007 9:53 UTC (Wed) by johill (subscriber, #25196) [Link]

I think the point is that Jon simply can't know whether someone is affiliated with a tiny company or working on their own in most cases. In those cases he did know, he distinguished (companies are listed, and "(None)" is listed), but in those he doesn't that's reflected by "(unknown)".

Who wrote 2.6.20?

Posted Feb 21, 2007 11:12 UTC (Wed) by fozzy (guest, #7022) [Link]

A few quick comments:

First great work Jon!

I wonder if a "Sponsored by" type addition that could be used in the MAINTAINERS file would make this sort of analysis in the future easier. The emails could then be matched to sponsor - maybe even defining a special "myself" sponsor for those doing the work privately. I'm sure a lot of companies would be happy for the recognition it would bring. However, as a kernel user rather than a contributor, maybe such a suggestion is culturally inappropriate.

Do you plan on making the scripts available so others can slice and dice the numbers without having to be such a got-foo expert?

Again, thanks for such interesting analysis.

Releasing the scripts

Posted Feb 21, 2007 13:56 UTC (Wed) by corbet (editor, #1) [Link]

I guess I don't see any reason why I couldn't make my scripts available - it would be a rather more straightforward affair than releasing the site code...:) It may take a week or so (I have a lot of other things to do), but I'll try to get that done. Be warned that they are not a thing of beauty, though...

Releasing the scripts

Posted Feb 23, 2007 9:49 UTC (Fri) by PhilHannent (guest, #1241) [Link]

It could end up like GIT and really taking off.

Its something I would like to see on a monthly basis and perhaps with added charting. An interested party could develop it further for you and you could still put the results on the site.

Sounds great to me.

Releasing the scripts

Posted Mar 2, 2007 0:09 UTC (Fri) by turpie (guest, #5219) [Link]

The problem with this idea is that it may encourage people to produce longer code rather than efficient code so that they can get a higher score.

edu contributions?

Posted Feb 21, 2007 10:12 UTC (Wed) by bkoz (guest, #4027) [Link]

Interesting. Last year did a quick scan of gcc/gnome/firefox/kernel looking specifically for educational contributors, or at least people contributing from .edu domains. (I realize this is not super accurate, but it gave a pointer about what institutions were contributing or had contributed.)

This might be a way to categorize some of the "unknown" or "none" bits from your tables.

Any chance you could break this down as well? (I'd noticed an impressively large Australia and China contingent, which doesn't seem to be showing up in your analysis)

The companies are volunteering

Posted Feb 21, 2007 15:17 UTC (Wed) by avik (guest, #704) [Link]

One way to look at it, is that the companies that employ the contributors
are volunteering the code. It's very different for a company to
contribute engineering work and for an individual to contribute their
spare time, but it is still a voluntary contribution.

Who wrote 2.6.20?

Posted Feb 21, 2007 16:40 UTC (Wed) by charris (guest, #13263) [Link]

It might also be interesting to try tabulating the contributors by sex. My impression, unsupported by any statistics, is that most of the women who contribute to the kernel work for IBM.

to be fair

Posted Feb 21, 2007 17:03 UTC (Wed) by ccyoung (guest, #16340) [Link]

to be fair (and unduly complicated) gnu and Xorg participation should be merged into this. for example, Novell has put a lot of cycles into X, a contribution no less relevant.

to be fair

Posted Feb 22, 2007 4:42 UTC (Thu) by k8to (guest, #15413) [Link]

If this was intended to measure the contributions of companies to free software as a whole, sure. But this article had a much narrower (and more achievable) scope.

Linux means two things, really. Sometimes people mean the kernel, sometimes people mean "that bunch of mostly-the-same operating systems we call Linux". This article was about the former.

Linux OS or Linux kernel

Posted Feb 23, 2007 1:35 UTC (Fri) by giraffedata (guest, #1954) [Link]

But it's still a good point that the article presents itself as a response to claims such as the one quoted:

Open-source, volunteer-created computer software like the Linux operating system and the Firefox Web browser ...

which almost certainly refer to the whole Linux operating system package, with the GNU stuff, Xorg, KDE, etc., etc.

Numbers for the Linux kernel certainly help to answer the question posed, but it's worth at least pointing out that the kernel is probably one of the less representative samples one could make of the operating system.

Who wrote 2.6.20?

Posted Feb 22, 2007 13:15 UTC (Thu) by jpmcc (guest, #2452) [Link]

Dispelling the perception that Linux is cobbled together by a large cadre of lone hackers working in isolation, the individual in charge of managing the Linux kernel said that most Linux improvements now come from corporations.

From Linux now a corporate beast, Joab Jackson GCN, 07/19/04

Who wrote 2.6.20?

Posted Feb 22, 2007 13:43 UTC (Thu) by sepreece (guest, #19270) [Link]

It would be interesting (and probably a lot harder) to do similar numbers for all the patches submitted (rather than accepted), and an "impact" or "futility" scoring, comparing submissions to acceptances.

Google?

Posted Feb 22, 2007 18:52 UTC (Thu) by jvotaw (subscriber, #3678) [Link]

I don't see Google listed but I think Daniel Phillips and Andrew Morton both work there. Maybe others, too.

But the really cool thing, to me, is how long the tail on this is -- very few people have contributed more than 1% of the code. It's truly a community effort.

-Joel

Google?

Posted Feb 24, 2007 15:42 UTC (Sat) by corbet (editor, #1) [Link]

Thanks - you drew my attention to the biggest inaccuracy in the original set of tables - akpm's work had just automatically been put into the Linux Foundation pile. The tables have been updated with that error fixed; I was also able to prune back the "unknown" category a bit.

How many professionals (in paid time) develop Free Software?

Posted Feb 23, 2007 15:06 UTC (Fri) by ber (subscriber, #2142) [Link]

| There has been little research, however, into how much work on Linux is
| truly "volunteer" - done on a hacker's spare, unpaid time. In general,
| the assumption that Linux is created by volunteers is simply accepted.

True and while our editor actually examines Linux and not the operating
system around it, I would like to expand the hypothese to Free Software
and GNU/Linux in general.

A few years ago I first looked at the problem
and the only backed-up number I found was from

[Lakhani et al. 2002]
Karim Lakhani, Bob Wolf, Jeff Bates and Chris DiBona Hacker Survey v0.73,
24.6.2002, Boston Consulting Group http://www.osdn.com/bcg

You can estimate from it that about 40% of the stable Free Software
(they have pulled their sample from) was developed in paid time.
To do this you can look at the participating people (25% professionals
in paid time) and the how much they contribute (twice as many hours)
and end up with about 40%. Given that someone spending more hours
could be more effective, the effect could be even higher.
Of course the sample has systematic errors, like that groups that have had
their own infrastructure like GNU or BSD are probably underrepresented.

I have also mentioned the number in my paper from 2004:
http://intevation.de/~bernhard/publications/200408-hmd/20...
which got published in a peer-reviewed magazin. (German only).

Great article

Posted Feb 23, 2007 19:03 UTC (Fri) by Gady (guest, #1141) [Link]

The open source community does painfully little self observation, or understanding of where it stands in society. We need more articles like this!

How does he find the time?

Posted Feb 26, 2007 4:21 UTC (Mon) by Max.Hyre (subscriber, #1054) [Link]

I notice our esteemed editor shows up on one of the lists. He obviously has a serious side interest in human cloning research. :-)

Who wrote 2.6.20?

Posted Mar 1, 2007 18:03 UTC (Thu) by shaitand (guest, #43800) [Link]

Doesn't that unfairly credit employers? The example you gave seemed to imply there might be small pieces the employers didn't pay for but what if that joeb@us.california.freemont.viavoice.office12.joesdesk.ibm.com doesn't get paid by IBM for ANY of the code he contributes to the kernel? Maybe he works on viavoice for IBM and writes kernel code as a hobby and IBM just signed off on it?

A developer of kernel quality probably works for a large firm that is open source friendly enough to sign off on it. But that doesn't mean they are actually paid by that company to code for the kernel. An ibm.com email address is as likely to designate someone working on project x as someone IBM is paying for their kernel contributions.

Who wrote 2.6.20?

Posted Mar 1, 2007 19:46 UTC (Thu) by hv76 (guest, #43803) [Link]

People who work for companies like IBM know when they can use their corporate email and when not!

This companies are big enough to have proper procedures/rules that define this.

Who wrote 2.6.20?

Posted Mar 1, 2007 22:59 UTC (Thu) by tap (guest, #43813) [Link]

I tried looking just at authors, using the Mercurial mirror of the git repository, and got slightly different results.

I counted 4769 non-merge changesets, vs your 4983. For the top 20 developers by changesets, mine are almost the same. I have Alan Cox with 60 changesets vs your 58. He has two with a redhat email address, I bet you missed those.

Who wrote 2.6.20?

Posted Mar 2, 2007 21:33 UTC (Fri) by kolyshkin (guest, #34342) [Link]

Thanks a lot for such an interesting article! But how have you counted all this? Perhaps publishing your scripts would make much sense, since we are all in the open source world :)

I have also mocked up a pipe of commands to count those changesets. This is what I ended up with (for SWsoft, the company what pays me to work on OpenVZ):

The problem here is number is not the same as yours. See, old version of the "Top changeset contributors by employer" table contained SWsoft (the company that pays me) with 37 changesets. In a new version of a table SWsoft is no longer here (went off top 20).

The only way I can come up with your result, 37, is to exclude Dmitry Mishin's 4 patches.

Who wrote 2.6.20?

Posted Mar 2, 2007 21:43 UTC (Fri) by kolyshkin (guest, #34342) [Link]

In fact, the previous error (if it's your error, not mine) is not that big.

The big one is the first table, the number of changesets by Josef Sipek. You got 79 for him, but there are 29 more patches by a "different" author, Josef "Jeff" Sipek. That makes the number 108, and the second position.

The bare command line I have used, if anybody is wanting to repeat it, is

It is stupid and does not account for "different" authors -- I noticed that "manually".

Of course, first you need to clone linux 2.6 source git tree:

mkdir linux-2.6
cd linux-2.6
git-clone git://git2.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6

Who wrote 2.6.20?

Posted Mar 7, 2007 16:30 UTC (Wed) by paort (guest, #43933) [Link]

Contemporary kernel development is spread out among a broad group of people, most of whom are paid for the work they do.

I was wondering what is the source for that. In the article you have data showing that most contributions come from non-volunteers, but that does not mean they are the majority of contributors. We could have a small number of paid coders doing most of the job and a lot of volunteers doing small portions. Do you happen to have the absolute numbers for paid and volunteer coders contributing to the kernel?