Open Source--It's in the Genes

What happens when you release 500,000 human genomes as open source? This.

DNA is digital. The three billion chemical bases that make up the human genome encode data not in binary, but in a quaternary system, using four compounds—adenine, cytosine, guanine, thymine—to represent four genetic "digits": A, C, G and T. Although this came as something of a surprise in 1953, when Watson and Crick proposed an A–T and C–G pairing as a "copying mechanism for genetic material" in their famous double helix paper, it's hard to see how hereditary information could have been transmitted efficiently from generation to generation in any other way. As anyone who has made photocopies of photocopies is aware, analog systems are bad at loss-free transmission, unlike digital encodings. Evolution of progressively more complex structures over millions of years would have been much harder, perhaps impossible, had our genetic material been stored in a purely analog form.

Although the digital nature of DNA was known more than half a century ago, it was only after many years of further work that quaternary data could be extracted at scale. The Human Genome Project, where laboratories around the world pieced together the three billion bases found in a single human genome, was completed in 2003, after 13 years of work, for a cost of around $750 million. However, since then, the cost of sequencing genomes has fallen—in fact, it has plummeted even faster than Moore's Law for semiconductors. A complete human genome now can be sequenced for a few hundred dollars, with sub-$100 services expected soon.

As costs have fallen, new services have sprung up offering to sequence—at least partially—anyone's genome. Millions have sent samples of their saliva to companies like 23andMe in order to learn things about their "ancestry, health, wellness and more". It's exciting stuff, but there are big downsides to using these companies. You may be giving a company the right to use your DNA for other purposes. That is, you are losing control of the most personal code there is—the one that created you in the boot-up process we call gestation. Deleting sequenced DNA can be hard.

That's bad enough, but it gets worse. Because the DNA of all your relatives is similar to yours to varying degrees, when you have your genome sequenced, you are effectively giving away part of their DNA too. Whether they agree or not, they lose their genetic anonymity, which may have serious and unforeseen consequences. In the US, police are using genetic information that has been made public by individuals to find partial matches of DNA from a crime scene. By building and exploring family trees on a massive scale, the police can narrow their investigations down to a few suspects to help them pinpoint the criminal.

Just as software code can be open source rather than proprietary, so there are publicly funded genomic sequencing initiatives that make their results available to all. One of the largest projects, the UK Biobank (UKB), involves 500,000 participants. Any researcher, anywhere in the world, can download complete, anonymized data sets, provided they are approved by the UKB board. One important restriction is that they must not try to re-identify any participant—something that would be relatively easy to do given the extremely detailed clinical history that was gathered from volunteers along with blood and urine samples. Investigators asked all 500,000 participants about their habits, and examined them for more than 2,000 different traits, including data on their social lives, cognitive state, lifestyle and physical health.

Given the large number of genomes that need to be sequenced, the first open DNA data sets from UKB are only partial, although the plan is to sequence all genomes more fully in due course. These smaller data sets allow what is called "genotyping", which provides a rough map of a person's DNA and its specific properties. Even this partial sequencing provides valuable information, especially when it is available for large numbers of people. As an article in Science points out, it is not just the size and richness of the open data sets that makes the UK Biobank unique, it is the thorough-going nature of the sharing that is required from researchers:

Researchers around the world can freely delve into the UKB data and rapidly build on one another's work, resulting in unexpected dividends in diverse fields, such as human evolution. In a crowdsourcing spirit rare in the hypercompetitive world of biomedical research, groups even post tools for using the data without first seeking credit by publishing in a journal.

The benefits from applying open-source methodology to half a million genomes are significant and growing by the day. About 7,000 researchers have registered to use UKB data on 1,400 projects, and more than 600 papers have been published. It is leading to rapid advances that are simply not possible when the DNA is proprietary. And as with open source, doing good brings benefits:

"The U.K. is getting all of the world's best brains" to study its citizens, says Ewan Birney, director of the EMBL European Bioinformatics Institute in Hinxton, U.K., and a member of the UKB's steering committee. The U.K. focus is also the project's chief downside, as it explores just one slice of humanity: northern Europeans. It holds data for only about 20,000 people of African or Asian descent, for example. Yet as new papers appear every few days, researchers say the UKB remains a shining example of the power of curiosity unleashed. "It's the thing we always dreamed of," [president and director of the Broad Institute in Cambridge, Massachusetts] Lander says.

It's the classic "given enough eyeballs, all bugs are shallow". By open-sourcing the genomic code of 500,000 of its citizens, the UK is getting the top DNA hackers in the world to find the "bugs"—the variants that are associated with medical conditions—that will help our understanding of them and may well lead to the development of new treatments for them. The advantages are so obvious, it's a wonder people use anything else. A bit like open source.

Glyn Moody has been writing about the internet since 1994, and about free software since 1995. In 1997, he wrote the first mainstream feature about GNU/Linux and free software, which appeared in Wired. In 2001, his book Rebel Code: Linux And The Open Source Revolution was published. Since then, he has written widely about free software and digital rights. He has a blog, and he is active on social media: @glynmoody on Twitter.

Load Disqus comments