By Jupyter--Is This the Future of Open Science?

on March 4, 2019

Taking the scientific paper to the next level.

In a recent article, I explained why open source is a vital part of open science. As I pointed out, alongside a massive failure on the part of funding bodies to make open source a key aspect of their strategies, there's also a similar lack of open-source engagement with the needs and challenges of open science. There's not much that the Free Software world can do to change the priorities of funders. But, a lot can be done on the other side of things by writing good open-source code that supports and enhances open science.

People working in science potentially can benefit from every piece of free software code—the operating systems and apps, and the tools and libraries—so the better those become, the more useful they are for scientists. But there's one open-source project in particular that already has had a significant impact on how scientists work—Project Jupyter:

Project Jupyter is a set of open-source software projects that form the building blocks for interactive and exploratory computing that is reproducible and multi-language. The main application offered by Jupyter is the Jupyter Notebook, a web-based interactive computing platform that allows users to author documents that combine live code, equations, narrative text, interactive dashboard and other rich media.

Project Jupyter was spun-off from IPython in 2014 by Fernando Pérez. Although it began as an environment for programming Python, its ambitions have grown considerably. Today, dozens of Jupyter kernels exist that allow other languages to be used. Indeed, the project itself speaks of supporting "interactive data science and scientific computing across all programming languages". As well as this broad-based support for programming languages, Jupyter is noteworthy for its power. It enables users to create and share documents that contain live code, equations, visualizations and narrative text. Uses include data cleaning and transformation, numerical simulation, statistical modeling, data visualization and machine learning.

In a way, Project Jupyter is the ultimate scientific tool, since it can be used in any discipline and for multiple purposes. As an article in the Atlantic rightly put it, it also can be thought of as the scientific paper taken to the next level by exploiting the possibilities of digital technology. A key aspect is that it's interactive—readers can use the embedded code to explore the data and carry out limitless "what ifs". It's such an obvious idea, you may wonder why it hasn't been done before. And the answer is that it has, notably in the form of Mathematica from Wolfram Research.

Mathematica is an innovative and powerful program with one huge flaw: it's proprietary. As such, it suffers all the usual downsides, one of which more or less disqualifies it for science: you can't check the code. That means you don't really know why it produces the results it does; you just have to take it on trust. That's not science; that's voodoo.

Its closed-source nature means that Mathematica can't tap into the community of users in the same way open-source projects can. Whatever advantages Mathematica once had, it's only a matter of time before open-source alternatives like Jupyter surpass it. Indeed, it's interesting that the Google Trends comparison of searches for Mathematica and searches for Jupyter show that interest in the latter is rising, while Google searches for the former are falling. It's an inexact metric, of course, but the overall trends are clear: Mathematica, like Microsoft Windows, is the past, and Jupyter, like GNU/Linux, is the future.

It's not just about the familiar dynamics of open-source development. There's a key reason why Jupyter has beaten Mathematica, as the academic Paul Romer explained in a perceptive post:

Mathematica failed, despite technical accomplishments, because the norms of its developers clashed so obviously with the norms of its intended users. Jupyter is succeeding because the norms of the community that is developing it are aligned with the norms of its users.

As well as its culture, there's another aspect of Jupyter that makes it a perfect fit for open science. On the page listing dozens of Jupyter notebooks—all freely accessible—there's a section titled "Reproducible academic publications":

This section contains academic papers that have been published in the peer-reviewed literature or pre-print sites such as the ArXiv that include one or more notebooks that enable (even if only partially) readers to reproduce the results of the publication.

Coupled with the transparency of the underlying code, this ability for anybody to check the logic and results of a publication is a real breakthrough in open science. At the moment, most academic papers can be read only superficially. In theory, anyone could set about reproducing the final conclusions—at least, provided the relevant datasets are freely available. Few will take the trouble to do so though, because there are no academic incentives to expend all that time and energy. With papers published not as static documents, but as dynamic Jupyter notebooks with full open datasets, it is possible to check the results properly, as well as to plug in other datasets or tweak the underlying assumptions. In this way, Jupyter notebooks are the perfect marriage of open source, open access and open data. This is exactly how open science should work, but until now almost never does.

The power and flexibility of the Jupyter environment make it a strong foundation for open-ended experimentation of the kind the Free Software community relishes. Moreover, coding in this domain could have a major impact on scientists using the notebook format and on the science they produce. That combination of satisfying intellectual challenge with real-world practical benefits makes it a perfect candidate for open-source coders looking for new and meaningful challenges.

Glyn Moody has been writing about the internet since 1994, and about free software since 1995. In 1997, he wrote the first mainstream feature about GNU/Linux and free software, which appeared in Wired. In 2001, his book Rebel Code: Linux And The Open Source Revolution was published. Since then, he has written widely about free software and digital rights. He has a blog, and he is active on social media: @glynmoody on Twitter.