, ,

7 Reproducibility Problems Using Jupyter Notebooks for BioMed

An image of Jupiter representing Jupyter notebooks with the Nuvolos logo

Introduction

In computational biology and biomedical science, Jupyter tools such as Jupyter Notebooks and Jupyter Lab (aka Jupyter notebooks for short) are a popular tool for computational workflows, since they allow you to explore, bundle, and share code (usually Python or R) in an easy and interactive manner. In principle, this is not just beneficial for fast and easy experimentation, but also benefits scientific reproducibility, by making sharing code and code documentation a relatively painless process.

Or so it would seem. In practice, the contribution of the use of Jupyter notebooks to reproducibility in fields like computational biology and biomedical science depends a lot on the way the practitioner actually designs and implements the notebook itself. It turns out that this goes wrong surprisingly often. A recent study by Sheeba Samuel and Daniel Mietchen, available as a preprint on Arxiv, tested the methodological reproducibility of a large number of Python-based Jupyter notebooks in computational biomed. By methodological reproducibility, they meant that they did not test the results, but simply whether the methods (i.e. the code) could itself be reproduced at all – in other words, whether the notebook would run as advertised in the first place.

Using an automated process, Samuel and Mietchen tested 9625 Jupyter notebooks from 1117 GitHub repositories, associated with 1419 publications listed in PubMed. Only 4169 of these (43.45%) had actual dependency files included so that they could be executed properly. Of those, over 35% failed to install anyway, and astonishingly, over 84% of the remainder – the ones that could at least be installed completely – failed to execute regardless due to an exception. In other words, the methodological reproducibility rate in computational biomed appears to be shockingly low! The question arises immediately why this might be.

Reproducibility problems: Jupyter notebook exceptions in computational biomed
Notebook exceptions in computational biomed by type and frequency; from Samuel and Mietchen (2022)

As is probably familiar to many researchers, it is not actually that easy to guarantee reproducibility – even ‘just’ of the methodological kind – despite the use of tools such as Jupyter notebooks. It is easy to make mistakes in the setup, implementation, and documentation of your computational workflow, and bad processes are likely to lead to problems with reproducibility down the line. As the study cited above shows, most researchers in closely related fields like computational biology and biomed are likely to be confronted with these issues. For this reason, we have tried to help researchers by identifying some of the frequently made mistakes in these fields in providing methodological reproducibility when using Jupyter notebooks, and to suggest ways to address them. Here they are:

1. Using outdated Python (or other tools)

One common problem for reproducibility arises from the use of versions of Python that are at the edge of or even beyond their support window. The long time it takes to move from experiment to peer review and publication is often to be blamed for this, rather than inadequate precautions on the part of researchers, but it nonetheless constitutes a real issue. In the Samuel and Mietchen study, more than a thousand notebooks had commits in 2021 but used versions of Python as old as 3.4 (which was deprecated in 2019) and 2.7 (in 2020), as well as many unspecified versions of Python that are probably older. It is important to make sure to use the most up to date version of Python whenever possible, since deprecated support of relatively old versions is likely to cause problems with execution by the time you actually get to publication of your work. The same thing applies, mutatis mutandis, to any other tooling used in the process. Keep it fresh!

2. Failing to declare dependencies

As mentioned in the introduction, more than half of the Python-based Jupyter notebooks studied for computational biomed failed to declare dependencies properly. To run your code in the future, colleagues or referees will need not just your code, but also any modules and libraries your code made use of. Failing to include a manifesto of these dependencies will make it very difficult to figure out what is required to make the code work, virtually guaranteeing a failure of reproducibility.

The best way to address this is by using automated dependency managers. Pip, for example, can very well handle not just the installation and updating of Python-based libraries and packages, but also provide a requirements.txt or pipfile in which the dependencies can be neatly and concisely stated. Conda can be used similarly, creating an environment.yml file that will in turn enable easy containerisation, e.g. using Docker. In this way, you make it possible for others to run the code with the exact dependencies required in the future, so that it will continue to execute as intended.

3. Improper documentation of your notebooks

As mentioned again and again in guides for reproducibility in computational biology and biomed, it is paramount to document your work properly. This seems obvious, but many researchers still are lacking in thoroughness on this front, both due to the time pressures of research and due to its trial-and-error experimental nature. Frequent mistakes are: to wait until a finished result is obtained before documenting anything, to fail to document intermediate results along the way, leaving out (seeming) dead ends or failures, and failing to note seeds for randomisation.

In each of these cases, this will hinder the ability of other researchers not just to reproduce the results using the code, but more importantly, to understand the choices you made and why you made them and to learn from the process. What’s more, you may find your own research time scattered in between many other duties, and then good documentation along the way, including of dead ends already discovered, may end up saving considerable time and providing much-needed clarity to the process.

The best ways to address this are to document each experiment as you go, preferably using standardised formats, so that both you and others can easily figure out what you did and why. Archiving intermediate results is helpful as well, as is storing the seeds used to get the randomisation process to give the outcomes you wish to share. Providing good comments in the code is as helpful in computational life sciences as it is in software engineering, and should not be underestimated as a source of reproducibility. This goes especially for explanations regarding the ‘why’: your colleagues will want to know why you made the choices you did and the reasoning behind the experiments. After all, it is the analysis rather than the bare results that constitute scientific progress.

4. Poor data storage

Another frequent mention in guides like the above is the problem of poor data storage and presentation. Naturally, you will want to enhance your publication with useful data visualisations such as plots and graphs and provide summaries of important data and results. But your colleagues, wanting to reproduce your results, will also want to get easy and practical access to your underlying raw data. Failing to store this data in raw format is not uncommon, leaving it up to others to do the laborious and error-prone process of reconstructing it from the description. Similarly, researchers all too often fail to include stored outputs of data values they subsequently analyse or summarise for the purposes of presentation. Having to reconstruct these from existing code is an unpleasantly clunky process and hinders reproducibility. As suggested in this guide to reproducible practices in computational biology and biomed, it is better to store such outputs directly in easily accessible files with clear naming conventions, making it easier for your colleagues to see the data you based your conclusions on.

In general, the core rule is to share and explain as much data as you can, making sure that the data is accessible and usable and that it is clear how it is organised. There are many different ways of approaching this, and standardisation is still somewhat in its infancy. But when

5. Bad coding practices in the notebooks

In section 3 above, we already mentioned lack of internal commenting as a common poor coding practice that hinders reproducibility in computational bio(medical) sciences. But there are more such bad practices. Repeated copy-pasting of the same code with minor variations for the sake of experimentation is another common example, as this practice makes reading the notebooks subsequently very difficult and causes tremendous problems if there is a bug or a problem with the design. It’s bad enough having to fix such things yourself, but imagine having to do this in all the instances of someone else’s notebook where they copied the same code segment 32 times, with each time a few changes in one line…

A similar issue is using poor presentation in the notebook itself: e.g. putting everything into a single cell, or making it unclear which cell performs which logical step in the experiment or analysis, or running cells out of order. It is unhelpful if you know that to get the desired results, one should run the sixth cell before the third and the fourth one twice, but your colleagues do not (and by the time of publication, you probably won’t either).

In your code structure, try to make it follow the logical and analytical steps of your experimental process. Use Markdown to make comments and provide key insights, as much for yourself as for your colleagues. Don’t cram everything into one cell, but give each segment of code room to breathe and stand on its own. Import all libraries in the first cell, rather than scattering them about the notebook, and declare all variables at the start. You will find that this doesn’t just aid in reproducibility, but also in clarity of thought, making your work better.

Enhancing reproducibility by improving code practice in Jupyter notebooks
Putting all imports in the first cells – example from Tara Boyle

6. Lacking effective version control methods

Most researchers in computational biology and biomed are probably familiar now with Git and repository sites such as GitHub and GitLab as key tools for enabling version control in scientific code. Even so, the Samuel and Mietchen study found that more often than one would expect, the relevant code was not to be found in the main branch of the respective GitHub repository, making it difficult to find which version one should be using for reproducibility. An important thing is to maintain good project management throughout, for which GitHub’s repository structure is not an adequate substitute.

It is important to make sure also that the code is clearly citable, and that there are clear descriptions in the repository for what the project and the code are about and what you are attempting to achieve with them. This means having, at a minimum, a license file for licensing permissions, a readme file with essential context for the project, and a citation file for how to cite the code and your work in general. There is an array of third-party tools available for GitHub, like Zenodo and Figshare, that allow you to create Digital Object Identifiers (DOIs) as unique and persistent identifiers for code. These will make citation – and therefore reproducibility – much more easily accomplished.

Although by now somewhat old, the 2016 PLOS Comp Bio guide to using Git for reproducibility may still be helpful to you for further practical tips.

7. Lacking the right tools for the computational workflow

With the ever-increasing demands in computational biology and biomed for proper reproducibility of research code, data, and other materials, the overhead of the scientific workflow gets larger by the day. Many researchers spend more time than they would like – and more than is necessary – on wrestling with process issues like version control, configuration and maintenance of tools and libraries, containerisation of dependencies, and managing data storage. All this time spent on the overhead of computational science is time not spent on experimentation, collaboration, and analysis, and that is a waste from the viewpoint of scientific progress.

It is exactly to make managing these practical concerns involved in scientific reproducibility easier that we have developed Nuvolos, a complete cloud-based platform for research, collaboration, and reproducibility. Nuvolos allows you to run all your preferred tools, including Jupyter notebooks in your favourite configuration directly in the browser. The platform design is custom-built entirely around the computational scientific workflow you are familiar with in biology and biomed, while everything is containerised from the beginning to ensure easy sharing, versioning, and reproducibility. High performance data storage and elastic scalability are provided as a matter of course. Nuvolos makes backing up and versioning entire bundles of code, data, and applications like Jupyter’s a matter of a few mouse-clicks, while you can invite and share your research with colleagues or referees just as easily. Nuvolos even provides native support for edu

Built by computational scientists for computational scientists, Nuvolos is dedicated to saving you costs in time and effort on the overhead of the computational scientific workflow. It enables you to meet the highest contemporary standards for scientific reproducibility by joining the experimentation, project management, data storage, version control, and sharing of results all in a single platform, so that instead of a nightmare of scattered processes and laborious project management, the demands of reproducibility can be a stimulus to experiment more and collaborate more. Curious? Give our free trial a go and find out for yourself the time you can save.

Conclusion

We hope you found this list useful for avoiding common problems with reproducibility in computational biology and biomed, in particular when using Jupyter notebooks. As shown in the study cited in the introduction, these problems are widespread and it is likely that you will find at least a few familiar. With Nuvolos, we attempt to make scientific progress faster and easier by reducing the burdens and overhead involved in the computational scientific process. For more information, check out our free trial or read our case studies to discover how institutions like the computational biology department at the Université de Lausanne use Nuvolos to aid teaching, research and reproducibility.