,

Surviving the Reproducibility Crisis in Computational Science

Computational science

Pitfalls and solutions for reproducible research

One of the biggest trends in the field of computational science has been the vast increase in available data, owing to new methods and improvements in relevant technology, from machine learning to high-performance computing. These improved techniques for generating and processing data enable rapid advances in research fields that depend on computational and data science (and there are few that don’t). Research datasets are scaling to sizes that would have been impossible to work with in the past, and there are simply vastly more datasets around. 

But with big data comes great responsibility. In the era of huge datasets, the elephant in the room is reproducibility: all scientific practice depends on it to verify its legitimacy, but the big data revolution has made it more difficult than ever. This problem has led to the reproducibility crisis across the sciences. Computational science is affected as much as any discipline.

Defining reproducibility

So what is reproducibility? In the literature, the terms ‘reproducibility’, ‘replication’ and ‘repeatability’ are defined in different ways and sometimes used interchangeably. In some cases, repeatability is defined as the same team getting the same results with the same methods, replicability as a different team using the same methods, and repeatability as a different team using different methods to still get the same result. But authors and institutions often have very different definitions of these terms. One way of approaching this problem is defined by Goodman and colleagues: to understand reproducibility as a term for the general construct – the need for transparency and the possibility of confirmation to provide scientific credibility – and to further subdivide this into different subtypes. 

The most relevant subtype for this discussion is what they call ‘results reproducibility’, which corresponds to what others often call ‘replication’: “the conduct of an independent study whose procedures are as closely matched to the original experiment as possible”. Since computational science tends to be largely deterministic, this type of reproducibility (or replication) is a common standard in the field, and it is here that data size and complexity can throw a spanner in the works.

How big data makes reproducibility difficult

As Roger Peng puts it in a widely-cited article in Science, “replication is the ultimate standard by which scientific claims are judged”, but with the “collection of large and complex data sets the notion of replication can be murkier”. There are several reasons why the era of big data has made replication harder rather than easier. Two stand out in particular: computing constraints and data sharing standards.

For example, many important large datasets have been made publicly available to enhance scientific inquiry: good examples are the RCSB Protein Data Bank in molecular biology or the Sloan Digital Sky Survey in astronomy. This is great for supporting further research, but the computational resources required to reproduce such datasets independently is beyond most researchers or even institutions. Even where resource constraints are not the problem, replicating experiments or computations correctly may depend on a particular computing environment: as complexity increases, it becomes more essential to use the same OS, libraries, and software versions as the original researchers to get the same results. Together, these computing constraints form one obstacle to reproducibility.

Of course, scientists generally agree that data used in research should in principle be available to others for the purposes of reproducibility, at least as part of the publication process. But this is easier said than done. Requesting and obtaining data can be a time-consuming process. Moreover, when the data does get shared, it can be difficult to use. Maybe the data is there, but the necessary code is missing, or the computational environment is not made clear. All of these are essential for reproduction to succeed. This sharing problem is the second major obstacle to reproducibility.

A diagram of the reproducibility spectrum

Degrees of reproducibility – diagram from Peng’s article. 

Of course, these problems don’t usually arise on purpose. Rather, researchers are busy people and have other priorities – especially since the ‘publish or perish’ environment in academic institutions does not reward sharing practices directly. 

Best practices and pitfalls for conducting reproducible research

So what can be done about this? Just like there are two main causes of the reproducibility crisis in computational science, there are also two main ways of addressing this issue: changing the norms and improving the tools. Both of these are increasingly important for doing computational science, or research in fields that depend on it. Fortunately, there is a lot being done to make these changes possible.

In the area of changing scientific standards, for example, we see repeated calls for best practices in enabling replication: making available the research data and the code and any algorithms used, as well as making explicit the computational environment. In other words, reproducible research artifacts make for reproducible research findings. Leading journals such as Science have tightened the requirements for publication in the last decade, for example requiring researchers to make all code used available as a precondition for (online) manuscript publication. In the future, full availability of research artifacts is likely to be demanded. Another initiative is represented by the FAIR Principles, which set norms for the production and dissemination of digital assets and which underpin the development of the European Open Science Cloud.

But however good these intentions are, researchers must still figure out how to make preparing and sharing research artifacts smoother and less time-consuming. Technology can make this possible. Some researchers use containerization (e.g. via SingularityCE or Apptainer) to help replicate the computational environment on the operating system side and provide library support. For code reproducibility, there’s Github’s versioning system. But each of these addresses only part of the problem: Github is meant for code, not data, with even Github’s Large File Storage soon reaching its limits. Some containerization applications also handle reproducibility better than others, due to variations in container runtime practices and potential host-to-container incompatibilities.

Nuvolos: a cloud platform for reproducible computational science

This is where Nuvolos comes in. Nuvolos provides collaborative workspaces, stored in the cloud, that allow computational scientists to run their favorite research applications natively in the browser. These workspaces can be shared between team members, but also outsiders such as referees, students, and third party researchers. What this means is that you can perform everything needed for your research or data analysis from the beginning to the end in Nuvolos, and on top of that, guaranteeing reproducibility after the work is done becomes a simple process rather than a chore. Data, code, tables, applications, and everything else can be containerized as snapshots and shared with collaborators in just a few clicks.

Ensuring reproducibility with Nuvolos
Easy distribution of research artifacts with Nuvolos

Nuvolos saves working scientists a lot of time while enabling reproducibility every step of the way. To try it out, book a demo here.