Table of contents
- Understanding reproducibility
- Sharing your code
- Code and project collaboration
- Dataset reproducibility
- Why Git is not always the solution
- Addendum: code conversion
Computational science helps to reproduce complex physical and digital systems as simulated environments. In doing so, computational research supports and facilitates progress in biotech, engineering, financial studies, and other fields. However, the reproducibility of computational research remains a hot topic in scientific debate.
Apart from such accusations as biased research topic selection, tampering with hypotheses, faulty result interpretation, and so forth, ensuring the validity of computational research can require confronting technical challenges. But while the ethical and epistemological problems must be addressed at the level of scientific training, the technical challenges can be solved directly. In this short guide we will explain how.
You cannot trust results that emerge just out of a black box and that can only be generated once. That’s why reproducibility, reliability, and quality go together.
Why you need to care about reproducibility in computational research
Reproducibility is gradually becoming a prerequisite for publishing an academic paper in computational research. It was first imposed by prominent journals. They wanted to retain their positions as knowledge exchange hubs.
A few weeks ago, the White House required that research papers funded by the U.S. government be made available online promptly and freely by the end of 2025. Data that underlies those publications must also be made available.
Of course, as a researcher you also want to sustain your reputation and facilitate knowledge transfer.
Moreover, you and your team are exactly as interested in documenting your experimental path as any external community. That allows you to reuse your past projects and bring your future ones to a successful result faster. New insights and valuable research findings are generated through knowledge accumulation.
The three ‘Re-‘s of computational research validity
Research reproducibility is often used interchangeably with research replicability. To understand the difference, we need to go back to all three “re-” problems. Computational research, some experts say, must be:
- “repeatable (the original team of researchers can reliably produce the same result using the same experimental setup),
- replicable (a different team can produce the same result using the original setup),
- and reproducible (a different team can produce the same result using a different experimental setup).” 1
You can see that, from all three, reproducibility seems to be the trickiest one. But what exactly should you make reproducible? To answer this question, we can break a computational research into four major elements 2:
Let us talk about the practical implications of this list.
In any research, hypotheses and new theories must be embedded into the existing research. This is done by a systematic literature review that precedes the empirical part. Post-mortem, once your empirical experiment is reproduced correctly, the reproducer must arrive at the same conclusions, and your concepts should get confirmation. For the algorithms, the implication is that they need to be described in a clear and language-agnostic manner.
Purely technical problems start with the code and dataset sharing. We will go through all of them, keeping in mind not only sharing for the purpose of getting your research published, but also for collaboration and knowledge transfer.
Sharing your code
Even while you are still working on your research, sharing is caring – your colleagues may already be interested in what you are working on and how. So how best to do it?
Sharing is more than code hosting
If you used to host the code on your local machine, you may consider different options for making it accessible to others.
A shared folder on your PC or remote server may be a cheap and quick solution but cumbersome when you need to add more users. Cloud hosting is much simpler in this regard.
Yet sharing does not equal copy-pasting. The problems emerge when someone else tries to execute your code. The pitfalls can range from such easy-to-solve issues as installed language version, missing libraries, browser issues, and configuration issues, to such serious ones as not matching computing power, missing scalability tools, operating system incompatibility, and so on.
Why containers are not enough
Most of the problems can be solved using containerization. A container would “wrap up” your code and data into a complete environment that can be run independently.
However, not every data scientist or computational researcher wants to take a deep dive into software development practices: you definitely have other tasks on your list!
Besides, containers won’t transfer original computing power to a new machine. This means that if you created your code on high-performance hardware, replicating it on a less powerful one may easily fail.
Code and project collaboration
Code collaboration goes beyond code sharing and, consequently, poses additional challenges.
We already mentioned that user management may impose certain efforts on you. Indeed, this is easily solved by cloud tools such as Dropbox, Google Colab, and others.
Tracking changes is another issue. Even if you work alone, sometimes you want to go back to a previous version if the last one has not proven successful. When more than one person works on the same project, this may get tricky. Dropbox, for instance, saves all versions of one file, but as soon as it has been renamed or duplicated, the version history is spoiled.
You could create many files for the same piece of an experiment – one per person and version – but then you end up with a total chaos of files and wouldn’t easily know which code has worked best.
Change tracking or code versioning tools allow you to pick up the best-performing code chunks and assemble them into the perfect working code. They also ensure that you can always reverse any changes.
Code versioning tools play a crucial role in reproducibility. They allow others to reuse your findings in a different setup. These tools not only host your code but help the next team to control differences between their setup (e.g., using different libraries) and yours. It enables all collaborators to lean on the same core code, regardless of whether this happens during the original experiment or during its reproduction.
Dataset reproducibility in computational research is another headache. Computational research datasets tend to be huge in volume which may already limit your choice of hosting opportunities. There are a few other things to keep in mind as well.
As with code sharing, user management needs to be simple and transparent. You need to be able to control who has access, why, and when, and to be able to easily onboard colleagues, referees, and so forth. This can pose technical challenges.
Loading data into running models
Every time you run your code, you need to load your data into it. It means that you need a storage solution for which your code language also offers a connector. This solution must allow you to load big volumes of data in chunks but also needs to be reasonably quick and have no limitations on how much you can load with one query.
The code versioning tools market is oversaturated, but you won’t find the same variety of offerings for data versioning tools.
Meanwhile, in many research areas fresh data is critical. And different data may lead to different results. So do different approaches to its processing. You need to be able to track back any changes in your research outcome to their origin. This increases the transparency of your results to any external observers as they can be sure that you did not tamper with your data to make the outcome fit with your original hypotheses.
When new data means new code
When you update your dataset, it may mean only adding a few rows, but may also be a complete replacement. In certain cases, you can even decide to change your method of data collection and sampling. This will lead to changes in your dataset structure.
As a consequence, changes in the code – or even the algorithm! – may be necessary. That means that change tracking should work for both code and dataset as a bundle.
Why Git is not always the solution
While software engineers rely on Git for good reason, data scientists, ML specialists, and computational researchers may miss a few critical functionalities.
How Git works
Git is a code versioning technology, implemented in such popular tools as GitHub, GitLab, and Bitbucket. You can set up your own Git server if you want.
Git not only saves all changes, it breaks the process of saving them into multiple stages. This allows teams to control which changes should be added and which should be rejected.
While the proposed changes are under review, the new ones may already be under development. How to keep them all apart and still have a working code? This is solved through branches: parallel strands of your code evolution that Git later brings to a common denominator.
Sounds cool, doesn’t it?
Git is for data-agnostic code
Git technology does not track changes in datasets. The only way it can capture those is to track changes in the data files that you keep in your Git repository.
However, when you need to revert changes in both code and dataset, you need to do it manually for all involved files – CSV, code scripts, etc. – which could lead to missing an important change. And definitely takes a lot of your time!
The reason why Git does not capture the whole bundle of changes – data changes and consequent code changes – is that it was made primarily for software developers. In software development, the code needs to be as data-agnostic as possible.
By contrast, data scientists and computational researchers need to focus on the data.
Git adds complexity
Git emerged as a technology for open-source projects but its recent development makes it more fit for commercial use. This is not bad; it only means that Git aims to prevent any harmful code changes from getting into production systems. Pull requests, revisions, merges, branches, etc. all ensure commercial safety.
But what researchers need is to experiment and test new ideas quickly. Indeed, systematic change tracking does not hurt as long as you can focus on the main process. As a Git beginner, though, you may find yourself confronted with a steep learning curve, for which you may not have time or capacity.
Addendum: code conversion
Occasionally, you may want to reuse algorithms in a different programming language than they were originally written in. This is rarely the case for someone who wants to make their own research reproducible, but may apply rather to those who want to ensure reproducibility of computational research done by others. While this is something of an edge case, it is another technical challenge to reproducibility that is worth briefly talking about.
Popular data science languages
Quite a few programming languages can be used in computational studies. Most data scientists, machine learning specialists, and computational research scholars stick to one of the following:
Once you join a new team or want to switch to a different language to overcome some limitations of the original, you will need to get your algorithms working in it. There are a few options for how you can do that.
Wrap code in other code
For some languages, you can use a wrapper. A wrapper is a library written in one programming language that can access code written in the other. It allows you to continue programming in the target language. Later, you can use this library to interact with the old code from the new code.
Converting code from one language to another
This is the bigger challenge: to recreate your algorithm in the target programming language. Particularly if you originally used a low- or no-code tool, such as MATLAB, IBM SPSS, SAS, or Stata, you will need to write your code from scratch.
For programming languages, the conversion may be less painful. For instance, R and Python are easy to convert one into another manually. Their syntax is very similar.
For some programming languages, you can try a code conversion tool. It does the whole translation work automatically, although the result will certainly need some final polishing on your side. If you have been following updates from the OpenAI project, you may know that it has developed Codex, a code conversion machine. It currently offers a waiting list. Facebook’s TransCoder AI is already live and focuses on conversion between Python, C++, and Java.The last option that requires a lot of programming and software development experience is to use a transcompiler or transpiler. It can translate the source code from one programming language into another. Unfortunately, you have to provide a transpiler with translation rules manually.
Pitfalls of code conversion
Experienced data scientists will know that source code conversion means more than just replacing commas with semicolons.
The main challenge is to correctly reproduce the existing logic. Keep in mind that the goal is not to recreate the old code, but the algorithm behind it. Those are not identical for the simple reason that, e.g. R and Python belong to different types of languages. R is functional, and Python is object-oriented.
This implies that you cannot simply translate line by line. You need to go back to your algorithm description that you hopefully documented in a clear and concise manner. Starting from there, you can develop a new code, leaning on the existing one if both share the same features.
We have pointed to the main technical pitfalls that can await you on your way to completely reproducible computational research. We have pointed to some possible solutions, but also described the difficulties inherent in using these in turn.
Keeping all those considerations in mind, we prepared for you a short checklist of technical requirements for reproducible research. Any solution you use must be able to tick all these boxes if it is to enable reproducibility without causing more problems down the line.
A checklist for reproducibility in computational research
- Code and dataset accessibility, during the project and also post-mortem
- Simple change tracking
- Dataset change tracking
- Ability to reverse changes accurately
- User access management
- Scalable computational power and data storage
These functionalities ensure that your efforts won’t be wasted and your results remain transparent for external audiences and reusable for yourself.
Nuvolos: a cloud platform for reproducible research
Naturally, those challenges can be solved by using a solution that is built to address all these issues. This means a cloud platform specialised in computational research that enables reproducibility, scalability, change tracking, and collaboration all in one.
Fortunately there is a solution that can do all that: Nuvolos, a platform for collaborative computational workspaces. It was built by computational scientists to address the technical challenges of reproducibility in computational research, and to give you your valuable research time back by solving these problems for you. Start a free trial now to explore the main features for free. Alternatively, check our documentation to learn more about Nuvolos and what it can do.