Table of contents
If you’re a data scientist or computational researcher, you probably know how to do code version control. But what about data versioning? Here’s where many computational scientists run into problems. What’s good for code is not necessarily good for data. This article explains the differences and how to solve your data versioning problems.
Data Versioning vs Code Versioning in Git
Data science prediction models and machine learning models are shaped by the data that you use to develop them. For this reason, adequate dataset version control becomes a pressing issue.
Version control as we know it – which is really mainly Git, if we’re being honest – relies on tracking changes in plain text files. In the background, it simply treats your code as plain text and compares the characters in the latest version with the previous one. When someone wants to get the newest code from Git, they download the history of changes with it.
Of course, it is also possible to store your data directly in the same Git repo where your code resides. But data science and machine learning datasets tend to be huge. Apart from the fact that Git-based solutions may impose restrictions on the size of your code storage, here the fact you download the whole history of your data files becomes a problem. It makes the process very slow, especially for someone who needs to clone your code repository for the very first time: they get both the recent data and the changes.But the biggest issue is that Git can only version data that is convertible into text form. This obviously does not directly work for image, audio, and video data.
How Alternative Methods for Data Versioning Work
When this issue first came up, a solution was quickly found. Datasets were forced out of the Git repositories and replaced with the so-called pointers. A pointer is a hash generated from the contents of the file. Hashes are not bound to file names and therefore capture actual changes. Pointers reside inside tiny, lightweight files.
Pointers are then again treated as text files and can be Git-versioned like any other file in your Git repo.
So where does the actual data go? The essential thing to understand is that the actual data is stored as snapshots which are inserted through different possible methods.
When you run any of your code, the correct snapshot is inserted “on-demand”, based on its pointer. This is done by either Git LFS or DVC, the two open-source technologies that have proven themselves as the most popular data versioning tools.
Keep in mind these methods are only technologies for enabling version control – like Git – and not actual storage solutions. Second, they only work for data versioning on top of Git, although DVC has a few features that function without it.
Let’s have a closer look at both.
Git LFS is an extension of Git technology. It links changes in your code with the changes in your data and tracks both of them. You do need to specify explicitly that you want your files to be versioned when you install Git LFS locally. Once done, you can save both types of changes (code and data) with one command in the command line.
If you host your code in BitBucket, GitHub, or GitLab, it is easy to use Git LFS, as they provide both the necessary storage and the backend. However, GitHub imposes tough usage limits that are also calculated in a somewhat tricky way. You can hit them faster than you think!
If you do not use any of the three services, you can set up your own Git LFS server. Currently, there is no easy out-of-the-box solution for that. Firstly, you need to provide your own server (as hardware or in the cloud), and secondly, you need to implement one of the available Git LFS server technologies. Fortunately, there are a various open-source options: simply choose your preferred option.
If you go for a personal Git LFS server, you can set up a custom remote storage as well, so you do not need to use the same server as your storage solution.
Whether this option is attractive depends on how much flexibility you want and whether you have enough capacity and knowledge to bother going through all of this. While doing it yourself can work, it involves considerable complexity and overhead. On the other hand, using GitHub or similar solutions subjects you to usage limits.
DVC is a Python library. It, too, aims at matching changes in the code with the changes in the data. For every dataset, you’d have a data.dvc file (raw.dvc, training.dvc, you name it) in your project folder that contains the hash associated with the current snapshot of that dataset. This file will be Git versioned if you use Git.
Saving changes in the code and in the data are two different workflows. If changes in data.dvc and code changes have been saved simultaneously, they will likely appear connected, although this need not be the case. For machine learning projects, for example, you could save a new data version together with the new model version, which allows linking them. But if you ‘only’ do data science or analytics, accurate tracking is not possible due to the divergence between code and data changes.
To deal with this problem, DVC offers a feature called experiment management. This one will isolate related changes in data and code and/or model configuration as an experiment. However, experiments cannot be saved to Git directly.
DVC compensates for this core weakness with a few other features that can be useful for machine learning specialists, such as tracking your model metrics – ROC and AUC – for precise model evaluation and plotting these metrics using DVC.
DVC also tries to provide some basic MLOps functionality. It allows you to build pipelines. Those are not ETL processes in this case, but rather continuous delivery automation. A pipeline is a recipe for a workflow that you use often, including code scripts that you run and configuration files that you use. The pipeline glues them together so that their sequence does not have to be run manually every time you need it.
Stages inside the pipeline may generate an output that then needs to be inserted into the next stage. DVC ensures that this happens and that the correct data is passed over.
One nice thing about DVC is that you can store all your data on your local machine where DVC is also installed and it will still work for data versioning. Alternatively, it is possible to set up remote storage, by selecting one of the multiple options.
So Which Data Versioning Tool is the Best?
As you can see, data versioning technology can get quite intricate, and there is a lot to remember. But don’t worry: we put everything together in a straightforward comparison table.
We compare with a Git-only approach, so you better understand the relative benefits of data versioning tools as well as their drawbacks – primarily overhead and complexity, representing a cost in time.
As you will see, there is one more solution, which we added on the right side: Nuvolos, our own integrated computational science platform.
How to Use Nuvolos for Data Versioning
Nuvolos was made for computational researchers and data scientists by people from the field, in order to solve the common problems of data versioning, hardware complexity and overhead, and to reduce the time spent on maintaining tools and methods. Nuvolos combines scalable computational power with a flexible UI and encapsulates a wide range of programming languages and data science applications.
Nuvolos’ approach to data versioning is different from the well-worn solutions described above. With Nuvolos, you can create different instances of your project to test different methods or to branch out different paths as your project evolves. Since Nuvolos allows you to save the state of your project instance at any time, you can fully track and access the history of your work..
This method also seamlessly joins data versioning to code versioning. Nuvolos simply takes one snapshot of the entire environment: code, data and application stack! You never need to revert changes, as you can simply go back to a past state, duplicate it, and start working again from there. And unlike with DVC experiments, Git is a first-class citizen for Nuvolos: you can effortlessly sync Nuvolos to Git.
How Nuvolos and Git Work Together
Nuvolos and Git can easily be used together.
Any Git-hosted project can be cloned into Nuvolos, even a private one. By default, the platform generates for each user a public-private key pair at signup. Users can add the public key to any of their Git repositories to grant access from Nuvolos. The private key is then configured as a variable in their environment.
Nuvolos provides consistent and systematic support in all applications for Git operations. Some applications even come with a GUI for it, e.g. VS Code, JupyterLab, R, or MATLAB, which thanks to Nuvolos’ cloud-native technology you can simply run in your browser any time. This means that with Nuvolos, you do not have to decide between data version control and using Git. Among other benefits, it can be used as a standalone version control technology for both code and data, without making Git integration a headache.
We hope to have shed some light on the most common approaches to data versioning and what you should consider before using them. Our approach at Nuvolos is to offer the best of both worlds for code and data versioning, by allowing users to save their entire environment at will and offering first-class integration with Git. In this way, Nuvolos is purpose-built to save you time and effort on tooling overhead and on organising your work. You can try it out now with our free trial.