Table of contents
- Why Follow the Same Structure With All Your Data Science Projects?
- How to Structure a Machine Learning or Data Science Project
- What Tools Can Help You Manage Your Data Science Project?
- Data Science Project Folder Templates
- An Alternative to a Git-Based Structure: Nuvolos
What could be worse than trying to keep a brilliant idea in your head while scrolling through your file chaos?
Data science projects need discipline. This starts with a straightforward folder structure for your project.
Why Follow the Same Structure With All Your Data Science Projects?
From the very beginning to a successful end – e.g., publishing your results – a properly structured project ensures smooth progress.
Experiments and model training require multiple iterations through a similar process. If you lose track of what you’ve done during the previous iteration, you can easily mess up the next one.
Quite often, you might have a few collaborators. Moreover, you may have a team lead or a manager responsible for the project and budget allocation. That’s why you need transparency: visibility for who has done what. Without a consistent project structure, no technically perfect change tracking tool will help you to capture changes correctly.
Besides, any new team member will have an easier time when all your projects have a standard structure.
By introducing such a standard, you set up an effortlessly repeatable process of sharing, whether for publication or collaboration purposes.
How to Structure a Machine Learning or Data Science Project
Where to start? For one, it’s not only about the folder structure.
The decision you need to make is how you group or isolate your pieces of code (or notebooks), your data files, and supplementary files. You will be asking yourself questions like: “Should I put all my code into one file/notebook? How do I organise my data files?”
Of course your project will have not only “code/notebook” and “data” files. We’ll take a deep dive into it in the practical part. But let’s discuss a general approach first.
General Hints and Tips
You can continue reshaping your files and folder structure onthego, but you need a general framework that fits with the process you follow.
First, consider whether you can break this process into independent stages, such as data collection, data processing, model prototyping, model testing, model refinement, etc. Further, think about whether certain steps should be isolated, e.g. if they can be reused, or whether they may or may not change in the future.
This will help you to separate code chunks into scripts that can be run independently and have standalone functionalities. At last, you can regroup code scripts into folders, following a simple rule: “One stage = one folder”.
How many stages you have depends on multiple factors.
First, your data affect it to a certain extent. If you can get raw data that needs little preprocessing and can be easily collected, then this can be covered by one script or notebook. In the opposite case, you may need to build a complex pipeline and add some automation.
Second, the practice shows that machine learning projects and data science projects differ in their complexity. Data science models need to be fine-tuned manually and, after successful testing, can be considered final. By contrast, machine and deep learning models involve extensive training and testing stages. They also require more iterations through the same cycle.
Third, it plays a huge role whether your data science or machine learning project should remain standalone or get embedded into a bigger system: for example, whether or not you want your model to run permanently and supply another system or application with its output. In such cases, you need to think about deployment automation. Also, the code you use for experimenting and the code that is going to be linked externally need to be clearly isolated from each other.
Which Folders Should You Include?
Let’s get more specific and try to create a simple checklist for your data science or machine learning project.
We just talked about stages, but there is one important exception: consider keeping all your data in a separate place.
The reason is that you will need to refer to this folder path inside many of your code scripts. It is way easier if you do not need to invent a new path every time you want to load or export your data.
However, you should have subfolders that contain data needed or generated at different stages of your project.
Data Processing Folder
As mentioned before, put everything that prepares data for modelling into one folder. That does not only allow for quick navigation in your project repository, but also prevents you from mixing your models and extract-transform-load (ETL) processes. ETL processes must deliver a consistent output that can be directly ingested into models. That allows you to prevent unwanted effects on the model output.
If you still feel that you need a folder for quick prototypes and if you work with Jupyter, feel free to create as many experimental notebooks as you want. Just make sure to keep your chaos locked in one folder and do not forget to integrate the best working code snippets into the final model version.
Models are the core of your project structure. They are also the ultimate goal of the whole experiment and deserve a separate folder.
We’d like to emphasise that this folder must be the place for well-structured models. You can have subfolders for models in different maturity phases, e.g. draft, trained, serialised, etc.
For computational researchers, this folder belongs to the publication package that is to be shared with the reviewers.
Documentation files help you and your collaborators to understand the main milestones of your project and their implementation. It is a good practice to keep at least one simple Readme file in your project.
In this folder, you can include everything that helps the other person to bring your models up and running, things to remember about your datasets, and other notes.
Optionally, you can keep published studies and other literature here.
For big projects and projects deployed to productive systems, we recommend having a separate folder where you keep your automation scripts.
Examples of such scripts are deployment pipelines, model training or model testing scripts.
Separating automation code from the main part allows you to reuse it with your next project as well as to “turn off” the automation if you do not need it any longer.
When you have processes that take a long time to run and/or need to run in batches, orchestration helps to manage them. A few open-source technologies, including Python-based ones ZenML and Luigi, are available for this purpose. The latter is even made for machine learning orchestration, e.g. for model training. It means that you do not need to go to an external tool to set them up; it can be done even in a Jupyter notebook. And if you want to switch to a different orchestration framework, you can disentangle this part from the main project.
Reporting and Data Visualisation Folder
If you have any external audience or stakeholders already at the early stage of your progress and need to deliver updates to them, it makes sense to have a dedicated place for this too. In this folder, you can store code or notebooks that are designed for generating reports or data visualisations.
Is This Too Much?
What Tools Can Help You Manage Your Data Science Project?
If this list of project elements feels a bit long and does not make you enthusiastic, let us talk about tools that can help you to organise your data science or machine learning project.
You may have wondered what we mean by reusing your project. When your code and data are well-organised, you can duplicate one of your existing projects and replace only selected pieces instead of building the same structure from scratch.
The following tools can help with this.
Cookiecutter will copy-paste only the folder structure, without copying any files. It can be useful if you want to assemble your project manually but still want to follow the same template.
Your models usually have different parameters that you vary with every experiment or when you get new input data. One way to deal with this is to create multiple models. Another way is to separate the permanent part of the code from any variable parameters and treat the latter as a configuration.
This can be done with the Python Hydra package or other configuration management tools.
Isolating configurable parameters has implications for your project structure. You’ll need to create a folder to store all configuration and supplementary scripts that “insert” configurable parameters into your models.
We recommend having separate subfolders for model configuration and other configurations, e.g. ETL processes.
When someone else wants to run your project, they need to ensure that all libraries (packages) that your code appeals to are installed in their environment.
The classic way is to create a ‘requirements.txt’ file in your project – at the highest root level – and to list there all dependencies, i.e. languages, their versions, and specific libraries that do not come with the main installation. A more modern way is to let such tools as Poetry manage dependencies but its operating principle is the same. You run an install command in a CLI tool and your dependencies manager goes through the list you have prepared.
Converting Your Project Into an Application
The summit of your automation efforts would be to turn your code into an independent application. With “application” we only mean a simple software artefact that will run your code and with which you can interact, presumably, through a CLI tool.
For this, you need to group your scripts – automation, orchestration, and data preprocessing – into one folder and create an initialization script for your application.
By convention, this folder is usually called `src` (= “source code”).
Furthermore, you can customise your application by creating its own command reference. It can be any command that activates only certain scripts and – optionally – provides arguments for them.
For instance, you can create a command that runs your ETL processes on your time-series data and takes some time period as an argument.A popular tool for writing those commands is called Make and is open-source. Makefile stores your command reference.
Data Versioning Management
As explained in our previous article, Git is not suitable for data versioning.
You can use manual versioning and add a timestamp or other indicator to your data file names. However, there is a data versioning tool called DVC that works pretty much as Git-based code versioning. The tool will track changes in your data and link them to the stages.
You can use DVC and Git-based technology at the same time. You only need to disentangle data versioning from the source code versioning as described in the DVC documentation.
Data Science Project Folder Templates
Based on our discussion so far and of course our experience as a computational research platform provider, we have a few suggested templates for how you could do your data science or machine learning project folder structure.
We offer two project structure templates, one for data scientists and one for machine learning specialists. Both were made with Git-based versioning in mind and include an application folder.
It means that you can easily push this project to Git, and your collaborators can clone it from there and run it using a CLI tool.
The machine learning template contains additional folders that store data and scripts for each of the three model development stages: model training, testing, and validation.
Data Science Project Template
Machine Learning Project Template
You may have noticed that for every “functional” folder or script – including datasets, ETL processes, and models – you need automation scripts, documentation, and configuration.
Follow this principle to keep your project easily readable and transparent.
An Alternative to a Git-Based Structure: Nuvolos
Perhaps this structure is a bit too complex for some projects.
As we explained when talking about data versioning problems, Git may require a lot of effort without adding convenience to your data science project.
How about an alternative? Meet Nuvolos, a computational research platform for rapid prototyping of data science and machine learning models.
Data Storage and Versioning
You can keep your data directly in your Nuvolos-based project as files, in a Snowflake database that comes together with it, or in external storage.
If you store data in files, you do not need data versioning.
If you store data in Snowflake, you can view it directly in Nuvolos using our Tables feature.
How Nuvolos Works
In Nuvolos, there is a hierarchy of concepts that inherit the best Git-versioning ideas for computational research. First, we have organisations, the highest level in the hierarchy. They represent your institution or company.
On the next level, you have workspaces: those are your projects within the same organisation. For every workspace, you can create multiple instances: those are analogous to branches in Git. Instances allow you and your collaborators to have the same start but follow separate paths in developing your models.
Every instance contains a complete project, with all folders and data. You can also imagine it as a designated machine that encapsulates a particular approach.
Within an instance, you can create snapshots: complete copies of your branch made at a certain point in time, with all your code and data.
Snapshots allow you to reverse changes you’d made to code and data in one leap or to generate a new instance.
Applications in Nuvolos
Instead of creating your application manually, you can use Nuvolos’ Application feature. You still can customise your applications to a certain extent using Make and continue using Hydra for dependencies management.
However, by being a browser-based solution, Nuvolos spares you a lot of incompatibility issues.
How to Use Nuvolos With Git
Nuvolos natively offers support for Git operations in all applications. Some of them, such as Visual Studio Code, even come with a GUI for this. In other cases, you can use a private key pair that is automatically generated in your account to enable access to Git from Nuvolos.
Nuvolos is a cloud-based platform. Scaling computing power or storage is just a click away.
This means that you can access and run your models from anywhere – all you need is internet access.
Besides working on the different branches – instances – of the same project yourself, you can of course invite your collaborators to edit the same instance you are working on.
Well-structured data science projects are transparent, reproducible, and easy to collaborate on.
Good discipline means some additional effort at the beginning; but once you get everything in place, you can focus on developing and testing new ideas instead of searching for where you stopped yesterday.
We wish you good luck with your next project!