Data Science Platform: Why Buy And Not Build?

We’ve been following the evergreen discussion on whether you should buy or build your data science platform. 

Although we do not disagree with the frequently proposed solution, some arguments in the discussion lack substantial background. 

We offer a neutral judgement and provide a few insider tips from our experience. 

What Is a Data Science Platform?

First, let’s distinguish a “data science platform” from other entities with similar features. With a data science platform, we mean more than just a big data storage with a crude GUI on top of it. Neither is it just a data science tool used for writing and testing algorithms.

What we are talking about, could also be called a computational research or data science platform. Such platforms provide at least the following core functionalities:

  • Cloud data storage
  • Connector to this storage
  • Tools for writing and testing algorithms
  • Computing power
  • Version/source code control
  • Collaboration
  • Reproducibility 

A data science platform enables both scholarly researchers and private sector specialists to keep their projects entirely in one workspace, without splitting different stages and provisioning them separately. 

Why Everyone Recommends Buying It?

If you plan or already run a project requiring a lot of computational power, precise tracking of model changes, easy scaling and team collaboration, you have surely looked for buy-vs-build comparisons. And you may have noticed that many of them are written by vendors. Can they be objective?

No doubt, buying is cheaper in the short term. Vendors share development costs with all their customers. Therefore they can offer you their products for only a chunk of the total cost.

However, if we consider expenditures not as wasted money but as an investment, we can arrive at some building pros.

Why Build Your Data Science Platform?

Building your own platform can have direct advantages as well as help to avoid risks associated with buying. 

Direct Advantages

For using the platform later, it may be crucial to understand how it has been built and how it works. Consequently, you can better track down bugs and incompatibility issues. This is a bit more difficult if you deal with a black box.

Moreover, your support and developer teams are much closer to your data science team. Although it highly depends on your organizational culture, internal communication tends to work at least faster than vendor negotiations.

It also means that you can change the platform once your needs change. On the contrary, a vendor may not have the capacity to anticipate every customer’s requirement.  

Time for the first disappointment: the essential prerequisite for enjoying those advantages is that you have enough internal resources.

Risks Associated With Buying

Being dependent on your vendor may have a certain impact on your projects. 

Following updates is one of the tricky issues. It can happen that you’d have to re-adjust your processes to match platform changes and keep your projects running. It can be the other way around: your vendor may have difficulties catching up with the recent tech trends.

You also have to rely on the vendor’s support team to resolve bugs. 

Vendor lock-in can even result in a further unplanned effort and cost. For instance, if the vendor implements updates incompatible with your processes or decides to leave the market. In such cases, you’d bear the cost of finding the new vendor and migrating to their platform. 

Cost Development (building a platform)

Customization Is King

However, if you do have resources, time, and expertise, you can enjoy the main advantage of the data science platform ownership. Customization is 100% in your hands!

Lack of customization may result in additional time spent for finding workarounds and, consequently, employee frustration and distraction, and delayed project deliveries.

Why even good vendors may miss a feature you need? While building their platforms, vendors iterate with a limited number of stakeholders to gather requirements. Even giant enterprises cannot take into account the needs of every potential customer. Their goal is to find a common denominator suitable for the majority.

Pursuing a catch-all strategy may either lead to feature oversaturation or feature minimalism. This simply means that the platform won’t have the features that you need.  

Could we convince you that building your own data science platform is better? Let’s check how strong your dedication is!

Platform Development: Phases, Cost, and Manpower

Building a data science platform is a convolutive process. The user interface is not the biggest problem nowadays. A few most popular tools for writing and running algorithms are open-source, including Jupyter, RStudio, VSCode, and some others. It is possible to integrate them into the platform and add a custom GUI on top of this. 

We would like to point you to less obvious pitfalls. 

Upfront Costs: Manpower and Infrastructure

The minimum manpower would include two developers and two data engineers. Depending on multiple factors, the total cost may vary. The point is: that you start paying the wages way before the platform becomes usable and starts to pay back.

Apart from this, you will need to either buy or rent the necessary infrastructure upfront. You may need a separate infrastructure for development and production. 

If you manage your hardware on your own, it will add servicing costs and risk of hardware failure. 

Cost Accumulation

Another negative factor is that your cost will accumulate. As we said before, you do not develop one piece of your platform and then move to the next one. You make multiple iterations. With every next iteration, you add new features and refine existing ones. But it also means that you still need the same expertise and the same infrastructure plus some new expertise and new infrastructure. 

Scalable Storage

While making your data science platform scalable, remember that it does not only encompass its computing power. Scalable storage is an important feature. 

Not every scalability is the same. Some databases scale for time-series data, and some do not. Some file storage is optimized for huge files but may lack efficiency when you store  small files in vast amounts.

What Is a Realistic Timeline?

According to our experience, a loose featureless setup may be finished within three months. But don’t get too excited: the progress inevitably slows down as you go.

After the first year, you will probably end up with a prototype: which is not at all bad! That includes a solution for computing power, data storage, and such essentials as network, firewalls, web servers, and user access management.

After that, you need another one and half years of active platform usage to clean up all bugs and refine available features.

That makes it almost three years in total.

No doubt that a data science platform is a powerful asset for your organization, but it requires a midterm investment. 

What Are The Main Perils?

Considering all the above, first, there may be just too many decisions to be made. Second, wrong decisions may lead to delays and the inability to finish your data science, ML, or computational research projects in time. 

Last but not least, you’ll need to remember the maintenance cost.

So, Buying? 

If you are neither convinced by buying nor by building scenario, look for a platform that has two key characteristics:

  • flexibility 
  • a matching niche profile

If the platform is highly customizable and highly specialized – for your needs, indeed! – then you can get the best out of it. 

We built Nuvolos for data scientists and computational researchers. We are a collaborative workspace for those who need to turn their ideas into reality quickly. Nuvolos is perfect for rapid prototyping but is not limited to that. It allows you to keep track of all prototype versions and continue developing the one you want to focus on. 

With us, your team can work on the project from the very start to the mature phase and host it on Nuvolos during the entire lifecycle. Confirmed by our customers!