If you are working in any data science or machine learning field, you’ve definitely heard of Amazon SageMaker. You may be wondering what it can do, and what use cases it is good for – and when it’s not. Starting with a beginner’s guide overview, we discuss the pros and cons of Amazon SageMaker and explain why high costs and burdensome design make it less than ideal for getting your prototyping under way quickly.
Table of contents
Amazon SageMaker: A Beginner’s Guide
When using SageMaker, it becomes clear quickly that it is primarily a visual extension of using Jupyter Notebooks. From a UI perspective, it is essentially a Jupyter Notebook service delivered through the browser, with a managed service interface to access it (the AWS Management Console). Any processes you run in your notebooks happen in AWS instances.
To use Amazon SageMaker, you’ll therefore need an AWS account. On top of that, as is typical for AWS services, you’ll need an IAM role (more on this in below) and a subscription according to their pricing plan, at least if you want to avoid the limitations of their free tier.
In some respects, the product is quite flexible. It offers an API, SDK, and CLI for interaction with the service, and it has a Git integration that allows you to clone notebooks into Amazon SageMaker and sync them to a Git repo. The GUI can be extended for code versioning as part of this integration as well. Be warned, though: this Git integration will also be necessary if at any point you stop using AWS services. Like most Amazon software, SageMaker cannot be used apart from the whole AWS ecosystem, so be ready for a substantial risk of lock-in. Whether this is worth it depends entirely on your use case and requirements, as we will discuss further below. First a few more technical requirements need to be explained, however.
Identity and Access Management (IAM)
To start using Amazon SageMaker, an AWS account is not enough. You need to create an IAM user. Your AWS user is your root user, and your IAM user – of which you can have a few – is your subsidiary account created for a specific purpose, such as running SageMaker. But even an IAM user is not enough. You additionally need to create an IAM role and attach certain policies to it. Policies, a general principle of the AWS ecosystem, describe which users can do what things in what services. You will need the IAM permissions whenever you create a notebook instance, create or manage S3 buckets (see below), creating or managing scheduled jobs, and so forth. For larger companies with a team to handle internal IT concerns, such role management makes sense: it allows restricting access to those employees that need to access it and the burden is reduced when using a larger number of AWS services, so that a few policies will do. Of course, this again also increases the risk of vendor lock-in. For smaller companies like startups, the IAM aspect can be quite a hassle – never mind for solo users.
In terms of data storage, Amazon SageMaker only works with S3 buckets. You cannot connect it to any other input source and output destination. When your model or other job runs successfully for the first time, a new S3 bucket will be created in the background automatically. Of course, S3 buckets are reasonably fit for purpose when dealing with non-textual data, and are popular for a reason, but one does have to keep this strict limit in mind when wanting storage flexibility.
If you have textual data and strongly prefer using SQL, you can treat your S3 storage as a NoSQL database and query it normally using the boto3 Python package in your Jupyter notebook. There are no limitations for using textual data. Any existing Python or R package will run normally in SageMaker.
If you work with image data, you can use one of the pre-built algorithms based on open-source libraries MXNet and TensorFlow, or take advantage of yet another AWS service: Ground Truth. MXNet and Ground Truth also help to process video files. For audio files, PyTorch scripts need to be adapted to run in SageMaker.
In this way, the AWS ecosystem ‘invites’ you to use yet another piece of Amazon software, this somewhat dramatically named Amazon SageMaker Ground Truth.
Amazon SageMaker Ground Truth
As is familiar to machine learning experts, training datasets can be labelled before they are given to the future model for practicing its prediction skills. Data labelling increases training outcomes and pushes the final model quality higher. Labels play the role of output variables that the model needs to predict. By labelling, you specify the correct output before you train the model. It lets you evaluate model accuracy later. Data labelling is often a burdensome manual activity, since human knowledge is still the foundation of truth assessment when it comes to algorithms.
Amazon SageMaker Ground Truth is essentially a labelling service: using it with SageMaker, you can use algorithms for programmatic data labelling or let your own workforce perform labelling by hand. More controversially, however, it is connected to Amazon’s notorious Mechanical Turk service, where individuals are paid pennies at a time for the elaborate manual work of labelling and identifying inputs to machine learning models. Mechanical Turk has been the source of many horror stories in the industry, so this is one part of the AWS ecosystem one might want to avoid on ethical grounds if nothing else.
Organizing and Scheduling Notebooks
Now to give a brief overview of the meat of the matter: working with notebooks in Amazon SageMaker. In the Notebook section of the GUI, you can create your notebook instances. Inside each instance, using the native Jupyter or JupyterLab GUI, you can create a folder structure where you will store your Jupyter notebooks. You can only access the notebooks of the instance you are currently working in at any time. The purpose of the notebooks is to support algorithm prototyping.
Next, you can create a training or processing job in their respective sections of the SageMaker GUI. You can pull in a Jupyter notebook from any of your notebook instances to use as your job. However, training and processing jobs will run on separate instances, distinct from the notebook instances (and from each other). Both jobs can be scheduled using the SageMaker GUI. If you want to create more complex pipelines with particular conditions, however, you need – believe it or not – yet another AWS service: AWS Lambda.
Training jobs can also be created automatically using the SageMaker Python package. In this case, you run a Jupyter notebook in the notebook instance where you specify which training job should be created and which AWS instance size you need.
Now we must talk about model deployment. By this we mean that your model would run on a long-term or permanent basis and its results would be consumed by a certain audience.
The end users are not meant to access Amazon SageMaker directly. It does, however, provide endpoints for remote access as well as some rudimentary containerisation options for re-use elsewhere. Of course, in the latter case you will have to take care yourself of hosting the models on another server with sufficient capacity. In both scenarios, you will need to set up an entry point of some kind for the end-users, be it another API, a frontend, or whatever integration, something you are also responsible for doing yourself.
The Limitations of the Amazon SageMaker Ecosystem
Summed up, the Amazon SageMaker ecosystem can be represented by the following diagram:
As you can see, the difficulty is the large number of different (and costly) services and the complexity of the total setup, which requires considerable overhead in time and effort from whoever is responsible for managing the total pipeline. You are also completely dependent on each piece of the AWS ecosystem, so that there is little flexibility in choosing different combinations or approaches. Let’s run through each step and consider the costs:
First, you need to pay for S3 buckets to store training data, testing/validation data, data that you want to get predictions for, and the output of your models. Although using S3 GUI is free, accessing data through an API causes additional costs typically calculated per request.
Second, you need to pay for the computing power that is used to run your notebooks. The price depends on the type and size of the instance. Apart from this, you need to pay for the time that you have your notebook instances running.
Third, you need to pay for using Amazon SageMaker Ground Truth. The cost depends on multiple factors as Ground Truth includes a few very different services.
Fourth, the AWS Lambda scheduler needs to be paid for.
More important perhaps than the monetary cost, however, is the cost in time and attention needed to set up, configure, manage, and correct all the different services and components of the total pipeline. Especially for smaller companies, institutions that do not have large or sophisticated ops departments, or solo users, this may end up making little sense.
All this is not to say the costs of the service are unreasonable per se: it depends on what your specific use case is and what you expect to get out of the whole setup. It does show, however, that what appears at first to be a fairly straightforward simple AWS service actually brings with it a host of hidden complexity and costs, not to mention the dependence on the AWS ecosystem and its vagaries in general.
When (Not) To Use Amazon SageMaker
To get the most out of Amazon SageMaker, the prerequisites are that you are comfortable working with other AWS offerings and knowledgeable in either API, containerisation technology, or general software development. Your team skills need to go beyond machine learning and data science. You have to be willing and interested in manually controlling access as well as compute power, and to want to set up, configure, and manage a larger chain of interlocking AWS services, each with their own features, requirements, and costs. For the machine learning use cases of larger companies that can manage their IT services accordingly, this might make sense, although one must still keep vendor lock-in in the AWS ecosystem in mind.
For smaller companies, less well-funded institutions, solo users, and essentially anyone who just wants to get going with prototyping or a simple general purpose pipeline, Amazon SageMaker is probably not the right choice. Its management UI and setup is burdensome and complicated, and there is too much to configure and maintain. Its complexity and feature oversaturation make it slow to set up and slow to get going. E.g., why separate ‘actual’ notebooks and training and data processing notebooks if you only want to do some prototyping?
The big costs are really in time and effort, in other words the engineering or management time involved in the AWS administration and the services themselves (not to mention needing to adapt existing code specifically to use with Amazon SageMaker). For some use cases, the returns on these efforts may outweigh the costs, but we suspect that often they will not. For those users who need a cloud-based machine learning solution that saves them time and effort and gets their work going with a minimum of hassle and overhead, we designed Nuvolos.
Why Nuvolos Gets You More, Faster
The cumbersome user management, complex computing cost calculations, and too strong embeddedness into the AWS universe of Amazon SageMaker can be impractical for those working in education and public research or just needing to prototype quickly. Nuvolos, the platform for computational and data science, was designed by and for researchers and therefore keeps the needs of most scientific use cases in mind.
First and foremost, Nuvolos is not restricted to running Python and R, but offers the widest range of applications directly in the browser. If you can run it on Linux, you can run it in Nuvolos, and in its best version with all dependencies added. What’s more, Nuvolos comes with integrated data storage as well as offering integrations with S3 buckets, but also for example with Dropbox, meaning you are not limited to the Amazon-supported solutions.
Secondly, with Nuvolos instances are combinations of code, data, and applications with all the necessary dependencies wrapped up in containers. This means you do not need to do the work of containerisation or set up anything: you can just get going with your project. Of course, for collaboration and reproducibility, the platform offers one-click sharing of all research artefacts directly in the UI.
Needless to say, Nuvolos provides plenty of options for granular user access and resource management, but with an easily usable and understandable UI directly integrated into the platform. Granular steering of your cloud computing power works even for running notebooks; no need to stop and restart your processes!
In short, for most use cases, we believe Nuvolos offers a wider range of application and data storage options than Amazon SageMaker, without the ecosystem lock-in. Most importantly, it’s accessible and easy to use, meaning you can get going with your work and focus on the pipeline and the results, not on overhead and configuration.
Sound interesting? Check out our free trial today.