Model deployment is often the domain of IT professionals and cloud infrastructure experts who understand how to securely and reliably host model endpoints that scale with usage demand. Thankfully, Amazon SageMaker is fully managed and handles all the underlying infrastructure, allowing developers and data scientists like you and me, who are not IT experts, to use simple APIs to host secure, low-latency, and highly scalable model endpoints.
In this blog post, I’ll share an end-to-end guide on how to host a MAX optimized model endpoint using MAX Serving and Amazon SageMaker. Here are the steps we’ll follow:
- Download a pre-trained Roberta model from HuggingFace
- Upload model to Amazon S3 so Amazon SageMaker and MAX Serving container has access to it.
- Pull the latest MAX Serving container image and push it to Amazon Elastic Container Registry (Amazon ECR)
- Create an Amazon SageMaker model and deploy to specified instance type. We’ll use Amazon EC2 c6i.4xlarge, on which MAX Engine can deliver up to 2.6x faster performance vs. TensorFlow
- Invoke the endpoint to test it
- Clean up AWS resources
If you’re just getting started with MAX, I also recommend reading this getting started blog post on how to optimize models and run inference with MAX. And this blog post on evaluating MAX Engine performance and accuracy.
Where can I get this example: All the code in this blog post is available as a runnable Jupyter Notebook on GitHub.
Step 0: Setup
I ran this example on an Amazon SageMaker notebook instance which you can access from AWS Console > Amazon SageMaker > Notebook > Notebook instances > Create notebook Instance. And follow the steps to create a new notebook instance with the default Amazon SageMaker execution role.
After the notebook instance is up and running you’ll get access to a hosted Jupyter notebook client. Choose the conda_tensorflow2_p310 conda environment since we’ll need TensorFlow to save our model in the TensorFlow saved model format.
Note: This is our development instance only. SageMaker will spin up a separate and dedicated instance for model hosting as we’ll see in Step 4
If you want to run this entire workflow on any other system such as your laptop or Amazon EC2 instance, make sure that you have permissions to access resources in your AWS account. The IAM managed policy Amazon SageMakerFullAccess grants all the necessary permissions. See AWS documentation for more details.
Next, download the example from GitHub. You can clone the entire repository or just get the Jupyter Notebook.
Now we’re ready to walk through the steps in the Jupyter Notebooks.
Step 1: Download a pre-trained Roberta model from HuggingFace
The first step is to download the model we want to serve. Let’s start with some basic imports and get access to boto3 and sagemaker session, role, bucket name, account number, and region which are all required by SageMaker to manage the deployment for us.
Next, I define a function to download and save the Roberta sentence classification model
The MAX Serving container is based on the NVIDIA Triton server and it expects models to reside in the specific layout seen below. See the docs for more info. It also expects a config.pbtxt file that tells the server to use the MAX Engine backend for high-performance inference instead of the default backend.
Output:
Step 2: Upload model to Amazon S3 so Amazon SageMaker and MAX Serving container has access to it
Now that you have the saved model in the format expected by Amazon SageMaker and MAX Serving, we’ll now need to compress it into a tar.gz file and upload it to Amazon S3. You can upload it to any bucket you and SageMaker have access to. In this example I choose the default bucket and capture the path in model_uri.
You can verify that the model is in Amazon S3 on the AWS console:
Step 3: Pull the latest MAX Serving container image and push it to Amazon Elastic Container Registry (Amazon ECR)
Amazon SageMaker expects the container image I want to use to host my models to be in private Amazon Elastic Container Registry. Modular provides a pre-built container image: public.ecr.aws/modular/max-serving-de, so we must first pull the image to our system and then push it to our private ECR repository. Note: If your development instance is of a different architecture than your deployment instance then be sure to choose the right tag when pulling the MAX Serving container. You can find the tags for all platforms here: https://gallery.ecr.aws/modular/max-serving-de
Here’s what’s happening in each line of code.
Create a new repository called sagemaker-max-serving.
- aws ecr create-repository --repository-name {registry_name}
Pull the MAX Serving container hosted by Modular
- docker pull {max_serving_image_uri}
Tag the MAX Serving container with the name that matches the repository we created.
- docker tag {max_serving_image_uri} {image}
Log in to Amazon ECR in your region
- For aws-cli v1.x: $(aws ecr get-login --no-include-email --region {region})
- For aws-cli v2.x: aws ecr get-login-password --region {region} | docker login --username AWS --password-stdin {f'{account}.dkr.ecr.{region}.amazonaws.com'}
Finally, push a copy of the MAX Serving container to your ECR repository.
- docker push {image}
You can head over to Amazon ECR to confirm that your image is now available for Amazon SageMaker.
Step 4: Create an Amazon SageMaker model and deploy to specified instance type
In this example I’ll deploy our Roberta model to Amazon EC2 c6i.4xlarge instance. If you head on over to our performance dashboard you can see that MAX Engine can deliver up to 2.6x faster performance vs. TensorFlow on this model and EC2 instance.
In the above code, we first create a Model using Amazon SageMaker SDK and specify attributes including:
- Model path on Amazon S3
- Path to MAX Serving container image in Amazon ECR
- IAM Role
- Model name (optional)
With the model created with a single API call you can deploy your model to the specified endpoint using Amazon SageMaker.
Notice that I specify initial_instance_count=1. You can specify a higher number to load balance large volume of requests, or head over to the AWS Console > Amazon SageMaker > Inference > Endpoint Configuration > your endpoint configuration that was just created and add scaling policies that can automatically add or reduce the number of instances based on traffic.
You can also confirm that the endpoint was created and is operational on the AWS Console.
And see the AWS cloudwatch for inference request logs.
Step 5: Invoke the endpoint to test the endpoint
With our MAX Serving endpoint hosted and operational, let’s send some inference requests to it!
To keep things simple, in this example MAX serving is only running model inference, I did not set it up to do any pre- or post–processing steps such as tokenization or conversion of IDs back to labels. Instead I’ll do those steps in the notebook. You can alternatively include the pre- and post-processing steps in the MAX Serving container, and I’ll demonstrate that in an upcoming blog post. Let us know on Discord if you want to see more content on deployment.
I’ll use the boto3 client to invoke the endpoint with our payload from above, get the response, and find the most confident classification.
Output:
Step 6: Clean up AWS resources
On AWS you only pay for what you use, which means when you are done using services, you have to clean up resources. You can run the following commands to delete the endpoint, endpoint config, model, Amazon S3 artifacts, and Amazon ECR repository we created.
Conclusion
AWS offers many options to host models for inference including Amazon EC2, Amazon Elastic Kubernetes Service (EKS) to manage container orchestration, and fully-managed services such as Amazon SageMaker we discussed in this blog post. There are benefits to every approach depending on how much flexibility and control you want over your deployments.
For most data scientists and developers who are not already IT professionals or MLOps experts, Amazon SageMaker provides a great balance of ease of use, flexibility and scalability which makes it very easy for developers to experiment with models and deploy quickly.
I hope you enjoyed this walkthrough! I sure enjoyed writing it. The Jupyter notebook with all the code in the blog post is available on GitHub. Check it out and share your feedback on discord!
Until next time! 🔥
Additional resources:
- Get started with downloading MAX
- Download and run MAX examples on GitHub
- Head over to the MAX docs to learn more about MAX Engine APIs and Mojo programming manual
- Join our Discord community
- Contribute to discussions on the Mojo and MAX GitHub
Report feedback, including issues on our Mojo and MAX GitHub tracker