Amazon SageMaker multi-model endpoints (MMEs) provide a scalable and cost-effective way to deploy a large number of machine learning (ML) models. It gives you the ability to deploy multiple ML models in a single serving container behind a single endpoint. From there, SageMaker manages loading and unloading the models and scaling resources on your behalf based on your traffic patterns. You will benefit from sharing and reusing hosting resources and a reduced operational burden of managing a large quantity of models.
In November 2022, MMEs added support for GPUs, which allows you to run multiple models on a single GPU device and scale GPU instances behind a single endpoint. This satisfies the strong MME demand for deep neural network (DNN) models that benefit from accelerated compute with GPUs. These include computer vision (CV), natural language processing (NLP), and generative AI models. The reasons for the demand include the following: