Amazon SageMaker Serverless Inference is a purpose-built inference option that makes it easy for you to deploy and scale machine learning (ML) models. It provides a pay-per-use model, which is ideal for services where endpoint invocations are infrequent and unpredictable. Unlike a real-time hosting endpoint, which is backed by a long-running instance, compute resources for serverless endpoints are provisioned on demand, thereby eliminating the need to choose instance types or manage scaling policies.
The following high-level architecture illustrates how a serverless endpoint works. A client invokes an endpoint, which is backed by AWS managed infrastructure.
However, serverless endpoints are prone to cold starts in the order of seconds, and is therefore more suitable for intermittent or unpredictable workloads.
To help determine whether a serverless endpoint is the right deployment option from a cost and performance perspective, we have developed the SageMaker Serverless Inference Benchmarking Toolkit, which tests different endpoint