The size of the machine learning (ML) models––large language models (LLMs) and foundation models (FMs)––is growing fast year-over-year, and these models need faster and more powerful accelerators, especially for generative AI. AWS Inferentia2 was designed from the ground up to deliver higher performance while lowering the cost of LLMs and generative AI inference.
In this post, we show how the second generation of AWS Inferentia builds on the capabilities introduced with AWS Inferentia1 and meets the unique demands of deploying and running LLMs and FMs.
The first generation of AWS Inferentia, a purpose-built accelerator launched in 2019, is optimized to accelerate deep learning inference. AWS Inferentia helped ML users reduce their inference costs and improve their prediction throughput and latency. With AWS Inferentia1, customers saw up to 2.3x higher throughput and up to 70% lower cost per inference than comparable inference-optimized Amazon Elastic Compute Cloud (Amazon EC2) instances.
AWS Inferentia2,

Continue reading

Leave Comment



At FusionWeb, we aim to look at the future through the lenses of imagination, creativity, expertise and simplicity in the most cost effective ways. All we want to make something that brings smile to our clients face. Let’s try us to believe us.