Recent developments in deep learning have led to increasingly large models such as GPT-3, BLOOM, and OPT, some of which are already in excess of 100 billion parameters. Although larger models tend to be more powerful, training such models requires significant computational resources. Even with the use of advanced distributed training libraries like FSDP and DeepSpeed, it’s common for training jobs to require hundreds of accelerator devices for several weeks or months at a time.
In late 2022, AWS announced the general availability of Amazon EC2 Trn1 instances powered by AWS Trainium—a purpose-built machine learning (ML) accelerator optimized to provide a high-performance, cost-effective, and massively scalable platform for training deep learning models in the cloud. Trn1 instances are available in a number of sizes (see the following table), with up to 16 Trainium accelerators per instance.
Instance Size
Trainium Accelerators
Accelerator Memory (GB)