Modern model pre-training often calls for larger cluster deployment to reduce time and cost. At the server level, such training workloads demand faster compute and increased memory allocation. As models grow to hundreds of billions of parameters, they require a distributed training mechanism that spans multiple nodes (instances).
In October 2022, we launched Amazon EC2 Trn1 Instances, powered by AWS Trainium, which is the second generation machine learning accelerator designed by AWS. Trn1 instances are purpose built for high-performance deep learning model training while offering up to 50% cost-to-train savings over comparable GPU-based instances. In order to bring down training time from weeks to days, or days to hours, and distribute a large model’s training job, we can use an EC2 Trn1 UltraCluster, which consists of densely packed, co-located racks of Trn1 compute instances all interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6