Modern model pre-training often calls for larger cluster deployment to reduce time and cost. At the server level, such training workloads demand faster compute and increased memory allocation. As models grow to hundreds of billions of parameters, they require a distributed training mechanism that spans multiple nodes (instances).
In October 2022, we launched Amazon EC2 Trn1 Instances, powered by AWS Trainium, which is the second generation machine learning accelerator designed by AWS. Trn1 instances are purpose built for high-performance deep learning model training while offering up to 50% cost-to-train savings over comparable GPU-based instances. In order to bring down training time from weeks to days, or days to hours, and distribute a large model’s training job, we can use an EC2 Trn1 UltraCluster, which consists of densely packed, co-located racks of Trn1 compute instances all interconnected by non-blocking petabyte scale networking. It is our largest UltraCluster to date, offering 6

Continue reading



At FusionWeb, we aim to look at the future through the lenses of imagination, creativity, expertise and simplicity in the most cost effective ways. All we want to make something that brings smile to our clients face. Let’s try us to believe us.