Starting today, you can use Amazon SageMaker Autopilot to tackle regression and classification tasks on large datasets up to 100 GB. Additionally, you can now provide your datasets in either CSV or Apache Parquet content types.
Businesses are generating more data than ever. A corresponding demand is growing for generating insights from these large datasets to shape business decisions. However, successfully training state-of-the-art machine learning (ML) algorithms on these large datasets can be challenging. Autopilot automates this process and provides a seamless experience for running automated machine learning (AutoML) on large datasets up to 100 GB.
Autopilot subsamples your large datasets automatically to fit the maximum supported limit while preserving the rare class in case of class imbalance. Class imbalance is an important problem to be aware of in ML, especially when dealing with large datasets. Consider a fraud detection dataset where only a small fraction of transactions is expected