In this post, we talk about how to split a machine learning (ML) dataset into train, test, and validation datasets with Amazon SageMaker Data Wrangler so you can easily split your datasets with minimal to no code.
Data used for ML is typically split into the following datasets:
Training – Used to train an algorithm or ML model. The model iteratively uses the data and learns to provide the desired result.
Validation – Introduces new data to the trained model. You can use a validation set to periodically measure model performance as training is happening, and also tune any hyperparameters of the model. However, validation datasets are optional.
Test – Used on the final trained model to assess its performance on unseen data. This helps determine how well the model generalizes.
Data Wrangler is a capability of Amazon SageMaker that helps data scientists and data