Subscribe to our newsletter
📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
We'll start by first preparing our data by ingesting it from source and splitting it into training, validation and test data splits.
Our data could reside in many different places (databases, files, etc.) and exist in different formats (CSV, JSON, Parquet, etc.). For our application, we'll load the data from a CSV file to a Pandas DataFrame using the read_csv function.
Here is a quick refresher on the Pandas library.
1 | import pandas as pd
|
1
2
3
4 | # Data ingestion
DATASET_LOC = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/dataset.csv"
df = pd.read_csv(DATASET_LOC)
df.head()
|
In our data engineering lesson we'll look at how to continually ingest data from more complex sources (ex. data warehouses)
Next, we need to split our training dataset into train and val data splits.
Here the model will have access to both inputs (features) and outputs (labels) to optimize its internal weights.
Here the model will not use the labels to optimize its weights but instead, we will use the validation performance to optimize training hyperparameters such as the learning rate, etc.
This is our best measure of how the model may behave on new, unseen data that is from a similar distribution to our training dataset.
Tip
For our application, we will have a training dataset to split into train and val splits and a separate testing dataset for the test set. While we could have one large dataset and split that into the three splits, it's a good idea to have a separate test dataset. Over time, our training data may grow and our test splits will look different every time. This will make it difficult to compare models against other models and against each other.
We can view the class counts in our dataset by using the pandas.DataFrame.value_counts function:
1 | from sklearn.model_selection import train_test_split
|
1
2 | # Value counts
df.tag.value_counts()
|
For our multi-class task (where each project has exactly one tag), we want to ensure that the data splits have similar class distributions. We can achieve this by specifying how to stratify the split by using the stratify keyword argument with sklearn's train_test_split() function.
Creating proper data splits
What are the criteria we should focus on to ensure proper data splits?
Show answer1
2
3 | # Split dataset
test_size = 0.2
train_df, val_df = train_test_split(df, stratify=df.tag, test_size=test_size, random_state=1234)
|
How can we validate that our data splits have similar class distributions? We can view the frequency of each class in each split:
1
2 | # Train value counts
train_df.tag.value_counts()
|
Before we view our validation split's class counts, recall that our validation split is only test_size of the entire dataset. So we need to adjust the value counts so that we can compare it to the training split's class counts.
1
2 | # Validation (adjusted) value counts
val_df.tag.value_counts() * int((1-test_size) / test_size)
|
These adjusted counts looks very similar to our train split's counts. Now we're ready to explore our dataset!
Upcoming live cohorts
Sign up for our upcoming live cohort, where we'll provide live lessons + QA, compute (GPUs) and community to learn everything in one day.
To cite this content, please use:
1
2
3
4
5
6 | @article{madewithml,
author = {Goku Mohandas},
title = { Preparation - Made With ML },
howpublished = {\url{https://madewithml.com/}},
year = {2023}
}
|