Subscribe to our newsletter
📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
Data preprocessing can be categorized into two types of processes: preparation and transformation. We'll explore common preprocessing techniques and then we'll preprocess our dataset.
Warning
Certain preprocessing steps are global (don't depend on our dataset, ex. lower casing text, removing stop words, etc.) and others are local (constructs are learned only from the training split, ex. vocabulary, standardization, etc.). For the local, dataset-dependent preprocessing steps, we want to ensure that we split the data first before preprocessing to avoid data leaks.
Preparing the data involves organizing and cleaning the data.
Performing SQL joins with existing data tables to organize all the relevant data you need into one view. This makes working with our dataset a whole lot easier.
1
2 | SELECT * FROM A
INNER JOIN B on A.id == B.id
|
Warning
We need to be careful to perform point-in-time valid joins to avoid data leaks. For example, if Table B may have features for objects in Table A that were not available at the time inference would have been needed.
First, we'll have to identify the rows with missing values and once we do, there are several approaches to dealing with them.
omit samples with missing values (if only a small subset are missing it)
1
2
3
4
5
6 | # Drop a row (sample) by index
df.drop([4, 10, ...])
# Conditionally drop rows (samples)
df = df[df.value > 0]
# Drop samples with any missing feature
df = df[df.isnull().any(axis=1)]
|
omit the entire feature (if too many samples are missing the value)
1
2 | # Drop a column (feature)
df.drop(["A"], axis=1)
|
fill in missing values for features (using domain knowledge, heuristics, etc.)
1
2 | # Fill in missing values with mean
df.A = df.A.fillna(df.A.mean())
|
may not always seem "missing" (ex. 0, null, NA, etc.)
1
2
3 | # Replace zeros to NaNs
import numpy as np
df.A = df.A.replace({"0": np.nan, 0: np.nan})
|
1
2 | # Ex. Feature value must be within 2 standard deviations
df[np.abs(df.A - df.A.mean()) <= (2 * df.A.std())]
|
Feature engineering involves combining features in unique ways to draw out signal.
1
2 | # Input
df.C = df.A + df.B
|
Tip
Feature engineering can be done in collaboration with domain experts that can guide us on what features to engineer and use.
Cleaning our data involves apply constraints to make it easier for our models to extract signal from the data.
1
2
3
4 | # Resize
import cv2
dims = (height, width)
resized_img = cv2.resize(src=img, dsize=dims, interpolation=cv2.INTER_LINEAR)
|
1
2 | # Lower case the text
text = text.lower()
|
Transforming the data involves feature encoding and engineering.
don't blindly scale features (ex. categorical features)
standardization: rescale values to mean 0, std 1
1
2
3
4
5
6
7
8 | # Standardization
import numpy as np
x = np.random.random(4) # values between 0 and 1
print ("x:\n", x)
print (f"mean: {np.mean(x):.2f}, std: {np.std(x):.2f}")
x_standardized = (x - np.mean(x)) / np.std(x)
print ("x_standardized:\n", x_standardized)
print (f"mean: {np.mean(x_standardized):.2f}, std: {np.std(x_standardized):.2f}")
|
min-max: rescale values between a min and max
1
2
3
4
5
6
7
8 | # Min-max
import numpy as np
x = np.random.random(4) # values between 0 and 1
print ("x:", x)
print (f"min: {x.min():.2f}, max: {x.max():.2f}")
x_scaled = (x - x.min()) / (x.max() - x.min())
print ("x_scaled:", x_scaled)
print (f"min: {x_scaled.min():.2f}, max: {x_scaled.max():.2f}")
|
binning: convert a continuous feature into categorical using bins
1
2
3
4
5
6
7
8 | # Binning
import numpy as np
x = np.random.random(4) # values between 0 and 1
print ("x:", x)
bins = np.linspace(0, 1, 5) # bins between 0 and 1
print ("bins:", bins)
binned = np.digitize(x, bins)
print ("binned:", binned)
|
and many more!
allows for representing data efficiently (maintains signal) and effectively (learns patterns, ex. one-hot vs embeddings)
label: unique index for categorical value
1
2
3
4
5
6
7
8 | # Label encoding
label_encoder.class_to_index = {
"attention": 0,
"autoencoders": 1,
"convolutional-neural-networks": 2,
"data-augmentation": 3,
... }
label_encoder.transform(["attention", "data-augmentation"])
|
one-hot: representation as binary vector
1
2 | # One-hot encoding
one_hot_encoder.transform(["attention", "data-augmentation"])
|
embeddings: dense representations capable of representing context
1
2
3
4
5 | # Embeddings
self.embeddings = nn.Embedding(
embedding_dim=embedding_dim, num_embeddings=vocab_size)
x_in = self.embeddings(x_in)
print (x_in.shape)
|
and many more!
autoencoders: learn to encode inputs for compressed knowledge representation
principle component analysis (PCA): linear dimensionality reduction to project data in a lower dimensional space.
1
2
3
4
5
6
7
8
9 | # PCA
import numpy as np
from sklearn.decomposition import PCA
X = np.array([[-1, -1, 3], [-2, -1, 2], [-3, -2, 1]])
pca = PCA(n_components=2)
pca.fit(X)
print (pca.transform(X))
print (pca.explained_variance_ratio_)
print (pca.singular_values_)
|
counts (ngram): sparse representation of text as matrix of token counts — useful if feature values have lot's of meaningful, separable signal.
1
2
3
4
5
6
7
8
9
10
11
12
13 | # Counts (ngram)
from sklearn.feature_extraction.text import CountVectorizer
y = [
"acetyl acetone",
"acetyl chloride",
"chloride hydroxide",
]
vectorizer = CountVectorizer()
y = vectorizer.fit_transform(y)
print (vectorizer.get_feature_names())
print (y.toarray())
# 💡 Repeat above with char-level ngram vectorizer
# vectorizer = CountVectorizer(analyzer='char', ngram_range=(1, 3)) # uni, bi and trigrams
|
similarity: similar to count vectorization but based on similarities in tokens
We'll often was to retrieve feature values for an entity (user, item, etc.) over time and reuse the same features across different projects. To ensure that we're retrieving the proper feature values and to avoid duplication of efforts, we can use a feature store.
Curse of dimensionality
What can we do if a feature has lots of unique values but enough data points for each unique value (ex. URL as a feature)?
Show answerWe can encode our data with hashing or using it's attributes instead of the exact entity itself. For example, representing a user by their location and favorites as opposed to using their user ID or representing a webpage with it's domain as opposed to the exact url. This methods effectively decrease the total number of unique feature values and increase the number of data points for each.
For our application, we'll be implementing a few of these preprocessing steps that are relevant for our dataset.
1
2
3
4
5 | import json
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
import re
|
We can combine existing input features to create new meaningful signal for helping the model learn. However, there's usually no simple way to know if certain feature combinations will help or not without empirically experimenting with the different combinations. Here, we could use a project's title and description separately as features but we'll combine them to create one input feature.
1
2 | # Input
df["text"] = df.title + " " + df.description
|
Since we're dealing with text data, we can apply some common text preprocessing operations. Here, we'll be using Python's built-in regular expressions library re and the Natural Language Toolkit nltk.
1
2 | nltk.download("stopwords")
STOPWORDS = stopwords.words("english")
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17 | def clean_text(text, stopwords=STOPWORDS):
"""Clean raw text string."""
# Lower
text = text.lower()
# Remove stopwords
pattern = re.compile(r'\b(' + r"|".join(stopwords) + r")\b\s*")
text = pattern.sub('', text)
# Spacing and filters
text = re.sub(r"([!\"'#$%&()*\+,-./:;<=>?@\\\[\]^_`{|}~])", r" \1 ", text) # add spacing
text = re.sub("[^A-Za-z0-9]+", " ", text) # remove non alphanumeric chars
text = re.sub(" +", " ", text) # remove multiple spaces
text = text.strip() # strip white space at the ends
text = re.sub(r"http\S+", "", text) # remove links
return text
|
Note
We could definitely try and include emojis, punctuations, etc. because they do have a lot of signal for the task but it's best to simplify the initial feature set we use to just what we think are the most influential and then we can slowly introduce other features and assess utility.
Once we're defined our function, we can apply it to each row in our dataframe via pandas.DataFrame.apply.
1
2
3
4 | # Apply to dataframe
original_df = df.copy()
df.text = df.text.apply(clean_text)
print (f"{original_df.text.values[0]}\n{df.text.values[0]}")
|
Warning
We'll want to introduce less frequent features as they become more frequent or encode them in a clever way (ex. binning, extract general attributes, common n-grams, mean encoding using other feature values, etc.) so that we can mitigate the feature value dimensionality issue until we're able to collect more data.
We'll wrap up our cleaning operation by removing columns (pandas.DataFrame.drop) and rows with null tag values (pandas.DataFrame.dropna).
1
2
3
4
5 | # DataFrame cleanup
df = df.drop(columns=["id", "created_on", "title", "description"], errors="ignore") # drop cols
df = df.dropna(subset=["tag"]) # drop nulls
df = df[["text", "tag"]] # rearrange cols
df.head()
|
We need to encode our data into numerical values so that our models can process them. We'll start by encoding our text labels into unique indices.
1
2
3
4
5 | # Label to index
tags = train_df.tag.unique().tolist()
num_classes = len(tags)
class_to_index = {tag: i for i, tag in enumerate(tags)}
class_to_index
|
Next, we can use the pandas.Series.map function to map our class_to_index dictionary on our tag column to encode our labels.
1
2
3 | # Encode labels
df["tag"] = df["tag"].map(class_to_index)
df.head()
|
We'll also want to be able to decode our predictions back into text labels. We can do this by creating an index_to_class dictionary and using that to convert encoded labels back into text labels.
1
2 | def decode(indices, index_to_class):
return [index_to_class[index] for index in indices]
|
1
2 | index_to_class = {v:k for k, v in class_to_index.items()}
decode(df.head()["tag"].values, index_to_class=index_to_class)
|
Next we'll encode our text as well. Instead of using a random dictionary, we'll use a tokenizer that was used for a pretrained LLM (scibert) to tokenize our text. We'll be fine-tuning this exact model later when we train our model.
Here is a quick refresher on attention and Transformers.
1
2 | import numpy as np
from transformers import BertTokenizer
|
The tokenizer will convert our input text into a list of token ids and a list of attention masks. The token ids are the indices of the tokens in the vocabulary. The attention mask is a binary mask indicating the position of the token indices so that the model can attend to them (and ignore the pad tokens).
1
2
3
4
5
6
7 | # Bert tokenizer
tokenizer = BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased", return_dict=False)
text = "Transfer learning with transformers for text classification."
encoded_inputs = tokenizer([text], return_tensors="np", padding="longest") # pad to longest item in batch
print ("input_ids:", encoded_inputs["input_ids"])
print ("attention_mask:", encoded_inputs["attention_mask"])
print (tokenizer.decode(encoded_inputs["input_ids"][0]))
|
Note that we use padding="longest" in our tokenizer function to pad our inputs to the longest item in the batch. This becomes important when we use batches of inputs later and want to create a uniform input size, where shorted text sequences will be padded with zeros to meet the length of the longest input in the batch.
We'll wrap our tokenization into a tokenize function that we can use to tokenize batches of our data.
1
2
3
4 | def tokenize(batch):
tokenizer = BertTokenizer.from_pretrained("allenai/scibert_scivocab_uncased", return_dict=False)
encoded_inputs = tokenizer(batch["text"].tolist(), return_tensors="np", padding="longest")
return dict(ids=encoded_inputs["input_ids"], masks=encoded_inputs["attention_mask"], targets=np.array(batch["tag"]))
|
1
2 | # Tokenization
tokenize(df.head(1))
|
We'll wrap up by combining all of our preprocessing operations into function. This way we can easily apply it to different datasets (training, inference, etc.)
1
2
3
4
5
6
7
8
9 | def preprocess(df, class_to_index):
"""Preprocess the data."""
df["text"] = df.title + " " + df.description # feature engineering
df["text"] = df.text.apply(clean_text) # clean text
df = df.drop(columns=["id", "created_on", "title", "description"], errors="ignore") # clean dataframe
df = df[["text", "tag"]] # rearrange columns
df["tag"] = df["tag"].map(class_to_index) # label encoding
outputs = tokenize(df)
return outputs
|
1
2 | # Apply
preprocess(df=train_df, class_to_index=class_to_index)
|
Upcoming live cohorts
Sign up for our upcoming live cohort, where we'll provide live lessons + QA, compute (GPUs) and community to learn everything in one day.
To cite this content, please use:
1
2
3
4
5
6 | @article{madewithml,
author = {Goku Mohandas},
title = { Preprocessing - Made With ML },
howpublished = {\url{https://madewithml.com/}},
year = {2023}
}
|