Subscribe to our newsletter
📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
Our goal is to learn a model \(\hat{y}\) that models \(y\) given \(X\) . You'll notice that neural networks are just extensions of the generalized linear methods we've seen so far but with non-linear activation functions since our data will be highly non-linear.
| \(N\) | total numbers of samples |
| \(D\) | number of features |
| \(H\) | number of hidden units |
| \(C\) | number of classes |
| \(W_1\) | 1st layer weights \(\in \mathbb{R}^{DXH}\) |
| \(z_1\) | outputs from first layer \(\in \mathbb{R}^{NXH}\) |
| \(f\) | non-linear activation function |
| \(a_1\) | activations from first layer \(\in \mathbb{R}^{NXH}\) |
| \(W_2\) | 2nd layer weights \(\in \mathbb{R}^{HXC}\) |
| \(z_2\) | outputs from second layer \(\in \mathbb{R}^{NXC}\) |
| \(\hat{y}\) | prediction \(\in \mathbb{R}^{NXC}\) |
(*) bias term (\(b\)) excluded to avoid crowding the notations
We'll set our seeds for reproducibility.
1
2 | import numpy as np
import random
|
1 | SEED = 1234
|
1
2
3 | # Set seed for reproducibility
np.random.seed(SEED)
random.seed(SEED)
|
I created some non-linearly separable spiral data so let's go ahead and download it for our classification task.
1
2 | import matplotlib.pyplot as plt
import pandas as pd
|
1
2
3
4
5 | # Load data
url = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/spiral.csv"
df = pd.read_csv(url, header=0) # load
df = df.sample(frac=1).reset_index(drop=True) # shuffle
df.head()
|
1
2
3
4
5 | # Data shapes
X = df[["X1", "X2"]].values
y = df["color"].values
print ("X: ", np.shape(X))
print ("y: ", np.shape(y))
|
1
2
3
4
5 | # Visualize data
plt.title("Generated non-linear data")
colors = {"c1": "red", "c2": "yellow", "c3": "blue"}
plt.scatter(X[:, 0], X[:, 1], c=[colors[_y] for _y in y], edgecolors="k", s=25)
plt.show()
|
We'll shuffle our dataset (since it's ordered by class) and then create our data splits (stratified on class).
1
2 | import collections
from sklearn.model_selection import train_test_split
|
1
2
3 | TRAIN_SIZE = 0.7
VAL_SIZE = 0.15
TEST_SIZE = 0.15
|
1
2
3
4
5 | def train_val_test_split(X, y, train_size):
"""Split dataset into data splits."""
X_train, X_, y_train, y_ = train_test_split(X, y, train_size=TRAIN_SIZE, stratify=y)
X_val, X_test, y_val, y_test = train_test_split(X_, y_, train_size=0.5, stratify=y_)
return X_train, X_val, X_test, y_train, y_val, y_test
|
1
2
3
4
5
6
7 | # Create data splits
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
X=X, y=y, train_size=TRAIN_SIZE)
print (f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print (f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")
print (f"Sample point: {X_train[0]} → {y_train[0]}")
|
In the previous lesson we wrote our own label encoder class to see the inner functions but this time we'll use scikit-learn LabelEncoder class which does the same operations as ours.
1 | from sklearn.preprocessing import LabelEncoder
|
1
2 | # Output vectorizer
label_encoder = LabelEncoder()
|
1
2
3
4 | # Fit on train data
label_encoder = label_encoder.fit(y_train)
classes = list(label_encoder.classes_)
print (f"classes: {classes}")
|
1
2
3
4
5
6 | # Convert labels to tokens
print (f"y_train[0]: {y_train[0]}")
y_train = label_encoder.transform(y_train)
y_val = label_encoder.transform(y_val)
y_test = label_encoder.transform(y_test)
print (f"y_train[0]: {y_train[0]}")
|
1
2
3
4 | # Class weights
counts = np.bincount(y_train)
class_weights = {i: 1.0/count for i, count in enumerate(counts)}
print (f"counts: {counts}\nweights: {class_weights}")
|
We need to standardize our data (zero mean and unit variance) so a specific feature's magnitude doesn't affect how the model learns its weights. We're only going to standardize the inputs X because our outputs y are class values.
1 | from sklearn.preprocessing import StandardScaler
|
1
2 | # Standardize the data (mean=0, std=1) using training data
X_scaler = StandardScaler().fit(X_train)
|
1
2
3
4 | # Apply scaler on training and test data (don't standardize outputs for classification)
X_train = X_scaler.transform(X_train)
X_val = X_scaler.transform(X_val)
X_test = X_scaler.transform(X_test)
|
1
2
3 | # Check (means should be ~0 and std should be ~1)
print (f"X_test[0]: mean: {np.mean(X_test[:, 0], axis=0):.1f}, std: {np.std(X_test[:, 0], axis=0):.1f}")
print (f"X_test[1]: mean: {np.mean(X_test[:, 1], axis=0):.1f}, std: {np.std(X_test[:, 1], axis=0):.1f}")
|
Before we get to our neural network, we're going to motivate non-linear activation functions by implementing a generalized linear model (logistic regression). We'll see why linear models (with linear activations) won't suffice for our dataset.
1 | import torch
|
1
2 | # Set seed for reproducibility
torch.manual_seed(SEED)
|
We'll create our linear model using one layer of weights.
1
2 | from torch import nn
import torch.nn.functional as F
|
1
2
3 | INPUT_DIM = X_train.shape[1] # X is 2-dimensional
HIDDEN_DIM = 100
NUM_CLASSES = len(classes) # 3 classes
|
1
2
3
4
5
6
7
8
9
10 | class LinearModel(nn.Module):
def __init__(self, input_dim, hidden_dim, num_classes):
super(LinearModel, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, num_classes)
def forward(self, x_in):
z = self.fc1(x_in) # linear activation
z = self.fc2(z)
return z
|
1
2
3 | # Initialize model
model = LinearModel(input_dim=INPUT_DIM, hidden_dim=HIDDEN_DIM, num_classes=NUM_CLASSES)
print (model.named_parameters)
|
We'll go ahead and train our initialized model for a few epochs.
1 | from torch.optim import Adam
|
1
2
3 | LEARNING_RATE = 1e-2
NUM_EPOCHS = 10
BATCH_SIZE = 32
|
1
2
3 | # Define Loss
class_weights_tensor = torch.Tensor(list(class_weights.values()))
loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
|
1
2
3
4
5 | # Accuracy
def accuracy_fn(y_pred, y_true):
n_correct = torch.eq(y_pred, y_true).sum().item()
accuracy = (n_correct / len(y_pred)) * 100
return accuracy
|
1
2 | # Optimizer
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
|
1
2
3
4
5
6
7 | # Convert data to tensors
X_train = torch.Tensor(X_train)
y_train = torch.LongTensor(y_train)
X_val = torch.Tensor(X_val)
y_val = torch.LongTensor(y_val)
X_test = torch.Tensor(X_test)
y_test = torch.LongTensor(y_test)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | # Training
for epoch in range(NUM_EPOCHS):
# Forward pass
y_pred = model(X_train)
# Loss
loss = loss_fn(y_pred, y_train)
# Zero all gradients
optimizer.zero_grad()
# Backward pass
loss.backward()
# Update weights
optimizer.step()
if epoch%1==0:
predictions = y_pred.max(dim=1)[1] # class
accuracy = accuracy_fn(y_pred=predictions, y_true=y_train)
print (f"Epoch: {epoch} | loss: {loss:.2f}, accuracy: {accuracy:.1f}")
|
Now let's see how well our linear model does on our non-linear spiral data.
1
2
3 | import json
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_fscore_support
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23 | def get_metrics(y_true, y_pred, classes):
"""Per-class performance metrics."""
# Performance
performance = {"overall": {}, "class": {}}
# Overall performance
metrics = precision_recall_fscore_support(y_true, y_pred, average="weighted")
performance["overall"]["precision"] = metrics[0]
performance["overall"]["recall"] = metrics[1]
performance["overall"]["f1"] = metrics[2]
performance["overall"]["num_samples"] = np.float64(len(y_true))
# Per-class performance
metrics = precision_recall_fscore_support(y_true, y_pred, average=None)
for i in range(len(classes)):
performance["class"][classes[i]] = {
"precision": metrics[0][i],
"recall": metrics[1][i],
"f1": metrics[2][i],
"num_samples": np.float64(metrics[3][i]),
}
return performance
|
1
2
3
4
5 | # Predictions
y_prob = F.softmax(model(X_test), dim=1)
print (f"sample probability: {y_prob[0]}")
y_pred = y_prob.max(dim=1)[1]
print (f"sample class: {y_pred[0]}")
|
1
2
3 | # # Performance
performance = get_metrics(y_true=y_test, y_pred=y_pred, classes=classes)
print (json.dumps(performance, indent=2))
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14 | def plot_multiclass_decision_boundary(model, X, y):
x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 101), np.linspace(y_min, y_max, 101))
cmap = plt.cm.Spectral
X_test = torch.from_numpy(np.c_[xx.ravel(), yy.ravel()]).float()
y_pred = F.softmax(model(X_test), dim=1)
_, y_pred = y_pred.max(dim=1)
y_pred = y_pred.reshape(xx.shape)
plt.contourf(xx, yy, y_pred, cmap=plt.cm.Spectral, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.RdYlBu)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
|
1
2
3
4
5
6
7
8
9 | # Visualize the decision boundary
plt.figure(figsize=(12,5))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_multiclass_decision_boundary(model=model, X=X_train, y=y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_multiclass_decision_boundary(model=model, X=X_test, y=y_test)
plt.show()
|
Using the generalized linear method (logistic regression) yielded poor results because of the non-linearity present in our data yet our activation functions were linear. We need to use an activation function that can allow our model to learn and map the non-linearity in our data. There are many different options so let's explore a few.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 | # Fig size
plt.figure(figsize=(12,3))
# Data
x = torch.arange(-5., 5., 0.1)
# Sigmoid activation (constrain a value between 0 and 1.)
plt.subplot(1, 3, 1)
plt.title("Sigmoid activation")
y = torch.sigmoid(x)
plt.plot(x.numpy(), y.numpy())
# Tanh activation (constrain a value between -1 and 1.)
plt.subplot(1, 3, 2)
y = torch.tanh(x)
plt.title("Tanh activation")
plt.plot(x.numpy(), y.numpy())
# Relu (clip the negative values to 0)
plt.subplot(1, 3, 3)
y = F.relu(x)
plt.title("ReLU activation")
plt.plot(x.numpy(), y.numpy())
# Show plots
plt.show()
|
The ReLU activation function (\(max(0,z)\)) is by far the most widely used activation function for neural networks. But as you can see, each activation function has its own constraints so there are circumstances where you'll want to use different ones. For example, if we need to constrain our outputs between 0 and 1, then the sigmoid activation is the best choice.
In some cases, using a ReLU activation function may not be sufficient. For instance, when the outputs from our neurons are mostly negative, the activation function will produce zeros. This effectively creates a "dying ReLU" and a recovery is unlikely. To mitigate this effect, we could lower the learning rate or use alternative ReLU activations, ex. leaky ReLU or parametric ReLU (PReLU), which have a small slope for negative neuron outputs.
Now let's create our multilayer perceptron (MLP) which is going to be exactly like the logistic regression model but with the activation function to map the non-linearity in our data.
It's normal to find the math and code in this section slightly complex. You can still read each of the steps to build intuition for when we implement this using PyTorch.
Our goal is to learn a model \(\hat{y}\) that models \(y\) given \(X\). You'll notice that neural networks are just extensions of the generalized linear methods we've seen so far but with non-linear activation functions since our data will be highly non-linear.
Step 1: Randomly initialize the model's weights \(W\) (we'll cover more effective initialization strategies later in this lesson).
1
2
3
4
5 | # Initialize first layer's weights
W1 = 0.01 * np.random.randn(INPUT_DIM, HIDDEN_DIM)
b1 = np.zeros((1, HIDDEN_DIM))
print (f"W1: {W1.shape}")
print (f"b1: {b1.shape}")
|
Step 2: Feed inputs \(X\) into the model to do the forward pass and receive the probabilities. First we pass the inputs into the first layer.
1
2
3 | # z1 = [NX2] · [2X100] + [1X100] = [NX100]
z1 = np.dot(X_train, W1) + b1
print (f"z1: {z1.shape}")
|
Next we apply the non-linear activation function, ReLU (\(max(0,z)\)) in this case.
1
2
3 | # Apply activation function
a1 = np.maximum(0, z1) # ReLU
print (f"a_1: {a1.shape}")
|
We pass the activations to the second layer to get our logits.
1
2
3
4
5 | # Initialize second layer's weights
W2 = 0.01 * np.random.randn(HIDDEN_DIM, NUM_CLASSES)
b2 = np.zeros((1, NUM_CLASSES))
print (f"W2: {W2.shape}")
print (f"b2: {b2.shape}")
|
1
2
3
4 | # z2 = logits = [NX100] · [100X3] + [1X3] = [NX3]
logits = np.dot(a1, W2) + b2
print (f"logits: {logits.shape}")
print (f"sample: {logits[0]}")
|
We'll apply the softmax function to normalize the logits and obtain class probabilities.
1
2
3
4
5 | # Normalization via softmax to obtain class probabilities
exp_logits = np.exp(logits)
y_hat = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
print (f"y_hat: {y_hat.shape}")
print (f"sample: {y_hat[0]}")
|
Step 3: Compare the predictions \(\hat{y}\) (ex. [0.3, 0.3, 0.4]) with the actual target values \(y\) (ex. class 2 would look like [0, 0, 1]) with the objective (cost) function to determine loss \(J\). A common objective function for classification tasks is cross-entropy loss.
(*) bias term (\(b\)) excluded to avoid crowding the notations
1
2
3
4 | # Loss
correct_class_logprobs = -np.log(y_hat[range(len(y_hat)), y_train])
loss = np.sum(correct_class_logprobs) / len(y_train)
print (f"loss: {loss:.2f}")
|
Step 4: Calculate the gradient of loss \(J(\theta)\) w.r.t to the model weights.
The gradient of the loss w.r.t to $$ W_2 $$ is the same as the gradients from logistic regression since $\(hat{y} = softmax(z_2)\).
The gradient of the loss w.r.t \(W_1\) is a bit trickier since we have to backpropagate through two sets of weights.
1
2
3
4
5
6 | # dJ/dW2
dscores = y_hat
dscores[range(len(y_hat)), y_train] -= 1
dscores /= len(y_train)
dW2 = np.dot(a1.T, dscores)
db2 = np.sum(dscores, axis=0, keepdims=True)
|
1
2
3
4
5 | # dJ/dW1
dhidden = np.dot(dscores, W2.T)
dhidden[a1 <= 0] = 0 # ReLu backprop
dW1 = np.dot(X_train.T, dhidden)
db1 = np.sum(dhidden, axis=0, keepdims=True)
|
Step 5: Update the weights \(W\) using a small learning rate \(\alpha\). The updates will penalize the probability for the incorrect classes (\(j\)) and encourage a higher probability for the correct class (\(y\)).
1
2
3
4
5 | # Update weights
W1 += -LEARNING_RATE * dW1
b1 += -LEARNING_RATE * db1
W2 += -LEARNING_RATE * dW2
b2 += -LEARNING_RATE * db2
|
Step 6: Repeat steps 2 - 4 until model performs well.
1
2
3
4
5
6
7 | # Convert tensors to NumPy arrays
X_train = X_train.numpy()
y_train = y_train.numpy()
X_val = X_val.numpy()
y_val = y_val.numpy()
X_test = X_test.numpy()
y_test = y_test.numpy()
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51 | # Initialize random weights
W1 = 0.01 * np.random.randn(INPUT_DIM, HIDDEN_DIM)
b1 = np.zeros((1, HIDDEN_DIM))
W2 = 0.01 * np.random.randn(HIDDEN_DIM, NUM_CLASSES)
b2 = np.zeros((1, NUM_CLASSES))
# Training loop
for epoch_num in range(1000):
# First layer forward pass [NX2] · [2X100] = [NX100]
z1 = np.dot(X_train, W1) + b1
# Apply activation function
a1 = np.maximum(0, z1) # ReLU
# z2 = logits = [NX100] · [100X3] = [NX3]
logits = np.dot(a1, W2) + b2
# Normalization via softmax to obtain class probabilities
exp_logits = np.exp(logits)
y_hat = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
# Loss
correct_class_logprobs = -np.log(y_hat[range(len(y_hat)), y_train])
loss = np.sum(correct_class_logprobs) / len(y_train)
# show progress
if epoch_num%100 == 0:
# Accuracy
y_pred = np.argmax(logits, axis=1)
accuracy = np.mean(np.equal(y_train, y_pred))
print (f"Epoch: {epoch_num}, loss: {loss:.3f}, accuracy: {accuracy:.3f}")
# dJ/dW2
dscores = y_hat
dscores[range(len(y_hat)), y_train] -= 1
dscores /= len(y_train)
dW2 = np.dot(a1.T, dscores)
db2 = np.sum(dscores, axis=0, keepdims=True)
# dJ/dW1
dhidden = np.dot(dscores, W2.T)
dhidden[a1 <= 0] = 0 # ReLu backprop
dW1 = np.dot(X_train.T, dhidden)
db1 = np.sum(dhidden, axis=0, keepdims=True)
# Update weights
W1 += -1e0 * dW1
b1 += -1e0 * db1
W2 += -1e0 * dW2
b2 += -1e0 * db2
|
Now let's see how our model performs on the test (hold-out) data split.
1
2
3
4
5
6
7
8 | class MLPFromScratch():
def predict(self, x):
z1 = np.dot(x, W1) + b1
a1 = np.maximum(0, z1)
logits = np.dot(a1, W2) + b2
exp_logits = np.exp(logits)
y_hat = exp_logits / np.sum(exp_logits, axis=1, keepdims=True)
return y_hat
|
1
2
3
4 | # Evaluation
model = MLPFromScratch()
y_prob = model.predict(X_test)
y_pred = np.argmax(y_prob, axis=1)
|
1
2
3 | # # Performance
performance = get_metrics(y_true=y_test, y_pred=y_pred, classes=classes)
print (json.dumps(performance, indent=2))
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29 | def plot_multiclass_decision_boundary_numpy(model, X, y, savefig_fp=None):
"""Plot the multiclass decision boundary for a model that accepts 2D inputs.
Credit: https://cs231n.github.io/neural-networks-case-study/
Arguments:
model {function} -- trained model with function model.predict(x_in).
X {numpy.ndarray} -- 2D inputs with shape (N, 2).
y {numpy.ndarray} -- 1D outputs with shape (N,).
"""
# Axis boundaries
x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 101),
np.linspace(y_min, y_max, 101))
# Create predictions
x_in = np.c_[xx.ravel(), yy.ravel()]
y_pred = model.predict(x_in)
y_pred = np.argmax(y_pred, axis=1).reshape(xx.shape)
# Plot decision boundary
plt.contourf(xx, yy, y_pred, cmap=plt.cm.Spectral, alpha=0.8)
plt.scatter(X[:, 0], X[:, 1], c=y, s=40, cmap=plt.cm.RdYlBu)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
# Plot
if savefig_fp:
plt.savefig(savefig_fp, format="png")
|
1
2
3
4
5
6
7
8
9 | # Visualize the decision boundary
plt.figure(figsize=(12,5))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_multiclass_decision_boundary_numpy(model=model, X=X_train, y=y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_multiclass_decision_boundary_numpy(model=model, X=X_test, y=y_test)
plt.show()
|
Now let's implement the same MLP in PyTorch.
We'll be using two linear layers along with PyTorch Functional API's ReLU operation.
1
2
3
4
5
6
7
8
9
10 | class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, num_classes):
super(MLP, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, num_classes)
def forward(self, x_in):
z = F.relu(self.fc1(x_in)) # ReLU activation function added!
z = self.fc2(z)
return z
|
1
2
3 | # Initialize model
model = MLP(input_dim=INPUT_DIM, hidden_dim=HIDDEN_DIM, num_classes=NUM_CLASSES)
print (model.named_parameters)
|
1
2
3 | # Define Loss
class_weights_tensor = torch.Tensor(list(class_weights.values()))
loss_fn = nn.CrossEntropyLoss(weight=class_weights_tensor)
|
1
2
3
4
5 | # Accuracy
def accuracy_fn(y_pred, y_true):
n_correct = torch.eq(y_pred, y_true).sum().item()
accuracy = (n_correct / len(y_pred)) * 100
return accuracy
|
1
2 | # Optimizer
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
|
1
2
3
4
5
6
7 | # Convert data to tensors
X_train = torch.Tensor(X_train)
y_train = torch.LongTensor(y_train)
X_val = torch.Tensor(X_val)
y_val = torch.LongTensor(y_val)
X_test = torch.Tensor(X_test)
y_test = torch.LongTensor(y_test)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | # Training
for epoch in range(NUM_EPOCHS*10):
# Forward pass
y_pred = model(X_train)
# Loss
loss = loss_fn(y_pred, y_train)
# Zero all gradients
optimizer.zero_grad()
# Backward pass
loss.backward()
# Update weights
optimizer.step()
if epoch%10==0:
predictions = y_pred.max(dim=1)[1] # class
accuracy = accuracy_fn(y_pred=predictions, y_true=y_train)
print (f"Epoch: {epoch} | loss: {loss:.2f}, accuracy: {accuracy:.1f}")
|
1
2
3 | # Predictions
y_prob = F.softmax(model(X_test), dim=1)
y_pred = y_prob.max(dim=1)[1]
|
1
2
3 | # # Performance
performance = get_metrics(y_true=y_test, y_pred=y_pred, classes=classes)
print (json.dumps(performance, indent=2))
|
1
2
3
4
5
6
7
8
9 | # Visualize the decision boundary
plt.figure(figsize=(12,5))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_multiclass_decision_boundary(model=model, X=X_train, y=y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_multiclass_decision_boundary(model=model, X=X_test, y=y_test)
plt.show()
|
Let's look at the inference operations when using our trained model.
1
2 | # Inputs for inference
X_infer = pd.DataFrame([{"X1": 0.1, "X2": 0.1}])
|
1
2
3 | # Standardize
X_infer = X_scaler.transform(X_infer)
print (X_infer)
|
1
2
3
4
5 | # Predict
y_infer = F.softmax(model(torch.Tensor(X_infer)), dim=1)
prob, _class = y_infer.max(dim=1)
label = label_encoder.inverse_transform(_class.detach().numpy())[0]
print (f"The probability that you have {label} is {prob.detach().numpy()[0]*100.0:.0f}%")
|
So far we have been initializing weights with small random values but this isn't optimal for convergence during training. The objective is to initialize the appropriate weights such that our activations (outputs of layers) don't vanish (too small) or explode (too large), as either of these situations will hinder convergence. We can do this by sampling the weights uniformly from a bound distribution (many that take into account the precise activation function used) such that all activations have unit variance.
You may be wondering why we don't do this for every forward pass and that's a great question. We'll look at more advanced strategies that help with optimization like batch normalization, etc. in future lessons. Meanwhile you can check out other initializers here.
1 | from torch.nn import init
|
1
2
3
4
5
6
7
8
9
10
11
12
13 | class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, num_classes):
super(MLP, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, num_classes)
def init_weights(self):
init.xavier_normal(self.fc1.weight, gain=init.calculate_gain("relu"))
def forward(self, x_in):
z = F.relu(self.fc1(x_in)) # ReLU activation function added!
z = self.fc2(z)
return z
|
A great technique to have our models generalize (perform well on test data) is to increase the size of your data but this isn't always an option. Fortunately, there are methods like regularization and dropout that can help create a more robust model.
Dropout is a technique (used only during training) that allows us to zero the outputs of neurons. We do this for dropout_p% of the total neurons in each layer and it changes every batch. Dropout prevents units from co-adapting too much to the data and acts as a sampling strategy since we drop a different set of neurons each time.
1 | DROPOUT_P = 0.1 # percentage of weights that are dropped each pass
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15 | class MLP(nn.Module):
def __init__(self, input_dim, hidden_dim, dropout_p, num_classes):
super(MLP, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.dropout = nn.Dropout(dropout_p) # dropout
self.fc2 = nn.Linear(hidden_dim, num_classes)
def init_weights(self):
init.xavier_normal(self.fc1.weight, gain=init.calculate_gain("relu"))
def forward(self, x_in):
z = F.relu(self.fc1(x_in))
z = self.dropout(z) # dropout
z = self.fc2(z)
return z
|
1
2
3
4 | # Initialize model
model = MLP(input_dim=INPUT_DIM, hidden_dim=HIDDEN_DIM,
dropout_p=DROPOUT_P, num_classes=NUM_CLASSES)
print (model.named_parameters)
|
Though neural networks are great at capturing non-linear relationships they are highly susceptible to overfitting to the training data and failing to generalize on test data. Just take a look at the example below where we generate completely random data and are able to fit a model with \(2*N*C + D\) (where N = # of samples, C = # of classes and D = input dimension) hidden units. The training performance is good (~70%) but the overfitting leads to very poor test performance. We'll be covering strategies to tackle overfitting in future lessons.
1
2
3
4 | NUM_EPOCHS = 500
NUM_SAMPLES_PER_CLASS = 50
LEARNING_RATE = 1e-1
HIDDEN_DIM = 2 * NUM_SAMPLES_PER_CLASS * NUM_CLASSES + INPUT_DIM # 2*N*C + D
|
1
2
3
4
5 | # Generate random data
X = np.random.rand(NUM_SAMPLES_PER_CLASS * NUM_CLASSES, INPUT_DIM)
y = np.array([[i]*NUM_SAMPLES_PER_CLASS for i in range(NUM_CLASSES)]).reshape(-1)
print ("X: ", format(np.shape(X)))
print ("y: ", format(np.shape(y)))
|
1
2
3
4
5
6
7 | # Create data splits
X_train, X_val, X_test, y_train, y_val, y_test = train_val_test_split(
X=X, y=y, train_size=TRAIN_SIZE)
print (f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print (f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print (f"X_test: {X_test.shape}, y_test: {y_test.shape}")
print (f"Sample point: {X_train[0]} → {y_train[0]}")
|
1
2
3
4
5 | # Standardize the inputs (mean=0, std=1) using training data
X_scaler = StandardScaler().fit(X_train)
X_train = X_scaler.transform(X_train)
X_val = X_scaler.transform(X_val)
X_test = X_scaler.transform(X_test)
|
1
2
3
4
5
6
7 | # Convert data to tensors
X_train = torch.Tensor(X_train)
y_train = torch.LongTensor(y_train)
X_val = torch.Tensor(X_val)
y_val = torch.LongTensor(y_val)
X_test = torch.Tensor(X_test)
y_test = torch.LongTensor(y_test)
|
1
2
3
4 | # Initialize model
model = MLP(input_dim=INPUT_DIM, hidden_dim=HIDDEN_DIM,
dropout_p=DROPOUT_P, num_classes=NUM_CLASSES)
print (model.named_parameters)
|
1
2 | # Optimizer
optimizer = Adam(model.parameters(), lr=LEARNING_RATE)
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 | # Training
for epoch in range(NUM_EPOCHS):
# Forward pass
y_pred = model(X_train)
# Loss
loss = loss_fn(y_pred, y_train)
# Zero all gradients
optimizer.zero_grad()
# Backward pass
loss.backward()
# Update weights
optimizer.step()
if epoch%20==0:
predictions = y_pred.max(dim=1)[1] # class
accuracy = accuracy_fn(y_pred=predictions, y_true=y_train)
print (f"Epoch: {epoch} | loss: {loss:.2f}, accuracy: {accuracy:.1f}")
|
1
2
3 | # Predictions
y_prob = F.softmax(model(X_test), dim=1)
y_pred = y_prob.max(dim=1)[1]
|
1
2
3 | # # Performance
performance = get_metrics(y_true=y_test, y_pred=y_pred, classes=classes)
print (json.dumps(performance, indent=2))
|
1
2
3
4
5
6
7
8
9 | # Visualize the decision boundary
plt.figure(figsize=(12,5))
plt.subplot(1, 2, 1)
plt.title("Train")
plot_multiclass_decision_boundary(model=model, X=X_train, y=y_train)
plt.subplot(1, 2, 2)
plt.title("Test")
plot_multiclass_decision_boundary(model=model, X=X_test, y=y_test)
plt.show()
|
It's important that we experiment, starting with simple models that underfit (high bias) and improve it towards a good fit. Starting with simple models (linear/logistic regression) let's us catch errors without the added complexity of more sophisticated models (neural networks).
To cite this content, please use:
1
2
3
4
5
6 | @article{madewithml,
author = {Goku Mohandas},
title = { Neural networks - Made With ML },
howpublished = {\url{https://madewithml.com/}},
year = {2023}
}
|