Subscribe to our newsletter
📬 Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.
Now that we have our data prepared, we can start training our models to optimize on our objective. Ideally, we would start with the simplest possible baseline and slowly add complexity to our models:
Since we have four classes, we may expect a random model to be correct around 25% of the time but recall that not all of our classes have equal counts.
We could build a list of common words for each class and if a word in the input matches a word in the list, we can predict that class.
We could start with a simple term frequency (TF-IDF) mode and then move onto embeddings with CNNs, RNNs, Transformers, etc.
We're going to skip straight to step 3 of developing a complex model because this task involves unstructured data and rule-based systems are not well suited for this. And with the increase adoption of large language models (LLMs) as a proven model architecture for NLP tasks, we'll fine-tune a pretrained LLM on our dataset.
Iterate on the data
Instead of using a fixed dataset and iterating on the models, we could keep the model constant and iterate on the dataset. This is useful to improve the quality of our datasets.
With the rapid increase in data (unstructured) and model sizes (ex. LLMs), it's becoming increasingly difficult to train models on a single machine. We need to be able to distribute our training across multiple machines in order to train our models in a reasonable amount of time. And we want to be able to do this without having to:
To address all of these concerns, we'll be using Ray Train here in order to create a training workflow that can scale across multiple machines. While there are many options to choose from for distributed training, such as Pytorch Distributed Data Parallel (DDP), Horovod, etc., none of them allow us to scale across different machines with ease and do so with minimal changes to our single-machine training code as Ray does.
Primer on distributed training
With distributed training, there will be a head node that's responsible for orchestrating the training process. While the worker nodes that will be responsible for training the model and communicating results back to the head node. From a user's perspective, Ray abstracts away all of this complexity and we can simply define our training functionality with minimal changes to our code (as if we were training on a single machine).
In this lesson, we're going to be fine-tuning a pretrained large language model (LLM) using our labeled dataset. The specific class of LLMs we'll be using is called BERT. Bert models are encoder-only models and are the gold-standard for supervised NLP tasks. However, you may be wondering how do all the (much larger) LLM, created for generative applications, fare (GPT 4, Falcon 40B, Llama 2, etc.)?
We chose the smaller BERT model for our course because it's easier to train and fine-tune. However, the workflow for fine-tuning the larger LLMs are quite similar as well. They do require much more compute but Ray abstracts away the scaling complexities involved with that.
Note
All the code for this section can be found in our separate benchmarks.ipynb notebook.
You'll need to first sign up for an OpenAI account and then grab your API key from here.
1
2 | import openai
openai.api_key = "YOUR_API_KEY"
|
We'll first load the our training and inference data into dataframes.
1 | import pandas as pd
|
1
2
3
4 | # Load training data
DATASET_LOC = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/dataset.csv"
train_df = pd.read_csv(DATASET_LOC)
train_df.head()
|
1
2
3 | # Unique labels
tags = train_df.tag.unique().tolist()
tags
|
1
2
3 | # Load inference dataset
HOLDOUT_LOC = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/holdout.csv"
test_df = pd.read_csv(HOLDOUT_LOC)
|
We'll define a few utility functions to make the OpenAI request and to store our predictions. While we could perform batch prediction by loading samples until the context length is reached, we'll just perform one at a time since it's not too many data points and we can have fully deterministic behavior (if you insert new data, etc.). We'll also added some reliability in case we overload the endpoints with too many request at once.
1
2
3
4
5
6
7 | import json
from collections import Counter
import matplotlib.pyplot as plt
import seaborn as sns; sns.set_theme()
from sklearn.metrics import precision_recall_fscore_support
import time
from tqdm import tqdm
|
We'll first define what a sample call to the OpenAI endpoint looks like. We'll pass in: - system_content that has information about how the LLM should behave. - assistant_content for any additional context it should have for answering our questions. - user_content that has our message or query to the LLM. - model should specify which specific model we want to send our request to.
We can pass all of this information in through the openai.ChatCompletion.create function to receive our response.
1
2
3
4
5
6
7
8
9
10
11
12
13 | # Query OpenAI endpoint
system_content = "you only answer in rhymes" # system content (behavior)
assistant_content = "" # assistant content (context)
user_content = "how are you" # user content (message)
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo-0613",
messages=[
{"role": "system", "content": system_content},
{"role": "assistant", "content": assistant_content},
{"role": "user", "content": user_content},
],
)
print (response.to_dict()["choices"][0].to_dict()["message"]["content"])
|
Now, let's create a function that can predict tags for a given sample.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16 | def get_tag(model, system_content="", assistant_content="", user_content=""):
try:
# Get response from OpenAI
response = openai.ChatCompletion.create(
model=model,
messages=[
{"role": "system", "content": system_content},
{"role": "assistant", "content": assistant_content},
{"role": "user", "content": user_content},
],
)
predicted_tag = response.to_dict()["choices"][0].to_dict()["message"]["content"]
return predicted_tag
except (openai.error.ServiceUnavailableError, openai.error.APIError) as e:
return None
|
1
2
3
4
5
6
7
8
9
10
11 | # Get tag
model = "gpt-3.5-turbo-0613"
system_context = f"""
You are a NLP prediction service that predicts the label given an input's title and description.
You must choose between one of the following labels for each input: {tags}.
Only respond with the label name and nothing else.
"""
assistant_content = ""
user_context = "Transfer learning with transformers: Using transformers for transfer learning on text classification tasks."
tag = get_tag(model=model, system_content=system_context, assistant_content=assistant_content, user_content=user_context)
print (tag)
|
Next, let's create a function that can predict tags for a list of inputs.
1
2
3 | # List of dicts w/ {title, description} (just the first 3 samples for now)
samples = test_df[["title", "description"]].to_dict(orient="records")[:3]
samples
|
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22 | def get_predictions(inputs, model, system_content, assistant_content=""):
y_pred = []
for item in tqdm(inputs):
# Convert item dict to string
user_content = str(item)
# Get prediction
predicted_tag = get_tag(
model=model, system_content=system_content,
assistant_content=assistant_content, user_content=user_content)
# If error, try again after pause (repeatedly until success)
while predicted_tag is None:
time.sleep(30) # could also do exponential backoff
predicted_tag = get_tag(
model=model, system_content=system_content,
assistant_content=assistant_content, user_content=user_content)
# Add to list of predictions
y_pred.append(predicted_tag)
return y_pred
|
1
2 | # Get predictions for a list of inputs
get_predictions(inputs=samples, model=model, system_content=system_context)
|