← 返回首页
Distributed training - Made With ML by Anyscale
Skip to content
Try Ray with $100 credit — Start Now
Made With ML by Anyscale
Distributed training
GokuMohandas/MadeWithML

Distributed training

Training models on our prepared data to optimize on our objective.
Goku Mohandas
· · ·
Repository · Notebook

Subscribe to our newsletter

📬  Receive new lessons straight to your inbox (once a month) and join 40K+ developers in learning how to responsibly deliver value with ML.

Subscribe

Intuition

Now that we have our data prepared, we can start training our models to optimize on our objective. Ideally, we would start with the simplest possible baseline and slowly add complexity to our models:

  1. Start with a random (chance) model.

    Since we have four classes, we may expect a random model to be correct around 25% of the time but recall that not all of our classes have equal counts.

  2. Develop a rule-based approach using if-else statements, regular expressions, etc.

    We could build a list of common words for each class and if a word in the input matches a word in the list, we can predict that class.

  3. Slowly add complexity by addressing limitations and motivating representations and model architectures.

    We could start with a simple term frequency (TF-IDF) mode and then move onto embeddings with CNNs, RNNs, Transformers, etc.

  4. Weigh tradeoffs (performance, latency, size, etc.) between performant baselines.
  5. Revisit and iterate on baselines as your dataset grows and new model architectures are developed.

We're going to skip straight to step 3 of developing a complex model because this task involves unstructured data and rule-based systems are not well suited for this. And with the increase adoption of large language models (LLMs) as a proven model architecture for NLP tasks, we'll fine-tune a pretrained LLM on our dataset.

Iterate on the data

Instead of using a fixed dataset and iterating on the models, we could keep the model constant and iterate on the dataset. This is useful to improve the quality of our datasets.

  • remove or fix data samples (false positives & negatives)
  • prepare and transform features
  • expand or consolidate classes
  • incorporate auxiliary datasets
  • identify unique slices to boost

Distributed training

With the rapid increase in data (unstructured) and model sizes (ex. LLMs), it's becoming increasingly difficult to train models on a single machine. We need to be able to distribute our training across multiple machines in order to train our models in a reasonable amount of time. And we want to be able to do this without having to:

  • set up a cluster by individually (and painstakingly) provisioning compute resources (CPU, GPU, etc.)
  • writing complex code to distribute our training across multiple machines
  • worry about communication and resource utilization between our different distributed compute resources
  • worry about fault tolerance and recovery from our large training workloads

To address all of these concerns, we'll be using Ray Train here in order to create a training workflow that can scale across multiple machines. While there are many options to choose from for distributed training, such as Pytorch Distributed Data Parallel (DDP), Horovod, etc., none of them allow us to scale across different machines with ease and do so with minimal changes to our single-machine training code as Ray does.

Primer on distributed training

With distributed training, there will be a head node that's responsible for orchestrating the training process. While the worker nodes that will be responsible for training the model and communicating results back to the head node. From a user's perspective, Ray abstracts away all of this complexity and we can simply define our training functionality with minimal changes to our code (as if we were training on a single machine).

Generative AI

In this lesson, we're going to be fine-tuning a pretrained large language model (LLM) using our labeled dataset. The specific class of LLMs we'll be using is called BERT. Bert models are encoder-only models and are the gold-standard for supervised NLP tasks. However, you may be wondering how do all the (much larger) LLM, created for generative applications, fare (GPT 4, Falcon 40B, Llama 2, etc.)?

We chose the smaller BERT model for our course because it's easier to train and fine-tune. However, the workflow for fine-tuning the larger LLMs are quite similar as well. They do require much more compute but Ray abstracts away the scaling complexities involved with that.

Note

All the code for this section can be found in our separate benchmarks.ipynb notebook.

Set up

You'll need to first sign up for an OpenAI account and then grab your API key from here.

1 2
import openai openai.api_key = "YOUR_API_KEY"

Load data

We'll first load the our training and inference data into dataframes.

1
import pandas as pd
1 2 3 4
# Load training data DATASET_LOC = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/dataset.csv" train_df = pd.read_csv(DATASET_LOC) train_df.head()
id created_on title description tag 0 1 2 3 4
6 2020-02-20 06:43:18 Comparison between YOLO and RCNN on real world... Bringing theory to experiment is cool. We can ... computer-vision
7 2020-02-20 06:47:21 Show, Infer & Tell: Contextual Inference for C... The beauty of the work lies in the way it arch... computer-vision
9 2020-02-24 16:24:45 Awesome Graph Classification A collection of important graph embedding, cla... other
15 2020-02-28 23:55:26 Awesome Monte Carlo Tree Search A curated list of Monte Carlo tree search pape... other
25 2020-03-07 23:04:31 AttentionWalk A PyTorch Implementation of "Watch Your Step: ... other
1 2 3
# Unique labels tags = train_df.tag.unique().tolist() tags
['computer-vision', 'other', 'natural-language-processing', 'mlops']
1 2 3
# Load inference dataset HOLDOUT_LOC = "https://raw.githubusercontent.com/GokuMohandas/Made-With-ML/main/datasets/holdout.csv" test_df = pd.read_csv(HOLDOUT_LOC)

Utilities

We'll define a few utility functions to make the OpenAI request and to store our predictions. While we could perform batch prediction by loading samples until the context length is reached, we'll just perform one at a time since it's not too many data points and we can have fully deterministic behavior (if you insert new data, etc.). We'll also added some reliability in case we overload the endpoints with too many request at once.

1 2 3 4 5 6 7
import json from collections import Counter import matplotlib.pyplot as plt import seaborn as sns; sns.set_theme() from sklearn.metrics import precision_recall_fscore_support import time from tqdm import tqdm

We'll first define what a sample call to the OpenAI endpoint looks like. We'll pass in: - system_content that has information about how the LLM should behave. - assistant_content for any additional context it should have for answering our questions. - user_content that has our message or query to the LLM. - model should specify which specific model we want to send our request to.

We can pass all of this information in through the openai.ChatCompletion.create function to receive our response.

1 2 3 4 5 6 7 8 9 10 11 12 13
# Query OpenAI endpoint system_content = "you only answer in rhymes" # system content (behavior) assistant_content = "" # assistant content (context) user_content = "how are you" # user content (message) response = openai.ChatCompletion.create( model="gpt-3.5-turbo-0613", messages=[ {"role": "system", "content": system_content}, {"role": "assistant", "content": assistant_content}, {"role": "user", "content": user_content}, ], ) print (response.to_dict()["choices"][0].to_dict()["message"]["content"])
I'm doing just fine, so glad you ask, Rhyming away, up to the task. How about you, my dear friend? Tell me how your day did ascend.

Now, let's create a function that can predict tags for a given sample.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
def get_tag(model, system_content="", assistant_content="", user_content=""): try: # Get response from OpenAI response = openai.ChatCompletion.create( model=model, messages=[ {"role": "system", "content": system_content}, {"role": "assistant", "content": assistant_content}, {"role": "user", "content": user_content}, ], ) predicted_tag = response.to_dict()["choices"][0].to_dict()["message"]["content"] return predicted_tag except (openai.error.ServiceUnavailableError, openai.error.APIError) as e: return None
1 2 3 4 5 6 7 8 9 10 11
# Get tag model = "gpt-3.5-turbo-0613" system_context = f""" You are a NLP prediction service that predicts the label given an input's title and description. You must choose between one of the following labels for each input: {tags}. Only respond with the label name and nothing else. """ assistant_content = "" user_context = "Transfer learning with transformers: Using transformers for transfer learning on text classification tasks." tag = get_tag(model=model, system_content=system_context, assistant_content=assistant_content, user_content=user_context) print (tag)
natural-language-processing

Next, let's create a function that can predict tags for a list of inputs.

1 2 3
# List of dicts w/ {title, description} (just the first 3 samples for now) samples = test_df[["title", "description"]].to_dict(orient="records")[:3] samples
[{'title': 'Diffusion to Vector', 'description': 'Reference implementation of Diffusion2Vec (Complenet 2018) built on Gensim and NetworkX. '}, {'title': 'Graph Wavelet Neural Network', 'description': 'A PyTorch implementation of "Graph Wavelet Neural Network" (ICLR 2019) '}, {'title': 'Capsule Graph Neural Network', 'description': 'A PyTorch implementation of "Capsule Graph Neural Network" (ICLR 2019).'}]
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
def get_predictions(inputs, model, system_content, assistant_content=""): y_pred = [] for item in tqdm(inputs): # Convert item dict to string user_content = str(item) # Get prediction predicted_tag = get_tag( model=model, system_content=system_content, assistant_content=assistant_content, user_content=user_content) # If error, try again after pause (repeatedly until success) while predicted_tag is None: time.sleep(30) # could also do exponential backoff predicted_tag = get_tag( model=model, system_content=system_content, assistant_content=assistant_content, user_content=user_content) # Add to list of predictions y_pred.append(predicted_tag) return y_pred
1 2
# Get predictions for a list of inputs get_predictions(inputs=samples, model=model, system_content=system_context)
100%|██████████| 3/3 [00:01