Build AI: Create Your Own LLM

Build an LLM from scratch.

Michael Chandra

Jun 21, 2024

An Introduction

In this post, we’ll explore:

What a Transformer model is
The step-by-step build process of a transformer model
A detailed breakdown of the code with explanations & tech used

What is a Transformer Model?

Transformer-based models utilise the transformer neural network architecture.

These models, first introduced in 2017, have become the foundation for language models like GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers).

The FT has a fantastic article here if you want to learn more.

How Transformers Differ from Other ML Models

Attention Mechanism: Unlike traditional sequential models (e.g., RNNs), transformers use a mechanism called "self-attention" to process all parts of the input simultaneously, allowing them to capture long-range dependencies more effectively.
Parallelisation: Transformers can process entire sequences in parallel, making them much faster to train compared to sequential models.
Scalability: They can be easily scaled to handle larger datasets and more complex tasks by increasing model size and computational resources.
Transfer Learning: Pre-trained transformer models can be fine-tuned for specific tasks with relatively small amounts of task-specific data, making them highly versatile.

Practical Example

Let's consider how a transformer model might process the sentence: "The cat sat on the mat because it was tired."

Input Processing:
- The sentence is broken down into tokens (words or subwords).
- Each token is converted into a numerical representation (embedding).
Self-Attention:
- The model looks at each word in relation to every other word in the sentence.
- For "it," the model might give high attention to "cat" and "tired," understanding that "it" refers to the cat and explains why it sat.
Contextual Understanding:
- The model builds a rich contextual representation of each word, taking into account its relationships with other words in the sentence.
Output Generation:
- If asked to complete the sentence "The cat sat on the mat because...," the model would use its understanding of context to generate a plausible ending, like "it was tired."

This process allows transformer models to understand and generate human-like text, translate between languages, answer questions, and perform various other language-related tasks with remarkable accuracy.

Why PyTorch?

PyTorch is a popular open-source machine learning library, known for its simplicity and power in building deep learning models.

In the context of LLMs, PyTorch provides the necessary tools and abstractions to handle the often complex mathematical operations, model architectures, and optimisation algorithms.

This allows developers to focus on high-level design, rather than wasting time in low-level detail.

multicolored abstract painting — Photo by Geordanna Cordero on Unsplash

The Build Process

The best way to learn is to practice. The following steps will guide you through building your own transformer model.

A simple transformer model build consists of the following steps:

Tokenisation
Model Definition
Data Preparation for Training
Training the Model
Evaluation

Technologies Used

PyTorch: A machine learning framework for building and training models.
Hugging Face Transformers: A library for NLP models.
Google Colab: A cloud-based Jupyter notebook environment.
Streamlit: A framework for creating interactive web applications.

The full GitHub repo can be found here.

Detailed Breakdown of the Code

1. Tokenisation

Explanation: Tokenising the text ensures that it is in a format that the model can process. This is crucial for training, as the numerical representations of the text (token IDs) are what the model uses to learn and make predictions.

A tokeniser is the first component in the chain and ingests the initial input into the model. This code initialises a BERT tokenizer, tokenises the input text "Hello, how are you?", and converts the tokens into numerical IDs.

from transformers import BertTokenizer

# Initialize the tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# Example input text
text = "Hello, how are you?"

# Tokenize the input text
tokens = tokenizer.tokenize(text)

# Convert tokens to token IDs
token_ids = tokenizer.convert_tokens_to_ids(tokens)

print(tokens)
print(token_ids)

2. Model Definition

Explanation: The model definition lays the foundation for how the transformer processes the input data and generates the output.

It includes the embedding layer to convert token IDs into dense vectors, transformer layers to process these vectors, and a linear layer to generate the final output.

This code defines a simple transformer model with an embedding layer, transformer layers, and a final linear layer for generating the output. The forward method processes the source and target sequences through these layers.

import torch.nn as nn

# Define the transformer model with encoder and decoder
class SimpleTransformerWithDecoder(nn.Module):
    def __init__(self, vocab_size, d_model, nhead, num_encoder_layers, num_decoder_layers):
        super(SimpleTransformerWithDecoder, self).__init__()
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.transformer = nn.Transformer(d_model, nhead, num_encoder_layers, num_decoder_layers)
        self.fc = nn.Linear(d_model, vocab_size)

    def forward(self, src, tgt):
        src = self.embedding(src) * torch.sqrt(torch.tensor(d_model, dtype=torch.float32))
        tgt = self.embedding(tgt) * torch.sqrt(torch.tensor(d_model, dtype=torch.float32))
        src = src.permute(1, 0, 2)  # Transformer expects (seq_len, batch_size, d_model)
        tgt = tgt.permute(1, 0, 2)
        output = self.transformer(src, tgt)
        output = output.permute(1, 0, 2)  # Convert back to (batch_size, seq_len, d_model)
        output = self.fc(output)
        return output

3. Data Preparation for Training

Explanation: Preparing the data ensures that it is in the correct format for training the model. This includes tokenising the text and creating batches of data for efficient training.

This code defines a TextDataset class that tokenises the input texts and prepares them for the model. It handles padding and truncation to ensure all sequences are of the same length.

from torch.utils.data import DataLoader, Dataset

# Define a simple dataset class
class TextDataset(Dataset):
    def __init__(self, texts, tokenizer, max_length):
        self.texts = texts
        self.tokenizer = tokenizer
        self.max_length = max_length

    def __len__(self):
        return len(self.texts)

    def __getitem__(self, idx):
        text = self.texts[idx]
        encoded_text = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            return_tensors='pt',
            truncation=True
        )
        return encoded_text['input_ids'].squeeze(), encoded_text['attention_mask'].squeeze()

4. Training the Model

Explanation: Training the model allows it to learn from the input data. By adjusting its parameters, the model improves its ability to generate accurate and coherent text.

This step is essential for developing a model that can understand and generate human-like language.

This code defines a train function that trains the transformer model using the tokenised data. The optimiser updates the model parameters, and the loss function measures the error between the predicted and actual outputs.

import torch.optim as optim
import torch.nn.functional as F

# Define a function to train the model
def train(model, dataloader, num_epochs, learning_rate, vocab_size):
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    criterion = nn.CrossEntropyLoss()

    for epoch in range(num_epochs):
        model.train()
        for src, _ in dataloader:
            tgt_input = src[:, :-1]
            tgt_output = src[:, 1:]

            optimizer.zero_grad()
            output = model(src, tgt_input)
            loss = criterion(output.reshape(-1, vocab_size), tgt_output.reshape(-1))
            loss.backward()
            optimizer.step()
        print(f"Epoch {epoch+1}, Loss: {loss.item()}")

# Example usage
num_epochs = 10
learning_rate = 0.001
vocab_size = 30522  # Example vocab size
train(model, dataloader, num_epochs, learning_rate, vocab_size)

5. Evaluation and Inference

Explanation: Evaluating the model allows us to test its performance and ensure that it generates coherent and contextually appropriate text.

This step is crucial for validating the model's effectiveness and making necessary adjustments.

This code defines an evaluate function that generates predictions for new input text. The input text is tokenised, passed through the model, and the output tokens are decoded to produce the final text.

# Define a function to evaluate the model
def evaluate(model, tokenizer, text, max_length=10):
    model.eval()
    encoded_text = tokenizer.encode_plus(
        text,
        add_special_tokens=True,
        max_length=max_length,
        padding='max_length',
        return_tensors='pt',
        truncation=True
    )
    input_ids = encoded_text['input_ids']
    tgt_input = input_ids[:, :-1]

    with torch.no_grad():
        output = model(input_ids, tgt_input)

    output_tokens = output.argmax(dim=-1)
    decoded_output = tokenizer.decode(output_tokens[0], skip_special_tokens=True)
    return decoded_output

# Example usage
input_text = "I love"
predicted_text = evaluate(model, tokenizer, input_text)
print(f"Input Text: {input_text}")
print(f"Predicted Continuation: {predicted_text}")

What Else You Need to Consider

Other considerations if you were to make an app:

User Interface (UI): Create a user-friendly interface with Streamlit

Visualizations: Use Matplotlib to visualise training loss and model performance.

The full GitHub repo can be found here.

An Example

The below prototype walks a user through the creation process of an LLM. It also includes additional functions to demonstrate what else is possible with libraries like Hugging Face:

Practical Considerations

Keep in mind:

Resources: Training requires substantial data and computing power.
Pre-trained Models: Fine-tuning existing models (e.g., from Hugging Face) is often more efficient.
Hyperparameters: Model performance heavily depends on settings like learning rate and model size. Experimentation is key.

These factors significantly impact the model's effectiveness and efficiency in real-world applications.

In Conclusion

This prototype demonstrates how to build a transformer model from scratch.

In our prototype we used PyTorch to build, train, and evaluate a transformer model. Maybe you’ll be the next Open AI.

Now it’s your turn. Give it a try!

The future is not far away. Remember, it’s Almost Tomorrow.

Almost Tomorrow