Build an LLM from scratch
This revision is from 2024/07/02 14:46. You can Restore it.
Large Language Models (LLMs), primarily function based on the statistical relationships between words, phrases, and sentences in their training data. Statistical Patterns: LLMs are trained on vast amounts of text data and learn to predict the next word in a sentence given the previous words. This involves identifying patterns and relationships between words and sentences. They rely on the frequency and co-occurrence of words to generate contextually appropriate responses. For example, if "dog" frequently appears with "bark," the model learns this association. The main objective during training is to master the statistical properties of language.
Step 1: Plan and design the LLM
- Standard model design
- LLM has the ability to re-train itself, to hit the re-train button. (no human required)
- LLM is constantly being re-fed its training data, told to re-work and improve the training data, with prompt engineering to choose facts and statistics. (no human required). New data is also added to a seperate directory.
- The trainer is in the model. LLM re-works its training code as well to produce a better model. Developing the trainer means the LLM's ability to distinguish differences correctly, better from worse, yes from no and so on, successful compile vs errors, red from blue. Two copies and the LLM must choose which is better and update its training data.
- Demonstrator must be resource light enough for the LLM to perform these tasks.
Step 2: Eval Space
Evaluation is key. At its most basic human eval, more so the tools that give the LLM the ability to test, proof and rework training data. For instance, a code compiler returns error or successful compilation, providing an evaluation of code. It has a tool to run its generated code and get an evaluation of the code such as errors and go back and work on it. Once it passes compilation, it updates the training data. Perhaps a training data compiler could do similar. Providing the model more and more tools to better rework its training data.
Synthesize new data. Factorial limits to the amount of data that can be synthesized, perhaps 8-word sentences and every combination, but again it is evaluation. Perhaps a simulation space which mimics real world physics could be a universal space for performing evaluation. The simulation space can bounce response off physics, designed to evaluate, ground and improve synthetic data by testing it. For instance, we can put enough physics together to test wing designs. The LLM would update its training data on the improved wing designs and then and then debate the design at the eyre of the user. Arguing its decision with facts and figures.
The focus shifts from the model to the trainer program. A model is only as ample as its sophisticated evaluation.
Limitations: what is the current weather? For example, the LLM cannot know this unless it was re-trained constantly, function calls are used to supplement the LLM. For example, if I want to book a flight, the LLM connects to the API system of the flight operator and automates the booking using function calling. To know the weather of the moment, the LLM function calls an authoritative server that relays the information. However, it is important to work within the system of model engineering and not revert to function calling as a fix.
Make the LLM
- Get the training datasets: Sources: Common Crawl, Wikipedia, books, articles, forums, public datasets (e.g., Project Gutenberg).
- Preprocess dataset: Data Preprocessing
- Tokenization: Split text into tokens (words, subwords, or characters).
- Normalization: Lowercase text, remove special characters, handle contractions, etc.
- Filtering: Remove non-text content, duplicates, and overly long or short texts.
- Encoding: Convert tokens to numerical representations using a tokenizer.
- Choose architecture for your LLM. Transformer-based models (e.g., GPT, BERT), Parameters: Define model size (number of layers, heads, hidden units).
- Training
- Evaluation
Use GPT2 tools:
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling
# Define configurationconfig = GPT2Config(
vocab_size=50257,
n_positions=1024,
n_ctx=1024,
n_embd=768,
n_layer=12,
n_head=12,
n_inner=3072,
activation_function='gelu',
resid_pdrop=0.1,
embd_pdrop=0.1,
attn_pdrop=0.1,
layer_norm_epsilon=1e-5,
initializer_range=0.02,
) # Initialize tokenizertokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
# Prepare datasetdef load_dataset(file_path, tokenizer):
return TextDataset(
tokenizer=tokenizer,
file_path=file_path,
block_size=128,
)
train_dataset = load_dataset("path/to/train.txt", tokenizer)
val_dataset = load_dataset("path/to/val.txt", tokenizer)
data_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer,
mlm=False,
) # Initialize modelmodel = GPT2LMHeadModel(config)
model.resize_token_embeddings(len(tokenizer))
# Set training argumentstraining_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=2,
per_device_eval_batch_size=2,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
logging_dir='./logs',
) # Create trainer and traintrainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=train_dataset,
eval_dataset=val_dataset,
)trainer.train()
# Save the modelmodel.save_pretrained("./trained_model")
tokenizer.save_pretrained("./trained_model")
Another, from scratch:
import torchimport torch.nn as nn
from transformers import PreTrainedModel, PretrainedConfig, Trainer, TrainingArguments
from datasets import load_dataset
from tokenizers import Tokenizer, models, pre_tokenizers, trainers
from transformers import PreTrainedTokenizerFast
# Define the model architectureclass NewLM(PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.embedding = nn.Embedding(config.vocab_size, config.hidden_size)
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(
d_model=config.hidden_size,
nhead=config.num_heads,
dim_feedforward=config.intermediate_size,
dropout=config.hidden_dropout_prob
),
num_layers=config.num_hidden_layers
)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size)
def forward(self, input_ids, attention_mask=None):
x = self.embedding(input_ids)
if attention_mask is not None:
x = x.permute(1, 0, 2) # TransformerEncoder expects seq_len first
x = self.transformer(x, src_key_padding_mask=attention_mask)
x = x.permute(1, 0, 2) # Change back to batch first
else:
x = x.permute(1, 0, 2)
x = self.transformer(x)
x = x.permute(1, 0, 2)
return self.lm_head(x)
# Create a custom configurationclass NewLMConfig(PretrainedConfig):
model_type = "new_lm"
def __init__(
self,
vocab_size=30000,
hidden_size=256,
num_hidden_layers=6,
num_heads=8,
intermediate_size=1024,
hidden_dropout_prob=0.1,
max_position_embeddings=512,
**kwargs
):
super().__init__(**kwargs)
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_heads = num_heads
self.intermediate_size = intermediate_size
self.hidden_dropout_prob = hidden_dropout_prob
self.max_position_embeddings = max_position_embeddings
# Train tokenizerdef train_tokenizer(texts):
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.train_from_iterator(texts, trainer)
return PreTrainedTokenizerFast(tokenizer_object=tokenizer)
# Load and preprocess datadataset = load_dataset("wikitext", "wikitext-2-raw-v1", split="train")
texts = dataset["text"]
# Train tokenizertokenizer = train_tokenizer(texts)
# Tokenize datasetdef tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, max_length=512, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True, remove_columns=dataset.column_names)
# Initialize modelconfig = NewLMConfig(vocab_size=len(tokenizer))
model = NewLM(config)
# Set training argumentstraining_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=8,
save_steps=10_000,
save_total_limit=2,
prediction_loss_only=True,
logging_dir='./logs',
) # Define data collatordef data_collator(features):
return {
"input_ids": torch.stack([torch.tensor(f["input_ids"]) for f in features]),
"attention_mask": torch.stack([torch.tensor(f["attention_mask"]) for f in features]),
"labels": torch.stack([torch.tensor(f["input_ids"]) for f in features]),
}
# Create trainer and traintrainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
data_collator=data_collator,
)trainer.train()
# Save the model and tokenizermodel.save_pretrained("./new_lm")
tokenizer.save_pretrained("./new_lm")
Rework Training Data
- Load the data.
- Initialize the LLM.
- Create a loop to process the data.
- In each iteration, select a random piece of data.
- Use the LLM to generate a new version of the data.
- Replace the original data with the generated data.
- Repeat until all data has been processed.
import random
import torch
from transformers import BertTokenizer, BertForMaskedLM
# Load your datadata = ["example sentence 1", "example sentence 2", ...]
# Initialize the LLM (BERT)device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
model = BertForMaskedLM.from_pretrained("bert-base-uncased").to(device)
# Loop through the datafor i in range(len(data)):
# Select a random piece of data
idx = random.randint(0, len(data) - 1)
input_text = data[idx]
# Tokenize the input text and mask a random word
inputs = tokenizer(input_text, return_tensors="pt").to(device)
masked_index = random.choice([i for i, token in enumerate(inputs["input_ids"][0]) if token.item() != tokenizer.pad_token_id])
inputs["input_ids"][0][masked_index] = tokenizer.mask_token_id
# Generate a new version of the data
outputs = model(**inputs)
predictions = outputs.logits
predicted_index = torch.argmax(predictions[0, masked_index]).item()
predicted_token = tokenizer.convert_ids_to_tokens([predicted_index])[0]
new_text = input_text[:masked_index] + predicted_token + input_text[masked_index + 1:]
# Replace the original data with the generated data
data[idx] = new_text
# Print the modified dataprint(data)
When complete retrain the model and repeat. After the loop extract, run the model training script. Turn the model loading into a function and reload the new model and repeat endlessly. Throw new training data in the directory it uses or have two directories, an orig and rework directory.
RefinedWeb is a massive dataset, train a Large Language Model (LLM) with it.
Prerequisites:
- Hardware: You'll need a powerful machine with a large GPU (e.g., NVIDIA V100 or A100) and sufficient memory (at least 16 GB).
- Software: Install the following:
- Python 3.8 or later
- PyTorch 1.11 or later
- Hugging Face Transformers library (e.g., transformers==4.12.0)
- datasets library (e.g., datasets==1.18.0)
- Dataset: Download the RefinedWeb dataset (600 billion tokens) from the official website or a mirror.
Procedure:
Step 1: Prepare the dataset
- Unzip the RefinedWeb dataset and store it in a directory (e.g., refinedweb_data).
- Use the datasets library to load the dataset:
import datasets
dataset = datasets.load_dataset('refinedweb', split='train')
Step 2: Preprocess the data
- Tokenize the dataset using a tokenizer (e.g., BertTokenizer from Hugging Face):
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def tokenize_function(examples):
return tokenizer(examples['text'], truncation=True)
dataset = dataset.map(tokenize_function, batched=True)
Step 3: Create a custom dataset class
- Create a custom dataset class to handle the RefinedWeb dataset:
class RefinedWebDataset(torch.utils.data.Dataset):
def __init__(self, dataset, tokenizer):
self.dataset = dataset
self.tokenizer = tokenizer
def __getitem__(self, idx):
example = self.dataset[idx]
inputs = self.tokenizer.encode_plus(
example['text'],
add_special_tokens=True,
max_length=512,
return_attention_mask=True,
return_tensors='pt'
)
labels = torch.tensor(example['labels'])
return inputs, labels
def __len__(self):
return len(self.dataset)
Step 4: Create a data loader
- Create a data loader from the custom dataset class:
batch_size = 32
data_loader = torch.utils.data.DataLoader(
RefinedWebDataset(dataset, tokenizer),
batch_size=batch_size,
shuffle=True
)
Step 5: Define the model and optimizer
- Define a PyTorch model (e.g., a transformer-based architecture like BERT or RoBERTa):
import torch.nn as nn
import torch.optim as optim
class MyLLM(nn.Module):
def __init__(self):
super(MyLLM, self).__init__()
self.transformer = transformers.BertForSequenceClassification.from_pretrained('bert-base-uncased')
def forward(self, inputs):
outputs = self.transformer(inputs['input_ids'], attention_mask=inputs['attention_mask'])
return outputs
model = MyLLM()
optimizer = optim.Adam(model.parameters(), lr=1e-5)
Step 6: Train the model
- Train the model using the data loader and optimizer:
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
for epoch in range(5): # Train for 5 epochs
model.train()
total_loss = 0
for batch in data_loader:
inputs, labels = batch
inputs = {k: v.to(device) for k, v in inputs.items()}
labels = labels.to(device)
optimizer.zero_grad()
outputs = model(inputs)
loss = nn.CrossEntropyLoss()(outputs, labels)
loss.backward()
optimizer.step()
total_loss += loss.item()
You may need to adjust the hyperparameters, model architecture, and training procedure based on your specific requirements. Training a large language model can take several days or even weeks.