This revision is from 2024/07/01 18:15. You can Restore it.
Step 1: Plan and design the LLM
Standard model design
LLM has the ability to re-train itself, to hit the re-train button. (no human required)
LLM is constantly re-working its training data to improve its training data. (no human required)
Note: focus on the LLM's ability to distinguish differences correctly, better from worse, yes from no and so on, successful compile vs errors, red from blue.
Demonstrator must be resource light enough for the LLM to perform these tasks.
Step 2: Eval Space
The tools that give the LLM the ability to test, proof and rework training data. For instance, a code compiler or a training data compiler.
Make the LLM
Get the training datasets: Sources: Common Crawl, Wikipedia, books, articles, forums, public datasets (e.g., Project Gutenberg).
Preprocess dataset: Data Preprocessing
Tokenization: Split text into tokens (words, subwords, or characters).
Normalization: Lowercase text, remove special characters, handle contractions, etc.
Filtering: Remove non-text content, duplicates, and overly long or short texts.
Encoding: Convert tokens to numerical representations using a tokenizer.
Choose architecture for your LLM. Transformer-based models (e.g., GPT, BERT), Parameters: Define model size (number of layers, heads, hidden units).
Training
Evaludation
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling