This revision is from 2024/07/01 21:10. You can Restore it.
Step 1: Plan and design the LLM
Standard model design
LLM has the ability to re-train itself, to hit the re-train button. (no human required)
LLM is constantly being re-fed its training data, told to re-work and improve the training data, with prompt engineering to choose facts and statistics. (no human required). New data is also added to a seperate directory.
The trainer is in the model. LLM re-works its training code as well to produce a better model. Developing the trainer means the LLM's ability to distinguish differences correctly, better from worse, yes from no and so on, successful compile vs errors, red from blue.
Demonstrator must be resource light enough for the LLM to perform these tasks.
Step 2: Eval Space
The tools that give the LLM the ability to test, proof and rework training data. For instance, a code compiler or a training data compiler. A versus B thinking.
Make the LLM
Get the training datasets: Sources: Common Crawl, Wikipedia, books, articles, forums, public datasets (e.g., Project Gutenberg).
Preprocess dataset: Data Preprocessing
Tokenization: Split text into tokens (words, subwords, or characters).
Normalization: Lowercase text, remove special characters, handle contractions, etc.
Filtering: Remove non-text content, duplicates, and overly long or short texts.
Encoding: Convert tokens to numerical representations using a tokenizer.
Choose architecture for your LLM. Transformer-based models (e.g., GPT, BERT), Parameters: Define model size (number of layers, heads, hidden units).
Training
Evaludation
Use GPT2 tools:
from transformers import GPT2Config, GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments, TextDataset, DataCollatorForLanguageModeling
# Replace the original data with the generated data
data[idx] = new_text
# Print the modified data
print(data)
When complete retrain the model and repeat. After the loop extract, run the model training script. Turn the model loading into a function and reload the new model and repeat endlessly. Throw new training data in the directory it uses or have two directories, an orig and rework directory.
Provide the model more tools and means to better rw-work its training data or even synthesize new data.
A simulation space whch mimicks real world physics could be a universal space for performing evaludation, as the direction of the models is self-administered by the LLM, throwing them in an NPC space such project as https://github.com/AkshitIreddy/Interactive-LLM-Powered-NPCs
The simulation space based on real world physics and threads from the real world is designed to ground the evaluation and spur synthetic data creation.
The focus shifts from the model to the trainer program.