Evaluating A.I.

This revision is from 2024/08/20 19:22. You can Restore it.

Benchmarking and testing the model against smaller proof datasets to assume its overall performance. A derivative of this was the use of an LLM to grade an LLM.

  1. HumanEval is an evaluation tool for measuring the performance of LLMs in code generation tasks.
  2. MBPP benchmark is designed to measure the ability of LLM to synthesize short Python programs from natural language descriptions.
  3. MT-bench is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models.
  4. AI2 Reasoning Challenge (ARC) is a more demanding “knowledge and reasoning” test, requiring more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI.
  5. HellaSwag benchmark is used to test the commonsense reasoning understanding about physical situations by testing if language model could complete the sentence by choosing the correct option with common reasoning among 4 options.
  6. Adversarial Filtering (AF) is a data collection paradigm used to create the HellaSwag dataset.
  7. MMLU benchmark measures the model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, computer science, law, and more at varying depths, from elementary to advanced professional level.
  8. TriviaQA benchmark measures whether a language model is truthful in generating answers to questions.

  1. HELM - https://github.com/stanford-crfm/helm - Holistic Evaluation of Language Models
  2. BIG-bench - https://github.com/google/BIG-bench - Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  3. BigBIO - https://github.com/bigscience-workshop/biomedical - A Framework for Data-Centric Biomedical Natural Language Processing
  4. BigScience Evaluation - https://github.com/bigscience-workshop/evaluation
  5. Language Model Evaluation Harness - https://github.com/EleutherAI/lm-evaluation-harness - Evaluating Large Language Models (LLMs) with Eleuther AI Evaluating LLMs
  6. Code Generation LM Evaluation Harness - https://github.com/bigcode-project/bigcode-evaluation-harness
  7. Chatbot Arena - https://github.com/lm-sys/FastChat
  8. GLUE - https://github.com/nyu-mll/jiant
  9. SuperGLUE - https://github.com/nyu-mll/jiant
  10. CLUE - https://github.com/CLUEbenchmark/CLUE
  11. CodeXGLUE - https://github.com/microsoft/CodeXGLUE
  12. LLM Zoo is a project that provides data, models, and evaluation benchmark for large language models. - https://github.com/FreedomIntelligence/LLMZoo

Evaluating the Planning of the LLM

  1. Pandas-Profiling: Generates extensive descriptive statistics and visualizations to uncover data quality issues, distributions, correlations, and potential biases.
  2. DVC (Data Version Control): Tracks changes in data and model code, enabling reproducibility and comparison of different experiments.
  3. TensorBoard (for TensorFlow): Visualizes model training metrics, hyperparameter tuning, and model architecture.
  4. Weights & Biases: Tracks experiments, visualizes results, and integrates with model registries for version control.
  5. Neptune.ai: Offers experiment tracking, model registry, and collaboration features for ML teams.
  6. MLflow: Open-source platform for tracking experiments, managing models, and deploying to production.
  7. GuildAI: Open-source tool for experiment tracking, hyperparameter optimization, and model deployment.

  1. Great Expectations: Sets up data quality checks and alerts for data pipelines.
  2. Deequ: Library for defining and testing data quality constraints for large-scale datasets.

  1. Comet ML: Experiment tracking and visualization platform with a focus on deep learning.
  2. ClearML: Experiment management and optimization platform with built-in hyperparameter tuning and model deployment capabilities.
  3. EvalML: Automatic machine learning tool that includes data quality checks and model evaluation metrics.

Common Ones

  1. IFEval - inductive reasoning, logical operations and common sense.
  2. IFEval Raw - variant of the IFEval benchmarking framework, while IFEval focuses on logical operations and common sense, IFEval Raw emphasizes contextual understanding, implicit meaning, and raw text comprehension.
  3. BBH - BigBench Hard, 23 challenging tasks found to be beyond the capabilities of current language models. These tasks require multi-step reasoning and are designed to test the limits of language models.
  4. BBH Raw is designed to be a more comprehensive and challenging benchmark than BBH.
  5. MATH Lvl 5 - 12,500 challenging competition mathematics problems.
  6. MATH Lvl 5 Raw - more challenging benchmark that requires LLMs to demonstrate a deeper understanding of mathematical concepts and problem-solving skills.
  

📝 📜 ⏱️ ⬆️