Evaluating A.I.

This revision is from 2024/01/20 04:57. You can Restore it.

Benchmarking and testing the model against smaller proof datasets to assume its overall performance. A derivative of this was the use of an LLM to grade an LLM.

  1. HumanEval is an evaluation tool for measuring the performance of LLMs in code generation tasks.
  2. MBPP benchmark is designed to measure the ability of LLM to synthesize short Python programs from natural language descriptions.
  3. MT-bench is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models.
  4. AI2 Reasoning Challenge (ARC) is a more demanding “knowledge and reasoning” test, requiring more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI.
  5. HellaSwag benchmark is used to test the commonsense reasoning understanding about physical situations by testing if language model could complete the sentence by choosing the correct option with common reasoning among 4 options.
  6. Adversarial Filtering (AF) is a data collection paradigm used to create the HellaSwag dataset.
  7. MMLU benchmark measures the model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, computer science, law, and more at varying depths, from elementary to advanced professional level.
  8. TriviaQA benchmark measures whether a language model is truthful in generating answers to questions.

  1. HELM - https://github.com/stanford-crfm/helm - Holistic Evaluation of Language Models
  2. BIG-bench - https://github.com/google/BIG-bench - Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
  3. BigBIO - https://github.com/bigscience-workshop/biomedical - A Framework for Data-Centric Biomedical Natural Language Processing
  4. BigScience Evaluation - https://github.com/bigscience-workshop/evaluation
  5. Language Model Evaluation Harness - https://github.com/EleutherAI/lm-evaluation-harness - Evaluating Large Language Models (LLMs) with Eleuther AI Evaluating LLMs
  6. Code Generation LM Evaluation Harness - https://github.com/bigcode-project/bigcode-evaluation-harness
  7. Chatbot Arena - https://github.com/lm-sys/FastChat
  8. GLUE - https://github.com/nyu-mll/jiant
  9. SuperGLUE - https://github.com/nyu-mll/jiant
  10. CLUE - https://github.com/CLUEbenchmark/CLUE
  11. CodeXGLUE - https://github.com/microsoft/CodeXGLUE
  

📝 📜 ⏱️ ⬆️