Evaluating A.I.
This revision is from 2024/08/20 19:22. You can Restore it.
Benchmarking and testing the model against smaller proof datasets to assume its overall performance. A derivative of this was the use of an LLM to grade an LLM.
- HumanEval is an evaluation tool for measuring the performance of LLMs in code generation tasks.
- MBPP benchmark is designed to measure the ability of LLM to synthesize short Python programs from natural language descriptions.
- MT-bench is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models.
- AI2 Reasoning Challenge (ARC) is a more demanding “knowledge and reasoning” test, requiring more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI.
- HellaSwag benchmark is used to test the commonsense reasoning understanding about physical situations by testing if language model could complete the sentence by choosing the correct option with common reasoning among 4 options.
- Adversarial Filtering (AF) is a data collection paradigm used to create the HellaSwag dataset.
- MMLU benchmark measures the model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, computer science, law, and more at varying depths, from elementary to advanced professional level.
- TriviaQA benchmark measures whether a language model is truthful in generating answers to questions.