Benchmarks

Evaluation Benchmarks

Standardized datasets to compare general capabilities.

Datasets

  • MMLU (Massive Multitask Language Understanding): General knowledge across STEM, humanities, etc.
  • GSM8K: Grade School Math. Chain-of-Thought reasoning.
  • HumanEval / MBPP: Coding capabilities.
  • Big-Bench: Diverse tasks.