Benchmarks
Evaluation Benchmarks
Standardized datasets to compare general capabilities.
Datasets
- MMLU (Massive Multitask Language Understanding): General knowledge across STEM, humanities, etc.
- GSM8K: Grade School Math. Chain-of-Thought reasoning.
- HumanEval / MBPP: Coding capabilities.
- Big-Bench: Diverse tasks.