evalscope/docs/en/experiments/benchmark/mmlu.md

5.1 KiB

MMLU

This is a large-scale multi-task assessment comprised of multiple-choice questions from various knowledge domains. The test covers humanities, social sciences, hard sciences, and other significant areas of study, encompassing 57 tasks, including basic mathematics, American history, computer science, law, among others. To achieve a high accuracy rate on this test, models must possess a broad knowledge of the world and problem-solving abilities. Dataset Link

Experimental Setup

  • Split: test
  • Total number: 13985
  • 0-shot

Experimental Results

Model Revision Precision Humanities STEM Social Science Other Weighted Avg Target Delta
Baichuan2-7B-Base v1.0.2 fp16 0.4111 0.3807 0.5233 0.504 0.4506 -
Baichuan2-7B-Chat v1.0.4 fp16 0.4439 0.374 0.5524 0.5458 0.4762 -
chatglm2-6b v1.0.12 fp16 0.3834 0.3413 0.4708 0.4445 0.4077 0.4546 (CoT) -4.69%
chatglm3-6b-base v1.0.1 fp16 0.5435 0.5087 0.7227 0.6471 0.5992 0.614 -1.48%
internlm-chat-7b v1.0.1 fp16 0.4005 0.3547 0.4953 0.4796 0.4297 -
Llama-2-13b-ms v1.0.2 fp16 0.4371 0.3887 0.5579 0.5437 0.4778 -
Llama-2-7b-ms v1.0.2 fp16 0.3146 0.3037 0.4134 0.3885 0.3509 -
Qwen-14B-Chat v1.0.6 bf16 0.5326 0.5397 0.7184 0.6859 0.6102 -
Qwen-7B v1.1.6 bf16 0.387 0.4 0.5403 0.5139 0.4527 -
Qwen-7B-Chat-Int8 v1.1.6 int8 0.4322 0.4277 0.6088 0.5778 0.5035 -
  • Target -- The official declared score of the model on the dataset
  • Delta -- The difference between the weighted average score and the target score

Settings: (Split: test, Total number: 13985, 5-shot)

Model Revision Precision Humanities STEM Social Science Other Weighted Avg Avg Target Delta
Baichuan2-7B-Base v1.0.2 fp16 0.4295 0.398 0.5736 0.5325 0.4781 0.4918 0.5416 (official) -4.98%
Baichuan2-7B-Chat v1.0.4 fp16 0.4344 0.3937 0.5814 0.5462 0.4837 0.5029 0.5293 (official) -2.64%
chatglm2-6b v1.0.12 fp16 0.3941 0.376 0.4897 0.4706 0.4288 0.4442 - -
chatglm3-6b-base v1.0.1 fp16 0.5356 0.4847 0.7175 0.6273 0.5857 0.5995 - -
internlm-chat-7b v1.0.1 fp16 0.4171 0.3903 0.5772 0.5493 0.4769 0.4876 - -
Llama-2-13b-ms v1.0.2 fp16 0.484 0.4133 0.6157 0.5809 0.5201 0.5327 0.548 (official) -1.53%
Llama-2-7b-ms v1.0.2 fp16 0.3747 0.3363 0.4372 0.4514 0.3979 0.4089 0.453 (official) -4.41%
Qwen-14B-Chat v1.0.6 bf16 0.574 0.553 0.7403 0.684 0.6313 0.6414 0.646 (official) -0.46%
Qwen-7B v1.1.6 bf16 0.4587 0.426 0.6078 0.5629 0.5084 0.5151 0.567 (official) -5.2%
Qwen-7B-Chat-Int8 v1.1.6 int8 0.4697 0.4383 0.6284 0.5967 0.5271 0.5347 0.554 (official) -1.93%