🏆 Chatbot Arena Leaderboard

This leaderboard is based on the following three benchmarks.

Chatbot Arena – a crowdsourced, randomized battle platform. We use 100K+ user votes to compute Elo ratings.
MT-Bench – a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
MMLU (5-shot) – a test to measure a model’s multitask accuracy on 57 tasks.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are mostly computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. Last updated: November, 2023.

Model	⭐ Arena Elo rating	📈 MT-bench (score)	MMLU	License
GPT-4-Turbo	1210	9.32		Proprietary
GPT-4	1159	8.99	86.4	Proprietary
Claude-1	1146	7.9	77	Proprietary
Claude-2	1125	8.06	78.5	Proprietary
Claude-instant-1	1106	7.85	73.4	Proprietary
GPT-3.5-turbo	1103	7.94	70	Proprietary
WizardLM-70b-v1.0	1093	7.71	63.7	Llama 2 Community
Vicuna-33B	1090	7.12	59.2	Non-commercial
OpenChat-3.5	1070	7.81	64.3	Apache-2.0
Llama-2-70b-chat	1065	6.86	63	Llama 2 Community
WizardLM-13b-v1.2	1047	7.2	52.7	Llama 2 Community
zephyr-7b-beta	1042	7.34	61.4	MIT
MPT-30B-chat	1031	6.39	50.4	CC-BY-NC-SA-4.0
Vicuna-13B	1031	6.57	55.8	Llama 2 Community
QWen-Chat-14B	1030	6.96	66.5	Qianwen LICENSE
falcon-180b-chat	1024		68	Falcon-180B TII License
zephyr-7b-alpha	1024	6.88		MIT
CodeLlama-34B-instruct	1022		53.7	Llama 2 Community
Guanaco-33B	1021	6.53	57.6	Non-commercial
Llama-2-13b-chat	1021	6.65	53.6	Llama 2 Community
Mistral-7B-Instruct-v0.1	1008	6.84	55.4	Apache 2.0
Llama-2-7b-chat	1001	6.27	45.8	Llama 2 Community
Vicuna-7B	994	6.17	49.8	Llama 2 Community
PaLM-Chat-Bison-001	991	6.4		Proprietary
ChatGLM3-6B	970			Apache-2.0
Koala-13B	955	5.35	44.7	Non-commercial
GPT4All-13B-Snoozy	925	5.41	43	Non-commercial
MPT-7B-Chat	918	5.42	32	CC-BY-NC-SA-4.0
ChatGLM2-6B	918	4.96	45.5	Apache-2.0
RWKV-4-Raven-14B	915	3.98	25.6	Apache 2.0
Alpaca-13B	893	4.53	48.1	Non-commercial
OpenAssistant-Pythia-12B	884	4.32	27	Apache 2.0
ChatGLM-6B	871	4.5	36.1	Non-commercial
FastChat-T5-3B	863	3.04	47.7	Apache 2.0
StableLM-Tuned-Alpha-7B	833	2.75	24.4	CC-BY-NC-SA-4.0
Dolly-V2-12B	810	3.28	25.7	MIT
LLaMA-13B	789	2.61	47	Non-commercial
WizardLM-30B		7.01	58.7	Non-commercial
Vicuna-13B-16k		6.92	54.5	Llama 2 Community
WizardLM-13B-v1.1		6.76	50	Non-commercial
Tulu-30B		6.43	58.1	Non-commercial
Guanaco-65B		6.41	62.1	Non-commercial
OpenAssistant-LLaMA-30B		6.41	56	Non-commercial
WizardLM-13B-v1.0		6.35	52.3	Non-commercial
Vicuna-7B-16k		6.22	48.5	Llama 2 Community
Baize-v2-13B		5.75	48.9	Non-commercial
XGen-7B-8K-Inst		5.55	42.1	Non-commercial
Nous-Hermes-13B		5.51	49.3	Non-commercial
MPT-30B-Instruct		5.22	47.8	CC-BY-SA 3.0
Falcon-40B-Instruct		5.17	54.7	Apache 2.0
H2O-Oasst-OpenLLaMA-13B		4.63	42.8	Apache 2.0

New rowNew column

Visit our HF space for more analysis!

If you want to see more models, please help us add them.

Acknowledgment

We thank Kaggle, MBZUAI, AnyScale, and HuggingFace for their generous sponsorship.