🏆 Chatbot Arena Leaderboard
| Blog | GitHub | Paper | Dataset | Twitter | Discord |
This leaderboard is based on the following three benchmarks.
- Chatbot Arena – a crowdsourced, randomized battle platform. We use 100K+ user votes to compute Elo ratings.
- MT-Bench – a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
- MMLU (5-shot) – a test to measure a model’s multitask accuracy on 57 tasks.
💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are mostly computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. Last updated: November, 2023.
Model | ⭐ Arena Elo rating | 📈 MT-bench (score) | MMLU | License |
---|---|---|---|---|
GPT-4-Turbo | 1210 | 9.32 | Proprietary | |
GPT-4 | 1159 | 8.99 | 86.4 | Proprietary |
Claude-1 | 1146 | 7.9 | 77 | Proprietary |
Claude-2 | 1125 | 8.06 | 78.5 | Proprietary |
Claude-instant-1 | 1106 | 7.85 | 73.4 | Proprietary |
GPT-3.5-turbo | 1103 | 7.94 | 70 | Proprietary |
WizardLM-70b-v1.0 | 1093 | 7.71 | 63.7 | Llama 2 Community |
Vicuna-33B | 1090 | 7.12 | 59.2 | Non-commercial |
OpenChat-3.5 | 1070 | 7.81 | 64.3 | Apache-2.0 |
Llama-2-70b-chat | 1065 | 6.86 | 63 | Llama 2 Community |
WizardLM-13b-v1.2 | 1047 | 7.2 | 52.7 | Llama 2 Community |
zephyr-7b-beta | 1042 | 7.34 | 61.4 | MIT |
MPT-30B-chat | 1031 | 6.39 | 50.4 | CC-BY-NC-SA-4.0 |
Vicuna-13B | 1031 | 6.57 | 55.8 | Llama 2 Community |
QWen-Chat-14B | 1030 | 6.96 | 66.5 | Qianwen LICENSE |
falcon-180b-chat | 1024 | 68 | Falcon-180B TII License | |
zephyr-7b-alpha | 1024 | 6.88 | MIT | |
CodeLlama-34B-instruct | 1022 | 53.7 | Llama 2 Community | |
Guanaco-33B | 1021 | 6.53 | 57.6 | Non-commercial |
Llama-2-13b-chat | 1021 | 6.65 | 53.6 | Llama 2 Community |
Mistral-7B-Instruct-v0.1 | 1008 | 6.84 | 55.4 | Apache 2.0 |
Llama-2-7b-chat | 1001 | 6.27 | 45.8 | Llama 2 Community |
Vicuna-7B | 994 | 6.17 | 49.8 | Llama 2 Community |
PaLM-Chat-Bison-001 | 991 | 6.4 | Proprietary | |
ChatGLM3-6B | 970 | Apache-2.0 | ||
Koala-13B | 955 | 5.35 | 44.7 | Non-commercial |
GPT4All-13B-Snoozy | 925 | 5.41 | 43 | Non-commercial |
MPT-7B-Chat | 918 | 5.42 | 32 | CC-BY-NC-SA-4.0 |
ChatGLM2-6B | 918 | 4.96 | 45.5 | Apache-2.0 |
RWKV-4-Raven-14B | 915 | 3.98 | 25.6 | Apache 2.0 |
Alpaca-13B | 893 | 4.53 | 48.1 | Non-commercial |
OpenAssistant-Pythia-12B | 884 | 4.32 | 27 | Apache 2.0 |
ChatGLM-6B | 871 | 4.5 | 36.1 | Non-commercial |
FastChat-T5-3B | 863 | 3.04 | 47.7 | Apache 2.0 |
StableLM-Tuned-Alpha-7B | 833 | 2.75 | 24.4 | CC-BY-NC-SA-4.0 |
Dolly-V2-12B | 810 | 3.28 | 25.7 | MIT |
LLaMA-13B | 789 | 2.61 | 47 | Non-commercial |
WizardLM-30B | 7.01 | 58.7 | Non-commercial | |
Vicuna-13B-16k | 6.92 | 54.5 | Llama 2 Community | |
WizardLM-13B-v1.1 | 6.76 | 50 | Non-commercial | |
Tulu-30B | 6.43 | 58.1 | Non-commercial | |
Guanaco-65B | 6.41 | 62.1 | Non-commercial | |
OpenAssistant-LLaMA-30B | 6.41 | 56 | Non-commercial | |
WizardLM-13B-v1.0 | 6.35 | 52.3 | Non-commercial | |
Vicuna-7B-16k | 6.22 | 48.5 | Llama 2 Community | |
Baize-v2-13B | 5.75 | 48.9 | Non-commercial | |
XGen-7B-8K-Inst | 5.55 | 42.1 | Non-commercial | |
Nous-Hermes-13B | 5.51 | 49.3 | Non-commercial | |
MPT-30B-Instruct | 5.22 | 47.8 | CC-BY-SA 3.0 | |
Falcon-40B-Instruct | 5.17 | 54.7 | Apache 2.0 | |
H2O-Oasst-OpenLLaMA-13B | 4.63 | 42.8 | Apache 2.0 |
New rowNew column
Visit our HF space for more analysis!
If you want to see more models, please help us add them.
Acknowledgment
We thank Kaggle, MBZUAI, AnyScale, and HuggingFace for their generous sponsorship.