Chatbot Arena Leaderboard

Published by

on

🏆 Chatbot Arena Leaderboard

Blog | GitHub | Paper | Dataset | Twitter | Discord |

This leaderboard is based on the following three benchmarks.

  • Chatbot Arena – a crowdsourced, randomized battle platform. We use 100K+ user votes to compute Elo ratings.
  • MT-Bench – a set of challenging multi-turn questions. We use GPT-4 to grade the model responses.
  • MMLU (5-shot) – a test to measure a model’s multitask accuracy on 57 tasks.

💻 Code: The Arena Elo ratings are computed by this notebook. The MT-bench scores (single-answer grading on a scale of 10) are computed by fastchat.llm_judge. The MMLU scores are mostly computed by InstructEval. Higher values are better for all benchmarks. Empty cells mean not available. Last updated: November, 2023.

Model⭐ Arena Elo rating📈 MT-bench (score)MMLULicense
GPT-4-Turbo12109.32Proprietary
GPT-411598.9986.4Proprietary
Claude-111467.977Proprietary
Claude-211258.0678.5Proprietary
Claude-instant-111067.8573.4Proprietary
GPT-3.5-turbo11037.9470Proprietary
WizardLM-70b-v1.010937.7163.7Llama 2 Community
Vicuna-33B10907.1259.2Non-commercial
OpenChat-3.510707.8164.3Apache-2.0
Llama-2-70b-chat10656.8663Llama 2 Community
WizardLM-13b-v1.210477.252.7Llama 2 Community
zephyr-7b-beta10427.3461.4MIT
MPT-30B-chat10316.3950.4CC-BY-NC-SA-4.0
Vicuna-13B10316.5755.8Llama 2 Community
QWen-Chat-14B10306.9666.5Qianwen LICENSE
falcon-180b-chat102468Falcon-180B TII License
zephyr-7b-alpha10246.88MIT
CodeLlama-34B-instruct102253.7Llama 2 Community
Guanaco-33B10216.5357.6Non-commercial
Llama-2-13b-chat10216.6553.6Llama 2 Community
Mistral-7B-Instruct-v0.110086.8455.4Apache 2.0
Llama-2-7b-chat10016.2745.8Llama 2 Community
Vicuna-7B9946.1749.8Llama 2 Community
PaLM-Chat-Bison-0019916.4Proprietary
ChatGLM3-6B970Apache-2.0
Koala-13B9555.3544.7Non-commercial
GPT4All-13B-Snoozy9255.4143Non-commercial
MPT-7B-Chat9185.4232CC-BY-NC-SA-4.0
ChatGLM2-6B9184.9645.5Apache-2.0
RWKV-4-Raven-14B9153.9825.6Apache 2.0
Alpaca-13B8934.5348.1Non-commercial
OpenAssistant-Pythia-12B8844.3227Apache 2.0
ChatGLM-6B8714.536.1Non-commercial
FastChat-T5-3B8633.0447.7Apache 2.0
StableLM-Tuned-Alpha-7B8332.7524.4CC-BY-NC-SA-4.0
Dolly-V2-12B8103.2825.7MIT
LLaMA-13B7892.6147Non-commercial
WizardLM-30B7.0158.7Non-commercial
Vicuna-13B-16k6.9254.5Llama 2 Community
WizardLM-13B-v1.16.7650Non-commercial
Tulu-30B6.4358.1Non-commercial
Guanaco-65B6.4162.1Non-commercial
OpenAssistant-LLaMA-30B6.4156Non-commercial
WizardLM-13B-v1.06.3552.3Non-commercial
Vicuna-7B-16k6.2248.5Llama 2 Community
Baize-v2-13B5.7548.9Non-commercial
XGen-7B-8K-Inst5.5542.1Non-commercial
Nous-Hermes-13B5.5149.3Non-commercial
MPT-30B-Instruct5.2247.8CC-BY-SA 3.0
Falcon-40B-Instruct5.1754.7Apache 2.0
H2O-Oasst-OpenLLaMA-13B4.6342.8Apache 2.0

New rowNew column

Visit our HF space for more analysis!

If you want to see more models, please help us add them.

Acknowledgment

We thank KaggleMBZUAIAnyScale, and HuggingFace for their generous sponsorship.Image 1Image 2Image 3Image 4

Previous Post
Next Post