The first public LLM comparison based on authentic
user prompts and expert human evaluation.
The leaderboard ranks LLMs across multiple prompt categories.
Rank | LLM | Total | Brainstorming | Closed QA | Generation | Open QA | Rewrite |
---|---|---|---|---|---|---|---|
1 | GPT-4 | 81.75 | 89.74 | 78.95 | 81.87 | 74.10 | 92.31 |
2 | WizardLM 13B V1.2 | 79.56 | 76.92 | 73.68 | 80.31 | 77.11 | 92.31 |
3 | LLaMA 2 70B Chat | 78.97 | 88.46 | 68.42 | 81.35 | 69.88 | 84.62 |
4 | GPT-3.5 Turbo | 76.79 | 73.08 | 68.42 | 80.31 | 73.49 | 76.92 |
5 | Vicuna 33B V1.3 | 74.21 | 82.05 | 47.37 | 71.50 | 70.48 | 76.92 |
6 | Guanaco 13B | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 |
Toloka compares and ranks the most popular LLMs in multiple categories, using Guanaco 13B as the baseline.