Question 1

What makes the Toloka leaderboard different?

Accepted Answer

If you’re choosing a model for business applications, you want to compare model output on realistic examples.Toloka’s goal is to measure human preferences for LLM output. Our prompts are extracted from real conversations with ChatGPT, and expert human assessments are quality- controlled for best accuracy.The Toloka LLM Leaderboard gives you:

Human preferences relevant to downstream applications
Comparison of the top 5 LLMs by category for practical business decisions
The most accurate human evaluation available

More details about our evaluation process: https://toloka.ai/blog/llm-leaderboard/)

Question 2

What makes our evaluation reliable?

Accepted Answer

Quality control includes 3 stages:

Human experts are tested, trained, and certified to perform evaluation tasks with specific guidelines on harmlessness, truthfulness, and helpfulness of model responses.
To make ratings objective, we use an overlap of 3 with Dawid-Skene aggregation, so each comparison is evaluated by 3 experts and aggregated to achieve a single verdict.
Each expert’s individual accuracy is continually monitored by comparing their judgments with the majority vote.

Unlike other leaderboards, we do not use crowdsourced ratings or LLM-generated ratings.)

Question 3

How are the scores calculated for each LLM?

Accepted Answer

We collect user prompts written for ChatGPT, run the models on these prompts, and use human evaluation to score the responses. Then we calculate the percentage of prompts where the model scored better than the baseline (Guanaco 13B).)

Question 4

Can you perform in-depth evaluation of an LLM in a specific area?

Accepted Answer

We can develop custom evaluations. Please reach out to us at https://toloka.ai/talk-to-us/.
We'd be happy to discuss your evaluation needs.)

Question 5

How are models selected for the leaderboard?

Accepted Answer

We select popular models from the Hugging Face Hub.)

Rank	LLM	Total	Brainstorming	Closed QA	Generation	Open QA	Rewrite
1	GPT-4	81.75	89.74	78.95	81.87	74.10	92.31
2	WizardLM 13B V1.2	79.56	76.92	73.68	80.31	77.11	92.31
3	LLaMA 2 70B Chat	78.97	88.46	68.42	81.35	69.88	84.62
4	GPT-3.5 Turbo	76.79	73.08	68.42	80.31	73.49	76.92
5	Vicuna 33B V1.3	74.21	82.05	47.37	71.50	70.48	76.92
6	Guanaco 13B	50.00	50.00	50.00	50.00	50.00	50.00

LLM Leaderboard

Leaders by category

Authentic user prompts

Accurate human evaluation

Practical comparisons

How evaluation works