Subscribe to Toloka News
Subscribe to Toloka News
A leading no-code chatbot platform provider, Chatfuel powers over a billion conversations monthly for hundreds of businesses, including brands like Netflix and Adidas. Their conversational LLM-based chatbots automate business communication with consumers on Facebook Messenger, Instagram, WhatsApp, and websites, in roles from support to customer retention and upselling.
When you're putting a chatbot in charge of your sales funnel and customer support, you rely on exceptional quality. However, LLM-driven chatbots are prone to a couple of major problems:
Chatfuel aims to guarantee high-quality responses from their chatbots. Toloka provides a deep evaluation framework to get a clear picture of chatbot quality, an essential step to keep Chatfuel's customers satisfied and uphold the brand's reputation in today's digital communication landscape.
Toloka has been an invaluable partner in our pursuit of improving our AI-driven customer support solutions. Their expertise and collaboration have allowed us to understand the weak points within our GPT-based customer support bots, enabling us to address critical shortcomings and enhance the overall user experience. Toloka's insights and suggestions for refining our custom GPT model have also been instrumental in fine-tuning its performance and accuracy. Now, we are better equipped to provide more efficient and effective chatbots, making them an indispensable part of our journey towards AI excellence.
— Oleg Krasikov, Head of Product, Chatfuel
The main goal of our evaluation was to benchmark the performance of Chatfuel's chatbot pipeline, powered by ChatGPT, against the current industry-leading compact open-source model, Mistral-7B-Instruct-v0.1.
The primary challenge was to develop a robust evaluation framework capable of adapting to Chatfuel's specific domain of support chatbots. We catered to a variety of chatbot scenarios, from managing open-ended customer conversations to offering precise answers from company knowledge bases.
The Toloka team adapted our deep evaluation framework to this domain and designed a flexible evaluation pipeline that will extend to a broader range of Chatfuel's chatbot applications in the future.
In the initial stages of our evaluation, we considered a range of informative metrics, including Completeness, Conciseness, Operator Interaction, Text Quality, and First Response Resolution.
We further refined our metrics to focus on the overall success of Chatfuel’s support chatbot model. These three metrics were identified as most impactful in enhancing user satisfaction and ensuring effective communication:
Each metric was carefully assessed using a pointwise approach, allowing Chatfuel to make cross-metric comparisons and track absolute performance over time. Our flexible framework left room to enhance the sensitivity of the tests where necessary by adding a pairwise variation for each metric.
Chatfuel's production pipeline powered by ChatGPT was benchmarked against the Mistral-7B-Instruct model built in the RAG pipeline to gauge relative effectiveness compared to the open source model, which can be used by any third-party striving to build a chatbot using publicly available sources.
Approach: Assessed by human experts with backgrounds in customer support.
Methodology:
Approach: Also evaluated by specialized human experts.
Methodology: Evaluators examine whether the chatbot response statements are factually correct and in complete alignment with Chatfuel's established knowledge base.
Challenges: Truthfulness can be difficult to assess, especially in nuanced customer support scenarios. This metric requires evaluators with a high degree of precision and skill.
Approach: Streamlined classification via highlighting using state-of-the-art language models.
Methodology: Tone evaluation analyzes how well the chatbot's responses match the desired communicative style, be it formal, friendly, or otherwise. The tone should be consistent and appropriate to the context. We used the following tone labels:
Challenges: To automate tone assessment, we carefully balanced quality and cost of the language model to maintain high evaluation standards without incurring excessive expenses.
Helpfulness
We used a paired assessment where evaluators directly compared output from Chatfuel's model and Mistral-7B-Instruct and chose the most helpful response for each query. Our evaluators showed a strong preference for the Chatfuel model (54.6% wins compared to 12.4% for Mistral-7B-Instruct).
Bootstrap resampling confirmed the statistical significance of these results, indicating that Chatfuel's win rate in helpfulness is significantly higher than both the tie rate and the loss rate. This suggests the Chatfuel model is much more robust in terms of helpfulness compared to Mistral-7B-Instruct-v0.1.
Relevance
Chatfuel's model also performed better than Mistral-7B-Instruct on relevance, with a significantly higher percentage in the Relevant category and lower percentages in the Neutral, Partially Relevant and Irrelevant categories, proven with statistical tests. This suggests that Chatfuel's responses are more likely to directly answer the user's question.
Chatfuel's higher percentage of Relevant statements, with significantly lower scores in Neutral, Partially Relevant, and Irrelevant categories, points to a consistently relevant response strategy.
Key takeaways
In our truthfulness evaluation of Chatfuel's model and the Mistral-7B-Instruct built in the RAG pipeline, we analyzed four categories: Supported, Neutral, Unsupported, and Contradicts. The results provide insightful comparisons between the two models.
The points below summarize the bootstrap confidence intervals for the differences in response ratios between the Chatfuel model and the Mistral-7B-Instruct model, taking into account the paired origin of the data. These intervals help determine whether the observed differences in the performance metrics of the two models are statistically significant.
Key takeaways:
In summary, while the Chatfuel model shows a stronger alignment with the knowledge base in terms of support and fewer contradictions, both models exhibit challenges in dealing with unsupported information, indicating a need for more robust integration with and reliance on the knowledge base.
The tone distribution analysis for Chatfuel and Mistral provides valuable insights into the tone of voice employed by each model in their interactions.
In an innovative approach, a specialized n-gram language model was trained on annotated labels to analyze the tone sequences in responses from both Chatfuel and Mistral-7B-Instruct-v0.1 models. The analysis revealed distinct patterns illustrated in the figure below:
This difference highlights Chatfuel's tendency to start conversations with empathy before providing instruction, whereas Mistral frequently initiates interactions with a friendly tone, suggesting a more consistently personable approach throughout the conversation.
Key takeaways
The table below contrasts the predominant tones and interaction styles of the Chatfuel and Mistral chatbot models, highlighting their respective focuses in user engagement:
Aspect | Chatfuel | Mistral |
---|---|---|
Primary Tone | Instructive 📘 and Formal 🎩 | Instructive 📘 and Friendly 😊 |
Interaction Style | Starts with Empathy 🤗, then Clear guidance 🧭 | Consistently Friendly 😊, followed by Guidance 🧭 |
Design Focus | Balances Empathy with Structured Directive Interactions 🏗️ | Prioritizes Personable Engagement, then Instructive Guidance 🌐 |
Toloka's deep evaluation focused on Helpfulness, Truthfulness, and Tone to get at the heart of model quality for Chatfuel's support chatbots. The results revealed that the Chatfuel model excels in providing relevant and empathetic responses to user requests.
Chatfuel gained valuable insights and benefits from the evaluation cycle:
The Chatfuel team can adapt the same evaluation framework to enhance their e-commerce chatbots and other products as needed. Most importantly, Chatfuel customers can be confident that they are getting an outstanding support chatbot experience.
Learn more about Toloka Deep Evaluation:
Learn more