Problem:
AI hallucination is one of the 'big deals' in AI.
January 16th, Apple stopped using its AI tool that made short versions of news stories because it kept making mistakes. The tool was showing phone alerts that looked like real news but had wrong information in them. At first, Apple waited a while and said they would just add a note saying AI made these summaries. When people kept complaining, Apple turned off the tool completely for news apps.
Solution:
It is now possible to evaluate LLM hallucination rates in specific contexts, particularly when summarizing documents or working with verifiable source material.
What:
A public leaderboard that ranks Large Language Models (LLMs) based on their tendency to hallucinate (generate false or unsupported information) when summarizing documents. Key points:
The leaderboard is maintained by Vectara using their Hughes Hallucination Evaluation Model (HHEM)
It's currently using version 2.1 of HHEM (previous rankings used version 1.0)
The leaderboard is regularly updated to reflect both improvements in their evaluation model and updates to the LLMs being tested
It's available both on their platform and on Hugging Face
Their goal is to provide a standardized way to compare different LLMs' reliability in terms of generating factually consistent summaries without introducing information that wasn't present in the source material.
Who:
Vectara is a company that provides AI search and retrieval technology services. They are known for developing search and retrieval solutions, particularly in the field of RAG (Retrieval Augmented Generation) and AI. They've gained attention in the AI community for creating tools to evaluate LLM performance, most notably their Hughes Hallucination Evaluation Model (HHEM) which is used to assess how often AI models generate false or unsupported information.
However, I should note that since my knowledge cutoff was in April 2024, I may not be aware of their most recent developments or changes to their company structure. For the most up-to-date information about Vectara, I'd recommend checking their official website or recent tech news sources.
If you're interested in learning more about Vectara's work in hallucination detection specifically, I could share more details about their evaluation methodology and approach to measuring LLM accuracy.
Methodology:
The researchers developed a model to detect hallucinations in LLM outputs, using datasets from factual consistency research. They tested various LLMs by having them summarize 1,000 short documents (primarily from CNN/Daily Mail Corpus) at temperature 0. Of these, 831 documents were successfully summarized by all models, while others were rejected due to content filters.
They measured two key metrics:
1. Factual consistency rate (accuracy in summarizing only facts present in the source document)
2. Answer rate (percentage of documents the model was willing to summarize)
They chose to evaluate summarization consistency rather than general factual accuracy because:
It allows direct comparison between the model's output and source material
Detecting hallucinations in ad-hoc questions is impossible without knowing each LLM's training data
This approach serves as a good proxy for model truthfulness
It's particularly relevant for RAG (Retrieval Augmented Generation) systems, where LLMs primarily function as summarizers of search results
Sources:
This ranking was also featured in The Visual Capitalist
Question to you:
Aren't you surprise to NOT see Claude neither Perplexity at the top of the ranking??
Indeed, it's a little bit disappointing, as well as for Mistral. Perplexity isn't in the list as it relies on one of the models above - you can pick with the Pro version between Claude, Sonar Large, GPT-4o, Sonar Huge, Grok-2 and O1.