AI software is able to recognize and react to patterns but it has one important caveat: The software will be wrong some amount of the time. The art of designing with AI is an exercise in deciding how much “wrongness” we're willing to accept and having systems and processes in place when it happens.
From Google Gemini to Claude 3 to Meta's Llama3, it's no secret that major AI companies are in a race for superiority. Large Language Models (LLMs), with their billions of parameters, have evolved far beyond simple text prediction. They're now capable of handling complex reasoning tasks. With each new model, creators tout their performance through benchmark charts, claiming dominance in specific areas of AI capability. Each new release comes with performance charts showing how the model excels over rivals in specific, often carefully chosen, benchmarks.
Benchmarks have become the way we describe which LLM is superior. While useful, relying too heavily on benchmark scores can lead to false expectations about performance in real-world tasks.
A model scoring 82% on a benchmark doesn’t guarantee that it will perform at that level in the complex, dynamic tasks your business demands. In fact, performance can vary greatly based on context. This article will explore the benefits and limitations of benchmarks and offer a broader perspective on evaluating LLMs in the real world.
Benchmarks for LLMs come from standardized performance tests on large datasets containing up to hundreds of thousands, or even millions, of pairings of prompts or problem statements and the correct result.
The datasets used for performance tests vary in structure. Some include multiple-choice questions, where models are graded on the number of correct answers, while others are open-ended and evaluated on their similarity to a predefined response. For more complex tasks, like producing code, models are assessed on the correctness of the output or whether the code can run successfully.
This field of evaluation has about as many ways to “grade” LLM performance as there are:
Tasks they can perform—coding, summarizing, translating, answering questions
The industries they’re applied to, whether for general-purpose use or more specialized fields like creativity, medicine, or banking.
Defining a “correct” answer becomes more challenging as the complexity of both the data and the goal increases. In these cases, correctness can be highly subjective and often reflects the biases of those who created or curated the original benchmark datasets.
For example, analyzing images presents more challenges for evaluation than working with structured data like spreadsheets. Similarly, complex tasks such as summarizing an article are more challenging to evaluate accurately than straightforward objectives, like counting hospital readmissions.
There are about as many different benchmarks as there are types of AI tasks, from summarization to analysis to reasoning and beyond. A few of the more popular benchmarks can be seen in the below chart from Anthropic’s Claude 3 release like graduate-level reasoning (GPQA), multilingual math (MGSM), and common knowledge.
These charts are filled with overly complicated metrics that most decision-makers shouldn’t be expected to understand. The technical jargon — like N-shot, which refers to how many correct examples the model sees before testing, and CoT (Chain of Thought), where the model is instructed to reason more verbosely — might seem impressive, but they distract from what matters most — real-world performance.
We're tempted to interpret this as “bigger is better.”
If you see GPT-4 has 52% for math problem-solving and Claude 3 has 60%, it’s tempting to assume that Claude 3 is better for those use cases.
However, benchmarks don’t give you the full picture of a model’s performance.
While benchmarks are useful because they give us a shared language for comparing LLMs, they’re misleading. Just because a model scores high on standard benchmarks, doesn't mean it's truly better for your company-specific tasks. Here are the three biggest reasons LLM benchmarks can be misleading today:
Several studies have shown that the datasets used to calculate popular LLM benchmarks for medical applications and common sense reasoning include errors.
In other words, evaluation datasets that have the prompt and “correct” answer aren't always so correct. Think of these datasets as this: Given prompt X, the correct answer is Y.
Now, we give our LLM the prompt X and see what answer it gives us. Then, we repeat this on thousands or even millions of examples before measuring what percent of the time the model performed as expected relative to the answer key.
If the answer key is wrong, our score is misleading. Maybe the model was technically wrong, but the answer key in the evaluation dataset was also wrong, so we're unfairly saying the model is doing well. Or on the flip side, the model is right, but the benchmark dataset is wrong, and we unfairly penalize it.
This results in inflated, or just simply inaccurate, benchmarks for these LLMs.
Many of the LLM benchmarks we see today are extremely esoteric. For example, Anthropic’s GPQA (a Graduate-Level Google-Proof Q&A) benchmark contains hundreds of doctorate-level biology, physics, and chemistry questions. However, most people are looking to LLMs for tasks like drafting email responses or summarizing meeting notes.
These hyper-specific benchmarks really don’t give us a clear understanding of its performance for relevant tasks and make it even more challenging for organizations to determine if a model will help them achieve their goals.
Some benchmarks likely contain information that these models were trained on, so it’s more of an exercise in memorization rather than comprehension.
Memorizing and associating words together doesn’t mean a model can solve new and complex problems in different contexts. This becomes an issue if your intention is to place the model in new, real-world applications.
While standardized benchmarks are useful for evaluating AI models, they fall short when it comes to predicting performance on specific, real-world tasks. For example, a model that scores high in the “common sense reasoning” benchmark might still struggle with unique customer service interactions.
Rather than relying solely on LLM benchmarks, think about how you’d evaluate a human performing the same task. If you're integrating an LLM-based solution into customer service, consider metrics like time to resolution compared to a human operator and measure performance against your specific KPIs. These KPIs should reflect the true value the AI brings, just as they do for human employees.
These datasets, benchmarks, and KPIs tailored to your company’s specific needs become the gold standard for evaluating AI models. Testing against these parameters will give you insight into its real-world performance, which will likely differ from standardized benchmark results.
Ultimately, benchmarks can be useful when comparing multiple LLMs, but it’s important to recognize their limitations. While a model that ranks high across benchmarks may indicate general competence, benchmarks don’t always translate to real-world success.
Even top-performing models can be wrong, and when models score just a few points apart, it’s often a coin toss as to which will better suit your business needs.
The real measure of value is how well a model performs in your specific business context. Instead of focusing solely on benchmark scores, consider internal metrics and KPIs that reflect the actual tasks at hand. This allows you to evaluate the model’s impact on business performance and identify when human oversight or intervention may be needed.
About the Author
Cal Al-Dhubaib is a globally recognized data scientist and AI strategist in trustworthy artificial intelligence, as well as the Head of AI and Data Science at Further, a data, cloud, and AI company focused on helping make sense of raw data.