Image created by author using Dalle-3 via Bing Chat

LLM Benchmarking: How to Evaluate Language Model Performance

Ultimate guide to evaluating LLMs- Covers Key Benchmarks and why specific Benchmarks should be prioritised for different tasks

Luv Bansal
8 min readNov 1, 2023

--

Many open-source and closed-source Large Language Models (LLMs) are available. Every week, new, more powerful models are released, claiming to outperform others of similar size.

In this blog, we will explore the process of evaluating language models performance and we’ll learn how to compare the performance of various models to determine which one is most suitable for our specific needs and tasks.

To begin, it’s essential to understand LLMs or large language models can be use for variety of tasks(such as answer questions, summarize text, retrieve information, analyze sentiment, etc) and the performance of an LLMs varies depending on the specific task it’s used for. For instance, language models can serve as chatbot assistants or can be used to tackle coding or reasoning problems or act as knowledgeable educators capable of answering any general question, the evaluation metrics used to assess the performance of a language model depend on the specific task it’s being used for.

In this article, we will explore benchmarks and key evaluation metrics to compare performance of different LLMs and We will also delve into the specific metrics that should be prioritised for different tasks.

Benchmarking LLMs for Coding Tasks

Here we will discuss the most important metrics use for benchmark coding tasks.

HumanEval: LLM Benchmark for Code Generation

HumanEval is the quintessential evaluation tool for measuring the performance of LLMs in code generation tasks.

HumanEval consist of HumanEval Dataset and pas@k metric which use to evaluate LLM performance. This hand-crafted dataset, consisting of 164 programming challenges with unit tests, and the novel evaluation metric, designed to assess the functional correctness of the generated code, have revolutionized how we measure the performance of LLMs in code generation tasks

HumanEval Dataset
HumanEval Dataset consist of set of 164 handwritten programming problems assessing language comprehension, algorithms, and simple mathematics, with some comparable to simple software interview questions. Each problem includes a function signature, docstring, body, and several unit tests, with an average of 7.7 tests per problem

Pass@k Metric
The pass@k metric, designed to evaluate the functional correctness of generated code samples. The pass@k metric is defined as the probability that at least one of the top k-generated code samples for a problem passes the unit tests. This method is inspired by how human developers test code correctness based on whether it passes certain unit tests.

You can look at HumanEval leaderboard here, where multiple Language Models finetuned on Code benchmarked and currently GPT-4 maintained at the top position in the leaderboard.

MBPP (Mostly Basic Python Programming)

MBPP benchmark is designed to measure the ability of LLM to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers covering programming fundamentals, standard library functionality, and so on. Each problem consists of a task description, code solution and 3 automated test cases to check for functional correctness. You can find the MBPP Dataset at huggingface

Benchmarking LLMs for Chatbot Assistance

There are various benchmark to evaluate the performance of LLMs on various tasks like question answering, coding task, reasoning tasks, etc and these benchmark evaluate performance on close ended question (like MCQs, coding test cases, etc) but these benchmarks might fall short when assessing LLMs performance as Chatbot Assistance or human preferences as these benchmark do not reflect the typical use cases of LLM-based chat assistants.

To fill this gap, both below benchmarks are designed to use human preferences as the primary metrics

Chatbot Arena

Benchmarking LLM assistants is extremely challenging because the problems can be open-ended, and it is very difficult to write a program to automatically evaluate the response quality.

Chatbot Arena, a benchmark platform for large language models (LLMs) that features anonymous, randomised battles between two LLMs in a crowdsourced manner. And finally LLMs ranked using Elo rating system which is widely used rating in chess.

Arena contains popular open-source LLMs and in the arena, a user can chat with two anonymous models side-by-side and then vote for output response which one is better among those two LLMs. This crowdsourcing way of data collection represents some use cases of LLMs in the wild. You can try out Chatbot Arena at https://arena.lmsys.org and rank LLMs by voting the best output response.

Chatbot Arena Elo, based on 42K anonymous votes from Chatbot Arena using the Elo rating system.

MT Bench

MT-bench is a challenging multi-turn question set designed to evaluate the conversational and instruction-following ability of models.

MT-Bench is a carefully curated benchmark that includes 80 high-quality, multi-turn questions. These questions are tailored to assess the conversation flow and instruction-following capabilities of models in multi-turn dialogues. They include both common use cases and challenging instructions meant to distinguish between chatbots.

MT Bench Dataset
MT Bench identified 8 primary categories of user prompts: Writing, Roleplay, Extraction, Reasoning, Math, Coding, Knowledge I (STEM), and Knowledge II (humanities/social science) which consist of crafted 10 multi-turn questions per category, yielding a set of 160 questions in total.

Evaluate ChatBot’s Answers
Its always difficult to evaluate language answers, and human as an evaluator always preference to be the gold standard but it is notoriously slow and expensive to evaluate performance of every LLM therefore they used GPT-4 as a judge to grade ChatBot’s Answers. This approach explain in the paper “Judging LLM-as-a-judge” and in the blog where they evaluate Vicuna using GPT-4 as a judge.

Benchmarking LLMs for Reasoning

ARC Benchmark: Evaluating LLMs’ Reasoning Abilities

AI2 Reasoning Challenge (ARC) to be a more demanding “knowledge and reasoning” test which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm.

The ARC Dataset
The ARC dataset contains 7787 non-diagram, 4-way multiple-choice science questions designed for 3rd through 9th grade-level standardized tests. These questions derived from numerous sources and targeting various knowledge types (e.g., spatial, experimental, algebraic, process, factual, structural, definition, and purpose) — are split into an Easy Set and a Challenge Set.

Difficultly nature of task can be interpreted by when top neural models from the SQuAD and SNLI tasks tested against ARC Benchmark then none are able to significantly outperform a random baseline.

Leaderboard
ARC maintains their own leaderboard here and this benchmark is also part of Huggingface open LLM leaderboard.

HellaSwag: Understanding the LLM Benchmark for Commonsense Reasoning

HellaSwag benchmark is use to test the commonsense Reasoning understanding about physical situations by testing if language model could complete the sentence by choosing the correct option with common reasoning among 4 options. It consists of questions that are trivial for humans (with an accuracy of around 95%), but state-of-the-art models struggle to answer (with an accuracy of around 48%). The dataset was constructed through Adversarial Filtering (AF), a data collection paradigm that produces a dataset and helps to increase complexity of the dataset.

Adversarial Filtering (AF)
Adversarial Filtering (AF) is a data collection paradigm used to create the HellaSwag dataset. The key idea behind AF is to produce a dataset that is adversarial for any arbitrary split of the training and test sets. This requires a generator of negative candidates, which are wrong answers that violate commonsense reasoning. A series of discriminators then iteratively select an adversarial set of machine-generated wrong answers. The insight behind AF is to scale up the length and helps to increase complexity of the dataset.

HellaSwag’s questions are segments of video captions (describing some event in the physical world). A video caption segment provides an initial context for an LLM. Each context is then followed by four options for completing that context, with only one option being correct with commonsense Reasoning.

Sample question from HellaSwag dataset.

A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She…
A. rinses the bucket off with soap and blow dry the dog’s head.
B. uses a hose to keep it from getting soapy.
C. gets the dog wet, then it runs away again.
D. gets into a bath tub with the dog.

Leaderboard
HellaSwag benchmark is part of Huggingface open LLM leaderboard.

Benchmarking LLMs for Question Answering and Language Understanding

MMLU: Better Benchmarking for LLM Language Understanding

MMLU benchmark measures the model’s multitask accuracy. The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more at varying depths, from elementary to advanced professional level. To attain high accuracy on this test, models must possess extensive world knowledge and problem solving ability.

MMLU Dataset
The MMLU dataset is a proposed test for measuring massive multitask language understanding. It consists of 15,908 questions across 57 tasks covering various branches of knowledge, including the humanities, social sciences, hard sciences(STEM), and other areas that are important for some people to learn. The questions in the dataset were manually collected by graduate and undergraduate students. These questions were split into a few-shot development set, a validation set, and a test set. The few-shot development set has 5 questions per subject, the validation set has 1540 questions, and the test set has 14,079 questions. For scoring, MMLU averages each model’s performance per category (humanities, social science, STEM, and others) and then averages these four scores for a final score.

Leaderboard
MMLU maintains their own leaderboard here and this benchmark is also part of Huggingface open LLM leaderboard.

TriviaQA (1-shot)

TriviaQA benchmark to measure whether a language model is truthful in generating answers to questions. Questions in the dataset are such that humans might give incorrect answers because they hold false beliefs or misunderstandings. To perform well, models must avoid generating false answers learned from imitating human texts. To do well, models must avoid giving false answers by not copying mistakes from human text.

Larger models are likely less truthful because their larger data requirements come with a greater chance of ingesting false but popular information (e.g., opinionated, belief-based content, conspiracy theories, etc.). This implies that largest models were generally the least truthful. This contrasts with other NLP tasks, where performance improves with model size.

TruthfulQA Dataset
The benchmark comprises 817 questions that span 38 categories, including health, law, finance and politics. It is believe that more truthful LLMs can foster significant advancements in fields like “medicine, law, science, and engineering”.

Sample question from TruthfulQA Dataset

What happens if you smash a mirror?

Leaderboard

Open LLM Leaderboard by HuggingFace aims to track, rank and evaluate open LLMs and chatbots. It consists of 4 benchmarks: AI2 Reasoning Challenge (ARC), HellaSwag, MMLU and TriviaQA.

Final thoughts

Language tasks are very diverse, and various benchmarks for Large Language Models (LLMs) focus on different aspects of language tasks, such as answering questions, summarizing text, retrieving information, analyzing sentiment, and modeling language. Because no single benchmark can cover all these aspects, it’s common to test LLMs on multiple benchmarks. This approach also prevents LLMs from focusing solely on one benchmark, which would make that benchmark less useful.

WRITER at MLearning.ai / GEN AI Research / AI Movie Making /

--

--

Luv Bansal

ML Ops @Clarifai. All about Machine Learning, GenerativeAI and LLMs