The Hallucination Problem of Large Language Models

Neeraj Varshney
9 min readAug 20, 2023

Hallucination in the context of language models refers to the generation of text or responses that seem syntactically sound, fluent, and natural but are factually incorrect, nonsensical, or unfaithful to the provided source input.

Recently developed large language models such as GPT-3, InstructGPT, PaLM, LLaMA, and several others have achieved remarkable performance on a wide range of language understanding tasks. Furthermore, they have been shown to possess an impressive ability to generate fluent and coherent text. Despite all these abilities, their tendency to ‘hallucinate’ critically hampers their reliability and limits their widespread adoption in real-world applications. In this article, we will take a deeper look into this hallucination problem of Large Language Models (LLMs) and study a variety of methods to address this problem.

Outline

  1. Why is it Important to Address the Hallucination Problem?
  2. Are Hallucinations Always Undesirable?
  3. What are the different Types of Hallucinations?
  4. Why do LLMs Hallucinate?
  5. What are the different ways of Measuring Hallucinations?
  6. How can we Reduce LLM Hallucinations?
  7. When do LLMs Hallucinate the Most?

Check out my recent paper on detecting and mitigating hallucinations of LLMs.

A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation

Why is it Important to Address the Hallucination Problem of LLMs?

Hallucinations of LLMs can have serious consequences such as it can lead to spreading of misinformation and violation of privacy. It also raises safety concerns for real-world applications. For example in medical applications, a hallucinated report generated from a patient’s information can pose serious risks to the patient. It can even provoke a life-threatening incident for the patient. Hallucinations hamper the reliability and trustworthiness of the model which makes it important to address this problem.

Are Hallucinations ALWAYS Undesirable?

Hallucination can not be regarded as BAD or UNDESIRABLE for all the tasks as it is a part of creative writing/thinking. For example while writing a movie plot, hallucination can play a vital role in coming up with an interesting story. So, the level of tolerance for hallucinations is application-dependent.

What are the different Types of Hallucinations?

There are two main types of Hallucinations, namely, intrinsic and extrinsic.

Intrinsic Hallucinations: When the generated output contradicts the source content.

Extrinsic Hallucinations: When the generated output cannot be verified from the source content, i.e., the output can neither be supported nor contradicted by the source.

Why do LLMs Hallucinate?

One thread of research pertaining to hallucinations of LLMS has focused on studying different causes of this phenomenon.

Source-Reference Divergence: When a model is trained on data with source-reference (target) divergence, it may learn to generate text that is not necessarily grounded or faithful to the given source. While collecting the data, source-reference divergence can happen unintentionally or intentionally.

Unintentional Source-Refernce Divergence: There could be several causes of this type of divergence. It is possible that the data is heuristically created and the target may contain information that is not always supported by the source. For instance, if you take news about an incident from two different websites as a source-reference pair then the reference may contain information that is absent in the source causing the divergence.

Intentional Source-Reference Divergence: Some tasks by nature do not always demand information alignment between the source and the target, especially those that value diversity in the generated output.

Another factor is the presence of duplicates in the training corpus. Specifically, duplicated examples in the training corpus can bias the model towards generating some highly frequent tokens/phrases causing hallucination.

Using such noisy data for training is one of the factors contributing to the hallucination phenomenon.

Stochasticity in Decoding Technique: It has been shown that decoding strategies that improve the generation diversity, such as top-k sampling, top-p, and temperature parameters, often result in increased hallucinations. This could be attributed to the introduction of “randomness/stochasticity” while selecting tokens (from top-k or top-p) instead of choosing the most probable token while decoding.

Parametric knowledge bias: Models have been shown to often prioritize the parametric knowledge (knowledge acquired during pre-training and implicitly stored in the parameters of the model) over the provided contextual knowledge resulting in hallucinations.

Discrepancy between training-time and inference-time Decoding: One of the common ways of training a model uses teacher-forced maximum likelihood estimation (MLE) method, where the decoder is encouraged to predict the next token conditioned on the ground-truth prefix sequences. However, at inference time, the model generates the next token conditioned on the historical sequences previously generated by the model itself. Such a discrepancy can lead to hallucinated generation, especially when the target sequence becomes long.

Also, generative LMs like GPT-3 are trained to model the statistical correlations between subword tokens, and thus in reality they can only acquire a limited capability to generate factually accurate text.

What are the different ways of measuring hallucination?

Human Evaluation is one of the most commonly used and reliable metric. However, it is very expensive and time-consuming to collect; thus, several automatic metrics have also been proposed.

Named Entity Error: Since Named Entities (NEs) are one of the core building blocks of “fact”, NE matching can be used to calculate the overlap between the generated output and the reference. Intuitively, a model can be considered as hallucinating (making factual errors) if it generates a NE that does not appear in the ground-truth knowledge source.

It can be defined as #NEs of the generated content that do not appear in the ground-truth text divided by Total #NEs in the generated content. Lower values of this metric are preferred.

Note that this metric requires an exhaustive ground-truth reference text.

Entailment Ratio: It is defined as the #sentences entailed by the reference text divided by the total #sentences in the generated output. For this, an off-the-shelf entailment/NLI model can be used.

Model-based Evaluation: These metrics are proposed to handle more complex syntactic and even semantic variations.

Using Question Answering System: It is based on the intuition that similar answers will be generated from a question if the generation is factually consistent with the source reference. Specifically, given a generated text, a question generation (QG) model generates a set of question-answer pairs. Then, a question-answering model answers the generated questions using a ground-truth source text as the reference, and the similarity of the corresponding answers is calculated.

Using Information Extraction System: This method uses Information Extraction models to represent the knowledge in a simpler relational tuples such as <subject, relation, object>. They extract such tuples from the generated output and verify them against relation tuples extracted from the source.

How can we Reduce LLM Hallucinations?

Data-related concerns contributing to hallucinations can be (at least theoretically) addressed by creating a high-quality noise-free dataset. However, it could be very difficult to validate and clean hundreds of gigabytes of text corpora.

A number of different kinds of methods have been proposed to address the hallucination problem such as (a) leveraging external knowledge to validate the correctness, (b) Modifying the decoding strategy, (c) sampling multiple outputs and checking their consistency. In this article, we will cover a representative method from each of these categories.

  1. Active Detection and Mitigation of Hallucinations via Validation Using External Knowledge

This approach is proposed in my recent paper: A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation

Here, we first present two interesting findings that motivate this approach and provide details of the approach.

Finding 1: Hallucination Propagates During Generation

We demonstrate that when there is a hallucination in a sentence generated by the model, it increases the chances of hallucination in the model’s subsequently generated sentences.

For instance, if the model generates the sentence: “Joe Biden was born in Germany in 1947” which is hallucinated, then as the next sentence, the model may generate something about Biden’s education in Germany or his place of birth in Germany which will again be hallucinated. In other words, hallucination propagates. Please refer to our paper where we present empirical evidence to illustrate this phenomenon. This implies that if we can “actively” detect and mitigate hallucination then we can also prevent its propagation in the subsequently generated sentences.

Finding 2: Logit Output Values Provide a Signal for Hallucination

We demonstrate that the logit output values (probability distribution over the output vocabulary) can be leveraged to obtain a signal for hallucination. Specifically, we calculate a probability score and show that when this score is low, the model tends to hallucinate more. Thus, it can be used as a signal for hallucination and when the score is low, information validation of the generated content can be performed.

Building on top of these two motivating findings, we propose an approach of active detection and mitigation.

Figure 1 illustrates this approach. Given an input, we iteratively generate sentences from the model and actively detect and mitigate hallucinations. Specifically, in the detection stage, we first identify the candidates of potential hallucination, i.e., the important concepts of the generated sentence. Then, we calculate the model’s uncertainty on them leveraging its logit output values. This is followed by validating the correctness corresponding to the uncertain concepts (where the calculated probability score is not sufficiently high) by retrieving relevant knowledge.

In the mitigation stage, we repair the hallucinated sentence using the retrieved knowledge as evidence. Finally, we append the repaired sentence to the input (and previously generated sentences) and continue generating the next sentence. This procedure not only mitigates the detected hallucination but also prevents its propagation in the subsequently generated sentences.

In the paper, we showed the effectiveness of this approach in mitigating hallucinations of models such as GPT-3.5 (text-davinci-003) and Vicuna in multiple tasks, such as article generation task, answering multi-hop questions, and false premise questions.

Figure 1: Active Detection and Mitigation approach to address the hallucination problem of LLMs.

In the paper, we also explored a self-inquiry method where instead of retrieving from the web, we ask the LM itself to answer the validation query, i.e., leverage its parametric knowledge.

2. Factual-Nucleus Sampling

In this method, the authors argue that the “randomness” of sampling is more harmful to factuality when it is used to generate the latter part of a sentence than the beginning of a sentence. Since there is no preceding text at the start of a sentence, so it is safe for LM to generate anything as long as it is grammatical and contextual. However, as the generation proceeds, the premise becomes more determined, and fewer word choices can make the sentence factual. Therefore, they introduce the factual-nucleus sampling algorithm that dynamically adapts the “nucleus” p along the generation of each sentence. In factual-nucleus sampling, the nucleus probability pt to generate the t-th token within each sentence is,

where λ is the decay factor for top-p probability, and ω lower bounds the decay of probability.

3. SelfCheckGPT

SelfCheckGPT is based on the intuition that if a model has knowledge of a concept, then sampled responses are likely to be similar and contain consistent facts. However, for hallucinated facts, stochastically sampled responses are likely to diverge and contradict one another. So, they sample multiple responses from the model (e.g. by varying the temperature parameter) and measure information consistency between the different responses to determine which statements are factual and which are hallucinated. This information consistency can be calculated using various methods such as neural methods to calculate semantic equivalence (like BERTScore) or using IE/QA-based methods (described above in the Metrics section).

When do LLMs Hallucinate the Most?

  • Numerals: Models have been shown to hallucinate a lot while generating numerals, such as dates, quantities, and scalars.
  • Long Text: Several tasks require understanding long-range dependencies such as document summarization and dialogue systems with long conversation history. Models often tend to self-contradict while generating the output.
  • Reasoning: Misunderstanding facts/information present in the source text can lead to hallucinations and errors. Accurately understanding the source context requires the ability to reason. Furthermore, if the generated output can be reasoned backwards to the source, then it can be considered to be faithful.
  • When Contextual Knowledge Conflicts with the Parameteric Knowledge: Models have been shown to prioritize the parametric knowledge (acquired during pre-training) over the contextual knowledge which leads to hallucinations.
  • When the context itself contains False Premise: When the provided context contains incorrect information or false premise (such as “Why are golf balls bigger than basketballs?” and “Why does Helium have atomic number of 1?”), the models often fail to detect it and hallucinate in their output.

My recent paper on this topic: A Stitch in Time Saves Nine: Detecting and Mitigating Hallucinations of LLMs by Validating Low-Confidence Generation

Other Interesting Papers on Hallucination:

  • Improving Factuality and Reasoning in Language Models through Multiagent Debate
  • Trusting Your Evidence: Hallucinate Less with Context-aware Decoding
  • How Language Model Hallucinations Can Snowball
  • Why Does ChatGPT Fall Short in Providing Truthful Answers?

WRITER at MLearning.ai // AI Video // Multimodal Machine Learning

--

--

Neeraj Varshney

Looking for full-time positions | Ph.D. Candidate working in Natural Language Processing (https://nrjvarshney.github.io)