Skip to main content

Stanford study challenges assumptions about language models: Larger context doesn’t mean better understanding 

Time's almost up! There's only one week left to request an invite to The AI Impact Tour on June 5th. Don't miss out on this incredible opportunity to explore various methods for auditing AI models. Find out how you can attend here.


A study released this month by researchers from Stanford University, UC Berkeley and Samaya AI has found that large language models (LLMs) often fail to access and use relevant information given to them in longer context windows.

In language models, a context window refers to the length of text a model can process and respond to in a given instance. It can be thought of as a working memory for a particular text analysis or chatbot conversation.

The study caught widespread attention last week after its release because many developers and other users experimenting with LLMs had assumed that the trend toward larger context windows would continue to improve LLM performance and their usefulness across various applications.

>>Don’t miss our special issue: The Future of the data center: Handling greater and greater demands.<<


June 5th: The AI Audit in NYC

Join us next week in NYC to engage with top executive leaders, delving into strategies for auditing AI models to ensure optimal performance and accuracy across your organization. Secure your attendance for this exclusive invite-only event.


If an LLM could take an entire document or article as input for its context window, the conventional thinking went, the LLM could provide perfect comprehension of the full scope of that document when asked questions about it. 

Assumptions around context window flawed

LLM companies like Anthropic have fueled excitement around the idea of longer content windows, where users can provide ever more input to be analyzed or summarized. Anthropic just released a new model called Claude 2, which provides a huge 100k token context window, and said it can enable new use cases such as summarizing long conversations or drafting memos and op-eds.

But the study shows that some assumptions around the context window are flawed when it comes to the LLM’s ability to search and analyze it accurately. 

The study found that LLMs performed best “when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts. Furthermore, performance substantially decreases as the input context grows longer, even for explicitly long-context models.”

Last week, industry insiders like Bob Wiederhold, COO of vector database company Pinecone, cited the study as evidence that stuffing entire documents into a document window for doing things like search and analysis won’t be the panacea many had hoped for. 

Semantic search preferable to document stuffing

Vector databases like Pinecone help developers increase LLM memory by searching for relevant information to pull into the context window. Wiederhold pointed to the study as evidence that vector databases will remain viable for the foreseeable future, since the study suggests semantic search provided by vector databases is better than document stuffing. 

Stanford University’s Nelson Liu, study lead author, agreed that if you try to inject an entire PDF into a language model context window and then ask questions about the document, a vector database search will generally be more efficient to use.

“If you’re searching over large amounts of documents, you want to be using something that’s built for search, at least for now,” said Liu. 

Liu cautioned, however, that the study isn’t necessarily claiming that sticking entire documents into a context window won’t work. Results will depend specifically on the sort of content contained in the documents the LLMs are analyzing. Language models are bad at differentiating between many things that are closely related or which seem relevant, Liu explained. But they are good at finding the one thing that is clearly relevant when most other things are not relevant.

“So I think it’s a bit more nuanced than ‘You should always use a vector database, or you should never use a vector database’,” he said.

Language models’ best use case: Generating content

Liu said his study assumed that most commercial applications are operating in a setting where they use some sort of vector database to help return multiple possible results into a context window. The study found that having more results in the context window didn’t always improve performance. 

As a specialist in language processing, Liu said he was surprised that people were thinking of using a context window to search for content, or to aggregate or synthesize it, although he said he could understand why people would want to. He said people should continue to think of language models as best used to generate content, and search engines as best to search content. 

“The hope that you can just throw everything into a language model and just sort of pray it works, I don’t think we’re there yet,” he said. “But maybe we’ll be there in a few years or even a few months. It’s not super clear to me how fast this space will move, but I think right now, language models aren’t going to replace vector databases and search engines.”