Intelligent Document Processing with AWS AI Services and Amazon Bedrock

12 min readOct 27, 2023

Companies in sectors like healthcare, finance, legal, retail, and manufacturing frequently handle large numbers of documents as part of their day-to-day operations. These documents often contain vital information that drives timely decision-making, essential for ensuring top-tier customer satisfaction, and reduced customer churn. Traditionally, the extraction of data from documents is manual, making it slow, prone to errors, costly, and challenging to scale. It often involves manual labor taking time away from critical activities. While the industry has been able to achieve some amount of automation through traditional OCR tools, these methods have proven to be brittle, expensive to maintain, and add to technical debt. With Intelligent Document Processing (IDP) leveraging artificial intelligence (AI), the task of extracting data from large amounts of documents with differing types and structures becomes efficient and accurate. This enables timely, and high-quality business decision-making while curbing overall expenses.

Additionally, generative AI with Large Language Models (LLM) is adding powerful capabilities to IDP solutions often bridging gaps that once existed even with highly trained ML models. In this article, I briefly discuss the various phases of IDP and how generative AI is being utilized to augment existing IDP workloads or develop new IDP workloads.

Phases of Intelligent Document Processing

At a high level, a standard IDP workflow consists of four phases — Document Storage, followed by Document Classification, Extraction, and Enrichment. Note that not all use-cases involve implementation of all these phases, but the most common use-cases are an amalgamation of more than one of these phases, with the most popular phases being document classification and extraction phases. The following diagram is how we visualize these IDP phases. In this article, we will focus mainly on document classification, and extraction phases and the AI components and mechanisms involved. For each of these IDP phases, we will briefly discuss the importance of the phase and further dive deeper into the implementation with generative AI and machine learning. We will discuss some of the cloud based AI services such as Amazon Comprehend, Amazon Textract, and LLM models via Amazon Bedrock. We will also discuss implementation details with the popular open-source LangChain Python library.

Figure 1: Phases of Intelligent Document Processing

Document Classification

A common challenge when dealing with large amounts of documents is identifying the documents, sorting them into categories, and then further processing them. This is traditionally a human or heuristics-based process, but as the volume of documents grows these methods become bottlenecks to business processes and outcomes. The core idea behind this phase is automating the categorization or classification using AI. In this phase, documents that are otherwise unknown or can only be sorted neatly into distinctive categories, are classified and labeled using AI in an automated fashion. For document classification, we utilize Amazon Comprehend custom classifier models.

With Amazon Comprehend we get the ability to train our custom document classification model that is tailored specifically to identifying documents in your use case. The model takes the layout of the document as well as the content (text) of the document into consideration to perform document classification. You can train a single model to perform document classification for up to 1000 different classes and documents can be either in PDF, JPG, PNG, DOCX, or TXT format. Documents can be single-page or multi-page documents. For multi-page documents, the model gives you page-level document class along with confidence scores.

Classification of documents with LLMs is another emerging technique that organizations may adopt to augment their classification pipeline. This is useful when newer documents, that the models are not trained with, show up in the pipeline. This is also useful when an existing trained classifier model classifies a document with extremely low accuracy. This is a very common situation in ML workloads as business processes evolve. Here’s what a possible implementation of document classification with the Amazon Bedrock and Anthropic Claude-v1 model looks like. We use Amazon Textract’s document extraction abilities with LangChain to get the text from the document and then use prompt engineering to identify the possible document category. LangChain uses Amazon Textract’s DetectDocumentText API for extracting text from printed, scanned, or handwritten documents.

Figure 2: Low-score classification routed to LLM for classification

```

from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChainloader = AmazonTextractPDFLoader("./samples/document.png")
document = loader.load()template = """Given a list of classes, classify the document into one of these classes. Skip any preamble text and just give the class name.<classes>INVOICE, BANK_STATEMENT, HEALTH_PLAN, PRESCRIPTION</classes>
<document>{doc_text}<document>
<classification>"""prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)print(f"The provided document is = {class_name}")—-
The provided document is = HEALTH_PLAN
```

Note that we are using the anthropic.claude-v1 model here which has a token context window of 100,000 tokens. This action can also be performed with the Anthropic Claude v2 model which also has a 100,000 token context window, however, the cost may greatly vary based on which model you choose. In general, the Anthropic Claude Instant, and Claude v1 models are great general-purpose models and are appropriate for a large number of use cases. The following code demonstrates a tracking mechanism that shows the token counts of both the input prompt and the generated text.

```
prompt = PromptTemplate(template=template, input_variables=["doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")

num_in_tokens = bedrock_llm.get_num_tokens(document[0].page_content)
print (f"Our prompt has {num_in_tokens} tokens \n\n=========================")llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
class_name = llm_chain.run(document[0].page_content)num_out_tokens = bedrock_llm.get_num_tokens(class_name)
print (f"Our output has {num_out_tokens} tokens \n\n=========================")
print (f"Total tokens = {num_in_tokens + num_out_tokens}")
—-
Our prompt has 1295 tokens 
=========================
Our output has 9 tokens 
=========================
Total tokens = 1304```

Document Extractions

Document Extraction is perhaps the most popular phase of IDP workflow. Most use cases I encounter in my day-to-day working with IDP are heavily skewed towards various mechanisms of extracting all sorts of different information from a wide variety of documents. Documents range from birth certificates, mortgage deeds, tax forms, resumes, medical trial reports, health insurance documents, ID documents (passports, licenses etc.), marketing materials, newspaper clips, and the list goes on. As you can imagine, the document formats and types vary hugely from use-case to use-case, and many times within a use-case. As such, there are a number of different ways information from all these varying types of documents can be extracted. However, it is important to keep in mind that there isn’t a “one size fits all” AI model that is capable of extracting from all these documents in a desired format, yet. But the good news is that there are many purpose-built as well as general-purpose models that can help us achieve our goals with document extraction.

Amazon Textract provides a number of general purpose features such as forms extractions, table extractions, query, and layout. These features work on just about any document and are great for general-purpose document extraction such as extraction from forms, or documents containing tabular data. Amazon Textract also provides some purpose-built AI models such as AnalyzeID for recognizing and extracting from US ID documents, AnalyzeExpense which can read and extract invoices and receipts, and AnalyzeLending which is capable of identifying and extracting mortgage and lending documents. You can read more about how these models work in the documentation linked above, but in this section, we will focus on LLM-based document extraction techniques. We will specifically focus on the two most common uses: template-based normalized key-value entity extractions and document Q&A, with large language models.

Template-based normalized extractions

In almost all IDP use cases, the data extracted is eventually sent to a downstream system for further processing or analytics. It becomes increasingly important to have normalized output from the AI based extractions so that the consumption mechanism and integration code are easy to maintain, more reliable, and less brittle. A set of deterministic keys could make a lot of difference in your integration code, as opposed to non-deterministic keys. For example, consider a use case involving extracting an applicant’s first name, last name, and date of birth. But the documents involved may or may not have these names clearly labeled as “First name”, “Last Name”, “Date of birth” and can come in any different flavor.

Figure 3: Document extraction actual output vs. desired output

Notice the difference between the actual output vs. the desired output. The desired output follows a more deterministic schema which makes our post processing logic a lot simpler, easier to maintain, and less expensive to develop. This is achievable via a large language model as is demonstrated in the code below.

```
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader("./samples/document.png")
document = loader.load()output_template= {
    "first_name":{ "type": "string", "description": "The applicant’s first name" },
    "last_id":{ "type": "string", "description": "The applicant’s last name" },
    "dob":{ "type": "string", "description": "The applicant’s date of birth" },
}template = """You are a helpful assistant. Please extract the following details from the document and format the output as JSON using the keys. Skip any preamble text and generate the final answer.<details>
{details}
</details><keys>
{keys}
</keys><document>
{doc_text}
<document><final_answer>"""details = "\n".join([f"{key}: {value['description']}" for key, value in output_template.items()])
keys = "\n".join([f"{key}" for key, value in output_template.items()])prompt = PromptTemplate(template=template, 
                        input_variables=["details", "keys", "doc_text"])
bedrock_llm = Bedrock(client=bedrock, model_id="anthropic.claude-v1")llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, "details": details, "keys": keys})print(output)
—-
{
  "first_name": "John",
  "last_name": "Doe",
  "dob": "26-JUN-1981"
}
```

In this case, merely crafting our prompt with clear instructions for the LLM we were able to not only extract the entities but also have the output generated in a format that conforms to our own schema. This is significant from a post-processing standpoint now that our integration code is much simpler and more deterministic. In addition to this, LangChain also provides a built-in way to create these instructions in a more modular way. Here’s what the code looks like.

```
from langchain.output_parsers import ResponseSchema
from langchain.output_parsers import StructuredOutputParser

response_schems = list()
for key, value in output_template.items():
    schema = ResponseSchema(name=key, 
                            description=value['description'], 
                            type=value['type'])
    response_schems.append(schema)output_parser = StructuredOutputParser.from_response_schemas(response_schems)
format_instructions= output_parser.get_format_instructions()
print(format_instructions)
—-
The output should be a markdown code snippet formatted in the following schema, including the leading and trailing "```json" and "```":```json
{
"first_name": string  // The applicant’s first name
"provider_id": string  // The applicant’s last name
"dob": string  // The applicant’s date of birth
}
```
```

We can then simply use this pre-generated formatting instruction (in the variable format_instruction) with our LLM chain as follows.

```
template = """

You are a helpful assistant. Please extract the following details from the document and strictly follow the instructions described in the format instructions to format the output. Skip any preamble text and generate the final answer.<details>
{details}
</details><format_instructions>
{format_instructions}
</format_instructions><document>
{doc_text}
<document><final_answer>"""llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
output = llm_chain.run({"doc_text": full_text, 
                        "details": details, 
                        "format_instructions": format_instructions})parsed_output= output_parser.parse(output)
parsed_output
```

Document Q&A with Retrieval Augmented Generation (RAG)

The second most common approach to extracting information from documents is using a chat-like question-and-answering method. This method entails prompting the LLM with the content of the document and a question in hopes of getting an answer from within the document (if available). However, there are a couple of issues that need proactive mitigation — prevention of model hallucinations and limit token context window.

Large pre-trained language models exhibit state-of-the-art results on many NLP tasks due to stored factual knowledge. However, their precision in knowledge manipulation is limited, causing suboptimal performance on knowledge-intensive tasks. To mitigate this, an approach known as Retrieval Augmented Generation is found to be most useful. This method not only helps the LLM in answering the question in a precise manner, and staying within the context of the document, but also prevents hallucinations from occurring and staying within the LLM’s token context limits for large documents. The idea of RAG is to gather only the relevant parts from the document that are semantically closer to the question being asked. This is achieved by chunking the document’s content into a number of smaller parts, generating vector embeddings of them, and then storing the embeddings in a vector database. The vector database can then perform similarity search, or max marginal relevancy search (MMR) to gather the most relevant chunks from the document. Once these relevant parts are obtained, a full context is crafted along with the question with some prompt engineering to get the most accurate answer out of the model.

Figure 4: Retrieval Augmented Generation with Document embeddings and vector db

In the preceding figure, we use Titan Embeddings G1 — Text model via Amazon Bedrock to generate embeddings of the text chunks. Much of the RAG mechanism can be performed with useful built in modules of the LangChain library including the chunking, generating embedding, and loading into a vector db of your choice and then perform RAG based Q&A with the document’s content using the vector db as a retriever. Following code gives a glimpse of a possible implementation of such a mechanism using LangChain’s RetrievalQA chain that does relevance search, context building, prompt augmentation using the context all internally. We use the FAISS vector store as our vector db in this example.

```
from langchain.embeddings import BedrockEmbeddings
from langchain.vectorstores import FAISS
from langchain.document_loaders import AmazonTextractPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.llms import Bedrock
from langchain.prompts import PromptTemplate
from langchain.chains import LLMChain

loader = AmazonTextractPDFLoader(f"./samples/document.png")
document = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=400,
                                               separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
                                               chunk_overlap=0)
texts = text_splitter.split_documents(document)embeddings = BedrockEmbeddings(client=bedrock,
                               model_id="amazon.titan-embed-text-v1")
vector_db = FAISS.from_documents(documents=texts, 
                                 embedding=embeddings)
retriever = vector_db.as_retriever(search_type='mmr', 
                                   search_kwargs={"k": 3})
template = """Answer the question as truthfully as possible strictly using only the provided text, and if the answer is not contained within the text, say "I don't know". Skip any preamble text and reasoning and give just the answer.<text>{document}</text>
<question>{question}</question>
<answer>"""prompt = PromptTemplate(template=template, 
                        input_variables=["document","question"])
bedrock_llm = Bedrock(client=bedrock, 
                      model_id="anthropic.claude-v1")llm_chain = LLMChain(prompt=prompt, llm=bedrock_llm)
answer = llm_chain.run(document=full_context, 
                       question="What is the per-person pharmacy out-of-pocket?")
print(f"Answer is = {answer.strip()}")
—-
Answer is = $6,000
```

In this example, we extracted a document with Amazon Textract getting per-page text. We then perform splitting of all the pages in the document into chunks of 400 characters. Subsequently, we generate vector embeddings of the chunks and store them into a FAISS vector store and use that vector store as a retriever for our RAG-based question and answering mechanism. Refer to this GitHub repository for a full set of Python Notebooks that explain the process step-by-step in detail.

Conclusion

In this post, we discussed the various phases of intelligent document processing and explored some of the ways generative AI with Amazon Bedrock is used to either augment or enhance an IDP workflow. There are a number of other types of extractions that we did not cover in this article such as summarization, self-querying table Q&A, standardization, and so forth. You can explore more about each of these types of extractions using the Python notebooks in the GitHub repository mentioned above. If you are already using an IDP workflow for your use case, generative AI gives a whole new world of possibilities by augmenting your workflow with LLM capabilities. If you are in the decision-making phase of an IDP-based workflow for your use case, it is worth exploring all the different ways generative AI can add value.

About the author

Anjan Biswas is a Senior AI Specialist Solutions Architect at Amazon Web Services (AWS). Anjan specializes in Computer Vision, NLP, and generative AI technologies and spends much of his time working on Intelligent Document Processing (IDP) use cases. He has over 16 yrs of experience in building large-scale enterprise systems across supply-chain, retail, tech, and healthcare industries, and is passionate about data science and machine learning.

Originally posted on OpenDataScience.com

Read more data science articles on OpenDataScience.com, including tutorials and guides from beginner to advanced levels! Subscribe to our weekly newsletter here and receive the latest news every Thursday. You can also get data science training on-demand wherever you are with our Ai+ Training platform. Interested in attending an ODSC event? Learn more about our upcoming events here.