How to use audio data in LlamaIndex with Python

LlamaIndex is a flexible data framework for connecting custom data sources to Large Language Models (LLMs). With LlamaIndex, you can easily store and index your data and then apply LLMs.

LLMs only work with textual data, so to process audio files with LLMs we first need to transcribe them into text.

Luckily, LlamaIndex provides an AssemblyAI integration through Llama Hub that lets you load audio data with just a few lines of code:

from llama_hub.assemblyai.base import AssemblyAIAudioTranscriptReader 

AssemblyAIAudioTranscriptReader(file_path="./my_file.mp3")

docs = reader.load_data()

Let's learn how to use this data reader step-by-step. For this, we create a small demo application with an LLM-powered query engine that lets you load audio data and ask questions about your data.

Getting Started

Create a new virtual environment:

# Mac/Linux:
python3 -m venv venv
. venv/bin/activate

# Windows:
python -m venv venv
.\venv\Scripts\activate.bat

Install LlamaIndex, Llama Hub, and the AssemblyAI Python package:

pip install llama-index llama-hub assemblyai

Set your AssemblyAI API key as an environment variable named ASSEMBLYAI_API_KEY. You can get a free API key here.

# Mac/Linux:
export ASSEMBLYAI_API_KEY=<YOUR_KEY>

# Windows:
set ASSEMBLYAI_API_KEY=<YOUR_KEY>

Use the AssemblyAIAudioTranscriptReader

To load and transcribe audio data into documents, import the AssemblyAIAudioTranscriptReader. It needs at least the file_path argument with an audio file specified as an URL or a local file path. You can read more about the integration in the official Llama Hub docs.

from llama_hub.assemblyai.base import AssemblyAIAudioTranscriptReader

audio_file = "https://storage.googleapis.com/aai-docs-samples/sports_injuries.mp3"
# or a local file path: audio_file = "./sports_injuries.mp3"

reader = AssemblyAIAudioTranscriptReader(file_path=audio_file)

docs = reader.load_data()

After loading the data, the transcribed text is stored in the text attribute.

print(docs[0].text)
# Runner's knee. Runner's knee is a condition ...

The metadata contains the full JSON response of our API with more meta information:

print(docs[0].metadata)
# {'language_code': <LanguageCode.en_us: 'en_us'>,
#  'punctuate': True,
#  'format_text': True,
#  …
# }

Tip: The default configuration of the document loader returns a list with only one document, that's why here we access the first document in the list with docs[0]. But you can use a different TranscriptFormat that splits text for example by sentences or paragraphs, and returns multiple documents. You can read more about the TranscriptFormat options here.

from llama_hub.assemblyai.base import TranscriptFormat

reader = AssemblyAIAudioTranscripReader(
    file_path="./your_file.mp3",
    transcript_format=TranscriptFormat.SENTENCES,
)

docs = reader.load_data()
# Now it returns a list with multiple documents

Apply a Vector Store Index and a Query Engine

Now that you have loaded the transcribed text into LlamaIndex documents, you can easily ask questions about the spoken data. For example, you can apply a model from OpenAI with a Query Engine.

For this, you also need to set your OpenAI API key as an environment variable:

# Mac/Linux:
export OPENAI_API_KEY=<YOUR_OPENAI_KEY>

# Windows:
set OPENAI_API_KEY=<YOUR_OPENAI_KEY>

Now, you can create a VectorStoreIndex and a query engine from the retrieved documents from the first step.

The metadata needs to be smaller than the text chunk size, and since it contains the full JSON response with extra information, it is quite large. For simplicity, we just remove it here:

from llama_index import VectorStoreIndex

# Metadata needs to be smaller than chunk size
# For simplicity we just get rid of it
docs[0].metadata = {}

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is a runner's knee?")
print(response)
# Runner's knee is a condition characterized by ...

Conclusion

This tutorial explained how to use the AssemblyAI data reader for LlamaIndex. You learned how to transcribe audio files and load the transcribed text into LlamaIndex documents, and how to create a Query Engine to ask questions about your spoken data.

Below is the complete code:

from llama_index import VectorStoreIndex
from llama_hub.assemblyai.base import AssemblyAIAudioTranscriptReader

audio_file = "https://storage.googleapis.com/aai-docs-samples/sports_injuries.mp3"

reader = AssemblyAIAudioTranscriptReader(file_path=audio_file)
docs = reader.load_data()

docs[0].metadata = {}

index = VectorStoreIndex.from_documents(docs)
query_engine = index.as_query_engine()
response = query_engine.query("What is a runner's knee?")
print(response)

If you enjoyed this article, feel free to check out some others on our blog, like

Alternatively, check out our YouTube channel for learning resources on AI, like our Machine Learning from Scratch series.

How to use audio data in LlamaIndex with Python

Apply a Vector Store Index and a Query Engine

Conclusion

Popular posts

AI trends in 2024: Graph Neural Networks

AI for Universal Audio Understanding: Qwen-Audio Explained

Combining Speech Recognition and Diarization in one model

How DALL-E 2 Actually Works