Ask Questions to an Audio File using OpenAI

In this article, I’ll explain how we can pass an audio file to LLM and I’m taking OpenAI as our LLM.

There are many people who prefer audio and video tutorials over reading along with our podcast lovers as listening seems to be more effective for them as compared to reading a book, an e-book, or an article, and it is quite common that after a certain period of time, we may forget some of the portions of our tutorial. Now, in order to get the insights again, re-watching or re-listening is the only option, which could be very time-consuming.

So, the best solution is to come up with a small AI-based application by writing just a few lines of code that can analyze the audio and respond to all the questions that are asked by the user.

Here, utilizing generative AI could be the best option, but the problem is, that we can’t pass audio directly as it is text-based. Let’s deep dive into this article, to understand how we can make this work in a step-by-step fashion.

Audio file

High-level steps

To execute the solution from end-to-end, we need to work with below components/libraries:

Audio to Text Generator

  • For transcript generation, we will be using AssemblyAI

Embedding Generator

  • For generating the embeddings, we will be using OpenAIEmbeddings

Vector Database

  • Chroma will be used as an in-memory database for storing the vectors

Large Language Model

  • OpenAI as LLM

All these are wrapped under a library called Langchain, so we will be highly utilizing that too.

First of all, we need to grab the keys as shown below:

Get An OpenAI API Key

To get the OpenAI key, you need to go to https://openai.com/, log in, and then grab the keys using the highlighted way:

welcome to openai

Get An AssemblyAI API Key

To get the AssemblyAI key, you need to go to AssemblyAI | Account, log in, and then grab the keys using the highlighted way:

assembly ai account info

Install Packages

Install these packages:

assemblyai>=0.17.0
openai>=0.28.0
sentence-transformers>=2.2.2
langchain>=0.0.278
chromadb>=0.4.8
tiktoken>=0.5.1

Import Required Packages

Do install the dependent libraries and import the below packages:

from langchain.chains import RetrievalQA
from langchain.chat_models import ChatOpenAI
from langchain.document_loaders import AssemblyAIAudioTranscriptLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.vectorstores import Chroma

Transcribe Audio

The next step is to extract text from an audio file. Here is the sample audio, I’ve used for this article.

doc = AssemblyAIAudioTranscriptLoader(file_path="https://storage.googleapis.com/aai-docs-samples/nbc.mp3").load()

Here is what the document looks like:

[Document(page_content=” Load time, a new president and new congressional makeup. Same old partisan divides, right? Yes and no. There’s the traditional red blue divide you’re very familiar with, but there’s a lot more below the surface going on in both parties…….”, metadata={‘language_code’: <LanguageCode.en_us: ‘en_us’>, ‘audio_url’: ‘https://storage.googleapis.com/aai-docs-samples/nbc.mp3', ‘punctuate’: True, ‘format_text’: True, ‘dual_channel’: None, ‘webhook_url’:26028, ‘end’: 26386, ‘confidence’: 0.94482}, {‘text’: ‘are’, ‘start’: 26418, ‘end’: 26566, ‘confidence’: 0.7851}, {‘text’: ‘what’, ‘start’: 26588, ‘end’: 26726, ‘confidence’: 0.99999}, …

Let’s update the metadata to something we want using the below code:

doc[0].metadata = {“audio_url”:doc[0].metadata[“audio_url”]}

Chunk The Text

I’m taking chunk size as 700 but you can change this number.

text_splitter = RecursiveCharacterTextSplitter(chunk_size=700, chunk_overlap=0)
texts = text_splitter.split_documents(doc)

texts will look like this:

[Document(page_content=” Load time, a new president and new congressional makeup. Same old partisan divides, right?…, Document(page_content=”supporters of former President Donald Trump. We’re going to call them the Trump Republican. Another 17% …, Document(page_content=”Republicans are firmly against compromising with Biden in order to gain consensus on legislation, as y…, Document(page_content=”make it easier on yourself to form a governing coalition, something the Biden White House may want to think about. When we come back…

Generate Embeddings And Save To Database

In this step, we will generate embeddings for the above text using below lines:

db = Chroma.from_documents(texts,OpenAIEmbeddings())
llm = ChatOpenAI(model="gpt-3.5-turbo", temperature=0)

Query And Get The Response

This step will create a QA chain and passing the query to that, will give an answer.

chain = RetrievalQA.from_chain_type(llm,retriever=db.as_retriever(search_type="mmr", search_kwargs={'fetch_k': 3}))
query = "What this audio file is all about?"
chain({"query":query})

Here is the output:

{‘query’: ‘What this audio file is all about?’, ‘result’: ‘The audio file discusses the current political landscape in the United States, specifically focusing on the divisions within the Democratic and Republican parties. It mentions the emergence of four political parties within these two major parties and discusses their differing views on compromising with President Biden to pass legislation.’}

You can see how easy it was to read audio, generate the text and get our questions answered.


Similar Articles