Integrating Pinecone With OpenAI And LlamaIndex

In this article, I’ll walk you through the process of creating a complete end-to-end solution using Pinecone, OpenAI and Llama-Index.

PineconeDB

Before we deep dive into it, here is a quick overview of each of these components:

Pinecone

Pinecone is a cloud-native vector database designed for storing and querying high-dimensional vectors. It provides fast and efficient search over vector embeddings. It has a simple API and no infrastructure hassles. It is one of the best solutions for those who are looking for query results with low latency at the scale of billions of vectors.

OpenAI

OpenAI models can be used for generating the embeddings as well as for text completions. By combining OpenAI’s models with Pinecone, we can achieve deep learning capabilities for embedding generations along with efficient vector storage and retrieval.

Llama-Index

Llama-Index is a framework that enables developers to integrate diverse data sources with LLMs like OpenAI and also provides tools to augment LLM applications with data.

Now, that we have a high-level idea of all the foundational parts, let’s do a deep dive into the implementation.

Firstly, we need to grab the OpenAI key as shown below:

Get An OpenAI API Key

To get the OpenAI key, you need to go to https://openai.com/, login, and then grab the keys using the highlighted way:

Get An OpenAI API Key

Once you have the key, do set it in an environment variable. Below is the code to do this:

import os
os.environ["OPENAI_API_KEY"] = "PASTE_YOUR_KEY_HERE"

Preparing The Data

Next, we need to load our data into the Pandas data frame. You can have your own data sources and not the same as mine. I’ve grabbed my data from Hugging Face, which contains contextual information along with questions and answers.

Let’s load the dataset:

from datasets import load_dataset
dataset = load_dataset("lmqg/qa_harvesting_from_wikipedia", split='train')

Here comes the most critical part, where in you need to decide which all data or what all columns you want to accommodate for your application. Here are mine:

data = dataset.to_pandas()[['id', 'context', 'title']]
data.drop_duplicates(subset='context', keep='first', inplace=True)
data.head()

load the dataset

In my example, context is none other than the text paragraphs that the questions are based on.

At this point, we are good to create documents for our data and that’s where we can utilize Llama-Index to transform the data frame into a list of Document objects.

Note that, each document contains the text passage, a unique id, and an extra field for the article title.

from llama_index import Document
docs = []

for i, row in data.iterrows():
    docs.append(Document(
        text=row['context'],
        doc_id=row['id'],
        extra_info={'title': row['title']}
    ))
docs[1]

This is how my 2nd doc looks like:

Document(id_=’54770', embedding=None, metadata={‘title’: ‘Federal government of the United States’}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, hash=’f32b9025c1c70c02dcb58f98bad47d786e13108a532e797e95a5717f9e851f7b’, text=’The full name of the republic is “United States of America”. No other name appears in the Constitution, and this is the name that appears on money, in treaties, and in legal cases to which it is a party (e.g. “Charles T. Schenck v. United States”). The terms “Government of the United States of America” or “United States Government” are often used in official documents to represent the federal government as distinct from the states collectively. In casual conversation or writing, the term “Federal Government” is often used, and the term “National Government” is sometimes used. The terms “Federal” and “National” in government agency or program names generally indicate affiliation with the federal government (e.g. Federal Bureau of Investigation, National Oceanic and Atmospheric Administration, etc.). Because the seat of government is in Washington, D.C., “Washington” is commonly used as a metonym for the federal government.’, start_char_idx=None, end_char_idx=None, text_template=’{metadata_str}\n\n{content}’, metadata_template=’{key}: {value}’, metadata_seperator=’\n’)

Next, we will be using SimpleNodeParser, which processes the list of Document objects into nodes, which are the basic units that Llama-Index uses for indexing and querying.

from llama_index.node_parser import SimpleNodeParser
parser = SimpleNodeParser()
nodes = parser.get_nodes_from_documents(docs)

Here is how a node looks like:

TextNode(id_=’f161b106-fd34–4e9d-8efb-2d29b2226064', embedding=None, metadata={‘title’: ‘Federal government of the United States’}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: ‘1’>: RelatedNodeInfo(node_id=’54766', node_type=<ObjectType.DOCUMENT: ‘4’>, metadata={‘title’: ‘Federal government of the United States’}, hash=’6001a46aa3511e7aa15ebbf791fba716efcf0bab530ca0ec4c29a4fa5dad0a42'), <NodeRelationship.NEXT: ‘3’>: RelatedNodeInfo(node_id=’793ddc20–4630–41fb-b174-fe846e990c7b’, node_type=<ObjectType.TEXT: ‘1’>, metadata={}, hash=’aeeda3e3df54773474bedee7b309e3724aa4ee61be6eddc8f8db0d6da05bdfd1')}, hash=’e9c11ab5cd1c9c87dd9c1cd19a607e24743fa203c99aff00b4144998c9a26d57', text=’The government of the United States of America is the federal government of the republic of fifty states that constitute the United States, as well as one capital district, and several other territories. The federal government is composed of three distinct branches: legislative, executive, and judicial, whose powers are vested by the U.S. Constitution in the Congress, the President, and the federal courts, including the Supreme Court, respectively. The powers and duties of these branches are further defined by acts of Congress, including the creation of executive departments and courts inferior to the Supreme Court.’, start_char_idx=0, end_char_idx=623, text_template=’{metadata_str}\n\n{content}’, metadata_template=’{key}: {value}’, metadata_seperator=’\n’)

Setting Up Pinecone

To get started with Pinecone, you can visit https://app.pinecone.io/ and setup your account. There are two ways you can create an index in Pinecone — through code and through website user interface.

If you want to create an index via website, you can click on Create Index button as shown below:

Setting Up Pinecone

It will open up a new dialog for you, where in you can furnish all the required details.

But if you want to achieve this using code then here it is:

import pinecone  

# create connection
pinecone.init(
    api_key=os.environ['PINECONE_API_KEY'],
    environment=os.environ['PINECONE_ENVIRONMENT']
)

# create index
index_name = 'sh-index'
if index_name not in pinecone.list_indexes():
    pinecone.create_index(
        index_name,
        dimension=1536,
        metric='cosine'
    )

# connect to the index
pinecone_index = pinecone.Index(index_name)

Once the index is created, you will see a similar entry with your index name:

sh-index

Generate Embedding And Index Data

If everything goes well here, then your index should be ready to accept the data.

Before we initialize our store index, we need to do two more things — initialize the StorageContext and ServiceContext. These both are used to set up storage and embedding model, respectively.

Finally, we need to initialize GPTVectorStoreIndex, which will serve as the storage and retrieval interface for our document embeddings in Pinecone’s vector database.

from llama_index import GPTVectorStoreIndex, StorageContext, ServiceContext
from llama_index.embeddings.openai import OpenAIEmbedding

# setup storage
storage_context = StorageContext.from_defaults(
    vector_store=vector_store
)
# setup the index/query process, ie the embedding model (and completion if used)
embed_model = OpenAIEmbedding(model='text-embedding-ada-002', embed_batch_size=200)
service_context = ServiceContext.from_defaults(embed_model=embed_model)

index = GPTVectorStoreIndex.from_documents(
    docs, storage_context=storage_context,
    service_context=service_context
)

Query Your Data

It’s query time :)

Just few more lines of code and we are good to test out setup.

query_engine = index.as_query_engine()
res = query_engine.query("What is federal government?")
print(res)

The federal government refers to the governing body of the United States of America, which is composed of three branches: legislative, executive, and judicial. These branches have specific powers and responsibilities that are outlined in the U.S. Constitution. The federal government is responsible for making and enforcing laws at the national level, as well as overseeing the operation of executive departments and lower courts.

Hope you learned something after reading my article. If there is anything, which is not clear, I would recommend you watch this complete explanation on my channel.


Similar Articles