Using Open-Source Model With OpenAI Models

Are you also looking forward to using some open-source models along with the OpenAI&rsquo model? Well, there could be many different reasons why anyone would like to go for open-source, i.e., open-source models are free and also transparent, along with the flexibility to customize anything.

In this article, I’ll walk you through the steps required to use an open-source model for embedding along with OpenAI’s LLM.

Here is the list of all the major tools and libraries we are going to use to build our solution.

OpenAI

We will be using OpenAI’s LLM to answer the questions.

Chroma

We will be using an in-memory implementation of Chroma DB to store our embeddings.

Llama-Index

We will be using this framework to integrate all the diverse data sources and LLM.

Let’s get started by grabbing the OpenAI API key as shown below.

Get An OpenAI API Key

To get the theOpenAIkey, you need to go to https://openai.com/, login, and then grab the keys using the highlighted way.

API Key

Once you have the key, set it in an environment variable. Below is the code to do this.

import os
os.environ["OPENAI_API_KEY"] = "PASTE_YOUR_KEY_HERE"

Import Packages

Here is the list of packages, we need to pull before we proceed further.

import chromadb
from llama_index import SimpleDirectoryReader, VectorStoreIndex, ServiceContext, StorageContext
from llama_index.vector_stores import ChromaVectorStore
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.llms import OpenAI

Read Data

I’m reading all the text files placed under myStoredirectory, but feel free to read single files as well by using the proper file loader.

docs = SimpleDirectoryReader(
    'Store',
    required_exts=[".txt"],
    filename_as_id=True
).load_data()

Here is what my first document looks like.

Document(
    id_='Store\\Homelessness.txt',
    embedding=None,
    metadata={
        'file_path': 'Store\\Homelessness.txt',
        'file_name': 'Homelessness.txt',
        'file_type': 'text/plain',
        'file_size': 1200,
        'creation_date': '2023–12–04',
        'last_modified_date': '2023–10–11',
        'last_accessed_date': '2023–12–07'
    },
    excluded_embed_metadata_keys=[
        'file_name',
        'file_type',
        'file_size',
        'creation_date',
        'last_modified_date',
        'last_accessed_date'
    ],
    excluded_llm_metadata_keys=[
        'file_name',
        'file_type',
        'file_size',
        'creation_date',
        'last_modified_date',
        'last_accessed_date'
    ],
    relationships={},
    hash='f85d1f91029eeb8f3766fac96503c5074193bfbab3f5068afb98c36225f008d8',
    text="Homelessness or houselessness — also known as a state of being unhoused or unsheltered — is the condition of lacking stable, safe, and functional housing. The general category includes disparate situations, including: … therefore, in most cities, only estimated homeless populations are known.\n",
    start_char_idx=None,
    end_char_idx=None,
    text_template='{metadata_str}\n\n{content}',
    metadata_template='{key}: {value}',
    metadata_seperator='\n'
)

Setup Database

Now we have the data ready, we are good to set up our Chroma db. Here, my collection is the collection name, which is going to store our data inside the database. Do keep in mind that a collection will be created only if it is not already present in the database.

Once our vector store is ready, we associate it with storage context as it is required to create an index.

db = chromadb.EphemeralClient()
chroma_collection = db.get_or_create_collection("myCollection")
vStore = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vStore)

Generate Embeddings

Here comes the important part and for generating the embeddings, we will be using an open-source model fromHuggingFacenamedsentence-transformers/all-MiniLM-L6-v2.

embedding_model = HuggingFaceEmbedding(model_name='sentence-transformers/all-MiniLM-L6-v2')
service_context = ServiceContext.from_defaults(embed_model=embedding_model)

Create Index

For creating an index, we need docs, embedding, and the vector store. As we have all the bits ready, we are good to go and create one, as shown below.

vStoreIndex = VectorStoreIndex.from_documents(docs,storage_context=storage_context,service_context=service_context)

Setup Engine and Get Response

This is the last part, wherein we need to instantiate our query engine and pass user queries to it.

engine = vStoreIndex.as_query_engine()
response = engine.query("What is homelessness?")
response  

Here is the generated sample response.

Response(
    response='Homelessness refers to the condition of lacking stable, safe, and functional housing. It encompasses various situations, such as living on the streets, moving between temporary shelters, residing in boarding houses without proper amenities, or having no permanent place to live. Homelessness can also include internally displaced persons who are forced to leave their homes due to civil conflict and become refugees within their own country. The legal status of homeless individuals can vary depending on the location. Homelessness is often associated with poverty, and accurately counting and addressing the needs of homeless populations can be challenging.',
    source_nodes=[
        NodeWithScore(
            node=TextNode(
                id_='b5a68706-1835-4583-a4c6-2d1c9010cc8a',
                embedding=None,
                ...
            ),
            'file_type': 'text/plain',
            'file_size': 1200,
            'creation_date': '2023-12-04',
            'last_modified_date': '2023-10-11',
            'last_accessed_date': '2023-12-07'
        }
    ]
)

I hope you have an idea about integrating the open-source embedding model withOpenAI’smodel.If there is anything not very clear, I would recommend you watch my video, wherein I have explained every bit of it.


Similar Articles