Vector databases for Azure Open AI Embeddings Storage

Pavankumar
1y
7k
0
2

Article

For any application, the data where it is being stored is critical. So far, we have been storing the data in a structured format(tables) and an unstructured format(json), and now we have a new thing called a vector database.

What are these Vector Databases?

A vector database is a type of database that stores data as high-dimensional vectors. Each vector has a certain number of dimensions, which can range from tens to thousands, depending on the complexity and granularity of the data. In short, the data or the text is turned into a decimal array and stored in the backend.

SQL or Oracle does not support these Vector databases; below is the list of databases where these Vector data can be stored.

Azure Cognitive Search
Chroma
CosmosDB
DuckDB
Milvus
Pinecone
Postgres
Qdrant
Redis
SQLite
Weaviate

Among the above list of vector databases, let's discuss on few databases like Postgres, Qdrant, and Cosmos with MongoDB VCore

Postgres

Recommend to use Postgres in the case have knowledge of the Postgres database, as it eases the storing or using multiple databases like storing structured data in structured db and vector data in another database.

Postgres supports the vector data stored under the flexible server type, which is called pgvector. Make sure your local machine has outbound communication to port 5432 to access the Postgres database.

Data retrieval from Postgres can be achieved by using npgsql, Dapper, or EntityFramework.

The bare minimum to create Postgres is cheaper compared to other databases.

Below is the sample reference of Vector datatype in Postgres, which is used to store the embedding data.

Below is the sample query which can be used to retrieve the data from Postgres.

string prompt = "seller name?";
var promptEmbeddings = await _openAI.Get.embeddingsAsync(prompt, null);
var promptSql = $"SELECT payload FROM contracttable ORDER BY embedding <-> '[{string.Join(',', promptEmbeddings)}]' LIMIT 5";

Pros

Postgres db is already being used by multiple users, so adaption to use the database would require less minimal effort.
Cost is also less compared to other vector db.

Qdrant

Qdrant is an open-source vector database, which we can use by deploying the docker image to Kubernetes for HA, or a quick deployment we can deploy to Azure Container Instance(ACI).

In case we try to ACI Qdrant db, make sure you mount the data to Azure blob storage because in case we save the data on the container, then in case of restart of the pod the data will be lost, and we need to store it again. In the case of C# developer, we can use the nuget of ZeroLevel to access the API's of Qdrant.

Qdrant comes with an Enterprise Licensing Production environment, which can be used for the production environment. We can deploy to Kubernetes in case we want to have HA of the Qdrant db.

In Qdrant, we have concepts like below.

Collections: A single Qdrant can contain multiple collections; we can use the API endpoints to view the list of collections, collection name wise or delete the collection. These collections are nothing but database-like when compared in the relational database world, which contains the data; here in the qdrant, we store the data as points.

Below is the sample from the Postman service, which calls the Qdrant image, which has been deployed as an Azure Container Instance and uses the swagger endpoints to fetch the list of collections.

Replace the {{apiurl}} with the ACI instance IP address.

Payload: In simple terms, the payload is a json format structure data that has the capability to store any additional information

Eg: In case of any Citation where you need to return the file name or path of the file from where the data has been fetched, we can the payload additional properties to store the value reterived during the response.

Points: The points are the central entity that Qdrant operates with. A point is a record consisting of a vector and an optional payload.

Note. In case of nuget package not working, please try to refer the github and build a new package, as the Qdrant APIs are updated.

Pros

We can deploy the image to the container, due to which it is cloud agnostic, and in case of data migration to another, we can easily migrate it with minimum effort.
Can be deployed into Kubernetes along with any backend application, which helps in the reduction of cost infrastructure.
Since it is exposed as API, we can access the database using any programming language.

Cons

It comes with enterprise license costing and has the option to store additional properties with payload, supports filtering, etc.

Swagger: https://ui.qdrant.tech/?v=v1.2.x#/
Nuget: https://www.nuget.org/packages/ZeroLevel/1.0.5#usedby-body-tab

Azure Cosmos DB with Mongo DB VCore

We can create Mongo DB V Core with Azure Cosmos DB. Currently, this is in preview, and compared to the other Vectord Database, it is a bit expensive; we can create Mongo DB Atlas to store the database as well, which is the cloud version of Mongo DB.

Cons: Preview and a bit expensive.

Pros: A available nuget package can be used to easily connect with the database.

Below is the reference from the Azure Calculator which I have put across for the vector databases for Qdrant, which is being deployed as Azure Container Instance, Azure Cosmo DB with Mongo DB VCore, and Postgres. I have taken the bare minimal tiers, and you can see that Qdrant and Postgres beat Azure Mongo DB Core as it is very expensive compared to the other two services.

Sqllite

In the case of the Vector database being stored offline at the device level, we can use Sqllite.

Pros: It is a lightweight db and can be used in required to store data at the device level.

Cons: Online access to the database is not available.