Cosmos DB  

Exploring AI and Vector Search in Azure CosmosDB for MongoDB VCore

Microsoft announced the introduction of vector search functionality in Azure Cosmos DB for MongoDB vCore. This feature enhances the capabilities of Cosmos DB by allowing developers to perform complex similarity searches on high-dimensional data, which is particularly useful in RAG-based applications, recommendation systems, image and document retrieval, and other scenarios. I am also participating in the Cosmos DB hackathon to explore how we can utilize this technology within retrieval augmented generation.

In this article, we will explore the details of this new functionality, its use cases, and provide a sample implementation using Python.

What is a Vector Store?

A vector store (or vector database) is designed to store and manage vector embeddings. These embeddings are mathematical representations of data in a high-dimensional space. Each dimension corresponds to a feature of the data, and tens of thousands of dimensions might be used to represent sophisticated data. For example, words, phrases, entire documents, images, audio, and other types of data can all be vectorized. In simpler terms, vector embeddings are a list of numbers that can represent complex data within a multi-dimensional space.

Example

  
    Pen: [0.6715,0.5562,0.3566,0.9787]
  

Now we can represent a pen within a multi-dimensional space and then utilize vector search algorithms to perform a similarity search, retrieving the closest matching elements.

How does a Vector Index Work?

In a vector store, vector search algorithms are used to index and query embeddings. Vector indexing is a technique used in ML and data analysis to search and retrieve information from large datasets efficiently. Some well-known algorithms include.

  • Flat Indexing

  • Hierarchical Navigable Small World (HNSW)

  • Inverted File (IVF) Indexes

  • Locality Sensitive Hashing (LSH) Indexes

Vector search enables you to find similar items based on their data characteristics, rather than exact matches on a specific property field. It’s useful for applications such as,

  • Searching for similar text

  • Finding related images

  • Making recommendations

  • Detecting anomalies

Integrated Vector Database in Azure Cosmos DB for MongoDB vCore

The Integrated Vector Database in Azure Cosmos DB for MongoDB vCore enables you to store efficiently, index, and query high-dimensional vector data directly within your Cosmos DB instance. Both transactional data and vector embeddings are stored together in Cosmos DB. This eliminates the need to transfer data to separate vector stores and incur additional costs. It works in 2 steps.

1. Vector Index Creation

To perform a vector similarity search over vector properties in your documents, you’ll first need to create a vector index. This index allows efficient querying based on vector characteristics.

2. Vector Search

Once your data is inserted into your Azure Cosmos DB for MongoDB vCore database and collection, and your vector index is defined, you can perform a vector similarity search against a targeted query vector. 

What is Vector Search?

Vector search, also known as similarity search or nearest neighbor search, is a technique used to find objects that are similar to a given query object in a high-dimensional space. Unlike traditional search methods that rely on exact matches, vector search leverages the concept of distance between points in a vector space to find similar items. This is particularly useful for unstructured data, such as images, audio, and text embeddings.

Benefits of Vector Search in Cosmos DB

  1. Efficient similarity searches: Enables fast and efficient searches on high-dimensional vectors, making it ideal for recommendation engines, image search, and natural language processing tasks

  2. Scalability: Leverages the scalability of Cosmos DB to handle large datasets and high query volumes.

  3. Flexibility: Integrates seamlessly with existing MongoDB APIs, enabling developers to utilize familiar tools and libraries.

Use Cases

  1. Recommendation systems: Providing personalized recommendations based on user behavior and preferences

  2. Image and video retrieval: Searching for images or videos that are visually similar to a given input

  3. Natural Language Processing: Finding documents or text snippets that are semantically similar to a query text

  4. Anomaly Detection: Identifying unusual patterns in high-dimensional data

Setting Up Vector Search in Cosmos DB

Prerequisites

  • An Azure account with an active subscription

  • Azure Cosmos DB for MongoDB vCore configured for your workload

Detailed Step-By-Step Guide and Sample Code Written in Python

  1. Create a Cosmos DB account

    • Navigate to the Azure portal.

    • Search for Azure Cosmos DB and select the MongoDB (vCore) option.

    • Follow the prompts to create your Cosmos DB account.

  2. Configure your database

    • Create a database and a collection where you’ll store your vectors.

    • Ensure that the collection is appropriately indexed to support vector operations. Specifically, you’ll need to create an index on the vector field.

  3. Insert vectors into the collection.

    • Vectors can be stored as arrays of numbers in your MongoDB documents. 

  4. Set up your project

    • Create a new Python project (e.g., using Visual Studio or Visual Studio Code).

    • Import necessary MongoDB and Azure/OpenAI modules

  5. Connect to the database using the Mongo client.

  6. Inserting data

    • The code below shows how to insert order data from a local JSON file and insert embeddings into the contentVector field.

  7. Generate vector embeddings by using OpenAI's getEmbeddings() method.

Here is the complete code for your reference.

  
    const { MongoClient } = require('mongodb');
const { OpenAIClient, AzureKeyCredential} = require("@azure/openai");

// Set up the MongoDB client
const dbClient = new MongoClient(process.env.AZURE_COSMOSDB_CONNECTION_STRING);

// Set up the Azure OpenAI client 
const aoaiClient = new OpenAIClient("https://" + process.env.AZURE_OPENAI_API_INSTANCE_NAME + ".openai.azure.com/", 
                    new AzureKeyCredential(process.env.AZURE_OPENAI_API_KEY));

async function main() {
    try {
        await dbClient.connect();
        console.log('Connected to MongoDB');
        const db = dbClient.db('order_db');

        // Load order data from a local json file
        console.log('Loading order data')
        const orderRawData = "<local json file>";
        const orderData = (await (await fetch(orderRawData)).json())
                                .map(order => cleanData(order));
        await insertDataAndGenerateEmbeddings(db, orderData);
       
    } catch (error) {
        console.error('An error occurred:', error);
    } finally {
        await dbClient.close();
    }
}

// Insert data into the database and generate embeddings
async function insertDataAndGenerateEmbeddings(db, data) {
    const orderCollection= db.collection('orders');
    await orderCollection.deleteMany({});
    var result = await orderCollection.bulkWrite(
        data.map(async (order) => ({
            insertOne: {
                document: {
                    ...order,
                    contentVector: await generateEmbeddings(JSON.stringify(order))
                }
            }
        }))
    );
    console.log(`${result.insertedCount} orders inserted`);
}

// Generate embeddings
async function generateEmbeddings(text) {
    const embeddings = await aoaiClient.getEmbeddings(embeddingsDeploymentName, text);
    await new Promise(resolve => setTimeout(resolve, 500)); // Rest period to avoid rate limiting on Azure OpenAI  
    return embeddings.data[0].embedding;
}
  

Note. Remember to replace placeholders (Cosmos DB connection string, Azure OpenAI key, and endpoint) with actual values.

Managing Costs

To manage costs effectively when using vector search in Cosmos DB.

  1. Optimize indexes: Ensure that only necessary fields are indexed.

  2. Monitor usage: Use Azure Monitor to track and analyze usage patterns.

  3. Auto-scale: Configure auto-scaling to handle peak loads without over-provisioning resources efficiently.

  4. Data partitioning: Partition your data appropriately to ensure efficient querying and storage.

Conclusion

The introduction of vector search functionality in Azure Cosmos DB for MongoDB vCore opens up new possibilities for building advanced AI and machine learning applications. By leveraging this feature, developers can implement efficient similarity searches, enabling a wide range of applications from recommendation systems to anomaly detection. With the provided Python code examples, you can get started with integrating vector search into your Cosmos DB-based applications.

For more detailed documentation, visit the Azure Cosmos DB documentation.