MongoDB  

Simple Demo Of Vector Database With Qdrant — Image Search

In my previous article, I demonstrated the power of a Vector Database using Qdrant for semantic search.

However, VDB offers not only the magic of VDB; it can do more.

Today, I want to show how the VDB enables image-to-image search.

Suppose I have an e-commerce website selling a variety of products, from T-shirts and books to cars.

In addition to allowing users to search by text, I want to enable them to search by image.

Something like this:

Press enter or click to view the image in full size

Image search result

The idea of this architecture is very simple

  1. Product Creation (Indexing): When a new product is created, we use an AI model to generate an embedding (a numerical vector) of the product image and store it in the VDB.

  2. Search (Querying): When a user uploads a query image, we generate an embedding for that image using the same model and compare it against the stored vectors in the VDB. The VDB then returns the Top N most similar images.

That’s all.

The AI model I’m using is CLIP model. CLIP (Contrastive Language-Image Pre-training) model, developed by OpenAI, is a neural network designed to understand the relationship between visual concepts and natural language.

Following is my dummy e-commerce page product creation screen, once I click ‘Create Product’, in the backend, I will use the CLIP model to embed the image.

Here is the walkthrough of the source code in python. (I am using ‘clip-ViT-B-32’ for a good balance of speed and accuracy)

  clip_model = SentenceTransformer('clip-ViT-B-32')

def get_image_embedding(image_bytes):
    image = Image.open(io.BytesIO(image_bytes))
    embedding = clip_model.encode(image)
    return embedding.tolist()

We initiate the CLIP Model, i choose to use ‘clip-ViT-B-32’.

Then get the image bytes (i pre-stored the image bytes of the product in sql server table, it’s up to you how you get the image byte data), and pass it to the model build in encode method, then you will get the embedding data of the image.

qdrant_client.upsert(
                collection_name=Config.QDRANT_COLLECTION,
                points=[
                    PointStruct(
                        id=product_id,
                        vector=embedding,
                        payload={
                            "product_id": product_id,
                            "product_name": product_name,
                            "product_description": product_description
                        }
                    )
                ]
            )

After that, just store the embedding vector along with other metadata of the product like Product ID, name and description to the collection.

You collection payload will look like this:

Press enter or click to view image in full size

For the image search functionality, we simply repeat the embedding process for the user’s uploaded image and query the collection.

  # Read image bytes and generate embedding
  mage_bytes = search_image.read()
  search_embedding = get_image_embedding(image_bytes)

 # Search VDB with embedded search image
  search_results = qdrant_client.query_points(
                    collection_name=Config.QDRANT_COLLECTION,
                    query=search_embedding,
                    limit=5,
                    with_payload=True
                ).points

For the image search functionality, we simply repeat the embedding process for the user’s uploaded image and query the collection.

In my demo, I return top 5 most similar image from VDB with the query images.

That’s all, that’s how we can perform very powerful image-to-image search by utilizing the behaviour of the VDB.

Other image embedding model

You may have noticed something unexpected in the search results. In my demo, the query image was a green sports car. However, the second and third results were a blue car and a red car (with ~69% similarity), while the green car came in fourth with a slightly lower score (68.86%).

By common sense, the green car should rank higher. Why did this happen?

Press enter or click to view image in full size

This is because the CLIP model I’m using, are excels at shape, object type, composition, scene context, but less focused on exact colors, textures, fine details.

Consider we want to provide the best experience to the user that visit our e-commerce website, he/she might want to see green colour car since the original image is green colour, we can enhance the the accuracy of our image-to-image search feature, by using another model.

ResNet50 is a classic Convolutional Neural Network (CNN) that is excellent at capturing textures, edges, and colors. By using ResNet50 to re-embed our images into a separate collection we can factor in these visual properties.

I wrote a script named import_images_visual.py to re-embed all my images using ResNet50.

This is the main code using Model ResNet50 to re-embed my existing image to another collection, so it will factor in the property of colour in the image.

visual_embedder = VisualEmbedder()
embedding = visual_embedder.encode_image(image_bytes)

 def encode_image(self, image_bytes):
         # Load image
        image = Image.open(io.BytesIO(image_bytes))

        # Convert to RGB if necessary
        if image.mode in ('RGBA', 'LA', 'P'):
            background = Image.new('RGB', image.size, (255, 255, 255))
            if image.mode == 'P':
                image = image.convert('RGBA')
            background.paste(image, mask=image.split()[-1] if image.mode in ('RGBA', 'LA') else None)
            image = background
        elif image.mode != 'RGB':
            image = image.convert('RGB')

        # Preprocess
        input_tensor = self.preprocess(image)
        input_batch = input_tensor.unsqueeze(0)

        # Generate embedding
        with torch.no_grad():
            embedding = self.model(input_batch)

        # Flatten and convert to list
        embedding = embedding.squeeze().numpy().tolist()

        return embedding

The most critical part of this logic is the pre-processing block. By converting all images (including transparent PNGs) into a standardized RGB format on a white background, it ensures the model isn’t confused by transparency artifacts or varying color modes. This creates a consistent “baseline” for color detection.

By using torch.no_grad(), the code efficiently extracts the raw mathematical representation of the image’s visual features and flattens them into a list, making it ready for comparison in a vector database like Qdrant.

Unlike “semantic” models that focus on what an object is (e.g., “a car”), this code focuses on what the object looks like. Because it preserves the multi-channel RGB data during processing, the resulting embedding is very sensitive to pixel-level information.

You may see the comparison of image-to-image search result between CLIP model VS ResNet50 model. ResNet50 model returns more accurate results.

Press enter or click to view image in full size

Hybrid Scoring

However, relying 100% on a visual model like ResNet50 has its risks. For instance, searching for a green car might accidentally return a green apple because the model sees a similar color and curved shape, even though the objects are completely different.

To solve this, we can implement Hybrid Scoring. By combining the strengths of both models, we ensure the search remains semantically correct while respecting visual details. You can design a weighted formula like this:

Final Score = ($0.6 \times$ CLIP Semantic Score) + ($0.4 \times$ ResNet50 Visual Score)

By combining the semantic depth of CLIP with the visual precision of ResNet50, you can build a search experience that truly “understands” both what a user is looking for and what it should look like. This hybrid approach transforms a simple e-commerce site into a powerful, intuitive discovery engine that keeps customers engaged and finding exactly what they need.

You can download my demo code from my Github repo.