Python  

How to Implement Cosine Similarity Using Embeddings in Python Step by Step

Introduction

In modern Artificial Intelligence and Data Science applications, measuring similarity between pieces of data is a fundamental requirement. One of the most widely used techniques for this purpose is cosine similarity, especially when working with text embeddings.

Cosine similarity is a mathematical metric used to measure how similar two vectors are, based on the angle between them rather than their magnitude.

Embeddings, on the other hand, are numerical representations of data (such as text, images, or audio) in vector form. These vectors capture semantic meaning, allowing machines to compare and analyze data effectively.

In practical terms:

  • Embeddings convert data into vectors

  • Cosine similarity compares those vectors

  • The result indicates how closely related two inputs are

This concept is widely used in search engines, recommendation systems, and AI-powered applications.

Detailed Explanation with Examples

Understanding Cosine Similarity

Cosine similarity calculates the cosine of the angle between two vectors. The closer the value is to 1, the more similar the vectors are. A value close to 0 indicates no similarity, while -1 indicates opposite meaning.

Formula

Cosine Similarity is calculated as:

cos(θ) = (A · B) / (||A|| × ||B||)

Where:

  • A and B are vectors

  • “·” represents dot product

  • ||A|| and ||B|| represent magnitudes of the vectors

Example in Python

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Example vectors
vector1 = np.array([[1, 2, 3]])
vector2 = np.array([[1, 2, 4]])

similarity = cosine_similarity(vector1, vector2)
print(similarity)

Using Text Embeddings

In real-world AI applications, embeddings are generated using models such as OpenAI, Sentence Transformers, or other NLP libraries.

from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity

model = SentenceTransformer('all-MiniLM-L6-v2')

sentences = ["AI is powerful", "Artificial Intelligence is very powerful"]
embeddings = model.encode(sentences)

similarity = cosine_similarity([embeddings[0]], [embeddings[1]])
print(similarity)

This approach allows systems to understand semantic similarity rather than just matching keywords.

Real-Life Examples and Scenarios

Scenario 1: Semantic Search

Search engines use cosine similarity to return results that are meaningfully related, not just keyword matches.

Example:

  • Query: "best way to learn AI"

  • Results include articles about machine learning and deep learning

Scenario 2: Recommendation Systems

Platforms like e-commerce or streaming services recommend items based on similarity between user preferences and available content.

Scenario 3: Chatbots and AI Assistants

AI systems compare user queries with stored knowledge to generate relevant responses.

Real-World Use Cases

  • Semantic search engines

  • Document similarity detection

  • Recommendation systems (Netflix, Amazon-like platforms)

  • Plagiarism detection tools

  • Chatbots and conversational AI systems

Advantages and Disadvantages

Advantages

  • Captures semantic meaning effectively

  • Works well with high-dimensional data

  • Widely supported in AI and machine learning libraries

Disadvantages

  • Requires vector generation (embeddings)

  • Performance depends on embedding quality

  • Computational cost can increase with large datasets

Comparison Table

FeatureCosine SimilarityEuclidean Distance
BasisAngle between vectorsDistance between points
Scale SensitivityNot affected by magnitudeAffected by magnitude
Best Use CaseText similarity, embeddingsGeometric distance problems
Output Range-1 to 10 to infinity

Summary

Cosine similarity, when combined with embeddings, forms the backbone of many modern AI systems. It enables applications to compare data based on meaning rather than exact matches, making it highly effective for tasks such as semantic search, recommendation systems, and natural language processing. By leveraging Python libraries and pre-trained embedding models, developers can efficiently implement scalable and intelligent similarity-based solutions in real-world applications.