Understanding Partitioning And Partition Key In Azure Cosmos DB

Jignesh Trivedi
5y
62k
0
5

Article

Introduction

Azure Cosmos DB is a globally distributed and multi-model database service. It is a NoSQL database that is used to store non-structured data. It is a high-performance non-normalized database also, that supports automatic scaling, high availability and low latency. Azure Cosmos DB is an updated version of Microsoft Document DB Service and is considered to be a horizontal scale database with multiple APIs and multiple model support. It is globally distributed which means that it is available in every Azure region (multiple Geo location) and allows us to replicate data to as many data centers and regions as we wish.

Partitioning in Azure Cosmos DB is used to scale individual containers in a database that help us to meet the performance requirement of our application. Cosmos DB creates logical partitions by dividing the items in the container to distinct subsets. The logical partitions create based on the value of a "partition key" which is associated with every item in a container. The items are in the same partition which have the same partition key value.

For example, suppose a container contains properties EmployeeId, Code, deparmentId, and other properties. The EmployeeId is a unique value for each item in the container, If EmployeeId is used as partition key for the items in the container and there are 10,000 unique EmployeeId value, so 10,000 logical partitions are created for the container.

Additionally, each container has an item ID column that is unique in a logical partition. The combination of item ID and partition key creates item index. So, selecting a partition key is important as it will affect our application performance.

Container is a fundamental unit of scalability. The data and throughput are partitioned based on the partition key that we have specified. The logical partition is also defined in the scope of database transactions.

Physical partitions

The container is scaled by distributing data across a wide number of logical partitions. The Cosmos DB maps one or more logical partitions to one physical partition that consists of a set of replicas. The replica set makes data durable, highly available, and consistent within a physical partition and it hosts an instance of the Azure Cosmos DB database engine. All the replicas of a physical partition support the allocated throughput (to the physical partition).

The designed partition key that does not distribute the throughput requests evenly might create "HOT" partitions. It results in the inefficient use of provisioned throughput and higher cost. The logical partition is defined by the partition key but physical partition defines the system itself. We can not control count, size, or placement of physical partition. We can control the number of logical partitions and distribution of data by selecting a proper partition key.

How to select partition key

Following are the points you need to consider when choosing a partition key.

There is an upper limit of a single logical partition, that is, 10 GB of storage
We can set Cosmos DB containers throughput between 400 - 100000 request units per second (RU/s). The request to the same partition cannot exceed the defined throughput, so it is important to pick up correct partition key
Choose a partition key that frequently as a filter in our queries
Choose a partition key that has a large range of values. It helps us to reduce the data stored in logical partitions, and throughput can be distributed among multiple logical partitions.

Synthetic partition key

It is good practice to create a partition key with distinct value. The goal is to distribute our workload and data across the items associated with partition key values. If this kind of data does not exist in our data, we can create an artificial partition known as "synthetic partition key". There are many ways to create synthetic partition key for our container.

Create partition key with a random suffix

We can create many distinct partition keys by appending a random suffix (at the end of partition key). If we distribute items in this way, we can perform parallel write operations.

Example

If our partition key represents a zipcode, we might choose a random number between 300 to 800 and concatenate with zip code as a suffix.

{
"Id" : "1"
"partitionKey": "364002-753"
}

Create partition key with pre-calculated suffixes

Creating a partition key with the random suffix greatly improves writing throughput, but the problem is, it's difficult to read a specific value because we don't know the suffix value. So this problem can be resolved by applying a pre-calculated suffix.

Consider the previous example, where a container uses a zip code as a partition key. Now, suppose each item has date property that we want to access and we often run queries to find items by date. We can combine the date with zip code property and make a partition key.

{
"Id" : "1",
"date" : "2019-11-23",
"partitionKey": "364002_20191123"
}

Combine multiple properties to make partition key

We can also create a partition key by combining multiple properties of our data. Consider the previous example, where a container used a zip code as a partition key. Now, suppose each item has a departmentId that we want to access and we often run queries to find items by departmentId. We can combine zip code and departmentId properties to make a partition key.

{
"Id" : "1",
"departmentId" : "201",
"zipcode" : "364002",
"partitionKey": "364002-201"
}

Summary

Azure Cosmos DB stores data in multiple physical partitions and it creates one default partition when creating Database
Once the size of the partition reaches the higher limit, Cosmos DB creates another partition
Partition key is the property or path within our documents that can be used to distribute data
Data having the same partition key value are logically grouped together and stored in the same physical partition
It is recommended to use a partition key when performing CRUD operations that helps us to improve the performance
The ideal partition key is one which is used frequently as a filter in our queries
Be careful when selecting the partition key, if we select a partition key that does not have many distinct values then all data is stored in a single partition and all queries get fired to a single partition
In multi-tenant application, TenantId is a good choice as a partition key
Pick up a partition key that has many distinct values to avoid "hot partitions"
Use synthetic partition key whenever required