How To Design And Query Data In Cosmos DB

Debasis Saha
5y
288k
0
11

Article

Introduction

Welcome back to the second article in an overview of CosmosDB – start from scratch article series. In the previous article, we discuss related to the basic concepts of the No-SQL Database include an overview of Cosmos DB, the Hierarchy structure of a Cosmos DB, etc. We also discuss how to create Cosmos DB using Azure Portal and the migration steps from any other database to Cosmos DB with the help of different tools. Now, in this article, we will discuss the design process related to the Cosmos DB. We also discussed related to different types of Query in Cosmos DB. If you want to read the previous article of this series, then please click on the below links:

Cosmos DB - An Overview (Part 1)

As we all know, Cosmos DB is a document database that can be run in SQL API or MongoDB API modes. So, we can perform the different types of Query and can interact with the database at the grain of schema-free documents. Documents of a similar type are organized into a collection. In this article, we will discuss how to design documents or collection to fulfill the business requirements. Also, we will discuss how to use the SQL API to build applications that use these documents. So, at the end of this article, the reader can be able to get an idea about the below points:

The basic concept related to the document models in Azure Cosmos DB
Concept of Partitioning Collections in Azure Cosmos DB
Query Data in SQL API Based Cosmos DB Database

Overview of Documents and Collection with Relations

As we discussed in the previous article, Cosmos DB supports the document database model. Maybe the document database model is new to some of us, especially related to the relational database model. So, first, let's compare the logical architecture of the document database along with the relational database like Microsoft SQL Server.

Document Database	Relational Database
Collections	Tables
Documents	Rows
Properties	Columns
Denormalized	Normalized
Schema-less	Declarative Schema
No Referential Integrity	Referential Integrity

The logical unit of data in a document database is a document – which is roughly equivalent to a relational row. Every document might have a different number of properties and associated values. Other than requiring that each document in the database has a unique identifier, there are no constraints on the schema of a document; each document in a document database might have a different set of properties. In Cosmos DB, a group of documents is stored together in a collection; collections are typically used to group a set of documents with common characteristics—for example, purchase orders, or personnel records—but there’s no requirement that all the documents in a collection represent the same type of object, or that they have the same schema.

In every document database, documents represent a serialized version of a programming object. As a result, sometimes it allows some repetition of data (or denormalizing) to provide simple query functionality and better performance. Many document model-based database systems like MongoDB always force simple queries because they do not support JSON operations in queries. But Cosmos DB SQL API does support a form of a JOIN syntax which will be discussed later in this article.

Documents Structure in the Document Database

In a Cosmos DB SQL API based database, normally documents are stored in JavaScript Object Notation (JSON) format. JSON is an industry-standard format for exchanging structured data. It is mainly derived from a subset of the JavaScript type system. Most of the programming languages include libraries that are used parsing JSON objects into native objects. The SQL API includes supports for GeoJSON – an extension to JSON used for encoding geospatial information – such as geographical latitude and longitude.

There are some key points which need to be remembered for Document structure in the document database:

Documents are stored as JSON format data.
Document Database supports GeoJSON for geometry related data
Each document must be a total size of 2 MB or less than 2 MB.
If the document contains binary blobs data and its size is above 2 MB, then need to store it as attachments.
Each document contains some default system properties like _rid, _etag, _ts, and _self.
It is automatically added against each document (as a GUID) if the application or source data does not have it.

Attachments

Although a single document should be no larger than 2 MB, you can associate binary blobs with a document by using an attachment. An attachment is a pointer record from a document inside the database to a blob stored elsewhere. You are permitted to create attachments linked to blobs in the Cosmos DB blob store up to a maximum total of 2 GB per database account—or you create attachments to blobs of unlimited size held external to the database, in services such as Azure Blob Store, or Microsoft OneDrive. The _media property of the attachment holds a URI to the blob location.

Document System Generated Properties

When you add a document to a Cosmos DB SQL API database, several system-generated properties are added to the document. The names of all system-generated properties are prefixed with an underscore (_),

Property	Purpose
_rid	Internal Unique Identifier
_etag	Used for concurrency control
_ts	The timestamp of the last update to the document expressed as seconds elapsed since January 1, 1970
_self	A unique URI for the document

Besides, all documents in a SQL API database must have an ID property. If you don’t specify an ID property when you create a document, the database will automatically generate an ID using a globally unique identifier (GUID).

Partitioning Collections in Cosmos DB

Cosmos DB supports logical partitioning and physical partitioning to provide predictable performance for data at any scale.

Logical partitioning

When we create a collection, our primary design focus is to select a partitioning key. A partitioning key defines a logical partition that is a grouping of all the data associated with each partitioning key value. The logical boundary that is defined by a partitioning key determines the scope of unique keys and atomic transactions in the collection; a unique key or transaction cannot extend beyond a single logical partition.

Physical partitioning

When we have assigned a partitioning key and a throughput level for a collection, Cosmos DB manages the provisioning and scaling of your collection automatically. Behind the scenes, Cosmos DB stores our data in physical partitions; a physical partition comprises a block of disk space, and associated CPU and memory resources—the amount of disk, CPU, and memory varies based on the amount of throughput provisioned for your collection.

How To Design And Query Data In Cosmos DB

Although all the data associated with a single partitioning key is stored in a single physical partition, physical partitions do not map one-to-one with partitioning key values; many partitioning keys are assigned to each physical partition. Partitioning key values are assigned to physical partitions using a pseudo-random hash of the partitioning key value. In some cases, a physical partition might contain data for more than one collection. By splitting partitions as they become full, Cosmos DB supports collections of unlimited size. This automatic scaling behavior is applied to any collection—regardless of whether a partition key is defined—that is provisioned with 1,000 RU/s or greater throughput. Collections with Fixed scaling grow to their specified size limit; collections with Unlimited scaling grow without restriction.

Perform Different Types of Queries on SQL-API Based Cosmos DB

The Cosmos DB SQL API supports a dialect of SQL for querying documents using SELECT statements that will be familiar if you have written SELECT statements in a relational database using an ANSI SQL compliant database engine. The SQL API returns results in the form of JSON documents. All queries are executed in the context of a single collection. A SQL API SELECT Clause comprised the following clauses as mentioned in the picture.

The following example is a SELECT statement that returns all the documents in a collection. The result of this query is a JSON array containing the documents.

SELECT * FROM p

In the below example, the query uses projection to return the Title and Name properties from each document in the collection. When you use projection, the FROM clause is mandatory.

SELECT p.Title, p.Name FROM p

In this example, the query uses property name syntax to return the Title and Name properties from each document in the collection. When you use property name syntax, the FROM clause is mandatory.

SELECT p["Title"], p["Name"] FROM p

In this example, the query uses projection to return the Title property, and the FirstName and LastName properties from the Name subdocument from each document in the collection.

SELECT p.Title, p.Name.FirstName, p.Name.LastName FROM p
SELECT p["Title"], p["Name"]["FirstName"] FROM p

In this example, the first query uses a WHERE clause to return all the documents where the CurrentAddress.State property equals WA. The second example uses the IN keyword to return records in multiple states,

//one state
SELECT * FROM p WHERE p.CurrentAddress.State = "WA"
//three states
SELECT * FROM p WHERE p.CurrentAddress.State IN ("WA","VA","CA")

Use Join

Although the SQL API query language supports the JOIN clause in SELECT queries, the JOIN clause has reduced functionality when compared to JOIN clauses in ANSI SQL. ANSI SQL supports different types of join—inner, left outer, right outer, full outer, and cross—between rows in either the same table or different tables. By contrast, the SQL API supports only cross joins; the ON clause used in ANSI SQL to describe how two tables are connected is not available. Furthermore, you cannot use the SQL API JOIN clause to join two documents; you can only join inside a document, between a document and its subdocuments.

SELECT p.Name, p.CurrentAddress.City AS CurrentCity, h.City AS EarlierCity
FROM p
JOIN h IN p.AddressHistory

Aggregate Functions

We can use aggregate functions to summarize data in SELECT queries; you place aggregate functions in the SELECT clause. The SQL API query language supports the following aggregate functions:

COUNT
SUM
AVG
MAX
MIN

Although the syntax of aggregate functions is similar to ANSI SQL, unlike ANSI SQL the SQL API query language does not support the GROUP BY clause; you cannot generate subtotals for different values of the same property in a single query. We can include more than one aggregate function in the SELECT clause of your queries.

In the following example, the query returns the average, maximum, and the sum of the _ts property of the documents in a collection, in addition to a count of all the documents in the collection:

SELECT AVG(c._ts) AS avg, MAX(c._ts) AS max, SUM(c._ts) AS sum, COUNT(1) AS count
FROM c

Indexing Function

Indexes are used in relational and nonrelational database systems to improve the performance of the database when responding to equality, range, or ordered queries by providing fast access to property values. An index consumes space in the database, in addition to the space needed to store the data. While indexing improves read performance, it slows write performance because any indexes must be maintained in parallel with changes to the database.

All documents in a collection are automatically indexed by partition key and id; this is the primary index.

By default, Cosmos DB automatically creates secondary indexes on all the documents added to a collection, based on the settings in the collection’s default index policy. You might find that the default policy is sufficient for your needs, but you can modify the default index policy in several ways to tune the performance of your collection. In the Azure Portal Data Explorer, you view and edit the index policy for a collection from the Scale and Settings blade.

Conclusion

In this article, we discussed the document structure of the Cosmos DB Database. Also, we discussed the Partitioning concept in Cosmos DB and can perform different types of Query on Cosmos DB. Any suggestions or feedback or query related to this article are most welcome. Also, if you want to know how to perform CRUD operations using C# then visit the below-mentioned link for the related article:

Crud Operations in Cosmos DB using Asp.Net Core