MongoDB Aggregations (Day 14)

Article

Before reading this article, I highly recommend reading the following previous parts of the series:

Introduction

Aggregation functions perform operations on groups of documents and return the computed result. Aggregation works mainly to group the data of multiple documents and perform various operations on the grouped data and returns a single or multiple results.

Types of aggregate functions

MongoDB performs aggregate operations in one of the following three ways.

Single-purpose aggregate methods and commands.
Pipeline.
Map Reduce

First, we will create a “Demo” collection and insert the following data into that collection.

{
"_id": ObjectId("55dc6c94b32228b6ef8753c0"),
"Name": "Pankaj Choudhary",
"Age": 21,
"Salary": 25000
}
{
"_id": ObjectId("55dc6ca8b32228b6ef8753c1"),
"Name": "Sandeep Jangid",
"Age": 22,
"Salary": 27000
}
{
"_id": ObjectId("55dc6cb9b32228b6ef8753c2"),
"Name": "Rahul Prajapat",
"Age": 23,
"Salary": 37000
}
{
"_id": ObjectId("55dc6ccab32228b6ef8753c3"),
"Name": "Sanjeev Baldia",
"Age": 22,
"Salary": 28000
}
{
"_id": ObjectId("55dc6cdcb32228b6ef8753c4"),
"Name": "Narendra Sharma",
"Age": 25,
"Salary": 25000
}
{
"_id": ObjectId("55dc6cf5b32228b6ef8753c5"),
"Name": "Nitin Yadav",
"Age": 28,
"Salary": 35000
}
{
"_id": ObjectId("55dc6d09b32228b6ef8753c6"),
"Name": "Omveer Choudhary",
"Age": 32,
"Salary": 37000
}

First, we will read about single-purpose aggregates.

Single-Purpose aggregate methods and commands

As implied by its name, single-purpose aggregate methods are used for specific aggregation operations on sets of data. Single-purpose aggregate methods are less complex but with limited scope compared to pipeline and map reduce operations. Single-purpose aggregate methods and commands provide straightforward semantics for common data processing options.

The following are some important single-purpose operations.

Count

The Count operation takes a number of documents and depending on the match query returns the count of the documents. In MongoDB, the count command and cursor.count() method are used to do the count operations.

Example 1

db.Demo.count()

Output

7

In this example Demo.count() is a cursor method that returns the count of all the documents present in the Demo collection.

Example 2

db.runCommand({count:'Demo',query:{Salary:{$gt:30000}}})

Output

{ "n" : 3, "ok" : 1 }

In this example, the count command takes all the documents in which the value of the Salary field is greater than 30000 and returns the count.

Distinct

The Distinct operation takes a document and depending on the match query returns the unique values for a field. In MongoDB the cursor.distinct() method and the distinct command performs the distinct operations.

Example 1

db.Demo.distinct("Salary")

Output

[ 25000, 27000, 37000, 28000, 35000 ]

In this example, we use the distinct cursor method that returns the distinct value for the “Salary” field.

Example 2

db.runCommand({distinct:'Demo',key:"Salary",query:{Salary:{$gt:30000}}})

Output

{
"values":
[
37000,
35000
],
"stats":
{
"n": 3,
"nscanned": 0,
"nscannedObjects": 7,
"timems": 598,
"planSummary": "COLLSCAN"
},
"ok": 1
}

In this example the distinct command returned the distinct values for the “Salary” field where the value of the “Salary” field is greater than 30000.

Group

The Group operation takes a number of documents and depending on the match query creates a group of fields grouped by their value and finally returns an array of documents with the computed result for each group. In MongoDB, the group command and the cursor.group() methods do the group operations.

Example 1

db.Demo.group({key:{Age:1},reduce:function(cur,result){result.Salary+=cur.Salary},initial:{Salary:0}})

Output

[
{
"Age": 21,
"Salary": 25000
},
{
"Age": 22,
"Salary": 55000
},
{
"Age": 23,
"Salary": 37000
},
{
"Age": 25,
"Salary": 25000
},
{
"Age": 28,
"Salary": 35000
},
{
"Age": 32,
"Salary": 37000
}]

In this example, we used the group cursor method. This method groups the documents on the basis of the value of the “Age” field and computes the sum of Salary for each group.

Example 2

db.runCommand({group:{ns:'Demo',key:{Salary:1},cond:{Salary:{$lt:30000}},$reduce:function(cur,resu
lt){result.Age+=cur.Age},initial:{Age:10}}})

Output

{
"retval":
[
{
"Salary": 25000,
"Age": 56
}, {
"Salary": 27000,
"Age": 32
}, {
"Salary": 28000,
"Age": 32
}
],
"count": NumberLong(4),
"keys": NumberLong(3),
"ok": 1
}

In this example, we use the group command. This command groups the documents on the basis of the value of the Salary field and computes the sum of the Age field for each group.

Aggregate Pipeline

Pipeline means the possibility to execute an operation on some input and use the output as the input for the next command and so on. An aggregate pipeline is a framework modeled on the concept of data processing pipelines. In aggregate pipelines, documents enter a multi-stage pipeline that transforms the documents into an aggregated result. The pipeline operation uses the match query to fetch the exact documents and grouping to generate the group of documents. Pipeline operations provide tools for sorting documents by specified fields. A pipeline operation contains many stages, like filters that operate like queries and the transformation of documents that modify the form of output. Mainly, aggregate commands operate on a single collection and pass the entire collection into the aggregation pipeline.

Aggregate pipelines are an alternative to map-reduce. It may be an appropriate choice for aggregate operations because map-reduce are unnecessarily complex. But pipelines have some limited behavior on values types and result size.

Syntax

db.Collection_Name.aggregate({Pipeline expression})

Parameter

Pipeline expression: Pipeline expressions specify the transformation to apply on the input documents. A pipeline expression only operates on the current documents of the pipeline stage.

In an aggregate pipeline, documents pass through many stages and each stage transforms the documents into another form. It is not recommended that each pipeline stage produce a document with respect to each document. Some stages might reduce some documents and might generate new documents.

A pipeline aggregate contains a number of stages, each stage takes some documents as input and does operations on these documents and generates output. The following are the possible stages in an aggregate pipeline:

Stage	Description
$project	Select specific fields from a collection
$match	Specifies the selection criteria, to reduce the amount of documents
$group	Used to divide the documents into various groups
$sort	Sort the documents
$skip	Used to skip a number of documents
$limit	Defines the number of documents in an output result
$unwind	Unwinds the documents using arrays
$redact	Reshapes each document by restricting the content for each document
$out	It is the last stage of a pipeline that writes the resulting documents of the aggregate pipeline to a collection.

Let's see an example to understand the concept of aggregate pipelines.

Example 1

db.Demo.aggregate([{$group:{_id:"$Age",Salary_Is:{$sum:"$Salary"}}}])

Output

{
"_id": 32,
"Salary_Is": 37000
}
{
"_id": 25,
"Salary_Is": 25000
}
{
"_id": 23,
"Salary_Is": 37000
}
{
"_id": 22,
"Salary_Is": 55000
}
{
"_id": 28,
"Salary_Is": 35000
}
{
"_id": 21,
"Salary_Is": 25000
}

In this example, we group the documents by the field “Age” and for each group, we calculate the sum of Salary. The following is the equivalent SQL query for this example.

Select Age, sum(Salary) from Demo group by Age.

Example 2

db.Demo.aggregate
([{
$group: {
_id: "$Age",
Sum_Salary: {
$sum: "$Salary"
},
Avg_Salary: {
$avg: "$Salary"
},
Min
_Salary: {
$min: "$Salary"
},
Max_Salary: {
$max: "$Salary"
}
}
}])

Output

{
"_id": 32,
"Sum_Salary": 37000,
"Avg_Salary": 37000,
"Min_Salary": 37000,
"Max_Salary": 37000
}
{
"_id": 25,
"Sum_Salary": 25000,
"Avg_Salary": 25000,
"Min_Salary": 25000,
"Max_Salary": 25000
}
{
"_id": 23,
"Sum_Salary": 37000,
"Avg_Salary": 37000,
"Min_Salary": 37000,
"Max_Salary": 37000
}
{
"_id": 22,
"Sum_Salary": 55000,
"Avg_Salary": 27500,
"Min_Salary": 27000,
"Max_Salary": 28000
}
{
"_id": 28,
"Sum_Salary": 35000,
"Avg_Salary": 35000,
"Min_Salary": 35000,
"Max_Salary": 35000
}
{
"_id": 21,
"Sum_Salary": 25000,
"Avg_Salary": 25000,
"Min_Salary": 25000,
"Max_Salary": 25000
}

In this example we group documents by “Age” field and calculate the sum, average, minimum and maximum salary. The following is the equivalent query in SQL.

Select Age, sum(Salary), avg(Salary), max(Salary), min(Salary) from Demo group by Age.

Example 3

db.Demo.aggregate
([{
$group:
{
_id: "$Age",
Sum_Salary:
{
$sum: "$Salary"
},
Avg_Salary:
{
$avg: "$Salary"
},
Min
_Salary:
{
$min: "$Salary"
},
Max_Salary:
{
$max: "$Salary"
}
}
},
{
$match:
{
_id:
{
$gt: 21,
$lt: 27
}
}
}])

Output

{

    "_id": 25,

    "Sum_Salary": 25000,

    "Avg_Salary": 25000,

    "Min_Salary": 25000,

    "Max_Salary": 25000

}

{

    "_id": 23,

    "Sum_Salary": 37000,

    "Avg_Salary": 37000,

    "Min_Salary": 37000,

    "Max_Salary": 37000

}

{

    "_id": 22,

    "Sum_Salary": 55000,

    "Avg_Salary": 27500,

    "Min_Salary": 27000,

    "Max_Salary": 28000

}
In this example, documents are passed through two stages. The first stage is “group”. In that stage, we group documents by the “Age” field and calculate the sum, average, minimum and maximum salary. The second stage is the “match”, that stage filters all the documents depending on the value of the Age field.

The following query also generate the same output

db.Demo.aggregate
([{
$match:
{
Age:
{
$gt: 21,
$lt: 27
}
}
}, {
$group:
{
_id: "$Age",
Sum_Salary:
{
$sum: "$Salary"
},
Avg_Salary:
{
$avg: "$Salary"
},
Min_Salary:
{
$min: "$Salary"
},
Max_Salary:
{
$max: "$Salary"
}
}
}])

The following is the equivalent query in SQL.

Select Age, sum(Salary), avg(Salary), max(Salary), min(Salary) from Demo Where Age>21 And Age<27 group by Age

Example 4

db.Demo.aggregate
([{
$limit: 3
},
{
$group:
{
_id: "$Age",
Sum_Salary:
{
$sum: "$Salary"
},
Avg_Salary:
{
$avg: "
$Salary "
},
Min_Salary:{$min:"
$Salary "},Max_Salary:{$max:"
$Salary "}
}
}])

Output

{
"_id": 23,
"Sum_Salary": 37000,
"Avg_Salary": 37000,
"Min_Salary": 37000,
"Max_Salary": 37000
}
{
"_id": 22,
"Sum_Salary": 27000,
"Avg_Salary": 27000,
"Min_Salary": 27000,
"Max_Salary": 27000
}
{
"_id": 21,
"Sum_Salary": 25000,
"Avg_Salary": 25000,
"Min_Salary": 25000,
"Max_Salary": 25000
}

In this example, documents are passed through two stages. The first stage is “limit” that selects the top 3 documents from the collection and pass these documents to the second stage. The second stage groups the documents by the “Age” field and calculates the sum, average, minimum and maximum salary.

Important Point

The order of the stages in the aggregate pipeline is very important. Like db.Demo.aggregate($limit,$group) and db.Demo.aggregate($group,$limit) the methods don’t provide the same result.

Now we will execute the previous method but interchange the order of the “limit” and “group” stages.

db.Demo.aggregate
([{
$group:
{
_id: "$Age",
Sum_Salary:
{
$sum: "$Salary"
},
Avg_Salary:
{
$avg: "$Salary"
},
M
in_Salary:
{
$min: "$Salary"
},
Max_Salary:
{
$max: "$Salary"
}
}
}, {
$limit: 3
}])

Output

{
"_id": 32,
"Sum_Salary": 37000,
"Avg_Salary": 37000,
"Min_Salary": 37000,
"Max_Salary": 37000
}
{
"_id": 25,
"Sum_Salary": 25000,
"Avg_Salary": 25000,
"Min_Salary": 25000,
"Max_Salary": 25000
}
{
"_id": 23,
"Sum_Salary": 37000,
"Avg_Salary": 37000,
"Min_Salary": 37000,
"Max_Salary": 37000
}

We can see that the result of this method and previous method is not the same, so the selection of the order of the stages is very important. This selection may change the expected result.

Example 5

db.Demo.aggregate
([{
$group:
{
_id: "$Age",
Sum_Salary: {
$sum: "$Salary"
},
Avg_Salary: {
$avg: "$Salary"
},
M
in_Salary: {
$min: "$Salary"
},
Max_Salary: {
$max: "$Salary"
}
}
}, {
$project: {
Sum_Salary: 1,
Max_Salary: 1,
_id: 0
}
}, {
$limit: 4
}])

Output

{
"Sum_Salary": 37000,
"Max_Salary": 37000
}
{
"Sum_Salary": 25000,
"Max_Salary": 25000
}
{
"Sum_Salary": 37000,
"Max_Salary": 37000
}
{
"Sum_Salary": 55000,
"Max_Salary": 28000
}

In this example documents passed through the following three stages. The first stage is “group” as described in the previous examples. The second stage is “project”. In that stage, we only select the Sum_Salary and Max_Salary fields to display. The third stage is “limit”, in that stage we select the top 4 documents from the result of the previous stage (project).

Example 6

db.Demo.aggregate([{
$group: {
_id: "$Age",
Sum_Salary: {
$sum: "$Salary"
},
Avg_Salary: {
$avg: "$Salary"
},
M
in_Salary: {
$min: "$Salary"
},
Max_Salary: {
$max: "$Salary"
}
}
}, {
$project: {
Sum_Salary: 1,
Max_Salary: 1,
_id: 0
}
}, {
$limit: 4
}, {
$skip: 2
}])

Output

{
"Sum_Salary": 37000,
"Max_Salary": 37000
}
{
"Sum_Salary": 55000,
"Max_Salary": 28000
}

This example is the same as the previous example but we add an extra stage, “skip”. So the result of the “limit” stage is passed to the “skip” stage and this stage removes the top 2 documents from the collection and prints the remaining documents.

Example 7

db.Demo.aggregate([{
$group: {
_id: "$Age",
Sum_Salary: {
$sum: "$Salary"
},
Avg_Salary: {
$avg: "$Salary"
},
M
in_Salary: {
$min: "$Salary"
},
Max_Salary: {
$max: "$Salary"
}
}
}, {
$project: {
Sum_Salary: 1,
Max_Salary: 1,
_id: 0
}
}, {
$limit: 4
}, {
$sort: {
Sum_Salary: 1
}
}])

Output

{
"Sum_Salary": 25000,
"Max_Salary": 25000
}
{
"Sum_Salary": 37000,
"Max_Salary": 37000
}
{
"Sum_Salary": 37000,
"Max_Salary": 37000
}
{
"Sum_Salary": 55000,
"Max_Salary": 28000
}

This example is similar to Example 6, but in this example, we use an extra stage, “sort". That stage retrieves the documents from the output of the “limit” stage and sorts the documents depending on the value of the Sum_Salary field in ascending order.

Example 8

db.Demo.aggregate([{
$group: {
_id: "$Age",
Sum_Salary: {
$sum: "$Salary"
},
Avg_Salary: {
$avg: "$Salary"
},
M
in_Salary: {
$min: "$Salary"
},
Max_Salary: {
$max: "$Salary"
}
}
}, {
$project: {
Sum_Salary: 1,
Max_Salary: 1,
_id: 0
}
}, {
$redact: {
$cond: {
if: {
$gt: ["$Sum_Salary", 30000]
},
then: "$$PRUNE",
else: "$$DESCEND"
}
}
}])

Output

{

    "Sum_Salary": 25000,

    "Max_Salary": 25000

}

{

    "Sum_Salary": 25000,

    "Max_Salary": 25000

}
In this example, we use a redact stage. This stage restricts the contents of the documents based on information stored in the documents themselves. In the preceding $redact stage we use two system variables. The first variable is ”$$PRUNE”, this variable excludes all the fields present in the current document or embedded at the document level. The second variable is “$$DESCEND” that returns all the fields of the current documents level, excluding embedded level documents. So the preceding query excludes all the documents where the value of “Sum_Salary” is greater than 30000.

Map-Reduce

Map-Reduce is another way to do aggregation. Map-Reduce is a combination of two operations. The first part of Map-Reduce is “Map”, that processes each document and retrieves one or more objects for each input document. The second part of Map-Reduce is “Reduce” that combines the result of the Map operations. Map-Reduce handles large amounts of data into aggregate results.

How Map-Reduce operation Works

For Map-Reduce operations, MongoDB provides the mapReduce database command. In a Map-Reduce operation, MongoDB first applies the map operation on all the documents that match the query condition. The result of the map operation generates the key-value pairs. For those keys that have more than one value, MongoDB applies reduce operations that collect and condense the aggregate result. Then MongoDB stores the results in a collection. Map-Reduce also contains an optional function, finalize. We can pass the output of the reduce operation to the finalize function to further condense the results of the aggregate operation.

MongoDB uses a JavaScript function in map-reduce aggregation. Map-Reduce aggregates may return the documents as output or may write the result to a collection. Map-Reduce uses the JavaScript functions that provide high flexibility.

Let us look at some examples for Map-Reduce aggregations.

Example 1

To perform the the mapReduce function for any collection, we must perform the following three steps. Now we will understand these procedures using the “Demo” collection.

Step 1

Define the map function.

function()
{
emit(this.Age,this.Salary);
} ;

In the function “this” defines the current collection name. This function maps the Salary to Age for each document and emits the Age and Salary pairs.

Step 2

Define the reduce function.

function(key,values)
{
return Array.sum(values);
}

Define the reduce function with the two arguments, key, and values. The values parameter is an array type that is emitted by the map function and contains salary values and grouped by Age. This function reduces the values array to a sum of its elements.

Step 3

Output Collection.

{ out:"My_Coll"}

This operation stores the result in the “My_Coll” collection. If the collection already exists then the contents of the collection will replace the contents of the mapReduce function.

The combined form of these three steps is:

db.Demo.mapReduce(function()
{
emit(this.Age, this.Salary);
},
function(key, values)
{
return Array.sum(v alues);
},
{
query:
{
Age:
{
$gt: 20
}
},
out: "My_Coll"
})

So when we execute the preceding query MongoDB returns the following outputs as acknowledgement.

{
"result" : "My_Coll",
"timeMillis" : 4,
"counts" :
{
"input" : 7,
"emit" : 7,
"reduce" : 1,
"output" : 6
},
"ok" : 1
}

The preceding result defines a number of input documents, number of emit documents, number of reduce documents, number of outputs and time to execute the mapReduce command.

The result of the preceding query will be stored in the “My_Coll” collection so we can use the find() method to retrieve the result from the “My_Coll” collection.

db.My_Coll.find().pretty()

Output

{
"_id": 21,
"value": 25000
}
{
"_id": 22,
"value": 55000
}
{
"_id": 23,
"value": 37000
}
{
"_id": 25,
"value": 25000
}
{
"_id": 28,
"value": 35000
}
{
"_id": 32,
"value": 37000
}

Example 2

Step 1

var map_func = function()
{
emit(this.Age, this.Salary);
};

Step 2

var reduce_func = function(key, values)
{
return Array.sum(values);
};

Step 3

db.Demo.mapReduce(map_func, reduce_func,
{
query:
{
Age:
{
$gt: 23
}
},
out: "My_Coll"
})

Output

{
"result": "My_Coll",
"timeMillis": 4,
"counts":
{
"input": 3,
"emit": 3,
"reduce": 0,
"output": 3
},
"ok": 1
}

The following is the data of My_Coll collection:

db.My_Coll.find().pretty()

Output

{
"_id": 25,
"value": 25000
}
{
"_id": 28,
"value": 35000
}
{
"_id": 32,
"value": 37000
}

In this example, we will define the map and reduce functions in two variables and use both variables in the mapReduce function. We also define the query for collections that define the selection criteria.

Example

db.Demo.mapReduce(
function()
{
emit(this.Age, this.Salary);
},
function(key, values)
{
return Array.avg(v alues);
}, {
query:
{
Age:
{
$gt: 20
},
Salary:
{
$gte: 24000
}
},
out: "My_Coll",
limit: 4,
sort:
{
Salary: 1
}
})

Output

{
"result": "My_Coll",
"timeMillis": 5,
"counts":
{
"input": 4,
"emit": 4,
"reduce": 1,
"output": 3
},
"ok": 1
}
}

The following is the data of My_Coll collection:

db.My_Coll.find().pretty()

Output

{
"_id": 21,
"value": 25000
}
{
"_id": 22,
"value": 27500
}
{
"_id": 25,
"value": 25000
}

Today, we read the aggregation operation in MongoDB. Aggregation is an important part of MongoDB that computes the records and filters out the results.

Thanks for reading this article!

Next Article >> MongoDB Replication (Day 15)