Aggregation Pipeline In MongoDB

Introduction

Aggregation operations are very important in any type of database whether it is SQL or NoSQL. To perform aggregations operations MongoDB groups values from multiple documents together and then performs a variety of operations on grouped data to return a single result. SQL uses aggregate functions to return a single value calculated from values in columns.

MongoDB has three ways to perform aggregation: the aggregation pipeline, the map-reduce function, and the single purpose aggregation methods.

In this article, we will focus on aggregation pipeline. I'll try to cover each major section of it using simple examples. We will be writing Mongo Shell Commands to perform aggregation.

Aggregation Pipeline

MongoDB's aggregation framework is based on the concept of data processing pipelines. Aggregation pipeline is similar to the UNIX world pipelines. At the very beginning is the collection, the collection is sent through document by document, documents are piped through a processing pipeline, and they go through series of stages and then, we eventually get a result set.
PIPELINE
In the figure, you see that collection is processed through different stages i.e. $project, $match, $group, $sort. These stages can appear multiple times.

Various stages in pipeline are 

  • $project select, reshape data
  • $match filter data
  • $group aggregate data
  • $sort sorts data
  • $skip skips data
  • $limit limit data
  • $unwind normalizes data
Let’s try to visualize the aggregation with an example. Don’t worry about the syntax. I will be explaining it soon.
  1. db.mycollection.aggregate([{  
  2.     $match: {  
  3.         'phone_type''smart'  
  4.     }  
  5. }, {  
  6.     $group: {  
  7.         '_id''$brand_name',  
  8.         total: {  
  9.             $sum: '$price'  
  10.         }  
  11.     }  
  12. }])  
pipeline

As you can see in the diagram, we have a collection. The $match stages filter out the documents. Then, in the next stage of the pipeline, documents get grouped and we get the final result set.

Preparing Dummy Data

To run Mongo Shell commands, we need a database and some dummy records. So, let’s create our database and a collection.
  1. Use mydb; // database name   
  2. dept = ['IT''Sales''HR''Admin'];    
  3. for (i = 0; i < 10; i++) {    
  4.     db.mycollection.insert({ //mycollection is collection name    
  5.         '_id': i,    
  6.         'emp_code''emp_' + i,    
  7.         'dept_name': dept[Math.round(Math.random() * 3)],    
  8.       'experience': Math.round(Math.random() * 10),    
  9.     });   
The above command will insert some dummy documents in a collection named mycollection in mydb database.

pipeline

Syntax
  1. db.mycollection.aggregate([{  
  2.     $match: {  
  3.         'phone_type''smart'  
  4.     }  
  5. }, {  
  6.     $group: {  
  7.         '_id''$brand_name',  
  8.         total: {  
  9.             $sum: '$price'  
  10.         }  
  11.     }  
  12. }])  
Syntax is much easier. The aggregate function takes an array as argument. In array, we can pass various phases/stages of pipeline.

In the above example, we have passed two phases of pipeline that are $match which will filter out record, and $group phase which will group the records and produces the final record set.

Stages Of Pipeline
  1. $project
    In $project phase, we can add a key, remove a key, reshape a key. There are also some simple functions that we can use on the key : $toUpper, $toLower, $add, $multiply etc.

    Let’s use $project to reshape the documents that we have created.
    1. db.mycollection.aggregate([{  
    2.     $project: {  
    3.         _id: 0,  
    4.         'department': {  
    5.             $toUpper: '$dept_name'  
    6.         },  
    7.         'new_experience': {  
    8.             $add: ['$experience', 1]  
    9.         }  
    10.     }  
    11. }])  
    In this aggregate query, we are projecting the documents, _id:0 means _id which is compulsory we are hiding this field. A new key named department is created using previous dept_name field in upper case.

    The point to be noticed here is that field ‘dept_name’ is prefixed with ‘$’ sign to tell Mongo Shell that this field is the original field name of the document. Another new field named new_experience is created by adding 1, using $add function to the previous experience field. We will get the output like this.

    pipeline in DBL

  2. $match
    It works exactly like ‘where clause' in SQL to filter out the records. The reason we might want to match, is because we would like to filter the results and only aggregate a portion of the documents or search for particular parts of the results set, after we do the grouping. Let's say, in our collection, we want to aggregate documents having department equals to sales. The query will be.
    1. db.mycollection.aggregate([{    
    2.     $match: {    
    3.         dept_name: 'Sales'    
    4.     }    
    5. }])    
    DBL

  3. $group
    As the name suggests, $group groups the documents based on some key. Let’s say, we want to group employees on their department name and we want to find the number of employees in each department.
    1. db.mycollection.aggregate([{  
    2.     $group: {  
    3.         _id: '$dept_name',  
    4.         no_of_employees: {  
    5.             $sum: 1  
    6.         }  
    7.     }  
    8. }])  
    Here, _id is the key for grouping and I have created a new key named no_of_employees and used $sum to find the total record in each group.

    DBL

    Let’s improve this query to present output in a more sensible way.
    1. db.mycollection.aggregate([{  
    2.     $group: {  
    3.         _id: {  
    4.             'department''$dept_name'  
    5.         },  
    6.         no_of_employees: {  
    7.             $sum: 1  
    8.         }  
    9.     }  
    10. }])  
    Let’s say, we want to group documents on more than one keys. All we need to do is to specify the name of the keys in _id field.
    1. db.mycollection.aggregate([{    
    2.     $group: {    
    3.         _id: {    
    4.             'department''$dept_name',    
    5.             'year_of_experience''$experience'    
    6.         },    
    7.         no_of_employees: {    
    8.             $sum: 1    
    9.         }    
    10.     }    
    11. }])    
    DBL

  4. $sort
    Sort helps you to sort the data after aggregation, in ascending or descending order as per your need. Let’s say, we want to group department name in ascending order and find out the number of employees.
    1. db.mycollection.aggregate([{  
    2.     $group: {  
    3.         _id: '$dept_name',  
    4.         no_of_employees: {  
    5.             $sum: 1  
    6.         }  
    7.     }  
    8. }, {  
    9.     $sort: {  
    10.         _id: 1  
    11.     }  
    12. }])  
    For descending, use -1. Here, in $sort, I have used _id field because in the first phase of aggregation I used $dept_name as _id for aggregation.

  5. $skip and $limit
    $skip and $limit, as the names suggest, skip and limit work respectively when we do a simple find. It doesn’t make any sense to skip and limit unless we first sort, otherwise, the result is undefined.

    We first skip records and then we limit those.

    Let’s see an example for the same.
    1. db.mycollection.aggregate([{  
    2.     $group: {  
    3.         _id: '$dept_name',  
    4.         no_of_employees: {  
    5.             $sum: 1  
    6.         }  
    7.     }  
    8. }, {  
    9.     $sort: {  
    10.         _id: 1  
    11.     }  
    12. }, {  
    13.     $skip: 2  
    14. }, {  
    15.     $limit: 1  
    16. }])  
    Documents are grouped, then sorted, after that, we skipped two documents and limit the document to only one.

    DBL

  6. $first and $last
    As we know how sort works in the aggregation pipeline, we can learn about $first and $last. They allow us to get the first and last value in each group as aggregation pipeline processes the document.
    1. db.mycollection.aggregate([{    
    2.     $group: {    
    3.         _id: '$dept_name',    
    4.         no_of_employees: {    
    5.             $sum: 1    
    6.         },    
    7.         first_record: {    
    8.             $first: '$emp_code'    
    9.         }    
    10.     }    
    11. }])    
    DBL

  7. $unwind
    As we know in MongoDB, documents can have arrays. It is not easy to group on something within an array. $unwind first un-joins the array data and then basically rejoins it in a way that lets us do grouping calculations on it.

    Let’s say, we have a document like this.
    1. {  
    2.     a: somedata,  
    3.     b: someotherdata,  
    4.     c: [arr1, arr2, arr3]  
    5. }  
    6.   
    7. After $unwind on‘ c’, we will get three documents.  
    8.   
    9. {  
    10.     a: somedata,  
    11.     b: someotherdata,  
    12.     c: arr1  
    13. } {  
    14.     a: somedata,  
    15.     b: someotherdata,  
    16.     c: arr2  
    17. } {  
    18.     a: somedata,  
    19.     b: someotherdata,  
    20.     c: arr3  
    21. }  
  8. Aggregation Expressions

    Let's see some expressions that are very common in SQL and in MongoDB we have an alternate for that.

    1. $Sum We have already seen its example.
    2. $avg Average works just like sum except it calculates the average for each group.
    3. $min Finds out the minimum value from each grouped document.
    4. $max Finds out the maximum value from each grouped document.

Further Reading

Given below are some useful links from where you can further investigate and learn more about aggregation in MongoDB.

https://docs.mongodb.com/manual/aggregation/
https://docs.mongodb.com/v3.0/applications/aggregation/
https://docs.mongodb.com/v3.2/reference/sql-aggregation-comparison/


Conclusion

I have not explained all the topics in aggregation but this article will help you kick start with aggregation using MongoDB in your project and for your learning. I have attached Mongo Shell Commands for your reference.