This blog post will give a basic introduction to MongoDB's aggregation framework.
What is Aggregation in MongoDB:
Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the mongod instance simplifies application code and limits resource requirements.
Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents.
This is similar to Group-By clause in RDBMS world of databases.
We use aggregation framework in MongoDB to compute the sum,averga,mean,median of the numerical type data in the documents of a collection.
Basic Syntax of aggregation:
The general aggregation framework is this:
db.collection.aggregate([{"aggregation operation"}])
Here, we have to pass the collection name and then use 'aggregate' command and then perform appropriate aggregation operation.
The following are the most commonly used aggregation operations which are used in MongoDB:
- $project - Reshapes the document & brings at the top. Uses 1:1 mapping. Use this to pass the fields which are needed in the next state of the pipleline. Also use this to filter the output of the aggregated dataset where certain fields are required.
- $match - Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage. Best suited to optimize the pipeline.Put the $match criteria in the beginning of the pipleline to minimize the amount of processing down the pipe.
- $group - Groups documents by some specified expression and outputs to the next stage a document for each distinct grouping. The output documents contain an _id field which contains the distinct group by key. The output documents can also contain computed fields that hold the values of some accumulator expression grouped by the $group‘s _id field. $group does not order its output documents.
- $sort - Sorts all input documents and returns them to the pipeline in sorted order.
Aggregation Pipeline framework:
The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the
documents into an aggregated results.
documents as they pass through the pipeline. Pipeline stages do not need to produce
one output document for every input document; e.g., some stages may generate new
documents or filter out documents. Pipeline stages can appear multiple times in the
pipeline.
Aggregation pipeline have some limitations on value types and result size, these limits
are:
- Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. This is resolved by using allowDiskUse option which enables aggregation pipeline stages to write data to temporary files.
- If the size of document that contains the complet result set exceeds 16 MB size, then the aggregate commands produces an error. This can be solved if the command return a cursor or store the results to a collection.
Aggregation Expressions:
The following are the most commonly use aggregation expression in the MongoDB's aggregation d framework:
- $sum: Calculates and returns the sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key. This is an accumulator operator and is available in the $group stage only.
- $avg: Returns the average value of the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key. $avg ignores non-numeric values. This is an accumulator operator and is available in the $group stage only.
- $min: Returns the lowest value that results from applying an expression to each document in a group of documents that share the same group by key. This is an accumulator operator and is available in the $group stage only.
- $max: Specify a $max value to specify the exclusive upper bound for a specific index in order to constrain the results of find(). The $max specifies the upper bound for all keys of a specific index in order.
- $push: The $push operator appends a specified value to an array. If the field is absent in the document to update, $push adds the array field with the value as its element. If the field is not an array, the operation will fail.If the value is an array, $push appends the whole array as a single element.
- $addtoSet: The $addToSet operator adds a value to an array only if the value is not already in the array. If the value is in the array, $addToSet does not modify the array. $addToSet only ensures that there are no duplicate items added to the set and does not affect existing duplicate elements. $addToSet does not guarantee a particular ordering of elements in the modified set. If the field is absent in the document to update, $addToSet creates the array field with the specified value as its element. If the field is not an array, the operation will fail.
- $first: Returns the value that results from applying an expression to the first document in a group of documents that share the same group by key. Documents must be in sorted order to work.Available only in the $group stage.
- $last: Returns the value that results from applying an expression to the last document in a group of documents that share the same group by key. Documents must be in sorted order to work. Available only in the $group stage.
Special Pipleline Stage Aggregation Operator - $unwind:
Applicable to those documents which are stored in an array.Deconstructs an array field from the input documents to output a document for each element.Each output document is the input document with the value of the array field replaced by the element.
The $unwind stage has the following prototype form:
{ $unwind: }
To specify a field path, prefix the field name with a dollar sign $ and enclose in quotes.
Behaviors
$unwind has the following behaviors:
- If a value in the field specified by the field path is not an array, db.collection.aggregate() generates an error.
- If you specify a path for a field that does not exist in an input document, the pipeline ignores the input document and will not output documents for that input document.
- If the array holds an empty array ([]) in an input document, the pipeline ignores the input document and will not output documents for that input document.
In the next post, I will show how to write queries using aggregate operators and how to work on the dataset to sum,average, finding min and max from the documents of a collection.
Th