Saturday, October 18, 2014

MongoDB: Aggregation Framework


This blog post will give a basic introduction to MongoDB's aggregation framework.

What is Aggregation in MongoDB:
Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the mongod instance simplifies application code and limits resource requirements.
Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents.
This is similar to Group-By clause in RDBMS world of databases.
We use aggregation framework in MongoDB to compute the sum,averga,mean,median of the numerical type data in the documents of a collection.
Basic Syntax of aggregation:
The general aggregation framework is this:
                           db.collection.aggregate([{"aggregation operation"}])
Here, we have to pass the collection name and then use 'aggregate' command and then perform appropriate aggregation operation.
The following are the most commonly used aggregation operations which are used in MongoDB:
  • $project - Reshapes the document & brings at the top. Uses 1:1 mapping. Use this to pass                   the  fields which are needed in the next state of the pipleline. Also use this to filter                     the output of the aggregated dataset where certain fields are required.
  • $match -  Filters the documents to pass only the documents that match the specified                     condition(s) to the next pipeline stage. Best suited to optimize the                                 pipeline.Put the $match criteria in the beginning of the pipleline to minimize                   the amount of processing down the pipe.
  • $group - Groups documents by some specified expression and outputs to the                             next stage a document for each distinct grouping. The output documents                       contain an _id field which contains the distinct group by key. The output                       documents can also contain computed fields that hold the values of some                     accumulator expression grouped by the $group‘s _id field.                                            $group does not order its output documents.
  • $sort  -    Sorts all input documents and returns them to the pipeline in sorted order.

    Aggregation Pipeline framework:

   The aggregation pipeline is a framework for data aggregation modeled on the concept of      data processing pipelines. Documents enter a multi-stage pipeline that transforms the
   documents into an aggregated results.

   The MongoDB aggregation pipeline consists of stages. Each stage transforms the 
    documents as they pass through the pipeline. Pipeline stages do not need to produce
   one output document for every input document; e.g., some stages may generate new
   documents or filter out documents. Pipeline stages can appear multiple times in the 
   pipeline.

   Aggregation pipeline have some limitations on value types and result size, these limits
   are:
  
  •  Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds       this limit,  MongoDB will produce an error. This is resolved by using  allowDiskUse option which enables aggregation pipeline stages to write data to temporary files.
  • If the size of document that contains the complet result set exceeds 16 MB size, then the aggregate commands produces an error. This can be solved if the command return a cursor or store the results to a collection.

  Aggregation Expressions:

  The following are the most commonly use aggregation expression in the MongoDB's aggregation  d  framework:

  • $sum: Calculates and returns the sum of all the numeric values that result                                 from applying a specified expression to each document in a group of                              documents that share the same group by key. This is an accumulator                            operator and is available in the $group stage only.
  • $avg:  Returns the average value of the numeric values that result from applying a                  specified expression to each document in a group of documents that share the              same group by key. $avg ignores non-numeric values. This is an accumulator              operator and is available in the $group stage only.
  • $min:  Returns the lowest value that results from applying an expression to each                     document in a group of documents that share the same group by key. This is               an accumulator operator and is available in the $group stage only.
  • $max: Specify a $max value to specify the exclusive upper bound for a specific index               in order to constrain the results of find(). The $max specifies the upper bound             for all keys of a specific index in order.
  • $push: The $push operator appends a specified value to an array. If the field is                        absent in the document to update, $push adds the array field with the value as              its element. If the field is not an array, the operation will fail.If the value is an                array, $push appends the whole array as a single element.
  • $addtoSet: The $addToSet operator adds a value to an array only if the value                                 is not already in the array. If the value is in the array, $addToSet does not                     modify the array. $addToSet only ensures that there are no                                           duplicate items added to the set and does not affect existing duplicate                           elements. $addToSet does not guarantee a particular ordering of                                   elements in the modified set. If the field is absent in the document to                             update, $addToSet creates the array field with the specified value as its                       element. If the field is not an array, the operation will fail.
  • $first:       Returns the value that results from applying an expression to the first                             document in a group of documents that share the same group by key.                           Documents must be in sorted order to work.Available only in the $group                       stage.
  • $last:     Returns the value that results from applying an expression to the last                            document in a group of documents that share the same group by key.                          Documents must be in sorted order to work. Available only in the $group                       stage.

   Special Pipleline Stage Aggregation Operator -  $unwind:


Applicable to those documents which are stored in an array.Deconstructs an array field from the input documents to output a document for each element.Each output document is the input document with the value of the array field replaced by the element.

The $unwind stage has the following prototype form:

                                         { $unwind: }
To specify a field path, prefix the field name with a dollar sign $ and enclose in quotes.

Behaviors

$unwind has the following behaviors:

  • If a value in the field specified by the field path is not an array, db.collection.aggregate() generates an error.

  • If you specify a path for a field that does not exist in an input document, the pipeline ignores the input document and will not output documents for that input document.

  • If the array holds an empty array ([]) in an input document, the pipeline ignores the input document and will not output documents for that input document.

In the next post, I will show how to write queries using aggregate operators and how to work on the dataset to sum,average, finding min and max from the documents of a collection.





Th





No comments: