originalthinker: October 2014

This blog post will give a basic introduction to MongoDB's aggregation framework.

What is Aggregation in MongoDB:

Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the mongod instance simplifies application code and limits resource requirements.

Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents.

This is similar to Group-By clause in RDBMS world of databases.

We use aggregation framework in MongoDB to compute the sum,averga,mean,median of the numerical type data in the documents of a collection.

Basic Syntax of aggregation:

The general aggregation framework is this:

db.collection.aggregate([{"aggregation operation"}])

Here, we have to pass the collection name and then use 'aggregate' command and then perform appropriate aggregation operation.

The following are the most commonly used aggregation operations which are used in MongoDB:

$project - Reshapes the document & brings at the top. Uses 1:1 mapping. Use this to pass the fields which are needed in the next state of the pipleline. Also use this to filter the output of the aggregated dataset where certain fields are required.

$match - Filters the documents to pass only the documents that match the specified condition(s) to the next pipeline stage. Best suited to optimize the pipeline.Put the $match criteria in the beginning of the pipleline to minimize the amount of processing down the pipe.

$group - Groups documents by some specified expression and outputs to the next stage a document for each distinct grouping. The output documents contain an _id field which contains the distinct group by key. The output documents can also contain computed fields that hold the values of some accumulator expression grouped by the $group‘s _id field. $group does not order its output documents.

$sort - Sorts all input documents and returns them to the pipeline in sorted order.

Aggregation Pipeline framework:

The aggregation pipeline is a framework for data aggregation modeled on the concept of data processing pipelines. Documents enter a multi-stage pipeline that transforms the

documents into an aggregated results.

The MongoDB aggregation pipeline consists of stages. Each stage transforms the

documents as they pass through the pipeline. Pipeline stages do not need to produce

one output document for every input document; e.g., some stages may generate new

documents or filter out documents. Pipeline stages can appear multiple times in the

pipeline.

Aggregation pipeline have some limitations on value types and result size, these limits

are:

Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds this limit, MongoDB will produce an error. This is resolved by using allowDiskUse option which enables aggregation pipeline stages to write data to temporary files.

If the size of document that contains the complet result set exceeds 16 MB size, then the aggregate commands produces an error. This can be solved if the command return a cursor or store the results to a collection.

Aggregation Expressions:

The following are the most commonly use aggregation expression in the MongoDB's aggregation d framework:

$sum: Calculates and returns the sum of all the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key. This is an accumulator operator and is available in the $group stage only.

$avg: Returns the average value of the numeric values that result from applying a specified expression to each document in a group of documents that share the same group by key. $avg ignores non-numeric values. This is an accumulator operator and is available in the $group stage only.

$min: Returns the lowest value that results from applying an expression to each document in a group of documents that share the same group by key. This is an accumulator operator and is available in the $group stage only.

$max: Specify a $max value to specify the exclusive upper bound for a specific index in order to constrain the results of find(). The $max specifies the upper bound for all keys of a specific index in order.

$push: The $push operator appends a specified value to an array. If the field is absent in the document to update, $push adds the array field with the value as its element. If the field is not an array, the operation will fail.If the value is an array, $push appends the whole array as a single element.

$addtoSet: The $addToSet operator adds a value to an array only if the value is not already in the array. If the value is in the array, $addToSet does not modify the array. $addToSet only ensures that there are no duplicate items added to the set and does not affect existing duplicate elements. $addToSet does not guarantee a particular ordering of elements in the modified set. If the field is absent in the document to update, $addToSet creates the array field with the specified value as its element. If the field is not an array, the operation will fail.

$first: Returns the value that results from applying an expression to the first document in a group of documents that share the same group by key. Documents must be in sorted order to work.Available only in the $group stage.

$last: Returns the value that results from applying an expression to the last document in a group of documents that share the same group by key. Documents must be in sorted order to work. Available only in the $group stage.

Special Pipleline Stage Aggregation Operator - $unwind:

Applicable to those documents which are stored in an array.Deconstructs an array field from the input documents to output a document for each element.Each output document is the input document with the value of the array field replaced by the element.

The $unwind stage has the following prototype form:

{ $unwind: }

To specify a field path, prefix the field name with a dollar sign $ and enclose in quotes.

Behaviors

$unwind has the following behaviors:

If a value in the field specified by the field path is not an array, db.collection.aggregate() generates an error.

If you specify a path for a field that does not exist in an input document, the pipeline ignores the input document and will not output documents for that input document.

If the array holds an empty array ([]) in an input document, the pipeline ignores the input document and will not output documents for that input document.

In the next post, I will show how to write queries using aggregate operators and how to work on the dataset to sum,average, finding min and max from the documents of a collection.

This post will discuss about the installation of MongoDB, how to create config files, how to start MongoDB.

For installation of the MongoDB, kindly go to the following URL and select the appropriate installation file, based on the operating system which you are using:

http://docs.mongodb.org/manual/installation/

In my case, I am using windows 8 which is 64 bit, but my mongodb version is 32bits.

I will advise to use 32 bit installation of MongoDB because the drivers developed in the languages such as Python etc work very well with 32 bits version. There are lot of issues encountered when 64 bit version of Python is used. Also, use Python 2.x version, not 3.x

Once the installation is complete, run the setup, install mongodb in the directory. I have chosen C: directory for the same.

Starting MongoDB:

MongoDB can be started as windows service. But before that, we need to create --dbpath file,conf file and the log file.

Infact, the best way is to create dbpath and logpath in the config file and run the config file while starting the mongodb instance, which I will explain.

Create a config file, mongodb.conf, put it in the folder C:\wamp\bin\mongodb\mongodb_win32\conf,which might be different for your machine. Just make sure that conf file is residing where the mongodb resides inside the \bin directory. Also, kindly take care of the path, for windows it is '\' and for unix it would be '/'

Add the following lines in the config file and name it as mongodb.conf:

# mongodb.conf

# data lives here
dbpath = C:\wamp\bin\mongodb\mongodb_win32\data\db\

# where to log
logpath = C:\wamp\bin\mongodb\mongodb_win32\logs\mongodb.log
logappend=true

# only run on localhost for development
bind_ip = 127.0.0.1

port = 27017
rest = true

To run the mongodb instance from the cmd, make sure to open the cmd in administrative, for linux and unix, the mode should be super user.

Now, open the command prompt as administrator and do the following:

The mongod instance is up and running.

Now, open another command prompt as administrator and do the following:

Here, we have use mongo command to connect to the database, we are connected to the test database.

By default, mongo looks for a database server listening on port 27017 on the localhost interface. To connect to a server on a different port or interface, use the --port and --host options.

The mongo shell session will use the test database by default. At any time, issue the 'db' operation at the mongo to report the name of the current database:

To get the list of databases, use 'show dbs'

To use a database, issue use db command, where db is the name of the database, as show in the screen shot below:

Now we are connected to the students database.

As mongoDB is a json related document,A manual:database holds a set of collections. A manual:collection holds a set of documents. A manual:document is a set of key-value pairs.

In order to find the collections of a database, issue 'show collections' operation:

The collection present in the database 'students' is 'grades'. It is using this collection that we will perform all the operations on the database.

Some basic operations using mongoDB queries, such as finding data inside the collection, counting the number of documents in a collections:

To find the data, we use the operation db.collection.find()

For our example, we will use db.grades.find()

We can see the output in the key:value pairs in the documents returned by the command, "student_id" is key whose values are 0,1,2 etc.

To find the number of in a collection, use db.collection.count()

The number of documents in the grades collection is 600.

In the next post, I will show some more operations such as finding specific document, and how to use aggregation framework to sum,average data and also, I will show you how to write file using pymongo, a driver available in python language.

References:

[1] http://docs.mongodb.org/manual/reference/method/db.collection.count/

[2] https://university.mongodb.com/

originalthinker

Saturday, October 18, 2014

MongoDB: Aggregation Framework

Friday, October 17, 2014

MongoDB Installation and running

About Me