Saturday, October 18, 2014

MongoDB: Aggregation Framework


This blog post will give a basic introduction to MongoDB's aggregation framework.

What is Aggregation in MongoDB:
Aggregations are operations that process data records and return computed results. MongoDB provides a rich set of aggregation operations that examine and perform calculations on the data sets. Running data aggregation on the mongod instance simplifies application code and limits resource requirements.
Like queries, aggregation operations in MongoDB use collections of documents as an input and return results in the form of one or more documents.
This is similar to Group-By clause in RDBMS world of databases.
We use aggregation framework in MongoDB to compute the sum,averga,mean,median of the numerical type data in the documents of a collection.
Basic Syntax of aggregation:
The general aggregation framework is this:
                           db.collection.aggregate([{"aggregation operation"}])
Here, we have to pass the collection name and then use 'aggregate' command and then perform appropriate aggregation operation.
The following are the most commonly used aggregation operations which are used in MongoDB:
  • $project - Reshapes the document & brings at the top. Uses 1:1 mapping. Use this to pass                   the  fields which are needed in the next state of the pipleline. Also use this to filter                     the output of the aggregated dataset where certain fields are required.
  • $match -  Filters the documents to pass only the documents that match the specified                     condition(s) to the next pipeline stage. Best suited to optimize the                                 pipeline.Put the $match criteria in the beginning of the pipleline to minimize                   the amount of processing down the pipe.
  • $group - Groups documents by some specified expression and outputs to the                             next stage a document for each distinct grouping. The output documents                       contain an _id field which contains the distinct group by key. The output                       documents can also contain computed fields that hold the values of some                     accumulator expression grouped by the $group‘s _id field.                                            $group does not order its output documents.
  • $sort  -    Sorts all input documents and returns them to the pipeline in sorted order.

    Aggregation Pipeline framework:

   The aggregation pipeline is a framework for data aggregation modeled on the concept of      data processing pipelines. Documents enter a multi-stage pipeline that transforms the
   documents into an aggregated results.

   The MongoDB aggregation pipeline consists of stages. Each stage transforms the 
    documents as they pass through the pipeline. Pipeline stages do not need to produce
   one output document for every input document; e.g., some stages may generate new
   documents or filter out documents. Pipeline stages can appear multiple times in the 
   pipeline.

   Aggregation pipeline have some limitations on value types and result size, these limits
   are:
  
  •  Pipeline stages have a limit of 100 megabytes of RAM. If a stage exceeds       this limit,  MongoDB will produce an error. This is resolved by using  allowDiskUse option which enables aggregation pipeline stages to write data to temporary files.
  • If the size of document that contains the complet result set exceeds 16 MB size, then the aggregate commands produces an error. This can be solved if the command return a cursor or store the results to a collection.

  Aggregation Expressions:

  The following are the most commonly use aggregation expression in the MongoDB's aggregation  d  framework:

  • $sum: Calculates and returns the sum of all the numeric values that result                                 from applying a specified expression to each document in a group of                              documents that share the same group by key. This is an accumulator                            operator and is available in the $group stage only.
  • $avg:  Returns the average value of the numeric values that result from applying a                  specified expression to each document in a group of documents that share the              same group by key. $avg ignores non-numeric values. This is an accumulator              operator and is available in the $group stage only.
  • $min:  Returns the lowest value that results from applying an expression to each                     document in a group of documents that share the same group by key. This is               an accumulator operator and is available in the $group stage only.
  • $max: Specify a $max value to specify the exclusive upper bound for a specific index               in order to constrain the results of find(). The $max specifies the upper bound             for all keys of a specific index in order.
  • $push: The $push operator appends a specified value to an array. If the field is                        absent in the document to update, $push adds the array field with the value as              its element. If the field is not an array, the operation will fail.If the value is an                array, $push appends the whole array as a single element.
  • $addtoSet: The $addToSet operator adds a value to an array only if the value                                 is not already in the array. If the value is in the array, $addToSet does not                     modify the array. $addToSet only ensures that there are no                                           duplicate items added to the set and does not affect existing duplicate                           elements. $addToSet does not guarantee a particular ordering of                                   elements in the modified set. If the field is absent in the document to                             update, $addToSet creates the array field with the specified value as its                       element. If the field is not an array, the operation will fail.
  • $first:       Returns the value that results from applying an expression to the first                             document in a group of documents that share the same group by key.                           Documents must be in sorted order to work.Available only in the $group                       stage.
  • $last:     Returns the value that results from applying an expression to the last                            document in a group of documents that share the same group by key.                          Documents must be in sorted order to work. Available only in the $group                       stage.

   Special Pipleline Stage Aggregation Operator -  $unwind:


Applicable to those documents which are stored in an array.Deconstructs an array field from the input documents to output a document for each element.Each output document is the input document with the value of the array field replaced by the element.

The $unwind stage has the following prototype form:

                                         { $unwind: }
To specify a field path, prefix the field name with a dollar sign $ and enclose in quotes.

Behaviors

$unwind has the following behaviors:

  • If a value in the field specified by the field path is not an array, db.collection.aggregate() generates an error.

  • If you specify a path for a field that does not exist in an input document, the pipeline ignores the input document and will not output documents for that input document.

  • If the array holds an empty array ([]) in an input document, the pipeline ignores the input document and will not output documents for that input document.

In the next post, I will show how to write queries using aggregate operators and how to work on the dataset to sum,average, finding min and max from the documents of a collection.





Th





Friday, October 17, 2014

MongoDB Installation and running


This post will discuss about the installation of MongoDB, how to create config files, how to start MongoDB.

For installation of the MongoDB, kindly go to the following URL and select the appropriate installation file, based on the operating system which you are using:

http://docs.mongodb.org/manual/installation/

In my case, I am using windows 8 which is 64 bit, but my mongodb version is 32bits.

I will advise to use 32 bit installation of MongoDB because the drivers developed in the languages such as Python etc work very well with 32 bits version. There are lot of issues encountered when 64 bit version of Python is used. Also, use Python 2.x version, not 3.x

Once the installation is complete, run the setup, install mongodb in the directory. I have chosen C: directory for the same.

Starting MongoDB:

MongoDB can be started as windows service. But before that, we need to create --dbpath file,conf file and the log file.

Infact, the best way is to create dbpath and logpath in the config file and run the config file while starting the mongodb instance, which I will explain.



Create a config file, mongodb.conf, put it in the folder C:\wamp\bin\mongodb\mongodb_win32\conf,which might be different for your machine. Just make sure that conf file is residing where the mongodb resides inside the \bin directory. Also, kindly take care of the path, for windows it is '\' and for unix it would be '/'

Add the following lines in the config file and name it as mongodb.conf:

# mongodb.conf

# data lives here
dbpath = C:\wamp\bin\mongodb\mongodb_win32\data\db\

# where to log
logpath = C:\wamp\bin\mongodb\mongodb_win32\logs\mongodb.log
logappend=true

# only run on localhost for development
bind_ip = 127.0.0.1                                                             

port = 27017
rest = true

To run the mongodb instance from the cmd, make sure to open the cmd in administrative, for linux and unix, the mode should be super user.

Now, open the command prompt as administrator and do the following:




The mongod instance is up and running.

Now, open another command prompt as administrator and do the following:


Here, we have use mongo command to connect to the database, we are connected to the test database.


By default, mongo looks for a database server listening on port 27017 on the localhost interface. To connect to a server on a different port or interface, use the --port and --host options.

The mongo shell  session will use the test database by default. At any time, issue the 'db' operation at the mongo to report the name of the current database:

To get the list of databases, use 'show dbs'



To use a database, issue use db command, where db is the name of the database, as show in the screen shot below:



Now we are connected to the students database.

As mongoDB is a json related document,A manual:database holds a set of collections. A manual:collection holds a set of documents. A manual:document is a set of key-value pairs.

In order to find the collections of a database, issue 'show collections' operation:

The collection present in the database 'students' is 'grades'. It is using this collection that we will perform all the operations on the database. 

Some basic operations using mongoDB queries, such as finding data inside the collection, counting the number of documents in a collections:

To find the data, we use the operation db.collection.find()

For our example, we will use db.grades.find()




We can see the output in the key:value pairs in the documents returned by the command, "student_id" is key whose values are 0,1,2 etc.

To find the number of in a collection, use db.collection.count()




The number of documents in the grades collection is 600.

In the next post, I will show some more operations such as finding specific document, and how to use aggregation framework to sum,average data and also, I will show you how to write file using pymongo, a driver available in python language.

References:

[1] http://docs.mongodb.org/manual/reference/method/db.collection.count/

[2] https://university.mongodb.com/