NoSQL Inside SQL with Java, Spring, Hibernate, and PostgreSQL

There are many benefits to schema-less NoSQL datastores, but there are always trade-offs. The primary gift the NoSQL movement has given us is the variety of options we now have for data persistence. With NoSQL we no longer must try to shoehorn everything into a relational model. Now the challenge is in deciding which persistence model fits best with each domain in a system and then combining those models in a cohesive way. The general term to describe this is Polyglot Persistence and there are many ways to accomplish it. Lets walk through how you can combine a regular SQL model with a key-value NoSQL model using Java, Spring, Hibernate, and PostgreSQL.

This article covers the pieces of a simple web application which uses regular SQL and PostgreSQL’s hstore for key value pairs. This method is a mix of NoSQL inside SQL. One benefit of this approach is that the same datastore can be used for both the SQL and the NoSQL data.

In this example the server technologies will be Java, Spring, and Hibernate. (The same thing can also be done with Rails, Django, and many other technologies.) To add Hibernate support for hstore I found a fantastic blog about “Storing sets of key/value pairs in a single db column with Hibernate using PostgreSQL hstore type“. I won’t go through that code here but you can find everything in the GitHub repo for my demo project.

This demo app uses Maven to define the dependencies. Embedded Jetty is started via a plain ‘ole Java application that sets up Spring MVC. Spring is configured via Java Config for the main stuff, the web stuff, and the database stuff.

The client technologies will be jQuery and Bootstrap and there is a strict separation between the client and server via RESTful JSON services. The whole client-side is in a plain ‘ole HTML file. Via jQuery / Ajax the client communicates to JSON services exposed via a Spring MVC Controller.

Ok. Now onto the NoSQL inside SQL stuff. This application stores “Contacts” that have a name but also can have many “Contact Methods” (e.g. phone numbers and email addresses). The “Contact Methods” are a good use of a schema-less, key-value pair column because it avoids the cumbersome alternatives: putting that information into a separate table or trying to create a model object that has all of the possible “Contact Methods”. So lets take a look at the simple Contact Entity:

package com.jamesward.model;  

import net.backtothefront.HstoreUserType; 
import org.hibernate.annotations.Type; 
import org.hibernate.annotations.TypeDef;  

import javax.persistence.Column; 
import javax.persistence.Entity; 
import javax.persistence.GeneratedValue; 
import javax.persistence.Id; 
import java.util.HashMap; 
import java.util.Map;  

@TypeDef(name = "hstore", typeClass = HstoreUserType.class) 
public class Contact {      

     public Integer id;      

     @Column(nullable = false)     
     public String name;      

     @Type(type = "hstore")     
     @Column(columnDefinition = "hstore")     
     public Map contactMethods = new HashMap();  } 

If you are familiar with Hibernate / JPA then most of this should look pretty familiar to you. The new / interesting stuff is the contactMethods property. It is a Map<String, String> and it uses PostgreSQL’s hstore datatype. In order for that to work, the type has to be defined and the columnDefinition set. Thanks again to Jakub Głuszecki for putting together the HstoreHelper and HstoreUserType that make this possible.

Now the rest is simple because it’s just plain Hibernate / JPA. Here is the ContactService that does the basic query and updates:

package com.jamesward.service;  

import com.jamesward.model.Contact; 
import org.springframework.stereotype.Service; 
import org.springframework.transaction.annotation.Transactional;  

import javax.persistence.EntityManager; 
import javax.persistence.PersistenceContext; 
import javax.persistence.criteria.CriteriaQuery;  

import java.util.List;  


public class ContactServiceImpl implements ContactService {      

     EntityManager em;      

     public void addContact(Contact contact) {         

     public List getAllContacts() {         
          CriteriaQuery c = em.getCriteriaBuilder().createQuery(Contact.class);                   
          return em.createQuery(c).getResultList();     

     public Contact getContact(Integer id) {         
          return em.find(Contact.class, id);     

     public void addContactMethod(Integer contactId, String name, String value) {              
     Contact contact = getContact(contactId);         
     contact.contactMethods.put(name, value);     

Now that you understand how it all works, check out a live demo on Heroku.

If you want to run this app locally or on Heroku, then first you need to grab the source code and continue working inside the newly created



$ git clone 
$ cd spring_hibernate_hstore_demo 

To run locally:

  1. Setup your PostgreSQL database to support hstore by opening a psql connection to it:
    $ psql -U username -W -h localhost database
  2. Then enable hstore:
     => create extension hstore; => \q 
  3. Build the app (depends on having Maven installed):
    $ mvn package 
  4. Set the DATABASE_URL environment variable to point to your PostgreSQL server:
    $ export DATABASE_URL=postgres://username:[email protected]/databasename
  5. Start the app:
    $ java -cp target/classes:target/dependency/* com.jamesward.Webapp
  6. Try it out

Cool! Now you can run it on the cloud with Heroku. Here is what you need to do:

  1. Install the Heroku Toolbelt
  2. Login to Heroku:
    $ heroku login
  3. Create a new app:
    $ heroku create
  4. Add Heroku Postgres:
    $ heroku addons:add heroku-postgresql:dev
  5. Tell Heroku to set the DATABASE_URL environment variable based on the database that was just added (replace YOUR_HEROKU_POSTGRESQL_COLOR_URL with your own):
    $ heroku pg:promote YOUR_HEROKU_POSTGRESQL_COLOR_URL
  6. Open a psql connection to the database:
    $ heroku pg:psql
  7. Enable hstore support in your database:
     => create extension hstore; => \q 
  8. Deploy the app:
    $ git push heroku master
  9. View the app on the cloud:
    $ heroku open

Fantastic! Let me know if you have any questions.

James Ward

James Ward

James Ward is a Principal Platform Evangelist at James frequently presents at conferences around the world such as JavaOne, Devoxx, and many other Java get-togethers. Along with Bruce Eckel, James co-authored First Steps in Flex.

This article is by James Ward from

The Promise of Big Data Analytics and Insights

Cognitive Computing and Big Data are transforming the way businesses, governments, and organizations solve their most complex problems. Data is all around us. But making sense of that data is challenging. It comes from everywhere and in many shapes and formats, structured and unstructured, from the Internet, mobile devices, social media, GPS, RFID, and elsewhere in neat and tidy columns and tables or more random. It is data that is too big and too chaotic to be leveraged by organizations.

The Enterra Cognitive Reasoning Platform™ (CRP) solves these challenges by combining the speed and accuracy of computational computing with the adaptive and predictive capabilities of human reasoning. It automates the analysis of vast amounts of data to uncover previously unknown connections, conclusions, inferences, and deductions that improve decision making across industries or domains.

Transforming Industry

In consumer products and retailing, Cognitive Computing is helping manufacturers and retailers identify precise sensory profiles of consumers to help manufacturers create, market, recommend, and suggest tasty new products to consumers. At the same time, these new technologies are helping companies identify and create the Digital Path to Purchase that drive purchasing by informing, educating, persuading, tracking and recommending goods and services.

In the supply chain, Cognitive Computing and its ability to analyze extraordinary volumes of data is improving operations, reducing waste, enhancing promotions, and ultimately delivering products more efficiently to manufacturers, retailers, and consumers.

In life sciences, Cognitive Computing and Big Data are revolutionizing our understanding of how genetics impact the onset and progression of disease, accelerating the discovering of drugs to treat and cure on the molecular level.

The Enterra CRP is helping some of the world’s leading brands and organizations realize the power of artificial intelligence and use it to transform their operations, supply chains, marketing, and new product development. Within industries and across domains, Enterra is solving some of today’s most complex business challenges to create a world that is better informed by the data around us.


Enterra Solution

A Cognitive Computing Company that specializes in Big Data Analytics and Insights. Enterra Solutions

What does Big Data mean to Database Marketing?

What does “Big Data” mean to Database Marketing?

Over the last year the term “Big Data” has been used progressively and is now part of the discussion in the budget battles taking place in many industries and organizations.  For me, it all started with hearing the word “Hadoop” back in 2010.  Who forgets a word like that?  You won’t.  The second time I heard it, I Googled it and then had a big smile on my face.  “Hadoop” and “Big Data” both mean more value for data, data integration and analytics.  It takes the value of data and data monetization to a new level.  Someday you will thank whoever coined the term “Big Data” and will hear the word Hadoop and not laugh.  There’s nothing worse than sitting in a meeting with IT and being asked “What are your thoughts on Big Data” and having no opinion.   Why should it matter to you- because it is all in the path to a bigger and now validated budget.  It also does not hurt to sound like you know what you are talking about.

“Big Data” budget for everyone

The buzzword in the budget wars right now is “Big Data”.  Everyone wants a piece of the “Big Data” budget – Marketing, HR, Compliance, Finance, and IT.  I also read that there is $200M earmarked for “Big Data” in the U.S. government budget this year.  Now that the word is out, the possibility for the application of “Big Data” is being discovered everywhere.  You could not ask for more exposure around data and funding for data integration and analytics.  For years it was exceedingly hard to justify the expense for data, data integration, data warehouses and analysis tools during budget reviews.  It was often too technical for Marketing Executives to fully appreciate and there was little understanding.

A faster path to direct answers with data

A decade ago, there was little interest in “Customer Intelligence” – and no one had coined the term back then.  We thought customer information was cutting edge but couldn’t find anyone to look at the reports because they were about “customers” not “prospects”.   In the 90’s data integration and matching consisted of homegrown processes that did
variations of exact name, phone number and address match strings, and IT would shut us down when we tried to add more business rules into the matching algorithms.  Employees spent hour’s manually matching records by looking the businesses up on the Internet, which we thought was a revolutionary data matching research tool.   There was the thought that data gave direct answers and there was no concept of what was between having data and getting results.  CDI, data governance, data stewards, and analytics were not part of our vocabulary back then.  We did those jobs but there were no cool words coined yet to describe our technical jobs.  We were in marketing, dealing with marketing executives and an IT organization that also thought we were “marketing” and were not technical like they were.

Feel comfortable saying “Big Data”

Here is a cheat sheet for mastering “Big Data” and feeling comfortable saying Hadoop.  Big Data is all kinds of data.  Examples of types of Big Data: POS and billing transactions, trade, customer interactions via sales, customer service, chat sessions, email, social media, events, and CRM records.  Customer preferences and behavior, customer usage, news, financial reports, hardware, networks and systems, product sensors, system level logs, phone records, employee results and most importantly, third party data.  The list is endless and really, any kind of data will work.

Getting back to Hadoop.  It is one of the many new types of technology that solves the problem of Big Data for businesses.  Think of it as a 3D Data Warehouse, where data does not have to be flat and structured.  Instead of a data refresh and the limitations of storage, data can just be added showing changes in time.  Imagine being able to track a sale made in 2007 to what the customer looks like in 2012 and all of the behavioral data that is in between.  There is finally an application that will give us a multi-dimentional view of data.  In simplest terms, Hadoop[i] is a high-performance distributed data storage and processing system accessible through open source software.  The system stores data on multiple servers and runs parallel processes against the data across servers and then combines the information to provide results.

Initiate the discussion in your next meeting

Still wondering why “Big Data” matters to you?  For database marketing professionals, who has more experience as business users of data, data integration and analytics?  Your experience and skills are now in demand across the enterprise.  The term “Big Data” is relevant right now, sounds impressive, is technical and is new (not really).  It has been around for years to anyone in the data world, but new to those (that approve the budgets).  So, go, grab the words and in your next meeting ask “what does “Big Data” mean to your department, organization or company”.   You could be really impressive and follow it up with “any thoughts on deploying “Hadoop” across the Enterprise”?

[i] Cloudera.  (2011).  Cloudera Ten Common Hadoopable Problems Whitepaper.  Retrieved from

This article is by Rebecca Croucher from

What’s the difference between Business Intelligence and Big Data?

I’m often asked the following question:

What is the difference between Business Intelligence and Big Data?

Before getting into my approach to answering that question, let’s be clear on what we’re talking about. When most people say “Business Intelligence” they’re talking about the class of products that have been implemented in most organizations, not the actual information or knowledge that is derived from the use of these systems. When it comes to Big Data, most people aren’t quite sure what they are talking about…some are talking about the size of the data, some are referring to the approach to analysis and others are talking about the process as a whole.

One of the problems that exists today is ensuring everyone understands what big data is and isn’t. While that is a discussion for another time, in this post I’ll simply say that Big Data isn’t something you buy, implement, configure and start using like you would do with Business Intelligence systems. It’s much more complicated than that.

Back to the original question: What is the difference between BI and Big Data?

I’ve never really been able to answer that question as fully as I’ve wanted to because most people aren’t willing to sit down and listen to me walk through the history of business intelligence and big data and explain their differences.

After many attempts at finding a succinct way to describe the differences, I finally figured out that most people don’t care about the technical differences or the history. Most people just want a ‘sound bite’ answer so I came up with this response:

Business Intelligence helps find answers to questions you know. Big Data helps you find the questions you don’t know you want to ask.

When I answer the question this way, I tend to get the nod of the head and and response similar to “…well that sounds really complicated!”.

It is complicated. That’s the difference between Business Intelligence and Big Data. You don’t have easy, well defined reports and answers with big data like you do with BI. You don’t have a single system to implement and manage with big data like you do with BI.  Don’t get me wrong…BI systems aren’t “simple” and the thought and planning that needs to go into BI systems and planning is very detailed, but BI and Big Data are completely different.

Business Intelligence systems have their place in business. They deliver neat, well-designed answers to neat, well-designed questions. Nothing wrong with that…but most businesses don’t have neat, well-designed questions these days. In fact, most organizations don’t really know what questions they need to be asking.

That’s the difference between BI and Big Data.

Eric Brown

Eric D. Brown, D.Sc.

Eric is a technology and marketing consultant with an interest in using technology and data to solve real-world business problems. In recent years, he has combined sentiment analysis, natural language processing and big data approaches to build innovative systems and strategies to solve interesting problems.

Storing polymorphic classes in MongoDb using C#

NoSql databases like Mongo give us advantage of storing our classes directly in data store without worrying too much about schemas.Thus saving us from object-relational impedance mismatch.Common scenario that arises from storing classes is how to handle inheritance hierarchies.

In this post I will discuss on how mongo db handles polymorphic classes or inheritance.I am using official c# driver for MongoDb.

To start with first thing to know is that MongoDb fully supports polymorphic classes and all the classes in your class hierarchies can be part of the same mongo collection.How it does that is with a  concept called Type Discriminators.

Type Discriminator

The way mongo distinguishes between various hierarchical types is by including a field with name ‘_t’  called type discriminator as shown below.


Lets consider a simple class hierarchy to illustrate the concept.


Point here is that while saving the data I am always going to use base class as shown below and type discriminators will help in serializing and de-serializing the actual type.

var writeConcernResult = _dataContext.ContentCollection.Save<ContentBase>(content);

There are multiple ways in which we can distinguish the type and hence mongo provides two inbuilt conventions called Type discriminator conventions.

  • ScalarDiscriminatorConvention : In this case ‘_t’ parameter contains by default the type name as shown above in the screenshot.
  • HierarchialDiscriminatorConvention : This convention comes into play when you specify one of the class in the hierarchy as something called root class as shown below

[BsonDiscriminator(RootClass = true)]
 public abstract  class ContentBase 

Additionally you need to specify the known types using BsonKnownTypes  attribute so that while de-serializing the content to objects correct type is created.

Below is how type discriminator is stored in this case.


Notice how the whole hierarchy is displayed as array.

Custom Type Discriminator Convention

You also have option of writing your custom type discriminator convention.Using custom convention you can change the way type discriminator is stored or what is stored.

For our example we will just change the name of element which stores type discriminator i.e. currently we have ‘_t’ and we will change it to say _contentType’.This may actually be required when you are working with different mongo drivers.

public class ContentTypeDiscriminatorConvention : IDiscriminatorConvention
        public string ElementName
            get { return "_contentType"; }
        public Type GetActualType(MongoDB.Bson.IO.BsonReader bsonReader, Type nominalType)
            var bookmark = bsonReader.GetBookmark();
            string typeValue = string.Empty;
            if (bsonReader.FindElement(ElementName))
                typeValue = bsonReader.ReadString();
                throw new NotSupportedException();
            return Type.GetType(typeValue);
        public MongoDB.Bson.BsonValue GetDiscriminator(Type nominalType, Type actualType)
            return actualType.Name;

Below are the results.


Recommended Books on Amazon:

MongoDB: The Definitive Guide

MongoDB Applied Design Patterns

50 Tips and Tricks for MongoDB Developers

This article is by Jagmeet Singh from

Working with Geospatial support in MongoDB: the basics

A project I’m working on requires storage of and queries on Geospatial data. I’m using MongoDB, which has good support for Geospatial data, at least good enough for my needs. This post walks through the basics of inserting and querying Geospatial data in MongoDB.

First off, I’m working with MongoDB 2.4.5, the latest. I initially tried this out using 2.2.3 and it wasn’t recognizing the 2dsphere index I set up, so I had to upgrade.

MongoDB supports storage of Geospatial types, represented as GeoJSON objects, specifically the Point, LineString, and Polygon types. I’m just going to work with Point objects here.

Once Geospatial data is stored in MongoDB, you can query for:

  • Inclusion: Whether locations are included in a polygon
  • Intersection: Whether locations intersect with a specified geometry
  • Proximity: Querying for points nearest other points

You have two options for indexing Geospatial data:

  • 2d : Calculations are done based on flat geometry
  • 2dsphere : Calculations are done based on spherical geometry

As you can imagine, 2dsphere is more accurate, especially for points that are further apart.

In my example, I’m using a 2dsphere index, and doing proximity queries.

First, create the collection that’ll hold a point. I’m planning to work this into the Sculptor code generator so I’m using the ‘port’ collection which is part of the ‘shipping’ example MongoDB-based project.

> db.createCollection("port") { "ok" : 1 } 

Next, insert records into the collection including a GeoJSON type, point. According to MongoDB docs, in order to index the location data, it must be stored as GeoJSON types.

> db.port.insert( { name: "Boston", loc : { type : "Point", coordinates : [ 71.0603, 42.3583 ] } }) 
> db.port.insert( { name: "Chicago", loc : { type : "Point", coordinates : [ 87.6500, 41.8500 ] } })  

> db.port.find()  

{ "_id" : ObjectId("51e47b4588ecd4e8dedf7185"), "name" : "Boston", "loc" : { "type" : "Point", "coordinates" : [  71.0603,  42.3583 ] } }
{ "_id" : ObjectId("51e47ee688ecd4e8dedf7187"), "name" : "Chicago", "loc" : { "type" : "Point", "coordinates" : [  87.65,  41.85 ] } } 

The coordinates above, as with all coordinates in MongoDB, are in longitude, latitude order.

Next, we create a 2dsphere index, which supports geolocation queries over spherical spaces.

> db.port.ensureIndex( { loc: "2dsphere" }) > 

Once this is set up, we can issue location-based queries, in this case using the ‘geoNear’ command:

> db.runCommand( { geoNear: 'port', near: {type: "Point", coordinates: [87.9806, 42.0883]}, spherical: true, maxDistance: 40000})  
     "ns" : "Shipping-test.port",
     "results" : [
             "dis" : 38110.32969523317,
             "obj" : {
                 "_id" : ObjectId("51e47ee688ecd4e8dedf7187"),
                 "name" : "Chicago",
                 "loc" : {
                     "type" : "Point",
                     "coordinates" : [
     "stats" : {
         "time" : 1,
         "nscanned" : 1,
         "avgDistance" : 38110.32969523317,
         "maxDistance" : 38110.32969523317
     "ok" : 1

For some reason, a similar query using ‘find’ and the ‘near’ operator, which should work, doesn’t:

> db.port.find( { "port" : { $near : { $geometry : { type : "Point", coordinates: [87.9806, 42.0883] } }, $maxDistance: 40000 } } )  
error: { 
"$err" : "can't find any special indices: 2d (needs index), 2dsphere (needs index),  for: { port: { $near: { $geometry: { type: \"Point\", coordinates: [ 87.9806, 42.0883 ] } }, $maxDistance: 40000.0 } }", 
"code" : 13038

This article is by Ron Smith from

Use Cases Of MongoDB

MongoDB is a relatively new contender in the data storage circle compared to giant like Oracle and IBM DB2, but it has gained huge popularity with their distributed key value store, MapReduce calculation capability and document oriented NoSQL features.

MongoDB has be rightfully acclaimed as the “Database Management System of the Year” by DB-Engines.

Along with these feature, MongoDB has numerous advantages when compared to the traditional RDBMS.  As a result, lots of companies are vying to employ MongoDB database. Here is a look at some of the real world use cases, where organisation, if not entirely are including at least as an addition to their existing databases.


Adhar is an excellent example of real world use cases of MongoDB. In recent times, there has been some controversy revolving around CIA’s non-profit Venture Capital arm, In-Q-Tel, backing the company which developed MongoDB. Putting aside the controversy, let’s look at the MongoDB’s role in Aadhar.

India’s Unique Identification project, aka Aadhar, is the world’s biggest biometrics database. Aadhar is in the process of capturing demographic and biometric data of over 1.2 billion residents. Aadhar has used MongoDB as one of its database to store this huge amount of data. MongoDB was among several database products, apart from MySQL, Hadoop and HBase, originally procured for running the database search. Here, MySQL is used for storing demographic data and MongoDB is used to store images. According to, MongoDB has nothing to do with the “sensitive” data.


Shutterfly is a popular Internet-based photo sharing and personal publishing company that manages a store of more than 6 billion images with a transaction rate of up to 10,000 operations per second. Shutterfly is one of the companies that transitioned from Oracle to MongoDB.

During the evaluation at the time of transitioning to MongoDB, it became apparent that a non-relational database would be a better suited for Shutterfly’s data needs and there by possibly improving programmer productivity as well as performance and scalability.

Shutterfly considered a wide variety of alternate database systems, including Cassandra, CouchDB and BerkeleyDB, before settling on the MongoDB. Shutterfly has installed MongoDB for metadata associated with uploaded photos. And for those parts of the application which require richer transactional model, like billing and account management, the traditional RDBMS is still in place.

Till now, Shutterfly is happy with its decision of transitioning to MongoDB and this is verified by Kenny Gorman’s (Data Architect of Shutterfly) statement, “I am a firm believer in choosing the correct tool for the job, and MongoDB was a nice fit, but not without compromises.”


MetLife is a leading global provider of insurance, annuities and employee benefit programs. They serve about 90 million customers and hold leading market positions in the United States, Japan, Latin America, Asia, Europe and the Middle East. MetLife uses MongoDB for “The Wall”, an innovative customer service application that provides a consolidated view of MetLife customers, including policy details and transactions. The Wall is designed to look and function like Facebook and has improved customer satisfaction and call centre productivity. The Wall brings together data from more than 70 legacy systems and merges it into a single record. It runs across six servers in two data centres and presently stores about 24 terabytes of data. MongoDB-based applications are part of a series of Big Data projects that MetLife is working on to transform the company and bring technology, business and customers together.


eBay is an American multinational internet consumer-to-consumer corporation, headquartered in San Jose. eBay has a number of projects running on MongoDB for search suggestions, metadata storage, cloud management and merchandizing categorization.

The above is just a hint at companies using MongoDB is .Here is a comprehensive list of all the companies using MongoDB. Most of the companies in this list, use MongoDB as their primary database.

This article is by bigdata from

How MongoDB’s Journaling Works

I was working on a section on the gooey innards of journaling for The Definitive Guide, but then I realized it’s an implementation detail that most people won’t care about. However, I had all of these nice diagrams just laying around.

Good idea, Patrick!

So, how does journaling work? Your disk has your data files and your journal files, which we’ll represent like this:

When you start up mongod, it maps your data files to a shared view. Basically, the operating system says: “Okay, your data file is 2,000 bytes on disk. I’ll map that to memory address 1,000,000-1,002,000. So, if you read the memory at memory address 1,000,042, you’ll be getting the 42nd byte of the file.” (Also, the data won’t necessary be loaded until you actually access that memory.)

This memory is still backed by the file: if you make changes in memory, the operating system will flush these changes to the underlying file. This is basically how mongod works without journaling: it asks the operating system to flush in-memory changes every 60 seconds.

However, with journaling, mongod makes a second mapping, this one to a private view. Incidentally, this is why enabling journalling doubles the amount of virtual memory mongod uses.

Note that the private view is not connected to the data file, so the operating system cannot flush any changes from the private view to disk.

Now, when you do a write, mongod writes this to the private view.

mongod will then write this change to the journal file, creating a little description of which bytes in which file changed.

The journal appends each change description it gets.

At this point, the write is safe. If mongod crashes, the journal can replay the change, even though it hasn’t made it to the data file yet.

The journal will then replay this change on the shared view.

Then mongod remaps the shared view to the private view. This prevents the private view from getting too “dirty” (having too many changes from the shared view it was mapped from).

Finally, at a glacial speed compared to everything else, the shared view will be flushed to disk. By default, mongod requests that the OS do this every 60 seconds.

And that’s how journaling works. Thanks to Richard, who gave the best explanation of this I’ve heard (Richard is going to be teaching an online course on MongoDB this fall, if you’re interested in more wisdom from the source).

Kristina Chodorow

Kristina Chodorow

Software engineer at Google working [email protected], author of several O’Reilly books on MongoDB.

This article is by Kristina Chodorow from

12 Months with MongoDB

As previously blogged, Wordnik is a heavy user of 10gen’s MongoDB. One year ago today we started the investigation to find an alternative to MySQL to store, find, and retrieve our corpus data. After months of experimentation in the non-relational landscape (and running a scary number of nightly builds), we settled on MongoDB. To mark the one-year anniversary of what ended up being a great move for Wordnik, I’ll describe a summary of how the migration has worked out for us.



The primary driver for migrating to MongoDB was for performance. We had issues with MySQL for both storage and retrieval, and both were alleviated by MongoDB. Some statistics:

  • Mongo serves an average of 500k requests/hour for us (that does include nights and weekends). We typically see 4x that during peak hours
  • We have > 12 billion documents in Mongo
  • Our storage is ~3TB per node
  • We easily sustain an insert speed of 8k documents/second, often burst to 50k/sec
  • A single java client can sustain 10MB/sec read over the backend (gigabit) network to one mongod. Four readers from the same client pull 40MB/sec over the same pipe
  • Every type of retrieval has become significantly faster than our MySQL implementation:

– example fetch time reduced from 400ms to 60ms
– dictionary entries from 20ms to 1ms
– document metadata from 30ms to .1ms
– spelling suggestions from 10ms to 1.2ms

One wonderful benefit to the built-in caching from Mongo is that taking our memcached layer out actually sped up calls by 1-2ms/call under load. This also frees up many GB of ram. We clearly cannot fit all our corpus data in RAM so the 60ms average for examples includes disk access.



We’ve been able to add a lot of flexibility to our system since we can now efficiently execute queries against attributes deep in the object graph. You’d need to design a really ugly schema to do this in mysql (although it can be done). Best of all, by essentially building indexes on object attributes, these queries are blazingly fast.

Other benefits:

  • We now store our audio files in MongoDB’s GridFS. Previously we used a clustered file system so files could be read and written from multiple servers. This created a huge amount of complexity from the IT operations point of view, and it meant that system backups (database + audio data) could get out of sync. Now that they’re in Mongo, we can reach them anywhere in the data center with the same mongo driver, and backups are consistent across the system.
  • Capped collections. We keep trend data inside capped collections, which have been wonderful for keeping datasets from unbounded growth.



Of course, storing all your critical data in a relatively new technology has its risks. So far, we’ve done well from a reliability standpoint. Since April, we’ve had to restart Mongo twice. The first restart was to apply a patch on 1.4.2 (we’re currently running 1.4.4) to address some replication issues. The second was due to an outage in our data center. More on that in a bit.



This is one challenge for a new player like MongoDB. The administrative tools are pretty immature when compared with a product like MySQL. There is a blurry hand-off between engineering and IT Operations for this product, which is something worth noting. Luckily for all of us, there are plenty of hooks in Mongo to allow for good tools to be built, and without a doubt there will be a number of great applications to help manage Mongo.

The size of our database has required us to build some tools for helping to maintain Mongo, which I’ll be talking about at MongoSV in December. The bottom line is yes–you can run and maintain MongoDB, but it is important to understand the relationship between your server and your data.

The outage we had in our data center caused a major panic. We lost our DAS device during heavy writes to the server–this caused corruption on both master and slave nodes. The master was busy flushing data to disk while the slave was applying operations via oplog. When the DAS came back online, we had to run a repair on our master node which took over 24 hours. The slave was compromised yet operable–we were able to promote that to being the master while repairing the other system.

Restoring from tape was an option but keep in mind, even a fast tape drive will take a chunk of time to recover 3TB data, let alone lose the data between the last backup and the outage. Luckily we didn’t have to go down this path. We also had an in-house incremental backup + point-in-time recovery tool which we’ll be making open-source before MongoSV.

Of course, there have been a few surprises in this process, and some good learnings to share.

Data size


At the MongoSF conference in April, I whined about the 4x disk space requirements of MongoDB. Later, the 10gen folks pointed out how collection-level padding works in Mongo and for our scenario–hundreds of collections with an average of 1GB padding/collection–we were wasting a ton of disk in this alone. We also were able to embed a number of objects in subdocuments and drop indexes–this got our storage costs under control–now only about 1.5-2x that of our former MySQL deployment.



There are operations that will lock MongoDB at the database level. When you’re serving hundreds of requests a second, this can cause requests to pile up and create lots of problems. We’ve done the following optimizations to avoid locking:

  • If updating a record, we always query the record before issuing the update. That gets the object in RAM and the update will operate as fast as possible. The same logic has been added for master/slave deployments where the slave can be run with “–pretouch” which causes a query on the object before issuing the update
  • Multiple mongod processes. We have split up our database to run in multiple processes based on access patterns.

In summary, life with MongoDB has been good for Wordnik. Our code is faster, more flexible and dramatically smaller. We can code up tools to help out the administrative side until other options surface.

Hope this has been informative and entertaining–you can always see MongoDB in action via our public api.


Tony Tam

Tony Tam

Strongly opinionated generalist, swagger committer and VP at News Inc.. WordNik

This article is by tonytam from

Interacting with MongoDB using Rails 3 and MongoMapper

MongoDB is an opensource document-oriented database in the vein of CouchDB. It’s been a while since I wanted to try this kind of database on a Rails project. After reading this nice tutorial today I decided to take some time to create a sample Rails 3 app and put it on github.

I chose to use MongoMapper, a ruby object mapper for Mongo. MongoMapper uses ActiveModel and let you interact with a MongoDB database in a very ActiveRecord way.

Hope this sample app will help you getting started with MongoDB!

François Lamontagne

François Lamontagne

I live in the city of Trois-Rivières (three rivers) in Quebec. I am a freelancer and I specialize in Web Development Ruby Fleebie

This article is by Frank from