Big Data & Hadoop

So one of the things I had to pick up very quickly since joining MTI was what Big Data is all about and how different vendors are approaching the management of ‘Big Data’….

If you try to Google “Big Data” you get so many websites blasting you with technical goobledigook…. all the big infrastructure companies (IBM, Microsoft, EMC) and even the consulting companies (Accenture, McKinsey) are guilty of trying to ‘over explain’ the concept!! It was quite frustrating that I couldn’t find one major player in the market that could explain what Big Data is in layman terms!! Not even a single simple paragraph without any of their marketing/technical bullsh!t spin….

It’s interesting that the best explanation of Big Data came from Intel…… who aren’t a storage vendor (afterall you need storage for ‘big data’), not a server manufacturer (ok, so not totally true but I’m talking about the HP/Dells of this world) and not a consulting company trying to sell loads of ‘Professional Service’ ……. Have a read of their whitepaper here:


So, Big Data…… what is it?!? Well, from what I can gather it’s a general term to explain the explosion of information that has occurred over the years with the greater use of the internet, social media, electronic communication, data gathering, etc….. A vast amount of information which is unstructured and of different varieties which companies are having trouble connecting together to make any business use of – ie an asset that they can’t utilise or analyse!

Big Data is characterised by the 3 Vs: Volume, Variety and Velocity……. the best explanation I found of these Vs was on the SAS website (

  • Volume – Many factors contribute to the increase in data volume – transaction-based data stored through the years, text data constantly streaming in from social media, increasing amounts of sensor data being collected, etc. In the past, excessive data volume created a storage issue. But with today’s decreasing storage costs, other issues emerge, including how to determine relevance amidst the large volumes of data and how to create value from data that is relevant.
  • Variety – Data today comes in all types of formats – from traditional databases to hierarchical data stores created by end users and OLAP systems, to text documents, email, meter-collected data, video, audio, stock ticker data and financial transactions. By some estimates, 80 percent of an organization’s data is not numeric! But it still must be included in analyses and decision making.
  • Velocity – According to Gartner, velocity “means both how fast data is being produced and how fast the data must be processed to meet demand.” RFID tags and smart metering are driving an increasing need to deal with torrents of data in near-real time. Reacting quickly enough to deal with velocity is a challenge to most organizations.

This is where Big Data analytics comes into play……. it’s a technology-enabled strategy for gaining  a better understanding into the data held by a company – a more accurate insight  into a customer/partner/business—and ultimately gaining competitive advantage. By having the ability to process and analyse real-time data (or stored data), companies can uncover hidden patterns, unknown correlations and other useful information  in order to make decisions faster, monitor emerging trends, rapidly change directions, and jump on new business opportunities!


Ok, so that pretty much sounded like marketing bullsh!t, so I have to apologise for writing all that……. but pretty much it’s all about tapping into the large amount of information that people are able to get their hands on and analysing it in order to extrapolate some form of useful data which will benefit you! It’s amazing how many jobs there are in the market for ‘big-data analysts’ or ‘data scientists’, not to mention the number of vendors jumping on board the bandwagon!

One of the articles I read on Intel’s whitepaper mentioned a very interesting fact about data growth…. that it took “from the dawn of civilization to 2003 to create 5 exabytes of information, we now create that same volume in just two days! By 2012, the digital universe of data will grow to 2.72 zettabytes (ZB) and will double every two years to reach 8 ZB by 2015.”


One name that keeps getting mentioned is Hadoop……. What the hell is Hadoop?? Fortunately Googling Hadoop gave a better result which was easier to digest!

(Interesting fact: Hadoop is actually named after a toy elephant that the programmers son had!)

The Apache Hadoop project is an open-source software framework (written in Java) that supports data-intensive distributed applications…. it allows the development of open-source software for reliable, scalable, distributed computing – Where the application is divided into many small fragments of work, each of which may be executed or re-executed on any node in the cluster. In addition, it provides a distributed file system that stores data on the compute nodes, providing very high aggregate bandwidth across the cluster.

Basically the Hadoop stack is fast becoming the best approach to unstructured data analytics. The complete technology stack includes common utilities, a distributed file system, analytics and data storage platforms, and an application layer that manages distributed processing, parallel computation, workflow, and configuration management.

If you want more information, then the best source is Hadoops’ own website: and Intel’s whitepape:

(Another Interesting fact: Supposedly – and I guess not surprisingly – one of the biggest Hadoop clusters in the world is used at Facebook, they have over 100PB of data!)


So given that VMware’s motto is to “Virtualise everything” in a “software defined datacentre”, it comes as no surprise that they are trying to get people who are looking at the Hadoop stack to stick it on VMware. I mean, it does make sense in some way to stick a Hadoop cluster into a virtualised environment…. companies don’t need to data-crunch every hour of the day (ok, some do), but sticking it on VMware allows you to use ‘elastic scaling’ on the Hadoop cluster as and when more resources are required to crunch through the data! Make use of a cloud computing that allows a self-service consumption model!

In addition, the ability to share the infrastructure with non-big data resources makes sense – due to how VMs are isolated from each other, you can have your Hadoop cluster running alongside your other business application workloads…..


Anyways, I’m still learning on the job…… but at least I now know enough to talk my way out of a situation if a client ever asks the same questions I asked at the start of this post! =)


One comment on “Big Data & Hadoop

  1. Thank you for this. Finally, somebody cuts through the buzzword-filled obfuscation and gets to the issue. Helpful starting point. Thanks again.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s