A lap around of Big Data with Microsoft HDInsight


Big Data synonyms with three V s :  Volume , Velocity & Variety. Even with traditional e-commerce system to modern social networks  all systems data conservation is dependent on this platform. Lets check a scenario of modern e-commerce analytic s after integration with Big Data.

bigdata

ecommerce

  • Big Data platform typically works by storing data first into clusters , then process the data through MapReduce workflows which executes by Mapping the input data through independent chunks processed by appropriate algorithms, the output from Map phase then moves to Shuffle/Sorting phase & finally the output from Shuffle phase comes to Reduce phase as input.
  • Lets check a typical Big Data MapReduce workflow.

storedata

processdata

MR

  • Microsoft’s BigData platform works exactly same way as a collaborative solution with Horton Works named as Microsoft HDInsight. Which typically simplifies the solution of running complex batch scripts. Lets cover a little insight of HDInsight/Hadoop ecosystem.

HDinsight

  • Microsoft’s Big Data platform unveils solutions from storing data into HDFS to query processing on Hive up to implementing Business Intelligence analytics on Excel Powerpivot, Powerpivot, SSAS & SSRS solutions.

MSBigdata

  • Storing data into HDFS : Petabytes to Zetabytes of data to be stored in HDFS clusters by means of Name Node followed by Data Nodes, in Azure HDInsight each Data Node is integrated with a worker roles & compute cluster. Alternatively , you can leverage the solutions using Azure Blob Storage utilizing  Front End(attaches OAuth/Security layer for authentication), Partition layer: for mapping with Azure Queue, table & blob storages , Stream layer : 3 layer HA for scaled out data stream.

HDFS

  • In order to programming on HDInsight , you can opt for Java, C#, F#, .NET, .js API, LINQ to Hive APIs which leverages to code on hadoop ecosystems including hadoop pig, hive, mahout, cascading, pegasus.

hdinsight_API

Microsoft's Hadoop Vision

Microsoft’s Hadoop Vision

Advertisements

An Introduction to Hadoop, MapReduce, Hive, HBase, Sqoop on Windows Azure


In today’s Hadoop world , MapReduce can be seen as a complement to an RDBMS. MapReduce is a good fit for processes that need to analyse the whole dataset, in a batch operation, specially for ad-hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been  indexed to deliver low latency retrieval and update times of a relatively small amount of  data. MapReduce suits applications where the data is written once , and read many times, whereas a relational database is good datasets that are continually updated.

Traditional RDBMS                                    MapReduce

Data Size: Gigabytes                                         Petabytes

Access: Interactive and Batch                     Batch

Updates: Read and write many times         Write once , read many times

Structure: Static schema                                  Dynamic schema

Integrity High                                                      Low

Scaling     Nonlinear                                           Linear

  • MapReduce & RDBMS is the amount of  structure in the datasets that they operate on. Structured data is the data that is organised into entities that they have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS.
  • Semi Structured data is looser, and through there may be a schema, is often ignored, so it may be used only as a guide to the structure of the data.
  • Unstructured data does not have any particular internal structure, for example, plain text or image data.
  • Map Reduce works well on unstructured or semi structured data , since it designed to interpret  the data at processing time. In order words , the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.
  • Relational data is often normalized to retain its integrity & remove redundancy.
  • Map Reduce is linearly scalable programming model. Its task is to write Map Function & Reducr function by keeping Shuffle. Each of which defines a mapping from one set of key-value pairs to another. These function are oblivious to the size of the data or the cluster thay are operating on, so they can be used unchanged for small dataset and for massive one.

 

  • Apache Hadoop & Hadoop Ecosystem on Windows Azure Platform(Azure HDInsight):
  •  Common: A set of operations & interfaces for distributed filesystems & general I/O (Serialization, Java RPC, persistent data structures)
  • Avro : A serialization system for efficient , cross language persistent data storage.
  • MapReduce: A Distributed data processing model and execution environment thsat runs on large clusters of commodity machines.
  • HDFS: A distributed filesystem that runs on large clusters of commodity machines.
  • Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  • Hive: A distributed data warehouse. Hive Manages data stored in HDFS & provides  batch style computations & ETL by HQL.
  • HBase: A distributed , column oriented database, HBase uses HDFS for its underlying storage, supported both batch – style computations using MapReduce and point queries.
  • ZooKeeper: A Distributed , highly available coordination service. ZooKeeper provides primitives such as distributed locks can be applied on distributed applications.
  • Sqoop: A Tool for efficiently moving data between RDBMS  & HDFS (from SQL Server/SQL Azue/Oracle to HDFS and vice-versa)

Lets check to create a Hadoop Cluster on Windows Azure HDInsight  on http://www.hadooponazure.com:

HadoopAzureCluster

 

HadoopCluster

 

  • Check out the Interactive Console on Hadoop on Azure EMR to execute Pig/Latin scripts or Hive data ware housing queries.

Console

%d bloggers like this: