What’s new in Azure SDK 2.5 & Visual Studio 2013 Update 4


Recently, after playing enough with Azure Stream Analytics , it’s time to move on with azure .net development & a new version of Azure sdk is published. Let’s have a quick overview on latest azure sdk.

First of all, lets download the sdk from webpi console, as directed ‘Microsoft Azure SDK 2.5 for .NET(VS 2013)

webpi

In this edition, there are few new components added like as:

i) EnvironmentTools.VS.msi

ii) HiveODBC32.msi

iii)HiveODBC64.msi

iv) Microsoft.Azure.HDInsightTools-x64.msi

v) Microsoft.Azure.HDInsightTools-x86.msi

so on…

Components

Now, after installing sdk 2.5 , lets start with Visual Studio 2013.

Vs2013-sdk2.5

Expand on ‘QuickStart’ under ‘Cloud’ & start exploring options to create AppService , Compute & DataService directly from VS 2013 /2012 itself.

sdk2.5

 

The default ‘DataBlobStorage1’ sample would be created in VS to create blob container, create a block blob/page blob, upload a new blob , delete a blob (all basic CRUD operations on blob using REST)

BlobStorage-VS

Next, the major improvements is done on Azure HDinsight shell integration into Visual Studio onto which you can now run your custom Hive table queries on HDFS of HDInsight clusters. Lets create a sample Hive query file on VS 2013.

Lets move into HDInsight tab on left side of VS installed menu & select HDInsight’ & select ‘HiveApplication’ to start with new Hive-ql. For this demo, I am selecting Hive Sample from VS.

HDI

 

On selecting Hive sample, I would be able to open the sample Hive queries on ‘weblogAnalysis.hql‘  & ‘sensordataAnalysis.hql’ from Azure HDinsight cluster.

Here goes a sample weblogAnalysis.hql:

DROP TABLE IF EXISTS weblogs;
— create table weblogs on space-delimited website log data.
— In this sample we will use the default container. You could also use ‘wasb://[container]@[storage account].blob.core.windows.net/Path/To/Data/’ to access the data in other containers.
CREATE EXTERNAL TABLE IF NOT EXISTS weblogs(s_date date, s_time string, s_sitename string, cs_method string, cs_uristem string,
cs_uriquery string, s_port int, cs_username string, c_ip string, cs_useragent string,
cs_cookie string, cs_referer string, cs_host string, sc_status int, sc_substatus int,
sc_win32status int, sc_bytes int, cs_bytes int, s_timetaken int )
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘ ‘
STORED AS TEXTFILE LOCATION ‘/HdiSamples/WebsiteLogSampleData/SampleLog/’
TBLPROPERTIES (‘skip.header.line.count’=’2’);

 

Before proceeding with the realtime hive queries, we need to make sure that the Azure HDI cluster is already provisioned & it might be either a simple Hadoop HDI cluster, HBase HDI cluster or Storm HDI cluster to build hive tables on top of it.

sensorhql-vs

There’s a new option came out for Azure HDI cluster to add custom powershell scripts while provisioning a HDI cluster using azure portal. Also, new additions of HDI cluster is exploration of R(official cran packages) & Apache Spark on hdinsight hdfs cluster which will be covered with demo next.

A lap around of Big Data with Microsoft HDInsight


Big Data synonyms with three V s :  Volume , Velocity & Variety. Even with traditional e-commerce system to modern social networks  all systems data conservation is dependent on this platform. Lets check a scenario of modern e-commerce analytic s after integration with Big Data.

bigdata

ecommerce

  • Big Data platform typically works by storing data first into clusters , then process the data through MapReduce workflows which executes by Mapping the input data through independent chunks processed by appropriate algorithms, the output from Map phase then moves to Shuffle/Sorting phase & finally the output from Shuffle phase comes to Reduce phase as input.
  • Lets check a typical Big Data MapReduce workflow.

storedata

processdata

MR

  • Microsoft’s BigData platform works exactly same way as a collaborative solution with Horton Works named as Microsoft HDInsight. Which typically simplifies the solution of running complex batch scripts. Lets cover a little insight of HDInsight/Hadoop ecosystem.

HDinsight

  • Microsoft’s Big Data platform unveils solutions from storing data into HDFS to query processing on Hive up to implementing Business Intelligence analytics on Excel Powerpivot, Powerpivot, SSAS & SSRS solutions.

MSBigdata

  • Storing data into HDFS : Petabytes to Zetabytes of data to be stored in HDFS clusters by means of Name Node followed by Data Nodes, in Azure HDInsight each Data Node is integrated with a worker roles & compute cluster. Alternatively , you can leverage the solutions using Azure Blob Storage utilizing  Front End(attaches OAuth/Security layer for authentication), Partition layer: for mapping with Azure Queue, table & blob storages , Stream layer : 3 layer HA for scaled out data stream.

HDFS

  • In order to programming on HDInsight , you can opt for Java, C#, F#, .NET, .js API, LINQ to Hive APIs which leverages to code on hadoop ecosystems including hadoop pig, hive, mahout, cascading, pegasus.

hdinsight_API

Microsoft's Hadoop Vision

Microsoft’s Hadoop Vision

An Introduction to Hadoop, MapReduce, Hive, HBase, Sqoop on Windows Azure


In today’s Hadoop world , MapReduce can be seen as a complement to an RDBMS. MapReduce is a good fit for processes that need to analyse the whole dataset, in a batch operation, specially for ad-hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been  indexed to deliver low latency retrieval and update times of a relatively small amount of  data. MapReduce suits applications where the data is written once , and read many times, whereas a relational database is good datasets that are continually updated.

Traditional RDBMS                                    MapReduce

Data Size: Gigabytes                                         Petabytes

Access: Interactive and Batch                     Batch

Updates: Read and write many times         Write once , read many times

Structure: Static schema                                  Dynamic schema

Integrity High                                                      Low

Scaling     Nonlinear                                           Linear

  • MapReduce & RDBMS is the amount of  structure in the datasets that they operate on. Structured data is the data that is organised into entities that they have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS.
  • Semi Structured data is looser, and through there may be a schema, is often ignored, so it may be used only as a guide to the structure of the data.
  • Unstructured data does not have any particular internal structure, for example, plain text or image data.
  • Map Reduce works well on unstructured or semi structured data , since it designed to interpret  the data at processing time. In order words , the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.
  • Relational data is often normalized to retain its integrity & remove redundancy.
  • Map Reduce is linearly scalable programming model. Its task is to write Map Function & Reducr function by keeping Shuffle. Each of which defines a mapping from one set of key-value pairs to another. These function are oblivious to the size of the data or the cluster thay are operating on, so they can be used unchanged for small dataset and for massive one.

 

  • Apache Hadoop & Hadoop Ecosystem on Windows Azure Platform(Azure HDInsight):
  •  Common: A set of operations & interfaces for distributed filesystems & general I/O (Serialization, Java RPC, persistent data structures)
  • Avro : A serialization system for efficient , cross language persistent data storage.
  • MapReduce: A Distributed data processing model and execution environment thsat runs on large clusters of commodity machines.
  • HDFS: A distributed filesystem that runs on large clusters of commodity machines.
  • Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  • Hive: A distributed data warehouse. Hive Manages data stored in HDFS & provides  batch style computations & ETL by HQL.
  • HBase: A distributed , column oriented database, HBase uses HDFS for its underlying storage, supported both batch – style computations using MapReduce and point queries.
  • ZooKeeper: A Distributed , highly available coordination service. ZooKeeper provides primitives such as distributed locks can be applied on distributed applications.
  • Sqoop: A Tool for efficiently moving data between RDBMS  & HDFS (from SQL Server/SQL Azue/Oracle to HDFS and vice-versa)

Lets check to create a Hadoop Cluster on Windows Azure HDInsight  on http://www.hadooponazure.com:

HadoopAzureCluster

 

HadoopCluster

 

  • Check out the Interactive Console on Hadoop on Azure EMR to execute Pig/Latin scripts or Hive data ware housing queries.

Console

%d bloggers like this: