An OverView of HDInsight (Hadoop+HBase) with Integrated PowerShell along with R


Recently, while started the work with Predictive Analytic s with Machine Learning & R , felt the necessity of integration of Azure HDInsight-HBase with Azure ML features. In this demo, we ‘ll go through few basic understandings of operations on HDInsight(Hadoop) on Azure with PowerShell 0.8.6.

To start with, first we need to create an azure storage account which must be in same datacenter (e.g SouthEast Asia for this demo) of HDInsight cluster.

 

StorageAccount

You need also create a blob container & storage context object in order to copy raw data (e.g Click Stream data, log data, machine-sensor data) to local drive to azure storage account.

 

StorageAcc

 

To Copy data from local drive to Azure Storage container , use the following script.

CopyDataToBlob

 

 

Next, we need to provision the HDInsight cluster , for that need to execute the following script.

ProvisioningCluster

Upon, executing the script, the cluster provisioning is started from accept, configuring , provisioning phase. You need to assign the username & password manually.

HDInsightProvision

ClusterProvisioned

 

Next, check in Azure management portal after few mins, the provisioning have been started.

Portal

Details of HDInsight cluster provisioning along with running HQL queries is stored in my github repository. You can get it here.

Now, HBase columnar storage is available as a part of hadoop cluster from HDInsight offerings, so while provisioning cluster from portal , you need the corresponding cluster type – HBase or Hadoop.

HBase

Both of cluster type(either HBase or Hadoop) of HDInsight 3.1 is completely based of pure Hortonworks HDP 2.1 clusters which contains the hadoop components of the following version.

  • Apache Hadoop 2.4
  • Apache HBase 0.98.0
  • Apache Pig 0.12.1
  • Apache Hive 0.13.0
  • Apache Tez 0.4
  • Apache ZooKeeper 3.4.5
  • Hue 2.3.1
  • Storm 0.9.1
  • Apache Oozie 4.0.0
  • Apache Falcon 0.5
  • Apache Sqoop 1.4.4
  • Apache Knox 0.4
  • Apache Flume 1.4.0
  • Apache Accumulo 1.5.1
  • Apache Phoenix 4.0.0
  • Apache Avro 1.7.4
  • Apache Mahout 0.9.0
  • Third party components:
    • Ganglia 3.5.0
    • Ganglia Web 3.5.7
    • Nagios 3.5.0

     

    For Big Data analytics world , one of the most fine-grained language that supports now with Azure ML is R. You can install R official packages for Windows, Linux & OS X, also for official project perspective , use R IDE.

    R Packages:

    R packages are self-contained units of R functionality that can be invoked as functions. A good analogy would be a .jar file in Java. There is a vast library of
    R packages available for a very wide range of operations ranging from statistical operations and machine learning to rich graphic visualization and plotting. Every package will consist of one or more R functions. An R package is a re-usable entity that can be shared and used by others. R users can install the package that contains the functionality they are looking for and start calling the functions in the package. A comprehensive list of these packages can be found at http://cran.r-project.org/ called Comprehensive R Archive Network (CRAN).

    Data Modelling with R:

    Regression: In statistics, regression is a classic technique to identify the scalar relationship between two or more variables by fitting the state line on the
    variable values. That relationship will help to predict the variable value for future events. For example, any variable y can be modeled as linear function
    of another variable x with the formula y = mx+c. Here, x is the predictor variable, y is the response variable, m is slope of the line, and c is the
    intercept. Sales forecasting of products or services and predicting the price of stocks can be achieved through this regression. R provides this regression
    feature via the lm method, which is by default present in R.
    Classification: This is a machine-learning technique used for labeling the set of observations provided for training examples. With this, we can classify
    the observations into one or more labels. The likelihood of sales, online fraud detection, and cancer classification (for medical science) are common
    applications of classification problems. Google Mail uses this technique to classify e-mails as spam or not. Classification features can be served by glm,
    glmnet, ksvm, svm, and randomForest in R.
    Clustering: This technique is all about organizing similar items into groups from the given collection of items. User segmentation and image
    compression are the most common applications of clustering. Market segmentation, social network analysis, organizing the computer clustering,
    and astronomical data analysis are applications of clustering. Google News uses these techniques to group similar news items into the same category.
    Clustering can be achieved through the knn, kmeans, dist, pvclust, and Mclust methods in R.

    Recommendation: The recommendation algorithms are used in recommender systems where these systems are the most immediately recognizable machine learning techniques in use today. Web content recommendations may include similar websites, blogs, videos, or related content. Also, recommendation of online items can be helpful for cross-selling and up-selling. We have all seen online shopping portals that attempt to recommend books, mobiles, or any items that can be sold on the Web based on the user’s past behavior. Amazon is a well-known e-commerce portal that generates 29 percent of sales through recommendation systems. Recommender systems can be implemented via Recommender()with the recommenderlab package in R.

     

Advertisements

About Anindita
Anindita Basak is working as Big Data Cloud Consultant in Microsoft. Worked in multiple MNCs as Developer & Senior Developer on Microsoft Azure, Data Platform, IoT & BI , Data Visualization, Data warehousing & ETL & of course in Hadoop platform.She played both as FTE & v- employee in Azure platform teams of Microsoft.Passionate about .NET , Java, Python & Data Science. She is also an active Big Data & Cloud Trainer & would love share her experience in IT Training Industry. She is an author, forum contributor, blogger & technical reviewer of various books on Big Data Hadoop, HDInsight, IoT & Data Science, SQL Server PDW & PowerBI.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: