An OverView of HDInsight (Hadoop+HBase) with Integrated PowerShell along with R

Recently, while started the work with Predictive Analytic s with Machine Learning & R , felt the necessity of integration of Azure HDInsight-HBase with Azure ML features. In this demo, we ‘ll go through few basic understandings of operations on HDInsight(Hadoop) on Azure with PowerShell 0.8.6.

To start with, first we need to create an azure storage account which must be in same datacenter (e.g SouthEast Asia for this demo) of HDInsight cluster.



You need also create a blob container & storage context object in order to copy raw data (e.g Click Stream data, log data, machine-sensor data) to local drive to azure storage account.




To Copy data from local drive to Azure Storage container , use the following script.




Next, we need to provision the HDInsight cluster , for that need to execute the following script.


Upon, executing the script, the cluster provisioning is started from accept, configuring , provisioning phase. You need to assign the username & password manually.




Next, check in Azure management portal after few mins, the provisioning have been started.


Details of HDInsight cluster provisioning along with running HQL queries is stored in my github repository. You can get it here.

Now, HBase columnar storage is available as a part of hadoop cluster from HDInsight offerings, so while provisioning cluster from portal , you need the corresponding cluster type – HBase or Hadoop.


Both of cluster type(either HBase or Hadoop) of HDInsight 3.1 is completely based of pure Hortonworks HDP 2.1 clusters which contains the hadoop components of the following version.

  • Apache Hadoop 2.4
  • Apache HBase 0.98.0
  • Apache Pig 0.12.1
  • Apache Hive 0.13.0
  • Apache Tez 0.4
  • Apache ZooKeeper 3.4.5
  • Hue 2.3.1
  • Storm 0.9.1
  • Apache Oozie 4.0.0
  • Apache Falcon 0.5
  • Apache Sqoop 1.4.4
  • Apache Knox 0.4
  • Apache Flume 1.4.0
  • Apache Accumulo 1.5.1
  • Apache Phoenix 4.0.0
  • Apache Avro 1.7.4
  • Apache Mahout 0.9.0
  • Third party components:
    • Ganglia 3.5.0
    • Ganglia Web 3.5.7
    • Nagios 3.5.0


    For Big Data analytics world , one of the most fine-grained language that supports now with Azure ML is R. You can install R official packages for Windows, Linux & OS X, also for official project perspective , use R IDE.

    R Packages:

    R packages are self-contained units of R functionality that can be invoked as functions. A good analogy would be a .jar file in Java. There is a vast library of
    R packages available for a very wide range of operations ranging from statistical operations and machine learning to rich graphic visualization and plotting. Every package will consist of one or more R functions. An R package is a re-usable entity that can be shared and used by others. R users can install the package that contains the functionality they are looking for and start calling the functions in the package. A comprehensive list of these packages can be found at called Comprehensive R Archive Network (CRAN).

    Data Modelling with R:

    Regression: In statistics, regression is a classic technique to identify the scalar relationship between two or more variables by fitting the state line on the
    variable values. That relationship will help to predict the variable value for future events. For example, any variable y can be modeled as linear function
    of another variable x with the formula y = mx+c. Here, x is the predictor variable, y is the response variable, m is slope of the line, and c is the
    intercept. Sales forecasting of products or services and predicting the price of stocks can be achieved through this regression. R provides this regression
    feature via the lm method, which is by default present in R.
    Classification: This is a machine-learning technique used for labeling the set of observations provided for training examples. With this, we can classify
    the observations into one or more labels. The likelihood of sales, online fraud detection, and cancer classification (for medical science) are common
    applications of classification problems. Google Mail uses this technique to classify e-mails as spam or not. Classification features can be served by glm,
    glmnet, ksvm, svm, and randomForest in R.
    Clustering: This technique is all about organizing similar items into groups from the given collection of items. User segmentation and image
    compression are the most common applications of clustering. Market segmentation, social network analysis, organizing the computer clustering,
    and astronomical data analysis are applications of clustering. Google News uses these techniques to group similar news items into the same category.
    Clustering can be achieved through the knn, kmeans, dist, pvclust, and Mclust methods in R.

    Recommendation: The recommendation algorithms are used in recommender systems where these systems are the most immediately recognizable machine learning techniques in use today. Web content recommendations may include similar websites, blogs, videos, or related content. Also, recommendation of online items can be helpful for cross-selling and up-selling. We have all seen online shopping portals that attempt to recommend books, mobiles, or any items that can be sold on the Web based on the user’s past behavior. Amazon is a well-known e-commerce portal that generates 29 percent of sales through recommendation systems. Recommender systems can be implemented via Recommender()with the recommenderlab package in R.



A Quick Walk-through on Azure Storage(SQL, NoSQL, NewSQL)

It’s always imaginable that developers are always flexible to proceed with relational databases while migrating an existing on-premise app to Azure platform while leveraging best possible architectural guidelines on migration to cloud. But, still forth in real-time cases , typical scenarios like suboptimal performance, high expenses, or worse case scenario because, NOSQL db can handle some tasks more efficiently than relational databases can. In few enterprise cases, it’s encountered a critical data storage problem, as because NOSQL solution implementation have been better off before deploying its app to production.

Moreover, there’s no single best data management choice for all data storage tasks, different data management solutions are optimized for different tasks. Let’s have a quick walk-through on various data storage option models supported on Microsoft Azure.




Let’s start first by four types of NOSQL db supported now Azure.

  • Key/value pair databases: store a single serialized object for each key value. They’re good for storing large volumes of data in situations where you want to get one item for a given key value and you don’t have to query based on other properties of the item.
  • Azure Blob Storage : It’s also a key/value based data storage which is same like as file system in functionality where you could search a file based on it’s folder/file name as key not file content as key. Blob offers read-write storage options (aka Block Blob) for storing large media files as well as for standard streaming purpose facilitates the usage of VHDs as Page Blob(aka Azure Drive).
  • Azure Table Storage : A standard key-value pair based NOSQL storage option prevailed from Azure storage inception phase. Each value is called an entity (similar to a row, identified by a partition key and row key) and contains multiple properties (similar to columns, but not all entities in a table have to share the same columns). Querying on columns other than the key is extremely inefficient and should be avoided.
  • Document Databases : Popular key/value databases in which the values are documents. “Document” here isn’t used in the sense of a Word or an Excel document but means a collection of named fields and values, any of which could be a child document. For example, in an order history table, an order document might have order number, order date, and customer fields, and the customer field might have name and address fields. The database encodes field data in a format such as XML, YAML, JSON, or BSON, or it can use plain text. One feature that sets document databases apart from other key/value databases is the capability they provide to query on nonkey fields and define secondary indexes, which makes querying more efficient. This capability makes a document database more suitable for applications that need to retrieve data on the basis of criteria more complex than the value of the document key.

Example : Mongo DB.

  • Column-family databases : key/value pair based data storage enables to structure data based on collections of columns called ‘Column families‘. For example, a population database consists of one group of column called ‘Persons’ (containing firstname, middlename, lastname) , one group for person’s address & another for profile info. The database can then store each column family in a separate partition while keeping all of the data for one person related to the same key. You can then read all profile information without having to read through all of the name and address information as well.

Example : Cassendra , Apache HBase (in preview supported with HDInsight as NOSQL Blob Storage)

  • Graph databases : Stores data in form of objects & relationships.The purpose of a graph database is to enable an application to efficiently perform queries that traverse the network of objects and the relationships between them. For example, the objects might be employees in a human resources database, and you might want to facilitate queries such as “find all engineers who directly or indirectly work for Product Manager.”

Example : Neo4j Graph Database.

Compared with relational databases, the NoSQL options offer far greater scalability and are more cost effective for storage and analysis of unstructured data. The tradeoff is that they don’t provide the rich querying and robust data integrity capabilities of relational databases. NoSQL options would work well for IIS log data, which involves high volume with no need for join queries. NoSQL options would not work so well for banking transactions, which require absolute data integrity and involve many relationships to other account-related data.

  • A brief about NewSQL : Combines the scalability features of NOSQL along with distributed querying & transactional integrity of OldSQL.
  • The first type of NewSQL systems are completely new database platforms. These are designed to operate in a distributed cluster of shared-nothing nodes, in which each node owns a subset of the data. Though many of the new databases have taken different design approaches, there are two primary categories evolving. The first type of system sends the execution of transactions and queries to the nodes that contain the needed data. SQL queries are split into query fragments and sent to the nodes that own the data. These databases are able to scale linearly as additional nodes are added.
  • General-purpose databases
    These maintain the full functionality of traditional databases, handling all types of queries. These databases are often written from scratch with a distributed architecture in mind, and include components such as distributed concurrency control, flow control, and distributed query processing. This includes Google Spanner, Clustrix, FoundationDB, NuoDB,TransLattice, ActorDB,andTrafodion.
    In-memory databases
    The applications targeted by these NewSQL systems are characterized as having a large number of transactions that (1) are short-lived (i.e., no user stalls), (2) touch a small subset of data using index lookups (i.e., no full table scans or large distributed joins), and (3) are repetitive (i.e. executing the same queries with different inputs).
    These NewSQL systems achieve high performance and scalability by eschewing much of the legacy architecture of the original IBM System R design, such as heavyweight recovery or concurrency control algorithms.
    Example systems in this category are:VoltDB, Pivotal‘s SQLFire and GemFire XD, SAP HANA.
    Example : NuoDB is supported in Azure as NewSQL.
    • Key Points to Consider while choosing the Data Storage Options :

Data semantic

What is the core data storage and data access semantic (are you storing relational or unstructured data)?
Unstructured data such as media files fits best in Blob storage; a collection of related data such as products, inventories, suppliers, customer orders, etc., fits best in a relational database.
Query support

How easy is it to query the data?
What types of questions can be efficiently asked?
Key/value data stores are very good at getting a single row when given a key value, but they are not so good for complex queries. For a user-profile data store in which you are always getting the data for one particular user, a key/value data store could work well. For a product catalog from which you want to get different groupings based on various product attributes, a relational database might work better.
NoSQL databases can store large volumes of data efficiently, but you have to structure the database around how the app queries the data, and this makes ad hoc queries harder to do. With a relational database, you can build almost any kind of query.
Functional projection

Can questions, aggregations, and so on be executed on the server?
If you run SELECT COUNT(*) from a table in SQL, the DBMS will very efficiently do all the work on the server and return the number you’re looking for. If you want the same calculation from a NoSQL data store that doesn’t support aggregation, this operation is an inefficient “unbounded query” and will probably time out. Even if the query succeeds, you have to retrieve all the data from the server and bring it to the client and count the rows on the client.
What languages or types of expressions can be used?
With a relational database, you can use SQL. With some NoSQL databases, such as Azure Table storage.

Ease of scalability

How often and how much will the data need to scale?
Does the platform natively implement scale-out?
How easy is it to add or remove capacity (size and throughput)?
Relational databases and tables aren’t automatically partitioned to make them scalable, so they are difficult to scale beyond certain limitations. NoSQL data stores such as Azure Table storage inherently partition everything, and there is almost no limit to adding partitions. You can readily scale Table storage up to 200 terabytes, but the maximum database size for Azure SQL Database is 500 gigabytes. You can scale relational data by partitioning it into multiple databases, but setting up an application to support that model involves a lot of programming work.
Instrumentation and Manageability

How easy is the platform to instrument, monitor, and manage?
You need to remain informed about the health and performance of your data store, so you need to know up front what metrics a platform gives you for free and what you have to develop yourself.

How easy is the platform to deploy and run on Azure? PaaS? IaaS? Linux?
Azure Table storage and Azure SQL Database are easy to set up on Azure. Platforms that aren’t built-in Azure PaaS solutions require more effort.
API Support

Is an API available that makes it easy to work with the platform?
The Azure Table Service has an SDK with a .NET API that supports the .NET 4.5 asynchronous programming model. If you’re writing a .NET app, the work to write and test the code will be much easier for the Azure Table Service than for a key/value column data store platform that has no API or a less comprehensive one.
Transactional integrity and data consistency

Is it critical that the platform support transactions to guarantee data consistency?
For keeping track of bulk emails sent, performance and low data-storage cost might be more important than automatic support for transactions or referential integrity in the data platform, making the Azure Table Service a good choice. For tracking bank account balances or purchase orders, a relational database platform that provides strong transactional guarantees would be a better choice.
Business continuity

How easy are backup, restore, and disaster recovery?
Sooner or later production data will become corrupted and you’ll need an undo function. Relational databases often have more fine-grained restore capabilities, such as the ability to restore to a point in time. Understanding what restore features are available in each platform you’re considering is an important factor to consider.

If more than one platform can support your data workload, how do they compare in cost?
For example, if you use ASP.NET Identity, you can store user profile data in Azure Table Service or Azure SQL Database. If you don’t need the rich querying facilities of SQL Database, you might choose Azure Table storage in part because it costs much less for a given amount of storage.

%d bloggers like this: