HDInsight | Anindita's Blog

A lap around the latest PowerBI annoucements , Socrata OData Feed & RealTime Fast Streaming Data Analytics

February 10, 2015 1 Comment

Last month, 27th february 2015 , there are some new awesome features connected with Microsoft PowerBI, lets have a quick look at this, first of all , in this release , the powerbi comes out of office 365 & Microsoft Office veils & you can now connect your data not only from Excel workbooks /Azure but also from PowerBI Designer files, Sendgrid, SalesForce CRM, Microsoft SQL Server Analysis Service, Azure Stream Analytics(private preview).

In the first demo, I’ve collected real time data from White House Visitors Records directory using OData feed by Socrata api using this link http://open.whitehouse.gov/OData.svc/p86s-ychb from Excel -> PowerQuery-> OData Feed or Excel-> Data-> OData Feed option.

Next, import data into PowerPivot table & build out the linked tables to put out the powerview dashboard.

Also, you can sign up for PowerBI public preview dashboard here , but make sure that the preview is now available for users in United States only.

The PowerMap tour is compiled along with the latest features introduced as Custom Maps in PowerMap & rich set of effects. The powermap tour on White House Visitors records index analysis is available on Youtube.

Upload the excel PowerView Dashboard workbook on PowerBI public preview portal & you can view the amazing experience including PowerQ&A without the environment of Office 365.

In new powerbi public preview portal , lots of option by which you can import data like SQL Server Analysis Service, Excel workbook, PowerBI Designer files, SendGrid, SalesForce CRM, Microsoft Dynamics, Marketo, GitHub, ZenDesk etc.

The new powerbi designer file is available for free download by this link & some spectacular views have been introduced in the designer preview like Tree charts, Gauge, Combo, Tabular etc.

In the next demo, I extracted real time 9-1-1 call records index data from http://data.seattle.gov/ & analysed 911 call records index over 2 days , possible report locations, types of reports all over US & of course over greater Seattle Area.

Filed under Microsoft PowerBI, Microsoft PowerBI Visuals Tagged with Azure Data Analytics, Excel DAX, HDInsight, Microsoft PowerBI, MobileBI, PowerBI Public Preview, PowerMap, PowerQ&A, PowerQuery

A Quick Walk-through on Azure Storage(SQL, NoSQL, NewSQL)

August 6, 2014 Leave a comment

It’s always imaginable that developers are always flexible to proceed with relational databases while migrating an existing on-premise app to Azure platform while leveraging best possible architectural guidelines on migration to cloud. But, still forth in real-time cases , typical scenarios like suboptimal performance, high expenses, or worse case scenario because, NOSQL db can handle some tasks more efficiently than relational databases can. In few enterprise cases, it’s encountered a critical data storage problem, as because NOSQL solution implementation have been better off before deploying its app to production.

Moreover, there’s no single best data management choice for all data storage tasks, different data management solutions are optimized for different tasks. Let’s have a quick walk-through on various data storage option models supported on Microsoft Azure.

Let’s start first by four types of NOSQL db supported now Azure.

Key/value pair databases: store a single serialized object for each key value. They’re good for storing large volumes of data in situations where you want to get one item for a given key value and you don’t have to query based on other properties of the item.

Azure Blob Storage : It’s also a key/value based data storage which is same like as file system in functionality where you could search a file based on it’s folder/file name as key not file content as key. Blob offers read-write storage options (aka Block Blob) for storing large media files as well as for standard streaming purpose facilitates the usage of VHDs as Page Blob(aka Azure Drive).

Azure Table Storage : A standard key-value pair based NOSQL storage option prevailed from Azure storage inception phase. Each value is called an entity (similar to a row, identified by a partition key and row key) and contains multiple properties (similar to columns, but not all entities in a table have to share the same columns). Querying on columns other than the key is extremely inefficient and should be avoided.

Document Databases : Popular key/value databases in which the values are documents. “Document” here isn’t used in the sense of a Word or an Excel document but means a collection of named fields and values, any of which could be a child document. For example, in an order history table, an order document might have order number, order date, and customer fields, and the customer field might have name and address fields. The database encodes field data in a format such as XML, YAML, JSON, or BSON, or it can use plain text. One feature that sets document databases apart from other key/value databases is the capability they provide to query on nonkey fields and define secondary indexes, which makes querying more efficient. This capability makes a document database more suitable for applications that need to retrieve data on the basis of criteria more complex than the value of the document key.

Example : Mongo DB.

Column-family databases : key/value pair based data storage enables to structure data based on collections of columns called ‘Column families‘. For example, a population database consists of one group of column called ‘Persons’ (containing firstname, middlename, lastname) , one group for person’s address & another for profile info. The database can then store each column family in a separate partition while keeping all of the data for one person related to the same key. You can then read all profile information without having to read through all of the name and address information as well.

Example : Cassendra , Apache HBase (in preview supported with HDInsight as NOSQL Blob Storage)

Graph databases : Stores data in form of objects & relationships.The purpose of a graph database is to enable an application to efficiently perform queries that traverse the network of objects and the relationships between them. For example, the objects might be employees in a human resources database, and you might want to facilitate queries such as “find all engineers who directly or indirectly work for Product Manager.”

Example : Neo4j Graph Database.

Compared with relational databases, the NoSQL options offer far greater scalability and are more cost effective for storage and analysis of unstructured data. The tradeoff is that they don’t provide the rich querying and robust data integrity capabilities of relational databases. NoSQL options would work well for IIS log data, which involves high volume with no need for join queries. NoSQL options would not work so well for banking transactions, which require absolute data integrity and involve many relationships to other account-related data.

A brief about NewSQL : Combines the scalability features of NOSQL along with distributed querying & transactional integrity of OldSQL.
The first type of NewSQL systems are completely new database platforms. These are designed to operate in a distributed cluster of shared-nothing nodes, in which each node owns a subset of the data. Though many of the new databases have taken different design approaches, there are two primary categories evolving. The first type of system sends the execution of transactions and queries to the nodes that contain the needed data. SQL queries are split into query fragments and sent to the nodes that own the data. These databases are able to scale linearly as additional nodes are added.
General-purpose databases

These maintain the full functionality of traditional databases, handling all types of queries. These databases are often written from scratch with a distributed architecture in mind, and include components such as distributed concurrency control, flow control, and distributed query processing. This includes Google Spanner, Clustrix, FoundationDB, NuoDB,TransLattice, ActorDB,andTrafodion.

In-memory databases

The applications targeted by these NewSQL systems are characterized as having a large number of transactions that (1) are short-lived (i.e., no user stalls), (2) touch a small subset of data using index lookups (i.e., no full table scans or large distributed joins), and (3) are repetitive (i.e. executing the same queries with different inputs).

These NewSQL systems achieve high performance and scalability by eschewing much of the legacy architecture of the original IBM System R design, such as heavyweight recovery or concurrency control algorithms.

Example systems in this category are:VoltDB, Pivotal‘s SQLFire and GemFire XD, SAP HANA.

Example : NuoDB is supported in Azure as NewSQL.
- Key Points to Consider while choosing the Data Storage Options :

Data semantic

What is the core data storage and data access semantic (are you storing relational or unstructured data)?
Unstructured data such as media files fits best in Blob storage; a collection of related data such as products, inventories, suppliers, customer orders, etc., fits best in a relational database.
Query support

How easy is it to query the data?
What types of questions can be efficiently asked?
Key/value data stores are very good at getting a single row when given a key value, but they are not so good for complex queries. For a user-profile data store in which you are always getting the data for one particular user, a key/value data store could work well. For a product catalog from which you want to get different groupings based on various product attributes, a relational database might work better.
NoSQL databases can store large volumes of data efficiently, but you have to structure the database around how the app queries the data, and this makes ad hoc queries harder to do. With a relational database, you can build almost any kind of query.
Functional projection

Can questions, aggregations, and so on be executed on the server?
If you run SELECT COUNT(*) from a table in SQL, the DBMS will very efficiently do all the work on the server and return the number you’re looking for. If you want the same calculation from a NoSQL data store that doesn’t support aggregation, this operation is an inefficient “unbounded query” and will probably time out. Even if the query succeeds, you have to retrieve all the data from the server and bring it to the client and count the rows on the client.
What languages or types of expressions can be used?
With a relational database, you can use SQL. With some NoSQL databases, such as Azure Table storage.

Ease of scalability

How often and how much will the data need to scale?
Does the platform natively implement scale-out?
How easy is it to add or remove capacity (size and throughput)?
Relational databases and tables aren’t automatically partitioned to make them scalable, so they are difficult to scale beyond certain limitations. NoSQL data stores such as Azure Table storage inherently partition everything, and there is almost no limit to adding partitions. You can readily scale Table storage up to 200 terabytes, but the maximum database size for Azure SQL Database is 500 gigabytes. You can scale relational data by partitioning it into multiple databases, but setting up an application to support that model involves a lot of programming work.
Instrumentation and Manageability

How easy is the platform to instrument, monitor, and manage?
You need to remain informed about the health and performance of your data store, so you need to know up front what metrics a platform gives you for free and what you have to develop yourself.
Operations

How easy is the platform to deploy and run on Azure? PaaS? IaaS? Linux?
Azure Table storage and Azure SQL Database are easy to set up on Azure. Platforms that aren’t built-in Azure PaaS solutions require more effort.
API Support

Is an API available that makes it easy to work with the platform?
The Azure Table Service has an SDK with a .NET API that supports the .NET 4.5 asynchronous programming model. If you’re writing a .NET app, the work to write and test the code will be much easier for the Azure Table Service than for a key/value column data store platform that has no API or a less comprehensive one.
Transactional integrity and data consistency

Is it critical that the platform support transactions to guarantee data consistency?
For keeping track of bulk emails sent, performance and low data-storage cost might be more important than automatic support for transactions or referential integrity in the data platform, making the Azure Table Service a good choice. For tracking bank account balances or purchase orders, a relational database platform that provides strong transactional guarantees would be a better choice.
Business continuity

How easy are backup, restore, and disaster recovery?
Sooner or later production data will become corrupted and you’ll need an undo function. Relational databases often have more fine-grained restore capabilities, such as the ability to restore to a point in time. Understanding what restore features are available in each platform you’re considering is an important factor to consider.
Cost

If more than one platform can support your data workload, how do they compare in cost?
For example, if you use ASP.NET Identity, you can store user profile data in Azure Table Service or Azure SQL Database. If you don’t need the rich querying facilities of SQL Database, you might choose Azure Table storage in part because it costs much less for a given amount of storage.

Filed under Azure SQL Database / SQL Server, Microsoft Azure Tagged with Azure NOSQL, Cassendra, Couchbase, Graph DB, HBase, HDInsight, MongoDB

Big Data Analytics- PowerBI (PowerQuery & PowerMap) dashboards using HDInsight,Facebook data using Office 365

October 31, 2013 Leave a comment

In order to work on Big Data Analytics (ClickStream,Sentiment,RealTime) , its very important to work with PowerBI (PowerQuery & PowerMap) using Office 365. Microsoft Power Query Preview is an Excel add-in that enhances the self-service Business Intelligence experience in Excel by simplifying data discovery and access. Power Query enables users to easily discover, combine, and refine data for better analysis in Excel.

Power Map Preview for Excel 2013 allows you to plot geographic and temporal data visually, analyze that data in 3D, and create interactive tours to share with others.

Download PowerQuery addin for Excel 2013 from here & PowerMap addins for Excel 2013 from here.

Let me show you a brief overview about Power Query after installing the addins for excel. open the PowerQuery tab & check the data importing options.

Under Data sources option on PowerQuery , you will find options to import data from HDFS, Azure HDInsight , even from Facebook too.
For this demo, I have selected , HDFS (Hadoop distributed File System) & Azure HDInsight option.
Lets first create a HDInsight cluster in portal.

Next , provide final configuration values(Azure Storage Account Vaults [ASV] details) in order to create the cluster.

Lets integrate the HDInsight cluster data with PowerQuery of Excel. Choose ‘Data Sources‘ as ‘Azure HDInsight Cluster‘.

On Query Editor, specify the hdfs table data by selecting appropriate query & click on ‘done‘.

Next, click on PowerView in order to create an interactive dashboard with PowerView using PowerQuery from HDFS data of Azure Storage Vault(ASV).

To work with PowerMap for Office 365: after installing the powermap addin , lets choose the ‘Map‘ option from ‘insert’ tab of excel 2013.

Created a sample report using World Bank’s data on Income Basis from Windows Azure Marketplace which has been imported into hdfs.

Created a video tour of the PowerMap dashboard using World Bank’s statistical data of countries by income level. Access it here.

Filed under Azure HDInsight, Microsoft Azure, Microsoft PowerBI Visuals Tagged with Big Data Analytics, BISM, HDFS, HDInsight, PowerBI for Office 365, PowerMap, PowerQuery, PowerView, SQL Server 2014 CTP2

Anindita's Blog

A lap around the latest PowerBI annoucements , Socrata OData Feed & RealTime Fast Streaming Data Analytics

A Quick Walk-through on Azure Storage(SQL, NoSQL, NewSQL)

Big Data Analytics- PowerBI (PowerQuery & PowerMap) dashboards using HDInsight,Facebook data using Office 365

Archives

Categories

Like on Facebook

The Cloud

Recent Posts

Follow me on Twitter

Blog Traffic

Blog Stats

Follow Blog via Email

Proud to be an Indiblogger

Most Valuable Blogger

Anindita's Blog

A lap around the latest PowerBI annoucements , Socrata OData Feed & RealTime Fast Streaming Data Analytics

Share this:

A Quick Walk-through on Azure Storage(SQL, NoSQL, NewSQL)

Share this:

Big Data Analytics- PowerBI (PowerQuery & PowerMap) dashboards using HDInsight,Facebook data using Office 365

Share this:

Archives

Categories

Like on Facebook

The Cloud

Recent Posts

Follow me on Twitter

Blog Traffic

Blog Stats

Follow Blog via Email

Proud to be an Indiblogger

Most Valuable Blogger