Hadoop Administration: Installation scripts of Apache Hadoop (2.6.0) on Ubuntu Unicorn as Multi-Node cluster


Recently , just published a quick step by step guide on deployment of Apache Hadoop (2.6.0) single -node cluster on Ubuntu unicorn(14.10) image, you can get the full installation video here.

 

Here, the full deployment of Apache Hadoop (2.6.0) multi-node cluster setup details are provided. The primary hardward requirements are needed to run the setup :

1. VMware Player/Workstation(if Windows/Linux) or VMware Fusion(if OSX)

2. More than 4 GB of RAM for primary OS

3. More than 60 GB of Disk space

4. Intel VT-X capable processor.

5. Ubuntu/CentOs/Red Hat/Sese OS Image(as guest OS)

Now, the step by step multinode hadoop clustering  scripts are provided.

 

Checkout the Ipaddress of each master & slaves node:

$ifconfig

Namenode > hadoopmaster > 192.168.23.132

Datanodes > hadoopslave1 > 192.168.23.133
hadoopslave2 > 192.168.23.134
hadoopslave3 > 192.168.23.135

Clone Hadoop Single node cluster as hadoopmaster

Hadoopmaster Node

$ sudo gedit /etc/hosts

hadoopmaster   192.168.23.132
hadoopslave1   192.168.23.133
hadoopslave2   192.168.23.134
hadoopslave3   192.168.23.135

$ sudo gedit /etc/hostname

hadoopmaster

$ cd /usr/local/hadoop/etc/hadoop

$ sudo gedit core-site.xml

replace localhost as hadoopmaster

$ sudo gedit hdfs-site.xml

replace value as 3 (represents no of datanode)

          $ sudo gedit yarn-site.xml

add the following configuration

<configuration>
<property>
<name>yarn.resourcemanager.resource-tracker.address</name>
<value>hadoopmaster:8025</value>
<property>
<property>
<name>yarn.resourcemanager.scheduler.address</name>
<value>hadoopmaster:8030</value>
<property>
<property>
<name>yarn.resourcemanager.address</name>
<value>hadoopmaster:8050</value>
<property>
</configuration>

$ sudo gedit mapred-site.xml.template
replace mapreduce.framework.name as mapred.job.tracker

replace yarn as hadoopmaster:54311

$ sudo rm -rf /usr/local/hadoop/hadoop_data

Shutdown hadoopmaster node

Clone Hadoopmaster Node as hadoopslave1, hadoopslave2, hadoopslave3

Hadoopslave Node (conf should be done on each slavenode)

$ sudo gedit /etc/hostname

hadoopslave<nodenumberhere>

          $ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/datanode

$ sudo chown -R trainer:trainer /usr/local/hadoop

          $ sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

remove dfs.namenode.dir property section

reboot all nodes

Hadoopmaster Node

          $ sudo gedit /usr/local/hadoop/etc/hadoop/masters

hadoopmaster

$ sudo gedit /usr/local/hadoop/etc/hadoop/slaves

remove localhost and add

hadoopslave1
hadoopslave2
hadoopslave3

$ sudo gedit /usr/local/hadoop/etc/hadoop/hdfs-site.xml

   remove dfs.datanode.dir property section

          $ sudo mkdir -p /usr/local/hadoop/hadoop_data/hdfs/namenode

$ sudo chown -R trainer:trainer /usr/local/hadoop

$ sudo ssh-copy-id -i ~/.ssh/id_dsa.pub trainer@hadoopmaster

$ sudo ssh-copy-id -i ~/.ssh/id_dsa.pub trainer@hadoopslave1

$ sudo ssh-copy-id -i ~/.ssh/id_dsa.pub trainer@hadoopslave2

$ sudo ssh-copy-id -i ~/.ssh/id_dsa.pub trainer@hadoopslave3

$ ssh hadoopmaster

$ exit

$ ssh hadoopslave1

$ exit

$  ssh hadoopslave2

$ exit

$ ssh hadoopslave3

$ exit

$ hadoop namenode -format

$ start-all.sh

$ jps (check in all 3 datanodes)


for checking Hadoop web console :

http://hadoopmasteripaddress :8088/
http://hadoopmasteripaddress :50070/
http://hadoopmasteripaddress :50090/

http://hadoopmasteripaddress  :50075/

 

Installation Commands of Apache Hadoop 2.6.0 as Single Node Pseudo-Distributed mode on Ubuntu 14.10 (Step by Step)


$ sudo apt-get update

$ sudo apt-get install default-jdk

$ java -version

$ sudo apt-get install ssh

$ sudo apt-get install rsync

$ ssh-keygen -t dsa -P ‘ ‘ -f ~/.ssh/id_dsa

$ cat ~/.ssh/id_dsa.pub >> ~/.ssh/authorized_keys

$ wget -c http://mirror.olnevhost.net/pub/apache/hadoop/common/current/hadoop-2.6.0.tar.gz

$ sudo tar -zxvf hadoop-2.6.0.tar.gz

$ sudo mv hadoop-2.6.0 /usr/local/hadoop

$ update-alternatives –config java

$ sudo gedit ~/.bashrc

#Hadoop Variables
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64
export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin
export PATH=$PATH:$HADOOP_HOME/sbin
export HADOOP_MAPRED_HOME=$HADOOP_HOME
export HADOOP_COMMON_HOME=$HADOOP_HOME
export HADOOP_HDFS_HOME=$HADOOP_HOME
export YARN_HOME=$HADOOP_HOME
export HADOOP_COMMON_LIB_NATIVE_DIR=$HADOOP_HOME/lib/native
export HADOOP_OPTS=”-Djava.library.path=$HADOOP_HOME/lib”

Now apply the variables.

$ source ~/.bashrc

There are a number of xml files within the Hadoop folder that require editing which are:

  • mapred-site.xml
  • yarn-site.xml
  • core-site.xml
  • hdfs-site.xml
  • hadoop-env.sh

The files can be found in /usr/local/hadoop/etc/hadoop/.First copy the mapred-site template file over and then edit it.

mapred-site.xml

mapreduce-xml

Next, go to the following path.

$ cd /usr/local/hadoop/etc/Hadoop

Add the following text between the configuration tabs.

mapred-site.xml.template

<property>
<name>mapreduce.framework.name</name>
<value>yarn</value>
</property>

yarn-site.xml

Add the following text between the configuration tabs.

<property>
<name>yarn.nodemanager.aux-services</name>
<value>mapreduce_shuffle</value>
</property>

core-site.xml

Add the following text between the configuration tabs.
<property>
<name>fs.default.name</name>
<value>hdfs://localhost:9000</value>
</property>

hdfs-site.xml

Add the following text between the configuration tabs.

<property>
<name>dfs.replication</name>
<value>1</value>
</property>

<property>
<name>dfs.name.dir</name>
<value>file:///home/hadoopuser/hadoopspace/hdfs/namenode</value>
</property>

<property>
<name>dfs.data.dir</name>
<value>file:///home/hadoopuser/hadoopspace/hdfs/namenode/datanode</value>
</property>

Note other locations can be used in hdfs by separating values with a comma, e.g.

file:/home/hadoopuser/hadoopspace/hdfs/datanode, .disk2/Hadoop/datanode, . .

hadoop-env.sh

Add an entry for JAVA_HOME

export JAVA_HOME=/usr/lib/jvm/jre-1.7.0-openjdk.x86_64/

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64

$ mkdir -p /home/hadoopuser/hadoopspace/hdfs/namenode

$ mkdir -p /home/hadoopuser/hadoopspace/hdfs/datanode

$ sudo chown hadoopuser:hadoopuser -R /usr/local/hadoop

Next format the namenode.

hdfs-format

Issue the following commands.

./start-dfs.sh
./start-yarn.sh

StartDemons

Issue the jps command and verify that the following jobs are running:

jps

At this point Hadoop has been installed and configured

type on terminal ,

firefox http://localhost:50070(namenode)

firefox http://localhost:50075(datanode)

firefox http://localhost:50090(checkpoint namenode)

firefox http://localhost:8088(Yarn Cluster)

Hadoop-namenode

MapReduce

An Introduction to Hadoop, MapReduce, Hive, HBase, Sqoop on Windows Azure


In today’s Hadoop world , MapReduce can be seen as a complement to an RDBMS. MapReduce is a good fit for processes that need to analyse the whole dataset, in a batch operation, specially for ad-hoc analysis. An RDBMS is good for point queries or updates, where the dataset has been  indexed to deliver low latency retrieval and update times of a relatively small amount of  data. MapReduce suits applications where the data is written once , and read many times, whereas a relational database is good datasets that are continually updated.

Traditional RDBMS                                    MapReduce

Data Size: Gigabytes                                         Petabytes

Access: Interactive and Batch                     Batch

Updates: Read and write many times         Write once , read many times

Structure: Static schema                                  Dynamic schema

Integrity High                                                      Low

Scaling     Nonlinear                                           Linear

  • MapReduce & RDBMS is the amount of  structure in the datasets that they operate on. Structured data is the data that is organised into entities that they have a defined format, such as XML documents or database tables that conform to a particular predefined schema. This is the realm of the RDBMS.
  • Semi Structured data is looser, and through there may be a schema, is often ignored, so it may be used only as a guide to the structure of the data.
  • Unstructured data does not have any particular internal structure, for example, plain text or image data.
  • Map Reduce works well on unstructured or semi structured data , since it designed to interpret  the data at processing time. In order words , the input keys and values for MapReduce are not an intrinsic property of the data, but they are chosen by the person analyzing the data.
  • Relational data is often normalized to retain its integrity & remove redundancy.
  • Map Reduce is linearly scalable programming model. Its task is to write Map Function & Reducr function by keeping Shuffle. Each of which defines a mapping from one set of key-value pairs to another. These function are oblivious to the size of the data or the cluster thay are operating on, so they can be used unchanged for small dataset and for massive one.

 

  • Apache Hadoop & Hadoop Ecosystem on Windows Azure Platform(Azure HDInsight):
  •  Common: A set of operations & interfaces for distributed filesystems & general I/O (Serialization, Java RPC, persistent data structures)
  • Avro : A serialization system for efficient , cross language persistent data storage.
  • MapReduce: A Distributed data processing model and execution environment thsat runs on large clusters of commodity machines.
  • HDFS: A distributed filesystem that runs on large clusters of commodity machines.
  • Pig: A data flow language and execution environment for exploring very large datasets. Pig runs on HDFS and MapReduce clusters.
  • Hive: A distributed data warehouse. Hive Manages data stored in HDFS & provides  batch style computations & ETL by HQL.
  • HBase: A distributed , column oriented database, HBase uses HDFS for its underlying storage, supported both batch – style computations using MapReduce and point queries.
  • ZooKeeper: A Distributed , highly available coordination service. ZooKeeper provides primitives such as distributed locks can be applied on distributed applications.
  • Sqoop: A Tool for efficiently moving data between RDBMS  & HDFS (from SQL Server/SQL Azue/Oracle to HDFS and vice-versa)

Lets check to create a Hadoop Cluster on Windows Azure HDInsight  on http://www.hadooponazure.com:

HadoopAzureCluster

 

HadoopCluster

 

  • Check out the Interactive Console on Hadoop on Azure EMR to execute Pig/Latin scripts or Hive data ware housing queries.

Console