Big Data/hadoop

Howto: Setting Up a Multi-Node Hadoop Cluster Under Ubuntu
This tutorial explains how to set up a Apache Hadoop cluster running on several Ubuntu machines. The Hadoop framework consists of a distributed file system (HDFS) and an implementation of the MapReduce computing paradigm.



Figure 1 gives an overview of the cluster which will be set up during this tutorial. It consists of one Ubuntu machine which will be the master and n Ubuntu machines which will be the slaves. In the figure each of the machines is divided into the HDFS layer shown at the bottom and the MapReduce layer shown at the top.

The HDFS layer consists of several data nodes which store the data files. These nodes are managed by one name node. One example of its management tasks is keeping track of the location where files are stored. In order to increase the fault tolerance of HDFS, there exists replicas of all stored files on different data nodes and a secondary name node which will take the place of the name node if it fails. In the cluster shown in figure 1 the master runs the name node as well as the secondary name node. Data nodes are run on each slave and on the master, too.

The MapReduce layer consists of a job tracker which splits a MapReduce job into several map and reduce tasks which are executed on the different task trackers. In order to reduce network traffic, the task trackers are run on the same machines as the data nodes. The job tracker could be run on an arbitrary machine but in cluster shown in figure 1 it is run on the master, too.

This tutorial was tested with the following software versions:
 * Ubuntu Linux 12.04.3
 * Java version 1.8.0-ea
 * Hadoop 1.2.1

Prerequisites
The following required software has to be installed on the master and on all slaves.

Java 8
In order to install Java, you have to install the two further packages, first. Therefore, execute: sudo apt-get install software-properties-common sudo apt-get install python-software-properties

Then, the repository containing a built of the Oracle Java 8 has to be registered and the local list of the contained software packages has to be updated: sudo add-apt-repository ppa:webupd8team/java sudo apt-get update

Finally, Java 8 can be installed: sudo apt-get install oracle-java8-installer

SSH Server
First check if SSH is already running by executing ssh address.of.machine for each machine which should be involved in the cluster. is the address of the master or slave. If the login on one of the machines fails, you have to install an ssh server on that machine by following http://www.cyberciti.biz/faq/ubuntu-linux-openssh-server-installation-and-configuration/

When the Hadoop cluster will be start up later on, the master will establish ssh connections to all slaves. Then the passphrase will have to be entered several times. In order to avoid this, the SSH public key authentication can be used. Therefore, a key pair consisting of a public and a private key has to be created at the master, ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa append the public key to the list of authorized keys, cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys and distribute the public key to all slaves: ssh-copy-id -i ~/.ssh/id_rsa.pub address.of.slave

Finally, the SSH setup can be tested by connecting from the master to the master and all slaves: ssh address.to.machine If you try to connect to one of the machines the first time, an error message prompts telling that the authenticity can not be established and asking if the connection should be continued. The question has to be answered with.

Install Hadoop
After fulfilling the prerequisites, Hadoop can be installed on all machines. First, the corresponding Hadoop release has to be downloaded from https://hadoop.apache.org/releases.html In this tutorial, this can be done by executing wget http://mirror.synyx.de/apache/hadoop/common/hadoop-1.2.1/hadoop_1.2.1-1_x86_64.deb

The installation is started by running: sudo dpkg --install hadoop_1.2.1-1_x86_64.deb The success of the installation can be checked by executing hadoop which should print the help message of this command.

Configure Hadoop
The configuration of Hadoop starts with setting up the environment of Hadoop. Therefore, some environment variables have to be set in the file  in all machines:

The address of the machine where the secondary name node should be started could be defined in the file  on the master. By default it is the same machine where the name node is. In this tutorial this value keeps unchanged.

The machines of the data nodes and task trackers are defined in the file  on the master. The address of each machine is added in a separate line.

Now you have to create three further configuration files in the folder :  
 * general Hadoop configurations are defined in :

fs.default.name hdfs://address.of.master:8020 The name of the default file system. Either the literal string "local" or a host:port for NDFS. true

hadoop.tmp.dir /data/hadoop/tmp A base for other temporary directories. fs.checkpoint.dir /data/hadoop/tmp/dfs/namesecondary Determines where on the local filesystem the DFS secondary name node should store the temporary images to merge. If this is a comma-delimited list of directories then the image is     replicated in all of the directories for redundancy.

 
 * HDFS specific settings are defined in :

dfs.name.dir /data/hadoop/namenode Determines where on the local filesystem the DFS name node should store the name table. If this is a comma-delimited list of directories then the name table is replicated in all of the directories, for redundancy. true

dfs.data.dir /data/hadoop/datadir Determines where on the local filesystem an DFS data node should store its blocks. If this is a comma-delimited list of directories, then data will be stored in all named directories, typically on different devices. Directories that do not exist are ignored. true dfs.replication 3    Default block replication. The actual number of replications can be specified when the file is created. The default is used if replication is not specified in create time. This number should be smaller than the number of data nodes.

 
 * MapReduce specific settings are defined in :

mapred.system.dir /mapred/mapredsystem true The directory where MapReduce stores control files.

mapred.job.tracker address.of.master:9000 true The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task.

mapred.local.dir /data/hadoop/mapreddir true The local directory where MapReduce stores intermediate data files. May be a comma-separated list of     directories on different devices in order to spread disk i/o. Directories that do not exist are ignored.

mapred.temp.dir /data/hadoop/tmp/mapred/temp A shared directory for temporary files.

Running the Hadoop cluster
Before Hadoop can be run, the HDFS has to be format. Therefore, the following command has to be run at the master: hadoop namenode -format

Therafter, the Hadoop cluster can be started by executing  at the master. This script first runs  which starts the HDFS in the following order:
 * 1) name node is run on the machine where   is executed
 * 2) all data nodes are run on the machines defined in
 * 3) secondary name nodes are run on the machines defined in

Then  is run which starts the MapReduce nodes in the following order:
 * 1) job tracker is run on the machine where   is executed
 * 2) all task trackers are run on the machines defined in

Hadoop provides three web interfaces:
 * 1) http://address.of.master:50070/ shows the information of the name node
 * 2) http://address.of.master:50030/ shows the information of the job tracker
 * 3) http://address.of.master:50060/ shows the information of all task trackers

The Hadoop clsuter can be stopped by executing  on the master which first runs   and then.

Additional Sources
https://github.com/renepickhardt/metalcon/wiki/setupZookeeperHadoopHbaseTomcatSolrNutch

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/