Saturday, April 27, 2013

Hadoop on Multi Node

This tutorial provides step by step way to install multinode Hadoop on Ubuntu 10.04.

1. Install Single node Hadoop on all Nodes before starting to follow this tutorial.

2. If you follow my single node tutorial and install Ubuntu 10.04 on VMware Workstation. You can clone your single node machine using method as shown in Figure below.

Untitled

3. Name the cloned virtual machine as slave. This can be change from System>Administration>User and Groups.

1. Networking of Master and Slave

1.1 All nodes should be accessable from each other on the network. 
1.2 Add IP address of both master and slave machine in all the machines using following command.
  • ifconfig # on all machines to know their IP.
  • sudo nano /etc/hosts
Add following lines in the file.
  • 192.168.216.135 master
  • 192.168.216.136 slave

2. SSH Access

SSH access should be enabled form master to slave so that jobs can be transfered from master to slave and vice versa.
2.1 Add RSA key of slave to authorized users using following command.
  • su - hduser
  • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
Now check the working of SSH by following commands.
  • ssh master
  • ssh slave

3. Configuration of Hadoop

3.1 Add IP and name of masters in /conf/master file of master node only using following command.
  • su - hduser
  • cd /home/hadoop/conf
  • nano masters
Add name of masters in this case:
  • master
and then

  • nano slaves
Add name of slaves in this case:
  • master
  • slave
3.2 Change different parameter files of hadoop files on all machines as follows, as changed in single node installation.

In conf/core-site.xml:

<property> <name>fs.default.name</name> <value>hdfs://master:54310</value> Only line which require change. <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

In conf/mapred-site.xml

<property> <name>mapred.job.tracker</name> <value>master:54311</value> Only Line require the change <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

In conf/hdfs-site.xml:

set replication value to 2 which was 1 eariler.

3.3 Format the NameNode

  • /home/hadoop/bin/hadoop namenode -format
3.4 If namenode donot get formatted delete temp folder of hadoop
  • cd /app/hadoop/tmp
  • rm -R dfs
  • rm -R mapred

4. Starting the Multi node Hadoop

4.1 Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

4.2 Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

No comments:

Post a Comment