Saturday, April 27, 2013

Hadoop on Multi Node

This tutorial provides step by step way to install multinode Hadoop on Ubuntu 10.04.

1. Install Single node Hadoop on all Nodes before starting to follow this tutorial.

2. If you follow my single node tutorial and install Ubuntu 10.04 on VMware Workstation. You can clone your single node machine using method as shown in Figure below.

Untitled

3. Name the cloned virtual machine as slave. This can be change from System>Administration>User and Groups.

1. Networking of Master and Slave

1.1 All nodes should be accessable from each other on the network. 
1.2 Add IP address of both master and slave machine in all the machines using following command.
  • ifconfig # on all machines to know their IP.
  • sudo nano /etc/hosts
Add following lines in the file.
  • 192.168.216.135 master
  • 192.168.216.136 slave

2. SSH Access

SSH access should be enabled form master to slave so that jobs can be transfered from master to slave and vice versa.
2.1 Add RSA key of slave to authorized users using following command.
  • su - hduser
  • ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
Now check the working of SSH by following commands.
  • ssh master
  • ssh slave

3. Configuration of Hadoop

3.1 Add IP and name of masters in /conf/master file of master node only using following command.
  • su - hduser
  • cd /home/hadoop/conf
  • nano masters
Add name of masters in this case:
  • master
and then

  • nano slaves
Add name of slaves in this case:
  • master
  • slave
3.2 Change different parameter files of hadoop files on all machines as follows, as changed in single node installation.

In conf/core-site.xml:

<property> <name>fs.default.name</name> <value>hdfs://master:54310</value> Only line which require change. <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

In conf/mapred-site.xml

<property> <name>mapred.job.tracker</name> <value>master:54311</value> Only Line require the change <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

In conf/hdfs-site.xml:

set replication value to 2 which was 1 eariler.

3.3 Format the NameNode

  • /home/hadoop/bin/hadoop namenode -format
3.4 If namenode donot get formatted delete temp folder of hadoop
  • cd /app/hadoop/tmp
  • rm -R dfs
  • rm -R mapred

4. Starting the Multi node Hadoop

4.1 Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

4.2 Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

Monday, April 1, 2013

Install Hadoop on Single Node

In this tutorial i will give you line by line instruction on how to setup hadoop on single node. First we will install hadoop on single node then move to multiple node in coming tutorials.

Pre-Requirements:

1. Java

* Update the system files using following command.
  • sudo add-apt-repository ppa:webupd8team/java
  • sudo apt-get update
  • sudo apt-get install oracle-java7-installer
* Now check the version of java 
  • java -version

2. Python environment.

Setup Python environment using following command.
  • sudo apt-get install python-software-properties

3. Create a new user for Hadoop.

Create a new user for hadoop. We will provide all permission to this user so that our hadoop can be secure.
  • sudo addgroup hadoop
  • sudo adduser --ingroup hadoop hduser

4. Configure SSH

SSH is required for hadoop to communicate between master and slave nodes. Configure it using following commands
  • sudo apt-get install openssh-server
  • su –  hduser
  • ssh-keygen -t rsa -P “”
  • cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Check if ssh is working by using following command.
  • ssh localhost

5. Disable IPV6

Sometime due to IPV6 configuration master node of hadoop is not able to communicate with slaves. It is better to disable IPV6 configuration using System Control file by following command.
* Open system control file in nano editor.
  • sudo nano /etc/sysctl.conf
* Embedd below written lines to end of system control file.




  • # disable ipv6
  • net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1

    * Restart the system for settings to take effect. Check if IPv6 is disabled by following command. Output should be 1 when Ipv6 is disabled.
    • cat /proc/sys/net/ipv6/conf/all/disable_ipv6

    All pre-requirements are complete. 

    We start with installation of hadoop on single node cluster.

    1. Download hadoop from here.

    * Extract the tar file. Rename it to hadoop and save on Desktop.
    * Move it to /home folder so that it canbe used by slave nodes also.
    • cd Desktop
    • sudo mv hadoop /home
    * Change owner of this folder to hduser by following command.
    • cd /home
    • sudo chown -R hduser:hadoop /home/hadoop
    * Open .bashrc file by using following command to set java environment variables.
    • su - hduser
    • nano $HOME/.bashrc
    *Add the following lines to the end of the file.
    # Set Hadoop-related environment variables

    export HADOOP_HOME=/home/hadoop
    # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
    export JAVA_HOME=/usr/lib/jvm/java-7-oracle
    # Some convenient aliases and functions for running Hadoop-related commands
    unalias fs &> /dev/null
    alias fs="hadoop fs"
    unalias hls &> /dev/null
    alias hls="fs -ls"
    # If you have LZO compression enabled in your Hadoop cluster and
    # compress job outputs with LZOP (not covered in this tutorial):
    # Conveniently inspect an LZOP compressed file from the command
    # line; run via:
    #
    # $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
    #
    # Requires installed 'lzop' command.
    #
    lzohead () {
        hadoop fs -cat $1 | lzop -dc | head -1000 | less
    }
    # Add Hadoop bin/ directory to PATH
    export PATH=$PATH:$HADOOP_HOME/bin

    2. Configure Hadoop script file

    * The only required environment variable we have to configure for Hadoop in this tutorial is Java. Follow following commands.
    • su - hduser
    • cd /home/hadoop/conf
    • nano hadoop-env.sh
    * and set java home as following:
    • # The java implementation to use.  Required.
    • export JAVA_HOME=/usr/lib/jvm/java-7–oracle
    3. Create a temporary directory for hadoop.
    A temp folder is created so that hadoop can store its temporary files in this folder.
    • exit
    • sudo mkdir -p /app/hadoop/tmp
    • sudo chown hduser:hadoop /app/hadoop/tmp

    4. Set Configuration files

    All hadoop configuration files are set in this section.Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
    In file conf/core-site.xml.
    • su - hduser
    • cd /home/hadoop/conf
    • nano core-site.xml
    Add following between configuration tags:
    <property>
      <name>hadoop.tmp.dir</name>
      <value>/app/hadoop/tmp</value>
      <description>A base for other temporary directories.</description>
    </property>

    <property>
      <name>fs.default.name</name>
      <value>hdfs://localhost:54310</value>
      <description>The name of the default file system.  A URI whose
      scheme and authority determine the FileSystem implementation.  The
      uri's scheme determines the config property (fs.SCHEME.impl) naming
      the FileSystem implementation class.  The uri's authority is used to
      determine the host, port, etc. for a filesystem.</description>
    </property>

    In file conf/mapred-site.xml:
    <property>
      <name>mapred.job.tracker</name>
      <value>localhost:54311</value>
      <description>The host and port that the MapReduce job tracker runs
      at.  If "local", then jobs are run in-process as a single map
      and reduce task.
      </description>
    </property>
    In file conf/hdfs-site.xml:
    <property>
      <name>dfs.replication</name>
      <value>1</value>
      <description>Default block replication.
      The actual number of replications can be specified when the file is created.
      The default is used if replication is not specified in create time.
      </description>
    </property>

    9. Formatting the HDFS filesystem via the NameNode
    • /home/hadoop/bin/hadoop namenode -format
    Output will look like:

    10. Start the single node cluster

    • /home/hadoop/bin/start-all.sh 
    Output will look like:


    Hadoop is running on this machine check its working using following command.

    • cd /home/hadoop
    • jps

    Above screen shows port where hadoop API's are running.