Rajinder Sandhu

Wednesday, February 12, 2014

Introduction to Hadoop and Map Reduce Architecture

Thursday, February 6, 2014

Running Own Written Python Code in Hadoop

This post enlisted the steps requires to run own written code in python on Hadoop v 1.0.3 Cluster.

1. Create a mapper Python Script file.

su - hduser
nano mapper.py

Write Following code in the mapper.py file and save it.

#!/usr/bin/env python

import sys

# input comes from STDIN (standard input)
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()
    # split the line into words
    words = line.split()
    # increase counters
    for word in words:
        # write the results to STDOUT (standard output);
        # what we output here will be the input for the
        # Reduce step, i.e. the input for reducer.py
        #
        # tab-delimited; the trivial word count is 1
        print '%s\t%s' % (word, 1)

2. Create a reducer file.

nano reducer.py

Write following code in reducer file.

#!/usr/bin/env python

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None

# input comes from STDIN
for line in sys.stdin:
    # remove leading and trailing whitespace
    line = line.strip()

    # parse the input we got from mapper.py
    word, count = line.split('\t', 1)

    # convert count (currently a string) to int
    try:
        count = int(count)
    except ValueError:
        # count was not a number, so silently
        # ignore/discard this line
        continue

    # this IF-switch only works because Hadoop sorts map output
    # by key (here: word) before it is passed to the reducer
    if current_word == word:
        current_count += count
    else:
        if current_word:
            # write result to STDOUT
            print '%s\t%s' % (current_word, current_count)
        current_count = count
        current_word = word

# do not forget to output the last word if needed!
if current_word == word:
    print '%s\t%s' % (current_word, current_count)

3. Test your code (cat data | map | sort | reduce)

I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected.

# very basic test
hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/mapper.py
God     1
is      1
God     1
I       1
am      1
I       1

hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py
God     2
is      1
I       2
am      1


hduser@ubuntu:~$ cat /tmp/sandhu/pg20417.txt | /home/hduser/mapper.py
 The     1
 Project 1
 Gutenberg       1
 EBook   1
 of      1
 [...]
 (you get the idea)

4. Running the Python Code on Hadoop

bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output

Running "WordCount" Map Reduce Job in Hadoop 1.0.3

This post will explain the steps required to run WordCount map reduce job in Hadoop v 1.0.3.

Create a folder to store files. Word will be counted from these files. For current setup we have three books in plain text format.

su - hduser
mkdir /tmp/sandhu

2. Copy three files to /tmp/sandhu folder. Check it using following command.

cd /tmp/sandhu
ls -l

output will look like:

3. Start the Hadoop Cluster:

/home/hadoop/bin/hadoop/start-all.sh

4. Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.

cd /home/hadoop
bin/hadoop dfs -copyFromLocal /tmp/sandhu /home/hduser/sandhu

Check that files are correctly copied to HDFS by following command.

bin/hadoop dfs -ls /home/hduser/sandhu

output will look like:

5. Now, we actually run the WordCount example job.

bin/hadoop jar hadoop*examples*.jar wordcount /home/hduser/sandhu /home/hduser/sandhu-output

Output will be like:

6. Retrieve the job result from HDFS

bin/hadoop dfs -cat /user/hduser/sandhu-output/part-r-00000

7. Hadoop API's

http://localhost:50070/ – web UI of the NameNode daemon
http://localhost:50030/ – web UI of the JobTracker daemon
http://localhost:50060/ – web UI of the TaskTracker daemon

Tuesday, August 6, 2013

Basic Meaning of Big Data

From last many days some students and my juniors are asking about basic of Big Data and how it is related to Cloud Computing. So I thought to write an article explaining meaning Big Data.

As the name suggest Big Data is related with huge amount to data which can not be processed by using simple methods and tools. For example, modern high-energy physics experiments, such as DZero1, typically generate more than one TeraByte of data per day. The famous social network Website, Facebook, serves 570 billion page views per month, stores 3 billion new photos every month, and manages 25 billion pieces of content. Main Question is how to process this much of data in less time? Data collected from these sources is very loosely linked to each other so making decisions from this data is very complex and time consuming. Our today's conventional databases cannot process data if they don't know exact relation between terms. Now a days organizations collect data form many different sources and methods. For example, an laptop company can collect data about a product from social networking sites such as Facebook, twitter etc. from many laptop related blogging websites and even from laptop selling online sites. Data collected from this much different sources can’t be process my conventional databases to make any proper decisions. This data is too big, moves too fast, or doesn’t fit the structure of conventional databases. To gain value from this data, we must follow a new approach. This new approach is known as Big Data.

Seeing the need of Big Data on March 29, 2012 American Government announced “Big Data Research and Development Initiative” and Big Data became the national policy for first time. Recently, the definition of big data as also given by the Gartner: “Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization”( I will explain three V’s of big data in coming posts in detail).

Basically technique used for getting useful information from very large unstructured data sets is known as Big Data. This data can be of any type. I hope you get very basic meaning of Big Data. Stay tuned i will post for detail about Big data architecture, 3 V’s and all in subsequent posts.

Rajinder Sandhu

Saturday, April 27, 2013

Hadoop on Multi Node

This tutorial provides step by step way to install multinode Hadoop on Ubuntu 10.04.

1. Install Single node Hadoop on all Nodes before starting to follow this tutorial.

2. If you follow my single node tutorial and install Ubuntu 10.04 on VMware Workstation. You can clone your single node machine using method as shown in Figure below.

3. Name the cloned virtual machine as slave. This can be change from System>Administration>User and Groups.

1. Networking of Master and Slave

1.1 All nodes should be accessable from each other on the network.

1.2 Add IP address of both master and slave machine in all the machines using following command.

ifconfig # on all machines to know their IP.
sudo nano /etc/hosts

Add following lines in the file.

192.168.216.135 master
192.168.216.136 slave

2. SSH Access

SSH access should be enabled form master to slave so that jobs can be transfered from master to slave and vice versa.

2.1 Add RSA key of slave to authorized users using following command.

su - hduser
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave

Now check the working of SSH by following commands.

ssh master
ssh slave

3. Configuration of Hadoop

3.1 Add IP and name of masters in /conf/master file of master node only using following command.

su - hduser
cd /home/hadoop/conf
nano masters

Add name of masters in this case:

master

and then

nano slaves

Add name of slaves in this case:

master
slave

3.2 Change different parameter files of hadoop files on all machines as follows, as changed in single node installation.

In conf/core-site.xml:

<property> <name>fs.default.name</name> <value>hdfs://master:54310</value> Only line which require change. <description>The name of the default file system. A URI whose scheme and authority determine the FileSystem implementation. The uri's scheme determines the config property (fs.SCHEME.impl) naming the FileSystem implementation class. The uri's authority is used to determine the host, port, etc. for a filesystem.</description> </property>

In conf/mapred-site.xml

<property> <name>mapred.job.tracker</name> <value>master:54311</value> Only Line require the change <description>The host and port that the MapReduce job tracker runs at. If "local", then jobs are run in-process as a single map and reduce task. </description> </property>

In conf/hdfs-site.xml:

set replication value to 2 which was 1 eariler.

3.3 Format the NameNode

/home/hadoop/bin/hadoop namenode -format

3.4 If namenode donot get formatted delete temp folder of hadoop

cd /app/hadoop/tmp
rm -R dfs
rm -R mapred

4. Starting the Multi node Hadoop

4.1 Run the command bin/start-dfs.sh on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves file.

4.2 Run the command bin/start-mapred.sh on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves file.

Monday, April 1, 2013

Install Hadoop on Single Node

In this tutorial i will give you line by line instruction on how to setup hadoop on single node. First we will install hadoop on single node then move to multiple node in coming tutorials.

Pre-Requirements:

1. Java

* Update the system files using following command.

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

* Now check the version of java

java -version

2. Python environment.

Setup Python environment using following command.

sudo apt-get install python-software-properties

3. Create a new user for Hadoop.

Create a new user for hadoop. We will provide all permission to this user so that our hadoop can be secure.

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

4. Configure SSH

SSH is required for hadoop to communicate between master and slave nodes. Configure it using following commands

sudo apt-get install openssh-server
su – hduser
ssh-keygen -t rsa -P “”
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Check if ssh is working by using following command.

ssh localhost

5. Disable IPV6

Sometime due to IPV6 configuration master node of hadoop is not able to communicate with slaves. It is better to disable IPV6 configuration using System Control file by following command.
* Open system control file in nano editor.

sudo nano /etc/sysctl.conf

* Embedd below written lines to end of system control file.

# disable ipv6

net.ipv6.conf.all.disable_ipv6 = 1

net.ipv6.conf.default.disable_ipv6 = 1

net.ipv6.conf.lo.disable_ipv6 = 1

* Restart the system for settings to take effect. Check if IPv6 is disabled by following command. Output should be 1 when Ipv6 is disabled.

cat /proc/sys/net/ipv6/conf/all/disable_ipv6

All pre-requirements are complete.

We start with installation of hadoop on single node cluster.

1. Download hadoop from here.

* Extract the tar file. Rename it to hadoop and save on Desktop.

* Move it to /home folder so that it canbe used by slave nodes also.

cd Desktop
sudo mv hadoop /home

* Change owner of this folder to hduser by following command.

cd /home
sudo chown -R hduser:hadoop /home/hadoop

* Open .bashrc file by using following command to set java environment variables.

su - hduser
nano $HOME/.bashrc

*Add the following lines to the end of the file.
# Set Hadoop-related environment variables

export HADOOP_HOME=/home/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

2. Configure Hadoop script file

* The only required environment variable we have to configure for Hadoop in this tutorial is Java. Follow following commands.

su - hduser
cd /home/hadoop/conf
nano hadoop-env.sh

* and set java home as following:

# The java implementation to use. Required.
export JAVA_HOME=/usr/lib/jvm/java-7–oracle

3. Create a temporary directory for hadoop.

A temp folder is created so that hadoop can store its temporary files in this folder.

exit
sudo mkdir -p /app/hadoop/tmp
sudo chown hduser:hadoop /app/hadoop/tmp

4. Set Configuration files

All hadoop configuration files are set in this section.Add the following snippets between the <configuration> ... </configuration> tags in the respective configuration XML file.
In file conf/core-site.xml.

su - hduser
cd /home/hadoop/conf
nano core-site.xml

Add following between configuration tags:

<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>

<property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>

In file conf/mapred-site.xml:

<property>
<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>

In file conf/hdfs-site.xml:

<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>

9. Formatting the HDFS filesystem via the NameNode

/home/hadoop/bin/hadoop namenode -format

Output will look like:

10. Start the single node cluster

/home/hadoop/bin/start-all.sh

Output will look like:

Hadoop is running on this machine check its working using following command.

cd /home/hadoop
jps

Above screen shows port where hadoop API's are running.

Rajinder Sandhu

Pages

Wednesday, February 12, 2014

Introduction to Hadoop and Map Reduce Architecture

Thursday, February 6, 2014

Running Own Written Python Code in Hadoop

1. Create a mapper Python Script file.

2. Create a reducer file.

Running "WordCount" Map Reduce Job in Hadoop 1.0.3

Tuesday, August 6, 2013

Basic Meaning of Big Data

Saturday, April 27, 2013

Hadoop on Multi Node

1. Networking of Master and Slave

2. SSH Access

3. Configuration of Hadoop

4. Starting the Multi node Hadoop

Monday, April 1, 2013

Install Hadoop on Single Node

Pre-Requirements:

1. Java

2. Python environment.

3. Create a new user for Hadoop.

4. Configure SSH

5. Disable IPV6

All pre-requirements are complete.

We start with installation of hadoop on single node cluster.

1. Download hadoop from here.

2. Configure Hadoop script file

4. Set Configuration files

Total Pageviews

About Me