Researcher, CLOUDS Lab., Computing and Information Systems, The University of Melbourne, Australia.
Wednesday, February 12, 2014
Thursday, February 6, 2014
Running Own Written Python Code in Hadoop
1. Create a mapper Python Script file.
- su - hduser
- nano mapper.py
2. Create a reducer file.
- nano reducer.py
- bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
Running "WordCount" Map Reduce Job in Hadoop 1.0.3
- Create a folder to store files. Word will be counted from these files. For current setup we have three books in plain text format.
- su - hduser
- mkdir /tmp/sandhu
- cd /tmp/sandhu
- ls -l
- /home/hadoop/bin/hadoop/start-all.sh
- cd /home/hadoop
- bin/hadoop dfs -copyFromLocal /tmp/sandhu /home/hduser/sandhu
- bin/hadoop dfs -ls /home/hduser/sandhu
- bin/hadoop jar hadoop*examples*.jar wordcount /home/hduser/sandhu /home/hduser/sandhu-output
- bin/hadoop dfs -cat /user/hduser/sandhu-output/part-r-00000
- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
Tuesday, August 6, 2013
Basic Meaning of Big Data
From last many days some students and my juniors are asking about basic of Big Data and how it is related to Cloud Computing. So I thought to write an article explaining meaning Big Data.
As the name suggest Big Data is related with huge amount to data which can not be processed by using simple methods and tools. For example, modern high-energy physics experiments, such as DZero1, typically generate more than one TeraByte of data per day. The famous social network Website, Facebook, serves 570 billion page views per month, stores 3 billion new photos every month, and manages 25 billion pieces of content. Main Question is how to process this much of data in less time? Data collected from these sources is very loosely linked to each other so making decisions from this data is very complex and time consuming. Our today's conventional databases cannot process data if they don't know exact relation between terms. Now a days organizations collect data form many different sources and methods. For example, an laptop company can collect data about a product from social networking sites such as Facebook, twitter etc. from many laptop related blogging websites and even from laptop selling online sites. Data collected from this much different sources can’t be process my conventional databases to make any proper decisions. This data is too big, moves too fast, or doesn’t fit the structure of conventional databases. To gain value from this data, we must follow a new approach. This new approach is known as Big Data.
Seeing the need of Big Data on March 29, 2012 American Government announced “Big Data Research and Development Initiative” and Big Data became the national policy for first time. Recently, the definition of big data as also given by the Gartner: “Big Data are high-volume, high-velocity, and/or high-variety information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization”( I will explain three V’s of big data in coming posts in detail).
Basically technique used for getting useful information from very large unstructured data sets is known as Big Data. This data can be of any type. I hope you get very basic meaning of Big Data. Stay tuned i will post for detail about Big data architecture, 3 V’s and all in subsequent posts.
Rajinder Sandhu
Saturday, April 27, 2013
Hadoop on Multi Node
1. Install Single node Hadoop on all Nodes before starting to follow this tutorial.
2. If you follow my single node tutorial and install Ubuntu 10.04 on VMware Workstation. You can clone your single node machine using method as shown in Figure below.

3. Name the cloned virtual machine as slave. This can be change from System>Administration>User and Groups.
1. Networking of Master and Slave
- ifconfig # on all machines to know their IP.
- sudo nano /etc/hosts
- 192.168.216.135 master
- 192.168.216.136 slave
2. SSH Access
- su - hduser
- ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@slave
- ssh master
- ssh slave
3. Configuration of Hadoop
- su - hduser
- cd /home/hadoop/conf
- nano masters
- master
- nano slaves
- master
- slave
- /home/hadoop/bin/hadoop namenode -format
- cd /app/hadoop/tmp
- rm -R dfs
- rm -R mapred
4. Starting the Multi node Hadoop
bin/start-dfs.sh
on the machine you want the (primary) NameNode to run on. This will bring up HDFS with the NameNode running on the machine you ran the previous command on, and DataNodes on the machines listed in the conf/slaves
file.bin/start-mapred.sh
on the machine you want the JobTracker to run on. This will bring up the MapReduce cluster with the JobTracker running on the machine you ran the previous command on, and TaskTrackers on the machines listed in the conf/slaves
file.Monday, April 1, 2013
Install Hadoop on Single Node
Pre-Requirements:
1. Java
* Update the system files using following command.- sudo add-apt-repository ppa:webupd8team/java
- sudo apt-get update
- sudo apt-get install oracle-java7-installer
- java -version
2. Python environment.
Setup Python environment using following command.- sudo apt-get install python-software-properties
3. Create a new user for Hadoop.
Create a new user for hadoop. We will provide all permission to this user so that our hadoop can be secure.- sudo addgroup hadoop
- sudo adduser --ingroup hadoop hduser
4. Configure SSH
SSH is required for hadoop to communicate between master and slave nodes. Configure it using following commands- sudo apt-get install openssh-server
- su – hduser
- ssh-keygen -t rsa -P “”
- cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
- ssh localhost
5. Disable IPV6
Sometime due to IPV6 configuration master node of hadoop is not able to communicate with slaves. It is better to disable IPV6 configuration using System Control file by following command.* Open system control file in nano editor.
- sudo nano /etc/sysctl.conf
# disable ipv6
net.ipv6.conf.all.disable_ipv6 = 1net.ipv6.conf.default.disable_ipv6 = 1net.ipv6.conf.lo.disable_ipv6 = 1
* Restart the system for settings to take effect. Check if IPv6 is disabled by following command. Output should be 1 when Ipv6 is disabled.
- cat /proc/sys/net/ipv6/conf/all/disable_ipv6
All pre-requirements are complete.
We start with installation of hadoop on single node cluster.
1. Download hadoop from here.
- cd Desktop
- sudo mv hadoop /home
- cd /home
- sudo chown -R hduser:hadoop /home/hadoop
- su - hduser
- nano $HOME/.bashrc
# Set Hadoop-related environment variables
export HADOOP_HOME=/home/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-oracle
# Some convenient aliases and functions for running Hadoop-related commands
unalias fs &> /dev/null
alias fs="hadoop fs"
unalias hls &> /dev/null
alias hls="fs -ls"
# If you have LZO compression enabled in your Hadoop cluster and
# compress job outputs with LZOP (not covered in this tutorial):
# Conveniently inspect an LZOP compressed file from the command
# line; run via:
#
# $ lzohead /hdfs/path/to/lzop/compressed/file.lzo
#
# Requires installed 'lzop' command.
#
lzohead () {
hadoop fs -cat $1 | lzop -dc | head -1000 | less
}
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin
2. Configure Hadoop script file
- su - hduser
- cd /home/hadoop/conf
- nano hadoop-env.sh
- # The java implementation to use. Required.
- export JAVA_HOME=/usr/lib/jvm/java-7–oracle
- exit
- sudo mkdir -p /app/hadoop/tmp
- sudo chown hduser:hadoop /app/hadoop/tmp
4. Set Configuration files
All hadoop configuration files are set in this section.Add the following snippets between the<configuration> ... </configuration>
tags in the respective configuration XML file.In file conf/core-site.xml.
- su - hduser
- cd /home/hadoop/conf
- nano core-site.xml
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp</value>
<description>A base for other temporary directories.</description>
</property>
<name>fs.default.name</name>
<value>hdfs://localhost:54310</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
conf/mapred-site.xml
:<name>mapred.job.tracker</name>
<value>localhost:54311</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
conf/hdfs-site.xml
:<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
- /home/hadoop/bin/hadoop namenode -format
10. Start the single node cluster
- /home/hadoop/bin/start-all.sh
- cd /home/hadoop
- jps