Big data from rajsandhu1989
Researcher, CLOUDS Lab., Computing and Information Systems, The University of Melbourne, Australia.
Sunday, June 8, 2014
Wednesday, February 12, 2014
Thursday, February 6, 2014
Running Own Written Python Code in Hadoop
This post enlisted the steps requires to run own written code in python on Hadoop v 1.0.3 Cluster.
1. Create a mapper Python Script file.
- su - hduser
- nano mapper.py
Write Following code in the mapper.py file and save it.
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for reducer.py
#
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
2. Create a reducer file.
- nano reducer.py
Write following code in reducer file.
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from mapper.py
word, count = line.split('\t', 1)
# convert count (currently a string) to int
try:
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
continue
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
else:
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
3. Test your code (cat data | map | sort | reduce)
I recommend to test your mapper.py and reducer.py scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected.
# very basic test
hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/mapper.py
God 1
is 1
God 1
I 1
am 1
I 1
hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/mapper.py | sort -k1,1 | /home/hduser/reducer.py
God 2
is 1
I 2
am 1
hduser@ubuntu:~$ cat /tmp/sandhu/pg20417.txt | /home/hduser/mapper.py
The 1
Project 1
Gutenberg 1
EBook 1
of 1
[...]
(you get the idea)
4. Running the Python Code on Hadoop
- bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/mapper.py -reducer /home/hduser/reducer.py -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
Running "WordCount" Map Reduce Job in Hadoop 1.0.3
This post will explain the steps required to run WordCount map reduce job in Hadoop v 1.0.3.
- Create a folder to store files. Word will be counted from these files. For current setup we have three books in plain text format.
- su - hduser
- mkdir /tmp/sandhu
2. Copy three files to /tmp/sandhu folder. Check it using following command.
- cd /tmp/sandhu
- ls -l
output will look like:
3. Start the Hadoop Cluster:
- /home/hadoop/bin/hadoop/start-all.sh
4. Before we run the actual MapReduce job, we first have to copy the files from our local file system to Hadoop’s HDFS.
- cd /home/hadoop
- bin/hadoop dfs -copyFromLocal /tmp/sandhu /home/hduser/sandhu
Check that files are correctly copied to HDFS by following command.
- bin/hadoop dfs -ls /home/hduser/sandhu
output will look like:
5. Now, we actually run the WordCount example job.
- bin/hadoop jar hadoop*examples*.jar wordcount /home/hduser/sandhu /home/hduser/sandhu-output
Output will be like:
6. Retrieve the job result from HDFS
- bin/hadoop dfs -cat /user/hduser/sandhu-output/part-r-00000
7. Hadoop API's
- http://localhost:50070/ – web UI of the NameNode daemon
- http://localhost:50030/ – web UI of the JobTracker daemon
- http://localhost:50060/ – web UI of the TaskTracker daemon
Subscribe to:
Posts (Atom)