This post enlisted the steps requires to run own written code in python on Hadoop v 1.0.3 Cluster.
1. Create a mapper Python Script file.
- su - hduser
- nano
Write Following code in the file and save it.
#!/usr/bin/env python
import sys
# input comes from STDIN (standard input)
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# split the line into words
words = line.split()
# increase counters
for word in words:
# write the results to STDOUT (standard output);
# what we output here will be the input for the
# Reduce step, i.e. the input for
# tab-delimited; the trivial word count is 1
print '%s\t%s' % (word, 1)
2. Create a reducer file.
- nano
Write following code in reducer file.
#!/usr/bin/env python
from operator import itemgetter
import sys
current_word = None
current_count = 0
word = None
# input comes from STDIN
for line in sys.stdin:
# remove leading and trailing whitespace
line = line.strip()
# parse the input we got from
word, count = line.split('\t', 1)
# convert count (currently a string) to int
count = int(count)
except ValueError:
# count was not a number, so silently
# ignore/discard this line
# this IF-switch only works because Hadoop sorts map output
# by key (here: word) before it is passed to the reducer
if current_word == word:
current_count += count
if current_word:
# write result to STDOUT
print '%s\t%s' % (current_word, current_count)
current_count = count
current_word = word
# do not forget to output the last word if needed!
if current_word == word:
print '%s\t%s' % (current_word, current_count)
3. Test your code (cat data | map | sort | reduce)
I recommend to test your and scripts locally before using them in a MapReduce job. Otherwise your jobs might successfully complete but there will be no job result data at all or not the results you would have expected.
# very basic test
hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/
God 1
is 1
God 1
I 1
am 1
I 1
hduser@ubuntu:~$ echo "God is God. I am I" | /home/hduser/ | sort -k1,1 | /home/hduser/
God 2
is 1
I 2
am 1
hduser@ubuntu:~$ cat /tmp/sandhu/pg20417.txt | /home/hduser/
The 1
Project 1
Gutenberg 1
EBook 1
of 1
(you get the idea)
4. Running the Python Code on Hadoop
- bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar -mapper /home/hduser/ -reducer /home/hduser/ -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output
Processing data was tough long back without the invention of big data. Under to incredible methodology any data can be processed at maximum speed at minimal time. You are maintaining a wonderful blog, and thanks for sharing this information in here.
ReplyDeleteHadoop Training in Chennai