Hadoop Ecosystem: Hadoop Streaming - An Utility from Hadoop for Non Java Programmers to write MapReduce Programs

Hadoop is a distributed computing framework. It become a De facto standard for data storage and processing at massive scale. Hadoop core has two layers one is Distributed Storage layer [HDFS] another once is Distribute computation or processing layer [MapReduce]. Both HDFS and MapReduce are written in Java. To operate with these projects, people has to write programs in Java. Especially for data processing we have to write MapReduce programs in Java. This MapReduce Java API is not useful for non java programmers and legacy code. To allow non java programmer's data processing activities over Hadoop. Hadoop provides a generic utility called Hadoop Streaming which allows non java programmers to write MapReduce programs in their own languages choices like Python, PHP, Perl, R, Shell, Scala, Ruby, C, C++ etc.

Hadoop Streaming is integration of Hadoop, I/O (Std Input/Std Output), External Executable (PHP, Perl, Python etc...)

The data flow is as follow:

HDFS ==> std out <==std in [External Program for Mapper] ==> std out <== std in [Shuffle] ==> std out <== std in [External program for Reducer] ==std out <== std in [HDFS]

This is very good for legacy code, statisticians, other non java programs. They can easily integrate their code with Hadoop and the computation runs in parallel.

Note: We need to install either compiler or interpreter of the external program on all cluster nodes. For programming languages like Python, Perl, Shell are already bundled with Unix/Linux. For other programming languages, we need to install them on all cluster nodes. This concept looks like Unix Pipes.

For this API, we have a single jar and it available inside contrib/streaming directory under HADOOP_HOME.

Python + Hadoop:

Usage:
Cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/pystream --mapper 'python /home/naga//bigdata/streaming/mapper.py' --reducer 'python /home/naga//bigdata/streaming/reducer.py'

Mapper.py:

import sys

lines = sys.stdin.readlines()
for line in lines:
words = line.split()
for word in words:

sys.stdout.write(word + "\n")

Reducer.py:

import sys

lines = sys.stdin.readlines()
wordcount = {};
for line in lines:
if line in wordcount:
count = wordcount[line]
count = count + 1
wordcount[line] = count
else:
wordcount[line] = 1
for word in wordcount:

sys.stdout.write(word.strip()+ "\t" + str(wordcount[word]) + "\n")

Perl + Hadoop:

Usage:
Cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/plstream --mapper 'perl /home/naga//bigdata/streaming/mapper.pl' --reducer 'perl /home/naga//bigdata/streaming/reducer.pl'

Mapper.pl:

#!/usr/bin/perl

@lines = <STDIN>;

foreach $line(@lines)
{
@words = split(/\s/, $line);
foreach $word(@words)
{
print $word, "\n";
}
}

Reducer.pl:

#!/usr/bin/perl

@words = <STDIN>;
%wordcount = {};

foreach $word(@words)
{
chomp($word);
if (exists $wordcount{$word})
{
$count = $wordcount{$word};
$count++;
$wordcount{$word} = $count;
}
else
{
$wordcount{$word} = 1;
}
}
while(($key, $value) = each %wordcount)
{
print $key, "\t", $value, "\n";
}

PHP + Hadoop:

Usage:
cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/phpstream --mapper 'php /home/naga//bigdata/streaming/mapper.php' --reducer 'php /home/naga//bigdata/streaming/reducer.php'

Mapper.php

<?php
while(($file = fgets(STDIN)) != null)
{
$words = explode(" ", $file);
foreach($words as $word)
{
echo $word . "\n";
}
}

?>

Reducer.php

<?php
$wordcount = array();
while(($myfile = fgets(STDIN)) != null)
{
$file = trim($myfile);
if(array_key_exists($file, $wordcount))
{
$count = $wordcount[$file];
$count++;
$wordcount[$file] = $count;
}
else
{
$wordcount[$file] = 1;
}
}
foreach($wordcount as $key => $value)
{
echo $key . "\t". $value . "\n";
}

?>

We can use the combination of Hadoop, Perl, PHP, Python etc... like Mapper code in Perl and Reducer code in Python vice versa, Mapper code PHP, and Reducer code in Shell script vice versa etc...

Hadoop Ecosystem

Thursday, August 21, 2014

Hadoop Streaming - An Utility from Hadoop for Non Java Programmers to write MapReduce Programs

1 comment:

About Me