Tuesday, June 21, 2016

Running Kafka Producers and Consumers in Secured Environment

Apache Kafka allows to run in secured mode using Kerberos Authentication.
To run Kafka producers and consumers, we need follow to the following steps:

Step 1. created a properties file say kafka.properties to add the following lines:

security.protocol=SASL_PLAINTEXT
sasl.kerberos.service.name=kafka

Step 2. Created a file say kafka.jaas to add the authentication details:

KafkaClient {
com.sun.security.auth.module.Krb5LoginModule required
useTicketCache=true;
};

Step 3. Added kafka.jaas file to the Kafka environment i.e., using KAFKA_OPTS environment variable in shell (both producer and consumer):

export KAFKA_OPTS="-Djava.security.auth.login.config=/path/to/kafka.jaas.conf"

Step 4: Started a kafka-console-producer:

kafka-console-producer --brokers0list <brokers list> --topic <topic Name> --producer.config kafka.properties

Step 5: Started a kafka-console-consumer:

kafka-console-consumer --bootstrap-server <brokers list> --topic <topic Name> --consumer.config kafka.properties --new-consumer

Step 6: Type some sample messages in the Producer console and check the messages are available in Consumer console.

Note: provide brokers list as brokerhost:port,brokerhost:port,...

Thursday, August 21, 2014

Hadoop Ecosystem: Hadoop Streaming - An Utility from Hadoop for Non ...

Hadoop Ecosystem: Hadoop Streaming - An Utility from Hadoop for Non ...: Hadoop is a distributed computing framework. It become a De facto standard for data storage and processing at massive scale. Hadoop core ha...

Hadoop Streaming - An Utility from Hadoop for Non Java Programmers to write MapReduce Programs

Hadoop is a distributed computing framework. It become a De facto standard for data storage and processing at massive scale. Hadoop core has two layers one is Distributed Storage layer [HDFS] another once is Distribute computation or processing layer [MapReduce]. Both HDFS and MapReduce are written in Java. To operate with these projects, people has to write programs in Java. Especially for data processing we have to write MapReduce programs in Java. This MapReduce Java API is not useful for non java programmers and legacy code. To allow non java programmer's data processing activities over Hadoop. Hadoop provides a generic utility called Hadoop Streaming which allows non java programmers to write MapReduce programs in their own languages choices like Python, PHP, Perl, R, Shell, Scala, Ruby, C, C++ etc.

Hadoop Streaming is integration of Hadoop, I/O (Std Input/Std Output), External Executable (PHP, Perl, Python etc...)

The data flow is as follow:

HDFS ==> std out <==std in [External Program for Mapper] ==> std out <== std in [Shuffle] ==> std out <== std in [External program for Reducer] ==std out <== std in [HDFS]

This is very good for legacy code, statisticians, other non java programs. They can easily integrate their code with Hadoop and the computation runs in parallel.

Note: We need to install either compiler or interpreter of the external program on all cluster nodes. For programming languages like Python, Perl, Shell are already bundled with Unix/Linux. For other programming languages, we need to install them on all cluster nodes. This concept looks like Unix Pipes.

For this API, we have a single jar and it available inside contrib/streaming directory under HADOOP_HOME.

Python + Hadoop:

Usage:
 Cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/pystream --mapper 'python /home/naga//bigdata/streaming/mapper.py' --reducer 'python /home/naga//bigdata/streaming/reducer.py'

Mapper.py:

import sys

lines = sys.stdin.readlines()
for line in lines:
    words = line.split()
    for word in words:

sys.stdout.write(word + "\n")

Reducer.py:

import sys

lines = sys.stdin.readlines()
wordcount = {};
for line in lines:
    if line in wordcount:
count = wordcount[line]
count = count + 1
wordcount[line] = count
    else:
wordcount[line] = 1
for word in wordcount:

    sys.stdout.write(word.strip()+ "\t" + str(wordcount[word]) + "\n")

Perl + Hadoop:

Usage:
Cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/plstream --mapper 'perl /home/naga//bigdata/streaming/mapper.pl' --reducer 'perl /home/naga//bigdata/streaming/reducer.pl'

Mapper.pl:

#!/usr/bin/perl

@lines = <STDIN>;

foreach $line(@lines)
{
@words = split(/\s/, $line);
foreach $word(@words)
{
print $word, "\n";
}
}

Reducer.pl:

#!/usr/bin/perl

@words = <STDIN>;
%wordcount = {};

foreach $word(@words)
{
chomp($word);
if (exists $wordcount{$word})
{
$count = $wordcount{$word};
$count++;
$wordcount{$word} = $count;
}
else
{
$wordcount{$word} = 1;
}
}
while(($key, $value) = each %wordcount)
{
print $key, "\t", $value, "\n";
}

PHP + Hadoop:

Usage:
cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/phpstream --mapper 'php /home/naga//bigdata/streaming/mapper.php' --reducer 'php /home/naga//bigdata/streaming/reducer.php'

Mapper.php

<?php
while(($file = fgets(STDIN)) != null)
{
$words = explode(" ", $file);
foreach($words as $word)
{
echo $word . "\n";
}
}

?>

Reducer.php

<?php
$wordcount = array();
while(($myfile = fgets(STDIN)) != null)
{
$file = trim($myfile);
if(array_key_exists($file, $wordcount))
{
$count = $wordcount[$file];
$count++;
$wordcount[$file] = $count;
}
else
{
$wordcount[$file] = 1;
}
}
foreach($wordcount as $key => $value)
{
echo $key . "\t". $value . "\n";
}

?>

We can use the combination of Hadoop, Perl, PHP, Python etc... like Mapper code in Perl and Reducer code in Python vice versa, Mapper code PHP, and Reducer code in Shell script vice versa etc...

Hadoop Ecosystem: Installing and Running Lipstick for managing Apach...

Hadoop Ecosystem: Installing and Running Lipstick for managing Apach...: 1.  Introduction Lipstick: Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, givi...

Hadoop Ecosystem: Enabling and Configuring Capacity Schedular

Hadoop Ecosystem: Enabling and Configuring Capacity Schedular: Hadoop Schedulers: Hadoop has 4 types of schedulers   1. FIFO (By default)   2. Capacity Scheduler   3. Fair Scheduler   4. HOD Sche...

Hadoop Ecosystem: Mahout - Machine Learning Framework on top of Hado...

Hadoop Ecosystem: Mahout - Machine Learning Framework on top of Hado...: Mahout Introduction: It is a Machine Learning Framework on top of Apache Hadoop. It has a list of Distributed and and Non-Distributed Al...

Hadoop Ecosystem: Hive Partitioning and Bucketing

Hadoop Ecosystem: Hive Partitioning and Bucketing: Hive Partitioning: In Hive, Partitioning is used to avoid scanning of the entire table for queries with filters (fine grained queries). T...