Thursday, August 21, 2014

Hadoop Ecosystem: Hadoop Streaming - An Utility from Hadoop for Non ...

Hadoop Ecosystem: Hadoop Streaming - An Utility from Hadoop for Non ...: Hadoop is a distributed computing framework. It become a De facto standard for data storage and processing at massive scale. Hadoop core ha...

Hadoop Streaming - An Utility from Hadoop for Non Java Programmers to write MapReduce Programs

Hadoop is a distributed computing framework. It become a De facto standard for data storage and processing at massive scale. Hadoop core has two layers one is Distributed Storage layer [HDFS] another once is Distribute computation or processing layer [MapReduce]. Both HDFS and MapReduce are written in Java. To operate with these projects, people has to write programs in Java. Especially for data processing we have to write MapReduce programs in Java. This MapReduce Java API is not useful for non java programmers and legacy code. To allow non java programmer's data processing activities over Hadoop. Hadoop provides a generic utility called Hadoop Streaming which allows non java programmers to write MapReduce programs in their own languages choices like Python, PHP, Perl, R, Shell, Scala, Ruby, C, C++ etc.

Hadoop Streaming is integration of Hadoop, I/O (Std Input/Std Output), External Executable (PHP, Perl, Python etc...)

The data flow is as follow:

HDFS ==> std out <==std in [External Program for Mapper] ==> std out <== std in [Shuffle] ==> std out <== std in [External program for Reducer] ==std out <== std in [HDFS]

This is very good for legacy code, statisticians, other non java programs. They can easily integrate their code with Hadoop and the computation runs in parallel.

Note: We need to install either compiler or interpreter of the external program on all cluster nodes. For programming languages like Python, Perl, Shell are already bundled with Unix/Linux. For other programming languages, we need to install them on all cluster nodes. This concept looks like Unix Pipes.

For this API, we have a single jar and it available inside contrib/streaming directory under HADOOP_HOME.

Python + Hadoop:

Usage:
 Cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/pystream --mapper 'python /home/naga//bigdata/streaming/mapper.py' --reducer 'python /home/naga//bigdata/streaming/reducer.py'

Mapper.py:

import sys

lines = sys.stdin.readlines()
for line in lines:
    words = line.split()
    for word in words:

sys.stdout.write(word + "\n")

Reducer.py:

import sys

lines = sys.stdin.readlines()
wordcount = {};
for line in lines:
    if line in wordcount:
count = wordcount[line]
count = count + 1
wordcount[line] = count
    else:
wordcount[line] = 1
for word in wordcount:

    sys.stdout.write(word.strip()+ "\t" + str(wordcount[word]) + "\n")

Perl + Hadoop:

Usage:
Cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/plstream --mapper 'perl /home/naga//bigdata/streaming/mapper.pl' --reducer 'perl /home/naga//bigdata/streaming/reducer.pl'

Mapper.pl:

#!/usr/bin/perl

@lines = <STDIN>;

foreach $line(@lines)
{
@words = split(/\s/, $line);
foreach $word(@words)
{
print $word, "\n";
}
}

Reducer.pl:

#!/usr/bin/perl

@words = <STDIN>;
%wordcount = {};

foreach $word(@words)
{
chomp($word);
if (exists $wordcount{$word})
{
$count = $wordcount{$word};
$count++;
$wordcount{$word} = $count;
}
else
{
$wordcount{$word} = 1;
}
}
while(($key, $value) = each %wordcount)
{
print $key, "\t", $value, "\n";
}

PHP + Hadoop:

Usage:
cmd> hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-streaming-1.0.3.jar --input /streaming/news --output /streaming/phpstream --mapper 'php /home/naga//bigdata/streaming/mapper.php' --reducer 'php /home/naga//bigdata/streaming/reducer.php'

Mapper.php

<?php
while(($file = fgets(STDIN)) != null)
{
$words = explode(" ", $file);
foreach($words as $word)
{
echo $word . "\n";
}
}

?>

Reducer.php

<?php
$wordcount = array();
while(($myfile = fgets(STDIN)) != null)
{
$file = trim($myfile);
if(array_key_exists($file, $wordcount))
{
$count = $wordcount[$file];
$count++;
$wordcount[$file] = $count;
}
else
{
$wordcount[$file] = 1;
}
}
foreach($wordcount as $key => $value)
{
echo $key . "\t". $value . "\n";
}

?>

We can use the combination of Hadoop, Perl, PHP, Python etc... like Mapper code in Perl and Reducer code in Python vice versa, Mapper code PHP, and Reducer code in Shell script vice versa etc...

Hadoop Ecosystem: Installing and Running Lipstick for managing Apach...

Hadoop Ecosystem: Installing and Running Lipstick for managing Apach...: 1.  Introduction Lipstick: Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, givi...

Hadoop Ecosystem: Enabling and Configuring Capacity Schedular

Hadoop Ecosystem: Enabling and Configuring Capacity Schedular: Hadoop Schedulers: Hadoop has 4 types of schedulers   1. FIFO (By default)   2. Capacity Scheduler   3. Fair Scheduler   4. HOD Sche...

Hadoop Ecosystem: Mahout - Machine Learning Framework on top of Hado...

Hadoop Ecosystem: Mahout - Machine Learning Framework on top of Hado...: Mahout Introduction: It is a Machine Learning Framework on top of Apache Hadoop. It has a list of Distributed and and Non-Distributed Al...

Hadoop Ecosystem: Hive Partitioning and Bucketing

Hadoop Ecosystem: Hive Partitioning and Bucketing: Hive Partitioning: In Hive, Partitioning is used to avoid scanning of the entire table for queries with filters (fine grained queries). T...

Hive Partitioning and Bucketing

Hive Partitioning:

In Hive, Partitioning is used to avoid scanning of the entire table for queries with filters (fine grained queries). This will improve the response times of the jobs. It will process the files from selected partitions which are supplied with where clause. We must specify the partitioned columns in the where clause part other wise it will scan the entire table.

In Hive, We have two types of partitioning.

  1. Static Partitioning
  2. Dynamic Partitioning

Static Partitioning:

*************creating table using single partition column*************

create table india (name string, age int, place string) partitioned by (state string) row format delimited fields terminated by '\t';

load data local inpath '/path of the file in local system(Linux)/filename' into table india partition (state='AP');
                                          [OR]
load data inpath '/path of the file in HDFS/filename' into table india partition (state='AP');



******** creating table using multiple partition columns***********

create table population(name string, age int, place string) partitioned by (state string, dist string) row format delimited fields terminated by '\t';

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='AP', dist='PKM');

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='AP', dist='GNT');

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='KA', dist='Mandya');

load data local inpath '/path of the file in local system(Linux)/filename' into table population partition (state='KA', dist='Sivmoga');

                                              [OR]

load data inpath '/path of the file in HDFS/filename' into table population partition (state='AP', dist='PKM');

load data inpath '/path of the file HDFS/filename' into table population partition (state='AP', dist='GNT');

load data inpath '/path of the file in HDFS/filename' into table population partition (state='KA', dist='Mandya');

load data inpath '/path of the file inHDFS/filename' into table population partition (state='KA', dist='Sivmoga');


select * from population;   => Direct HDFS call
select * from population where state='AP'; => Direct HDFS call
select * from population where state='AP' and dist='GNT'; => Direct HDFS call
select * from population where state='AP' and age > 30; => MapReduce Job
select avg(age) from population where state='AP' and dist='GNT' group by age;  => MapReduce Job
select avg(age) from population group by place;  => MapReduce Job

Dynamic Partitioning:

In Dynamic partitioning, we have two modes;

  1. Strict Mode - at least one static partitioned column
  2. Non Strict Mode - Static partitioned column is not required

By Default Hive Dynamic partitioning is enabled in strict mode.
There is a constraint per dynamic partitioning per node and cluster.
By default, maximum no of partitions per node is 100.
maximum no of partitions per cluster is 1000

Strict Mode:

create table part_stocks(sdate string, open double, high double, low double, close double, volume bigint, adj_close double) partitioned by (stock string, market string) row format delimited fields terminated by '\t';

In the above table, we have two partitioned columns
stock => dynamic partitioned column
market => static partitioned column

increase the no of dynamic partitions per node and cluster if necessary.
set hive.exec.max.dynamic.partitions.pernode=300

insert into table part_stocks partition(stock, market='NYSE') select sdate, open, high, low, close, volume, adj_close, stock from stocks;

Non Strict Mode:

Hive Dynamic partitioning in non strict mode requires turn off "hive.exec.dynamic.partition.mode" to nonstrict.
set hive.exec.dynamic.partition.mode=nonstrict

create table part_stocks(market string, sdate string, open double, high double, low double, close double, volume bigint, adj_close double) partitioned by (stock string) row format delimited fields terminated by '\t';

In the above table, we have one partitioned column
stock => dynamic partitioned column

increase the no of dynamic partitions per node and cluster if necessary.
set hive.exec.max.dynamic.partitions.pernode=300

insert into table part_stocks partition(stock) select market, sdate, open, high, low, close, volume, adj_close, stock from stocks;

Note: Partitioning is good for improving the query response times, but over partitioning creates problems for both HDFS (too many small files) and MapReduce (several map tasks)
*******************************************************************

Hive Bucketing:

To overcome the over partitioning in Hive, it is better to use Bucketing or Combination of Partitioning and Bucketing:

**** enable the bucketing **********
By default Bucketing is disabled in Hive, enable it using the following parameter
set hive.enforce.bucketing = true;

****creating table with dynamic partition ans using the bucketing concept******

CREATE TABLE stocks_bucketed(market string, stock string, open double, high double, low double, close double, volume bigint, adj_close double) PARTITIONED BY(sdate STRING) CLUSTERED BY(stock) INTO 5 BUCKETS row format delimited fields terminated by '\t';

describe stocks_bucketed;

insert into table stocks_bucketed partition(sdate) select market, stock, open,high,low,close,volume,adj_close,sdate from stocks;

Mahout - Machine Learning Framework on top of Hadoop

Mahout Introduction:

It is a Machine Learning Framework on top of Apache Hadoop.
It has a list of Distributed and and Non-Distributed Algorithms
Mahout runs in Local Mode (Non -Distributed) and Hadoop Mode (Distributed Mode)
To run Mahout in distributed mode install hadoop and set HADOOP_HOME environment variable.
To run Mahout in local mode set the environment variable MAHOUT_LOCAL=Mahout Directory Path

It supports three types of Machine Learning Algorithms

1. Classification
2. Clustering
3. Recommendation

Mahout Installation:

Download the Mahout binaries from Apache Mahout Website;
Extract the Mahout tar file to directory;
Copy the directory Some location like /yourpath/mahout-0.x.x/
export MAHOUT_HOME= /yourpath/mahout-0.x.x/

Mahout runs in Two Modes

1. In Distributed Mode i.e. On Hadoop
2. In Local Mode i.e. Non distributed mode.

Running Mahout in Distributed Mode:

export HADOOP_HOME= Path of the Hadoop Installation Directory
export PATH=$PATH:$MAHOUT_HOME/bin
Type mahout

we will have info like mahout is running on hadoop and a list of available algorithms.
Mahout is ready to run some algorithms, the following are examples for classification, clustering etc.

Classification Example:

Running Naive Bayes classifier by using 20 news group data set:

Download the data from 20 news group. It has to sets of data for training and testing.

Data Preparation:

mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/train -p 20news-bydate-train
mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/test -p 20news-bydate-test/

hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model

Train the classifier using training data set with custom attribute values like ng, source etc..
mahout trainclassifier -i /20news/train/ -o /20news/model -type bayes -ng 1 -source hdfs

Test the classifier with testing data set using the model which creating by training phase.
mahout testclassifier -m /20news/model -d /20news/test -type bayes -ng 1 -source hdfs

Running Complementary Naive Bayes classifier by using 20 news group data set:

hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model

mahout trainclassifier -i /20news/train/ -o /20news/model -type cbayes -ng 1 -source hdfs
mahout testclassifier -m /20news/model -d /20news/test -type cbayes -ng 1 -source hdfs

Clustering Example:

Running Dirichlet clustering algorithm with Reuters data set

Prepare the Reuters data set:
Extract the data set:
mahout org.apache.lucene.benchmark.utils.ExtractReuters  reuters-in (Downloaded data set)  reuters-out

Generating Sequence files from the extracted Reuters data set;
This requires Mahout to run in local mode, set MAHOUT_LOCAL=Mahout Home Directory
export MAHOUT_LOCAL= path to Mahout Home Directory
mahout seqdirectory -i reuters-out/ -o reuters-seq -c UTF-8 -ow

Turn Mahout to run in Hadoop Mode,
export MAHOUT_LOCAL=

Load the data into hdfs under a particular directory

hadoop fs -rmr /genpact/clustering
hadoop fs -mkdir /genpact/clustering
hadoop fs -put reuters-seq/ /genpact/clustering/

Generating the term vectors from sequence files.

mahout seq2sparse -i /genpact/clustering/reuters-seq/ -o /genpact/clustering/reuters-out-seqdir-sparse-fkmeans --maxDFPercent 85 --namedVector

Run the Dirichlet Clustering algorithm:

mahout dirichlet -i /genpact/clustering/data/reuters-sparse/tfidf-vectors -o /genpact/clustering/dirichlet/reuters-dirichlet -k 20 -ow -x 20 -a0 2 -md org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution -mp org.apache.mahout.math.DenseVector -dm org.apache.mahout.common.distance.CosineDistanceMeasure

Run the KMeans Clustering algorithm:

mahout kmeans -i /genpact/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c /genpact/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

Run the Fuzzy KMeans Clustering algorithm:

mahout fkmeans -i /genpact/clustering/reuters-out-seqdir-sparse-fkmeans/tfidf-vectors/ -c /genpact/clustering/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

Dump the clustering results:

mahout clusterdump -s /genpact/clusterin/reuters-kmeans/clusters-*-final -d /genpact/clustering/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir /genpact/clustering/reuters-kmeans/clusteredPoints

Logistic Regression:
Download the data sets like donut.csv or adults.csv etc

mahout trainlogistic --passes 100 --rate 50 --lambda 0.001 --input /home/hadoopz/naga/datasets/adult.csv --output adults.model --target salgroup --categories 2 --predictors age sex hours-per-week --types n n n

mahout runlogistic --input /home/hadoopz/naga/datasets/adult.test --model adults.model --auc --scores --confusion > clasifieddata

Running Latent Dirichlet allocation (Topic Modeling):

hadoop fs -rmr /genpact/lda
mahout lda -i /genpact/reuters-out-seqdir-sparse-kmeans/tf-vectors -o /genpact/lda -ow -k 10 -x 20
mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i /genpact/lda/state-20/ -d /genpact/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -w 10

Enabling and Configuring Capacity Schedular

Hadoop Schedulers:

Hadoop has 4 types of schedulers

  1. FIFO (By default)
  2. Capacity Scheduler
  3. Fair Scheduler
  4. HOD Scheduler (Hadoop On Demand)

In Hadoop jobs are submitted to Queues. Queues are submitting jobs to Hadoop Cluster By default the queue name is called default.

Configuring Capacity Scheduler:

To configure Capacity Scheduler, we need to do the following things:

1. Change/Add some of the configuration parameters in mapred-site.xml
2. Change/Add queues information in the capacity-scheduler.xml
3. Change/Add acls information in the mapred-queue-acls.xml

I assumed 4 queues for the cluster, their names are india, usa, aus, eng etc...

1. Changes to mapred-site.xml

  <configuration>

    <property>
      <name>mapred.job.tracker</name>
      <value>hostname:9001</value>
      <description>This tells where is Job Tracker is running....</description>
    </property>

    <property>
      <name>mapred.reduce.tasks</name>
      <value>2</value>
      <description>This number tells how many no of reducers to run</description>
    </property>

    <property>
      <name>mapred.jobtracker.taskScheduler</name>
      <value>org.apache.hadoop.mapred.CapacityTaskScheduler</value>
      <description>This overrides the default schedular FIFO with Capacity Schedular</description>
    </property>

    <property>
      <name>mapred.queue.names</name>
      <value>india,usa,eng,aus</value>
      <description>These are new created queue names</description>
    </property>
 
    <property>
      <name>mapred.acls.enabled</name>
      <value>true</value>
      <description>This enables access control lists</description>
    </property>

    <property>
      <name>mapred.job.queue.name</name>
      <value>india</value>
      <description>Specify the specific queue to submit the job instead of submitting to default queue.</description>
    </property>
  </configuration>-

2. Changes to capacity-scheduler.xml

    <configuration>

    <!-- system limit, across all queues -->

      <property>
        <name>mapred.capacity-scheduler.maximum-system-jobs</name>
        <value>3000</value>
        <description>Maximum number of jobs in the system which can be initialized,
        concurrently, by the CapacityScheduler.
        </description>
      </property>


    <!-- queue: india -->
      <property>
        <name>mapred.capacity-scheduler.queue.india.capacity</name>
        <value>20</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.india.supports-priority</name>
        <value>false</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.india.minimum-user-limit-percent</name>
        <value>20</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.india.user-limit-factor</name>
        <value>10</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.india.maximum-initialized-active-tasks</name>
        <value>200000</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.india.maximum-initialized-active-tasks-per-user</name>
        <value>100000</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.india.init-accept-jobs-factor</name>
        <value>100</value>
      </property>

    <!-- queue: usa -->
      <property>
        <name>mapred.capacity-scheduler.queue.usa.capacity</name>
        <value>30</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.usa.supports-priority</name>
        <value>false</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.usa.minimum-user-limit-percent</name>
        <value>20</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.usa.user-limit-factor</name>
        <value>1</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.usa.maximum-initialized-active-tasks</name>
        <value>200000</value>
      </property>
 
      <property>
        <name>mapred.capacity-scheduler.queue.usa.maximum-initialized-active-tasks-per-user</name>
        <value>100000</value>
      </property>
 
      <property>
        <name>mapred.capacity-scheduler.queue.usa.init-accept-jobs-factor</name>
        <value>10</value>
      </property>

    <!-- queue: eng -->
      <property>
        <name>mapred.capacity-scheduler.queue.eng.capacity</name>
        <value>30</value>
      </property>
 
      <property>
        <name>mapred.capacity-scheduler.queue.eng.supports-priority</name>
        <value>false</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.eng.minimum-user-limit-percent</name>
        <value>20</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.eng.user-limit-factor</name>
        <value>1</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.eng.maximum-initialized-active-tasks</name>
        <value>200000</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.eng.maximum-initialized-active-tasks-per-user</name>
        <value>100000</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.eng.init-accept-jobs-factor</name>
        <value>10</value>
      </property>

    <!-- queue: aus -->
      <property>
        <name>mapred.capacity-scheduler.queue.aus.capacity</name>
        <value>20</value>
      </property>
 
      <property>
        <name>mapred.capacity-scheduler.queue.aus.supports-priority</name>
        <value>false</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.aus.minimum-user-limit-percent</name>
        <value>20</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.aus.user-limit-factor</name>
        <value>20</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.aus.maximum-initialized-active-tasks</name>
        <value>200000</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.aus.maximum-initialized-active-tasks-per-user</name>
        <value>100000</value>
      </property>

      <property>
        <name>mapred.capacity-scheduler.queue.aus.init-accept-jobs-factor</name>
        <value>10</value>
      </property>
</configuration>
       
       
3. Changes to mapred-queue-acls.xml

<configuration>

<property>
  <name>mapred.queue.india.acl-submit-job</name>
  <value>*</value>
  <description> Comma separated list of user and group names that are allowed
    to submit jobs to the 'india' queue. The user list and the group list
    are separated by a blank. For e.g. user1,user2 group1,group2.
    If set to the special value '*', it means all users are allowed to
    submit jobs. If set to ' '(i.e. space), no user will be allowed to submit
    jobs.

    It is only used if authorization is enabled in Map/Reduce by setting the
    configuration property mapred.acls.enabled to true.

    Irrespective of this ACL configuration, the user who started the cluster and
    cluster administrators configured via mapreduce.cluster.administrators can submit jobs.
  </description>
</property>

<property>
  <name>mapred.queue.india.acl-administer-jobs</name>
  <value>*</value>
  <description> Comma separated list of user and group names that are allowed
    to view job details, kill jobs or modify job's priority for all the jobs
    in the 'india' queue. The user list and the group list
    are separated by a blank. For e.g. user1,user2 group1,group2.
    If set to the special value '*', it means all users are allowed to do
    this operation. If set to ' '(i.e. space), no user will be allowed to do
    this operation.

    It is only used if authorization is enabled in Map/Reduce by setting the
    configuration property mapred.acls.enabled to true.

    Irrespective of this ACL configuration, the user who started the cluster and
    cluster administrators configured via
    mapreduce.cluster.administrators can do the above operations on all the jobs
    in all the queues. The job owner can do all the above operations on his/her
    job irrespective of this ACL configuration.
  </description>
</property>

</configuration>

Once all the above changes are done, just start-all.sh to start the Hadoop cluster.
Visit the Hadoop MapReduce Cluster Administration on http://hostname:50030/jobtracker.jsp
We will see the Scheduling Information in the web page, it gives the complete info about each and every configured Queue.

****************************************Done*************************************

Installing and Running Lipstick for managing Apache Pig Workflows

1.  Introduction Lipstick:
Lipstick combines a graphical depiction of a Pig workflow with information about the job as it executes, giving developers insight that previously required a lot of sifting through logs (or a Pig expert) to piece together

2.  Installing and Running Lipstick:
The following steps are required to install lipstick:
Step1:  Download the source code from git repository
Cmd> git clone https://github.com/Netflix/Lipstick.git
Create a env variable LIPSTICK_HOME=<path to home of lipstick dir> in .bashrc file
Step2:  Installing runtime dependencies
Install graphviz --> rich set of graph drawing tools
Ubuntu: sudo apt-get install graphviz
Cent OS: yum install graphviz

Running Lipstick Locally

Step3:  Start the server
cmd> cd $LIPSTICK_HOME
cmd> ./gradlew debug -PwithHadoop
cmd> cd quickstart/
cmd> ../example1

Running Lipstick on Hadoop Cluster

Step4:  Install the MySQL database
Create the database called lipstick with xxxxxx root and password xxxxxxxx
Create the lipstick properties file called lipstick.properties under the /etc/ directory
content of lipstick.properties is:
dataSource.driverClassName=com.mysql.jdbc.Driver
dataSource.username=xxxxx
dataSource.password=xxxxx
dataSource.dbCreate=update
dataSource.url=jdbc:mysql://localhost:3306/lipstick?useUnicode=true&characterEncoding=utf8&autoReconnect=true

Step5:  cmd> cd $LIPSTICK_HOME
cmd> ./gradlew
This will create the lipstic-x.x.war file in build directory of the lipstick home

Step6:  Install the apache tomcat to deploy lipstick war

Step7:  Please increase the size of the war file to deploy in tomcat by modifying the webapps/manager/WEB-INF/web.xml to 100 MB (Default allowed war size is 50 MB, but lipstick war more than 50 MB)

Step8:  Copy the all the jars of the lipstick from its build directory to hadoop lib
Cmd> cp $LIPSTICK_HOME/build/*.jar $HADOOP_HOME/lib
Run this command: hadoop jar lipstick-console-0.6-SNAPSHOT.jar  -Dlipstick.server.url=http://hadoop:8080/lipstick-1.0

Step9:  We will see the lipstick web interface on http://hadoop:8080/lipstick-1.0/
Step8 will take to the grunt shell, execute the pig latin statements, see the lipstick web interface

*******************Installation Part is done ********************************