Thursday, August 21, 2014

Mahout - Machine Learning Framework on top of Hadoop

Mahout Introduction:

It is a Machine Learning Framework on top of Apache Hadoop.
It has a list of Distributed and and Non-Distributed Algorithms
Mahout runs in Local Mode (Non -Distributed) and Hadoop Mode (Distributed Mode)
To run Mahout in distributed mode install hadoop and set HADOOP_HOME environment variable.
To run Mahout in local mode set the environment variable MAHOUT_LOCAL=Mahout Directory Path

It supports three types of Machine Learning Algorithms

1. Classification
2. Clustering
3. Recommendation

Mahout Installation:

Download the Mahout binaries from Apache Mahout Website;
Extract the Mahout tar file to directory;
Copy the directory Some location like /yourpath/mahout-0.x.x/
export MAHOUT_HOME= /yourpath/mahout-0.x.x/

Mahout runs in Two Modes

1. In Distributed Mode i.e. On Hadoop
2. In Local Mode i.e. Non distributed mode.

Running Mahout in Distributed Mode:

export HADOOP_HOME= Path of the Hadoop Installation Directory
export PATH=$PATH:$MAHOUT_HOME/bin
Type mahout

we will have info like mahout is running on hadoop and a list of available algorithms.
Mahout is ready to run some algorithms, the following are examples for classification, clustering etc.

Classification Example:

Running Naive Bayes classifier by using 20 news group data set:

Download the data from 20 news group. It has to sets of data for training and testing.

Data Preparation:

mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/train -p 20news-bydate-train
mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/test -p 20news-bydate-test/

hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model

Train the classifier using training data set with custom attribute values like ng, source etc..
mahout trainclassifier -i /20news/train/ -o /20news/model -type bayes -ng 1 -source hdfs

Test the classifier with testing data set using the model which creating by training phase.
mahout testclassifier -m /20news/model -d /20news/test -type bayes -ng 1 -source hdfs

Running Complementary Naive Bayes classifier by using 20 news group data set:

hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model

mahout trainclassifier -i /20news/train/ -o /20news/model -type cbayes -ng 1 -source hdfs
mahout testclassifier -m /20news/model -d /20news/test -type cbayes -ng 1 -source hdfs

Clustering Example:

Running Dirichlet clustering algorithm with Reuters data set

Prepare the Reuters data set:
Extract the data set:
mahout org.apache.lucene.benchmark.utils.ExtractReuters  reuters-in (Downloaded data set)  reuters-out

Generating Sequence files from the extracted Reuters data set;
This requires Mahout to run in local mode, set MAHOUT_LOCAL=Mahout Home Directory
export MAHOUT_LOCAL= path to Mahout Home Directory
mahout seqdirectory -i reuters-out/ -o reuters-seq -c UTF-8 -ow

Turn Mahout to run in Hadoop Mode,
export MAHOUT_LOCAL=

Load the data into hdfs under a particular directory

hadoop fs -rmr /genpact/clustering
hadoop fs -mkdir /genpact/clustering
hadoop fs -put reuters-seq/ /genpact/clustering/

Generating the term vectors from sequence files.

mahout seq2sparse -i /genpact/clustering/reuters-seq/ -o /genpact/clustering/reuters-out-seqdir-sparse-fkmeans --maxDFPercent 85 --namedVector

Run the Dirichlet Clustering algorithm:

mahout dirichlet -i /genpact/clustering/data/reuters-sparse/tfidf-vectors -o /genpact/clustering/dirichlet/reuters-dirichlet -k 20 -ow -x 20 -a0 2 -md org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution -mp org.apache.mahout.math.DenseVector -dm org.apache.mahout.common.distance.CosineDistanceMeasure

Run the KMeans Clustering algorithm:

mahout kmeans -i /genpact/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c /genpact/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

Run the Fuzzy KMeans Clustering algorithm:

mahout fkmeans -i /genpact/clustering/reuters-out-seqdir-sparse-fkmeans/tfidf-vectors/ -c /genpact/clustering/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering

Dump the clustering results:

mahout clusterdump -s /genpact/clusterin/reuters-kmeans/clusters-*-final -d /genpact/clustering/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir /genpact/clustering/reuters-kmeans/clusteredPoints

Logistic Regression:
Download the data sets like donut.csv or adults.csv etc

mahout trainlogistic --passes 100 --rate 50 --lambda 0.001 --input /home/hadoopz/naga/datasets/adult.csv --output adults.model --target salgroup --categories 2 --predictors age sex hours-per-week --types n n n

mahout runlogistic --input /home/hadoopz/naga/datasets/adult.test --model adults.model --auc --scores --confusion > clasifieddata

Running Latent Dirichlet allocation (Topic Modeling):

hadoop fs -rmr /genpact/lda
mahout lda -i /genpact/reuters-out-seqdir-sparse-kmeans/tf-vectors -o /genpact/lda -ow -k 10 -x 20
mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i /genpact/lda/state-20/ -d /genpact/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -w 10

No comments:

Post a Comment