Mahout Introduction:
It is a Machine Learning Framework on top of Apache Hadoop.
It has a list of Distributed and and Non-Distributed Algorithms
Mahout runs in Local Mode (Non -Distributed) and Hadoop Mode (Distributed Mode)
To run Mahout in distributed mode install hadoop and set HADOOP_HOME environment variable.
To run Mahout in local mode set the environment variable MAHOUT_LOCAL=Mahout Directory Path
It supports three types of Machine Learning Algorithms
1. Classification
2. Clustering
3. Recommendation
Mahout Installation:
Download the Mahout binaries from Apache Mahout Website;
Extract the Mahout tar file to directory;
Copy the directory Some location like /yourpath/mahout-0.x.x/
export MAHOUT_HOME= /yourpath/mahout-0.x.x/
Mahout runs in Two Modes
1. In Distributed Mode i.e. On Hadoop
2. In Local Mode i.e. Non distributed mode.
Running Mahout in Distributed Mode:
export HADOOP_HOME= Path of the Hadoop Installation Directory
export PATH=$PATH:$MAHOUT_HOME/bin
Type mahout
we will have info like mahout is running on hadoop and a list of available algorithms.
Mahout is ready to run some algorithms, the following are examples for classification, clustering etc.
Classification Example:
Running Naive Bayes classifier by using 20 news group data set:
Download the data from 20 news group. It has to sets of data for training and testing.
Data Preparation:
mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/train -p 20news-bydate-train
mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/test -p 20news-bydate-test/
hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model
Train the classifier using training data set with custom attribute values like ng, source etc..
mahout trainclassifier -i /20news/train/ -o /20news/model -type bayes -ng 1 -source hdfs
Test the classifier with testing data set using the model which creating by training phase.
mahout testclassifier -m /20news/model -d /20news/test -type bayes -ng 1 -source hdfs
Running Complementary Naive Bayes classifier by using 20 news group data set:
hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model
mahout trainclassifier -i /20news/train/ -o /20news/model -type cbayes -ng 1 -source hdfs
mahout testclassifier -m /20news/model -d /20news/test -type cbayes -ng 1 -source hdfs
Clustering Example:
Running Dirichlet clustering algorithm with Reuters data set
Prepare the Reuters data set:
Extract the data set:
mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-in (Downloaded data set) reuters-out
Generating Sequence files from the extracted Reuters data set;
This requires Mahout to run in local mode, set MAHOUT_LOCAL=Mahout Home Directory
export MAHOUT_LOCAL= path to Mahout Home Directory
mahout seqdirectory -i reuters-out/ -o reuters-seq -c UTF-8 -ow
Turn Mahout to run in Hadoop Mode,
export MAHOUT_LOCAL=
Load the data into hdfs under a particular directory
hadoop fs -rmr /genpact/clustering
hadoop fs -mkdir /genpact/clustering
hadoop fs -put reuters-seq/ /genpact/clustering/
Generating the term vectors from sequence files.
mahout seq2sparse -i /genpact/clustering/reuters-seq/ -o /genpact/clustering/reuters-out-seqdir-sparse-fkmeans --maxDFPercent 85 --namedVector
Run the Dirichlet Clustering algorithm:
mahout dirichlet -i /genpact/clustering/data/reuters-sparse/tfidf-vectors -o /genpact/clustering/dirichlet/reuters-dirichlet -k 20 -ow -x 20 -a0 2 -md org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution -mp org.apache.mahout.math.DenseVector -dm org.apache.mahout.common.distance.CosineDistanceMeasure
Run the KMeans Clustering algorithm:
mahout kmeans -i /genpact/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c /genpact/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering
Run the Fuzzy KMeans Clustering algorithm:
mahout fkmeans -i /genpact/clustering/reuters-out-seqdir-sparse-fkmeans/tfidf-vectors/ -c /genpact/clustering/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering
Dump the clustering results:
mahout clusterdump -s /genpact/clusterin/reuters-kmeans/clusters-*-final -d /genpact/clustering/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir /genpact/clustering/reuters-kmeans/clusteredPoints
Logistic Regression:
Download the data sets like donut.csv or adults.csv etc
mahout trainlogistic --passes 100 --rate 50 --lambda 0.001 --input /home/hadoopz/naga/datasets/adult.csv --output adults.model --target salgroup --categories 2 --predictors age sex hours-per-week --types n n n
mahout runlogistic --input /home/hadoopz/naga/datasets/adult.test --model adults.model --auc --scores --confusion > clasifieddata
Running Latent Dirichlet allocation (Topic Modeling):
hadoop fs -rmr /genpact/lda
mahout lda -i /genpact/reuters-out-seqdir-sparse-kmeans/tf-vectors -o /genpact/lda -ow -k 10 -x 20
mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i /genpact/lda/state-20/ -d /genpact/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -w 10
It is a Machine Learning Framework on top of Apache Hadoop.
It has a list of Distributed and and Non-Distributed Algorithms
Mahout runs in Local Mode (Non -Distributed) and Hadoop Mode (Distributed Mode)
To run Mahout in distributed mode install hadoop and set HADOOP_HOME environment variable.
To run Mahout in local mode set the environment variable MAHOUT_LOCAL=Mahout Directory Path
It supports three types of Machine Learning Algorithms
1. Classification
2. Clustering
3. Recommendation
Mahout Installation:
Download the Mahout binaries from Apache Mahout Website;
Extract the Mahout tar file to directory;
Copy the directory Some location like /yourpath/mahout-0.x.x/
export MAHOUT_HOME= /yourpath/mahout-0.x.x/
Mahout runs in Two Modes
1. In Distributed Mode i.e. On Hadoop
2. In Local Mode i.e. Non distributed mode.
Running Mahout in Distributed Mode:
export HADOOP_HOME= Path of the Hadoop Installation Directory
export PATH=$PATH:$MAHOUT_HOME/bin
Type mahout
we will have info like mahout is running on hadoop and a list of available algorithms.
Mahout is ready to run some algorithms, the following are examples for classification, clustering etc.
Classification Example:
Running Naive Bayes classifier by using 20 news group data set:
Download the data from 20 news group. It has to sets of data for training and testing.
Data Preparation:
mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/train -p 20news-bydate-train
mahout prepare20newsgroups -a org.apache.mahout.vectorizer.DefaultAnalyzer -c UTF-8 --outputDir 20news/test -p 20news-bydate-test/
hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model
Train the classifier using training data set with custom attribute values like ng, source etc..
mahout trainclassifier -i /20news/train/ -o /20news/model -type bayes -ng 1 -source hdfs
Test the classifier with testing data set using the model which creating by training phase.
mahout testclassifier -m /20news/model -d /20news/test -type bayes -ng 1 -source hdfs
Running Complementary Naive Bayes classifier by using 20 news group data set:
hadoop fs -rmr /20news
hadoop fs -mkdir /20news
hadoop fs -put train/ /20news/
hadoop fs -put test/ /20news/
hadoop fs -rmr /20news/model
mahout trainclassifier -i /20news/train/ -o /20news/model -type cbayes -ng 1 -source hdfs
mahout testclassifier -m /20news/model -d /20news/test -type cbayes -ng 1 -source hdfs
Clustering Example:
Running Dirichlet clustering algorithm with Reuters data set
Prepare the Reuters data set:
Extract the data set:
mahout org.apache.lucene.benchmark.utils.ExtractReuters reuters-in (Downloaded data set) reuters-out
Generating Sequence files from the extracted Reuters data set;
This requires Mahout to run in local mode, set MAHOUT_LOCAL=Mahout Home Directory
export MAHOUT_LOCAL= path to Mahout Home Directory
mahout seqdirectory -i reuters-out/ -o reuters-seq -c UTF-8 -ow
Turn Mahout to run in Hadoop Mode,
export MAHOUT_LOCAL=
Load the data into hdfs under a particular directory
hadoop fs -rmr /genpact/clustering
hadoop fs -mkdir /genpact/clustering
hadoop fs -put reuters-seq/ /genpact/clustering/
Generating the term vectors from sequence files.
mahout seq2sparse -i /genpact/clustering/reuters-seq/ -o /genpact/clustering/reuters-out-seqdir-sparse-fkmeans --maxDFPercent 85 --namedVector
Run the Dirichlet Clustering algorithm:
mahout dirichlet -i /genpact/clustering/data/reuters-sparse/tfidf-vectors -o /genpact/clustering/dirichlet/reuters-dirichlet -k 20 -ow -x 20 -a0 2 -md org.apache.mahout.clustering.dirichlet.models.DistanceMeasureClusterDistribution -mp org.apache.mahout.math.DenseVector -dm org.apache.mahout.common.distance.CosineDistanceMeasure
Run the KMeans Clustering algorithm:
mahout kmeans -i /genpact/reuters-out-seqdir-sparse-kmeans/tfidf-vectors/ -c /genpact/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering
Run the Fuzzy KMeans Clustering algorithm:
mahout fkmeans -i /genpact/clustering/reuters-out-seqdir-sparse-fkmeans/tfidf-vectors/ -c /genpact/clustering/reuters-kmeans-clusters -o /genpact/reuters-kmeans/ -dm org.apache.mahout.common.distance.CosineDistanceMeasure -x 10 -k 20 -ow --clustering
Dump the clustering results:
mahout clusterdump -s /genpact/clusterin/reuters-kmeans/clusters-*-final -d /genpact/clustering/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -b 100 -n 20 --evaluate -dm org.apache.mahout.common.distance.CosineDistanceMeasure --pointsDir /genpact/clustering/reuters-kmeans/clusteredPoints
Logistic Regression:
Download the data sets like donut.csv or adults.csv etc
mahout trainlogistic --passes 100 --rate 50 --lambda 0.001 --input /home/hadoopz/naga/datasets/adult.csv --output adults.model --target salgroup --categories 2 --predictors age sex hours-per-week --types n n n
mahout runlogistic --input /home/hadoopz/naga/datasets/adult.test --model adults.model --auc --scores --confusion > clasifieddata
Running Latent Dirichlet allocation (Topic Modeling):
hadoop fs -rmr /genpact/lda
mahout lda -i /genpact/reuters-out-seqdir-sparse-kmeans/tf-vectors -o /genpact/lda -ow -k 10 -x 20
mahout org.apache.mahout.clustering.lda.LDAPrintTopics -i /genpact/lda/state-20/ -d /genpact/reuters-out-seqdir-sparse-kmeans/dictionary.file-0 -dt sequencefile -w 10
No comments:
Post a Comment