Ruby Version Of Programs In "A Programmer's Guide To Data Mining"
I have written the Ruby version of the programs in the online book "A Programmer's Guide To Data Mining". The book's website is http://guidetodatamining.com/. This is my first small step into the world of analytics and now I feel I am a member of the exclusive Numerati club.
Though the book is on data mining, it is a good entry level resource for machine learning. It simplifies and de-mystifies data analytics for the programmer using Python examples. I translated the programs into Ruby for my own learning, as well as for further simplification and de-mystification with easy language idioms.
I have uploaded the programs to my github repository, https://github.com/mh-github/gtdm-r
Size data, as given by cloc.pl is:
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Ruby 20 333 896 2583
-------------------------------------------------------------------------------
Given below is a brief note on each program.
Chapter 2
filteringdata.rb
---- Give list of recommendations
---- first find nearest neighbor
---- based on Manhattan distance
---- Input : hard coded in the file.
filteringdataPearson.rb
---- Gives implementation of Pearson Correlation Coefficient calculation
---- Input : hard coded in the file.
recommender.rb
---- Has implementation of Recommender class
---- Computes nearest neighbor based on Pearson method
---- Input : hard coded in the file plus folder BX-dump
BX-dump has files of Book Crossing website by Cai-Nicolas Zeigler
that has 278,858 users rating 271,379 books in csv files)
Chapter 3
recommender3.rb
---- Extends recommender program from previous chapter
---- Makes predictions based on Weighted Slope One
---- Input : data hard coded in the program plus folder ml-100k
(ml-100k has MovieLens dataset from GroupLens Research Project
at the University of Minnesota --
has user ratings of movies).
Chapter 4
nearestNeighborClassifier.rb
---- Implementation of nearest neighbor classifier based on Manhattan Distance
---- Input : Files athletesTrainingSet.txt, athletesTestSet.txt,
irisTrainingSet.data, irisTestSet.data,
mpgTrainingSet.txt, mpgTestSet.txt
------ Iris Data Set : Arguably the most famous data set used in data mining.
It was used by Sir Donald Fisher back in the 1930s. The Iris Data Set
consists of 50 samples for each of three species of Irises (Iris Setosa,
Iris Virginica, and Iris Versicolor). The data set includes measurement
for two parts of the Iris's flower: the sepal (the green covering the flower
bud) and the parts.
mpg Data Set : Modified version of another widely used data set, the
Auto Miles Per Gallon data set from Carnegie Mellon University. It was initially
used in the 1983 American Statistical Association Exposition. In the modified
versions of the data, we are trying to predict mpg, which is a discrete
category (with values 10, 15, 20, 25, 30, 35, 40, and 45) using the attributes
cylinders, displacement, horsepower, weight, and acceleration.
nearestNeighborClassifierUnitTest.rb
---- Unit tests for nearestNeighborClassifier
---- Input : File athletesTestSet and hard coded data
Chapter 5
crossValidation.rb
---- a classifier will be built from files with the bucketPrefix
excluding the file with textBucketNumber. dataFormat is a string that
describes how to interpret each line of the data files.
---- Input : folder mpgData
pimaKNN.rb
---- Implementation of k-Nearest Neighbor Algorithm
---- Input : Folders pimaSmall, pima
------ (data files from Pima Indians Diabetes Data Set from the
U.S. National Institute of Diabetes and Digestive and Kidney Diseases)
Chapter 6
naiveBayes.rb
---- Implementation of the Naive Bayes classifier
---- Input : folder house-votes
------ (contains files of the Congressional Voting Records Data Set,
available from the Machine Learning Repository. It is available
in a form that can be used by our programs at this book website).
naiveBayesDensityFunction.rb
---- Implementation of the Naive Bayes classifier using probability density function
---- Input : folder house-votes
Chapter 7
bayesSentiment.rb
---- Implements a naive Bayes approach to text classification
trainingdir is the training data. Each subdirectory of
trainingdir is titled with the name of the classification
category -- those subdirectories in turn contain the text files for that category.
---- Input : folder review_polarity_buckets and file stopwords25.txt
bayesText.rb
---- Input : folder 20news-bydate and file stopwords0.txt
------ Usenet newsgroup posts data from http://qwone.com/~jason/20Newsgroups/
We are using the 20news=bydate dataset,
and it is also available on the book website.
Chapter 8
hierarchicalClusterer.rb
---- Example code for hierarchical clustering
---- Uses a priority queue and print dendrogram algorithm by David Eppstein
---- Input : file dogs.csv
kmeans.rb
---- Implementation of the K-means algorithm
---- Input : file dogs.csv
kmeansPlusPlus.rb
---- Implementation of the K-means++ algorithm
---- Input : file dogs.csv
Though the book is on data mining, it is a good entry level resource for machine learning. It simplifies and de-mystifies data analytics for the programmer using Python examples. I translated the programs into Ruby for my own learning, as well as for further simplification and de-mystification with easy language idioms.
I have uploaded the programs to my github repository, https://github.com/mh-github/gtdm-r
Size data, as given by cloc.pl is:
-------------------------------------------------------------------------------
Language files blank comment code
-------------------------------------------------------------------------------
Ruby 20 333 896 2583
-------------------------------------------------------------------------------
Given below is a brief note on each program.
Chapter 2
filteringdata.rb
---- Give list of recommendations
---- first find nearest neighbor
---- based on Manhattan distance
---- Input : hard coded in the file.
filteringdataPearson.rb
---- Gives implementation of Pearson Correlation Coefficient calculation
---- Input : hard coded in the file.
recommender.rb
---- Has implementation of Recommender class
---- Computes nearest neighbor based on Pearson method
---- Input : hard coded in the file plus folder BX-dump
BX-dump has files of Book Crossing website by Cai-Nicolas Zeigler
that has 278,858 users rating 271,379 books in csv files)
Chapter 3
recommender3.rb
---- Extends recommender program from previous chapter
---- Makes predictions based on Weighted Slope One
---- Input : data hard coded in the program plus folder ml-100k
(ml-100k has MovieLens dataset from GroupLens Research Project
at the University of Minnesota --
has user ratings of movies).
Chapter 4
nearestNeighborClassifier.rb
---- Implementation of nearest neighbor classifier based on Manhattan Distance
---- Input : Files athletesTrainingSet.txt, athletesTestSet.txt,
irisTrainingSet.data, irisTestSet.data,
mpgTrainingSet.txt, mpgTestSet.txt
------ Iris Data Set : Arguably the most famous data set used in data mining.
It was used by Sir Donald Fisher back in the 1930s. The Iris Data Set
consists of 50 samples for each of three species of Irises (Iris Setosa,
Iris Virginica, and Iris Versicolor). The data set includes measurement
for two parts of the Iris's flower: the sepal (the green covering the flower
bud) and the parts.
mpg Data Set : Modified version of another widely used data set, the
Auto Miles Per Gallon data set from Carnegie Mellon University. It was initially
used in the 1983 American Statistical Association Exposition. In the modified
versions of the data, we are trying to predict mpg, which is a discrete
category (with values 10, 15, 20, 25, 30, 35, 40, and 45) using the attributes
cylinders, displacement, horsepower, weight, and acceleration.
nearestNeighborClassifierUnitTest.rb
---- Unit tests for nearestNeighborClassifier
---- Input : File athletesTestSet and hard coded data
Chapter 5
crossValidation.rb
---- a classifier will be built from files with the bucketPrefix
excluding the file with textBucketNumber. dataFormat is a string that
describes how to interpret each line of the data files.
---- Input : folder mpgData
pimaKNN.rb
---- Implementation of k-Nearest Neighbor Algorithm
---- Input : Folders pimaSmall, pima
------ (data files from Pima Indians Diabetes Data Set from the
U.S. National Institute of Diabetes and Digestive and Kidney Diseases)
Chapter 6
naiveBayes.rb
---- Implementation of the Naive Bayes classifier
---- Input : folder house-votes
------ (contains files of the Congressional Voting Records Data Set,
available from the Machine Learning Repository. It is available
in a form that can be used by our programs at this book website).
naiveBayesDensityFunction.rb
---- Implementation of the Naive Bayes classifier using probability density function
---- Input : folder house-votes
Chapter 7
bayesSentiment.rb
---- Implements a naive Bayes approach to text classification
trainingdir is the training data. Each subdirectory of
trainingdir is titled with the name of the classification
category -- those subdirectories in turn contain the text files for that category.
---- Input : folder review_polarity_buckets and file stopwords25.txt
bayesText.rb
---- Input : folder 20news-bydate and file stopwords0.txt
------ Usenet newsgroup posts data from http://qwone.com/~jason/20Newsgroups/
We are using the 20news=bydate dataset,
and it is also available on the book website.
Chapter 8
hierarchicalClusterer.rb
---- Example code for hierarchical clustering
---- Uses a priority queue and print dendrogram algorithm by David Eppstein
---- Input : file dogs.csv
kmeans.rb
---- Implementation of the K-means algorithm
---- Input : file dogs.csv
kmeansPlusPlus.rb
---- Implementation of the K-means++ algorithm
---- Input : file dogs.csv
Comments
Post a Comment