Friday, 27 March 2015

Ruby Version Of Programs In "A Programmer's Guide To Data Mining"

I have written the Ruby version of the programs in the online book "A Programmer's Guide To Data Mining". The book's website is This is my first small step into the world of analytics and now I feel I am a member of the exclusive Numerati club.

Though the book is on data mining, it is a good entry level resource for machine learning. It simplifies and de-mystifies data analytics for the programmer using Python examples. I translated the programs into Ruby for my own learning, as well as for further simplification and de-mystification with easy language idioms.

I have uploaded the programs to my github repository,
Size data, as given by is:
Language files blank comment code
Ruby      20    333    896   2583

Given below is a brief note on each program.

Chapter 2
---- Give list of recommendations
---- first find nearest neighbor
---- based on Manhattan distance
---- Input : hard coded in the file.

---- Gives implementation of Pearson Correlation Coefficient calculation
---- Input : hard coded in the file.

---- Has implementation of Recommender class
---- Computes nearest neighbor based on Pearson method
---- Input : hard coded in the file plus folder BX-dump
     BX-dump has files of Book Crossing website by Cai-Nicolas Zeigler
     that has 278,858 users rating 271,379 books in csv files)

Chapter 3
---- Extends recommender program from previous chapter
---- Makes predictions based on Weighted Slope One
---- Input : data hard coded in the program plus folder ml-100k
     (ml-100k has MovieLens dataset from GroupLens Research Project
      at the University of Minnesota --
      has user ratings of movies).

Chapter 4
---- Implementation of nearest neighbor classifier based on Manhattan Distance
---- Input : Files athletesTrainingSet.txt, athletesTestSet.txt,,,
        mpgTrainingSet.txt, mpgTestSet.txt
------ Iris Data Set : Arguably the most famous data set used in data mining.
    It was used by Sir Donald Fisher back in the 1930s. The Iris Data Set
    consists of 50 samples for each of three species of Irises (Iris Setosa,
    Iris Virginica, and Iris Versicolor). The data set includes measurement
    for two parts of the Iris's flower: the sepal (the green covering the flower
    bud) and the parts.
    mpg Data Set : Modified version of another widely used data set, the
    Auto Miles Per Gallon data set from Carnegie Mellon University. It was initially
    used in the 1983 American Statistical Association Exposition. In the modified
    versions of the data, we are trying to predict mpg, which is a discrete
    category (with values 10, 15, 20, 25, 30, 35, 40, and 45) using the attributes
    cylinders, displacement, horsepower, weight, and acceleration.

---- Unit tests for nearestNeighborClassifier
---- Input : File athletesTestSet and hard coded data

Chapter 5
---- a classifier will be built from files with the bucketPrefix
     excluding the file with textBucketNumber. dataFormat is a string that
     describes how to interpret each line of the data files.
---- Input : folder mpgData

---- Implementation of k-Nearest Neighbor Algorithm
---- Input : Folders pimaSmall, pima
------ (data files from Pima Indians Diabetes Data Set from the
        U.S. National Institute of Diabetes and Digestive and Kidney Diseases)

Chapter 6
---- Implementation of the Naive Bayes classifier
---- Input : folder house-votes
------ (contains files of the Congressional Voting Records Data Set,
        available from the Machine Learning Repository. It is available
        in a form that can be used by our programs at this book website).

---- Implementation of the Naive Bayes classifier using probability density function
---- Input : folder house-votes

Chapter 7
---- Implements a naive Bayes approach to text classification
     trainingdir is the training data. Each subdirectory of
     trainingdir is titled with the name of the classification
     category -- those subdirectories in turn contain the text files for that category.
---- Input : folder review_polarity_buckets and file stopwords25.txt

---- Input : folder 20news-bydate and file stopwords0.txt
------ Usenet newsgroup posts data from
       We are using the 20news=bydate dataset,
       and it is also available on the book website.

Chapter 8
---- Example code for hierarchical clustering
---- Uses a priority queue and print dendrogram algorithm by David Eppstein
---- Input : file dogs.csv

---- Implementation of the K-means algorithm
---- Input : file dogs.csv

---- Implementation of the K-means++ algorithm
---- Input : file dogs.csv

No comments:

Post a Comment