Friday, 2 October 2015

3 Basic Analytics Algorithms : The long, the short and the applications

Data Science or analytics is a combination of methodologies from statistics and machine learning. The three basic analytics algorithms, that a beginner data scientist comes across are:
  • linear regression
  • k-nn
  • k-means
Linear regression comes from the statistics heritage and k-nn / k-means from the machine learning heritage.

Image courtesy:

The fundamental differentiator of machine learning algorithms is to divide the data into two sets: the training data set and the test data set. Training data is a subset of data with known results and the data scientist runs the computer program to analyse the test data set and improves it by comparing results with the training data set.

In this post, I have compiled the description from online resources, the R syntax and typical examples of each. Note: There is plenty of material both online and in books, so I am giving the extracts of the algorithms verbatim from the sources.

Linear Regression
The Long
Linear regression is a statistical procedure for predicting the value of a dependent variable from an independent variable when the relationship between the variables can be described with a linear model.

A linear regression equation can be written as Yp= mX + b, where Yp is the predicted value of the dependent variable, m is the slope of the regression line, and b is the Y-intercept of the regression line.

In statistics, linear regression is a method of estimating the conditional expected value of one variable y given the values of some other variable or variables x. The variable of interest, y, is conventionally called the "dependent variable". The terms "endogenous variable" and "output variable" are also used. The other variables x are called the "independent variables". The terms "exogenous variables" and "input variables" are also used. The dependent and independent variables may be scalars or vectors. If the independent variable is a vector, one speaks of multiple linear regression.

A linear regression model is typically stated in the form y = α + βx + ε

The right hand side may take other forms, but generally comprises a linear combination of the parameters, here denoted α and β. The term ε represents the unpredicted or unexplained variation in the dependent variable; it is conventionally called the "error" whether it is really a measurement error or not. The error term is conventionally assumed to have expected value equal to zero, as a nonzero expected value could be absorbed into α. See also errors and residuals in statistics; the difference between an error and a residual is also dealt with below. It is also assumed that is ε independent of x.

The short
fitted.regression <- lm(Weight ~ Height, data=heights.weights)

Predictions: i) Predicting the sale price of a house based on its area.[DDS] ii) Predicting the amount of page views for a web site.[MLFH]

The Long
The KNN or k-nearest neighbors algorithm is one of the simplest machine learning algorithms and is an example of instance-based learning, where new data are classified based on stored, labeled instances. More specifically, the distance between the stored data and the new instance is calculated by means of some kind of a similarity measure. This similarity measure is typically expressed by a distance measure such as the Euclidean distance, cosine similarity or the Manhattan distance. In other words, the similarity to the data that was already in the system is calculated for any new data point that you input into the system. Then, you use this similarity value to perform predictive modeling. Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Whether you classify or assign a value to the new instance depends of course on your how you compose your model with KNN.

The k-nearest neighbor algorithm adds to this basic algorithm that after the distance of the new point to all stored data points has been calculated, the distance values are sorted and the k-nearest neighbors are determined. The labels of these neighbors are gathered and a majority vote or weighted vote is used for classification or regression purposes. In other words, the higher the score for a certain data point that was already stored, the more likely that the new instance will receive the same classification as that of the neighbor. In the case of regression, the value that will be assigned to the new data point is the mean of its k nearest neighbors.

The short
knn(train, test, cl, k=3)

Classification: US Senators vote analysis: Do senators from different parties mix when clustered by roll-call vote records? [MLFH]

The Long
Clustering is the process of partitioning a group of data points into a small number of clusters. For instance, the items in a supermarket are clustered in categories (butter, cheese and milk are grouped in dairy products). Of course this is a qualitative kind of partitioning. A quantitative approach would be to measure certain features of the products, say percentage of milk and others, and products with high percentage of milk would be grouped together. In general, we have n data points xi,i=1...n that have to be partitioned in k clusters. The goal is to assign a cluster to each data point. K-means is a clustering method that aims to find the positions μi,i=1...k of the clusters that minimize the distance from the data points to the cluster.

The Lloyd's algorithm, mostly known as k-means algorithm, is used to solve the k-means clustering problem and works as follows. First, decide the number of clusters k. Then:
1. Initialize the center of the clusters
2. Attribute the closest cluster to each data point
3. Set the position of each cluster to the mean of all data points belonging to that cluster
4. Repeat steps 2-3 until convergence

The short
kmeans(x, centers, iter.max = 10, nstart = 1,
      (algorithms = c(“Hartigan-Wong”, “Lloyd”, “Forgy”,

Clustering : i) Based on emails, cluster people and discover the relationships among these individuals.[GTDM] ii) Analyze students results and predict their performance to make effective decisions by academic planners.[1]

DDS -- Doing Data Science by Cathy O’Neill and Rachel Schuster, 2013, O’Reilly Publications.
MLFH -- Machine Learning For Hackers by Drew Conway & John Myles White.

No comments:

Post a Comment