Thursday 26 March 2015

Key Points From A Programmer's Guide To Data Mining

Preface
To me, the best programmers are empty cups, who constantly explore new technology (noSQL, node.js, whatever) with open minds. Mediocre programmers have surrounded their minds with cities of delusion -- C++ is good, Java is bad, PHP is the only way to do web programming. MySQL is the only database to consider. My hope is that you will find some of the ideas in this book valuable and I ask that you keep a beginner's mind when reading it. As Shunryu Suzuki says:
In the beginner's mind there are many possibilities,
In the expert's mind there are few.

Chapter 1 : Intro to data mining & how to use this book
Years ago, in that small town, our friends helped us find stuff.
We used experts to help us find stuff.
We also use the thing itself to help us find stuff.

Data mining is just not about recommending stuff to us, or having merchants sell more stuff.
Data mining is focused on finding patterns in data.

This book follows a learn-by-doing approach. Instead of passively reading the book, I encourage you to work through the exercises and experiment with the Python code I provide. Experimenting around, code hacking, and trying out methods with different data sets is the key to really gaining an understanding of the techniques.

This book is intended more as a quick, gritty, hands-on introduction designed to give you a basic foundation of data mining techniques.

Why - why does this matter?
There's lot of stuff out there (movies, music, books, rice coolers). There's going to be a huge growth in the amount of stuff out there. The problem with having all this stuff available is finding the stuff that is relevant to us. Of all the movies out there, what movie should I watch. What's the next book I should read? This problem of identifying relevant stuff is what data mining is about.

Chapter 2 : Collaborative filtering
The recommendation method we are looking at in this chapter is called collaborative filtering. It's called collaborative because it makes recommendations based on other people -- in effect, people collaborate to come up with recommendations. It works like this. Suppose the task is to recommend a book to you. I search among other users of the site to find one that is similar to you in the books she enjoys. Once I find that similar person I can see what she likes and recommend those books to you.

So the first step is to find someone similar.

The easiest distance to compute is what is called Manhattan Distance or cab driver distance.

You may recall the Pythagorean Theorem from your distant educational past. Here, we are going to figure out the straight line, as-the-crow-flies, distance, we are calling Euclidean distance.

Manhattan Distance and Euclidean Distance work best when there are no missing values.

We can generalize the Manhattan distance and Euclidean distance to what is called the Minkowski Distance Metric.

When
-- r = 1 : The formula is Manhattan Distance
-- r = 2 : The formula is Euclidean Distance
-- r = ∞ : Supremum Distance

The greater the r, the more a large difference in one dimension will influence the total difference.

The Pearson Correlation Coefficient is a measure of correlation between two variables. It ranges between -1 and 1 inclusive. 1 indicates perfect agreement, -1 indicates perfect disagreement.

Other than perhaps looking complex, the problem with the formula above (Pearson) is that the algorithm to implement it would require multiple passes through the data. Fortunately for us algorithmic people, there is an alternative formula, which is an approximation of Pearson.

We can implement it using a single-pass algorithm.

Cosine similarity ignores 0-0 matches. It is defined as

cos(x, y) = x.y
--------------
||x|| x ||y||
where . indicates the dot product and ||x|| indicates the length of the vector x.

Which similarity measure to use?
-- If your data is dense (almost all attributes have non-zero values) and the magnitude of the attribute value is important, use distance measures such as Euclidean or Manhattan.
-- If the data is subject to grade-inflation (different users may be using different scales) use Pearson.
-- If the data is sparse consider using Cosine Similarity.

The problem is that zero values tend to dominate any measure of distance. So the solution of adding zeros is no better than the original distance values. One workaround people have used is to compute -- in some sense -- an 'average' distance where one computes the distance by using songs they rated in common divided by the number of songs they rated in common.

Again, Manhattan and Euclidean work spectacularly well on dense data, but if the data is sparse, it may be better to use Cosine Similarity.

Weirdness
The problem is that we are relying on a single "most similar" person. Any quirk that person has is passed on as a recommendation. One way of evening out those quirks is to base our recommendations on more than one person who is similar to our user. For this we can use the k-nearest neighbor approach.

In the k-nearest neighbor approach to collaborative filtering we use k most similar people to determine recommendations. The best value for k is application specific -- you will need to do some experimentation.

Suppose I would like to make recommendations for Ann and am using k-nearest neighbour with k=3. The question is how can I determine how much influence each person should have. If there is a Pie of Influence (TM) how big a slice should I give each person? If add up the Pearson scores I get 2. Sally's share is 0.8/2 or 40%. Eric's share is 35% (0.7 / 2) and Amanda's share is 25%.
(proportional to Pearson ratings).
Projected rating = sum of (each person's rating x influence).

Chapter 3 : Collaborative filtering
Explicit ratings are when the user herself explicitly rates the item.
For implicit ratings, we don't ask users to give any ratings -- we just observe their behavior.
Another implicit rating is what the customer actually buys.

Problems with explicit ratings
Problem 1 : People are lazy and don't rate items.
Problem 2 : People may lie or give only partial information. They can lie directly -- giving inaccurate ratings or lie via omission -- providing only partial information.
Problem 3 : People don't update their ratings.

Recall that I said my purchase of the book Anticancer: A New Way of Life was a gift to my cousin. If we mine my purchase history a bit more we would see that I bought this book before. In fact, in the last year I purchased multiple copies of three books. One can imagine that I am making these multiple purchases not because I am losing the books, or that I am losing my mind an forgetting that I read the books. The most rational reason, is that I liked the books so much I am in a sense recommending these books to others by giving them as gifts. So we can gain a substantial amount of information from a person's purchase history.

Latency can be a major drawback of neighbor-based recommendation systems.

In user-based collaborative filtering, we're comparing a user with every other user to find the closest matches. There are two main problems with this approach.

1. Scalability. As we have just discussed, the computation increases as the number of users increases. User-based methods work fine for thousands of users, but scalability gets to be a problem where we have a million users.

2. Sparsity. Most recommendation systems have many users and many products but the average user rates a small fraction of the total products. For example, Amazon carries millions of books but the average user rates just a handful of books.

In user-based filtering we had a user, found the most similar person (or users) to that user and used the ratings of that similar person to make recommendations. In item-based filtering, ahead of time we find the most similar items, and combine that with a user's rating of items to generate a recommendation.

If we were doing user-based filtering we would determine the similarity between rows. For item-based filtering we are determining the similarity between columns.

User-based filtering is also called memory based collaborative filtering. Why? Because we need to store all the ratings in order to make recommendations.

Item-based filtering is also called model-based collaborative filtering. Why? Because we don't need to store all the ratings. We build a model representing how close every item is to every other item!

Adjusted Cosine Similarity
We also already talked about grade inflation where a user gives higher ratings than expected. To compensate for this grade inflation we will subtract the user's average rating from each rating. This gives us the adjusted cosine similarity formula.

Adjusted Cosine Similarity is a Model-Based Collaborative Filtering Method. As mentioned a few pages back, one advantage of these methods compared to memory-based ones is that they scale better. For large data sets, model-based methods tend to be fast and require less memory.

Slope One
Another popular algorithm for item-based collaborative filtering is Slope One. A major advantage of Slope One is that it is simple and hence easy to implement. Slope One was introduced in the paper "Slope One Predictors for Online Rating-Based Collaborative Filtering" by Daniel Lemire and Anna Machlachlan. This is an awesome paper and well worth the read.

There are actually several Slope One algorithms. I will present the weighted Slope One algorithm.

You can consider Slope One to be in two parts. First, ahead of time (in batch mode, in the middle of the night or whenever) we are going to compute what is called the deviation between every pair of items. In the second phase we actually make predictions.

Chapter 4 : Content Based Filtering & Classification
In previous chapters we talked about some of the difficulties of collaborative filtering including problems with data sparsity and scalability. Another problem is that recommendation systems based on collaborative filtering tend to recommend already popular items -- there is a bias towards popularity.

The difference in scale among attributes is a big problem for any recommendation system.

The solution is normalization!

One common method of normalization involves having the values of each feature range from 0 to 1.

The problem with the standard score is that it is greatly influenced by outliers. Because of this problem with the mean, the standard score is often modified.

To compute the Modified Standard Score you replace the mean in the formula by the median (the middle value) and replace the standard deviation by what is called the absolute standard deviation.

Modified Standard Score:

(each value) - (median)
-----------------------------
(absolute standard deviation)

We should normalize when
1. our data mining method calculates the distance between two entries based on the values of their features.
2. the scale of the different features is different (especially when it is drastically different -- for ex., the scale of asking price compared to the scale of the number of bedrooms).

In linear algebra, a vector is a quantity that has magnitude and direction. Various well defined operators can be performed on vectors including adding and subtracting vectors and scalar multiplication.

In data mining, a vector is simply a list of numbers that represent the attributes of an object. The example on the previous page represented attributes of a song as a list of numbers. Another example, would be representing a text document as a vector -- each position of the vector would represent a particular word and the number at that position would represent how many times that word occurred in the text.

Once we define attributes this way, we can perform vector operations (from linear algebra) on them.

A classifier is a program that uses an object's attributes to determine what group or class it belongs to!

A classifier uses a set of objects that are already labeled with the class they belong to. It uses that set to classify new, unlabeled objects.

Classifiers can be used in a wide range of applications. The following page lists just a few.

Twitter Sentiment Classification

A number of people are working on classifying the sentiment (a positive or negative opinion) in tweets. This can be used in a variety of ways. For example, if Axe releases a new underarm deoderant, they can check whether people like it or not. The attributes are the words in the tweet.

Classification for Targeted Political Ads

This is called microtargeting. People are classified into such groups as "Barn Raisers", "Inner Compass", and "Hearth Keepers." Hearth Keepers, for example, focus on their family and keept to themselves.

Health and the Quantified Self
It's the start of the quantified self explosion. We can now buy simple devices like Fitbit, and the Nike Fuelband. Intel and other companies are working on intelligent homes that have floors that can weigh us, keep track of our movements and alert someone if we deviate from normal. Experts are predicting that in a few years we will be wearing tiny compu-patches that can monitor dozens of factors in real time and make instant classifications.

Automatic identification of people in photos
There are apps now that can identify and tag your friends in photographs. (And the same techniques apply to identifying people walking down the street using public video cams.) Techniques vary but some of them use attributes like the relative position and size of a person's eyes, nose, jaw, etc.

Targeted Marketing
Similar to political microtargeting. Instead of a broad advertising campaign to sell my expensive Vegas time share luxury condos, can I identify likely buyers and market just to them? Even better if I can identify subgroups of likely buyers and I can really tailor my ads to specific groups.

Heads Up on Normalization
In this chapter we talked the importance of normalizing data. This is critical when attributes have drastically different scales (for example, income and age). In order to get accurate distance measurements, we should rescale the attributes so they all have the same scale.

While most data miners use the term 'normalization' to refer to this rescaling, others make a distinction between 'normalization' and 'standardization'. For them, normalization means scaling values so they lie on a scale from 0 to 1. Standardization, on the other hand, refers to scaling an attribute so the average (mean or median) is 0, and other values are deviations from this average (standard deviation or absolute standard deviation).

The list is endless
-- classifying people as terrorist or nonterrorist
-- automatic classification of email (hey, this email looks pretty important; this is regular email; this looks like spam)
-- predicting medical clinical outcomes
-- identifying finanical fraud (for ex., credit card fraud)

Chapter 5 : Further Explorations in Classification -- Evaluating algorithms and kNN.
In evaluating any data mining algorithm, if our test is a subset of our training data the results will be optimistic and often overly optimistic. So that doesn't seem like a great idea.

Leave-One-Out
In the machine learning literature, n-fold cross validation (where n is the number of samples in our data set) is called leave-one-out. We already mentioned one benefit of leave-one-out -- at every iteration we are using the largest possible amount of our data for training. The other benefit is that it is deterministic.

10-fold cross validation is called non-deterministic because when we run the test again we are not guaranteed to get the same result. In contrast, the leave-one-out method is deterministic. Every time we use leave-one-out on the same classifier and the same data we will get the same result. That is a good thing!

The main disadvantage of leave-one-out is the computational expense of the method.
The problem with the leave-one-out evaluation method is that necessarily all the test sets are non-stratified since they contain only one instance. In sum, while leave-one-out may be appropriate for very small datasets, 10-fold cross validation is by far the most popular choice.
So far, we have been evaluating our classifier by computing the percent accuracy. That is,

number of test cases correctly classified
-----------------------------------------
Total number of test cases
Sometimes we may want a more detailed picture of the performance of our classification algorithm and one such detailed visualisation is a table called the confusion matrix. The rows of the confusion matrix represent the actual class of the test cases, the columns represent what our classifier predicted.

The name confusion matrix comes from the observation that it is easy for us to see where our algorithm gets confused.

The diagonal of the confusion matrix represents instances that were classified correctly. It is easy to inspect the matrix to get an idea of what type of errors our classifier is making.

Kappa Statistic
The Kappa statistic compares the performance of a classifier to that of a classifier that makes predictions based solely on chance.

Generate another confusion matrix that will represent the results of a random classifier (a classifier that makes random predictions).

K = P(c) - P(r)
-----------
1 - P(r)
where P(c) is the accuracy of the real classifier and P(r) is the accuracy of the random one.
A commonly cited scale on how to interpret Kappa
< 0 : less than chance performance
0.01 - 0.20 : slightly good
0.21 - 0.40 : fair performance
0.41 - 0.60 : moderate performance
0.81 - 1.00 : near perfect performance

Improvements to the Nearest Neighbor Algorithm!
One problem with the nearest neighbor algorithm occurs when we have outliers.

kNN
One way to improve our nearest neighbor approach is instead of looking at one nearest neighbor we look at a number of nearest neighbors -- k nearest neighbors or kNN. Each neighbor will get a vote and the algorithm will predict that the instance will belong to the class with the highest number of votes.

So when we are trying to predict a discrete class (marathoners, gymnasts, or basketball players, for example) we can use the voting method. The class with the most votes will be the one assigned to the instance. If there is a tie the predicted class will be selected randomly from the classes that are tied. When we are trying to predict a numeric value like how many stars a person will give the band Funky Meters we can apportion influence from the nearest neighbors to compute a distance-weighted value.

Because I want the rating of the closest person to be weighed more heavily in the final value than the other neighbors, the first step we will do is to convert the distance measure to make it so that the larger the number the closer that person is. We can do this by computing the inverse of the distance (that is, 1 over the distance).

Finally, we are going to multiply each person's influence and rating and sum the results.

Picking a good algorithm makes a significant difference. However, if you are trying to solve a practical problem (rather than publish a research paper) it might not be worth your while to spend a lot of time researching and tweaking algorithms. You will perhaps get more bang for your buck -- or a better return on your time -- if you concentrate on getting more data.

People have used kNN classifiers for
-- recommending items at Amazon
-- assessing consumer credit
-- classifying land cover using image analysis
-- recognizing faces
-- classifying the gender of people in images
-- recommending web pages
-- recommending vacation packages

Chapter 6 : Probability and Naive Bayes
With the nearest neighbor algorithms, it is difficult to quantify confidence about a classification. With classification methods based on probability -- Bayesian methods -- we can not only make a classification but we can make probabilistic classifications -- this athlete is 80% likely to be a basketball player, this patient has a 40% chance of getting diabetes in the next five years, the probability of rain in Las Cruces in the next 24 hours is 10%.

Nearest Neighbor approaches are called lazy learners. They are called this because when we give them a set of training data, they just basically save -- or remember -- the set. Each time it classifies an instance, it goes through the entire training dataset. If we have a 100,000 music tracks in our training data, it goes through the entire 100,000 tracks each time it classifies an instance.

Bayesian methods are called eager learners. When given a training set eager learners immediately analyze the data and build a model. When it wants to classify an instance it uses this internal model. Eager learners tend to classify instances faster than lazy learners.

The ability to make probabilistic classification, and the fact that they are eager learners are two advantages of Bayesian methods.

Prior probability is denoted by P(h) -- the probability of hypothesis h.

P(h|D) -- the probability of the hypothesis h given some data D.

P(A|B) = P(A ∩ B)
-----------
P(B)

P(h), the probability that some hypothesis is true, is called the prior probability of h. (Before we have any evidence).
P(h|d) is called the posterior probability of h. After we observe some data d what is the probability of h? It is also called conditional probability.

In our quest to build a Bayesian classifier we will need two additional probabilities, P(D) and P(D|h).
P(D) is the probability that some training data will be observed.
P(D|h) is the probability that some data value holds given the hypothesis.

Bayes Theorem describes the relationship between P(h), P(h|D), P(D) and P(D|h)
P(h|D) = P(D|h) P(h)
-----------
P(D)
This theorem is the cornerstone of all Bayesian methods. Usually in data mining we use this theorem to decide among alternative hypotheses. Given the evidence, is this person a gymnast, marathoner, or basketball player. Given the evidence, will this person buy Sencha tea or not. To decide among alternatives we compute the probability for each hypothesis.
Once we compute all these probabilities, we will pick the hypothesis with the highest probability. This is called the maximum a posteriori hypothesis, or hMAP.

Why do we need Bayes Theorem?

For many real world problems it is very difficult to compute P(h|D) directly. So Bayes Theorem provides a strategy for computing P(h|D) when it is hard to do so directly.

Naive Bayes
Most of the time we have more evidence than just a single piece of data. To compute the probability of an hypothesis given all the evidence, we simply multiply the individual probabilities.

The Naive Bayes Classifier code consists of two components, one for training and one for classifying.

Training
The output of training needs to be:
-- a set of prior probabilities
-- a set of conditional probabilities
Classifying
Once we have trained the classifier, we want to classify various instances.

Estimating Probabilities
The probabilities in Naive Bayes are really estimates of the true probabilities. True probabilities are those obtained from the entire population. However, giving the test to every one is near impossible. We can estimate that probability by selecting a random representative sample of the population.

When a probability is 0, it dominates the Naive Bayes calculation -- it doesn't matter what the other values are. Another problem is that probabilities based on a sample produce a biased underestimate of the true probability.

P(x | y) = nc
----
n
Here n is the total number of instances of class y in the training set; nc is the total number of instances of class y that have the value x.

The problem we have is when nc equals zero. We can eliminate this problem by changing the formula to
P(x | y) = nc + mp
---------
n + m
m is a constant called the equivalent sample size. The method for determining the value of m varies.
v Numbers
You probably noticed that I changed from numerical data which I used in all the nearest neighbor approaches I discussed to using categorical data for the naive Bayes formula. By "categorical data" we mean that the data is put into discrete categories. For example, we divide people up in how they voted for a particular bill and the people who voted 'yes' go in one category and the people who voted 'no' go in another. Or we might categorize musicians by the instrument they play. So all saxophonists go in one bucket, all drummers in another, all pianists in another and so on. And these categories do not form a scale.

For Bayesian approaches we count things -- how many occurrences are there of people who are sedentary -- and it may not be initially obvious how to count things that are on a scale -- for example, something like grade point average.

There are two approaches
Method 1: Making categories
One solution is to make categories by discretizing the continuous attribute.
Method 2 : Gaussian distributions!

standard deviation describes the range of scattering. If all the values are bunched up around the mean, the standard deviation is small; if the values are scattered the standard deviation is large.

Gaussian distribution is just a high falutin term for normal distribution. The function that describes this distribution is called the Gaussian function or bell curve. Most of the time the Numerati (aka data miners) assume attributes follow a Gaussian distribution. What it means is that about 68% of the instances in a Gaussian distribution fall within 1 standard deviation of the mean and 95% of the instances fall within 2 standard deviations of the mean.

kNN
k Nearest Neighbors is a reasonable choice when the training set is large. kNN is extremely versatile and used in large number of fields including recommendation systems, proteomics (the study of the entire protein set of an organism), the interaction among proteins, and image classification.

Advantages of kNN
-- simple to implement
-- does not assume the data has any particular structure -- a good thing!
-- large amount of memory needed to store the training set.

Advantages of Bayes
-- simple to implement (just counting things)
-- need less training data than many other methods
-- a good method to use if you want something that performs well and has good performance times.

Main disadvantages of Bayes
It cannot learn interactions among features. For example, it cannot learn that I like foods with cheese and I like foods with rice but I do not like foods with both.

If events are independent we can determine their joint probability (the probability that they both occurred) by multiplying the individual probabilities together.

So, for Bayes to work we need to use attributes that are independent, but most real-world problems violate that condition. What we are going to do is just to assume that they are independent! We are using the magic wand of sweeping things under the rug (TM) -- and ignoring this problem. We call it naive Bayes because we are naively assuming independence even though we know it is not. It turns out that the naive Bayes works really, really, well even with this naive assumption.

Chapter 7 : Naive Bayes and Text
"Structured data" -- data where instances are described by a set of attributes (for example, a row in a table might describe a car by a set of attributes including miles per gallon, the number of cylinders and so on).
Unstructured data includes things like email messages, twitter messages, blog posts, and newspaper articles. These types of things (at least at first glance) do not seem to be neatly represented in a table.

So how can I create an automatic text classification system? We will use the naive Bayes methods. We start with a training data set and, since we are now interested in unstructured text this data set is called the training corpus. Each entry in the corpus we will call a document even if it is a 140 character Twitter post. Each document is labeled with its class.

Each post is labeled in some way as a 'favorable' review or 'unfavorable' and we are going to train our classifier using this corpus of labeled documents.

When we start with labeled training it is called 'supervised learning.' Text classification is an example of supervised training.

Learning from unlabeled text is called unsupervised learning. One example of unsupervised learning is clustering.

There is also semi-supervised learning where the system learns from both labeled and unlabeled data. Often the system is bootstrapped using labeled data and then in subsequent learning makes use of unlabeled data.

Training Phase
For each word wk in the vocabulary we are going to compute the probability of that word occuring given each hypothesis: P(wk | hi).

We are going to compute this as follows. For each hypothesis
1. combine the documents tagged with that hypothesis into one text file.
2. count how many word occurences there are in the file. This time, if there are 500 occurences of the we are going to count it 500 times. Let's call this n.
3. For each word in the vocabulary wk, count how many times that word occured in the text. Call this nk.
4. For each word in the vocabulary wk, compute
P(wk | hi) = nk + 1
----------------
n + |Vocabulary|
Naive Bayes Classification Phase
Once we have completed the training phase we can classify documents using the formula

hMAP = arg max hεH P(D | h) P(h)

Throw out the 200 most frequent words in English. Removing these words cuts down the size of our text by about half. Plus, it doesn't look like removing these words will have any impact on our ability to categorize texts. Indeed data miners have called such words without any content, and fluff words.

These words that we remove are called 'stop words'. We have a list of such words, the 'stop word list', and remove these words from the text in a preprocessing step. We remove these words because 1) it cuts down on the amount of processing we need to do and 2) it does not negatively impact the performance of our system -- as the noise argument suggests removing them might improve performance.

Naive Bayes and Sentiment Analysis
The goal of sentiment analysis is to determine the writer's attitude (or opinion).
One common type of sentiment analysis is to determine the polarity of a review or comment (positive or negative) and we can use a Naive Bayes Classifier for this task.

Chapter 8 : Clustering
After we train the classifier, it selects a label from a set of labels it acquired during the training phase -- it knows the possible labels. This task is called clustering. The system divides a set of instances into clusters or groups based on some measure of similarity.

There are two main types of clustering algorithms.
k-means clustering
We tell the algorithm how many clusters to make.

hierarchical clustering
We don't specify how many clusters to make. Instead the algorithm starts with each instance in its own cluster. At each iteration of the algorithm it combines the two most similar clusters into one. It repeatedly does this until there is only one cluster.

To determine the 'closest clusters' we use a distance formula. But we have some choices in how we compute the distance between two clusters, which leads to different clustering methods.

Single-linkage clustering
We define the distance between two clusters as the shortest distance between any member of one cluster to any member of the other.

Complete-linkage clustering
We define the distance between two clusters as the greatest distance between any member of one cluster to any member of the other.

Average-linkage Clustering
We define the distance between two clusters as the average distance between any member of one cluster to any member of the other.

Coding a hierarchical clustering algorithm
For our task of building a hierarchical clusterer, we will put the clusters in a priority queue. The priority will be the shortest distance to a cluster's nearest neighbor.

The basic k-means algorithm is:
1. select k random instances to be the initial centroids.
2. REPEAT
3. assign each instance to the nearest centroid.
4. update centroids by computing mean of each cluster.
5. UNTIL centroids don't change (much).

For you computer science geeks:
K-means is an instance of the Expectation Maximization (EM) Algorithm, which is an iterative method that alternates between two phases. We start with an initial estimate of some parameter. In the K-means case we start with an estimate of the centroids. In the expectation (E) phase, we use this estimate to place points into their expected cluster. In the Maximization (M) phase we use these expected values to adjust the estimate of the centroids.

The k-means clustering algorithm is like this. There is no guarantee that it will find the optimal division of the data into clusters. Because at the start of the algorithm we select an initial set of centroids randomly, which is much like picking a random spot. Then, based on this initial set, we optimize the clusters finding the local optimum.

The final clusters are heavily dependent on the selection of the initial centroids.

SSE or Scatter
To determine the quality of a set of clusters we can use the sum of the squared error (SSE). This is also called scatter. Here is how to compute it: for each point we will square the difference from that point to its centroid, then add all those squared distances together.

k-means++
A major weakness in k-means is in the first step where it randomly picks k of the datapoints to be the initial centroids. It is the random part that is the problem. Because it is random, sometimes the initial centroids are a great pick and lead to near optimal clustering. Other times the initial centroids are a reasonable pick and lead to good clustering. But sometimes -- again, because we pick randomly -- sometimes the initial centroids are poor leading to non-optimal clustering. The k-means++ algorithm fixes this defect by changing the way we pick the initial centroids. Everything else about k-means remains the same.

k-means++ -- selecting the initial set of centroids
1. Initially, the set of initial centroids is empty.
2. Select the first centroid randomly from the data points as before.
3. Until we have k initial centroids:
a. Compute the distance, D, between each datapoint (dp) and its closest centroid. This distance is D(dp).
b. In a probability proportional to D(dp) select one datapoint at random to be a new centroid and add it to the set of centroids.
c. REPEAT

No comments:

Post a Comment