K Means Clustering
K-means clustering is simplest and popular unsupervised machine learning algorithm. Clustering is one of the most common experimental data analysis technique used to get a perception about the structure of the data. It is defined as the task of defining subgroups in the data such that data points in the same subgroup or cluster are very similar while data points in different clusters are very different.
Clustering analysis can be done according to features where we try to find subgroups of samples based on features or on the basis of samples.
In contrast to supervised learning, clustering is considered as unsupervised learning method because we don’t have the ground truth to compare the output of the clustering algorithm to the true labels to calculate its performance. We only want to try to find the structure of the data by grouping the data points into different subgroups. Generally, unsupervised algorithms like K means clustering make interpretations or inferences from datasets using only input vectors without referring to known, or labelled outcomes.
A cluster means a collection of data points aggregated or combined together because of certain similarities. We have to define a target number k, which refers to the number of centroids need in the dataset. A centroid is the imaginary or real location which represents the center of the cluster.
Each data point is allocated to each of the clusters by reducing the in-cluster sum of squares. To make it simple, the K-means algorithm identifies k number of centroids, and then allocates every data point to the nearest cluster, by keeping the centroids as small as possible. Here means in the K-means refers to averaging of the data, which means finding the Centroid.
To process the learning data, the K-means algorithm in data mining starts with a first group of centroids that are randomly selected, which are used as the beginning points for every cluster, and then performs iterative or repetitive calculations to optimize or improve the positions of the centroids.
It stops creating and optimizing clusters when:
- The centroids have stabilized which means there is no change in their values because the clustering has been successful.
- The definite number of iterations has been achieved.
K means algorithm is an iterative algorithm which tries to partition the dataset into K number of
pre-defined distinct non-overlapping subgroups or clusters where each data point belongs to only
one group. It tries to make the inter-cluster data points as similar as possible but also keeping the
clusters different or far as possible.
It assigns data points to a cluster such that the average of the squared distance between the data
points and the cluster’s centroid is at the minimum. The less variation within clusters, the more
homogeneous or similar the data points are within the same cluster.
The k means algorithm works as follows:
- Specify number of clusters required K.
- Initialize centroids by shuffling the dataset and then randomly selecting K data points for the centroids without any replacement.
- Iterate the process until there is no change to the centroids. That is assignment of data points to clusters isn’t changing.
- Calculate the sum of the squared distance between data points and all centroids.
- Assign each data point to the closest cluster.
Calculate the centroids for the clusters by taking the average of the all data points that belong to each cluster.
Step 1: Import libraries
- Pandas (reading and writing spreadsheets).
- Numpy (carrying out efficient computations).
- Matplotlib (visualization of data).
Step 2: Generate random data
Step 3: Use Scikit-Learn
Step 4: Find the centroid
Step 5: Testing the algorithm
K-Means is an efficient method. Though, we need to specify the number of clusters,
in advance and the final results are sensitive to initialization and often terminates at a local
optimum. A practical approach is to compare the outcomes of multiple runs with different k values
and choose the best one based on a predefined criterion. Generally, a large k value probably
decreases the error but increases the risk of overfitting.
K means is very popular algorithm and used in a variety of applications such as market segmentation, document clustering, image segmentation and image compression, etc. The goal usually when we undergo a cluster analysis is:
- Get a meaningful perception of the structure of the data we’re dealing with.
- Cluster then predict where different models will be built for different clusters if we believe that there is a wide variation in the behaviors of different clusters.
Clustering analysis doesn’t have a solid evaluation metric that we can use to evaluate the outcome of different clustering algorithms. Moreover, k means algorithm requires k as an input and doesn’t learn it from data, there is no right answer in terms of the number of clusters that we should have in any problem. Sometimes domain knowledge and intuition may help but that is not the case. In the cluster-predict procedure, we can evaluate how well the models are performing based on different K clusters since clusters are used in the downstream modeling.
There are two metrics that may give us some intuition about k:
- Elbow method
- Silhouette analysis
Elbow method gives us an idea about a good k number of clusters
would be based on the sum of squared distance (SSE) between data points and their assigned cluster centroids. We pick value of k at the spot where SSE starts to flatten out and forming an elbow structure.
The graph above shows that k=2 is a good choice. Sometimes it’s hard to figure out a good number of clusters to use because the curve is monotonically decreasing and may not show any elbow shape or has an exact point where the curve starts flattening out.
Silhouette analysis may be used to determine the degree of separation between clusters.
- Calculate the average distance from all data points in the same cluster (ai).
- Calculate the average distance from all data points in the closest cluster (bi).
- Compute the coefficient as:
The coefficient values range in the interval [-1, 1].
- If it is 0, the sample is very close to the neighboring clusters.
- If it is 1, the sample is far away from the neighboring clusters.
- If it is -1, the sample is assigned to the wrong clusters.
Therefore, the coefficients should be as big as possible and close to 1 to have a good cluster.
K means algorithm is good in capturing structure of the data if clusters have a spherical shape.
It tries to construct a nice spherical shape around the centroid. Means, the minute the clusters have a complicated geometric shape, k means does a poor job in clustering the data.
K means algorithm doesn’t let data points that are far-away from each other share the same cluster even though they obviously belong to the same cluster.