Cluster means the group of points, whereas Clustering algorithm refers to classifying different data points into a specific group. Clustering is a method of unsupervised learning. Generally, Data points in a group contains similar properties or features and data points in different group consists of highly dissimilarities in their properties or features compared to another group.
In the above figure, a is unstructured data and b is the clustered data.
Usually, Clustering models are of two types
- Hard Clustering
- Soft Clustering
In Hard Clustering, each data point is either assigned to one cluster completely or not assigned. Which means a data point either belongs to a cluster or does not belongs to a cluster.
In Soft Clustering, each data point is assigned according to its probability or likelihood towards the cluster. In simple, each cluster is having certain values and data point is assigned to a cluster on the basis of the value of its probability of cluster.
Types of Clustering model algorithms
There are different types of clustering model algorithms in Machine learning. There are
- Centroid models
- Connectivity models
- Density models
- Distribution models
Let us study about them in detail
Centroid models are iterative clustering models in which the clusters are formed by the closeness of data points to the centroid of clusters. These algorithms are efficient but sensitive to initial conditions and outliers. In these type of models, number of clusters required are to be mentioned beforehand, which means that it is important to have prior knowledge regarding dataset.
In this model, centroid of the cluster is formed in such a way that distance between data points is minimum with that of center. These models run iteratively or repeatedly in order to find local optima. Centroid clustering models organizes the data into non-hierarchical clusters.
K-Means Clustering model is the most widely used centroid based clustering algorithm.
The basic idea of connectivity model is somewhat similar to centroid model which defines the clusters based on the closeness of data points. These models are based on the notion that the data points that are closer exhibit similar nature as compared to that of points which are far away.
Connectivity models follows two methods. In first method, it starts with classifying all data points into separate clusters and then aggregating them as the distance reduces. In the second method, all data points are classified as a single cluster and then separated or partitioned according to the increase in distance. The choice of distance function is independent. These models are very easy to understand or interpret but it lacks scalability for handling big datasets.
Hierarchical clustering models and its variants are the best examples of connectivity models.
In distribution model of approach, we will fit the data on the probability that how it may belong to the same distribution, where grouping is either normal or gaussian. Gaussian distribution is more prominent where there is fixed number of distributions and the upcoming data is fitted into it such that distribution of data may get maximized.
This model works well on synthetic data and differently sized clusters. As distance from the distribution’s center increases, the probability that a point belongs to that luster or distribution decreases.
But this model may have a problem if there are no restrictions for the limit of model’s complexity. Distribution based model produces clusters which assumes defined mathematical models underlying the data, a strong assumption for some data distributions.
Density based clustering model connects the data points into a cluster based on its density. This model searches for the data space for areas of varied density of data points in the data space. It separates various density regions based on different densities present in data space. This allows for arbitrary shaped distribution provided that dense areas should be connected. These algorithms do not assign outliers to the clusters.