The following article is written by Naina Chaturvedi in 2021 on Medium and can be read fully at https://medium.datadriveninvestor.com/cluster-analysis-using-python-part-1-4ceee387d79a
Clustering is a technique of dividing the population or data points, grouping them into different clusters on the basis of similarity and dissimilarity between them. It helps in determining the intrinsic group among the unlabeled data points.
Applications of Clustering —
- Market Segmentation — helps in grouping people who have same purchasing behaviour, discover new customer segments for marketing etc
- News — To group related news together
- Search Engines — To group similar results
- Social Network Analysis
- Image Segmentation
- Anomaly detection
- Insurance fraud cases etc
There are various clustering techniques/methods like —
- Partition Clustering — Also known as centroid based method, the intuition behind the partition clustering is that a cluster is characterized and represented by a central vector and data points that are in close vicinity to these vectors are assigned to the respective clusters. The cluster center is the calculated such that the distance ( can be any of these — Manhattan Distance, Euclidian distance, Minkowski Distance) between the data points from one cluster to another is minimum when compared with another cluster centroid.
K-means clustering method is used and can be summarized as —
i. Divide into number of cluster K
ii. Find the centroid of the current partition
iii. Calculate the distance each points to Centroids
iv. Group based on minimum distance
v. After re-grouping/re-allotting the points, find the new centroid of the new cluster
2. Density-Based Clustering — It takes take density into consideration instead of distances such that it starts with identifying different clusters in the dataset and then connecting the areas of high densities into clusters.
3. Hierarchical Clustering — It’s a technique in which the dataset is divided into clusters to form a tree-like structure based on the hierarchy. There are two approaches — Agglomerative (bottom up) and Divisive (top down).
4. Distribution Model-Based Clustering — It’s a technique which uses probability as its metric. The data points are grouped based on their likely hood of belonging to the same probability distribution (where Gaussian and binomial Distributions are used)
5. Fuzzy Clustering — It’s a technique in which the data points are assigned to multiple clusters. It’s used with data points where there is possibility of high degree/level of overlap such as in biometrics-image segmentation etc.