Clustering#

Clustering is the process of grouping the data points that are similar to each other. It’s an unsupervised learning technique.

Intuitive working:#

  1. Select K random points

  2. Calculate centroid for all the points and assign points to closest centroid

  3. Repeat till converge

Random initialization problem#

  1. Random initialiaztion doesn’t always work

  2. To solve this problem use KMeans++

Selecting number of clusters (Elbow method)#

Within Cluster Sum of Squares is the sum of squares of the distance of every point to the nearest cluster. Initially it will be very high and then it will lower and lower and will eventually reach 0. We have to make a call and decide what is the best value of K depending on the slope obtained on increasing the cluster. If at a point, increase in number of cluster doesn’t decrease a substantial amount of WCSS then that’s the ideal number of clusters.

In short: We’re maximizing the second derivative of WCSS

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('Mall_Customers.csv')
X = df.iloc[:, [3, 4]].values
df.head()
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
from sklearn.cluster import KMeans

fig = plt.figure(figsize=(10, 8))
WCSS = []
for i in range(1, 11):
    clf = KMeans(n_clusters=i, init='k-means++', max_iter=300, n_init=10, random_state=0)
    clf.fit(X)
    WCSS.append(clf.inertia_) # inertia is another name for WCSS

plt.plot(range(1, 11), WCSS)
plt.title('The Elbow Method')
plt.ylabel('WCSS')
plt.xlabel('Number of Clusters')
plt.show()
../../../../_images/b6f91c73f7624ce15638a37d0600e4747ba6b6d86f4e6bd8329c36ac20f1c289.png
clf = KMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10,  random_state=0)
y_kmeans = clf.fit_predict(X)

Problem in KMeans#

While KMeans is a good algorithm, the time complexity is very poor. Kmeans works in \(O(n \cdot K \cdot I \cdot f)\) Where n is number of records, K is number of clusters, I is number of iterations, f is number of features in particular record. Clearly, the algorithm will take forever to complete on a dataset of > 100,000 data points.

Minibatch KMeans#

Main features of Minibatch KMeans are:

  1. Instead of using the entire dataset at once, it operates in batches.

  2. Uses Gradient Descent update, which is way more faster than what KMeans does.

How it works is, it takes batches of datasets and finds the centroids for the smaller dataset (minibatch) Then for the next batch, it uses the centroid found in previous batch and updates it using Gradient Descent. This simple method makes it faster by a magnitude of the input size.

Gotchas of using Minibatch KMeans#

  1. While the performance is good, the result might or might not be good – it totally depends on the initial conditions.

  2. The result might be somewhat different than what is obtained from KMeans.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

df = pd.read_csv('Mall_Customers.csv')
X = df.iloc[:, [3, 4]].values
df.head()
CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
from sklearn.cluster import MiniBatchKMeans

clf = MiniBatchKMeans(n_clusters=5, init='k-means++', max_iter=300, n_init=10,  random_state=0)
y_minikmeans = clf.fit_predict(X)
fig = plt.figure(figsize=(10, 8))
plt.scatter(X[y_minikmeans == 0, 0], X[y_minikmeans == 0, 1], color='red', s=60, label='Cluster 1', edgecolors='black')
plt.scatter(X[y_minikmeans == 1, 0], X[y_minikmeans == 1, 1], color='green', s=60, label='Cluster 2', edgecolors='black')
plt.scatter(X[y_minikmeans == 2, 0], X[y_minikmeans == 2, 1], color='blue',s=60, label='Cluster 3', edgecolors='black')
plt.scatter(X[y_minikmeans == 3, 0], X[y_minikmeans == 3, 1], color='yellow', s=60, label='Cluster 4', edgecolors='black')
plt.scatter(X[y_minikmeans == 4, 0], X[y_minikmeans == 4, 1], color='cyan', s=60, label='Cluster 5', edgecolors='black')
# cluster centres
plt.scatter(clf.cluster_centers_[:, 0], clf.cluster_centers_[:, 1], color='magenta', s=100, label='Centroid',edgecolors='black')
plt.legend()
plt.title('Clusters using Minibatch KMeans')
plt.ylabel('Annual Income (k$)')
plt.xlabel('Spending Score (1-100)')
plt.show()
../../../../_images/46f9cbf94a6944c545288a577423d70140071440aaafd80f6609712ed44531bf.png

Types of Hierarchical Clustering:#

  1. Agglomerative: Bottom-up approach. Initially, each point is a cluster, then merged later.

  2. Divisive: Top-bottom approach. Initially, there is only one cluster, then separated later.

Agglomertive Clustering:#

  1. Make each point a single cluster

  2. Take two closest points and merge them in one cluster.

  3. Repeat step 2 till only one cluster left.

While choosing the closest points, there are multiple ways to go:

  1. Take the distance of two closest point in clusters

  2. Average distance

  3. Centroid distance

  4. Farthest points etc.

All the information is stored in a data structure called Dendogram. Where you can set the threshold and get the required number of clusters.

Image Quantization#

Image Quantization is a technique to compress image from continuous values to discrete values. It’s a lossy compression and reduces the file size significantly.

We can use KMeans clustering to compress the image. We’ll convert RGB values that range from 0 - 255 to 5 values only. And convert the image from RGB to one with only 5 colors.

import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(figsize=(14, 10))
image = plt.imread("../images/naruto.png")
plt.title("Original Image")
plt.imshow(image)
plt.show()
../../../../_images/39c6610530cbca6ed57b79ab1595cd2a830a1de339a69c026654eb6614e870f1.png
# Original image has three dimesntions one for each R, G, and B. 
# We'll convert it into 2D by reducing one dimension
x, y, z = image.shape
image = image.reshape(x*y, z)
image.shape
(291500, 4)
from sklearn.cluster import KMeans

clf = KMeans(n_clusters=5)
clf.fit(image)
KMeans(n_clusters=5)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now the cluster_centers_ will represent 5 different colors. We’ll use the same colors to represent the entire image. Using the labels_ array, which will classify every pixel and assign a cluster, we can index the cluster_centers_ array which has cluster colors.

centers = clf.cluster_centers_
labels = clf.labels_
fig = plt.figure(figsize=(14, 10))
plt.title("Compressed Image")
plt.imshow(centers[labels].reshape(x, y, z))
plt.show()
../../../../_images/f2a4c6ed9d16e70283bbfc2ac5c3a997e5e1fab5eec3c205d1f31c8b22294403.png

Outliers#

Not all data can be trusted, many times we can have error in data which might lead to poor models. Such erroneous data points that can be introduced due to human errors / conversion errors are called Outliers. There are certain models that are sensitive to outliers (e.g. Linear Regression) hence presence of such outliers will degrade the performance of your machine learning model.

Outlier removal#

Outlier removal is the process of removing such outliers and making your model better. An application of KMeans clustering is outlier removal. We can use the following algorithm to achieve that:

  1. Cluster the data using KMeans.

  2. Find \(n\) points fatherest from centroids.

  3. If any of the \(n\) points are significantly far from the cluster centroid then remove these points.

# create a fake dataset 
import matplotlib.pyplot as plt
import numpy as np

from sklearn.datasets import make_blobs


X, y = make_blobs(100, centers=1)
# visualize the data

fig = plt.figure(figsize=(10, 8))
def plot_clusters(X, y):
    for label in np.unique(y):
        plt.scatter(X[y==label][:, 0], X[y==label][:, 1], label=label)
    plt.legend()
    plt.xlabel("Dimension 1")
    plt.ylabel("Dimension 1")
    plt.title("Fake Data")
plot_clusters(X, y)
../../../../_images/8a3692ca89dc2be8685c8e25a454604533898024607dca2a7c896b28e00c692a.png

Next step is to use KMeans to identify cluster centroids. The Mathematician inside you might be inclined to say why not just find the centroid directly averaging the x and y values. In our example, that’ll work, but imagine having 5 clusters, so we don’t know the clusters beforehands so that the reason why we use KMeans!

from sklearn.cluster import KMeans

clf = KMeans(n_clusters=1)
clf.fit(X)
centroids = clf.cluster_centers_
fig = plt.figure(figsize=(10, 8))
plt.scatter(centroids[0][0], centroids[0][1], label="Centroid", color="magenta")
plot_clusters(X, y)
../../../../_images/bd4b4f882b971bdcc725694e9872de91a3fe721e50285c264bf4a7128d317ebb.png
distances = clf.transform(X)
fig = plt.figure(figsize=(10, 8))

plt.scatter(centroids[0][0], centroids[0][1], label="Centroid", color="magenta")

sorted_indices = list(reversed(np.argsort(distances.ravel())))[:5]
plot_clusters(X, y)

plt.scatter(X[sorted_indices][:, 0], X[sorted_indices][:, 1], color="red", edgecolor="green", 
            s=100, label="Extreme Values")
plt.legend()
plt.show()
../../../../_images/6faa2e9d8a586e62edec8eec02538a59ad32c1e994afa2e9f47b13ea8b1f0c06.png

We can treat the 5 extreme points as outliers. Now we’ll remove them and refit the model

X = np.delete(X, sorted_indices, axis=0) # important to mention axis=0
y = np.delete(y, sorted_indices, axis=0)
clf = KMeans(n_clusters=1)
clf.fit(X)

fig = plt.figure(figsize=(10, 8))
centroids = clf.cluster_centers_
plt.scatter(centroids[0][0], centroids[0][1], label="Centroid", color="magenta")
plot_clusters(X, y)
../../../../_images/dfff5aaa22da7b4b99afa8e634b32a23ba08cd958c55af970a78006b9c277286.png