k-means with Three Different Distance Metrics by Using Feature Extraction¶

We will apply Feature Extraction to Iris data and compare three different Distance Metrics. In this project We will use Iris data to compare three different Distance Metrics. The power of k-means algorithm is due to its computational efficiency and the nature of ease at which it can be used. Distance metrics are used to find similar data objects that lead to develop robust algorithms for the data mining functionalities such as classification and clustering.

DISTANCE METRICS¶

1.Basic Euclidean Distance Metric:¶

Euclidean distance computes the root of square difference between co-ordinates of pair of objects.

$Distance=\sqrt{\left(\sum_{k=1}^m (x_{ik}-x_{jk} \right)^2)}$

import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt

df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df['euclidean']=np.sqrt(df.petal_length**2+df.petal_width**2)
df.head()

1.1. Find the cluster center:¶

Data point is assigned to the cluster center whose distance from the cluster center is minimum of all the cluster centers. We are assuming we have 3 clusters. $Center (x,y)=(\left(\sum_{i=1}^N \frac{x_i}{N}\right), \left(\sum_{i=1}^N \frac{y_i}{N}\right))$

plt.plot(df.iloc[:,3:4],df.iloc[:,2:3],"o",color='black')
plt.annotate('Setosa', xy=(0.6,1.5), xytext=(1,1.55),
            arrowprops=dict(facecolor='blue', shrink=0.15),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="b", ec="m", alpha=1)
plt.text(df[df.species=='setosa'].iloc[:,3:4].mean(), df[df.species=='setosa'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=2,bbox=bbox_Circle)
plt.annotate('versicolor', xy=(1.7,4.2), xytext=(2,4.2),
            arrowprops=dict(facecolor='orange', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="orange", ec="m", alpha=1)
plt.text(df[df.species=='versicolor'].iloc[:,3:4].mean(), df[df.species=='versicolor'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=2,bbox=bbox_Circle)
plt.annotate('virginica', xy=(1.5,5.7), xytext=(0.8,5.7),
            arrowprops=dict(facecolor='g', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="g", ec="m", alpha=1)
plt.text(df[df.species=='virginica'].iloc[:,3:4].mean(), df[df.species=='virginica'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=2,bbox=bbox_Circle)
plt.show()

Three different clusters will be plotted with three different color

# Use the 'hue' argument to provide a factor variable
g=sns.lmplot( x="petal_width", y="petal_length", data=df, fit_reg=False, hue='species', legend=False)
plt.annotate('Setosa', xy=(0.6,1.5), xytext=(1,1.55),
            arrowprops=dict(facecolor='blue', shrink=0.15),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="blue", ec="b", alpha=1)
plt.text(df[df.species=='setosa'].iloc[:,3:4].mean(), df[df.species=='setosa'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=5,bbox=bbox_Circle)
plt.annotate('versicolor', xy=(1.7,4.2), xytext=(2,4.2),
            arrowprops=dict(facecolor='orange', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="orange", ec="b", alpha=1)
plt.text(df[df.species=='versicolor'].iloc[:,3:4].mean(), df[df.species=='versicolor'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=5,bbox=bbox_Circle)
plt.annotate('virginica', xy=(1.5,5.7), xytext=(0.8,5.7),
            arrowprops=dict(facecolor='g', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="green", ec="b", alpha=1)
plt.text(df[df.species=='virginica'].iloc[:,3:4].mean(), df[df.species=='virginica'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=5,bbox=bbox_Circle)

# Move the legend to an empty part of the plot
plt.legend(loc='lower right')

<matplotlib.legend.Legend at 0x20423a20748>

#plot to see clusters.  
x=[]
for i in range (1,151): 
    x.append(i)
df['x']=x
g=sns.lmplot( x='x', y="euclidean", data=df, fit_reg=False, hue='species', legend=False)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="b", ec="b", alpha=1)
plt.text(df[df.species=='setosa']['x'].mean(), df[df.species=='setosa']['euclidean'].mean(), "setos", ha="center", va="center", size=5,bbox=bbox_Circle)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="orange", ec="b", alpha=1)
plt.text(df[df.species=='versicolor']['x'].mean(), df[df.species=='versicolor']['euclidean'].mean(), "versic", ha="center", va="center", size=5,bbox=bbox_Circle)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="g", ec="b", alpha=1)
plt.text(df[df.species=='virginica']['x'].mean(), df[df.species=='virginica']['euclidean'].mean(), "versic", ha="center", va="center", size=5,bbox=bbox_Circle)

plt.annotate("",
              xy=(0, 5.345), xycoords='data',
              xytext=(150, 5.345), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate("",
              xy=(0, 2.5), xycoords='data',
              xytext=(150, 2.5), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate('8 points should be in virginica region instead of versicolor region', xy=(128, 4.7), xytext=(130, 3.5),
            arrowprops=dict(facecolor='blue', shrink=0.15),
            )
bbox_props = dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9)
plt.text(100, 1.5, "setosa", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 4, "versicolor", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 6.5, "virginica", ha="center", va="center", size=15,bbox=bbox_props)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="m", ec="m", alpha=0.25)
plt.text(126, 4.6, "virginica", ha="center", va="center", size=15,bbox=bbox_Circle)
plt.show()

bins = [0, df[df.species=='setosa']['euclidean'].max(), df[df.species=='versicolor']['euclidean'].max(), 
        df[df.species=='virginica']['euclidean'].max()]
df['binned'] = pd.cut(df['euclidean'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()

confusion_matrix=df.groupby(['species']).sum().iloc[:,6:9]
confusion_matrix.columns = ['setosa_predict','versicolor_predict', 'virginica_predict']
confusion_matrix

2. Manhattan Distance Metric¶

Manhattan distance computes the absolute differences between coordinates of pair of objects

$Distance=|x_i-x_j|$

df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df['Manhattan']=np.abs(df.petal_length-df.petal_width)
df.head()

#plot to see clusters. 
x=[]
for i in range (1,151): 
    x.append(i)
df['x']=x
g=sns.lmplot( x='x', y="Manhattan", data=df, fit_reg=False, hue='species', legend=False)
plt.annotate("",
              xy=(0, 3.5), xycoords='data',
              xytext=(150, 3.5), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate("",
              xy=(0, 1.8), xycoords='data',
              xytext=(150, 1.8), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )

bbox_props = dict(boxstyle="round", fc="w", ec="0.5", alpha=0.25)
plt.text(100, 1.4, "setosa", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 2.5, "versicolor", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 4.3, "virginica", ha="center", va="center", size=15,bbox=bbox_props)
plt.show()

#We will create 3 binns doe to k=3 cluster to see confusion matrix
bins = [0, df[df.species=='setosa']['Manhattan'].max(), df[df.species=='versicolor']['Manhattan'].max(), 
        df[df.species=='virginica']['Manhattan'].max()]
df['binned'] = pd.cut(df['Manhattan'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()

confusion_matrix=df.groupby(['species']).sum().iloc[:,6:9]
confusion_matrix.columns = ['setosa_predict','versicolor_predict', 'virginica_predict']
confusion_matrix

3. Minkowski Distance Metric¶

Minkowski Distance is the generalized metric distance.

$Distance=\left(\sum_{k=1}^m (x_{ik}-x_{jk} \right)^p)^{1/p}$

Note that when p=2, the distance becomes the Euclidean distance. When p=1 it becomes city block distance. Chebyshev distance is a variant of Minkowski distance where $p=\infty$ (taking a limit). This distance can be used for both ordinal and quantitative variables.

df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
#for p=5
df['Minkowski']=(df.petal_length**5+df.petal_width**5)**(1/5)
df.head()

#plot to see cluster
x=[]
for i in range (1,151): 
    x.append(i)
df['x']=x
g=sns.lmplot( x='x', y="Minkowski", data=df, fit_reg=False, hue='species', legend=False)
plt.annotate("",
              xy=(0, 5.151961423363787), xycoords='data',
              xytext=(150, 5.151961423363787), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate("",
              xy=(0, 2.5), xycoords='data',
              xytext=(150, 2.5), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )

bbox_props = dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9)
plt.text(100, 1.5, "setosa", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 4, "versicolor", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 6.5, "virginica", ha="center", va="center", size=15,bbox=bbox_props)

Text(22,6.5,'virginica')

#We will create 3 binns doe to k=3 cluster to see confusion matrix
bins = [0, df[df.species=='setosa']['Minkowski'].max(), df[df.species=='versicolor']['Minkowski'].max(), 
        df[df.species=='virginica']['Minkowski'].max()]
df['binned'] = pd.cut(df['Minkowski'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()

confusion_matrix=df.groupby(['species']).sum().iloc[:,6:9]
confusion_matrix.columns = ['setosa_predict','versicolor_predict', 'virginica_predict']
confusion_matrix

4. CONCLUSION¶

K means is a heuristic algorithm that partitions a data set into K clusters by minimizing the sum of squared distance in each cluster. During the implementation of k-means with three different distance metrics, it is observed that selection of distance metric plays a very important role in clustering. So, the selection of distance metric should be made carefully. The distortion in k-means using Manhattan distance metric is less than that of k-means using Euclidean distance metric. As a conclusion, the K-means, which is implemented using Euclidean distance metric gives best result and K-means based on Manhattan distance metric’s performance, is worst.

	sepal_length	sepal_width	petal_length	petal_width	species	euclidean
0	5.1	3.5	1.4	0.2	setosa	1.414214
1	4.9	3.0	1.4	0.2	setosa	1.414214
2	4.7	3.2	1.3	0.2	setosa	1.315295
3	4.6	3.1	1.5	0.2	setosa	1.513275
4	5.0	3.6	1.4	0.2	setosa	1.414214

	sepal_length	sepal_width	petal_length	petal_width	species	Minkowski
0	5.1	3.5	1.4	0.2	setosa	1.400017
1	4.9	3.0	1.4	0.2	setosa	1.400017
2	4.7	3.2	1.3	0.2	setosa	1.300022
3	4.6	3.1	1.5	0.2	setosa	1.500013
4	5.0	3.6	1.4	0.2	setosa	1.400017