Fork me on GitHub
k-means_with_Three_different_Distance_Metrics

k-means with Three Different Distance Metrics by Using Feature Extraction

We will apply Feature Extraction to Iris data and compare three different Distance Metrics. In this project We will use Iris data to compare three different Distance Metrics. The power of k-means algorithm is due to its computational efficiency and the nature of ease at which it can be used. Distance metrics are used to find similar data objects that lead to develop robust algorithms for the data mining functionalities such as classification and clustering.

DISTANCE METRICS

1.Basic Euclidean Distance Metric:

Euclidean distance computes the root of square difference between co-ordinates of pair of objects.

$Distance=\sqrt{\left(\sum_{k=1}^m (x_{ik}-x_{jk} \right)^2)}$

In [19]:
import numpy as np
import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
In [20]:
df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df['euclidean']=np.sqrt(df.petal_length**2+df.petal_width**2)
df.head()
Out[20]:
sepal_length sepal_width petal_length petal_width species euclidean
0 5.1 3.5 1.4 0.2 setosa 1.414214
1 4.9 3.0 1.4 0.2 setosa 1.414214
2 4.7 3.2 1.3 0.2 setosa 1.315295
3 4.6 3.1 1.5 0.2 setosa 1.513275
4 5.0 3.6 1.4 0.2 setosa 1.414214

1.1. Find the cluster center:

Data point is assigned to the cluster center whose distance from the cluster center is minimum of all the cluster centers. We are assuming we have 3 clusters. $Center (x,y)=(\left(\sum_{i=1}^N \frac{x_i}{N}\right), \left(\sum_{i=1}^N \frac{y_i}{N}\right))$

In [21]:
plt.plot(df.iloc[:,3:4],df.iloc[:,2:3],"o",color='black')
plt.annotate('Setosa', xy=(0.6,1.5), xytext=(1,1.55),
            arrowprops=dict(facecolor='blue', shrink=0.15),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="b", ec="m", alpha=1)
plt.text(df[df.species=='setosa'].iloc[:,3:4].mean(), df[df.species=='setosa'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=2,bbox=bbox_Circle)
plt.annotate('versicolor', xy=(1.7,4.2), xytext=(2,4.2),
            arrowprops=dict(facecolor='orange', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="orange", ec="m", alpha=1)
plt.text(df[df.species=='versicolor'].iloc[:,3:4].mean(), df[df.species=='versicolor'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=2,bbox=bbox_Circle)
plt.annotate('virginica', xy=(1.5,5.7), xytext=(0.8,5.7),
            arrowprops=dict(facecolor='g', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="g", ec="m", alpha=1)
plt.text(df[df.species=='virginica'].iloc[:,3:4].mean(), df[df.species=='virginica'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=2,bbox=bbox_Circle)
plt.show()

Three different clusters will be plotted with three different color

In [22]:
# Use the 'hue' argument to provide a factor variable
g=sns.lmplot( x="petal_width", y="petal_length", data=df, fit_reg=False, hue='species', legend=False)
plt.annotate('Setosa', xy=(0.6,1.5), xytext=(1,1.55),
            arrowprops=dict(facecolor='blue', shrink=0.15),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="blue", ec="b", alpha=1)
plt.text(df[df.species=='setosa'].iloc[:,3:4].mean(), df[df.species=='setosa'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=5,bbox=bbox_Circle)
plt.annotate('versicolor', xy=(1.7,4.2), xytext=(2,4.2),
            arrowprops=dict(facecolor='orange', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="orange", ec="b", alpha=1)
plt.text(df[df.species=='versicolor'].iloc[:,3:4].mean(), df[df.species=='versicolor'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=5,bbox=bbox_Circle)
plt.annotate('virginica', xy=(1.5,5.7), xytext=(0.8,5.7),
            arrowprops=dict(facecolor='g', shrink=0.1),
            )
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="green", ec="b", alpha=1)
plt.text(df[df.species=='virginica'].iloc[:,3:4].mean(), df[df.species=='virginica'].iloc[:,2:3].mean(), "virginica", ha="center", va="center", size=5,bbox=bbox_Circle)

# Move the legend to an empty part of the plot
plt.legend(loc='lower right')
Out[22]:
<matplotlib.legend.Legend at 0x20423a20748>
In [23]:
#plot to see clusters.  
x=[]
for i in range (1,151): 
    x.append(i)
df['x']=x
g=sns.lmplot( x='x', y="euclidean", data=df, fit_reg=False, hue='species', legend=False)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="b", ec="b", alpha=1)
plt.text(df[df.species=='setosa']['x'].mean(), df[df.species=='setosa']['euclidean'].mean(), "setos", ha="center", va="center", size=5,bbox=bbox_Circle)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="orange", ec="b", alpha=1)
plt.text(df[df.species=='versicolor']['x'].mean(), df[df.species=='versicolor']['euclidean'].mean(), "versic", ha="center", va="center", size=5,bbox=bbox_Circle)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="g", ec="b", alpha=1)
plt.text(df[df.species=='virginica']['x'].mean(), df[df.species=='virginica']['euclidean'].mean(), "versic", ha="center", va="center", size=5,bbox=bbox_Circle)

plt.annotate("",
              xy=(0, 5.345), xycoords='data',
              xytext=(150, 5.345), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate("",
              xy=(0, 2.5), xycoords='data',
              xytext=(150, 2.5), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate('8 points should be in virginica region instead of versicolor region', xy=(128, 4.7), xytext=(130, 3.5),
            arrowprops=dict(facecolor='blue', shrink=0.15),
            )
bbox_props = dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9)
plt.text(100, 1.5, "setosa", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 4, "versicolor", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 6.5, "virginica", ha="center", va="center", size=15,bbox=bbox_props)
bbox_Circle = dict(boxstyle="Circle,pad=0.3", fc="m", ec="m", alpha=0.25)
plt.text(126, 4.6, "virginica", ha="center", va="center", size=15,bbox=bbox_Circle)
plt.show()
In [27]:
bins = [0, df[df.species=='setosa']['euclidean'].max(), df[df.species=='versicolor']['euclidean'].max(), 
        df[df.species=='virginica']['euclidean'].max()]
df['binned'] = pd.cut(df['euclidean'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()
Out[27]:
sepal_length sepal_width petal_length petal_width species euclidean x binned_(0.0, 1.942] binned_(1.942, 5.345] binned_(5.345, 7.273]
0 5.1 3.5 1.4 0.2 setosa 1.414214 1 1 0 0
1 4.9 3.0 1.4 0.2 setosa 1.414214 2 1 0 0
2 4.7 3.2 1.3 0.2 setosa 1.315295 3 1 0 0
3 4.6 3.1 1.5 0.2 setosa 1.513275 4 1 0 0
4 5.0 3.6 1.4 0.2 setosa 1.414214 5 1 0 0
In [29]:
confusion_matrix=df.groupby(['species']).sum().iloc[:,6:9]
confusion_matrix.columns = ['setosa_predict','versicolor_predict', 'virginica_predict']
confusion_matrix
Out[29]:
setosa_predict versicolor_predict virginica_predict
species
setosa 50 0 0
versicolor 0 50 0
virginica 0 8 42

2. Manhattan Distance Metric

Manhattan distance computes the absolute differences between coordinates of pair of objects

$Distance=|x_i-x_j|$

In [35]:
df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
df['Manhattan']=np.abs(df.petal_length-df.petal_width)
df.head()
Out[35]:
sepal_length sepal_width petal_length petal_width species Manhattan
0 5.1 3.5 1.4 0.2 setosa 1.2
1 4.9 3.0 1.4 0.2 setosa 1.2
2 4.7 3.2 1.3 0.2 setosa 1.1
3 4.6 3.1 1.5 0.2 setosa 1.3
4 5.0 3.6 1.4 0.2 setosa 1.2
In [36]:
#plot to see clusters. 
x=[]
for i in range (1,151): 
    x.append(i)
df['x']=x
g=sns.lmplot( x='x', y="Manhattan", data=df, fit_reg=False, hue='species', legend=False)
plt.annotate("",
              xy=(0, 3.5), xycoords='data',
              xytext=(150, 3.5), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate("",
              xy=(0, 1.8), xycoords='data',
              xytext=(150, 1.8), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )

bbox_props = dict(boxstyle="round", fc="w", ec="0.5", alpha=0.25)
plt.text(100, 1.4, "setosa", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 2.5, "versicolor", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 4.3, "virginica", ha="center", va="center", size=15,bbox=bbox_props)
plt.show()
In [153]:
#We will create 3 binns doe to k=3 cluster to see confusion matrix
bins = [0, df[df.species=='setosa']['Manhattan'].max(), df[df.species=='versicolor']['Manhattan'].max(), 
        df[df.species=='virginica']['Manhattan'].max()]
df['binned'] = pd.cut(df['Manhattan'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()
Out[153]:
sepal_length sepal_width petal_length petal_width species Manhattan x binned_(0.0, 1.7] binned_(1.7, 3.5] binned_(3.5, 4.7]
0 5.1 3.5 1.4 0.2 setosa 1.2 1 1 0 0
1 4.9 3.0 1.4 0.2 setosa 1.2 2 1 0 0
2 4.7 3.2 1.3 0.2 setosa 1.1 3 1 0 0
3 4.6 3.1 1.5 0.2 setosa 1.3 4 1 0 0
4 5.0 3.6 1.4 0.2 setosa 1.2 5 1 0 0
In [154]:
confusion_matrix=df.groupby(['species']).sum().iloc[:,6:9]
confusion_matrix.columns = ['setosa_predict','versicolor_predict', 'virginica_predict']
confusion_matrix
Out[154]:
setosa_predict versicolor_predict virginica_predict
species
setosa 50 0 0
versicolor 0 50 0
virginica 0 29 21

3. Minkowski Distance Metric

Minkowski Distance is the generalized metric distance.

$Distance=\left(\sum_{k=1}^m (x_{ik}-x_{jk} \right)^p)^{1/p}$

Note that when p=2, the distance becomes the Euclidean distance. When p=1 it becomes city block distance. Chebyshev distance is a variant of Minkowski distance where $p=\infty$ (taking a limit). This distance can be used for both ordinal and quantitative variables.

In [41]:
df=pd.read_csv('https://raw.githubusercontent.com/uiuc-cse/data-fa14/gh-pages/data/iris.csv')
#for p=5
df['Minkowski']=(df.petal_length**5+df.petal_width**5)**(1/5)
df.head()
Out[41]:
sepal_length sepal_width petal_length petal_width species Minkowski
0 5.1 3.5 1.4 0.2 setosa 1.400017
1 4.9 3.0 1.4 0.2 setosa 1.400017
2 4.7 3.2 1.3 0.2 setosa 1.300022
3 4.6 3.1 1.5 0.2 setosa 1.500013
4 5.0 3.6 1.4 0.2 setosa 1.400017
In [42]:
#plot to see cluster
x=[]
for i in range (1,151): 
    x.append(i)
df['x']=x
g=sns.lmplot( x='x', y="Minkowski", data=df, fit_reg=False, hue='species', legend=False)
plt.annotate("",
              xy=(0, 5.151961423363787), xycoords='data',
              xytext=(150, 5.151961423363787), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )
plt.annotate("",
              xy=(0, 2.5), xycoords='data',
              xytext=(150, 2.5), textcoords='data',
              arrowprops=dict(arrowstyle="-",
                              connectionstyle="arc3,rad=0."), 
              )

bbox_props = dict(boxstyle="round", fc="w", ec="0.5", alpha=0.9)
plt.text(100, 1.5, "setosa", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 4, "versicolor", ha="center", va="center", size=15,bbox=bbox_props)
plt.text(22, 6.5, "virginica", ha="center", va="center", size=15,bbox=bbox_props)
Out[42]:
Text(22,6.5,'virginica')
In [157]:
#We will create 3 binns doe to k=3 cluster to see confusion matrix
bins = [0, df[df.species=='setosa']['Minkowski'].max(), df[df.species=='versicolor']['Minkowski'].max(), 
        df[df.species=='virginica']['Minkowski'].max()]
df['binned'] = pd.cut(df['Minkowski'], bins)
df = df.copy()
df = pd.get_dummies(df, columns=['binned'], prefix = ['binned'])
df.head()
Out[157]:
sepal_length sepal_width petal_length petal_width species Minkowski x binned_(0.0, 1.9] binned_(1.9, 5.103] binned_(5.103, 6.906]
0 5.1 3.5 1.4 0.2 setosa 1.400017 1 1 0 0
1 4.9 3.0 1.4 0.2 setosa 1.400017 2 1 0 0
2 4.7 3.2 1.3 0.2 setosa 1.300022 3 1 0 0
3 4.6 3.1 1.5 0.2 setosa 1.500013 4 1 0 0
4 5.0 3.6 1.4 0.2 setosa 1.400017 5 1 0 0
In [158]:
confusion_matrix=df.groupby(['species']).sum().iloc[:,6:9]
confusion_matrix.columns = ['setosa_predict','versicolor_predict', 'virginica_predict']
confusion_matrix
Out[158]:
setosa_predict versicolor_predict virginica_predict
species
setosa 50 0 0
versicolor 0 50 0
virginica 0 10 40

4. CONCLUSION

K means is a heuristic algorithm that partitions a data set into K clusters by minimizing the sum of squared distance in each cluster. During the implementation of k-means with three different distance metrics, it is observed that selection of distance metric plays a very important role in clustering. So, the selection of distance metric should be made carefully. The distortion in k-means using Manhattan distance metric is less than that of k-means using Euclidean distance metric. As a conclusion, the K-means, which is implemented using Euclidean distance metric gives best result and K-means based on Manhattan distance metric’s performance, is worst.

Published: November 3, 2018