K-Means clustering is an unsupervised machine learning algorithm which partitions n instances into k clusters by similarity. As K-Means clustering is an unsupervised learning algorithm, therefore instances will not have labels.
As K-Means clustering is an unsupervised learning algorithm, training instances will not have labels. Furthermore, to make you understand K-Means clustering algorithm, I will take a very simple dataset having three features without labels.
Instances | x1 | x2 | x3 |
---|---|---|---|
1 | 12 | 40 | 30 |
2 | 16 | 35 | 29 |
3 | 14 | 37 | 25 |
4 | 18 | 36 | 21 |
5 | 12 | 40 | 31 |
For the sake of simplicity, I will divide instances only into two clusters. To partition the dataset into two clusters you will require two instances to be centroid. Let us take instance 1 as centroid of cluster-0 and instance-2 as center of cluster-1.
cluster-0={1}
It means that instance 1 is in cluster-0.
cluster-1={2}
It means that instance 2 is in cluster-1.
Now let us see, which cluster instance 3 will belong to.
To decide which cluster instance 3 will belong to, you need to calculate distance from instance 1(centroid) to instance 3 and instance 2 (centroid) to instance 3.
Euclidian distance between instance 1 and 3 is.
d1-3=sqrt((12-14)2+(40-37)2+(30-25)2)=sqrt(38)
d2-3=sqrt((16-14)2+(35-37)2+(29-25)2)=sqrt(24)
It can be observed that instance no. 3 is closer to 2 compare to instance no. 1. Now it will be in cluster-1 then clusters would be.
cluster-0={1}
cluster-1={2,3}
Furthermore, a new centroid will be calculated for cluster-1, which would be.
((16+14)/2, (35+37)/2, (29+25)/2)=(15, 36, 27)
To decide which cluster instance 4 will belong to, you need to calculate distance from instance 1(centroid) to instance 4, and instance 23 (centroid) to instance 4.
d1-4=sqrt((12-18)2+(40-36)2+(30-21)2)=sqrt(133)
d23-4=sqrt((15-18)2+(36-36)2+(27-21)2)=sqrt(45)
It can be observed that instance no. 4 is closer to 23 compare to instance no. 1. Now it will be in cluster-1 then clusters would be.
cluster-0={1}
cluster-1={2,3,4}
Now, new centroid will be calculated for cluster-1 as given below.
((16+14+18)/3, (35+37+36)/3, (29+25+21)/3)=(16, 36, 25)
To decide which cluster instance 5 will belong to, you need to calculate distance from instance 1(centroid) to instance 5, and instance 234 (centroid) to instance 5.
d1-5=sqrt((12-12)2+(40-40)2+(30-31)2)=sqrt(1)
d234-5=sqrt((16-12)2+(36-40)2+(25-31)2)=sqrt(68)
It can be observed that instance no. 5 is closer to 1 compare to instance no. centroid 234. Now it will be in cluster-0 then clusters would be.
cluster-0={1, 5}
cluster-1={2,3,4}
Now centroid for cluster-0 will be calculated as.
((12+12)/2, (40+40)/2, (30+31))=(12, 40, 30.5)
Finally, we will have two clusters
cluster-0={1, 5}
cluster-1={2,3,4}
And their centers are (16, 36, 25) and (12, 40, 30.5) respectively.
Python code
from sklearn.cluster import KMeans
kmc=KMeans(n_clusters=3)
X=[[12,40,30],[15,35,29],[14,37,25],[18,36,21],[12,40,31]]
kmc.fit(X)
print(kmc.labels_)
print(kmc.cluster_centers_)
After running the above code you will get clusters and their centers as output.