close

OpenCV-Python | category: color quantization

home

OpenCV-Python

OpenCV Python Tutorials

opencvpython.blogspot.com

K-Means Clustering - 3 : Working with OpenCV


Hi,

In the previous articles, K-Means Clustering - 1 : Basic Understanding and K-Means Clustering - 2 : Working with Scipy, we have seen what is K-Means and how to use it to cluster the data. In this article, We will see how we can use K-Means function in OpenCV for K-Means clustering.

OpenCV documentation for K-Means clustering : cv2.KMeans()

Function parameters :

Input parameters :

1 - samples : It should be of np.float32 data type, and as said in previous article, each feature should be put in a single column.

2 - nclusters(K
) : Number of clusters

3 - criteria : It is the algorithm termination criteria. Actually, it should be a tuple of 3 parameters. They are ( type, max_iter, epsilon ):
    3.a - type of termination criteria : It has 3 flags as below:
      - cv2.TERM_CRITERIA_EPS - stop the algorithm iteration if specified accuracy, epsilon, is reached.
    - cv2.TERM_CRITERIA_MAX_ITER - stop the algorithm after the specified number of iterations, max_iter.
      - cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER - stop the iteration when any of the above condition is met.

    3.b - max_iter - An integer specifying maximum number of iterations. 
    3.c - epsilon - Required accuracy

4 - attempts : Flag to specify the number of times the algorithm is executed using different initial labellings. The algorithm returns the labels that yield the best compactness. This compactness is returned as output.

5 - flags : This flag is used to specify how initial centers are taken. Normally two flags are used for this : cv2.KMEANS_PP_CENTERS and cv2.KMEANS_RANDOM_CENTERS. (I didn't find any difference in their results, so I don't know where they are suitable. For time-being, I use second one in my examples).

Output parameters:

1 - compactness : It is the sum of squared distance from each point to their corresponding centers.

2 - labels : This is the label array (same as 'code' in previous article) where each element marked '0', '1'.....

3 - centers : This is array of centers of clusters.

Now let's do the same examples we did in last article. Remember, we used random number generator to generate data, so data may be different this time.

1 - Data with Only One Feature:

Below is the code, I have commented on important parts.

import numpy as np
import cv2
from matplotlib import pyplot as plt

x = np.random.randint(25,100,25)
y = np.random.randint(175,255,25)
z = np.hstack((x,y))
z = z.reshape((50,1))

# data should be np.float32 type
z = np.float32(z)

# Define criteria = ( type, max_iter = 10 , epsilon = 1.0 )
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)

# Apply KMeans
ret,labels,centers = cv2.kmeans(z,2,criteria,10,cv2.KMEANS_RANDOM_CENTERS)

# Now split the data depending on their labels
A = z[labels==0]
B = z[labels==1]

# Now plot 'A' in red, 'B' in blue, 'centers' in yellow
plt.hist(A,256,[0,256],color = 'r')
plt.hist(B,256,[0,256],color = 'b')
plt.hist(centers,32,[0,256],color = 'y')
plt.show()

Below is the output we get :

K-Means Clustering - 3 : Working with OpenCV
KMeans() with one feature set

2 - Data with more than one feature :

Directly moving to the code:

import numpy as np
import cv2
from matplotlib import pyplot as plt

X = np.random.randint(25,50,(25,2))
Y = np.random.randint(60,85,(25,2))
Z = np.vstack((X,Y))

# convert to np.float32
Z = np.float32(Z)

# define criteria and apply kmeans()
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
ret,label,center = cv2.kmeans(Z,2,criteria,10,cv2.KMEANS_RANDOM_CENTERS)

# Now separate the data, Note the flatten()
A = Z[label.flatten()==0]
B = Z[label.flatten()==1]

# Plot the data
plt.scatter(A[:,0],A[:,1])
plt.scatter(B[:,0],B[:,1],c = 'r')
plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')
plt.xlabel('Height'),plt.ylabel('Weight')
plt.show()

Note that, while separating data to A and B, we used label.flatten(). It is because 'label' returned by the OpenCV is a column vector. Actually, we needed a plain array. In Scipy, we get 'label' as plain array, so we don't need the flatten() there in Scipy. To understand more, check the 'label' in both the cases.

Below is the output we get :

K-Means Clustering - 3 : Working with OpenCV
KMeans() with two feature sets

3 - Color Quantization :

import numpy as np
import cv2
from matplotlib import pyplot as plt

img = cv2.imread('home.jpg')
Z = img.reshape((-1,3))

# convert to np.float32
Z = np.float32(Z)

# define criteria, number of clusters(K) and apply kmeans()
criteria = (cv2.TERM_CRITERIA_EPS + cv2.TERM_CRITERIA_MAX_ITER, 10, 1.0)
K = 8
ret,label,center = cv2.kmeans(Z,K,criteria,10,cv2.KMEANS_RANDOM_CENTERS)

# Now convert back into uint8, and make original image
center = np.uint8(center)
res = center[label.flatten()]
res2 = res.reshape((img.shape))

cv2.imshow('res2',res2)
cv2.waitKey(0)
cv2.destroyAllWindows()

Below is the output we get :

K-Means Clustering - 3 : Working with OpenCV
Color Quantization with KMeans Clustering

Summary :

So finally, We have seen how to use KMeans clustering with OpenCV. I know, I haven't explained much in this article, because it is same as the previous article. Just a function is changed.

So this series on KMeans Clustering algorithm ends here.

Feel free to post your comments, feedback...

Feel free to share it with your friends....

Regards,
Abid Rahman K.

K-Means Clustering - 2 : Working with Scipy


Hi,

In the previous article, 'K-Means Clustering - 1 : Basic Understanding', we understood what is K-Means clustering, how it works etc. In this article, we will use k-means functionality in Scipy for data clustering. OpenCV will be covered in another article.

Scipy's cluster module provides routines for clustering. The vq module in it provides k-means functionality. You will need Scipy version 0.11 to get this feature.

We also use Matplotlib to visualize the data.

Note : All the data arrays used in this article are stored in github repo for you to check. It would be nice to check it for a better understanding. It is optional. Or you can create your own data and check it.

So we start by importing all the necessary libraries.

>>> import numpy as np
>>> from scipy.cluster import vq
>>> from matplotlib import pyplot as plt

Here I would like to show three examples.

1 - Data with only one feature :

Consider, you have a set of data with only one feature, ie one-dimensional. For eg, we can take our t-shirt problem where you use only height of people to decide the size of t-shirt.

Or, from an image processing point of view, you have a grayscale image with pixel values ranges from 0 to 255. You need to group it into just two colors, may be black and white only. ( That is another version of thresholding. I don't think someone will use k-means for thresholding. So just take this as a demo of k-means.)

So we start by creating data.

>>> x = np.random.randint(25,100,25)
>>> y = np.random.randint(175,255,25)
>>> z = np.hstack((x,y))
>>> z = z.reshape((50,1))

So we have 'z' which is an array of size 50, and values ranging from 0 to 255. I have reshaped 'z' to a column vector. It is not necessary here, but it is a good practice. Reason, I will explain in coming sections. Now we can plot this using Matplotlib's histogram plot.

>>> plt.hist(z,256,[0,256]),plt.show()

We get following image :

K-Means Clustering - 2 : Working with Scipy
Test Data

Now we use our k-means functions.

First function, vq.kmeans(), is used to cluster the data as per our requirements and it returns the centroids of the clusters. (Docs)

It takes our test data and number of clusters we need as inputs. Other two inputs are optional and is not of big concern now.

>>> centers,dist = vq.kmeans(z,2)
>>> centers
array([[207],
[ 60]])

First output is 'centers', which are the centroids of clustered data. For our data, it is 60 and 207. Second output is the distortion between centroids and test data. We mark the centroids along with the inputs.

>>> plt.hist(z,256,[0,256]),plt.hist(centers,32,[0,256]),plt.show()

Below is the output we got. Those green bars are the centroids.

K-Means Clustering - 2 : Working with Scipy
Green bars shows centroids after clustering

Now we have found the centroids. From first article, you might have seen our next job is to label the data '0' and '1' according to distance to the centroids. We use vq.vq() function for this purpose.

vq.vq() takes our test data and centroids as inputs and provides us the labelled data,called 'code' and distance between each data and corresponding centroids.

>>> code, distance = vq.vq(z,centers)

If you compare the arrays 'code' and 'z' in git repo, you can see all values near to first centroid will be labelled '0' and next as '1'.

Also check the distance array. 'z[0]' is 47, which is near to 60, so labelled as '1' in 'code'. And distance between them is 13, which is 'distance[0]'. Similarly you can check other data also.

Now we have the labels of all data, we can separate the data according to labels.

>>> a = z[code==0]
>>> b = z[code==1]

'a' corresponds to data with centroid = 207 and 'b' corresponds to remaining data. (Check git repo to see a&b).

Now we plot 'a' in red color, 'b' in blue color and 'centers' in yellow color as below:

>>> plt.hist(a,256,[0,256],color = 'r') # draw 'a' in red color
>>> plt.hist(b,256,[0,256],color = 'b') # draw 'b' in blue color
>>> plt.hist(centers,32,[0,256],color = 'y') # draw 'centers' in yellow color
>>> plt.show()

We get the output as follows, which is our clustered data :

K-Means Clustering - 2 : Working with Scipy
Output of K-Means clustering

So, we have done a very simple and basic example on k-means clustering. Next one, we will try with more than one features.

2 - Data with more than one feature :

In previous example, we took only height for t-shirt problem. Here, we will take both height and weight, ie two features.

Remember, in previous case, we made our data to a single column vector. This is because, it is a good convention, and normally followed by people from all fields. ie each feature is arranged in a column, while each row corresponds to an input sample.

For example, in this case, we set a test data of size 50x2, which are heights and weights of 50 people. First column corresponds to height of all the 50 people and second column corresponds to their weights. First row contains two elements where first one is the height of first person and second one his weight. Similarly remaining rows corresponds to heights and weights of other people. Check image below:

K-Means Clustering - 2 : Working with Scipy

So now we can prepare the data.

>>> x = np.random.randint(25,50,(25,2))
>>> y = np.random.randint(60,85,(25,2))
>>> z = np.vstack((x,y))

Now we got a 50x2 array. We plot it with 'Height' in X-axis and 'Weight' in Y-axis.

>>> plt.scatter(z[:,0],z[:,1]),plt.xlabel('Height'),plt.ylabel('Weight')
>>> plt.show()

(Some data may seem ridiculous. Never mind it, it is just a demo)

K-Means Clustering - 2 : Working with Scipy
Test Data

Now we apply k-means algorithm and label the data.

>>> center,dist = vq.kmeans(z,2)
>>> code,distance = vq.vq(z,center)

This time, 'center' is a 2x2 array, first column corresponds to centroids of height, and second column corresponds to centroids of weight.(Check git repo data)

As usual, we extract data with label '0', mark it with blue, then data with label '1', mark it with red, mark centroids in yellow and check how it looks like.

>>> a = z[code==0]
>>> b = z[code==1]
>>> plt.scatter(a[:,0],a[:,1]),plt.xlabel('Height'),plt.ylabel('Weight')
>>> plt.scatter(b[:,0],b[:,1],c = 'r')
>>> plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')
>>> plt.show()

This is the output we got :

K-Means Clustering - 2 : Working with Scipy
Result of K-Means clustering

So this is how we apply k-means clustering with more than one feature.

Now we go for a simple application of k-means clustering, ie color quantization.

3 - Color Quantization :

Color Quantization is the process of reducing number of colors in an image. One reason to do so is to reduce the memory. Sometimes, some devices may have limitation such that it can produce only limited number of colors. In those cases also, color quantization is performed.

There are lot of algorithms for color quantization. Wikipedia page for color quantization gives a lot of details and references to it. Here we use k-means clustering for color quantization.

There is nothing new to be explained here. There are 3 features, say, R,G,B. So we need to reshape the image to an array of Mx3 size (M is just a number). And after the clustering, we apply centroid values (it is also R,G,B) to all pixels, such that resulting image will have specified number of colors. And again we need to reshape it back to the shape of original image. Below is the code:

import cv2
import numpy as np
from scipy.cluster import vq

img = cv2.imread('home.jpg')
z = img.reshape((-1,3))

k = 2 # Number of clusters
center,dist = vq.kmeans(z,k)
code,distance = vq.vq(z,center)
res = center[code]
res2 = res.reshape((img.shape))
cv2.imshow('res2',res2)
cv2.waitKey(0)
cv2.destroyAllWindows()

Change the value of 'k' to get different number of colors. Below is the original image and results I got for values k=2,4,8 :

K-Means Clustering - 2 : Working with Scipy
Color Quantization with K-Means clustering

So, that's it !!!

In this article, we have seen how to use k-means algorithm with the help of Scipy functions. We also did 3 examples with sufficient number of images and plots. There are two more functions related to it, but I will deal it later.

In next article, we will deal with OpenCV k-means implementation.

I hope you enjoyed it...

And if you found this article useful, don't forget to share it on Google+ or facebook etc.

Regards,

Abid Rahman K.


Ref :


1 - Scipy cluster module documentation

2 - Color Quantization

K-Means Clustering - 3 : Working with OpenCVK-Means Clustering - 2 : Working with Scipy

Report "OpenCV-Python"

Are you sure you want to report this post for ?

Cancel
×