OpenCV-Python | category: opencv python tutorials



OpenCV Python Tutorials

K-Means Clustering - 2 : Working with Scipy


In the previous article, 'K-Means Clustering - 1 : Basic Understanding', we understood what is K-Means clustering, how it works etc. In this article, we will use k-means functionality in Scipy for data clustering. OpenCV will be covered in another article.

Scipy's cluster module provides routines for clustering. The vq module in it provides k-means functionality. You will need Scipy version 0.11 to get this feature.

We also use Matplotlib to visualize the data.

Note : All the data arrays used in this article are stored in github repo for you to check. It would be nice to check it for a better understanding. It is optional. Or you can create your own data and check it.

So we start by importing all the necessary libraries.

>>> import numpy as np
	>>> from scipy.cluster import vq
	>>> from matplotlib import pyplot as plt

Here I would like to show three examples.

1 - Data with only one feature :

Consider, you have a set of data with only one feature, ie one-dimensional. For eg, we can take our t-shirt problem where you use only height of people to decide the size of t-shirt.

Or, from an image processing point of view, you have a grayscale image with pixel values ranges from 0 to 255. You need to group it into just two colors, may be black and white only. ( That is another version of thresholding. I don't think someone will use k-means for thresholding. So just take this as a demo of k-means.)

So we start by creating data.

>>> x = np.random.randint(25,100,25)
	>>> y = np.random.randint(175,255,25)
	>>> z = np.hstack((x,y))
	>>> z = z.reshape((50,1))

So we have 'z' which is an array of size 50, and values ranging from 0 to 255. I have reshaped 'z' to a column vector. It is not necessary here, but it is a good practice. Reason, I will explain in coming sections. Now we can plot this using Matplotlib's histogram plot.

>>> plt.hist(z,256,[0,256]),

We get following image :

K-Means Clustering - 2 : Working with Scipy
Test Data

Now we use our k-means functions.

First function, vq.kmeans(), is used to cluster the data as per our requirements and it returns the centroids of the clusters. (Docs)

It takes our test data and number of clusters we need as inputs. Other two inputs are optional and is not of big concern now.

>>> centers,dist = vq.kmeans(z,2)
	>>> centers
	       [ 60]])

First output is 'centers', which are the centroids of clustered data. For our data, it is 60 and 207. Second output is the distortion between centroids and test data. We mark the centroids along with the inputs.

>>> plt.hist(z,256,[0,256]),plt.hist(centers,32,[0,256]),

Below is the output we got. Those green bars are the centroids.

K-Means Clustering - 2 : Working with Scipy
Green bars shows centroids after clustering

Now we have found the centroids. From first article, you might have seen our next job is to label the data '0' and '1' according to distance to the centroids. We use vq.vq() function for this purpose.

vq.vq() takes our test data and centroids as inputs and provides us the labelled data,called 'code' and distance between each data and corresponding centroids.

>>> code, distance = vq.vq(z,centers)

If you compare the arrays 'code' and 'z' in git repo, you can see all values near to first centroid will be labelled '0' and next as '1'.

Also check the distance array. 'z[0]' is 47, which is near to 60, so labelled as '1' in 'code'. And distance between them is 13, which is 'distance[0]'. Similarly you can check other data also.

Now we have the labels of all data, we can separate the data according to labels.

>>> a = z[code==0]
	>>> b = z[code==1]

'a' corresponds to data with centroid = 207 and 'b' corresponds to remaining data. (Check git repo to see a&b).

Now we plot 'a' in red color, 'b' in blue color and 'centers' in yellow color as below:

>>> plt.hist(a,256,[0,256],color = 'r') # draw 'a' in red color
	>>> plt.hist(b,256,[0,256],color = 'b') # draw 'b' in blue color
	>>> plt.hist(centers,32,[0,256],color = 'y') # draw 'centers' in yellow color

We get the output as follows, which is our clustered data :

K-Means Clustering - 2 : Working with Scipy
Output of K-Means clustering

So, we have done a very simple and basic example on k-means clustering. Next one, we will try with more than one features.

2 - Data with more than one feature :

In previous example, we took only height for t-shirt problem. Here, we will take both height and weight, ie two features.

Remember, in previous case, we made our data to a single column vector. This is because, it is a good convention, and normally followed by people from all fields. ie each feature is arranged in a column, while each row corresponds to an input sample.

For example, in this case, we set a test data of size 50x2, which are heights and weights of 50 people. First column corresponds to height of all the 50 people and second column corresponds to their weights. First row contains two elements where first one is the height of first person and second one his weight. Similarly remaining rows corresponds to heights and weights of other people. Check image below:

K-Means Clustering - 2 : Working with Scipy

So now we can prepare the data.

>>> x = np.random.randint(25,50,(25,2))
	>>> y = np.random.randint(60,85,(25,2))
	>>> z = np.vstack((x,y))

Now we got a 50x2 array. We plot it with 'Height' in X-axis and 'Weight' in Y-axis.

>>> plt.scatter(z[:,0],z[:,1]),plt.xlabel('Height'),plt.ylabel('Weight')

(Some data may seem ridiculous. Never mind it, it is just a demo)

K-Means Clustering - 2 : Working with Scipy
Test Data

Now we apply k-means algorithm and label the data.

>>> center,dist = vq.kmeans(z,2)
	>>> code,distance = vq.vq(z,center)

This time, 'center' is a 2x2 array, first column corresponds to centroids of height, and second column corresponds to centroids of weight.(Check git repo data)

As usual, we extract data with label '0', mark it with blue, then data with label '1', mark it with red, mark centroids in yellow and check how it looks like.

>>> a = z[code==0]
	>>> b = z[code==1]
	>>> plt.scatter(a[:,0],a[:,1]),plt.xlabel('Height'),plt.ylabel('Weight')
	>>> plt.scatter(b[:,0],b[:,1],c = 'r')
	>>> plt.scatter(center[:,0],center[:,1],s = 80,c = 'y', marker = 's')

This is the output we got :

K-Means Clustering - 2 : Working with Scipy
Result of K-Means clustering

So this is how we apply k-means clustering with more than one feature.

Now we go for a simple application of k-means clustering, ie color quantization.

3 - Color Quantization :

Color Quantization is the process of reducing number of colors in an image. One reason to do so is to reduce the memory. Sometimes, some devices may have limitation such that it can produce only limited number of colors. In those cases also, color quantization is performed.

There are lot of algorithms for color quantization. Wikipedia page for color quantization gives a lot of details and references to it. Here we use k-means clustering for color quantization.

There is nothing new to be explained here. There are 3 features, say, R,G,B. So we need to reshape the image to an array of Mx3 size (M is just a number). And after the clustering, we apply centroid values (it is also R,G,B) to all pixels, such that resulting image will have specified number of colors. And again we need to reshape it back to the shape of original image. Below is the code:

import cv2
	import numpy as np
	from scipy.cluster import vq

	img = cv2.imread('home.jpg')
	z = img.reshape((-1,3))

	k = 2           # Number of clusters
	center,dist = vq.kmeans(z,k)
	code,distance = vq.vq(z,center)
	res = center[code]
	res2 = res.reshape((img.shape))

Change the value of 'k' to get different number of colors. Below is the original image and results I got for values k=2,4,8 :

K-Means Clustering - 2 : Working with Scipy
Color Quantization with K-Means clustering

So, that's it !!!

In this article, we have seen how to use k-means algorithm with the help of Scipy functions. We also did 3 examples with sufficient number of images and plots. There are two more functions related to it, but I will deal it later.

In next article, we will deal with OpenCV k-means implementation.

I hope you enjoyed it...

And if you found this article useful, don't forget to share it on Google+ or facebook etc.


Abid Rahman K.

Ref :

1 - Scipy cluster module documentation

2 - Color Quantization

K-Means Clustering - 1 : Basic Understanding


In this section, we will be dealing with K-Means clustering, a machine learning algorithm, which is used for grouping a large set of data to desired number of groups depending on their similarity. It will be a series of articles, may be two or three articles.

In this article, we mainly focus on the understanding what is this algorithm, how it works, what are its applications etc. I think an intuitive understanding is more better than explaining the mathematical background behind the algorithm.

In the coming articles, I will explain how to use in-built functions of Scipy and OpenCV for K-Means clustering.

(Note : All images are sufficiently larger, but resized to fit in page. Just click on the image to see it in its original size)

By the way, What is this algorithm for ?

According to Wikipedia :

k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean.

( Reading above passage, I hope you might have understood at-least this : It groups large data to some groups based on some features.)

If you didn't understand above statements, I will tell a couple of examples :

1 - Google News :

I am sure you will have visited Google News at-least once. For every news, Google provides a main article and just below it, there are links to corresponding news from all other newspapers. Check below image :

K-Means Clustering - 1 : Basic Understanding

This is one of my favorite features in Google News, because I try to read news on same topic from different papers. (It helps me to understand how media play with our minds. Different newspapers try to present same news in different ways to audience. And it is seen frequently with respect to political news and people reading them are, most often, got biased by these news, while real truth is hidden from them. Presently the power of media is such that, if you control the media, you control the world. Anyway, it is out of our topic. )

I have always wondered how does Google achieve this. I am not claiming Google uses k-means clustering, but this is what K-Means clustering also does, You have a large set of news and group them according to its content.

2 - T-shirt size problem

This example is commonly used to explain clustering problems.

Consider a company, which is going to release a new model of T-shirt to market. Obviously they will have to manufacture models in different sizes to satisfy people of all sizes. So the company make a data of people's height and weight, and plot them on to a graph, as below:

K-Means Clustering - 1 : Basic Understanding
People Height v/s Weight plot

Company can't create t-shirts with all the sizes. Instead, they divide people to Small, Medium and Large, and manufacture only these 3 models which will fit into all the people. This grouping of people into three groups can be done by k-means clustering, and algorithm provides us best 3 sizes, which will satisfy all the people. And if it doesn't, company can divide people to more groups, may be five, and so on. Check image below :

K-Means Clustering - 1 : Basic Understanding
People grouped to different clusters

So, now I hope, you might have understood use of this algorithm. Now we can go to next step.

How does the algorithm work ?

This algorithm is an iterative process. We will explain it step-by-step with the help of images.

Consider a set of data as below ( Don't be bothered about what this data is, or you can consider it as t-shirt problem). We need to cluster this data into two groups.

K-Means Clustering - 1 : Basic Understanding
Test Data

Step :1 - Algorithm randomly chooses two centroids, c1 and c2 (sometimes, any two data are taken as the centroids).

Step :2 - Then it calculates the distance from each point to both centroids. If a test data is more closer to c1, then that data is labelled with '0'. If it is closer to c2, then labelled as '1' (If more centroids are there, labelled as '2','3' etc).

In our case, we will color all '0' labelled with red, and '1' labelled with blue. So we get following image after above operations.

K-Means Clustering - 1 : Basic Understanding
Initial centroid selection and data labelling

Step :3 - Next we calculate the average of all blue points and red points separately and that will be our new centroids. That is c1 and c2 shift to newly calculated centroids. (Remember, the images shown are not true values and not to true scale, it is just for demonstration only. So calculated averages may not be what you expect).

And again, perform step 2 with new centroids and label data to '0' and '1'.

So we get result as below :

K-Means Clustering - 1 : Basic Understanding
New centroid calculated and data re-labelled

Now step - 2 and step - 3 are iterated until both centroids are converged to fixed points. These points are such that sum of distances between test data and their corresponding centroids are minimum. Or simply, sum of distances between red data & c1 and blue data & c2 is minimum.

Final result almost looks like below :

K-Means Clustering - 1 : Basic Understanding
Final Result

So, I hope by this time, you would have obtained a basic understanding how K-Means clustering works. With this intuitive knowledge, read some good text books to understand the mathematics behind this algorithm. I am sure you won't find any difficulty to understand it.

It is just a top layer of K-Means clustering. There are a lot of modifications to this algorithm like, how to choose the initial centroids, how to speed up the iteration process etc. Me too, haven't gone much deeper and I think this would be sufficient to start working with K-Means clustering. Those who need more knowledge can read any good text books.

So that was a complete theoretical, easy-to-read article.

In next article, we will see how to use Scipy's kmeans implementation for data clustering, and we will deal OpenCV implementation in later article.

So that's all for now.

If you find this article useful, don't forget to share this article on Google+, facebook etc. It is just a click away !!!


Abid Rahman K.

Ref :

1 - Stanford Machine Learning classes by Prof. Andrew Ng

BackGround Extraction using Running Average

Hi, this is going to be a very simple article, but you will find it very helpful. It is about Background Extraction from a Video.

Suppose you are given video of footage of traffic, may be some thing like this : Traffic in India, and you are asked to find an approximate background. Or anything like that.

Background extraction comes important in object tracking. If you already have an image of the bare background, then it is simple. But in many cases, you won't have such an image and so, you will have to create one. That is where Running Average comes in handy.

( I thought about this when one guy asked a question in SOF : Link)

The function we use here to find Running Average is cv2.accumulateWeighted(). For example, if we are watching a video, we keep feeding each frame to this function, and the function keep finding the averages of all frames fed to it as per the relation below :

BackGround Extraction using Running Average

where :

src is nothing but our source image. It can be grayscale or color image and either 8-bit or 32-bit floating point.

dst is the output or accumulator image with same channels as that of source image, and it is either 32-bit or 64-bit floating point. Also, we should declare it first to a value which will be taken as initial value.

alpha is the weight of the input image. According to Docs, alpha regulates the update speed (how fast the accumulator “forgets” about earlier images). In simple words, if alpha is a higher value, average image tries to catch even very fast and short changes in the data. If it is lower value, average becomes sluggish and it won't consider fast changes in the input images. I will explain it a little bit with help of images at the end of article.

Code :

import cv2
	import numpy as np

	c = cv2.VideoCapture(0)
	_,f =

	avg1 = np.float32(f)
	avg2 = np.float32(f)

	    _,f =
	    res1 = cv2.convertScaleAbs(avg1)
	    res2 = cv2.convertScaleAbs(avg2)

	    k = cv2.waitKey(20)

	    if k == 27:


In above code, I have set two averages, one with higher alpha value and another with lower alpha value so you can understand effect of alpha. At first both are set to initial frame of the capture. And in loop they get updated.

You can see some results in the SOF link I already provided. (I provide those results here, you can check the code and alpha value there):

Result 1 :

I used my webcam and saved original frame and running average at a particular instant.

BackGround Extraction using Running Average
Original Frame

As you can see, in this frame, my hand blocks the views behind.
BackGround Extraction using Running Average
Average Frame

But Running Average image shows the background clearly. You can see my hands ( actually lots of hands, which you can interpret as , I was waving my hand)

Result 2 :

Again from SOF :

BackGround Extraction using Running Average

This is a frame from a typical traffic video taken by a stationary camera. As you can see, a car is going on the road, and the person is trying to cross the road at a particular instant of time.
BackGround Extraction using Running Average

But see the running average at that time. There is no person and car in this image ( Actually it is there, have a close look, then you will see it, and the person is more clear than car, since car is moving very fast and across the image, it has not much effect on average, but person is there for a long time, since he is slow and moving across the road.)

Now we need to see the effect of alpha on these images.

Result 3 :

BackGround Extraction using Running Average
Original Frame
BackGround Extraction using Running Average
Alpha = 0.1

BackGround Extraction using Running Average
Alpha = 0.01

I was just waving my hand across my face in front of camera. Original frame caught it without any mercy. Running Average with alpha 0.1 has caught it as a transparent hand, with main emphasis on background. As alpha again reduced, you can see no hand there in front of face. ie the effect, as alpha decreases, sudden changes shows no effect on running averages.

Result 4 :

From the traffic video I have given in beginning of this article:

BackGround Extraction using Running Average
Original Frame
BackGround Extraction using Running Average
Alpha = 0.1

BackGround Extraction using Running Average
Alpha = 0.01

As alpha decreases, more vehicles are removed to create an approximate background. Unlike other examples, there is not a single instant in this video without vehicles and camera is shaky and moving, we still got a good background (although blurry)

Result 5:

BackGround Extraction using Running Average
Original Frame
BackGround Extraction using Running Average
Alpha = 0.1

BackGround Extraction using Running Average
Alpha = 0.01

Explanations are same as above. So reducing alpha creates a good background image. So choose the alpha value wisely.

So this is all about today.

As I mentioned, it is a simple article, but there is a good point to learn which we did with help of lot of images and simple code. Try it yourself and have some hacks to create something cool.

Let me know your feedbacks, thoughts etc !!!


K-Means Clustering - 2 : Working with ScipyK-Means Clustering - 1 : Basic UnderstandingBackGround Extraction using Running Average

Report "OpenCV-Python"

Are you sure you want to report this post for ?