Feature Selection Package - Clusters - K-means
Description
KMEANS uses a two-phase iterative algorithm to minimize the sum of
point-to-centroid distances, summed over all K clusters. The first
phase uses what the literature often describes as "batch" updates,
where each iteration consists of reassigning points to their nearest
cluster centroid, all at once, followed by recalculation of cluster
centroids. This phase may be thought of as providing a fast but
potentially only approximate solution as a starting point for the
second phase. The second phase uses what the literature often
describes as "on-line" updates, where points are individually
reassigned if doing so will reduce the sum of distances, and cluster
centroids are recomputed after each reassignment. Each iteration
during this second phase consists of one pass though all the points.
KMEANS can converge to a local optimum, which in this case is a
partition of points in which moving any single point to a different
cluster increases the total sum of distances. This problem can only be
solved by a clever (or lucky, or exhaustive) choice of starting points.
Usage
Method Signature:
[idx, C, sumD, D] =
KM(X, k, varargin)
Output:
idx: The clustering data for each data point.
C: K-by-P matrix containing the K cluster centroid locations.
sumD: 1-by-K vector containing within-cluster sums of point-to-centroid distances.
D: N-by-K matrix containing distances from each point to every centroid.
Input:
X, k: Partitions points in the N-by-P data matrix
X into k clusters.
varargin: KM(..., 'PARAM1',val1, 'PARAM2',val2, ...); This allows
you to specify optional parameters and their values. Accepted parameters are as follows:
Name |
Description |
Values |
Distance |
Distance measure, in P-dimensional space, that KMEANS should minimize with respect to. |
Value |
Definition |
sqEuclidean |
Squared Euclidean distance |
cityblock |
Sum of absolute differences |
cosine |
One minus the cosine of the included angle between points (treated as vectors) |
correlation |
One minus the sample correlation between points (treated as sequences of values) |
Hamming |
Percentage of bits that differ (only suitable for binary data) |
|
Start |
Method used to choose initial cluster centroid positions, sometimes known as "seeds". |
Value |
Definition |
sample |
Select K observations from X at random. |
uniform |
Select K points uniformly at random from the range of X. Not valid for Hamming distance. |
cluster |
Perform preliminary clustering phase on random 10% subsample of X. This preliminary phase is itself initialized using 'sample'. |
matrix |
A K-by-P matrix of starting locations. In
this case, you can pass in [] for K, and
KMEANS infers K from the first dimension of
the matrix. You can also supply a 3D array,
implying a value for 'Replicates'
from the array's third dimension. |
|
Replicates |
Number of times to repeat the clustering, each with a new set of initial centroids |
[ positive integer | {1}] |
Maxiter |
The maximum number of iterations |
[ positive integer | {100}] |
EmptyAction |
Action to take if a cluster loses all of its member observations. |
Value |
Definition |
error |
Treat an empty cluster as an error |
drop |
Remove any clusters that become empty, and set corresponding values in C and D to NaN. |
singleton |
Create a new cluster consisting of the one observation furthest from its centroid. |
|
Display |
Display level |
[ 'off' | {'notify'} | 'final' | 'iter' ] |
Code Example
% Using the wine.dat data set, which can be found at
% [fspackage_location]/classifiers/knn/wine.mat
KM(X, 10);
Paper
BibTex entry for:
Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.
Spath, H. (1985) Cluster Dissection and Analysis: Theory, FORTRAN
Programs, Examples, translated by J. Goldschmidt, Halsted Press, New York, 226 pp.
@book{Seber84,
author = {Seber, G.A.F.},
title = {Multivariate Observations},
year = {1984},
address = {New York},
publisher = {Wiley}
}
@book{537092,
author = {Spath, Helmuth},
title = {The Cluster Dissection and Analysis Theory FORTRAN Programs Examples},
year = {1985},
isbn = {0131379852},
publisher = {Prentice-Hall, Inc.},
address = {Upper Saddle River, NJ, USA}
}