Feature Selection Package - Clusters - K-means
Description
KMEANS uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all K clusters. The first phase uses what the literature often describes as "batch" updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. This phase may be thought of as providing a fast but potentially only approximate solution as a starting point for the second phase. The second phase uses what the literature often describes as "on-line" updates, where points are individually reassigned if doing so will reduce the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this second phase consists of one pass though all the points. KMEANS can converge to a local optimum, which in this case is a partition of points in which moving any single point to a different cluster increases the total sum of distances. This problem can only be solved by a clever (or lucky, or exhaustive) choice of starting points.
Usage
Method Signature:
[idx, C, sumD, D] = KM(X, k, varargin)

Output:
    idx: The clustering data for each data point.
    C: K-by-P matrix containing the K cluster centroid locations.
    sumD: 1-by-K vector containing within-cluster sums of point-to-centroid distances.
    D: N-by-K matrix containing distances from each point to every centroid.
Input:
    X, k: Partitions points in the N-by-P data matrix X into k clusters.
    varargin: KM(..., 'PARAM1',val1, 'PARAM2',val2, ...); This allows you to specify optional parameters and their values. Accepted parameters are as follows:
Name Description Values
Distance Distance measure, in P-dimensional space, that KMEANS should minimize with respect to.
Value Definition
sqEuclidean Squared Euclidean distance
cityblock Sum of absolute differences
cosine One minus the cosine of the included angle between points (treated as vectors)
correlation One minus the sample correlation between points (treated as sequences of values)
Hamming Percentage of bits that differ (only suitable for binary data)
Start Method used to choose initial cluster centroid positions, sometimes known as "seeds".
Value Definition
sample Select K observations from X at random.
uniform Select K points uniformly at random from the range of X. Not valid for Hamming distance.
cluster Perform preliminary clustering phase on random 10% subsample of X. This preliminary phase is itself initialized using 'sample'.
matrix A K-by-P matrix of starting locations. In this case, you can pass in [] for K, and KMEANS infers K from the first dimension of the matrix. You can also supply a 3D array, implying a value for 'Replicates' from the array's third dimension.
Replicates Number of times to repeat the clustering, each with a new set of initial centroids [ positive integer | {1}]
Maxiter The maximum number of iterations [ positive integer | {100}]
EmptyAction Action to take if a cluster loses all of its member observations.
Value Definition
error Treat an empty cluster as an error
drop Remove any clusters that become empty, and set corresponding values in C and D to NaN.
singleton Create a new cluster consisting of the one observation furthest from its centroid.
Display Display level [ 'off' | {'notify'} | 'final' | 'iter' ]
Code Example
% Using the wine.dat data set, which can be found at
% [fspackage_location]/classifiers/knn/wine.mat
KM(X, 10);
Paper
BibTex entry for:

Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.
Spath, H. (1985) Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples, translated by J. Goldschmidt, Halsted Press, New York, 226 pp.
@book{Seber84,
   author = {Seber, G.A.F.},
   title = {Multivariate Observations},
   year = {1984},
   address = {New York},
   publisher = {Wiley}
}

@book{537092,
   author = {Spath, Helmuth},
   title = {The Cluster Dissection and Analysis Theory FORTRAN Programs Examples},
   year = {1985},
   isbn = {0131379852},
   publisher = {Prentice-Hall, Inc.},
   address = {Upper Saddle River, NJ, USA}
}