Feature Selection Package Documentation

KMEANS uses a two-phase iterative algorithm to minimize the sum of point-to-centroid distances, summed over all K clusters. The first phase uses what the literature often describes as "batch" updates, where each iteration consists of reassigning points to their nearest cluster centroid, all at once, followed by recalculation of cluster centroids. This phase may be thought of as providing a fast but potentially only approximate solution as a starting point for the second phase. The second phase uses what the literature often describes as "on-line" updates, where points are individually reassigned if doing so will reduce the sum of distances, and cluster centroids are recomputed after each reassignment. Each iteration during this second phase consists of one pass though all the points. KMEANS can converge to a local optimum, which in this case is a partition of points in which moving any single point to a different cluster increases the total sum of distances. This problem can only be solved by a clever (or lucky, or exhaustive) choice of starting points.

Method Signature:

[idx, C, sumD, D] = KM(X, k, varargin)

Output:
idx: The clustering data for each data point.
C: K-by-P matrix containing the K cluster centroid locations.
sumD: 1-by-K vector containing within-cluster sums of point-to-centroid distances.
D: N-by-K matrix containing distances from each point to every centroid.
Input:
X, k: Partitions points in the N-by-P data matrix X into k clusters.
varargin: KM(..., 'PARAM1',val1, 'PARAM2',val2, ...); This allows you to specify optional parameters and their values. Accepted parameters are as follows:

Name

Description

Values

Distance

Distance measure, in P-dimensional space, that KMEANS should minimize with respect to.

Value	Definition
sqEuclidean	Squared Euclidean distance
cityblock	Sum of absolute differences
cosine	One minus the cosine of the included angle between points (treated as vectors)
correlation	One minus the sample correlation between points (treated as sequences of values)
Hamming	Percentage of bits that differ (only suitable for binary data)

Start

Method used to choose initial cluster centroid positions, sometimes known as "seeds".

Value	Definition
sample	Select K observations from X at random.
uniform	Select K points uniformly at random from the range of X. Not valid for Hamming distance.
cluster	Perform preliminary clustering phase on random 10% subsample of X. This preliminary phase is itself initialized using 'sample'.
matrix	A K-by-P matrix of starting locations. In this case, you can pass in [] for K, and KMEANS infers K from the first dimension of the matrix. You can also supply a 3D array, implying a value for 'Replicates' from the array's third dimension.

Replicates

Number of times to repeat the clustering, each with a new set of initial centroids

[ positive integer | {1}]

Maxiter

The maximum number of iterations

[ positive integer | {100}]

EmptyAction

Action to take if a cluster loses all of its member observations.

Value	Definition
error	Treat an empty cluster as an error
drop	Remove any clusters that become empty, and set corresponding values in C and D to NaN.
singleton	Create a new cluster consisting of the one observation furthest from its centroid.

Display

Display level

[ 'off' | {'notify'} | 'final' | 'iter' ]

BibTex entry for:

Seber, G.A.F., Multivariate Observations, Wiley, New York, 1984.
Spath, H. (1985) Cluster Dissection and Analysis: Theory, FORTRAN Programs, Examples, translated by J. Goldschmidt, Halsted Press, New York, 226 pp.

@book{Seber84,
   author = {Seber, G.A.F.},
   title = {Multivariate Observations},
   year = {1984},
   address = {New York},
   publisher = {Wiley}
}

@book{537092,
   author = {Spath, Helmuth},
   title = {The Cluster Dissection and Analysis Theory FORTRAN Programs Examples},
   year = {1985},
   isbn = {0131379852},
   publisher = {Prentice-Hall, Inc.},
   address = {Upper Saddle River, NJ, USA}
}