Name

sxk_means_groups - determine 'best' number of clusters in the data using K-means classification of a set of images

Usage

Usage in command lines:

sxk_means_groups.py sstackfile output_file <maskfile> --K1=Min_number_of_Cluster --K2=Max_number_of_Clusters --trials=Number_of_trials_of_K-means --CTF --rand_seed=1000 --maxit=Maximum_number_of_iterations --MPI --debug

Usage in python programming:

k_means_groups(stack, out_file, maskname,"SSE", K1, K2, rand_seed, maxit, trials, CTF, MPI=False, DEBUG=False, flagnorm=False)

MPI Note: the MPI version is parallized with number of trial.

1. set the MPI flag in command line
2. mpirun -np 32 sxk_means_groups.py and the remaining parameters
The above example is for mympi.

Example:

sxk_means_groups.py hri_stack.hdf RES mask2d_23.hdf --K1=2 --K2=10 --maxit=500 --trials=5

mpirun -np 5 sxk_means_groups.py bdd:hri_stack RES mask2d_23.hdf --K1=2 --K2=10 --maxit=1000 --rand_seed=100 --MPI

Note: when 2D input images were aligned (see sxali2d), the program will apply the 2D alignment parameters (xform.align2d) stored in headers prior to clustering.

Input

stackfile: The input stack of images
output_file: text file in which values of clustering criteria are be stored
mask: filename for input image mask. The input image are considered only for pixels mask that have value > 0.5. Note: has to have the same dimensions as the input (default = None, entire images will be used)
K1: minimum requested number of clusters
K2: maximum requested number of clusters
trials: number of trials of K-means (see description below) (default one trial). In mpi version, the program ignore --trials option and internally set trials as the number of cpu used.
CTF: if set, CTF information stored in file headers will be used (default no CTF).
rand_seed: the seed used to generating random numbers (set to -1, means different and pseudo-random each time)
MPI: to use MPI version of k-means groups

Output

output_file: text file will contain columns according the criteria chosen, for example if crit='CHD', the columns of numbers: (1) number of clusters, (2) values of Coleman criterion, (3) values of Harabasz criterion and (4) values of Davies-Bouldin criterion
output_file.p: file contain a gnuplot script, this file allow plot directly the values of all criteria with the same range. Use this command in gnuplot: load 'output_file.p'
WATCH_GRP_KMEANS or WATCH_MPI_GRP_KMEANS: file contain the progress of k-means groups. This file can be read in real-time to watch the evolution of criteria.

Description

The command implements Sum of Squared Errors minimization method and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
Minimization methods: SSE - class averages are updated after reassignment of each object.
The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible resuts one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classificiation specified number of times and return the best solution found.
Program calculates and returns values of clustering quality: Coleman, Harabasz or Davies-Bouldin. When plotted agains number of clusters, for the number of clusters best reflecting data structure, Coleman should have local maximum while Harabasz should have local minimum and Davies-Bouldin have local minimum.

Reference

Pattern Classification - Richard O.Duda, Peter E.Hart, David G.Stork (2001), Wiley, New York.

Author / Maintainer

Julien Bert, Guozhi Tao

Keywords

category 1: APPLICATIONS

Files

statistics.py, sxk_means_groups.py

Maturity

alpha: works, even if slowly.

Bugs

None. It is perfect.