Name
sxk_means_groups - determine 'best' number of clusters in the data using K-means classification of a set of images
Usage
Usage in command lines:
sxk_means_groups.py sstackfile output_file <maskfile> --K1=Min_number_of_Cluster --K2=Max_number_of_Clusters --trials=Number_of_trials_of_K-means --CTF --rand_seed=1000 --maxit=Maximum_number_of_iterations --MPI --debug
Usage in python programming:
k_means_groups(stack, out_file, maskname,"SSE", K1, K2, rand_seed, maxit, trials, CTF, MPI=False, DEBUG=False, flagnorm=False)
MPI Note: the MPI version is parallized with number of trial.
- 1. set the MPI flag in command line
- 2. mpirun -np 32 sxk_means_groups.py and the remaining parameters
- The above example is for mympi.
Example:
sxk_means_groups.py hri_stack.hdf RES mask2d_23.hdf --K1=2 --K2=10 --maxit=500 --trials=5
mpirun -np 5 sxk_means_groups.py bdd:hri_stack RES mask2d_23.hdf --K1=2 --K2=10 --maxit=1000 --rand_seed=100 --MPI
Note: when 2D input images were aligned (see sxali2d), the program will apply the 2D alignment parameters (xform.align2d) stored in headers prior to clustering.
Input
- stackfile
- The input stack of images
- output_file
- text file in which values of clustering criteria are be stored
- mask
filename for input image mask. The input image are considered only for pixels mask that have value > 0.5. Note: has to have the same dimensions as the input (default = None, entire images will be used)
- K1
- minimum requested number of clusters
- K2
- maximum requested number of clusters
- trials
- number of trials of K-means (see description below) (default one trial). In mpi version, the program ignore --trials option and internally set trials as the number of cpu used.
- CTF
- if set, CTF information stored in file headers will be used (default no CTF).
- rand_seed
- the seed used to generating random numbers (set to -1, means different and pseudo-random each time)
- MPI
- to use MPI version of k-means groups
Output
- output_file
- text file will contain columns according the criteria chosen, for example if crit='CHD', the columns of numbers: (1) number of clusters, (2) values of Coleman criterion, (3) values of Harabasz criterion and (4) values of Davies-Bouldin criterion
- output_file.p
- file contain a gnuplot script, this file allow plot directly the values of all criteria with the same range. Use this command in gnuplot: load 'output_file.p'
- WATCH_GRP_KMEANS or WATCH_MPI_GRP_KMEANS
- file contain the progress of k-means groups. This file can be read in real-time to watch the evolution of criteria.
Description
- The command implements Sum of Squared Errors minimization method and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
Minimization methods: SSE - class averages are updated after reassignment of each object.
- The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible resuts one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classificiation specified number of times and return the best solution found.
Program calculates and returns values of clustering quality: Coleman, Harabasz or Davies-Bouldin. When plotted agains number of clusters, for the number of clusters best reflecting data structure, Coleman should have local maximum while Harabasz should have local minimum and Davies-Bouldin have local minimum.
Reference
Pattern Classification - Richard O.Duda, Peter E.Hart, David G.Stork (2001), Wiley, New York.
Author / Maintainer
Julien Bert, Guozhi Tao
Keywords
- category 1
- APPLICATIONS
Files
statistics.py, sxk_means_groups.py
See also
Maturity
- alpha
- works, even if slowly.
Bugs
None. It is perfect.