Name

sxk_means - K-means classification of a set of images

Usage

Usage in command lines:

sxk_means.py stack outdir <maskfile> --K=10 --trials=2 --debug --maxit=100 --rand_seed=10 --crit='all' --init_method='rnd' --normalize --CTF --MPI

Usage in python programming:

k_means_main(stack, out_file, maskname,"SSE", K1, K2, rand_seed, maxit, trials, CTF=False, MPI=False, DEBUG=False, flagnorm=False)

Usage of MPI:

1. set the flag --MPI in command line.

2. mpirun -np 4 sxk_means.py and the remaining parameters.

The above example is for mympi.

Example:

sxk_means.py hri_stack.hdf RES mask2d_23.hdf --K=128 --maxit=500 --crit="D"

mpirun -np 4 sxk_means.py bdd:hri_stack RES mask2d_23.hdf --K=128 --maxit=1000 --rand_seed=100 --MPI

Note 1: when 2D input images were aligned (see sxali2d), the program will apply the 2D alignment parameters (xform.align2d) stored in headers prior to clustering.

Note 2: CTF is not implemented.

Input

stack: The input stack of images
maskfile: optional mask file to be used
outdir: name of directory where the results are writed
K: The requested number of clusters (default 2).
trials: number of trials of K-means (see description below) (default one trial). MPI version ignore --trials, the number of trials in MPI version will be the number of cpu used.
max_iter: maximum number of iterations the program will perform (default 100)
CTF: if set, CTF information stored in file headers will be used (default no CTF).
rand_seed: the seed used to generating random numbers (set to -1, means different and pseudo-random each time)
crit: names of criterion used: 'all' all criterions, 'C' Coleman, 'H' Harabasz or 'D' Davies-Bouldin, thoses criterions return the values of classification quality, see also sxk_means_groups. Any combination is accepted, i.e., 'CD', 'HC', 'CHD', .
MPI: to use MPI version of k-means ( default False ). For the mpi version, the program is paralleled with different trials.
normalize: Normalize images under the mask
init_method: Method used to initialize partition: "rnd" randomize or "d2w" for d2 weighting initialization (default is rnd)

Output

outdir: The directory to which the averages of K clusters, and the variance. The classification charts are written to the logfile. Warning: If the output directory already exists, the program will crash and an error message will come up. Please change the name of directory and restart the program .

The program will write two kinds of image stack files:

the averages of each cluster (averages.hdf) and the
variance of each cluster (variances.hdf).

The averages have the following attributes set:

'Class_average': 1 (indicate that the image is a class avergae, not the raw data),
'nobjects': number of objects in a given class,
'members': list of images assigned to this class.

The variances have the following attributes set:

'Class_average': 1 and
'nobjects'.

Description

The command implements SSE minimization methods and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
Minimization methods: SSE - Sum of Squared Errors K-means class averages are updated after reassignment of each object.
The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible results one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classification specified number of times and return the best solution found.
Program calculates and returns values of classification quality - see sxk_means_groups.
The program can also cluster on a text file containing columns of numbers; the elements to cluster are indexed by row number. For example, if infile is a text file with N columns, then by running the program with infile as input text file instead of an input stack of images, sxk_means will cluster based on K columns, where K<N is determined by the maskfile. The elements of the i-th cluster is written to kmeans_grp_00i.txt in the output directory.
The maskfile has to be a binary file, and is used to determine which columns the program will cluster on. For example, if the input text file to cluster has 4 columns, and the following will produce a binary mask maskone.hdf for clustering based on the first column of infile:
maskone = model_blank(4,bckg=0)
maskone[0]=1
drop_image(maskone,'maskone.hdf')

Reference

Pattern Classification - Richard O.Duda, Peter E.Hart, David G.Stork (2001), Wiley, New York.

Author / Maintainer

Julien Bert, Guozhi Tao

Keywords

category 1: APPLICATIONS

Files

statisctics.py, sxk_means.py

Maturity

beta: works for author, often works for others.

Bugs

HDF file: HDF file has a limitation on the number of items contain in the header (~16000). In the case 'members' (list of images assigned to each class) is a list over 16000 elements, all assignment will be automatically export to text file: kmeans_grp_00.txt, kmeans_grp_01.txt, etc. Each file contain the list of ID images assigns to this class.