Name
sxk_means - K-means classification of a set of images
Usage
Usage in command lines:
sxk_means.py stack outdir <maskfile> --K=10 --trials=2 --debug --maxit=100 --rand_seed=10 --crit='all' --init_method='rnd' --normalize --CTF --MPI
Usage in python programming:
k_means_main(stack, out_file, maskname,"SSE", K1, K2, rand_seed, maxit, trials, CTF=False, MPI=False, DEBUG=False, flagnorm=False)
Usage of MPI:
1. set the flag --MPI in command line.
2. mpirun -np 4 sxk_means.py and the remaining parameters.
The above example is for mympi.
Example:
sxk_means.py hri_stack.hdf RES mask2d_23.hdf --K=128 --maxit=500 --crit="D"
mpirun -np 4 sxk_means.py bdd:hri_stack RES mask2d_23.hdf --K=128 --maxit=1000 --rand_seed=100 --MPI
Note 1: when 2D input images were aligned (see sxali2d), the program will apply the 2D alignment parameters (xform.align2d) stored in headers prior to clustering.
Note 2: CTF is not implemented.
Input
- stack
- The input stack of images
- maskfile
- optional mask file to be used
- outdir
- name of directory where the results are writed
The parameters preceded with -- are optional and default values are given in parenthesis.
names of criterion used: 'all' all criterions, 'C' Coleman, 'H' Harabasz or 'D' Davies-Bouldin, thoses criterions return the values of classification quality, see also sxk_means_groups. Any combination is accepted, i.e., 'CD', 'HC', 'CHD', .
Output
- outdir
- The directory to which the averages of K clusters, and the variance. The classification charts are written to the logfile. Warning: If the output directory already exists, the program will crash and an error message will come up. Please change the name of directory and restart the program .
The program will write two kinds of image stack files:
the averages of each cluster (averages.hdf) and the
variance of each cluster (variances.hdf).
The averages have the following attributes set:
- 'Class_average': 1 (indicate that the image is a class avergae, not the raw data),
- 'nobjects': number of objects in a given class,
- 'members': list of images assigned to this class.
The variances have the following attributes set:
- 'Class_average': 1 and
- 'nobjects'.
Description
- The command implements SSE minimization methods and two different algorithms depending on the CTF flag. In each case, random initialization is used, i.e., initially, images are randomly assigned to K classes.
Minimization methods: SSE - Sum of Squared Errors K-means class averages are updated after reassignment of each object.
- The results of K-means classification are (in most cases) irreproducible, i.e., if classification is repeated for the same number of classes but using different initial assignment (as in this implementation), the result will be different. In order to find reproducible results one is advised to repeat K-means many times and accept the 'best' solution, as identified by the criterion value. For a sufficiently large number of trials and reasonable data, it is possible to find optimum solution. This process is facilitated by the number of trials user can provide: program will repeat classification specified number of times and return the best solution found.
Program calculates and returns values of classification quality - see sxk_means_groups.
The program can also cluster on a text file containing columns of numbers; the elements to cluster are indexed by row number. For example, if infile is a text file with N columns, then by running the program with infile as input text file instead of an input stack of images, sxk_means will cluster based on K columns, where K<N is determined by the maskfile. The elements of the i-th cluster is written to kmeans_grp_00i.txt in the output directory.
- The maskfile has to be a binary file, and is used to determine which columns the program will cluster on. For example, if the input text file to cluster has 4 columns, and the following will produce a binary mask maskone.hdf for clustering based on the first column of infile:
- maskone = model_blank(4,bckg=0)
- maskone[0]=1
- drop_image(maskone,'maskone.hdf')
Reference
Pattern Classification - Richard O.Duda, Peter E.Hart, David G.Stork (2001), Wiley, New York.
Author / Maintainer
Julien Bert, Guozhi Tao
Keywords
- category 1
- APPLICATIONS
Files
statisctics.py, sxk_means.py
See also
sxk_means_groups sxk_means_stable
Maturity
- beta
- works for author, often works for others.
Bugs
HDF file: HDF file has a limitation on the number of items contain in the header (~16000). In the case 'members' (list of images assigned to each class) is a list over 16000 elements, all assignment will be automatically export to text file: kmeans_grp_00.txt, kmeans_grp_01.txt, etc. Each file contain the list of ID images assigns to this class.