Name

sxk_means_stable - Collect stable class averages with several independent runs of k-means

Usage

Usage in command line:

sxk_means_stable.py stack outdir <maskfile> --K=2 --nb_part=5 --th_nobj=10 --rand_seed=10 --maxit=1000 --normalize --CTF --MPI

Usage in python programming:

Normal version:

k_means_stab_stream(stack, outdir, maskfile, K, nb_part, th_onobj, rand_seed, CTF)

SSE is the optimization method we recommend.

MPI version:

k_means_stab_MPI_stream(stack, outdir, maskfile, K, nb_part, th_nobj, rand_seed, CTF)

'Note: when 2D input images were aligned (see sxali2d), the program will apply the 2D alignment parameters (xform.align2d) stored in headers prior to clustering.

MPI Note: MPI version is under development.

To use MPI || version:
1. set the flag --MPI in command line
2. mpirun -np 32 sxk_means.py and the remaining parameters
The above example is for mympi.

Examples:

sxk_means_stable.py data.hdf kmeans_stab mask2d_26.hdf --K=8 --nb_part=5

mpirun -np 5 sxk_means_stable.py 'bdb:data' kmeans_stab mask2d_26.hdf --K=8 --th_nobj=5 --MPI

Input

stack: the input stack images (bdb, hdf or txt)
outdir: name of directory where the results are written
maskfile: optional mask file to be used (bdb or hdf)
K: the requested number of clusters (default 2)
nb_part: number of partitions used to select stable averages (default 5). In the mpi version, the nb_part is determined internally by the number of cpus used.
th_nobj: minimum number of objects per class average required in the stable partition. All classes with a number of images per group < th_nobj will not transfer to the final partition (default 1, meaning keep all averages)
rand_seed: the seed used to generate random numbers (set to 0)
CTF: if set, CTF information stored in file headers will be used (default: no CTF).
MPI: to use MPI version of k_means_stable

Output

outdir/main_log.txt: the main logfile, all steps are written in order to watch the progress of the program
outdir/averages.hdf: the final stable class averages
outdir/averages_**.hdf: intermediate class averages produce by the independent runs of clustering. '**' correspond to the number of cluster with a format '00', if the number of partitions is 5, for example, the directory outdir/ will contain averages_00.hdf through averages_04.hdf

Txt case

This function is able to use a text file format as input. The structure file must contain one 'image' per line, and data of image must be separate by a space, ex:

0.34 5.46 2.34 6.78

3.78 2.23 1.78 5.67

this file contain 2 'images' of 4 'pixels'. As this data will be convert to an image structure you can still use a hdf file to mask it. But the output directory will contain only text file format with the membership of each group for all partitions:

outdir/kmeans_grp_***.txt: the final stable membership, for each group *** (format to '000'), the file store the list of id of images in the group. If K is equal to 3, it will have tree files k_means_grp_000.txt, k_means_grp_001.txt and k_means_grp_002.txt
outdir/k_means_part_**_grp_***.txt: intermediate membership produce by independent run of clustering. For each partition ** format ('00') a file is produce for every groups containing the list of id of images.

Description

K_means_stable will use the function k_means to repeat independents clustering of the data set. For a complete details of k-means parameters (K, option method, ...) see the function page sxk_means. The random seed of each clustering appears to the mainlog.txt file. After repeat nb_part clustering, all partitions will be matched together to compare their membership. If there are two partitions, the matching algorithm used will be the optimal Hungarian algorithm. Otherwise, if the number of partitions is more than two, then the matching algorithm used is an in-house branching algorithm. Images appearing to the same cluster each time will be kept to create a stable averages. A coefficient of stability in percent appears to the mainlog.txt file, whose value reflects the similarity between membership. If the number of images in a stable group is under the value th_nobj, those images will be consider as not used , and this average will be remove of the final stable class averages. th_nobj allows to remove useless averages which contains fewer images. Number of final averages appears also to the mainlog.txt file.

Reference

This program used the function Munkres algorithm (or Hungarian algorithm) when the number of partition is two, http://www.clapper.org/bmc, BSD-like Licence, copyright (c) 2008 Brian M. Clapper. Otherwise the in house branch algorithm will be used.

Author / Maintainer

Julien Bert

Keywords

Files

applications.py

Maturity

beta: works for author, often works for others.

Bugs

HDF file limitation to the number of attributes see bug to sxk_means