Muster
 All Classes Namespaces Files Functions Variables Typedefs Macros
mainpage
Go to the documentation of this file.
1 // -*- c++ -*-
2 
3 /*!
4  @mainpage
5 
6  \image html kmedoids.png
7 
8  <h2>Introduction</h2>
9 
10  The Muster library provides implementations of sequential and parallel K-Medoids
11  clustering algorithms. It is intended as a general framework for parallel
12  cluster analysis, particularly for performance data analysis on systems with
13  very large numbers of processes.
14 
15  The parallel implementations in the Muster are designed to perform well even
16  in environments where the data to be clustered is entirely distributed. For
17  example, many performance tools need to analyze one data element from each
18  process in a system. To analyze this data efficiently, clustering algorithms
19  that move as little data as possible are required. In Muster, we exploit
20  sampled clustering algorithms to realize this efficiency.
21 
22  The parallel algorithms in Muster are implemented using the Message Passing
23  Interface (MPI), making them suitable for use on many of the world's largest
24  supercomputers. They should, however, also run efficiently on your laptop.
25 
26  <h2>Getting Started</h2>
27 
28  <h3>Partitions</h3>
29  The algorithms in Muster are <i>partitioning algorithms</i>, that is, they divide
30  a data set up into a set of groups, or <i>clusters</i> of data objects. Together,
31  these groups are called a <i>clustering</i> or a <i>partitioning</i> of the data.
32 
33  There are two classes that represent clusterings. These are as follows:
34  <dl>
35  <dt>\link cluster::partition partition\endlink</dt>
36  <dd>This class represents a partitioning of a data set. It stores the clusters in the
37  data set along with <i>representatives</i> from each of the clusters. It also stores,
38  for every object in the data set, the cluster to which that object has been assigned.
39 
40  A \link cluster::partition partition\endlink is entirely local to the process that owns it. It exists
41  in one memory space and contains data about <i>all</i> objects in the data set it describes.
42  </dd>
43 
44  <dt>\link cluster::par_partition par_partition\endlink</dt>
45  <dd>This class is similar to \link cluster::partition partition\endlink, but it is a distributed
46  data structure. A \link cluster::par_partition par_partition\endlink object represents the
47  results of parallel clustering algorithm.
48  Instead of storing the cluster assignments of <i>all</i> objects in the data set, each
49  par_partition object stores only the cluster membership for objects that are local to the
50  calling process.
51  </dd>
52  </dl>
53 
54  Note that par_partition does not inherit from partition, because the classes are structurally
55  different. One is a local, centralized data structure, and its state represents all data in the
56  set, while the other is a distributed structure, and represents only a part of the full data set.
57 
58  If you need to, you can convert a par_partition to a partition with the
59  \link cluster::par_partition::gather() par_partition::gather()\endlink method, but the two
60  classes are not interchangeable.
61 
62  <h3>Clustering Algorithms</h3>
63  Classes for clustering algorithms derive from either \link cluster::partition partition\endlink
64  or \link cluster::par_partition par_partition\endlink. The algorithms themselves are methods on these derived
65  classes, and they store their output in the class. This allows all (or at least most of)
66  the state for the algorithms and their output to be encapsulated in a single class.
67 
68  Algorithms themselves are template functions on the derived classes. You can thus call them
69  on any type of object with any number of distance metrics. Because they are template
70  functions, you don't need to explicitly indicate the types of the things you pass to the
71  clustering algorithms; the types are inferred from the algorithms' parameters.
72 
73  There are two classes you should be concerned with:
74 
75  <dl>
76  <dt>\link cluster::kmedoids kmedoids\endlink</dt>
77  <dd>This class inherits from \link cluster::partition partition\endlink and contains implementations
78  of the PAM and CLARA sequential clustering algorithms (Kaufman and Rousseeuw, 1990).
79  PAM is an \f$O(n^2)\f$, optimal K-medoids algorithm, and CLARA is an \f$O(n)\f$ implementation
80  that executes multiple trials of PAM. These algorithms are implemented in the
81  \link cluster::kmedoids::pam() pam()\endlink and \link cluster::kmedoids::clara() clara()\endlink methods.
82 
83  PAM and CLARA both require that the caller supply \f$k\f$, the number of clusters, as a
84  parameter to the algorithm. We have adopted a technique used by the
85  X-Means (Pelleg and Moore, 2000) algorithm to choose an "ideal" \f$k\f$ from many clustering
86  trials using the \link bic.h Bayesian Information Criterion (BIC)\endlink. Instead of supplying a
87  specific \f$k\f$, the caller supplies a range of values for \f$k\f$, and the algorithms
88  use the BIC to select the best fit from the range.
89 
90  The BIC variants are implemented in
91  \link cluster::kmedoids::xpam() xpam()\endlink and \link cluster::kmedoids::xclara() xclara()\endlink.
92  They will be slower than the fixed-k versions, as they can require many iterations of PAM or CLARA be tested
93  to find the best \f$k\f$.
94  </dd>
95 
96  <dt>\link cluster::par_kmedoids par_kmedoids\endlink</dt>
97  <dd>This class inherits from \link cluster::par_partition par_partition\endlink and it implements the CAPEK
98  parallel clustering algorithm. Functionally, CAPEK is similar to CLARA, but it is
99  distributed and runs in \f$O(\frac{n}{P}log(P))\f$ time for \f$n\f$ data objects and \f$P\f$
100  processes. If \f$n = P\f$, that is, if there are only as many input data elements as processes,
101  CAPEK runs in \f$O(log(P))\f$ time.
102 
103  The fixed-k version of CAPEK is implemented in \link cluster::par_kmedoids::capek() capek()\endlink,
104  and a variant using the BIC to select a best \f$k\f$ is in \link
105  cluster::par_kmedoids::xcapek() capek()\endlink.
106  </dd>
107  </dl>
108 
109  <h3>Dissimilarity Functions and Matrices</h3>
110  Most of the algorithms here require some sort of dissimilarity metric to run. As with
111  the STL, you can use any callable object (function or functor) as a distance function.
112  See the documentation for \link cluster::par_kmedoids::xcapek() xcapek()\endlink for an
113  example a dissimilarity functor.
114 
115  The PAM algorithm, which is the basis for all the algorithms in this package, requires a
116  precomputed dissimilarity matrix in order to run efficiently. Given a set of \f$n\f$
117  objects, a dissimilarity matrix is a triangular, \f$n \times n\f$ matrix \f$D\f$ where
118  each element \f$D_{ij}\f$ represents the distance between the \f$i^{th}\f$ and \f$j^{th}\f$
119  objects in the data set. It takes \f$O(n^2)\f$ time to compute a distance matrix like this.
120 
121  We use the <code>boost::symmetric_matrix</code> class to represent dissimilarity matrices. This class
122  provides a packed representation of an upper-triangular matrix, making it an efficient way to
123  store a dissimilarity matrix for clustering. For convenience, we have typedef'd
124  <code>boost::symmetric_matrix</code> to \link cluster::dissimilarity_matrix\endlink.
125  To construct a dissimilarity matrix, use \link cluster::build_dissimilarity_matrix()\endlink to do this.
126 
127  PAM is the only algorithm
128  the package that requires the use to pass in the matrix explicitly. This is for
129  efficiency reasons. A user (or another algorithm) may want to call PAM many times
130  using the same dissimilarity matrix, and with this design, the user can first build
131  the matrix then call PAM without paying the (potentially very high) cost of building
132  the matrix.
133 
134 
135  <h2>Author</h2>
136  Muster was implemented by Todd Gamblin at
137  <a href="http://www.llnl.gov">Lawrence Livermore National Laboratory</a>.
138 
139  Please send questions, bug reports, or suggestions
140  to <a href="mailto:tgamblin@llnl.gov">tgamblin@llnl.gov</a>.
141 
142 
143  <h2>References</h2>
144  For more details on the algorithms implemented in Muster, You can consult the following
145  sources:
146 
147  <ol>
148  <li>For more on CAPEK:
149  <p>
150  Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed.
151  <a href="http://www.cs.unc.edu/~tgamblin/pubs/scalable-cluster-ics10.pdf">
152  <b>Clustering Performance Data Efficiently at Massive Scales</b></a>.
153  <i>Proceedings of the International Conference on Supercomputing (ICS'10)</i>,
154  Tsukuba, Japan, June 1-4, 2010.
155 
156  <li>For more on X-Means and the Bayesian Information Criterion:
157  <p>
158  Dan Pelleg and Andrew Moore. <a href="http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf">
159  <b>X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters</b></a>.
160  <i>Proceedings of the Seventeenth International Conference on Machine Learning</i>,
161  San Francisco, CA. June 29-July 2, 2000. pp 727-734.
162 
163  <li>For more on PAM and CLARA:
164  <p>
165  Leonard Kaufman and Peter J. Rousseeuw.
166  <b><a href="http://www.amazon.com/Finding-Groups-Data-Introduction-Probability/dp/0471735787">Finding Groups in Data: An Introduction to Cluster Analysis</a></b>. John Wiley & Sons, Inc., New York.
167 
168  </ol>
169 */
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Thu Sep 1 2016 using Doxygen 1.8.5