Muster
 All Classes Namespaces Files Functions Variables Typedefs Macros
kmedoids.h
Go to the documentation of this file.
1 //////////////////////////////////////////////////////////////////////////////////////////////////
2 // Copyright (c) 2010, Lawrence Livermore National Security, LLC.
3 // Produced at the Lawrence Livermore National Laboratory
4 // LLNL-CODE-433662
5 // All rights reserved.
6 //
7 // This file is part of Muster. For details, see http://github.com/tgamblin/muster.
8 // Please also read the LICENSE file for further information.
9 //
10 // Redistribution and use in source and binary forms, with or without modification, are
11 // permitted provided that the following conditions are met:
12 //
13 // * Redistributions of source code must retain the above copyright notice, this list of
14 // conditions and the disclaimer below.
15 // * Redistributions in binary form must reproduce the above copyright notice, this list of
16 // conditions and the disclaimer (as noted below) in the documentation and/or other materials
17 // provided with the distribution.
18 // * Neither the name of the LLNS/LLNL nor the names of its contributors may be used to endorse
19 // or promote products derived from this software without specific prior written permission.
20 //
21 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
22 // OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
23 // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
24 // LAWRENCE LIVERMORE NATIONAL SECURITY, LLC, THE U.S. DEPARTMENT OF ENERGY OR CONTRIBUTORS BE
25 // LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
26 // (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
27 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
28 // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
29 // ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 //////////////////////////////////////////////////////////////////////////////////////////////////
31 
32 ///
33 /// @file kmedoids.h
34 /// @author Todd Gamblin tgamblin@llnl.gov
35 /// @brief Implementations of the classic clustering algorithms PAM and CLARA, from
36 /// <i>Finding Groups in Data</i>, by Kaufman and Rousseeuw.
37 ///
38 #ifndef K_MEDOIDS_H
39 #define K_MEDOIDS_H
40 
41 #include <vector>
42 #include <set>
43 #include <iostream>
44 #include <stdexcept>
45 #include <cfloat>
46 
47 #include <boost/random.hpp>
48 
49 #include "random.h"
50 #include "dissimilarity.h"
51 #include "partition.h"
52 #include "bic.h"
53 
54 namespace cluster {
55 
56  ///
57  /// Implementations of the classic clustering algorithms PAM and CLARA, from
58  /// <i>Finding Groups in Data</i>, by Kaufman and Rousseeuw.
59  ///
60  class kmedoids : public partition {
61  public:
62  ///
63  /// Constructor. Can optionally specify number of objects to be clustered.
64  /// and this will start out with all of them in one cluster.
65  ///
66  /// The random number generator associated with this kmedoids object is seeded
67  /// with the time in microseconds since the epoch.
68  ///
69  kmedoids(size_t num_objects = 0);
70  ~kmedoids();
71 
72  /// Get the average dissimilarity of objects w/their medoids for the last run.
73  double average_dissimilarity() const;
74 
75  /// Set random seed.
76  void set_seed(unsigned long seed);
77 
78  /// Set whether medoids will be sorted by object id after clustering is complete.
79  /// Defaults to true.
80  void set_sort_medoids(bool sort_medoids);
81 
82  /// Set tolerance for convergence. This is relative error, not absolute error. It will be
83  /// multiplied by the mean distance before it is used to test convergence.
84  /// Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.
85  void set_epsilon(double epsilon);
86 
87  ///
88  /// Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM)
89  /// algorithm as described in Kaufman and Rousseeuw.
90  ///
91  /// @param distance dissimilarity matrix for all objects to cluster
92  /// @param k number of clusters to produce
93  /// @param initial_medoids Optionally supply k initial object ids to be used as initial medoids.
94  ///
95  /// @see \link build_dissimilarity_matrix()\endlink, a function to automatically
96  /// construct a dissimilarity matrix given a vector of objects and a distance function.
97  ///
98  void pam(const dissimilarity_matrix& distance, size_t k, const object_id *initial_medoids = NULL);
99 
100  ///
101  /// Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM)
102  /// algorithm as described in Kaufman and Rousseeuw. Runs PAM from 1 to max_k and selects
103  /// the best k using the bayesian information criterion. Sets this partition to the best
104  /// partition found using PAM from 1 to k.
105  ///
106  /// Based on X-Means, see Pelleg & Moore, 2000.
107  ///
108  /// @param distance dissimilarity matrix for all objects to cluster
109  /// @param max_k Upper limit on number of clusters to find.
110  /// @param dimensionality Number of dimensions in clustered data, for BIC.
111  ///
112  /// @return the best BIC value found (the bic value of the final partitioning).
113  ///
114  /// @see \link build_dissimilarity_matrix()\endlink, a function to automatically
115  /// construct a dissimilarity matrix given a vector of objects and a distance function.
116  ///
117  double xpam(const dissimilarity_matrix& distance, size_t max_k, size_t dimensionality);
118 
119  ///
120  /// CLARA clustering algorithm, as per Kaufman and Rousseuw and
121  /// R. Ng and J. Han, "Efficient and Effective Clustering Methods
122  /// for Spatial Data Mining."
123  ///
124  /// @tparam T Type of objects to be clustered.
125  /// @tparam D Dissimilarity metric type. D should be callable
126  /// on (T, T) and should return a double.
127  ///
128  /// @param objects Objects to cluster
129  /// @param dmetric Distance metric to build dissimilarity matrices with
130  /// @param k Number of clusters to partition
131  ///
132  template <class T, class D>
133  void clara(const std::vector<T>& objects, D dmetric, size_t k) {
134  size_t sample_size = init_size + 2*k;
135 
136  // Just run plain KMedoids once if sampling won't gain us anything
137  if (objects.size() <= sample_size) {
139  build_dissimilarity_matrix(objects, dmetric, mat);
140  pam(mat, k);
141  return;
142  }
143 
144  // get everything the right size before starting.
145  medoid_ids.resize(k);
146  cluster_ids.resize(objects.size());
147 
148  // medoids and clusters for best partition so far.
149  partition best_partition;
150 
151  //run KMedoids on a sampled subset max_reps times
152  total_dissimilarity = DBL_MAX;
153  for (size_t i = 0; i < max_reps; i++) {
154  // Take a random sample of objects, store sample in a vector
155  std::vector<size_t> sample_to_full;
156  algorithm_r(objects.size(), sample_size, back_inserter(sample_to_full), rng);
157 
158  // Build a distance matrix for PAM
159  dissimilarity_matrix distance;
160  build_dissimilarity_matrix(objects, sample_to_full, dmetric, distance);
161 
162  // Actually run PAM on the subset
163  kmedoids subcall;
164  subcall.set_sort_medoids(false); // skip sort for subcall since it's not needed
165  subcall.pam(distance, k);
166 
167  // copy medoids from the subcall to local data, being sure to translate indices
168  for (size_t i=0; i < medoid_ids.size(); i++) {
169  medoid_ids[i] = sample_to_full[subcall.medoid_ids[i]];
170  }
171 
172  // sync up the cluster_ids matrix with the new medoids by assigning
173  // each object to its closest medoid. Remember the quality of the clustering.
174  double dissimilarity = assign_objects_to_clusters(lazy_distance(objects, dmetric));
175 
176  // keep the best clustering found so far around
177  if (dissimilarity < total_dissimilarity) {
178  swap(best_partition);
179  total_dissimilarity = dissimilarity;
180  }
181  }
182 
183  if (sort_medoids) sort(); // just do one final ordering of ids.
184  }
185 
186 
187  ///
188  /// Takes existing clustering and reassigns medoids by taking closest medoid to mean
189  /// of each cluster. This is O(n) and can give better representatives for CLARA clusterings.
190  /// This is needed to apply gaussian-model BIC criteria to clusterings.
191  ///
192  /// @tparam T To use this function, T needs to support algebraic operations.<br>
193  /// Specifically, T must support enough to construct a mean:
194  /// - addition <code>T + T = T</code>
195  /// - scalar division <code>T / c = T</code>
196  ///
197  template <class T, class D>
198  void center_medoids(const std::vector<T>& objects, D distance) {
199  // First sum up elements in all clusters to get the mean for each
200  std::vector<T> means(medoid_ids.size());
201  std::vector<size_t> counts(medoid_ids.size());
202  for (size_t i=0; i < cluster_ids.size(); i++) {
203  medoid_id m = cluster_ids[i];
204  means[m] = counts[m] ? (means[m] + objects[i]) : objects[i];
205  counts[m]++;
206  }
207 
208  // Now, turn cumulative sums into means and calculate distance from medoids to means
209  std::vector<double> shortest(means.size());
210  for (size_t m=0; m < means.size(); m++) {
211  means[m] = means[m] / counts[m];
212  shortest[m] = distance(means[m], objects[medoid_ids[m]]);
213  }
214 
215  // Find closest medoids to each mean element, preferring existing medoids if there are ties.
216  for (size_t i=0; i < cluster_ids.size(); i++) {
217  medoid_id m = cluster_ids[i];
218  double d = distance(objects[i], means[m]);
219  if (d < shortest[m]) {
220  medoid_ids[m] = i;
221  shortest[m] = d;
222  }
223  }
224  }
225 
226 
227  ///
228  /// K-Agnostic version of CLARA. This uses the BIC criterion as described in bic.h to
229  /// run clara() a number of times and to select a best run of clara() from the trials.
230  /// This will be slower than regular clara(). In particular, it's O(n*max_k).
231  ///
232  /// @param[in] objects Objects to cluster
233  /// @param[in] dmetric Distance metric to build dissimilarity matrices with
234  /// @param[in] max_k Max number of clusters to find.
235  /// @param[in] dimensionality Dimensionality of objects, used by BIC.
236  ///
237  template <class T, class D>
238  double xclara(const std::vector<T>& objects, D dmetric, size_t max_k, size_t dimensionality) {
239  double best_bic = -DBL_MAX; // note that DBL_MIN isn't what you think it is.
240 
241  for (size_t k = 1; k <= max_k; k++) {
242  kmedoids subcall;
243  subcall.clara(objects, dmetric, k);
244  center_medoids(objects, dmetric);
245  double cur_bic = bic(subcall, lazy_distance(objects, dmetric), dimensionality);
246 
247  if (xcallback) xcallback(subcall, cur_bic);
248  if (cur_bic > best_bic) {
249  best_bic = cur_bic;
250  swap(subcall);
251  }
252  }
253  return best_bic;
254  }
255 
256 
257  void set_init_size(size_t sz) { init_size = sz; }
258  void set_max_reps(size_t r) { max_reps = r; }
259 
260 
261  /// Set callback function for XPAM and XCLARA. default is none.
262  void set_xcallback(void (*)(const partition& part, double bic));
263 
264  protected:
265  typedef boost::mt19937 random_type; /// Type for RNG used in this algorithm
266  random_type random; /// Randomness source for this algorithm
267 
268  /// Adaptor for STL algorithms.
269  typedef boost::random_number_generator<random_type, unsigned long> rng_type;
271 
272  std::vector<medoid_id> sec_nearest; /// Index of second closest medoids. Used by PAM.
273  double total_dissimilarity; /// Total dissimilarity bt/w objects and their medoid
274  bool sort_medoids; /// Whether medoids should be canonically sorted by object id.
275  double epsilon; /// Normalized sensitivity for convergence
276  size_t init_size; /// initial sample size (before 2*k)
277  size_t max_reps; /// initial sample size (before 2*k)
278 
279 
280  /// Callback for each iteration of xpam. is called with the current clustering and its BIC score.
281  void (*xcallback)(const partition& part, double bic);
282 
283  /// KR BUILD algorithm for assigning initial medoids to a partition.
284  void init_medoids(size_t k, const dissimilarity_matrix& distance);
285 
286  /// Total cost of swapping object h with medoid i.
287  /// Sums costs of this exchagne for all objects j.
288  double cost(medoid_id i, object_id h, const dissimilarity_matrix& distance) const;
289 
290 
291  /// Assign each object to the cluster with the closest medoid.
292  ///
293  /// @return Total dissimilarity of objects w/their medoids.
294  ///
295  /// @param distance a callable object that computes distances between indices, as a distance
296  /// matrix would. Algorithms are free to use real distance matrices (as in PAM)
297  /// or to compute lazily (as in CLARA medoid assignment).
298  template <class DM>
299  double assign_objects_to_clusters(DM distance) {
300  if (sec_nearest.size() != cluster_ids.size()) {
301  sec_nearest.resize(cluster_ids.size());
302  }
303 
304  // go through and assign each object to nearest medoid, keeping track of total dissimilarity.
305  double total_dissimilarity = 0;
306  for (object_id i=0; i < cluster_ids.size(); i++) {
307  double d1, d2; // smallest, second smallest distance to medoid, respectively
308  medoid_id m1, m2; // index of medoids with distances d1, d2 from object i, respectively
309 
310  d1 = d2 = DBL_MAX;
311  m1 = m2 = medoid_ids.size();
312  for (medoid_id m=0; m < medoid_ids.size(); m++) {
313  double d = distance(i, medoid_ids[m]);
314  if (d < d1 || medoid_ids[m] == i) { // prefer the medoid in case of ties.
315  d2 = d1; m2 = m1;
316  d1 = d; m1 = m;
317  } else if (d < d2) {
318  d2 = d; m2 = m;
319  }
320  }
321 
322  cluster_ids[i] = m1;
323  sec_nearest[i] = m2;
324  total_dissimilarity += d1;
325  }
326 
327  return total_dissimilarity;
328  }
329  };
330 
331 
332 } // namespace cluster
333 
334 #endif //K_MEDOIDS_H
rng_type rng
Definition: kmedoids.h:270
double average_dissimilarity() const
Get the average dissimilarity of objects w/their medoids for the last run.
Definition: kmedoids.cpp:70
void set_epsilon(double epsilon)
Set tolerance for convergence.
Definition: kmedoids.cpp:82
void set_seed(unsigned long seed)
Set random seed.
Definition: kmedoids.cpp:74
void init_medoids(size_t k, const dissimilarity_matrix &distance)
KR BUILD algorithm for assigning initial medoids to a partition.
Definition: kmedoids.cpp:90
void build_dissimilarity_matrix(const std::vector< T > &objects, D dissimilarity, dissimilarity_matrix &mat)
Computes a dissimilarity matrix from a vector of objects.
Definition: dissimilarity.h:59
std::vector< object_id > medoid_ids
Gives the index of the object that is the ith medoid.
Definition: partition.h:79
double epsilon
Whether medoids should be canonically sorted by object id.
Definition: kmedoids.h:275
boost::random_number_generator< random_type, unsigned long > rng_type
Randomness source for this algorithm.
Definition: kmedoids.h:269
Class to represent a partitioning of a data set.
bool sort_medoids
Total dissimilarity bt/w objects and their medoid.
Definition: kmedoids.h:274
lazy_distance_functor< T, D > lazy_distance(const std::vector< T > &objs, D dist)
Type-inferred syntactic sugar for constructing lazy_distance_functor.
void swap(partition &other)
Fast swap with other patrition objects.
Definition: partition.cpp:68
void(* xcallback)(const partition &part, double bic)
initial sample size (before 2*k)
Definition: kmedoids.h:281
double xpam(const dissimilarity_matrix &distance, size_t max_k, size_t dimensionality)
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in K...
Definition: kmedoids.cpp:222
double cost(medoid_id i, object_id h, const dissimilarity_matrix &distance) const
Total cost of swapping object h with medoid i.
Definition: kmedoids.cpp:135
boost::numeric::ublas::symmetric_matrix< double > dissimilarity_matrix
Packed repersentation of symmetric dissimilarity matrix.
Definition: dissimilarity.h:48
void sort()
puts medoids in order by their object id, and adjusts cluster_ids accordingly.
Definition: partition.cpp:74
random_type random
Type for RNG used in this algorithm.
Definition: kmedoids.h:266
Template function implementations of the Bayesian Information Criterion.
void set_xcallback(void(*)(const partition &part, double bic))
Set callback function for XPAM and XCLARA. default is none.
Definition: kmedoids.cpp:86
void set_max_reps(size_t r)
Definition: kmedoids.h:258
double assign_objects_to_clusters(DM distance)
Assign each object to the cluster with the closest medoid.
Definition: kmedoids.h:299
size_t max_reps
initial sample size (before 2*k)
Definition: kmedoids.h:277
kmedoids(size_t num_objects=0)
Constructor.
Definition: kmedoids.cpp:55
void set_sort_medoids(bool sort_medoids)
Set whether medoids will be sorted by object id after clustering is complete.
Definition: kmedoids.cpp:78
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data...
Definition: kmedoids.h:60
size_t object_id
More descriptive type for object index.
Definition: partition.h:52
void clara(const std::vector< T > &objects, D dmetric, size_t k)
CLARA clustering algorithm, as per Kaufman and Rousseuw and R.
Definition: kmedoids.h:133
size_t medoid_id
More descriptive type for medoid index.
Definition: partition.h:51
void center_medoids(const std::vector< T > &objects, D distance)
Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster...
Definition: kmedoids.h:198
std::vector< medoid_id > sec_nearest
Definition: kmedoids.h:272
size_t init_size
Normalized sensitivity for convergence.
Definition: kmedoids.h:276
void pam(const dissimilarity_matrix &distance, size_t k, const object_id *initial_medoids=NULL)
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in K...
Definition: kmedoids.cpp:161
double xclara(const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality)
K-Agnostic version of CLARA.
Definition: kmedoids.h:238
double total_dissimilarity
Index of second closest medoids. Used by PAM.
Definition: kmedoids.h:273
Data types and functions for dealing with dissimilarity matrices.
void algorithm_r(size_t numElements, size_t sample_size, OutputIterator out, Random &random)
This is Knuth&#39;s algorithm R for taking a sample of numElements numbers.
Definition: random.h:63
std::vector< medoid_id > cluster_ids
Gives cluster id (index in medoids) for the ith object.
Definition: partition.h:84
Helper functions for taking random samples and seeding RNGs from the system clock.
boost::mt19937 random_type
Definition: kmedoids.h:265
This represents a partitioning of a data set.
Definition: partition.h:76
void set_init_size(size_t sz)
Definition: kmedoids.h:257
double bic(const partition &p, D distance, size_t M)
Directly computes the BIC from a partition object based on the cluster centroids and the number of cl...
Definition: bic.h:77
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Thu Sep 1 2016 using Doxygen 1.8.5