Muster
 All Classes Namespaces Files Functions Variables Typedefs Macros
par_kmedoids.h
Go to the documentation of this file.
1 //////////////////////////////////////////////////////////////////////////////////////////////////
2 // Copyright (c) 2010, Lawrence Livermore National Security, LLC.
3 // Produced at the Lawrence Livermore National Laboratory
4 // LLNL-CODE-433662
5 // All rights reserved.
6 //
7 // This file is part of Muster. For details, see http://github.com/tgamblin/muster.
8 // Please also read the LICENSE file for further information.
9 //
10 // Redistribution and use in source and binary forms, with or without modification, are
11 // permitted provided that the following conditions are met:
12 //
13 // * Redistributions of source code must retain the above copyright notice, this list of
14 // conditions and the disclaimer below.
15 // * Redistributions in binary form must reproduce the above copyright notice, this list of
16 // conditions and the disclaimer (as noted below) in the documentation and/or other materials
17 // provided with the distribution.
18 // * Neither the name of the LLNS/LLNL nor the names of its contributors may be used to endorse
19 // or promote products derived from this software without specific prior written permission.
20 //
21 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
22 // OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
23 // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
24 // LAWRENCE LIVERMORE NATIONAL SECURITY, LLC, THE U.S. DEPARTMENT OF ENERGY OR CONTRIBUTORS BE
25 // LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
26 // (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
27 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
28 // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
29 // ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 //////////////////////////////////////////////////////////////////////////////////////////////////
31 
32 ///
33 /// @file par_kmedoids.h
34 /// @author Todd Gamblin tgamblin@llnl.gov
35 /// @brief CAPEK and XCAPEK scalable parallel clustering algorithms.
36 ///
37 #ifndef PAR_KMEDOIDS_H
38 #define PAR_KMEDOIDS_H
39 
40 #include <mpi.h>
41 #include <ostream>
42 #include <vector>
43 #include <functional>
44 
45 #include <boost/iterator/permutation_iterator.hpp>
46 
47 #include "Timer.h"
48 #include "kmedoids.h"
49 #include "multi_gather.h"
50 #include "trial.h"
51 #include "id_pair.h"
52 #include "par_partition.h"
53 #include "stl_utils.h"
54 #include "bic.h"
55 #include "mpi_bindings.h"
56 #include "gather.h"
57 #include "packable_vector.h"
58 #include "binomial.h"
59 
60 namespace cluster {
61 
62  ///
63  /// This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.
64  ///
65  /// For more information on these algorithms, see this paper:
66  /// @par
67  /// Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed.
68  /// <a href="http://www.cs.unc.edu/~tgamblin/pubs/scalable-cluster-ics10.pdf">
69  /// <b>Clustering Performance Data Efficiently at Massive Scales</b></a>.
70  /// <i>Proceedings of the International Conference on Supercomputing (ICS'10)</i>,
71  /// Tsukuba, Japan, June 1-4, 2010.
72  ///
73  /// <b>Example usage:</b>
74  /// @code
75  /// // This is a theoretical point struct to be clustered.
76  /// struct point {
77  /// double x, y;
78  /// };
79  ///
80  /// // Euclidean distance functor to use for clustering.
81  /// struct euclidean_distance {
82  /// double operator()(const point& lhs, const point& rhs) {
83  /// double a = lhs.x - rhs.x;
84  /// double b = lhs.y - rhs.y;
85  /// return sqrt(a*a + b*b);
86  /// }
87  /// };
88  ///
89  /// vector<point> points;
90  /// // ... Each process puts some local points in the vector ...
91  ///
92  /// par_kmedoids parkm(MPI_COMM_WORLD);
93  /// vector<point> medoids; // storage for reprsentatives
94  ///
95  /// // Run xcapek(), finding a max of 20 clusters among the 2d points.
96  /// parkm.xcapek(points, euclidean_distance(), 20, 2, &medoids);
97  /// @endcode
98  ///
99  /// After this runs, these members of <code>parkm</code> are interesting:
100  /// - <code>parkm.\link par_partition::cluster_ids cluster_ids\endlink</code>:
101  /// A vector of cluster ids for the local objects in <code>points</code>
102  /// - <code>parkm.\link par_partition::medoid_ids medoid_ids\endlink</code>:
103  /// A vector of object ids for all the cluster representatives
104  ///
105  /// Together, these define the clustering that the algorithm found. See par_partition
106  /// for an explanation of how distributed partitions like this are represented.
107  ///
108  /// The <code>medoids</code> vector contains actual copies of the representatives for each cluster.
109  /// The copies correspond to the object ids in <code>parkm.medoid_ids</code>. Supplying
110  /// the <code>medoids</code> vector like this is optional, so if you don't need copies of
111  /// the representatives, you can omit it from the call.
112  ///
113  /// @endcode
114  ///
115  class par_kmedoids : public par_partition {
116  public:
117  ///
118  /// Constructs a parallel kmedoids object and seeds its random number generator.
119  /// This is a collective operation, and needs to be called by all processes.
120  ///
121  /// par_kmedoids assumes that there is one object to be clustered per process.
122  ///
123  par_kmedoids(MPI_Comm comm = MPI_COMM_WORLD);
124 
125  virtual ~par_kmedoids() { }
126 
127  /// Set the random seed. If used, must be called on all processes in the MPI_Comm
128  /// with the same seed value. If not, a seed is generated and broadcast to
129  /// the MPI_Comm.
130  void set_seed(uint32_t seed);
131 
132  /// Get the average dissimilarity of objects w/their medoids for the last run.
133  double average_dissimilarity();
134 
135  /// BIC score for selected clustering
136  double bic_score();
137 
138  ///
139  /// Sets max_reps, Max number of times to run PAM with each sampled dataset.
140  /// Default is 5, per Kaufman and Rousseeuw.
141  ///
142  void set_max_reps(size_t reps) { max_reps = reps; }
143 
144  ///
145  /// Max number of times to run PAM with each sampled dataset.
146  ///
147  size_t get_max_reps() { return max_reps; }
148 
149  ///
150  /// Sets init_size, baseline size for samples, added to 2*k.
151  /// Defaults to 40, per Kaufman and Rousseeuw.
152  ///
153  void set_init_size(size_t size) { init_size = size; }
154 
155  ///
156  /// Baseline size for samples, added to 2*k.
157  ///
158  size_t get_init_size() { return init_size; }
159 
160  /// Set tolerance for convergence. This is relative error, not absolute error. It will be
161  /// multiplied by the mean distance before it is used to test convergence.
162  /// Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.
163  void set_epsilon(double epsilon);
164 
165 
166  ///
167  /// Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.
168  /// Puts resulting medoids in all_medoids when done.
169  ///
170  template <class T, class D>
171  void run_pam_trials(trial_generator& trials, const std::vector<T>& objects, D dmetric,
172  std::vector<typename id_pair<T>::vector>& all_medoids, MPI_Comm comm)
173  {
174  int size, rank;
175  CMPI_Comm_size(comm, &size);
176  CMPI_Comm_rank(comm, &rank);
177 
178  MPI_Group comm_group;
179  CMPI_Comm_group(comm, &comm_group);
180 
181  for (size_t i=0; trials.has_next(); i++) {
182  int my_k = -1; // trial id for local run of kmedoids
183  int my_trial = -1; // trial id for local run of kmedoids
184  std::vector<size_t> my_ids; // object ids for each of my_objects
185  std::vector<T> my_objects; // vector to hold local sample of objects for clustering.
186  multi_gather<T> gather(comm); // simultaneous, asynchronous local gathers for collecting samples.
187 
188  // start gathers for each trial to aggregate samples to single worker processes.
189  for (int root=0; trials.has_next() && root < size; root++) {
190  trial cur_trial = trials.next(); // generate a trial descriptor
191 
192  // Generate a set of indices for members of this k-medoids trial
193  std::vector<size_t> sample_ids;
194  boost::random_number_generator<random_t> rng(random); // Boost adaptor for STL RNG's
195  algorithm_r(trials.num_objects, cur_trial.sample_size, std::back_inserter(sample_ids), rng);
196 
197  // figure out where the sample objects live, ASSUME objects.size() objs per process.
198  std::vector<int> sources;
199  std::transform(sample_ids.begin(), sample_ids.end(), std::back_inserter(sources),
200  std::bind2nd(std::divides<size_t>(), objects.size()));
201  sources.erase(std::unique(sources.begin(), sources.end()), sources.end());
202 
203  // make a permutation vector for the indices of the sampled *local* objects
204  std::vector<size_t> sample_indices;
205  transform(std::lower_bound(sample_ids.begin(), sample_ids.end(), objects.size() * rank),
206  std::lower_bound(sample_ids.begin(), sample_ids.end(), objects.size() * (rank + 1)),
207  std::back_inserter(sample_indices),
208  std::bind2nd(std::minus<int>(), objects.size() * rank));
209 
210  // gather trial members to the current worker (root)
211  gather.start(boost::make_permutation_iterator(objects.begin(), sample_indices.begin()),
212  boost::make_permutation_iterator(objects.begin(), sample_indices.end()),
213  sources.begin(), sources.end(), my_objects, root);
214 
215  // record which trial to use locally and save the medoids there.
216  if (rank == root) {
217  my_k = cur_trial.k;
218  my_trial = trials.count() - 1;
219  my_ids.swap(sample_ids);
220  }
221  }
222  timer.record("StartGather");
223 
224  // finish all sample gathers.
225  gather.finish();
226  timer.record("FinishGather");
227 
228  // we're a worker process if we were assigned a trial number.
229  int is_worker_process = (my_trial >= 0);
230 
231  // then run PAM on the samples that we aggregated to workers.
232  if (is_worker_process) {
233  kmedoids cluster;
234  cluster.set_epsilon(epsilon);
235 
237  build_dissimilarity_matrix(my_objects, dmetric, mat);
238  cluster.pam(mat, my_k);
239  timer.record("LocalCluster");
240 
241  // put this trial's medoids into their spot in the global medoids array.
242  // and pack them up so that we can bcast them to other processes.
243  for (size_t m=0; m < cluster.medoid_ids.size(); m++) {
244  all_medoids[my_trial].push_back(
245  make_id_pair(my_objects[cluster.medoid_ids[m]], my_ids[cluster.medoid_ids[m]]));
246  }
247  }
248 
249  // create a new communicator for the trials
250  // note that the way we construct this, members of trials have the same rank in trials_comm
251  // and in comm. So it's safe to gather to zero on trials_comm then bcast from 0 on comm.
252  std::vector<int> trial_ranks;
253  for (int trial_id = i * size; trial_id < trials.count(); trial_id++) {
254  trial_ranks.push_back(trial_id % size);
255  }
256 
257  MPI_Group trials_group;
258  CMPI_Group_incl(comm_group, trial_ranks.size(), &trial_ranks[0], &trials_group);
259 
260  MPI_Comm trials_comm;
261  CMPI_Comm_create(comm, trials_group, &trials_comm);
262  timer.record("CreateMedoidComm");
263 
264  // Gather the trials to a single process
265  std::vector<char> packed_medoids;
266  binomial_embedding binomial(trial_ranks.size(), 0);
267  if (is_worker_process) {
268  gather_packed(make_packable_vector(&all_medoids[my_trial], false), packed_medoids,
269  binomial, trials_comm);
270  CMPI_Comm_free(&trials_comm);
271  }
272  timer.record("GatherTrials");
273 
274  // bcast the trials to everyone from rank zero using regular MPI_Bcast on the
275  // full packed vector.
276  size_t packed_medoids_size = packed_medoids.size();
277  CMPI_Bcast(&packed_medoids_size, 1, MPI_SIZE_T, 0, comm);
278 
279  if (rank != 0) packed_medoids.resize(packed_medoids_size);
280  CMPI_Bcast(&packed_medoids[0], packed_medoids_size, MPI_PACKED, 0, comm);
281  timer.record("BroadcastTrials");
282 
283  // unpack the medoids and swap them into their place in the all_medoids array.
284  std::vector< packable_vector< id_pair<T> > > unpacked_medoids;
285  unpack_binomial(packed_medoids, unpacked_medoids, binomial, comm);
286  for (int trial_id = i * size; trial_id < trials.count(); trial_id++) {
287  unpacked_medoids[trial_id % size]._packables->swap(all_medoids[trial_id]);
288  }
289  timer.record("UnpackFromBroadcast");
290  }
291 
292  CMPI_Group_free(&comm_group);
293  }
294 
295  ///
296  /// This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).
297  ///
298  /// Assumes that objects to be clustered are fully distributed across parallel process,
299  /// with the same number of objects per process.
300  ///
301  /// @tparam T Type of objects to be clustered.
302  /// T must support the following operations:
303  /// - <code>int packed_size(MPI_Comm comm) const</code>
304  /// - <code>void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const</code>
305  /// - <code>static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)</code>
306  /// @tparam D Dissimilarity metric type.
307  /// D should be callable on (T, T) and should return a double representing
308  /// the distance between the two T's.
309  ///
310  /// @param[in] objects Local objects to cluster (ASSUME: currently must be same number per process!)
311  /// @param[in] dmetric Distance metric to build dissimilarity matrices with
312  /// @param[in] k Number of clusters to find.
313  /// @param[out] medoids Optional output vector where global medoids will
314  /// be stored along with their source ranks.
315  ///
316  /// CAPEK will run trials insatances of PAM on separate processors for each k in 1..max_k using the
317  /// run_pam_trials() routine. Each trial aggregates sample_size objects distributed over all
318  /// processes in the system.
319  ///
320  /// @see xcapek() for a K-agnostic version of this algorithm.
321  ///
322  template <class T, class D>
323  void capek(const std::vector<T>& objects, D dmetric, size_t k, std::vector<T> *medoids = NULL)
324  {
325  int size, rank;
326  CMPI_Comm_size(comm, &size);
327  CMPI_Comm_rank(comm, &rank);
328 
329  if (!seed_set)
330  seed_random_uniform(comm); // seed RN generator uniformly across ranks.
331 
332  // fix things if k is greater than the number of elements, since we can't
333  // ever find that many clusters.
334  size_t num_objects = size * objects.size();
335  k = std::min(num_objects, k);
336  timer.record("Init");
337 
338  // do parallel work: farms out trials and broadcasts medoids from each trial to
339  // all processes. On completion, medoids from all trials are in all_medoids vector.
340  std::vector<typename id_pair<T>::vector> all_medoids(max_reps);
341  trial_generator trials(k, k, max_reps, init_size, num_objects);
342  run_pam_trials(trials, objects, dmetric, all_medoids, comm);
343 
344  // Make two arrays to hold our closest medoids and their distance from our object
345  std::vector<double> all_dissimilarities(trials.count(), 0.0); // dissimilarity sums
346  std::vector< std::vector<medoid_id> > all_cluster_ids(trials.count()); // local nearest medoid ids
347 
348  // Go through all the trials again, and for each of them, find the closest
349  // medoid to this process's objects and sum the dissimilarities
350  for (size_t i=0; i < trials.count(); i++) {
351  for (size_t o=0; o < objects.size(); o++) {
352  object_id global_oid = rank + o;
353  std::pair<double, size_t> closest = closest_medoid(objects[o], global_oid, all_medoids[i], dmetric);
354 
355  all_dissimilarities[i] += closest.first;
356  all_cluster_ids[i].push_back(closest.second);
357  }
358  }
359  timer.record("FindMinima");
360 
361  // Sum up all the min dissimilarities. We do a Reduce/Bcast instead of an Allreduce
362  // to avoid FP error and guarantee that sums is the same across all processors.
363  std::vector<double> sums(trials.count()); // destination vectors for reduction.
364 
365  CMPI_Reduce(&all_dissimilarities[0], &sums[0], trials.count(), MPI_DOUBLE, MPI_SUM, 0, comm);
366  CMPI_Bcast(&sums[0], trials.count(), MPI_DOUBLE, 0, comm);
367  timer.record("GlobalSums");
368 
369  // find minmum global dissimilarity among all trials.
370  std::vector<double>::iterator min_sum = std::min_element(sums.begin(), sums.end());
371  total_dissimilarity = *min_sum;
372  size_t best = (min_sum - sums.begin()); // index of best trial.
373 
374 
375  // Finally set up the partition to correspond to trial with best dissimilarity found
376  medoid_ids.resize(all_medoids[best].size());
377  for (size_t i = 0; i < medoid_ids.size(); i++) {
378  medoid_ids[i] = all_medoids[best][i].id;
379  }
380 
381  // Make an indirection vector from the unsorted to sorted medoids
382  std::vector<size_t> mapping(medoid_ids.size());
383  std::generate(mapping.begin(), mapping.end(), sequence());
384  std::sort(mapping.begin(), mapping.end(), indexed_lt(medoid_ids));
385  invert(mapping);
386 
387  // set up local cluster ids, medoids, and medoid_ids with the sorted mapping.
388  for (size_t i=0; i < medoid_ids.size(); i++) {
389  medoid_ids[i] = all_medoids[best][mapping[i]].id;
390  }
391 
392  // swap in the cluster ids with the best BIC score.
393  cluster_ids.swap(all_cluster_ids[best]);
394 
395  // if the caller wanted a copy of the medoids, copy them into the dstination array.
396  if (medoids) {
397  medoids->resize(medoid_ids.size());
398  for (size_t i=0; i < medoid_ids.size(); i++) {
399  (*medoids)[i] = all_medoids[best][mapping[i]].element;
400  }
401  }
402 
403  timer.record("BicScore");
404  }
405 
406 
407  ///
408  /// K-agnostic version of capek().
409  /// This version attempts to guess the best K for the data using the
410  /// Bayesian Information Criterion (BIC) described in bic.h. Evaluation of the
411  /// BIC is parallelized using global reduction operations.
412  ///
413  /// Like capek(), this uses run_pam_trials() to farm out trials of the PAM clustering algorithm,
414  /// but it requires more trials than capek(). In particular, it will run
415  /// (sum(1..max_k) * trials) total trials in parallel on MPI worker processes.
416  ///
417  /// @tparam T Type of objects to be clustered.
418  /// T must support the following operations:
419  /// - <code>int packed_size(MPI_Comm comm) const</code>
420  /// - <code>void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const</code>
421  /// - <code>static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)</code>
422  /// @tparam D Dissimilarity metric type.
423  /// D should be callable on (T, T) and should return a double representing
424  /// the distance between the two T's.
425  ///
426  /// @param[in] objects Local objects to cluster (ASSUME: currently must be same number per process!)
427  /// @param[in] dmetric Distance metric to build dissimilarity matrices with
428  /// @param[in] max_k Max number of clusters to find.
429  /// @param[in] dimensionality Dimensionality of objects, used by BIC.
430  /// @param[out] medoids Optional output vector where global medoids will be stored
431  /// along with their source ranks.
432  ///
433  /// @return
434  /// The best BIC value found, that is, the BIC value of the final clustering.
435  ///
436  template <class T, class D>
437  double xcapek(const std::vector<T>& objects, D dmetric, size_t max_k, size_t dimensionality,
438  std::vector<T> *medoids = NULL)
439  {
440  int size, rank;
441  CMPI_Comm_size(comm, &size);
442  CMPI_Comm_rank(comm, &rank);
443 
444  if (!seed_set)
445  seed_random_uniform(comm); // seed RN generator uniformly across ranks.
446 
447  // fix things if k is greater than the number of elements, since we can't
448  // ever find that many clusters.
449  size_t num_objects = size * objects.size();
450  max_k = std::min(num_objects, max_k);
451  timer.record("Init");
452 
453  std::vector<typename id_pair<T>::vector> all_medoids(max_k * max_reps);
454  trial_generator trials(max_k, max_reps, init_size, num_objects);
455  run_pam_trials(trials, objects, dmetric, all_medoids, comm);
456 
457  // Make two arrays to hold our closest medoids and their distance from our object
458  std::vector<double> all_dissimilarities(trials.count(), 0.0); // dissimilarity sums
459  std::vector< std::vector<medoid_id> > all_cluster_ids(trials.count()); // local nearest medoid ids
460 
461  std::vector<double> all_dissim2; // dissimilarity sums squared
462  std::vector<size_t> cluster_sizes; // sizes of clusters in each trial
463 
464  // Go through all the trials again, and for each of them, find the closest
465  // medoid to this process's objects and sum the squared dissimilarities
466  for (size_t i=0; i < trials.count(); i++) {
467  const size_t num_medoids = all_medoids[i].size();
468  for (size_t m=0; m < num_medoids; m++) {
469  all_dissim2.push_back(0.0);
470  cluster_sizes.push_back(0);
471  }
472  double *dissim2 = &all_dissim2[all_dissim2.size() - num_medoids];
473  size_t *sizes = &cluster_sizes[cluster_sizes.size() - num_medoids];
474 
475  for (size_t o=0; o < objects.size(); o++) {
476  object_id global_oid = rank + o;
477  std::pair<double, size_t> closest = closest_medoid(objects[o], global_oid, all_medoids[i], dmetric);
478 
479  all_dissimilarities[i] += closest.first;
480  dissim2[closest.second] += closest.first * closest.first;
481  sizes[closest.second] += 1;
482  all_cluster_ids[i].push_back(closest.second);
483  }
484  }
485  timer.record("FindMinima");
486 
487 
488  // Sum up all the min dissimilarities. We do a Reduce/Bcast instead of an Allreduce
489  // to avoid FP error and guarantee that sums is the same across all processors.
490  std::vector<double> sums(trials.count()); // destination vectors for reduction.
491  std::vector<double> sums2(all_dissim2.size());
492  std::vector<size_t> sizes(cluster_sizes.size());
493 
494  CMPI_Reduce(&all_dissimilarities[0], &sums[0], trials.count(), MPI_DOUBLE, MPI_SUM, 0, comm);
495  CMPI_Reduce(&all_dissim2[0], &sums2[0], sums2.size(), MPI_DOUBLE, MPI_SUM, 0, comm);
496  CMPI_Bcast(&sums[0], trials.count(), MPI_DOUBLE, 0, comm);
497  CMPI_Bcast(&sums2[0], sums2.size(), MPI_DOUBLE, 0, comm);
498  CMPI_Allreduce(&cluster_sizes[0], &sizes[0], sizes.size(), MPI_SIZE_T, MPI_SUM, comm);
499  timer.record("GlobalSums");
500 
501  // find minmum global dissimilarity among all trials.
502  std::vector<double>::iterator min_sum = std::min_element(sums.begin(), sums.end());
503  total_dissimilarity = *min_sum;
504  size_t best = (min_sum - sums.begin()); // index of best trial.
505 
506  // locally calculate the BIC for each trial
507  size_t best_trial = 0;
508  best_bic_score = -DBL_MAX;
509 
510  size_t trial_offset = 0; // offset into sizes array
511  for (size_t i=0; i < trials.count(); i++) {
512  size_t k = all_medoids[i].size();
513  double cur_bic = bic(k, &sizes[trial_offset], &sums2[trial_offset], dimensionality);
514  if (cur_bic > best_bic_score) {
515  best_trial = i;
516  best_bic_score = cur_bic;
517  }
518  trial_offset += k;
519  }
520 
521  best = best_trial;
522 
523  // Finally set up the partition to correspond to best trial found.
524  medoid_ids.resize(all_medoids[best].size());
525  for (size_t i = 0; i < medoid_ids.size(); i++) {
526  medoid_ids[i] = all_medoids[best][i].id;
527  }
528 
529  // Make an indirection vector from the unsorted to sorted medoids.
530  std::vector<size_t> mapping(medoid_ids.size());
531  std::generate(mapping.begin(), mapping.end(), sequence());
532  std::sort(mapping.begin(), mapping.end(), indexed_lt(medoid_ids));
533  invert(mapping);
534 
535  // set up local cluster ids, medoids, and medoid_ids with the sorted mapping.
536  for (size_t i=0; i < medoid_ids.size(); i++) {
537  medoid_ids[i] = all_medoids[best][mapping[i]].id;
538  }
539 
540  // swap in the cluster ids with the best BIC score.
541  cluster_ids.swap(all_cluster_ids[best]);
542 
543  // if the user wanted a copy of the medoids, copy them into the dstination array.
544  if (medoids) {
545  medoids->resize(medoid_ids.size());
546  for (size_t i=0; i < medoid_ids.size(); i++) {
547  (*medoids)[i] = all_medoids[best][mapping[i]].element;
548  }
549  }
550 
551  timer.record("BicScore");
552  return best_bic_score;
553  }
554 
555  /// Get the Timer with info on the last run of either capek() or xcapek().
556  const Timer& get_timer() { return timer; }
557 
558  protected:
559  typedef boost::mt19937 random_t; ///< Type for random number generator used here.
560  random_t random; ///< Random number distribution to be used for samples
561  bool seed_set; /// Track whether the random seed has been set
562 
563  double total_dissimilarity; ///< Total dissimilarity bt/w objects and medoids for last clustering.
564  double best_bic_score; ///< BIC score for the clustering found.
565  size_t init_size; ///< baseline size for samples
566  size_t max_reps; ///< Max repetitions of trials for a particular k.
567  double epsilon; ///< Tolerance for convergence tests in kmedoids PAM runs.
568 
569  Timer timer; ///< Performance timer.
570 
571  ///
572  /// Seeds random number generators across all processes with the same number,
573  /// taken from the time in microseconds since the epoch on the process 0.
574  ///
575  void seed_random_uniform(MPI_Comm comm);
576 
577  ///
578  /// Find the closest object in the medoids vector to the object passed in.
579  /// Returns a pair of the closest medoid's id and its distance from the object.
580  ///
581  /// @param[in] object Object to compare to medoids.
582  /// @param[in] oid ID of the object (need this so medoids prefer themselves as their own medoids).
583  /// @param[in] medoids Vector of medoids to find the closest from.
584  /// @param[in] dmetric Distance metric to assess closeness with.
585  ///
586  template <typename T, typename D>
587  std::pair<double, size_t> closest_medoid(
588  const T& object, object_id oid, const std::vector< id_pair<T> >& medoids, D dmetric
589  ) {
590  double min_distance = DBL_MAX;
591  size_t min_id = medoids.size();
592  for (size_t m=0; m < medoids.size(); m++) {
593  double d = dmetric(medoids[m].element, object);
594  if (d < min_distance || medoids[m].id == oid) { // prefer actual medoid as closest
595  min_distance = d;
596  min_id = m;
597  }
598  }
599  return std::make_pair(min_distance, min_id);
600  }
601 
602  };
603 
604 } // Namespace cluster
605 
606 #endif // PAR_KMEDOIDS_H
#define CMPI_Comm_group
Definition: mpi_bindings.h:95
void start(ObjIterator begin_obj, ObjIterator end_obj, RankIterator begin_src, RankIterator end_src, std::vector< T > &dest, int root)
Starts initial send and receive requests for this gather.
Definition: multi_gather.h:114
void set_epsilon(double epsilon)
Set tolerance for convergence.
Definition: kmedoids.cpp:82
size_t init_size
baseline size for samples
Definition: par_kmedoids.h:565
#define MPI_SIZE_T
Definition: mpi_utils.h:39
Data structure representing a trial run of a partitioned clustering algorithm.
Generator object for a strided sequence of ints.
Definition: stl_utils.h:60
void record(const std::string &name)
Records time since start or last call to record.
Definition: Timer.cpp:48
std::vector< id_pair< T > > vector
Template typedef for declaring vectors of id_pair&lt;T&gt;
Definition: id_pair.h:64
void unpack_binomial(const std::vector< char > &src, std::vector< T > &dest, const binomial_embedding binomial, MPI_Comm comm)
Unpacks a packed vector in binomial order into objects in rank order in the destination vector...
Definition: gather.h:76
#define CMPI_Comm_free
Definition: mpi_bindings.h:94
Definition: Timer.h:11
Timer timer
Performance timer.
Definition: par_kmedoids.h:569
void build_dissimilarity_matrix(const std::vector< T > &objects, D dissimilarity, dissimilarity_matrix &mat)
Computes a dissimilarity matrix from a vector of objects.
Definition: dissimilarity.h:59
std::vector< object_id > medoid_ids
Gives the index of the object that is the ith medoid.
Definition: partition.h:79
bool has_next() const
whether there are trials remaining.
Definition: trial.cpp:79
size_t max_reps
Max repetitions of trials for a particular k.
Definition: par_kmedoids.h:566
double average_dissimilarity()
Get the average dissimilarity of objects w/their medoids for the last run.
#define CMPI_Allreduce
Definition: mpi_bindings.h:79
#define CMPI_Reduce
Definition: mpi_bindings.h:91
Asynchronous, some-to-some gather operation used by parallel clustering algorithms to simultaneously ...
Asynchronous, some-to-some gather operation used by parallel clustering algorithms to simultaneously ...
Definition: multi_gather.h:62
void seed_random_uniform(MPI_Comm comm)
Seeds random number generators across all processes with the same number, taken from the time in micr...
MPI_Comm comm
Communicator, the processes of which this partition divides.
Definition: par_partition.h:79
std::pair< double, size_t > closest_medoid(const T &object, object_id oid, const std::vector< id_pair< T > > &medoids, D dmetric)
Find the closest object in the medoids vector to the object passed in.
Definition: par_kmedoids.h:587
void gather_packed(const T &src, std::vector< char > &dest, const binomial_embedding binomial, MPI_Comm comm)
Packs and gathers a buffer full of packed representation of src&#39;s.
Definition: gather.h:24
void set_seed(uint32_t seed)
Set the random seed.
#define CMPI_Comm_rank
Definition: mpi_bindings.h:81
id_pair< T > make_id_pair(const T &elt, int id)
Helper function for making arbitrary id_pairs with type inference.
Definition: id_pair.h:90
par_kmedoids(MPI_Comm comm=MPI_COMM_WORLD)
Constructs a parallel kmedoids object and seeds its random number generator.
boost::numeric::ublas::symmetric_matrix< double > dissimilarity_matrix
Packed repersentation of symmetric dissimilarity matrix.
Definition: dissimilarity.h:48
size_t count() const
return iterations so far.
Definition: trial.cpp:105
trial next()
return parameters for next trial
Definition: trial.cpp:84
#define CMPI_Group_free
Definition: mpi_bindings.h:98
Template function implementations of the Bayesian Information Criterion.
par_partition represents a partitioning of a distributed data set.
Definition: par_partition.h:70
double best_bic_score
BIC score for the clustering found.
Definition: par_kmedoids.h:564
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data...
Definition: kmedoids.h:60
size_t get_max_reps()
Max number of times to run PAM with each sampled dataset.
Definition: par_kmedoids.h:147
size_t object_id
More descriptive type for object index.
Definition: partition.h:52
boost::mt19937 random_t
Type for random number generator used here.
Definition: par_kmedoids.h:559
size_t k
Definition: trial.h:53
Distributed representation of a partitioning of a data set.
void set_max_reps(size_t reps)
Sets max_reps, Max number of times to run PAM with each sampled dataset.
Definition: par_kmedoids.h:142
This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.
Definition: par_kmedoids.h:115
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data...
double bic_score()
BIC score for selected clustering.
void pam(const dissimilarity_matrix &distance, size_t k, const object_id *initial_medoids=NULL)
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in K...
Definition: kmedoids.cpp:161
std::vector< object_id > cluster_ids
Global cluster ids for local objects.
Definition: par_partition.h:76
random_t random
Random number distribution to be used for samples.
Definition: par_kmedoids.h:560
indexed_lt_functor< Indexable > indexed_lt(const Indexable &container)
Definition: stl_utils.h:42
void algorithm_r(size_t numElements, size_t sample_size, OutputIterator out, Random &random)
This is Knuth&#39;s algorithm R for taking a sample of numElements numbers.
Definition: random.h:63
packable_vector< T > make_packable_vector(std::vector< T > *vec, bool owned=true)
Tempate function adaptor so we can have type inference when making packable vectors.
void run_pam_trials(trial_generator &trials, const std::vector< T > &objects, D dmetric, std::vector< typename id_pair< T >::vector > &all_medoids, MPI_Comm comm)
Farms out trials of PAM to worker processes then collects medoids from all trials to all processors...
Definition: par_kmedoids.h:171
MPI-packable struct for an MPI-packable type plus its object id.
Definition: id_pair.h:59
std::vector< object_id > medoid_ids
Gives the object id for the ith medoid. This object may not be local.
Definition: par_partition.h:72
#define CMPI_Group_incl
Definition: mpi_bindings.h:97
MPI-packable, templated struct for shipping around an MPI-packable object plus the id of the process ...
double epsilon
Tolerance for convergence tests in kmedoids PAM runs.
Definition: par_kmedoids.h:567
void set_epsilon(double epsilon)
Set tolerance for convergence.
#define CMPI_Bcast
Definition: mpi_bindings.h:80
size_t get_init_size()
Baseline size for samples, added to 2*k.
Definition: par_kmedoids.h:158
const Timer & get_timer()
Get the Timer with info on the last run of either capek() or xcapek().
Definition: par_kmedoids.h:556
#define CMPI_Comm_create
Definition: mpi_bindings.h:96
void invert(std::vector< Index > &vec)
Definition: stl_utils.h:48
#defines for switching between MPI and PMPI bindings.
void capek(const std::vector< T > &objects, D dmetric, size_t k, std::vector< T > *medoids=NULL)
This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).
Definition: par_kmedoids.h:323
const size_t num_objects
number of elements in the data set; determines maximum sample size.
Definition: trial.h:92
void gather(partition &local, int root=0)
Collective operation.
Class to generate a set of trials for clustering.
Definition: trial.h:70
This struct represents parameters for a single trial run of kmedoids.
Definition: trial.h:52
double total_dissimilarity
Track whether the random seed has been set.
Definition: par_kmedoids.h:563
double bic(const partition &p, D distance, size_t M)
Directly computes the BIC from a partition object based on the cluster centroids and the number of cl...
Definition: bic.h:77
void set_init_size(size_t size)
Sets init_size, baseline size for samples, added to 2*k.
Definition: par_kmedoids.h:153
double xcapek(const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality, std::vector< T > *medoids=NULL)
K-agnostic version of capek().
Definition: par_kmedoids.h:437
size_t sample_size
Definition: trial.h:55
#define CMPI_Comm_size
Definition: mpi_bindings.h:82
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Thu Sep 1 2016 using Doxygen 1.8.5