Muster
 All Classes Namespaces Files Functions Variables Typedefs Macros
par_kmedoids Class Reference

This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...

#include <par_kmedoids.h>

Inheritance diagram for par_kmedoids:
Collaboration diagram for par_kmedoids:

Public Member Functions

 par_kmedoids (MPI_Comm comm=MPI_COMM_WORLD)
 Constructs a parallel kmedoids object and seeds its random number generator. More...
 
virtual ~par_kmedoids ()
 
void set_seed (uint32_t seed)
 Set the random seed. More...
 
double average_dissimilarity ()
 Get the average dissimilarity of objects w/their medoids for the last run. More...
 
double bic_score ()
 BIC score for selected clustering. More...
 
void set_max_reps (size_t reps)
 Sets max_reps, Max number of times to run PAM with each sampled dataset. More...
 
size_t get_max_reps ()
 Max number of times to run PAM with each sampled dataset. More...
 
void set_init_size (size_t size)
 Sets init_size, baseline size for samples, added to 2*k. More...
 
size_t get_init_size ()
 Baseline size for samples, added to 2*k. More...
 
void set_epsilon (double epsilon)
 Set tolerance for convergence. More...
 
template<class T , class D >
void run_pam_trials (trial_generator &trials, const std::vector< T > &objects, D dmetric, std::vector< typename id_pair< T >::vector > &all_medoids, MPI_Comm comm)
 Farms out trials of PAM to worker processes then collects medoids from all trials to all processors. More...
 
template<class T , class D >
void capek (const std::vector< T > &objects, D dmetric, size_t k, std::vector< T > *medoids=NULL)
 This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK). More...
 
template<class T , class D >
double xcapek (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality, std::vector< T > *medoids=NULL)
 K-agnostic version of capek(). More...
 
const Timerget_timer ()
 Get the Timer with info on the last run of either capek() or xcapek(). More...
 
- Public Member Functions inherited from par_partition
 par_partition (MPI_Comm comm=MPI_COMM_WORLD)
 Construct a parallel partition for the communicator supplied Partition starts off with everyone in one cluster with medoid 0. More...
 
virtual ~par_partition ()
 Virtual destructor for inheritance. More...
 
void get_sizes (std::vector< size_t > &sizes)
 Scalably get the sizes of all the clusters in this partition. More...
 
void gather (partition &local, int root=0)
 Collective operation. More...
 

Protected Types

typedef boost::mt19937 random_t
 Type for random number generator used here. More...
 

Protected Member Functions

void seed_random_uniform (MPI_Comm comm)
 Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0. More...
 
template<typename T , typename D >
std::pair< double, size_t > closest_medoid (const T &object, object_id oid, const std::vector< id_pair< T > > &medoids, D dmetric)
 Find the closest object in the medoids vector to the object passed in. More...
 

Protected Attributes

random_t random
 Random number distribution to be used for samples. More...
 
bool seed_set
 
double total_dissimilarity
 Track whether the random seed has been set. More...
 
double best_bic_score
 BIC score for the clustering found. More...
 
size_t init_size
 baseline size for samples More...
 
size_t max_reps
 Max repetitions of trials for a particular k. More...
 
double epsilon
 Tolerance for convergence tests in kmedoids PAM runs. More...
 
Timer timer
 Performance timer. More...
 

Additional Inherited Members

- Public Attributes inherited from par_partition
std::vector< object_idmedoid_ids
 Gives the object id for the ith medoid. This object may not be local. More...
 
std::vector< object_idcluster_ids
 Global cluster ids for local objects. More...
 
MPI_Comm comm
 Communicator, the processes of which this partition divides. More...
 

Detailed Description

This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.

For more information on these algorithms, see this paper:

Todd Gamblin, Bronis R. de Supinski, Martin Schulz, Rob Fowler, and Daniel A. Reed. Clustering Performance Data Efficiently at Massive Scales. Proceedings of the International Conference on Supercomputing (ICS'10), Tsukuba, Japan, June 1-4, 2010.

Example usage:

* // This is a theoretical point struct to be clustered.
* struct point {
* double x, y;
* };
*
* // Euclidean distance functor to use for clustering.
* double operator()(const point& lhs, const point& rhs) {
* double a = lhs.x - rhs.x;
* double b = lhs.y - rhs.y;
* return sqrt(a*a + b*b);
* }
* };
*
* vector<point> points;
* // ... Each process puts some local points in the vector ...
*
* par_kmedoids parkm(MPI_COMM_WORLD);
* vector<point> medoids; // storage for reprsentatives
*
* // Run xcapek(), finding a max of 20 clusters among the 2d points.
* parkm.xcapek(points, euclidean_distance(), 20, 2, &medoids);
*

After this runs, these members of parkm are interesting:

  • parkm.cluster_ids: A vector of cluster ids for the local objects in points
  • parkm.medoid_ids: A vector of object ids for all the cluster representatives

Together, these define the clustering that the algorithm found. See par_partition for an explanation of how distributed partitions like this are represented.

The medoids vector contains actual copies of the representatives for each cluster. The copies correspond to the object ids in parkm.medoid_ids. Supplying the medoids vector like this is optional, so if you don't need copies of the representatives, you can omit it from the call.

Definition at line 115 of file par_kmedoids.h.

Member Typedef Documentation

typedef boost::mt19937 random_t
protected

Type for random number generator used here.

Definition at line 559 of file par_kmedoids.h.

Constructor & Destructor Documentation

par_kmedoids ( MPI_Comm  comm = MPI_COMM_WORLD)

Constructs a parallel kmedoids object and seeds its random number generator.

This is a collective operation, and needs to be called by all processes.

par_kmedoids assumes that there is one object to be clustered per process.

Definition at line 47 of file par_kmedoids.cpp.

virtual ~par_kmedoids ( )
inlinevirtual

Definition at line 125 of file par_kmedoids.h.

Member Function Documentation

double average_dissimilarity ( )

Get the average dissimilarity of objects w/their medoids for the last run.

Definition at line 67 of file par_kmedoids.cpp.

double bic_score ( )

BIC score for selected clustering.

Definition at line 73 of file par_kmedoids.cpp.

void capek ( const std::vector< T > &  objects,
dmetric,
size_t  k,
std::vector< T > *  medoids = NULL 
)
inline

This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).

Assumes that objects to be clustered are fully distributed across parallel process, with the same number of objects per process.

Template Parameters
TType of objects to be clustered. T must support the following operations:
  • int packed_size(MPI_Comm comm) const
  • void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
  • static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)
DDissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's.
Parameters
[in]objectsLocal objects to cluster (ASSUME: currently must be same number per process!)
[in]dmetricDistance metric to build dissimilarity matrices with
[in]kNumber of clusters to find.
[out]medoidsOptional output vector where global medoids will be stored along with their source ranks.

CAPEK will run trials insatances of PAM on separate processors for each k in 1..max_k using the run_pam_trials() routine. Each trial aggregates sample_size objects distributed over all processes in the system.

See Also
xcapek() for a K-agnostic version of this algorithm.

Definition at line 323 of file par_kmedoids.h.

std::pair<double, size_t> closest_medoid ( const T &  object,
object_id  oid,
const std::vector< id_pair< T > > &  medoids,
dmetric 
)
inlineprotected

Find the closest object in the medoids vector to the object passed in.

Returns a pair of the closest medoid's id and its distance from the object.

Parameters
[in]objectObject to compare to medoids.
[in]oidID of the object (need this so medoids prefer themselves as their own medoids).
[in]medoidsVector of medoids to find the closest from.
[in]dmetricDistance metric to assess closeness with.

Definition at line 587 of file par_kmedoids.h.

size_t get_init_size ( )
inline

Baseline size for samples, added to 2*k.

Definition at line 158 of file par_kmedoids.h.

size_t get_max_reps ( )
inline

Max number of times to run PAM with each sampled dataset.

Definition at line 147 of file par_kmedoids.h.

const Timer& get_timer ( )
inline

Get the Timer with info on the last run of either capek() or xcapek().

Definition at line 556 of file par_kmedoids.h.

void run_pam_trials ( trial_generator trials,
const std::vector< T > &  objects,
dmetric,
std::vector< typename id_pair< T >::vector > &  all_medoids,
MPI_Comm  comm 
)
inline

Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.

Puts resulting medoids in all_medoids when done.

Definition at line 171 of file par_kmedoids.h.

void seed_random_uniform ( MPI_Comm  comm)
protected

Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.

Definition at line 77 of file par_kmedoids.cpp.

void set_epsilon ( double  epsilon)

Set tolerance for convergence.

This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.

Definition at line 62 of file par_kmedoids.cpp.

void set_init_size ( size_t  size)
inline

Sets init_size, baseline size for samples, added to 2*k.

Defaults to 40, per Kaufman and Rousseeuw.

Definition at line 153 of file par_kmedoids.h.

void set_max_reps ( size_t  reps)
inline

Sets max_reps, Max number of times to run PAM with each sampled dataset.

Default is 5, per Kaufman and Rousseeuw.

Definition at line 142 of file par_kmedoids.h.

void set_seed ( uint32_t  seed)

Set the random seed.

If used, must be called on all processes in the MPI_Comm with the same seed value. If not, a seed is generated and broadcast to the MPI_Comm.

Definition at line 57 of file par_kmedoids.cpp.

double xcapek ( const std::vector< T > &  objects,
dmetric,
size_t  max_k,
size_t  dimensionality,
std::vector< T > *  medoids = NULL 
)
inline

K-agnostic version of capek().

This version attempts to guess the best K for the data using the Bayesian Information Criterion (BIC) described in bic.h. Evaluation of the BIC is parallelized using global reduction operations.

Like capek(), this uses run_pam_trials() to farm out trials of the PAM clustering algorithm, but it requires more trials than capek(). In particular, it will run (sum(1..max_k) * trials) total trials in parallel on MPI worker processes.

Template Parameters
TType of objects to be clustered. T must support the following operations:
  • int packed_size(MPI_Comm comm) const
  • void pack(void *buf, int bufsize, int *position, MPI_Comm comm) const
  • static T unpack(void *buf, int bufsize, int *position, MPI_Comm comm)
DDissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's.
Parameters
[in]objectsLocal objects to cluster (ASSUME: currently must be same number per process!)
[in]dmetricDistance metric to build dissimilarity matrices with
[in]max_kMax number of clusters to find.
[in]dimensionalityDimensionality of objects, used by BIC.
[out]medoidsOptional output vector where global medoids will be stored along with their source ranks.
Returns
The best BIC value found, that is, the BIC value of the final clustering.

Definition at line 437 of file par_kmedoids.h.

Member Data Documentation

double best_bic_score
protected

BIC score for the clustering found.

Definition at line 564 of file par_kmedoids.h.

double epsilon
protected

Tolerance for convergence tests in kmedoids PAM runs.

Definition at line 567 of file par_kmedoids.h.

size_t init_size
protected

baseline size for samples

Definition at line 565 of file par_kmedoids.h.

size_t max_reps
protected

Max repetitions of trials for a particular k.

Definition at line 566 of file par_kmedoids.h.

random_t random
protected

Random number distribution to be used for samples.

Definition at line 560 of file par_kmedoids.h.

bool seed_set
protected

Definition at line 561 of file par_kmedoids.h.

Timer timer
protected

Performance timer.

Definition at line 569 of file par_kmedoids.h.

double total_dissimilarity
protected

Track whether the random seed has been set.

Total dissimilarity bt/w objects and medoids for last clustering.

Definition at line 563 of file par_kmedoids.h.


The documentation for this class was generated from the following files:
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Thu Sep 1 2016 using Doxygen 1.8.5