Muster
|
This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More...
#include <par_kmedoids.h>
Public Member Functions | |
par_kmedoids (MPI_Comm comm=MPI_COMM_WORLD) | |
Constructs a parallel kmedoids object and seeds its random number generator. More... | |
virtual | ~par_kmedoids () |
void | set_seed (uint32_t seed) |
Set the random seed. More... | |
double | average_dissimilarity () |
Get the average dissimilarity of objects w/their medoids for the last run. More... | |
double | bic_score () |
BIC score for selected clustering. More... | |
void | set_max_reps (size_t reps) |
Sets max_reps, Max number of times to run PAM with each sampled dataset. More... | |
size_t | get_max_reps () |
Max number of times to run PAM with each sampled dataset. More... | |
void | set_init_size (size_t size) |
Sets init_size, baseline size for samples, added to 2*k. More... | |
size_t | get_init_size () |
Baseline size for samples, added to 2*k. More... | |
void | set_epsilon (double epsilon) |
Set tolerance for convergence. More... | |
template<class T , class D > | |
void | run_pam_trials (trial_generator &trials, const std::vector< T > &objects, D dmetric, std::vector< typename id_pair< T >::vector > &all_medoids, MPI_Comm comm) |
Farms out trials of PAM to worker processes then collects medoids from all trials to all processors. More... | |
template<class T , class D > | |
void | capek (const std::vector< T > &objects, D dmetric, size_t k, std::vector< T > *medoids=NULL) |
This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK). More... | |
template<class T , class D > | |
double | xcapek (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality, std::vector< T > *medoids=NULL) |
K-agnostic version of capek(). More... | |
const Timer & | get_timer () |
Get the Timer with info on the last run of either capek() or xcapek(). More... | |
![]() | |
par_partition (MPI_Comm comm=MPI_COMM_WORLD) | |
Construct a parallel partition for the communicator supplied Partition starts off with everyone in one cluster with medoid 0. More... | |
virtual | ~par_partition () |
Virtual destructor for inheritance. More... | |
void | get_sizes (std::vector< size_t > &sizes) |
Scalably get the sizes of all the clusters in this partition. More... | |
void | gather (partition &local, int root=0) |
Collective operation. More... | |
Protected Types | |
typedef boost::mt19937 | random_t |
Type for random number generator used here. More... | |
Protected Member Functions | |
void | seed_random_uniform (MPI_Comm comm) |
Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0. More... | |
template<typename T , typename D > | |
std::pair< double, size_t > | closest_medoid (const T &object, object_id oid, const std::vector< id_pair< T > > &medoids, D dmetric) |
Find the closest object in the medoids vector to the object passed in. More... | |
Protected Attributes | |
random_t | random |
Random number distribution to be used for samples. More... | |
bool | seed_set |
double | total_dissimilarity |
Track whether the random seed has been set. More... | |
double | best_bic_score |
BIC score for the clustering found. More... | |
size_t | init_size |
baseline size for samples More... | |
size_t | max_reps |
Max repetitions of trials for a particular k. More... | |
double | epsilon |
Tolerance for convergence tests in kmedoids PAM runs. More... | |
Timer | timer |
Performance timer. More... | |
Additional Inherited Members | |
![]() | |
std::vector< object_id > | medoid_ids |
Gives the object id for the ith medoid. This object may not be local. More... | |
std::vector< object_id > | cluster_ids |
Global cluster ids for local objects. More... | |
MPI_Comm | comm |
Communicator, the processes of which this partition divides. More... | |
This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms.
For more information on these algorithms, see this paper:
Example usage:
After this runs, these members of parkm
are interesting:
parkm.cluster_ids
: A vector of cluster ids for the local objects in points
parkm.medoid_ids
: A vector of object ids for all the cluster representativesTogether, these define the clustering that the algorithm found. See par_partition for an explanation of how distributed partitions like this are represented.
The medoids
vector contains actual copies of the representatives for each cluster. The copies correspond to the object ids in parkm.medoid_ids
. Supplying the medoids
vector like this is optional, so if you don't need copies of the representatives, you can omit it from the call.
Definition at line 115 of file par_kmedoids.h.
|
protected |
Type for random number generator used here.
Definition at line 559 of file par_kmedoids.h.
par_kmedoids | ( | MPI_Comm | comm = MPI_COMM_WORLD | ) |
Constructs a parallel kmedoids object and seeds its random number generator.
This is a collective operation, and needs to be called by all processes.
par_kmedoids assumes that there is one object to be clustered per process.
Definition at line 47 of file par_kmedoids.cpp.
|
inlinevirtual |
Definition at line 125 of file par_kmedoids.h.
double average_dissimilarity | ( | ) |
Get the average dissimilarity of objects w/their medoids for the last run.
Definition at line 67 of file par_kmedoids.cpp.
double bic_score | ( | ) |
BIC score for selected clustering.
Definition at line 73 of file par_kmedoids.cpp.
|
inline |
This is the Clustering Algorithm with Parallel Extensions to K-Medoids (CAPEK).
Assumes that objects to be clustered are fully distributed across parallel process, with the same number of objects per process.
T | Type of objects to be clustered. T must support the following operations:
|
D | Dissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's. |
[in] | objects | Local objects to cluster (ASSUME: currently must be same number per process!) |
[in] | dmetric | Distance metric to build dissimilarity matrices with |
[in] | k | Number of clusters to find. |
[out] | medoids | Optional output vector where global medoids will be stored along with their source ranks. |
CAPEK will run trials insatances of PAM on separate processors for each k in 1..max_k using the run_pam_trials() routine. Each trial aggregates sample_size objects distributed over all processes in the system.
Definition at line 323 of file par_kmedoids.h.
|
inlineprotected |
Find the closest object in the medoids vector to the object passed in.
Returns a pair of the closest medoid's id and its distance from the object.
[in] | object | Object to compare to medoids. |
[in] | oid | ID of the object (need this so medoids prefer themselves as their own medoids). |
[in] | medoids | Vector of medoids to find the closest from. |
[in] | dmetric | Distance metric to assess closeness with. |
Definition at line 587 of file par_kmedoids.h.
|
inline |
Baseline size for samples, added to 2*k.
Definition at line 158 of file par_kmedoids.h.
|
inline |
Max number of times to run PAM with each sampled dataset.
Definition at line 147 of file par_kmedoids.h.
|
inline |
Get the Timer with info on the last run of either capek() or xcapek().
Definition at line 556 of file par_kmedoids.h.
|
inline |
Farms out trials of PAM to worker processes then collects medoids from all trials to all processors.
Puts resulting medoids in all_medoids when done.
Definition at line 171 of file par_kmedoids.h.
|
protected |
Seeds random number generators across all processes with the same number, taken from the time in microseconds since the epoch on the process 0.
Definition at line 77 of file par_kmedoids.cpp.
void set_epsilon | ( | double | epsilon | ) |
Set tolerance for convergence.
This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.
Definition at line 62 of file par_kmedoids.cpp.
|
inline |
Sets init_size, baseline size for samples, added to 2*k.
Defaults to 40, per Kaufman and Rousseeuw.
Definition at line 153 of file par_kmedoids.h.
|
inline |
Sets max_reps, Max number of times to run PAM with each sampled dataset.
Default is 5, per Kaufman and Rousseeuw.
Definition at line 142 of file par_kmedoids.h.
void set_seed | ( | uint32_t | seed | ) |
Set the random seed.
If used, must be called on all processes in the MPI_Comm with the same seed value. If not, a seed is generated and broadcast to the MPI_Comm.
Definition at line 57 of file par_kmedoids.cpp.
|
inline |
K-agnostic version of capek().
This version attempts to guess the best K for the data using the Bayesian Information Criterion (BIC) described in bic.h. Evaluation of the BIC is parallelized using global reduction operations.
Like capek(), this uses run_pam_trials() to farm out trials of the PAM clustering algorithm, but it requires more trials than capek(). In particular, it will run (sum(1..max_k) * trials) total trials in parallel on MPI worker processes.
T | Type of objects to be clustered. T must support the following operations:
|
D | Dissimilarity metric type. D should be callable on (T, T) and should return a double representing the distance between the two T's. |
[in] | objects | Local objects to cluster (ASSUME: currently must be same number per process!) |
[in] | dmetric | Distance metric to build dissimilarity matrices with |
[in] | max_k | Max number of clusters to find. |
[in] | dimensionality | Dimensionality of objects, used by BIC. |
[out] | medoids | Optional output vector where global medoids will be stored along with their source ranks. |
Definition at line 437 of file par_kmedoids.h.
|
protected |
BIC score for the clustering found.
Definition at line 564 of file par_kmedoids.h.
|
protected |
Tolerance for convergence tests in kmedoids PAM runs.
Definition at line 567 of file par_kmedoids.h.
|
protected |
baseline size for samples
Definition at line 565 of file par_kmedoids.h.
|
protected |
Max repetitions of trials for a particular k.
Definition at line 566 of file par_kmedoids.h.
|
protected |
Random number distribution to be used for samples.
Definition at line 560 of file par_kmedoids.h.
|
protected |
Definition at line 561 of file par_kmedoids.h.
|
protected |
Performance timer.
Definition at line 569 of file par_kmedoids.h.
|
protected |
Track whether the random seed has been set.
Total dissimilarity bt/w objects and medoids for last clustering.
Definition at line 563 of file par_kmedoids.h.