Muster
|
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More...
#include <kmedoids.h>
Public Member Functions | |
kmedoids (size_t num_objects=0) | |
Constructor. More... | |
~kmedoids () | |
double | average_dissimilarity () const |
Get the average dissimilarity of objects w/their medoids for the last run. More... | |
void | set_seed (unsigned long seed) |
Set random seed. More... | |
void | set_sort_medoids (bool sort_medoids) |
Set whether medoids will be sorted by object id after clustering is complete. More... | |
void | set_epsilon (double epsilon) |
Set tolerance for convergence. More... | |
void | pam (const dissimilarity_matrix &distance, size_t k, const object_id *initial_medoids=NULL) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw. More... | |
double | xpam (const dissimilarity_matrix &distance, size_t max_k, size_t dimensionality) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw. More... | |
template<class T , class D > | |
void | clara (const std::vector< T > &objects, D dmetric, size_t k) |
CLARA clustering algorithm, as per Kaufman and Rousseuw and R. More... | |
template<class T , class D > | |
void | center_medoids (const std::vector< T > &objects, D distance) |
Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster. More... | |
template<class T , class D > | |
double | xclara (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality) |
K-Agnostic version of CLARA. More... | |
void | set_init_size (size_t sz) |
void | set_max_reps (size_t r) |
void | set_xcallback (void(*)(const partition &part, double bic)) |
Set callback function for XPAM and XCLARA. default is none. More... | |
![]() | |
partition (size_t num_objects=0) | |
Constructor. More... | |
virtual | ~partition () |
Virtual destructor; currently does nothing. More... | |
bool | is_medoid (object_id oi) const |
True if and only if object i is a medoid. More... | |
void | to_cluster_list (cluster_list &list) const |
Creates a list of std::sets from the partition info in medoids and cluster_ids. More... | |
void | swap (partition &other) |
Fast swap with other patrition objects. More... | |
void | sort () |
puts medoids in order by their object id, and adjusts cluster_ids accordingly. More... | |
size_t | size () const |
Total number of objects in the partition. More... | |
size_t | num_clusters () const |
Total number of clusters in the partition. More... | |
size_t | size (size_t i) const |
Number of objects in cluster i. More... | |
template<class OutputIterator > | |
void | write_members (medoid_id m, OutputIterator out) |
Write the members of cluster m out to the output stream as object_ids. More... | |
void | write_members_with_runs (medoid_id m, std::ostream &out) |
Write the members of cluster m out to the output stream formatted nicely with hyphenated runs of consecutive numbers. More... | |
member_writer | members (medoid_id m) |
Protected Types | |
typedef boost::mt19937 | random_type |
typedef boost::random_number_generator < random_type, unsigned long > | rng_type |
Randomness source for this algorithm. More... | |
Protected Member Functions | |
void | init_medoids (size_t k, const dissimilarity_matrix &distance) |
KR BUILD algorithm for assigning initial medoids to a partition. More... | |
double | cost (medoid_id i, object_id h, const dissimilarity_matrix &distance) const |
Total cost of swapping object h with medoid i. More... | |
template<class DM > | |
double | assign_objects_to_clusters (DM distance) |
Assign each object to the cluster with the closest medoid. More... | |
Protected Attributes | |
random_type | random |
Type for RNG used in this algorithm. More... | |
rng_type | rng |
std::vector< medoid_id > | sec_nearest |
double | total_dissimilarity |
Index of second closest medoids. Used by PAM. More... | |
bool | sort_medoids |
Total dissimilarity bt/w objects and their medoid. More... | |
double | epsilon |
Whether medoids should be canonically sorted by object id. More... | |
size_t | init_size |
Normalized sensitivity for convergence. More... | |
size_t | max_reps |
initial sample size (before 2*k) More... | |
void(* | xcallback )(const partition &part, double bic) |
initial sample size (before 2*k) More... | |
Additional Inherited Members | |
![]() | |
std::vector< object_id > | medoid_ids |
Gives the index of the object that is the ith medoid. More... | |
std::vector< medoid_id > | cluster_ids |
Gives cluster id (index in medoids) for the ith object. More... | |
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw.
Definition at line 60 of file kmedoids.h.
|
protected |
Definition at line 265 of file kmedoids.h.
|
protected |
Randomness source for this algorithm.
Adaptor for STL algorithms.
Definition at line 269 of file kmedoids.h.
kmedoids | ( | size_t | num_objects = 0 | ) |
Constructor.
Can optionally specify number of objects to be clustered. and this will start out with all of them in one cluster.
The random number generator associated with this kmedoids object is seeded with the time in microseconds since the epoch.
Definition at line 55 of file kmedoids.cpp.
~kmedoids | ( | ) |
Definition at line 68 of file kmedoids.cpp.
|
inlineprotected |
Assign each object to the cluster with the closest medoid.
distance | a callable object that computes distances between indices, as a distance matrix would. Algorithms are free to use real distance matrices (as in PAM) or to compute lazily (as in CLARA medoid assignment). |
Definition at line 299 of file kmedoids.h.
double average_dissimilarity | ( | ) | const |
Get the average dissimilarity of objects w/their medoids for the last run.
Definition at line 70 of file kmedoids.cpp.
|
inline |
Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster.
This is O(n) and can give better representatives for CLARA clusterings. This is needed to apply gaussian-model BIC criteria to clusterings.
T | To use this function, T needs to support algebraic operations. Specifically, T must support enough to construct a mean:
|
Definition at line 198 of file kmedoids.h.
|
inline |
CLARA clustering algorithm, as per Kaufman and Rousseuw and R.
Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining."
T | Type of objects to be clustered. |
D | Dissimilarity metric type. D should be callable on (T, T) and should return a double. |
objects | Objects to cluster |
dmetric | Distance metric to build dissimilarity matrices with |
k | Number of clusters to partition |
Definition at line 133 of file kmedoids.h.
|
protected |
Total cost of swapping object h with medoid i.
Sums costs of this exchagne for all objects j.
Definition at line 135 of file kmedoids.cpp.
|
protected |
KR BUILD algorithm for assigning initial medoids to a partition.
Definition at line 90 of file kmedoids.cpp.
void pam | ( | const dissimilarity_matrix & | distance, |
size_t | k, | ||
const object_id * | initial_medoids = NULL |
||
) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.
distance | dissimilarity matrix for all objects to cluster |
k | number of clusters to produce |
initial_medoids | Optionally supply k initial object ids to be used as initial medoids. |
Definition at line 161 of file kmedoids.cpp.
void set_epsilon | ( | double | epsilon | ) |
Set tolerance for convergence.
This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.
Definition at line 82 of file kmedoids.cpp.
|
inline |
Definition at line 257 of file kmedoids.h.
|
inline |
Definition at line 258 of file kmedoids.h.
void set_seed | ( | unsigned long | seed | ) |
Set random seed.
Definition at line 74 of file kmedoids.cpp.
void set_sort_medoids | ( | bool | sort_medoids | ) |
Set whether medoids will be sorted by object id after clustering is complete.
Defaults to true.
Definition at line 78 of file kmedoids.cpp.
Set callback function for XPAM and XCLARA. default is none.
Definition at line 86 of file kmedoids.cpp.
|
inline |
K-Agnostic version of CLARA.
This uses the BIC criterion as described in bic.h to run clara() a number of times and to select a best run of clara() from the trials. This will be slower than regular clara(). In particular, it's O(n*max_k).
[in] | objects | Objects to cluster |
[in] | dmetric | Distance metric to build dissimilarity matrices with |
[in] | max_k | Max number of clusters to find. |
[in] | dimensionality | Dimensionality of objects, used by BIC. |
Definition at line 238 of file kmedoids.h.
double xpam | ( | const dissimilarity_matrix & | distance, |
size_t | max_k, | ||
size_t | dimensionality | ||
) |
Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.
Runs PAM from 1 to max_k and selects the best k using the bayesian information criterion. Sets this partition to the best partition found using PAM from 1 to k.
Based on X-Means, see Pelleg & Moore, 2000.
distance | dissimilarity matrix for all objects to cluster |
max_k | Upper limit on number of clusters to find. |
dimensionality | Number of dimensions in clustered data, for BIC. |
Definition at line 222 of file kmedoids.cpp.
|
protected |
Whether medoids should be canonically sorted by object id.
Definition at line 275 of file kmedoids.h.
|
protected |
Normalized sensitivity for convergence.
Definition at line 276 of file kmedoids.h.
|
protected |
initial sample size (before 2*k)
Definition at line 277 of file kmedoids.h.
|
protected |
Type for RNG used in this algorithm.
Definition at line 266 of file kmedoids.h.
|
protected |
Definition at line 270 of file kmedoids.h.
|
protected |
Definition at line 272 of file kmedoids.h.
|
protected |
Total dissimilarity bt/w objects and their medoid.
Definition at line 274 of file kmedoids.h.
|
protected |
Index of second closest medoids. Used by PAM.
Definition at line 273 of file kmedoids.h.
initial sample size (before 2*k)
Callback for each iteration of xpam. is called with the current clustering and its BIC score.
Definition at line 281 of file kmedoids.h.