Muster
 All Classes Namespaces Files Functions Variables Typedefs Macros
kmedoids Class Reference

Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More...

#include <kmedoids.h>

Inheritance diagram for kmedoids:
Collaboration diagram for kmedoids:

Public Member Functions

 kmedoids (size_t num_objects=0)
 Constructor. More...
 
 ~kmedoids ()
 
double average_dissimilarity () const
 Get the average dissimilarity of objects w/their medoids for the last run. More...
 
void set_seed (unsigned long seed)
 Set random seed. More...
 
void set_sort_medoids (bool sort_medoids)
 Set whether medoids will be sorted by object id after clustering is complete. More...
 
void set_epsilon (double epsilon)
 Set tolerance for convergence. More...
 
void pam (const dissimilarity_matrix &distance, size_t k, const object_id *initial_medoids=NULL)
 Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw. More...
 
double xpam (const dissimilarity_matrix &distance, size_t max_k, size_t dimensionality)
 Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw. More...
 
template<class T , class D >
void clara (const std::vector< T > &objects, D dmetric, size_t k)
 CLARA clustering algorithm, as per Kaufman and Rousseuw and R. More...
 
template<class T , class D >
void center_medoids (const std::vector< T > &objects, D distance)
 Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster. More...
 
template<class T , class D >
double xclara (const std::vector< T > &objects, D dmetric, size_t max_k, size_t dimensionality)
 K-Agnostic version of CLARA. More...
 
void set_init_size (size_t sz)
 
void set_max_reps (size_t r)
 
void set_xcallback (void(*)(const partition &part, double bic))
 Set callback function for XPAM and XCLARA. default is none. More...
 
- Public Member Functions inherited from partition
 partition (size_t num_objects=0)
 Constructor. More...
 
virtual ~partition ()
 Virtual destructor; currently does nothing. More...
 
bool is_medoid (object_id oi) const
 True if and only if object i is a medoid. More...
 
void to_cluster_list (cluster_list &list) const
 Creates a list of std::sets from the partition info in medoids and cluster_ids. More...
 
void swap (partition &other)
 Fast swap with other patrition objects. More...
 
void sort ()
 puts medoids in order by their object id, and adjusts cluster_ids accordingly. More...
 
size_t size () const
 Total number of objects in the partition. More...
 
size_t num_clusters () const
 Total number of clusters in the partition. More...
 
size_t size (size_t i) const
 Number of objects in cluster i. More...
 
template<class OutputIterator >
void write_members (medoid_id m, OutputIterator out)
 Write the members of cluster m out to the output stream as object_ids. More...
 
void write_members_with_runs (medoid_id m, std::ostream &out)
 Write the members of cluster m out to the output stream formatted nicely with hyphenated runs of consecutive numbers. More...
 
member_writer members (medoid_id m)
 

Protected Types

typedef boost::mt19937 random_type
 
typedef
boost::random_number_generator
< random_type, unsigned long > 
rng_type
 Randomness source for this algorithm. More...
 

Protected Member Functions

void init_medoids (size_t k, const dissimilarity_matrix &distance)
 KR BUILD algorithm for assigning initial medoids to a partition. More...
 
double cost (medoid_id i, object_id h, const dissimilarity_matrix &distance) const
 Total cost of swapping object h with medoid i. More...
 
template<class DM >
double assign_objects_to_clusters (DM distance)
 Assign each object to the cluster with the closest medoid. More...
 

Protected Attributes

random_type random
 Type for RNG used in this algorithm. More...
 
rng_type rng
 
std::vector< medoid_idsec_nearest
 
double total_dissimilarity
 Index of second closest medoids. Used by PAM. More...
 
bool sort_medoids
 Total dissimilarity bt/w objects and their medoid. More...
 
double epsilon
 Whether medoids should be canonically sorted by object id. More...
 
size_t init_size
 Normalized sensitivity for convergence. More...
 
size_t max_reps
 initial sample size (before 2*k) More...
 
void(* xcallback )(const partition &part, double bic)
 initial sample size (before 2*k) More...
 

Additional Inherited Members

- Public Attributes inherited from partition
std::vector< object_idmedoid_ids
 Gives the index of the object that is the ith medoid. More...
 
std::vector< medoid_idcluster_ids
 Gives cluster id (index in medoids) for the ith object. More...
 

Detailed Description

Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw.

Definition at line 60 of file kmedoids.h.

Member Typedef Documentation

typedef boost::mt19937 random_type
protected

Definition at line 265 of file kmedoids.h.

typedef boost::random_number_generator<random_type, unsigned long> rng_type
protected

Randomness source for this algorithm.

Adaptor for STL algorithms.

Definition at line 269 of file kmedoids.h.

Constructor & Destructor Documentation

kmedoids ( size_t  num_objects = 0)

Constructor.

Can optionally specify number of objects to be clustered. and this will start out with all of them in one cluster.

The random number generator associated with this kmedoids object is seeded with the time in microseconds since the epoch.

Definition at line 55 of file kmedoids.cpp.

~kmedoids ( )

Definition at line 68 of file kmedoids.cpp.

Member Function Documentation

double assign_objects_to_clusters ( DM  distance)
inlineprotected

Assign each object to the cluster with the closest medoid.

Returns
Total dissimilarity of objects w/their medoids.
Parameters
distancea callable object that computes distances between indices, as a distance matrix would. Algorithms are free to use real distance matrices (as in PAM) or to compute lazily (as in CLARA medoid assignment).

Definition at line 299 of file kmedoids.h.

double average_dissimilarity ( ) const

Get the average dissimilarity of objects w/their medoids for the last run.

Definition at line 70 of file kmedoids.cpp.

void center_medoids ( const std::vector< T > &  objects,
distance 
)
inline

Takes existing clustering and reassigns medoids by taking closest medoid to mean of each cluster.

This is O(n) and can give better representatives for CLARA clusterings. This is needed to apply gaussian-model BIC criteria to clusterings.

Template Parameters
TTo use this function, T needs to support algebraic operations.
Specifically, T must support enough to construct a mean:
  • addition T + T = T
  • scalar division T / c = T

Definition at line 198 of file kmedoids.h.

void clara ( const std::vector< T > &  objects,
dmetric,
size_t  k 
)
inline

CLARA clustering algorithm, as per Kaufman and Rousseuw and R.

Ng and J. Han, "Efficient and Effective Clustering Methods for Spatial Data Mining."

Template Parameters
TType of objects to be clustered.
DDissimilarity metric type. D should be callable on (T, T) and should return a double.
Parameters
objectsObjects to cluster
dmetricDistance metric to build dissimilarity matrices with
kNumber of clusters to partition

Definition at line 133 of file kmedoids.h.

double cost ( medoid_id  i,
object_id  h,
const dissimilarity_matrix distance 
) const
protected

Total cost of swapping object h with medoid i.

Sums costs of this exchagne for all objects j.

Definition at line 135 of file kmedoids.cpp.

void init_medoids ( size_t  k,
const dissimilarity_matrix distance 
)
protected

KR BUILD algorithm for assigning initial medoids to a partition.

Definition at line 90 of file kmedoids.cpp.

void pam ( const dissimilarity_matrix distance,
size_t  k,
const object_id initial_medoids = NULL 
)

Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.

Parameters
distancedissimilarity matrix for all objects to cluster
knumber of clusters to produce
initial_medoidsOptionally supply k initial object ids to be used as initial medoids.
See Also
build_dissimilarity_matrix(), a function to automatically construct a dissimilarity matrix given a vector of objects and a distance function.

Definition at line 161 of file kmedoids.cpp.

void set_epsilon ( double  epsilon)

Set tolerance for convergence.

This is relative error, not absolute error. It will be multiplied by the mean distance before it is used to test convergence. Defaults to 1e-15; may need to be higher if there exist clusterings with very similar quality.

Definition at line 82 of file kmedoids.cpp.

void set_init_size ( size_t  sz)
inline

Definition at line 257 of file kmedoids.h.

void set_max_reps ( size_t  r)
inline

Definition at line 258 of file kmedoids.h.

void set_seed ( unsigned long  seed)

Set random seed.

Definition at line 74 of file kmedoids.cpp.

void set_sort_medoids ( bool  sort_medoids)

Set whether medoids will be sorted by object id after clustering is complete.

Defaults to true.

Definition at line 78 of file kmedoids.cpp.

void set_xcallback ( void(*)(const partition &part, double bic xpc)

Set callback function for XPAM and XCLARA. default is none.

Definition at line 86 of file kmedoids.cpp.

double xclara ( const std::vector< T > &  objects,
dmetric,
size_t  max_k,
size_t  dimensionality 
)
inline

K-Agnostic version of CLARA.

This uses the BIC criterion as described in bic.h to run clara() a number of times and to select a best run of clara() from the trials. This will be slower than regular clara(). In particular, it's O(n*max_k).

Parameters
[in]objectsObjects to cluster
[in]dmetricDistance metric to build dissimilarity matrices with
[in]max_kMax number of clusters to find.
[in]dimensionalityDimensionality of objects, used by BIC.

Definition at line 238 of file kmedoids.h.

double xpam ( const dissimilarity_matrix distance,
size_t  max_k,
size_t  dimensionality 
)

Classic K-Medoids clustering, using the Partitioning-Around-Medoids (PAM) algorithm as described in Kaufman and Rousseeuw.

Runs PAM from 1 to max_k and selects the best k using the bayesian information criterion. Sets this partition to the best partition found using PAM from 1 to k.

Based on X-Means, see Pelleg & Moore, 2000.

Parameters
distancedissimilarity matrix for all objects to cluster
max_kUpper limit on number of clusters to find.
dimensionalityNumber of dimensions in clustered data, for BIC.
Returns
the best BIC value found (the bic value of the final partitioning).
See Also
build_dissimilarity_matrix(), a function to automatically construct a dissimilarity matrix given a vector of objects and a distance function.

Definition at line 222 of file kmedoids.cpp.

Member Data Documentation

double epsilon
protected

Whether medoids should be canonically sorted by object id.

Definition at line 275 of file kmedoids.h.

size_t init_size
protected

Normalized sensitivity for convergence.

Definition at line 276 of file kmedoids.h.

size_t max_reps
protected

initial sample size (before 2*k)

Definition at line 277 of file kmedoids.h.

random_type random
protected

Type for RNG used in this algorithm.

Definition at line 266 of file kmedoids.h.

rng_type rng
protected

Definition at line 270 of file kmedoids.h.

std::vector<medoid_id> sec_nearest
protected

Definition at line 272 of file kmedoids.h.

bool sort_medoids
protected

Total dissimilarity bt/w objects and their medoid.

Definition at line 274 of file kmedoids.h.

double total_dissimilarity
protected

Index of second closest medoids. Used by PAM.

Definition at line 273 of file kmedoids.h.

void(* xcallback)(const partition &part, double bic)
protected

initial sample size (before 2*k)

Callback for each iteration of xpam. is called with the current clustering and its BIC score.

Definition at line 281 of file kmedoids.h.


The documentation for this class was generated from the following files:
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Thu Sep 1 2016 using Doxygen 1.8.5