Muster
|
Namespace for everything in the cluster library. More...
Classes | |
class | binomial_embedding |
struct | counter_iterator |
Counting output iterator that records how many times an output iterator was assigned to, but ignores the value stored. More... | |
struct | matrix_distance |
Adaptor for passing a matrix by reference to template functions that take a callable distance function. More... | |
struct | lazy_distance_functor |
Functor for computing distance lazily from an object array and a distance metric. More... | |
struct | id_pair |
MPI-packable struct for an MPI-packable type plus its object id. More... | |
class | kmedoids |
Implementations of the classic clustering algorithms PAM and CLARA, from Finding Groups in Data, by Kaufman and Rousseeuw. More... | |
class | multi_gather |
Asynchronous, some-to-some gather operation used by parallel clustering algorithms to simultaneously send members of sample sets to a set of distributed worker processes. More... | |
struct | null_deleter |
Pass this to a shared pointer if you do NOT want it to own its object. More... | |
struct | packable_vector |
This class allows a vector of packable objects to be packed as though it were a packable object itself. More... | |
class | par_kmedoids |
This class implements the CAPEK and XCAPEK scalable parallel clustering algorithms. More... | |
struct | par_partition |
par_partition represents a partitioning of a distributed data set. More... | |
struct | partition |
This represents a partitioning of a data set. More... | |
struct | trial |
This struct represents parameters for a single trial run of kmedoids. More... | |
class | trial_generator |
Class to generate a set of trials for clustering. More... | |
Typedefs | |
typedef boost::numeric::ublas::symmetric_matrix < double > | dissimilarity_matrix |
Packed repersentation of symmetric dissimilarity matrix. More... | |
typedef size_t | medoid_id |
More descriptive type for medoid index. More... | |
typedef size_t | object_id |
More descriptive type for object index. More... | |
typedef std::vector< std::set < object_id > > | cluster_list |
Explicit representation of a clustering. More... | |
Functions | |
template<typename D > | |
double | bic (const partition &p, D distance, size_t M) |
Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters. More... | |
template<typename SizeIterator , typename DissimIterator > | |
double | bic (size_t k, SizeIterator cluster_sizes, DissimIterator sum2_dissim, size_t dimensionality) |
This version of the BIC assumes some precomputed information. More... | |
template<class T > | |
counter_iterator< T > | counter (T &ref) |
Adaptor for creating type-inferred counters. More... | |
template<class T , class D > | |
void | build_dissimilarity_matrix (const std::vector< T > &objects, D dissimilarity, dissimilarity_matrix &mat) |
Computes a dissimilarity matrix from a vector of objects. More... | |
template<class T , class D > | |
void | build_dissimilarity_matrix (const std::vector< T > &objects, const std::vector< size_t > &subset, D dissimilarity, dissimilarity_matrix &mat) |
Computes a dissimilarity matrix from a subset of a vector of objects. More... | |
template<class T , class D > | |
lazy_distance_functor< T, D > | lazy_distance (const std::vector< T > &objs, D dist) |
Type-inferred syntactic sugar for constructing lazy_distance_functor. More... | |
template<class T > | |
void | gather_packed (const T &src, std::vector< char > &dest, const binomial_embedding binomial, MPI_Comm comm) |
Packs and gathers a buffer full of packed representation of src's. More... | |
template<class T > | |
void | unpack_binomial (const std::vector< char > &src, std::vector< T > &dest, const binomial_embedding binomial, MPI_Comm comm) |
Unpacks a packed vector in binomial order into objects in rank order in the destination vector. More... | |
template<class T > | |
void | gather (const T &src, std::vector< T > &dest, MPI_Comm comm, int root=0) |
Binomial gather of char buffers into a single agglomerated clump of buffers. More... | |
template<class T > | |
void | allgather (const T &src, std::vector< T > &dest, MPI_Comm comm, int root=0) |
Allgather for variable-length data. More... | |
template<class T > | |
id_pair< T > | make_id_pair (const T &elt, int id) |
Helper function for making arbitrary id_pairs with type inference. More... | |
template<class T > | |
std::ostream & | operator<< (std::ostream &out, const id_pair< T > &p) |
Print out an id_pair as a tuple of its element and its source rank. More... | |
template<class T > | |
packable_vector< T > | make_packable_vector (std::vector< T > *vec, bool owned=true) |
Tempate function adaptor so we can have type inference when making packable vectors. More... | |
std::ostream & | operator<< (std::ostream &out, const par_partition &par) |
Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it. More... | |
ostream & | operator<< (ostream &out, const cluster_list &clusters) |
double | mirkin_distance (const cluster_list &c1, const cluster_list &c2) |
Mirkin distance bt/w two clusterings. More... | |
std::ostream & | operator<< (std::ostream &out, const partition &km) |
For convenience. More... | |
double | mirkin_distance (const partition &c1, const partition &c2) |
Convenience overload for comparing partition objects directly. More... | |
void | expand (cluster_list &list, size_t level=1) |
Expand a cluster_list by l levels. More... | |
std::ostream & | operator<< (std::ostream &out, const partition::member_writer &mw) |
std::ostream & | operator<< (std::ostream &out, const cluster_list &list) |
Prints out nicely formatted clustering. More... | |
template<typename D > | |
double | total_dissimilarity (const partition &p, D dist) |
Compute the total dissimilarity between all objects and their medoids. More... | |
template<typename D > | |
double | total_dissimilarity (const partition &p, D dist, medoid_id m) |
Compute the total dissimilarity between all objects in a particular cluster and its medoid. More... | |
template<typename D > | |
double | total_squared_dissimilarity (const partition &p, D dist) |
Compute the total squared dissimilarity between all objects and their medoids. More... | |
template<typename D > | |
double | total_squared_dissimilarity (const partition &p, D dist, medoid_id m) |
Compute the total squared dissimilarity between all objects in a particular cluster and its medoid. More... | |
template<class OutputIterator , class Random > | |
void | algorithm_r (size_t numElements, size_t sample_size, OutputIterator out, Random &random) |
This is Knuth's algorithm R for taking a sample of numElements numbers. More... | |
template<class OutputIterator , class Random > | |
void | fast_sample (size_t numElements, size_t sample_size, OutputIterator out, Random &random) |
This is a fast algorithm for random sampling that scales with the number of elements sampled (sample_size). More... | |
long | get_time_seed () |
Returns a seed for random number generators based on the product of sec and usec from gettimeofday(). More... | |
Namespace for everything in the cluster library.
typedef std::vector< std::set<object_id> > cluster_list |
Explicit representation of a clustering.
Instead of a vecto of representative ids, this has k sets of object_ids indicating which objects are in a particular cluster. You can convert a partition to a cluster_list with to_cluster_list().
Definition at line 60 of file partition.h.
typedef boost::numeric::ublas::symmetric_matrix<double> dissimilarity_matrix |
Packed repersentation of symmetric dissimilarity matrix.
Definition at line 48 of file dissimilarity.h.
typedef size_t medoid_id |
More descriptive type for medoid index.
Definition at line 51 of file partition.h.
typedef size_t object_id |
More descriptive type for object index.
Definition at line 52 of file partition.h.
void cluster::algorithm_r | ( | size_t | numElements, |
size_t | sample_size, | ||
OutputIterator | out, | ||
Random & | random | ||
) |
This is Knuth's algorithm R for taking a sample of numElements numbers.
We sample sample_size
indices from [0..numElements) and write them to out
.
Note that this algorithm scales linaerly with the number of objects sampled from (that is, numElements).
numElements | total number of elements to select from |
sample_size | number of elements to select |
out | destination for selected elements, must model output iterator |
random | model of STL Random Number Generator. must be callable as random(N), returning a random number in [0,N). |
void cluster::allgather | ( | const T & | src, |
std::vector< T > & | dest, | ||
MPI_Comm | comm, | ||
int | root = 0 |
||
) |
double cluster::bic | ( | const partition & | p, |
D | distance, | ||
size_t | M | ||
) |
Directly computes the BIC from a partition object based on the cluster centroids and the number of clusters.
[in] | p | A partition object describing the clustering to be evaluated. |
[in] | distance | A distance function callable on two indices from the partition p. |
[in] | M | Dimensionality parameter – degrees of freedom in the input dataset. |
double cluster::bic | ( | size_t | k, |
SizeIterator | cluster_sizes, | ||
DissimIterator | sum2_dissim, | ||
size_t | dimensionality | ||
) |
This version of the BIC assumes some precomputed information.
This is useful for parallel implementations, where it is more efficient to compute some global sums as a reduction rather than aggregating a full partition to one process. Here, we assume that the sizes of the distributed clusters as well as the total squared intra-cluster dissimilarity (between each object and its medoid) is known.
[in] | k | Number of clusters in the clustering. Same as k from k-means or k-medoids. |
[in] | cluster_sizes | Start of range of k sizes. *cluster_sizes .. *(cluster_sizes + k) should be the sizes of clusters 1 to k |
[in] | sum2_dissim | Sum of squared dissimilarities of each object w.r.t. its nearest medoid. |
[in] | dimensionality | Dimensionality of clustered data. e.g., 2 for 2-dimensional points. |
void cluster::build_dissimilarity_matrix | ( | const std::vector< T > & | objects, |
D | dissimilarity, | ||
dissimilarity_matrix & | mat | ||
) |
Computes a dissimilarity matrix from a vector of objects.
[in] | objects | Vector of any type T. |
[in] | dissimilarity | A dissimilarity measure that gives the distance between two T's. Needs to be callable on (T, T). |
[out] | mat | Output parameter. Dissimiliarity matrix is stored here. |
Definition at line 59 of file dissimilarity.h.
void cluster::build_dissimilarity_matrix | ( | const std::vector< T > & | objects, |
const std::vector< size_t > & | subset, | ||
D | dissimilarity, | ||
dissimilarity_matrix & | mat | ||
) |
Computes a dissimilarity matrix from a subset of a vector of objects.
objects | Vector of any type T. |
subset | Indirection vector. Contains indices into objects for elements to be compared. |
dissimilarity | A dissimilarity measure that gives the distance between two T's. Needs to be callable(T, T). |
mat | Output parameter. Dissimiliarity matrix is stored here. |
Definition at line 83 of file dissimilarity.h.
counter_iterator<T> cluster::counter | ( | T & | ref | ) |
Adaptor for creating type-inferred counters.
Example Usage:
void expand | ( | cluster_list & | list, |
size_t | level = 1 |
||
) |
Expand a cluster_list by l levels.
That is, replace each index i in the cluster_list with indices in [2^l * i ... 2^l * (i+1) - 1]
Definition at line 170 of file partition.cpp.
void cluster::fast_sample | ( | size_t | numElements, |
size_t | sample_size, | ||
OutputIterator | out, | ||
Random & | random | ||
) |
This is a fast algorithm for random sampling that scales with the number of elements sampled (sample_size).
Its performance does not depend on the number of elements sampled from (numElements), so it will perform better than algorithm R in most cases.
Specifically, this algorithm will perform better when sample_size is small compared to numElements. Algorithm R will perform better when sample_size is a large percentage of the elements to be sampled. Use this algoritm when you need O(sample_size) performance and sample_size is constant in the limit with respect to num_elements.
numElements | total number of elements to select from |
sample_size | number of elements to select |
out | destination for selected elements, must model output iterator. |
random | model of STL Random Number Generator. must be callable as random(N), returning a random number in [0,N). |
void cluster::gather | ( | const T & | src, |
std::vector< T > & | dest, | ||
MPI_Comm | comm, | ||
int | root = 0 |
||
) |
void cluster::gather_packed | ( | const T & | src, |
std::vector< char > & | dest, | ||
const binomial_embedding | binomial, | ||
MPI_Comm | comm | ||
) |
Packs and gathers a buffer full of packed representation of src's.
All packed src's are stored in the destination buffer in binomial traversal order on completion. That is, element i in dest is the value for rank binomial.reverse_relative_rank(i).
This allows you to use native MPI operations like bcast on the packed buffer once it's gathered.
|
inline |
lazy_distance_functor<T,D> cluster::lazy_distance | ( | const std::vector< T > & | objs, |
D | dist | ||
) |
Type-inferred syntactic sugar for constructing lazy_distance_functor.
Definition at line 124 of file dissimilarity.h.
id_pair<T> cluster::make_id_pair | ( | const T & | elt, |
int | id | ||
) |
packable_vector<T> cluster::make_packable_vector | ( | std::vector< T > * | vec, |
bool | owned = true |
||
) |
Tempate function adaptor so we can have type inference when making packable vectors.
Definition at line 95 of file packable_vector.h.
double mirkin_distance | ( | const cluster_list & | c1, |
const cluster_list & | c2 | ||
) |
Mirkin distance bt/w two clusterings.
Definition at line 125 of file partition.cpp.
double mirkin_distance | ( | const partition & | c1, |
const partition & | c2 | ||
) |
Convenience overload for comparing partition objects directly.
Definition at line 162 of file partition.cpp.
std::ostream& cluster::operator<< | ( | std::ostream & | out, |
const id_pair< T > & | p | ||
) |
std::ostream & operator<< | ( | std::ostream & | out, |
const par_partition & | par | ||
) |
Right now this just uses parition::operator<<() by making a partition with this par_partition's cluster_ids and medoid_ids vectors and outputting it.
Definition at line 110 of file par_partition.cpp.
ostream& cluster::operator<< | ( | ostream & | out, |
const cluster_list & | clusters | ||
) |
Definition at line 113 of file partition.cpp.
|
inline |
Definition at line 141 of file partition.h.
std::ostream& cluster::operator<< | ( | std::ostream & | out, |
const cluster_list & | list | ||
) |
Prints out nicely formatted clustering.
std::ostream & operator<< | ( | std::ostream & | out, |
const partition & | p | ||
) |
For convenience.
Definition at line 154 of file partition.cpp.
double cluster::total_dissimilarity | ( | const partition & | p, |
D | dist | ||
) |
Compute the total dissimilarity between all objects and their medoids.
Definition at line 170 of file partition.h.
double cluster::total_dissimilarity | ( | const partition & | p, |
D | dist, | ||
medoid_id | m | ||
) |
Compute the total dissimilarity between all objects in a particular cluster and its medoid.
Definition at line 182 of file partition.h.
double cluster::total_squared_dissimilarity | ( | const partition & | p, |
D | dist | ||
) |
Compute the total squared dissimilarity between all objects and their medoids.
Definition at line 196 of file partition.h.
double cluster::total_squared_dissimilarity | ( | const partition & | p, |
D | dist, | ||
medoid_id | m | ||
) |
Compute the total squared dissimilarity between all objects in a particular cluster and its medoid.
Definition at line 210 of file partition.h.