Muster
 All Classes Namespaces Files Functions Variables Typedefs Macros
bic.h
Go to the documentation of this file.
1 //////////////////////////////////////////////////////////////////////////////////////////////////
2 // Copyright (c) 2010, Lawrence Livermore National Security, LLC.
3 // Produced at the Lawrence Livermore National Laboratory
4 // LLNL-CODE-433662
5 // All rights reserved.
6 //
7 // This file is part of Muster. For details, see http://github.com/tgamblin/muster.
8 // Please also read the LICENSE file for further information.
9 //
10 // Redistribution and use in source and binary forms, with or without modification, are
11 // permitted provided that the following conditions are met:
12 //
13 // * Redistributions of source code must retain the above copyright notice, this list of
14 // conditions and the disclaimer below.
15 // * Redistributions in binary form must reproduce the above copyright notice, this list of
16 // conditions and the disclaimer (as noted below) in the documentation and/or other materials
17 // provided with the distribution.
18 // * Neither the name of the LLNS/LLNL nor the names of its contributors may be used to endorse
19 // or promote products derived from this software without specific prior written permission.
20 //
21 // THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS
22 // OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
23 // MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL
24 // LAWRENCE LIVERMORE NATIONAL SECURITY, LLC, THE U.S. DEPARTMENT OF ENERGY OR CONTRIBUTORS BE
25 // LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
26 // (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
27 // DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
28 // WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN
29 // ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
30 //////////////////////////////////////////////////////////////////////////////////////////////////
31 
32 ///
33 /// @file bic.h
34 /// @author Todd Gamblin tgamblin@llnl.gov
35 /// @brief Template function implementations of the Bayesian Information Criterion.
36 ///
37 /// The Bayesian Information Criterion (BIC) is a criterion for model selection
38 /// that balances a maximum likelihood estimator and a parameter count. This
39 /// implementation is designed for clustering algorithms, in particular K-means and
40 /// K-medoids clustering algorithms where we expect clusters with spherical gaussian
41 /// distributions.
42 ///
43 /// Here, we want to test whether a clustering's centroids or medoids are good predictors
44 /// of the points in a data set, so these are our parameters, and we try to find the best
45 /// clustering without too many clusters. For more on this technique and the approach
46 /// we've based this implementation on, see this paper:
47 /// @par
48 /// Dan Pelleg and Andrew Moore. <a href="http://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf">
49 /// <b>X-Means: Extending K-Means with Efficient Estimation of the Number of Clusters</b></a>.
50 /// <i>Proceedings of the Seventeenth International Conference on Machine Learning</i>,
51 /// San Francisco, CA. June 29-July 2, 2000. pp 727-734.
52 ///
53 #ifndef BAYESIAN_INFORMATION_CRITERION_H
54 #define BAYESIAN_INFORMATION_CRITERION_H
55 
56 #include <stdint.h>
57 #include <numeric>
58 #include <iostream>
59 #include <cmath>
60 #include <vector>
61 #include "dissimilarity.h"
62 #include "partition.h"
63 
64 namespace cluster {
65 
66  ///
67  /// Directly computes the BIC from a partition object based on the cluster centroids
68  /// and the number of clusters.
69  ///
70  /// @param[in] p A partition object describing the clustering to be evaluated.
71  /// @param[in] distance A distance function callable on two \em indices from the partition p.
72  /// @param[in] M Dimensionality parameter -- degrees of freedom in the input dataset.
73  ///
74  /// @return A valid BIC score based on the input parameters. Higher numbers indicate better fit.
75  ///
76  template <typename D>
77  double bic(const partition& p, D distance, size_t M) {
78  double R = p.size();
79  size_t k = p.num_clusters();
80 
81  // calculate variance.
82  double s2 = total_squared_dissimilarity(p, distance) / (R - k);
83  double s = sqrt(s2);
84  double sM = pow(s, (double)M);
85 
86  // compute sizes of the clusters in the partition.
87  size_t sizes[k];
88  for (size_t i=0; i < k; i++) {
89  sizes[i] = p.size(i);
90  }
91 
92  // compute BIC from point probabilities
93  double root2pi = sqrt(2 * M_PI);
94  double lD = 0;
95  for (size_t i=0; i < p.size(); i++) {
96  double d = distance(i, p.medoid_ids[p.cluster_ids[i]]);
97  double Ri = sizes[p.cluster_ids[i]];
98  lD +=
99  + log(1.0 / (root2pi * sM))
100  - (1 / (2 * s2)) * d * d
101  + log(Ri / R);
102  }
103 
104  const size_t pj = (k-1) + M*k + 1; // free parameter count
105  return lD - pj/2 * log((double)R);
106  }
107 
108 
109  ///
110  /// This version of the BIC assumes some precomputed information. This is useful for
111  /// parallel implementations, where it is more efficient to compute some global sums as a
112  /// reduction rather than aggregating a full partition to one process. Here,
113  /// we assume that the sizes of the distributed clusters as well as the total squared intra-cluster
114  /// dissimilarity (between each object and its medoid) is known.
115  ///
116  /// @param[in] k Number of clusters in the clustering. Same as k from k-means or k-medoids.
117  /// @param[in] cluster_sizes Start of range of k sizes.
118  /// <code>*cluster_sizes .. *(cluster_sizes + k)</code> should be the
119  /// sizes of clusters 1 to k
120  /// @param[in] sum2_dissim Sum of squared dissimilarities of each object w.r.t. its nearest medoid.
121  /// @param[in] dimensionality Dimensionality of clustered data. e.g., 2 for 2-dimensional points.
122  ///
123  template <typename SizeIterator, typename DissimIterator>
124  double bic(size_t k, SizeIterator cluster_sizes, DissimIterator sum2_dissim, size_t dimensionality) {
125  // figure out total size of data set and put it in R.
126  const double R = std::accumulate(cluster_sizes, cluster_sizes + k, 0);
127  const double M = dimensionality; // Shorthand for model dimensionality
128  const double logR = log(R);
129  const double log2pi = log(2 * M_PI);
130  const double pj = (k-1) + M*k + 1; // free parameter count
131  const double s2 = std::accumulate(sum2_dissim, sum2_dissim + k, 0.0) / (R - k);
132 
133  // apply criterion formula from xmeans paper.
134  double criterion = 0;
135  for (size_t i=0; i < k; i++) {
136  const double Rn = *(cluster_sizes + i);
137  criterion +=
138  - (Rn * log2pi) / 2.0
139  - (Rn * M * log(s2)) / 2.0
140  - (Rn - 1) / 2.0
141  + Rn * log(Rn)
142  - Rn * logR;
143  }
144  criterion -= (pj/2.0 * logR);
145 
146  return criterion;
147  }
148 
149 };
150 
151 #endif // BAYESIAN_INFORMATION_CRITERION_H
double total_squared_dissimilarity(const partition &p, D dist)
Compute the total squared dissimilarity between all objects and their medoids.
Definition: partition.h:196
std::vector< object_id > medoid_ids
Gives the index of the object that is the ith medoid.
Definition: partition.h:79
Class to represent a partitioning of a data set.
size_t size() const
Total number of objects in the partition.
Definition: partition.h:109
Data types and functions for dealing with dissimilarity matrices.
std::vector< medoid_id > cluster_ids
Gives cluster id (index in medoids) for the ith object.
Definition: partition.h:84
This represents a partitioning of a data set.
Definition: partition.h:76
double bic(const partition &p, D distance, size_t M)
Directly computes the BIC from a partition object based on the cluster centroids and the number of cl...
Definition: bic.h:77
size_t num_clusters() const
Total number of clusters in the partition.
Definition: partition.h:112
Muster. Copyright © 2010, Lawrence Livermore National Laboratory, LLNL-CODE-433662.
Distribution of Muster and its documentation is subject to terms of the Muster LICENSE.
Generated on Thu Sep 1 2016 using Doxygen 1.8.5