java.lang.Object
org.apache.lucene.sandbox.codecs.quantization.KMeans

public class KMeans extends Object
KMeans clustering algorithm for vectors
  • Field Details

    • MAX_NUM_CENTROIDS

      public static final int MAX_NUM_CENTROIDS
      See Also:
    • DEFAULT_RESTARTS

      public static final int DEFAULT_RESTARTS
      See Also:
    • DEFAULT_ITRS

      public static final int DEFAULT_ITRS
      See Also:
    • DEFAULT_SAMPLE_SIZE

      public static final int DEFAULT_SAMPLE_SIZE
      See Also:
    • vectors

      private final FloatVectorValues vectors
    • numVectors

      private final int numVectors
    • numCentroids

      private final int numCentroids
    • random

      private final Random random
    • initializationMethod

      private final KMeans.KmeansInitializationMethod initializationMethod
    • restarts

      private final int restarts
    • iters

      private final int iters
  • Constructor Details

  • Method Details

    • cluster

      public static KMeans.Results cluster(FloatVectorValues vectors, VectorSimilarityFunction similarityFunction, int numClusters) throws IOException
      Cluster vectors into a given number of clusters
      Parameters:
      vectors - float vectors
      similarityFunction - vector similarity function. For COSINE similarity, vectors must be normalized.
      numClusters - number of cluster to cluster vector into
      Returns:
      results of clustering: produced centroids and for each vector its centroid
      Throws:
      IOException - when if there is an error accessing vectors
    • cluster

      public static KMeans.Results cluster(FloatVectorValues vectors, int numClusters, boolean assignCentroidsToVectors, long seed, KMeans.KmeansInitializationMethod initializationMethod, boolean normalizeCenters, int restarts, int iters, int sampleSize) throws IOException
      Expert: Cluster vectors into a given number of clusters
      Parameters:
      vectors - float vectors
      numClusters - number of cluster to cluster vector into
      assignCentroidsToVectors - if true assign centroids for all vectors. Centroids are computed on a sample of vectors. If this parameter is true, in results also return for all vectors what centroids they belong to.
      seed - random seed
      initializationMethod - Kmeans initialization method
      normalizeCenters - for cosine distance, set to true, to use spherical k-means where centers are normalized
      restarts - how many times to run Kmeans algorithm
      iters - how many iterations to do within a single run
      sampleSize - sample size to select from all vectors on which to run Kmeans algorithm
      Returns:
      results of clustering: produced centroids and if assignCentroidsToVectors == true also for each vector its centroid
      Throws:
      IOException - if there is error accessing vectors
    • computeCentroids

      private float[][] computeCentroids(boolean normalizeCenters) throws IOException
      Throws:
      IOException
    • initializeForgy

      private float[][] initializeForgy() throws IOException
      Initialize centroids using Forgy method: randomly select numCentroids vectors for initial centroids
      Throws:
      IOException
    • initializeReservoirSampling

      private float[][] initializeReservoirSampling() throws IOException
      Initialize centroids using a reservoir sampling method
      Throws:
      IOException
    • initializePlusPlus

      private float[][] initializePlusPlus() throws IOException
      Initialize centroids using Kmeans++ method
      Throws:
      IOException
    • runKMeansStep

      private static double runKMeansStep(FloatVectorValues vectors, float[][] centroids, short[] docCentroids, boolean useKahanSummation, boolean normalizeCentroids) throws IOException
      Run kmeans step
      Parameters:
      vectors - float vectors
      centroids - centroids, new calculated centroids are written here
      docCentroids - for each document which centroid it belongs to, results will be written here
      useKahanSummation - for large datasets use Kahan summation to calculate centroids, since we can easily reach the limits of float precision
      normalizeCentroids - if centroids should be normalized; used for cosine similarity only
      Throws:
      IOException - if there is an error accessing vector values
    • assignCentroids

      static void assignCentroids(FloatVectorValues vectors, float[][] centroids, List<Integer> unassignedCentroidsIdxs) throws IOException
      For centroids that did not get any points, assign outlying points to them chose points by descending distance to the current centroid set
      Throws:
      IOException