Class BPIndexReorderer

java.lang.Object
org.apache.lucene.misc.index.BPIndexReorderer

public final class BPIndexReorderer extends Object
Implementation of "recursive graph bisection", also called "bipartite graph partitioning" and often abbreviated BP, an approach to doc ID assignment that aims at reducing the sum of the log gap between consecutive postings. While originally targeted at reducing the size of postings, this algorithm has been observed to also speed up queries significantly by clustering documents that have similar sets of terms together.

This algorithm was initially described by Dhulipala et al. in "Compressing graphs and inverted indexes with recursive graph bisection". This implementation takes advantage of some optimizations suggested by Mackenzie et al. in "Tradeoff Options for Bipartite Graph Partitioning".

Typical usage would look like this:

 LeafReader reader; // reader to reorder
 Directory targetDir; // Directory where to write the reordered index

 Directory targetDir = FSDirectory.open(targetPath);
 BPIndexReorderer reorderer = new BPIndexReorderer();
 ForkJoinPool pool = new ForkJoinPool(Runtime.getRuntime().availableProcessors(),
     p -> new ForkJoinWorkerThread(p) {}, null, false);
 reorderer.setForkJoinPool(pool);
 reorderer.setFields(Collections.singleton("body"));
 CodecReader reorderedReaderView = reorderer.reorder(SlowCodecReaderWrapper.wrap(reader), targetDir);
 try (IndexWriter w = new IndexWriter(targetDir, new IndexWriterConfig().setOpenMode(OpenMode.CREATE))) {
   w.addIndexes(reorderedReaderView);
 }
 DirectoryReader reorderedReader = DirectoryReader.open(targetDir);
 

Note: This is a slow operation that consumes O(maxDoc + numTerms * numThreads) memory.

  • Field Details

    • TERM_IDS_BLOCK_SIZE

      private static final int TERM_IDS_BLOCK_SIZE
      Block size for terms in the forward index
      See Also:
    • FORK_THRESHOLD

      private static final int FORK_THRESHOLD
      Minimum problem size that will result in tasks being splitted.
      See Also:
    • DEFAULT_MIN_DOC_FREQ

      public static final int DEFAULT_MIN_DOC_FREQ
      Minimum required document frequency for terms to be considered: 4,096.
      See Also:
    • DEFAULT_MIN_PARTITION_SIZE

      public static final int DEFAULT_MIN_PARTITION_SIZE
      Minimum size of partitions. The algorithm will stop recursing when reaching partitions below this number of documents: 32.
      See Also:
    • DEFAULT_MAX_ITERS

      public static final int DEFAULT_MAX_ITERS
      Default maximum number of iterations per recursion level: 20. Higher numbers of iterations typically don't help significantly.
      See Also:
    • minDocFreq

      private int minDocFreq
    • maxDocFreq

      private float maxDocFreq
    • minPartitionSize

      private int minPartitionSize
    • maxIters

      private int maxIters
    • ramBudgetMB

      private double ramBudgetMB
    • fields

      private Set<String> fields
    • LOG2_TABLE

      private static final float[] LOG2_TABLE
  • Constructor Details

    • BPIndexReorderer

      public BPIndexReorderer()
      Constructor.
  • Method Details

    • setMinDocFreq

      public void setMinDocFreq(int minDocFreq)
      Set the minimum document frequency for terms to be considered, 4096 by default.
    • setMaxDocFreq

      public void setMaxDocFreq(float maxDocFreq)
      Set the maximum document frequency for terms to be considered, as a ratio of maxDoc. This is useful because very frequent terms (stop words) add significant overhead to the reordering logic while not being very relevant for ordering. This value must be in (0, 1]. Default value is 1.
    • setMinPartitionSize

      public void setMinPartitionSize(int minPartitionSize)
      Set the minimum partition size, when the algorithm stops recursing, 32 by default.
    • setMaxIters

      public void setMaxIters(int maxIters)
      Set the maximum number of iterations on each recursion level, 20 by default. Experiments suggests that values above 20 do not help much. However, values below 20 can be used to trade effectiveness for faster reordering.
    • setRAMBudgetMB

      public void setRAMBudgetMB(double ramBudgetMB)
      Set the amount of RAM that graph partitioning is allowed to use. More RAM allows running faster. If not enough RAM is provided, a BPIndexReorderer.NotEnoughRAMException will be thrown. This is 10% of the total heap size by default.
    • setFields

      public void setFields(Set<String> fields)
      Sets the fields to use to perform partitioning. A null value indicates that all indexed fields should be used.
    • writePostings

      private int writePostings(CodecReader reader, Set<String> fields, Directory tempDir, DataOutput postingsOut, int parallelism) throws IOException
      Throws:
      IOException
    • buildForwardIndex

      private BPIndexReorderer.ForwardIndex buildForwardIndex(Directory tempDir, String postingsFileName, int maxDoc, int maxTerm) throws IOException
      Throws:
      IOException
    • computeDocMap

      public Sorter.DocMap computeDocMap(CodecReader reader, Directory tempDir, Executor executor) throws IOException
      Expert: Compute the Sorter.DocMap that holds the new doc ID numbering. This is exposed to enable integration into BPReorderingMergePolicy, reorder(CodecReader, Directory, Executor) should be preferred in general.
      Throws:
      IOException
    • reorder

      public CodecReader reorder(CodecReader reader, Directory tempDir, Executor executor) throws IOException
      Reorder the given CodecReader into a reader that tries to minimize the log gap between consecutive documents in postings, which usually helps improve space efficiency and query evaluation efficiency. Note that the returned CodecReader is slow and should typically be used in a call to IndexWriter.addIndexes(CodecReader...).

      The provided Executor can be used to perform reordering concurrently. A value of null indicates that reordering should be performed in the current thread.

      NOTE: The provided Executor must not reject tasks.

      Throws:
      BPIndexReorderer.NotEnoughRAMException - if not enough RAM is provided
      IOException
    • computePermutation

      private int[] computePermutation(CodecReader reader, Set<String> fields, Directory dir, TaskExecutor executor) throws IOException
      Compute a permutation of the doc ID space that reduces log gaps between consecutive postings.
      Throws:
      IOException
    • sorted

      private static boolean sorted(IntsRef intsRef)
      Returns true if, and only if, the given IntsRef is sorted.
    • docRAMRequirements

      private static long docRAMRequirements(int maxDoc)
    • termRAMRequirementsPerThreadPerTerm

      private static long termRAMRequirementsPerThreadPerTerm()
    • fastLog2

      static float fastLog2(int i)
      An approximate log() function in base 2 which trades accuracy for much better performance.
    • writeMonotonicInts

      static void writeMonotonicInts(int[] ints, int len, DataOutput out) throws IOException
      Simple bit packing that focuses on the common / efficient case when term IDs can be encoded on 16 bits.
      Throws:
      IOException
    • readMonotonicInts

      static int readMonotonicInts(DataInput in, int[] ints) throws IOException
      Decoding logic for writeMonotonicInts(int[], int, DataOutput). It should get auto-vectorized.
      Throws:
      IOException