Class Lucene80DocValuesFormat
- All Implemented Interfaces:
NamedSPILoader.NamedSPI
Documents that have a value for the field are encoded in a way that it is always possible to
know the ordinal of the current document in the set of documents that have a value. For instance,
say the set of documents that have a value for the field is {1, 5, 6, 11}
. When the
iterator is on 6
, it knows that this is the 3rd item of the set. This way, values
can be stored densely and accessed based on their index at search time. If all documents in a
segment have a value for the field, the index is the same as the doc ID, so this case is encoded
implicitly and is very fast at query time. On the other hand if some documents are missing a
value for the field then the set of documents that have a value is encoded into blocks. All doc
IDs that share the same upper 16 bits are encoded into the same block with the following
strategies:
- SPARSE: This strategy is used when a block contains at most 4095 documents. The lower 16
bits of doc IDs are stored as
shorts
while the upper 16 bits are given by the block ID. - DENSE: This strategy is used when a block contains between 4096 and 65535 documents. The
lower bits of doc IDs are stored in a bit set. Advancing < 512 documents is performed
using
ntz
operations while the index is computed by accumulating thebit counts
of the visited longs. Advancing >= 512 documents is performed by skipping to the start of the needed 512 document sub-block and iterating to the specific document within that block. The index for the sub-block that is skipped to is retrieved from a rank-table positioned beforethe bit set. The rank-table holds the origo index numbers for all 512 documents sub-blocks, represented as an unsigned short for each 128 blocks. - ALL: This strategy is used when a block contains exactly 65536 documents, meaning that the
block is full. In that case doc IDs do not need to be stored explicitly. This is typically
faster than both SPARSE and DENSE which is a reason why it is preferable to have all
documents that have a value for a field using contiguous doc IDs, for instance by using
index sorting
.
Skipping blocks to arrive at a wanted document is either done on an iterative basis or by using the jump-table stored at the end of the chain of blocks. The jump-table holds the offset as well as the index for all blocks, packed in a single long per block.
Then the five per-document value types (Numeric,Binary,Sorted,SortedSet,SortedNumeric) are encoded using the following strategies:
- Delta-compressed: per-document integers written as deltas from the minimum value,
compressed with bitpacking. For more information, see
LegacyDirectWriter
. - Table-compressed: when the number of unique values is very small (< 256), and when there
are unused "gaps" in the range of values used (such as
SmallFloat
), a lookup table is written instead. Each per-document entry is instead the ordinal to this table, and those ordinals are compressed with bitpacking (LegacyDirectWriter
). - GCD-compressed: when all numbers share a common divisor, such as dates, the greatest common denominator (GCD) is computed, and quotients are stored using Delta-compressed Numerics.
- Monotonic-compressed: when all numbers are monotonically increasing offsets, they are written as blocks of bitpacked integers, encoding the deviation from the expected delta.
- Const-compressed: when there is only one possible value, no per-document data is needed and this value is encoded alone.
Depending on calculated gains, the numbers might be split into blocks of 16384 values. In that case, a jump-table with block offsets is appended to the blocks for O(1) access to the needed block.
- Fixed-width Binary: one large concatenated byte[] is written, along with the fixed length.
Each document's value can be addressed directly with multiplication (
docID * length
). - Variable-width Binary: one large concatenated byte[] is written, along with end addresses for each document. The addresses are written as Monotonic-compressed numerics.
- Prefix-compressed Binary: values are written in chunks of 16, with the first value written completely and other values sharing prefixes. chunk addresses are written as Monotonic-compressed numerics. A reverse lookup index is written from a portion of every 1024th term.
- Sorted: a mapping of ordinals to deduplicated terms is written as Prefix-compressed Binary, along with the per-document ordinals written using one of the numeric strategies above.
- Single: if all documents have 0 or 1 value, then data are written like SORTED.
- SortedSet: a mapping of ordinals to deduplicated terms is written as Binary, an ordinal list and per-document index into this list are written using the numeric strategies above.
- Single: if all documents have 0 or 1 value, then data are written like NUMERIC.
- SortedNumeric: a value list and per-document index into this list are written using the numeric strategies above.
Files:
.dvd
: DocValues data.dvm
: DocValues metadata
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enum
Configuration option for doc values. -
Field Summary
FieldsModifier and TypeFieldDescription(package private) static final byte
(package private) static final int
(package private) static final int
(package private) static final String
(package private) static final String
(package private) static final int
(package private) static final String
(package private) static final String
private final Lucene80DocValuesFormat.Mode
static final String
Attribute key for compression mode.(package private) static final byte
(package private) static final int
(package private) static final int
(package private) static final byte
(package private) static final byte
(package private) static final byte
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
(package private) static final int
-
Constructor Summary
ConstructorsConstructorDescriptionDefault constructor.Constructor -
Method Summary
Modifier and TypeMethodDescriptionfieldsConsumer
(SegmentWriteState state) Note: although this format is only used on older versions, we need to keep the write logic in addition to the read logic.fieldsProducer
(SegmentReadState state) Returns aDocValuesProducer
to read docvalues from the index.Methods inherited from class org.apache.lucene.codecs.DocValuesFormat
availableDocValuesFormats, forName, getName, reloadDocValuesFormats, toString
-
Field Details
-
MODE_KEY
Attribute key for compression mode. -
mode
-
DATA_CODEC
- See Also:
-
DATA_EXTENSION
- See Also:
-
META_CODEC
- See Also:
-
META_EXTENSION
- See Also:
-
VERSION_START
static final int VERSION_START- See Also:
-
VERSION_BIN_COMPRESSED
static final int VERSION_BIN_COMPRESSED- See Also:
-
VERSION_CONFIGURABLE_COMPRESSION
static final int VERSION_CONFIGURABLE_COMPRESSION- See Also:
-
VERSION_CURRENT
static final int VERSION_CURRENT- See Also:
-
NUMERIC
static final byte NUMERIC- See Also:
-
BINARY
static final byte BINARY- See Also:
-
SORTED
static final byte SORTED- See Also:
-
SORTED_SET
static final byte SORTED_SET- See Also:
-
SORTED_NUMERIC
static final byte SORTED_NUMERIC- See Also:
-
DIRECT_MONOTONIC_BLOCK_SHIFT
static final int DIRECT_MONOTONIC_BLOCK_SHIFT- See Also:
-
NUMERIC_BLOCK_SHIFT
static final int NUMERIC_BLOCK_SHIFT- See Also:
-
NUMERIC_BLOCK_SIZE
static final int NUMERIC_BLOCK_SIZE- See Also:
-
BINARY_BLOCK_SHIFT
static final int BINARY_BLOCK_SHIFT- See Also:
-
BINARY_DOCS_PER_COMPRESSED_BLOCK
static final int BINARY_DOCS_PER_COMPRESSED_BLOCK- See Also:
-
TERMS_DICT_BLOCK_SHIFT
static final int TERMS_DICT_BLOCK_SHIFT- See Also:
-
TERMS_DICT_BLOCK_SIZE
static final int TERMS_DICT_BLOCK_SIZE- See Also:
-
TERMS_DICT_BLOCK_MASK
static final int TERMS_DICT_BLOCK_MASK- See Also:
-
TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD
static final int TERMS_DICT_BLOCK_COMPRESSION_THRESHOLD- See Also:
-
TERMS_DICT_BLOCK_LZ4_SHIFT
static final int TERMS_DICT_BLOCK_LZ4_SHIFT- See Also:
-
TERMS_DICT_BLOCK_LZ4_SIZE
static final int TERMS_DICT_BLOCK_LZ4_SIZE- See Also:
-
TERMS_DICT_BLOCK_LZ4_MASK
static final int TERMS_DICT_BLOCK_LZ4_MASK- See Also:
-
TERMS_DICT_COMPRESSOR_LZ4_CODE
static final int TERMS_DICT_COMPRESSOR_LZ4_CODE- See Also:
-
TERMS_DICT_BLOCK_LZ4_CODE
static final int TERMS_DICT_BLOCK_LZ4_CODE- See Also:
-
TERMS_DICT_REVERSE_INDEX_SHIFT
static final int TERMS_DICT_REVERSE_INDEX_SHIFT- See Also:
-
TERMS_DICT_REVERSE_INDEX_SIZE
static final int TERMS_DICT_REVERSE_INDEX_SIZE- See Also:
-
TERMS_DICT_REVERSE_INDEX_MASK
static final int TERMS_DICT_REVERSE_INDEX_MASK- See Also:
-
-
Constructor Details
-
Lucene80DocValuesFormat
public Lucene80DocValuesFormat()Default constructor. -
Lucene80DocValuesFormat
Constructor
-
-
Method Details
-
fieldsConsumer
Note: although this format is only used on older versions, we need to keep the write logic in addition to the read logic. It's possible for doc values on older segments to be written to through doc values updates.- Specified by:
fieldsConsumer
in classDocValuesFormat
- Throws:
IOException
-
fieldsProducer
Description copied from class:DocValuesFormat
Returns aDocValuesProducer
to read docvalues from the index.NOTE: by the time this call returns, it must hold open any files it will need to use; else, those files may be deleted. Additionally, required files may be deleted during the execution of this call before there is a chance to open them. Under these circumstances an IOException should be thrown by the implementation. IOExceptions are expected and will automatically cause a retry of the segment opening logic with the newly revised segments.
- Specified by:
fieldsProducer
in classDocValuesFormat
- Throws:
IOException
-