summarization.bm25
– BM25 ranking function¶This module contains function of computing rank scores for documents in corpus and helper class BM25 used in calculations. Original algorithm descibed in 1, also you may check Wikipedia page 2.
Robertson, Stephen; Zaragoza, Hugo (2009). The Probabilistic Relevance Framework: BM25 and Beyond, http://www.staff.city.ac.uk/~sb317/papers/foundations_bm25_review.pdf
Okapi BM25 on Wikipedia, https://en.wikipedia.org/wiki/Okapi_BM25
Examples
>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
PARAM_K1 - Free smoothing parameter for BM25.
PARAM_B - Free smoothing parameter for BM25.
EPSILON - Constant used for negative idf of document in corpus.
gensim.summarization.bm25.
BM25
(corpus, k1=1.5, b=0.75, epsilon=0.25)¶Bases: object
Implementation of Best Matching 25 ranking function.
corpus_size
¶Size of corpus (number of documents).
int
avgdl
¶Average length of document in corpus.
float
doc_freqs
¶Dictionary with terms frequencies for each document in corpus. Words used as keys and frequencies as values.
list of dicts of int
idf
¶Dictionary with inversed documents frequencies for whole corpus. Words used as keys and frequencies as values.
dict
doc_len
¶List of document lengths.
list of int
corpus (list of list of str) – Given corpus.
k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.
get_score
(document, index)¶Computes BM25 score of given document in relation to item of corpus selected by index.
document (list of str) – Document to be scored.
index (int) – Index of document in corpus selected to score with document.
BM25 score.
float
get_scores
(document)¶Computes and returns BM25 scores of given document in relation to every item in corpus.
document (list of str) – Document to be scored.
BM25 scores.
list of float
get_scores_bow
(document)¶Computes and returns BM25 scores of given document in relation to every item in corpus.
document (list of str) – Document to be scored.
BM25 scores.
list of float
gensim.summarization.bm25.
get_bm25_weights
(corpus, n_jobs=1, k1=1.5, b=0.75, epsilon=0.25)¶Returns BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.
corpus (list of list of str) – Corpus of documents.
n_jobs (int) – The number of processes to use for computing bm25.
k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.
BM25 scores.
list of list of float
Examples
>>> from gensim.summarization.bm25 import get_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = get_bm25_weights(corpus, n_jobs=-1)
gensim.summarization.bm25.
iter_bm25_bow
(corpus, n_jobs=1, k1=1.5, b=0.75, epsilon=0.25)¶Yield BM25 scores (weights) of documents in corpus. Each document has to be weighted with every document in given corpus.
corpus (list of list of str) – Corpus of documents.
n_jobs (int) – The number of processes to use for computing bm25.
k1 (float) – Constant used for influencing the term frequency saturation. After saturation is reached, additional presence for the term adds a significantly less additional score. According to 1, experiments suggest that 1.2 < k1 < 2 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
b (float) – Constant used for influencing the effects of different document lengths relative to average document length. When b is bigger, lengthier documents (compared to average) have more impact on its effect. According to 1, experiments suggest that 0.5 < b < 0.8 yields reasonably good results, although the optimal value depends on factors such as the type of documents or queries.
epsilon (float) – Constant used as floor value for idf of a document in the corpus. When epsilon is positive, it restricts negative idf values. Negative idf implies that adding a very common term to a document penalize the overall score (with ‘very common’ meaning that it is present in more than half of the documents). That can be undesirable as it means that an identical document would score less than an almost identical one (by removing the referred term). Increasing epsilon above 0 raises the sense of how rare a word has to be (among different documents) to receive an extra score.
list of (index, float) – BM25 scores in bag of weights format.
Examples
>>> from gensim.summarization.bm25 import iter_bm25_weights
>>> corpus = [
... ["black", "cat", "white", "cat"],
... ["cat", "outer", "space"],
... ["wag", "dog"]
... ]
>>> result = iter_bm25_weights(corpus, n_jobs=-1)