module Boilerpipe::Filters

Fuses adjacent blocks if their distance (in blocks) does not exceed a certain limit. This probably makes sense only in cases where an upstream filter already has removed some blocks.

Removes TextBlocks which have explicitly been marked as “not content”.

A full-text extractor trained on krdwrd.org/ krdwrd.org/trac/attachment/wiki/Corpora/Canola/CANOLA.pdf Works well with SimpleEstimator, too.

Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper “Boilerplate Detection using Shallow Text Features”, particularly using text densities and link densities.

we create a list of potential titles from the page title then we look at every text block and if the text block contains a potential title - we set that text block label as :TITLE

Marks all TextBlocks “content” which are between the headline and the part that has already been marked content, if they are marked MIGHT_BE_CONTENT. This filter is quite specific to the news domain. used downstream of KeepLargetBlockFilter since that's what sets MIGHT_BE_CONTENT

Marks all blocks as “non-content” that occur after blocks that have been marked INDICATES_END_OF_TEXT. These marks are ignored unless a minimum number of words in content blocks occur before this mark (default: 60). This can be used in conjunction with an upstream TerminatingBlocksFinder.

Keeps the largest TextBlock only (by the number of words). In case of more than one block with the same number of words, the first block is chosen. All discarded blocks are marked “not content” and flagged as :MIGHT_BE_CONTENT.

Note that, by default, only TextBlocks marked as “content” are taken into consideration.

Marks all blocks as content that: are on the same tag-level as very likely main content (usually the level of the largest block) have a significant number of words, currently: at least 100 Used downstream of KeepLargestBlockFilter

Marks nested list-item blocks after the end of the main content as content.

Used downstream of keep_largest_block_filter.

Marks all blocks as content.

Keeps only blocks that have at least one segment fragment (“clause”) with at least k words (default: 5).

NOTE: You might consider using the SplitParagraphBlocksFilter upstream.

SplitParagraphBlocksFilter

Keeps only those content blocks which contain at least k words.

Classifies TextBlocks as content/not-content through rules that have been determined using the C4.8 machine learning algorithm, as described in the paper “Boilerplate Detection using Shallow Text Features” (WSDM 2010), particularly using number of words per block and link density per block.

Merges two subsequent blocks if their text densities are equal.

Splits TextBlocks at paragraph boundaries.

NOTE: This is not fully supported (i.e., it will break highlighting support via getContainedTextElements()), but this one probably is necessary for some other filters.

see MinClauseWordsFilter

Finds blocks which are potentially indicating the end of an article text and marks them with INDICATES_END_OF_TEXT. This can be used in conjunction with a downstream IgnoreBlocksAfterContentFilter.

Marks trailing headlines TextBlocks that have the label :#HEADING as boilerplate. Trailing means they are marked content and are below any other content block.