class Docsplit::TextCleaner
Cleans up OCR’d text by using a series of heuristics to remove garbage words. Algorithms taken from:
Automatic Removal of "Garbage Strings" in OCR Text: An Implementation -- Taghva, Nartker, Condit, and Borsack Improving Search and Retrieval Performance through Shortening Documents, Detecting Garbage, and Throwing out Jargon -- Kulp
Constants
- ACRONYM
- ALL_ALPHA
- ALNUM
- CONSONANT
- CONSONANT_5
- LOWER
- NEWLINE
- PUNCT
- REPEAT
- REPEATED
- SINGLETONS
- SPACE
- UPPER
- VOWEL
- VOWEL_5
- WORD
Cached regexes we plan on using.
Public Instance Methods
clean(text)
click to toggle source
For the time being, ‘clean` uses the regular StringScanner, and not the multibyte-aware version, coercing to ASCII first.
# File lib/docsplit/text_cleaner.rb, line 37 def clean(text) if String.method_defined?(:encode) text.encode!('ascii', :invalid => :replace, :undef => :replace, :replace => '?') else require 'iconv' unless defined?(Iconv) text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first end scanner = StringScanner.new(text) cleaned = [] spaced = false loop do if space = scanner.scan(SPACE) cleaned.push(space) unless spaced && (space !~ NEWLINE) spaced = true elsif word = scanner.scan(WORD) unless garbage(word) cleaned.push(word) spaced = false end elsif scanner.eos? return cleaned.join('').gsub(REPEATED, '') end end end
garbage(w)
click to toggle source
Is a given word OCR garbage?
# File lib/docsplit/text_cleaner.rb, line 64 def garbage(w) acronym = w =~ ACRONYM # More than 30 bytes in length. (w.length > 30) || # If there are three or more identical characters in a row in the string. (w =~ REPEAT) || # More punctuation than alpha numerics. (!acronym && (w.scan(ALNUM).length < w.scan(PUNCT).length)) || # Ignoring the first and last characters in the string, if there are three or # more different punctuation characters in the string. (w[1...-1].scan(PUNCT).uniq.length >= 3) || # Four or more consecutive vowels, or five or more consecutive consonants. ((w =~ VOWEL_5) || (w =~ CONSONANT_5)) || # Number of uppercase letters greater than lowercase letters, but the word is # not all uppercase + punctuation. (!acronym && (w.scan(UPPER).length > w.scan(LOWER).length)) || # Single letters that are not A or I. (w.length == 1 && (w =~ ALL_ALPHA) && (w !~ SINGLETONS)) || # All characters are alphabetic and there are 8 times more vowels than # consonants, or 8 times more consonants than vowels. (!acronym && (w.length > 2 && (w =~ ALL_ALPHA)) && (((vows = w.scan(VOWEL).length) > (cons = w.scan(CONSONANT).length) * 8) || (cons > vows * 8))) end