class Docsplit::TextCleaner

Cleans up OCR’d text by using a series of heuristics to remove garbage words. Algorithms taken from:

Automatic Removal of "Garbage Strings" in OCR Text: An Implementation
  -- Taghva, Nartker, Condit, and Borsack

Improving Search and Retrieval Performance through Shortening Documents,
Detecting Garbage, and Throwing out Jargon
  -- Kulp

Constants

ACRONYM
ALL_ALPHA
ALNUM
CONSONANT
CONSONANT_5
LOWER
NEWLINE
PUNCT
REPEAT
REPEATED
SINGLETONS
SPACE
UPPER
VOWEL
VOWEL_5
WORD

Cached regexes we plan on using.

Public Instance Methods

clean(text) click to toggle source

For the time being, ‘clean` uses the regular StringScanner, and not the multibyte-aware version, coercing to ASCII first.

# File lib/docsplit/text_cleaner.rb, line 37
def clean(text)
  if String.method_defined?(:encode)
    text.encode!('ascii', :invalid => :replace, :undef => :replace, :replace => '?')
  else
    require 'iconv' unless defined?(Iconv)
    text = Iconv.iconv('ascii//translit//ignore', 'utf-8', text).first
  end

  scanner = StringScanner.new(text)
  cleaned = []
  spaced  = false
  loop do
    if space = scanner.scan(SPACE)
      cleaned.push(space) unless spaced && (space !~ NEWLINE)
      spaced = true
    elsif word = scanner.scan(WORD)
      unless garbage(word)
        cleaned.push(word)
        spaced = false
      end
    elsif scanner.eos?
      return cleaned.join('').gsub(REPEATED, '')
    end
  end
end
garbage(w) click to toggle source

Is a given word OCR garbage?

# File lib/docsplit/text_cleaner.rb, line 64
def garbage(w)
  acronym = w =~ ACRONYM

  # More than 30 bytes in length.
  (w.length > 30) ||

  # If there are three or more identical characters in a row in the string.
  (w =~ REPEAT) ||

  # More punctuation than alpha numerics.
  (!acronym && (w.scan(ALNUM).length < w.scan(PUNCT).length)) ||

  # Ignoring the first and last characters in the string, if there are three or
  # more different punctuation characters in the string.
  (w[1...-1].scan(PUNCT).uniq.length >= 3) ||

  # Four or more consecutive vowels, or five or more consecutive consonants.
  ((w =~ VOWEL_5) || (w =~ CONSONANT_5)) ||

  # Number of uppercase letters greater than lowercase letters, but the word is
  # not all uppercase + punctuation.
  (!acronym && (w.scan(UPPER).length > w.scan(LOWER).length)) ||

  # Single letters that are not A or I.
  (w.length == 1 && (w =~ ALL_ALPHA) && (w !~ SINGLETONS)) ||

  # All characters are alphabetic and there are 8 times more vowels than
  # consonants, or 8 times more consonants than vowels.
  (!acronym && (w.length > 2 && (w =~ ALL_ALPHA)) &&
    (((vows = w.scan(VOWEL).length) > (cons = w.scan(CONSONANT).length) * 8) ||
      (cons > vows * 8)))
end