module OcrChallenge::NameParser

It turns out that identifying names in a blob of text is hard. I decided to use a dictionary of names in combination with eliminating lines with digits.

Public Instance Methods

parse_names(dir_path) click to toggle source

Note: the name files are expected to be new line separated names

# File lib/ocr_challenge/name_parser.rb, line 8
def parse_names(dir_path)

  #TODO: catch IO exception
  names_dir = Pathname.new(dir_path)
  name_files= names_dir.children

  preprocessed_lines = lines.map(&:strip).reject do |line|
    line =~ /\d/    # names shouldn't have digits in them
  end

  # compare the current line with all the names available in the name files
  preprocessed_lines.select do |line|
    name_files.any? do |file|
      name_lines = file.readlines
      name_lines.any? do |name_line|
        line.downcase =~ /\b#{name_line.downcase.chomp}\b/
      end
    end
  end
end