module OcrChallenge::NameParser
It turns out that identifying names in a blob of text is hard. I decided to use a dictionary of names in combination with eliminating lines with digits.
Public Instance Methods
parse_names(dir_path)
click to toggle source
Note: the name files are expected to be new line separated names
# File lib/ocr_challenge/name_parser.rb, line 8 def parse_names(dir_path) #TODO: catch IO exception names_dir = Pathname.new(dir_path) name_files= names_dir.children preprocessed_lines = lines.map(&:strip).reject do |line| line =~ /\d/ # names shouldn't have digits in them end # compare the current line with all the names available in the name files preprocessed_lines.select do |line| name_files.any? do |file| name_lines = file.readlines name_lines.any? do |name_line| line.downcase =~ /\b#{name_line.downcase.chomp}\b/ end end end end