module DateParser::NaturalDateParsing

Handles the mechanics of natural language processing.

Methods

interpret_date(txt, creation_date, parse_single_years): Return an array of dates from the set of parameters.

We parse in order of decreasing strictness. I.e., a very specific phrase like “January 1st, 2013” will be parsed before “January 1st,” which will be parsed before just “2013”. Whenever we determine a phrase is part of a date, we remove the phrase after parsing. So in the example “January 1st, 2013” we'll return only one date.

If no dates are found, returns an empty array.

parse_one_word(word, creation_date, parse_single_years): Given a single word, a string, tries to return a Date object.

parse_two_words(words, creation_date = nil): Attempts to return a Date object given a string containing two words.

parse_three_words(words, creation_date = nil): Given three words, attempts to return a Date object.

Constants

MONTH

Names of months as well as common shortened versions

NUMERIC_DAY

A list of numbers from [1, 31]

RELATIVE_DAYS

Phrases that denote a date relative to today (here often called the creation_date)

SINGLE_DAYS

Names of days as well as common shortened versions.

SUFFIXED_NUMERIC_DAY

Numbers from [1, 31] as well as the common suffixes (such as 1st, 2nd, e.t.c.)

Public Class Methods

full_numeric_date(word) click to toggle source

Parses a single word of the form XXXX-XX-XX, DD-MM-YYYY or MM-DD-YYYY Also accepts words of the form XXXX/XX/XX

# File lib/date_parser/natural_date_parsing.rb, line 392
def NaturalDateParsing.full_numeric_date(word)
  demarcating_token = get_demarcating_token(word)
  
  subparts = word.split(demarcating_token)
  
  # This is a weak check to see where the year is
  year_index = (subparts[0].to_i).abs > 31 ? 0 : 2
  
  # Then we assume it's of the form YYYY-MM-DD
  if year_index == 0
    return Date.parse(word)
  else
    # We check the subparts to try to see which part is DD.
    # If we can't determine it, we assume it's International Standard Format,
    # or DD-MM-YY
    
    if subparts[1].to_i > 12
      # American Standard (MM-DD-YYYY)
      subparts[0] = numeric_month_to_string(subparts[0].to_i)
      return Date.parse(subparts.join(" "))
      
    else
      # International Standard (DD-MM-YYYY)
      return Date.parse(word)
    end
  end
  
  return Date.parse(word)
end
interpret_date( txt, creation_date = nil, parse_single_years = false, parse_ambiguous_dates = true ) click to toggle source

Processes a given text and returns an array of probable dates contained within.

Description

Tries to interpret dates from the given text, in order from strictest interpretation to looser interpretations. No word can be part of two different dates.

Works by calling parse_three_words, parse_two_words, and parse_one_word on the text.

Attributes

  • txt - The text to parse.

  • creation_date - A Date object of when the text was created or released. Defaults to nil, but if provided can make returned dates more accurate.

  • parse_single_years - A boolean. If true, we interpret single numbers as years. This is a very broad assumption, and so defaults to false.

  • parse_ambiguous_dates - Some phrases are not necessarily dates depending on context. For example “1st” may not refer to the 1st of a month. This option toggles whether or not those phrases are considered dates. Defaults to true.

Examples

text = "Henry and Hanke created a calendar that causes each day to fall " +
       "on the same day of the week every year. They recommend its " +
       "implementation on January 1, 2018, a Monday."
creation_date = Date.parse("July 6, 2016")

NaturalDateParsing.interpret_date(text, creation_date)
    #=> [#<Date: 2018-01-01 ((2458120j,0s,0n),+0s,2299161j)>,
         #<Date: 2016-07-11 ((2457581j,0s,0n),+0s,2299161j)>]

NaturalDateParsing.interpret_date("No dates here!")
    #=> []

NaturalDateParsing.interpret_date("2012", nil, true)
    #=> [#<Date: 2012-01-01 ((2455928j,0s,0n),+0s,2299161j)>]
# File lib/date_parser/natural_date_parsing.rb, line 116
def NaturalDateParsing.interpret_date(
                                      txt, 
                                      creation_date = nil, 
                                      parse_single_years = false,
                                      parse_ambiguous_dates = true
                                      )
  possible_dates = []
  txt = Utils::clean_str(txt)
  words = txt.split(" ").map{|x| x.strip}
  
  # We use the while loop, as apparently there are cases where we try to subset
  # words despite the value of i being >= words.length - 3
  # TODO: Figure out why the above happens. Preferably return to for loop.
  # TODO: Cleaner way of structuring the below? I could break up the loops
  # into single functions. Consider.
  i = 0
  
  while (i <= words.length - 3) do
    subset_words = words[i..(i+2)]
    
    proposed_date = parse_three_words(subset_words, creation_date)
    
    if(! proposed_date.nil?)
      possible_dates << proposed_date
      words = Utils::delete_at_indices(words, i..(i+2))
      i -= 1
    end
    
    i += 1
  end
  
  i = 0
  
  while (i <= words.length - 2) do
    subset_words = words[i..(i+1)]
    proposed_date = parse_two_words(subset_words, creation_date)
    
    if(! proposed_date.nil?)
      possible_dates << proposed_date
      words = Utils::delete_at_indices(words, i..(i+1))
      i -= 1
    end
    
    i += 1
  end
  
  i = 0
  
  while (i <= words.length - 1) do
    subset_words = words[i]
    
    proposed_date = parse_one_word(subset_words, 
                                   creation_date, 
                                   parse_single_years,
                                   parse_ambiguous_dates)
    
    if(! proposed_date.nil?)
      possible_dates << proposed_date
      words.delete_at(i)
      i -= 1
    end
    
    i += 1
  end
  
  return possible_dates
end
month_day(words, creation_date = nil) click to toggle source

Parses an array containing two elements (single words) on the assumption that the array is of the form [“MONTH”, “DAY”]

# File lib/date_parser/natural_date_parsing.rb, line 365
def NaturalDateParsing.month_day(words, creation_date = nil)
  begin
    proposed_date = Date.parse(words.join(" "))
    
    diff_in_years = creation_date.nil? ? 0 : (creation_date.year - Date.today.year)
    
    return proposed_date >> diff_in_years * 12
  rescue ArgumentError
    return nil
  end
end
numeric_single_day(word, creation_date = nil) click to toggle source

Parses a single numeric date (1st, 2nd, 3rd, e.t.c.).

# File lib/date_parser/natural_date_parsing.rb, line 378
def NaturalDateParsing.numeric_single_day(word, creation_date = nil)
  diff_in_months = creation_date.nil? ? 0 : (creation_date.year * 12 + creation_date.month) - 
                                            (Date.today.year * 12 + Date.today.month)
  
  begin
    return Date.parse(word) >> diff_in_months
  rescue ArgumentError
    ## If an ArgumentError arises, Date is not convinced it's a date.
    return nil
  end
end
parse_one_word( word, creation_date = nil, parse_single_years = false, parse_ambiguous_dates = true ) click to toggle source

Takes a single word and tries to return a date.

If no date can be interpreted from the word, returns nil. We consider these cases:

  • DAY (mon, tuesday, e.t.c.)

  • A relative day (today, tomorrow, tonight, yesterday)

  • Dates of the form MM/DD

  • Numbers such as [1st, 31st]

  • MONTH (jan, february, e.t.c.)

  • YYYY (2012, 102. Must be enabled.)

  • YYYY-MM-DD, DD-MM-YYYY, MM-DD-YYYY

Attributes

  • word - A String, preferably consisting of a single word.

  • creation_date - A Date object of when the text was created or released. Defaults to nil, but if provided can make returned dates more accurate.

  • parse_single_years - A boolean. If true, we interpret single numbers as years. This is a very broad assumption, and so defaults to false.

  • parse_ambiguous_dates - Some phrases are not necessarily dates depending on context. For example “1st” may not refer to the 1st of a month. This option toggles whether or not those phrases are considered dates. Defaults to true.

# File lib/date_parser/natural_date_parsing.rb, line 217
def NaturalDateParsing.parse_one_word(
                                      word, 
                                      creation_date = nil, 
                                      parse_single_years = false,
                                      parse_ambiguous_dates = true
                                      )
  
  if SINGLE_DAYS.include? word
    proposed_date = Date.parse(word)
    
    # If we have the creation_date date, we can try to be a little smarter
    if(! creation_date.nil?)
      weeks_to_shift = difference_in_weeks(Date.today, creation_date)
                                                       
      proposed_date = proposed_date - (weeks_to_shift * 7)
      
      # Right now though, it should be within 1 week of accuracy, and either one
      # week ahead or one week behind.
      # The solution is pretty simple. If the proposed date
      # is more than a week ahead of the creation date, then go back one week.
      if proposed_date - creation_date > 7
        proposed_date = proposed_date - 7
      elsif proposed_date - creation_date < 0
        proposed_date = proposed_date + 7
      end
    end
    
    return proposed_date
  end
  
  # Parsing phrases like "yesterday", "today", "tonight"
  if RELATIVE_DAYS.include? word
    if word == 'today' || word == 'tonight'
      if creation_date.nil?
        return Date.today
      else
        return creation_date
      end
    elsif word == 'yesterday'
      if creation_date.nil?
        return Date.today - 1
      else
        return creation_date - 1
      end
    elsif word == "tomorrow"
      return creation_date + 1
    end
  end
  
  # Parsing strings like "23rd"
  if (SUFFIXED_NUMERIC_DAY.include? word) && parse_ambiguous_dates
    return numeric_single_day(word, creation_date)
  end
  
  # Parsing month names
  if MONTH.include? word
    return default_month(word, creation_date)
  end
  
  # In this case, we assume it's a year!
  if parse_single_years && (Utils::is_int? word)
    return default_year(word)
  end
  
  # Parsing XX-XX-XXXX, XXXX-XX-XX, XX/XX/XXXX, or XXXX/XX/XX
  if full_numeric_date?(word)
    return full_numeric_date(word)
  end
  
  # Parsing strings of the form XX/XX
  if slash_date?(word)
    return slash_date(word, creation_date)
  end
end
parse_three_words(words, creation_date = nil) click to toggle source

Takes three words and tries to return a date.

If no date can be interpreted from the word, returns nil. In this case, assumes the word can take these forms:

Attributes

  • words - An array of three words, downcased and stripped.

  • creation_date - A Date object of when the text was created or released. Defaults to nil, but if provided can make returned dates more accurate.

# File lib/date_parser/natural_date_parsing.rb, line 328
def NaturalDateParsing.parse_three_words(words, creation_date = nil)
  
  if MONTH.include?(words[0]) && _weak_day?(words[1]) && Utils::is_int?(words[2])
    return Date.parse(words.join(" "))
  end
  
end
parse_two_words(words, creation_date = nil) click to toggle source

Takes two words and tries to return a date.

If no date can be interpreted from the word, returns nil. In this case, we look for dates of this form:

Attributes

  • words - An array of two words, downcased and stripped.

  • creation_date - A Date object of when the text was created or released. Defaults to nil, but if provided can make returned dates more accurate.

# File lib/date_parser/natural_date_parsing.rb, line 306
def NaturalDateParsing.parse_two_words(words, creation_date = nil)
  
  if MONTH.include?(words[0]) && _weak_day?(words[1])
    return month_day(words, creation_date)
  end
  
end
slash_date(word, creation_date = nil) click to toggle source

Given a single word, assumes the word is of the form XX/XX and returns the appropriate Date object. If not possible, returns nil.

# File lib/date_parser/natural_date_parsing.rb, line 343
def NaturalDateParsing.slash_date(word, creation_date = nil)
  samp = word.split('/')
  month = samp[0].to_i
  day = samp[1].to_i
  
  if month > 0 && month <= 12 && day > 0 && day <= 31
    # TODO: IMPROVE EXCEPTION HANDLING.
    begin
      proposed_date = Date.parse(word)
      if(! creation_date.nil?) ## We're sensitive to only years here.
        years_diff = Date.today.year - creation_date.year
        proposed_date = proposed_date << (12 * years_diff)
      end
      return proposed_date
    rescue ArgumentError
      return nil
    end
  end
end

Private Class Methods

_weak_day?(word) click to toggle source

Private Functions

# File lib/date_parser/natural_date_parsing.rb, line 430
def NaturalDateParsing._weak_day?(word)
  return (NUMERIC_DAY.include? word) || (SUFFIXED_NUMERIC_DAY.include? word)
end
default_month(month, released = nil) click to toggle source

TODO. NOT SENSITIVE TO YEAR.

# File lib/date_parser/natural_date_parsing.rb, line 439
def NaturalDateParsing.default_month(month, released = nil)
  this_year = released.nil? ? Date.today.year : released.year
  return Date.parse(month + " " + this_year.to_s)
end
default_year(year) click to toggle source
# File lib/date_parser/natural_date_parsing.rb, line 434
def NaturalDateParsing.default_year(year)
  return Date.parse("Jan 1 " + year)
end
difference_in_weeks(date1, date2) click to toggle source

Be careful with this. date1 is the later date.

# File lib/date_parser/natural_date_parsing.rb, line 457
def NaturalDateParsing.difference_in_weeks(date1, date2)
  return ((date1 - date2) / 7).to_i
end
full_numeric_date?(word) click to toggle source

Is it generally of the form XXXX-XX-XX or XXXX/XX/XX?

# File lib/date_parser/natural_date_parsing.rb, line 480
def NaturalDateParsing.full_numeric_date?(word)
  demarcating_token = get_demarcating_token(word)
  substrings = word.split(demarcating_token)
  
  if substrings.length != 3
    return false
  end
  
  for substring in substrings do
    if !Utils.is_int?(substring)
      return false
    end
  end
  
  return true
end
get_demarcating_token(word) click to toggle source

Given a string, tries to determine if the word contains a demarcating token such as '-' or '/' If so, returns that demarcating token. Assumes that only one such token is present.

If no such token is found, returns an empty string.

# File lib/date_parser/natural_date_parsing.rb, line 512
def NaturalDateParsing.get_demarcating_token(word)
  demarcating_token = ""
  
  if word.include? "-"
    demarcating_token = "-"
  elsif word.include? "/"
    demarcating_token = "/"
  end
  
  return demarcating_token
end
numeric_month_to_string(numeric) click to toggle source

Converts a numeric month to a string.

# File lib/date_parser/natural_date_parsing.rb, line 498
def NaturalDateParsing.numeric_month_to_string(numeric)
  months = ["january", "february", "march", "april", "may", "june",
            "july", "august", "september", "october", "november",
            "december"]
  
  return months[numeric - 1]
end
slash_date?(word) click to toggle source

Determines if a given date could be a slash date. I.e., of the form XX/XX

# File lib/date_parser/natural_date_parsing.rb, line 463
def NaturalDateParsing.slash_date?(word)
  substrings = word.split("/")
  
  if substrings.size != 2
    return false
  end
  
  for substring in substrings do
    if !Utils.is_int?(substring)
      return false
    end
  end
  
  return true
end
suffix(number) click to toggle source
# File lib/date_parser/natural_date_parsing.rb, line 444
def NaturalDateParsing.suffix(number)
  int = number.to_i
  
  ## Check to see if the least significant digit is 1.
  if int & 1 == 1
    return int.to_s + "st"
  else
    return int.to_s + "th"
  end
end