module RetrievalLite::BooleanRetrieval

Gathers documents that satisfy boolean expression

Public Class Methods

evaluate(corpus, query) click to toggle source

Gathers up all documents of a corpus that satisfy a boolean expression with the standard operators: AND, OR, NOT. Does not order the documents in particular any way. Assumes that all boolean operators are separated by white space on either side.

@param corpus [Corpus] the collection of documents @param query [String] the boolean query to be evaluated @return [Array<Document>] unordered array of documents that satisfy the query

# File lib/retrieval_lite/boolean_retrieval.rb, line 11
def self.evaluate(corpus, query)
  if !is_valid_expression?(query)
    raise "Each boolean operator (AND, OR, NOT) must operate on two terms."
  end

  # must strip all non alphanumeric characters
  query = strip_query(query)

  # must have spaces in front and back for next line
  query = " " + query + " " 

  # replace all operators with corresponding operators
  query = query.gsub("AND", "\&\&").gsub("OR", "\|\|").gsub("NOT", "!")

  # replace all terms with corresponding functions
  query.gsub!(/[a-zA-Z0-9]+(-[a-zA-Z0-9]+)?/) do |q|
     " document.contains?(\"" + q.downcase + "\") "
  end

  output_documents = []
  corpus.documents.each do |document|
    begin
      if eval(query)
        output_documents << document
      end
    rescue
      raise "The boolean expression is not valid.  Please check all parethensis and operators."
    end
  end

  return output_documents
end
has_boolean_operators?(query) click to toggle source

@param query [String] the boolean query to be evaluated @return [Boolean] whether query contains any boolean operators

# File lib/retrieval_lite/boolean_retrieval.rb, line 46
def self.has_boolean_operators?(query)
  /AND|OR|NOT/ === query
end
is_valid_expression?(query) click to toggle source

@note all other invalid expressions should be caught later on @param query [String] the boolean query to be evaluated @return [Boolean] whether query ends parenthesis correctly

# File lib/retrieval_lite/boolean_retrieval.rb, line 53
def self.is_valid_expression?(query)
  !(/(AND|OR|NOT)\s*\)/ === query)
end
strip_query(query) click to toggle source

@param query [String] the boolean query to be evaluated @return [String] a query removed of any non-alphanumeric characters besides parenthesis and whitespace

# File lib/retrieval_lite/boolean_retrieval.rb, line 59
def self.strip_query(query)
  # remove non-alphanumeric
  query = query.gsub(/[^a-zA-Z0-9\s\(\)\-]/, " ")

  # getting rid of stray hyphens
  query = query.gsub(/\-\-+/, " ").gsub(/\s+\-\s+/, " ")
end