module BioDSL::BackTrack

Module containing code to locate nucleotide patterns in sequences allowing for ambiguity codes and a given maximum mismatches, insertions, and deletions. The pattern match engine is based on a backtrack algorithm. Insertions are nucleotides found in the pattern but not in the sequence. Deletions are nucleotides found in the sequence but not in the pattern. Algorithm based on code kindly provided by j_random_hacker @ Stackoverflow: stackoverflow.com/questions/7557017/approximate-string-matching-using-backtracking/

Constants

MAX_DEL
MAX_INS
MAX_MIS
OK_PATTERN

Public Instance Methods

patmatch(pattern, options = {}) { |m| ... } click to toggle source

str.patmatch(pattern[, options])
-> Match
str.patmatch(pattern[, options]) { |match|
  block
}
-> Match

options:
  :start
  :stop
  :max_mismatches
  :max_insertions
  :max_deletions

Method to iterate through a sequence from a given start position to the end of the sequence or to a given stop position to locate a pattern allowing for a maximum number of mismatches, insertions, and deletions. Insertions are nucleotides found in the pattern but not in the sequence. Deletions are nucleotides found in the sequence but not in the pattern.

# File lib/BioDSL/seq/backtrack.rb, line 68
def patmatch(pattern, options = {})
  options[:start] ||= 0
  options[:stop] ||= length - 1
  options[:max_mismatches] ||= 0
  options[:max_insertions] ||= 0
  options[:max_deletions] ||= 0

  patscan(pattern, options) do |m|
    if block_given?
      yield m
    else
      return m
    end
  end
end
patscan(pattern, options = {}) { |match| ... } click to toggle source

str.patscan(pattern[, options])
-> Array
str.patscan(pattern[, options]) { |match|
  block
}
-> Match

options:
  :start
  :stop
  :max_mismatches
  :max_insertions
  :max_deletions

Method to iterate through a sequence from a given start position to the end of the sequence or to a given stop position to locate a pattern allowing for a maximum number of mismatches, insertions, and deletions. Insertions are nucleotides found in the pattern but not in the sequence. Deletions are nucleotides found in the sequence but not in the pattern. Matches found in block context return the Match object. Otherwise matches are returned in an Array of Match objects.

# File lib/BioDSL/seq/backtrack.rb, line 107
def patscan(pattern, options = {})
  options[:start] ||= 0
  options[:stop] ||= length - 1
  options[:max_mismatches] ||= 0
  options[:max_insertions] ||= 0
  options[:max_deletions] ||= 0

  unless pattern.downcase =~ OK_PATTERN
    fail BackTrackError, "Bad pattern: #{pattern}"
  end

  unless (0...length).include? options[:start]
    fail BackTrackError, "start: #{options[:start]} out of range " \
                         "(0..#{length - 1})"
  end

  unless (0...length).include? options[:stop]
    fail BackTrackError, "stop: #{options[:stop]} out of range " \
                         "(0..#{length - 1})"
  end

  unless (0..MAX_MIS).include? options[:max_mismatches]
    fail BackTrackError, "max_mismatches: #{options[:max_mismatches]} " \
                         "out of range (0..#{MAX_MIS})"
  end

  unless (0..MAX_INS).include? options[:max_insertions]
    fail BackTrackError, "max_insertions: #{options[:max_insertions]} " \
                         "out of range (0..#{MAX_INS})"
  end

  unless (0..MAX_DEL).include? options[:max_deletions]
    fail BackTrackError, "max_deletions: #{options[:max_deletions]} " \
                         "out of range (0..#{MAX_DEL})"
  end

  matches = []

  while (result = scan_C(@seq, pattern, options[:start], options[:stop],
                         options[:max_mismatches], options[:max_insertions],
                         options[:max_deletions]))
    match = Match.new(result.first, result.last,
                      @seq[result.first...result.first + result.last])

    if block_given?
      yield match
    else
      matches << match
    end

    options[:start] = result.first + 1
  end

  return matches unless block_given?
end