module BioDSL::BackTrack
Module containing code to locate nucleotide patterns in sequences allowing for ambiguity codes and a given maximum mismatches, insertions, and deletions. The pattern match engine is based on a backtrack algorithm. Insertions are nucleotides found in the pattern but not in the sequence. Deletions are nucleotides found in the sequence but not in the pattern. Algorithm based on code kindly provided by j_random_hacker @ Stackoverflow: stackoverflow.com/questions/7557017/approximate-string-matching-using-backtracking/
Constants
- MAX_DEL
- MAX_INS
- MAX_MIS
- OK_PATTERN
Public Instance Methods
str.patmatch(pattern[, options]) -> Match str.patmatch(pattern[, options]) { |match| block } -> Match options: :start :stop :max_mismatches :max_insertions :max_deletions
Method to iterate through a sequence from a given start position to the end of the sequence or to a given stop position to locate a pattern allowing for a maximum number of mismatches, insertions, and deletions. Insertions are nucleotides found in the pattern but not in the sequence. Deletions are nucleotides found in the sequence but not in the pattern.
# File lib/BioDSL/seq/backtrack.rb, line 68 def patmatch(pattern, options = {}) options[:start] ||= 0 options[:stop] ||= length - 1 options[:max_mismatches] ||= 0 options[:max_insertions] ||= 0 options[:max_deletions] ||= 0 patscan(pattern, options) do |m| if block_given? yield m else return m end end end
str.patscan(pattern[, options]) -> Array str.patscan(pattern[, options]) { |match| block } -> Match options: :start :stop :max_mismatches :max_insertions :max_deletions
Method to iterate through a sequence from a given start position to the end of the sequence or to a given stop position to locate a pattern allowing for a maximum number of mismatches, insertions, and deletions. Insertions are nucleotides found in the pattern but not in the sequence. Deletions are nucleotides found in the sequence but not in the pattern. Matches found in block context return the Match
object. Otherwise matches are returned in an Array
of Match
objects.
# File lib/BioDSL/seq/backtrack.rb, line 107 def patscan(pattern, options = {}) options[:start] ||= 0 options[:stop] ||= length - 1 options[:max_mismatches] ||= 0 options[:max_insertions] ||= 0 options[:max_deletions] ||= 0 unless pattern.downcase =~ OK_PATTERN fail BackTrackError, "Bad pattern: #{pattern}" end unless (0...length).include? options[:start] fail BackTrackError, "start: #{options[:start]} out of range " \ "(0..#{length - 1})" end unless (0...length).include? options[:stop] fail BackTrackError, "stop: #{options[:stop]} out of range " \ "(0..#{length - 1})" end unless (0..MAX_MIS).include? options[:max_mismatches] fail BackTrackError, "max_mismatches: #{options[:max_mismatches]} " \ "out of range (0..#{MAX_MIS})" end unless (0..MAX_INS).include? options[:max_insertions] fail BackTrackError, "max_insertions: #{options[:max_insertions]} " \ "out of range (0..#{MAX_INS})" end unless (0..MAX_DEL).include? options[:max_deletions] fail BackTrackError, "max_deletions: #{options[:max_deletions]} " \ "out of range (0..#{MAX_DEL})" end matches = [] while (result = scan_C(@seq, pattern, options[:start], options[:stop], options[:max_mismatches], options[:max_insertions], options[:max_deletions])) match = Match.new(result.first, result.last, @seq[result.first...result.first + result.last]) if block_given? yield match else matches << match end options[:start] = result.first + 1 end return matches unless block_given? end