class BioDSL::SliceAlign

Slice aligned sequences in the stream to obtain subsequences.

slice_align slices an alignment to extract subsequence from all sequences in the stream. This is done by either specifying a range or a set of primers that is then used to locate the range to be sliced from the sequences.

If a range is given with the slice option the potitions (0-based) must be corresponding the aligned sequence, i.e with gaps.

If a set of primers are given with the forward and reverse options (or the forward_rc and reverse_rc options) these primers are used to locate the matching positions in the first entry and this range is used to slice this and any following sequences. It is possible to specify fuzzy primer matching by using the max_mismatches, max_insertions and max_deletions options. Moreover, IUPAC ambigity codes are allowed.

It is also possible to specify a template file using the template_file option. The template file should be a file with one FASTA formatted sequence from the alignment (with gaps). If a template file and a range is specified the nucleotide positions from the ungapped template will be used. If both template file and primers are specified the template sequence is used for the primer search and the positions will be used for slicing.

The sequences in the stream are replaced with the sliced subsequences.

Usage

slice_align(<slice: <index>|<range>> |
            <forward: <string> | forward_rc: <string>>,
            <revese: <string> | reverse_rc: <string>
            [, max_mismatches: <uint>[, max_insertions: <uint>
            [, max_deletions: <uint>[, template_file: <file>]]]])

Options

Examples

Consider the following alignment in the file ‘test.fna`

>ID00000000
CCGCATACG-------CCCTGAGGGG----
>ID00000001
CCGCATGAT-------ACCTGAGGGT----
>ID00000002
CCGCATATACTCTTGACGCTAAAGCGTAGT
>ID00000003
CCGTATGTG-------CCCTTCGGGG----
>ID00000004
CCGGATAAG-------CCCTTACGGG----
>ID00000005
CCGGATAAG-------CCCTTACGGG----

We can slice the alignment with slice_align using a range:

BD.new.
read_fasta(input: "test.fna").
slice_align(slice: 14 .. 27).
dump.
run

{:SEQ_NAME=>"ID00000000", :SEQ=>"--CCCTGAGGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000001", :SEQ=>"--ACCTGAGGGT--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000002", :SEQ=>"GACGCTAAAGCGTA", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000003", :SEQ=>"--CCCTTCGGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000004", :SEQ=>"--CCCTTACGGG--", :SEQ_LEN=>14}
{:SEQ_NAME=>"ID00000005", :SEQ=>"--CCCTTACGGG--", :SEQ_LEN=>14}

Or we could slice the alignment using a set of primers:

BD.new.
read_fasta(input: "test.fna").
slice_align(forward: "CGCATACG", reverse: "GAGGGG", max_mismatches: 0,
            max_insertions: 0, max_deletions: 0).
dump.run

{:SEQ_NAME=>"ID00000000", :SEQ=>"CGCATACG-------CCCTGAGGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000001", :SEQ=>"CGCATGAT-------ACCTGAGGGT", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000002", :SEQ=>"CGCATATACTCTTGACGCTAAAGCG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000003", :SEQ=>"CGTATGTG-------CCCTTCGGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000004", :SEQ=>"CGGATAAG-------CCCTTACGGG", :SEQ_LEN=>25}
{:SEQ_NAME=>"ID00000005", :SEQ=>"CGGATAAG-------CCCTTACGGG", :SEQ_LEN=>25}

Now, if we have a template file with the following FASTA entry:

>template
CTGAATACG-------CCATTCGATGG---

and spefifying primers these will be matched to the template and the hit positions used for slicing:

BD.new.
read_fasta(input: "test.fna").
slice_align(template_file: "template.fna", forward: "GAATACG",
            reverse: "ATTCGAT", max_mismatches: 0, max_insertions: 0,
            max_deletions: 0).
dump.run

{:SEQ_NAME=>"ID00000000", :SEQ=>"GCATACG-------CCCTGAGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000001", :SEQ=>"GCATGAT-------ACCTGAGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000002", :SEQ=>"GCATATACTCTTGACGCTAAAGC", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000003", :SEQ=>"GTATGTG-------CCCTTCGGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000004", :SEQ=>"GGATAAG-------CCCTTACGG", :SEQ_LEN=>23}
{:SEQ_NAME=>"ID00000005", :SEQ=>"GGATAAG-------CCCTTACGG", :SEQ_LEN=>23}

Finally, specifying a template file and an interval the positions used for slicing will be the ungapped positions from the template sequence. This is useful if you are slicing 16S rRNA alignments and want the E.coli corresponding positions - simply use the E.coli sequence as template.

BD.new.
read_fasta(input: "test.fna").
slice_align(template_file: "template.fna", slice: 4 .. 14).
dump.run

{:SEQ_NAME=>"ID00000000", :SEQ=>"ATACG-------CCCTGA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000001", :SEQ=>"ATGAT-------ACCTGA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000002", :SEQ=>"ATATACTCTTGACGCTAA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000003", :SEQ=>"ATGTG-------CCCTTC", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000004", :SEQ=>"ATAAG-------CCCTTA", :SEQ_LEN=>18}
{:SEQ_NAME=>"ID00000005", :SEQ=>"ATAAG-------CCCTTA", :SEQ_LEN=>18}

rubocop: enable LineLength rubocop: disable ClassLength

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for SliceAlign.

@param options [Hash] Options hash. @option options [Range,Integer] :slice @option options [String] :forward @option options [String] :forward_rc @option options [String] :reverse @option options [String] :reverse_rc @option options [Integer] :max_mismatches @option options [Integer] :max_insertions @option options [Integer] :max_deletions @option options [String] :template_file

@return [SliceAlign] Class instance.

# File lib/BioDSL/commands/slice_align.rb, line 182
def initialize(options)
  @options  = options
  @forward  = forward
  @reverse  = reverse
  @indels   = BioDSL::Seq::INDELS.sort.join
  @template = nil
  @slice    = options[:slice]

  check_options
  defaults
end

Public Instance Methods

lmb() click to toggle source

Return the comman lamba for slice_align.

@return [Proc] Command lambda.

# File lib/BioDSL/commands/slice_align.rb, line 197
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    parse_template_file
    setup_template_slice

    input.each do |record|
      @status[:records_in] += 1
      slice_align(record) if record.key? :SEQ
      output << record
      @status[:records_out] += 1
    end
  end
end

Private Instance Methods

check_options() click to toggle source

Check options.

# File lib/BioDSL/commands/slice_align.rb, line 216
def check_options
  options_allowed(@options, :slice, :forward, :forward_rc, :reverse,
                  :reverse_rc, :max_mismatches, :max_insertions,
                  :max_deletions, :template_file)
  options_conflict(@options, slice: :forward)
  options_files_exist(@options, :template_file)
  options_assert(@options, ':max_mismatches >= 0')
  options_assert(@options, ':max_insertions >= 0')
  options_assert(@options, ':max_deletions >= 0')
  options_assert(@options, ':max_mismatches <= 5')
  options_assert(@options, ':max_insertions <= 5')
  options_assert(@options, ':max_deletions <= 5')
end
defaults() click to toggle source

Setup default primer matching attributes.

# File lib/BioDSL/commands/slice_align.rb, line 231
def defaults
  @max_mis = @options[:max_mismatches] || 2
  @max_ins = @options[:max_insertions] || 1
  @max_del = @options[:max_deletions] || 1
end
find_match(pattern, entry) click to toggle source

Find pattern in entry and return match.

@param pattern [String] Search pattern. @param entry [BioDSL::Seq] Sequence to search.

@return [BioDSL::Seq::Match] Pattern match.

@raise [BioDSL::SeqError] If no match.

# File lib/BioDSL/commands/slice_align.rb, line 348
def find_match(pattern, entry)
  match = entry.patmatch(pattern,
                         max_mismatches: @max_mis,
                         max_insertions: @max_ins,
                         max_deletions:  @max_del)

  return match unless match.nil?

  fail BioDSL::SeqError, "pattern not found: #{pattern}"
end
forward() click to toggle source

Return the forward primer sequence and reverse-complement it if need be.

@return [String] Forward primer sequence.

# File lib/BioDSL/commands/slice_align.rb, line 287
def forward
  if @options[:forward_rc]
    @options[:forward] = Seq.new(seq: @options[:forward_rc], type: :dna).
                         reverse.complement.seq
  else
    @options[:forward]
  end
end
parse_template_file() click to toggle source

Parse FASTA file with one gapped template sequence if specified.

# File lib/BioDSL/commands/slice_align.rb, line 238
def parse_template_file
  return unless @options[:template_file]

  @template = BioDSL::Fasta.read(@options[:template_file]).first
end
reverse() click to toggle source

Return the reverse primer sequence and reverse-complement it if need be.

@return [String] Reverse primer sequence.

# File lib/BioDSL/commands/slice_align.rb, line 299
def reverse
  if @options[:reverse_rc]
    @options[:reverse] = Seq.new(seq: @options[:reverse_rc], type: :dna).
                         reverse.complement.seq
  else
    @options[:reverse]
  end
end
setup_slice(entry) click to toggle source

Usings primers to locate slice positions in entry.

@param entry [BioDSL::Seq] Sequence entry.

# File lib/BioDSL/commands/slice_align.rb, line 330
def setup_slice(entry)
  pos_index = PosIndex.new(entry, @indels)
  compact   = Seq.new(seq: entry.seq.dup.delete(@indels))

  fmatch = find_match(@forward, compact)
  rmatch = find_match(@reverse, compact)

  @slice = Range.new(pos_index[fmatch.start], pos_index[rmatch.stop])
end
setup_template_slice() click to toggle source

Set the slice positions using the template sequence.

# File lib/BioDSL/commands/slice_align.rb, line 245
def setup_template_slice
  return unless @template

  pos_index = PosIndex.new(@template, @indels)

  if @slice
    start, stop = setup_template_slice_range(pos_index)
  else
    start, stop = setup_template_slice_primers(pos_index)
  end

  @slice = Range.new(start, stop)
end
setup_template_slice_primers(pos_index) click to toggle source

Given a position index use primers to locate the slice positions in the template sequence.

@param pos_index [PosIndex] Position index.

# File lib/BioDSL/commands/slice_align.rb, line 274
def setup_template_slice_primers(pos_index)
  compact   = Seq.new(seq: @template.seq.dup.delete(@indels))
  fmatch    = find_match(@forward, compact)
  rmatch    = find_match(@reverse, compact)
  start     = pos_index[fmatch.start]
  stop      = pos_index[rmatch.stop]

  [start, stop]
end
setup_template_slice_range(pos_index) click to toggle source

Given a position index use slice positions to locate equivalent postitions in the template sequence.

@param pos_index [PosIndex] Position index.

# File lib/BioDSL/commands/slice_align.rb, line 263
def setup_template_slice_range(pos_index)
  start = pos_index[@slice.first]
  stop  = pos_index[@slice.last]

  [start, stop]
end
slice_align(record) click to toggle source

Slice sequence in given record accoding to slice positions.

@param record [Hash] BioDSL record.

# File lib/BioDSL/commands/slice_align.rb, line 311
def slice_align(record)
  entry = BioDSL::Seq.new_bp(record)

  @status[:sequences_in] += 1
  @status[:residues_in] += entry.length

  setup_slice(entry) unless @slice

  entry = entry[@slice]

  record.merge! entry.to_bp

  @status[:sequences_out] += 1
  @status[:residues_out] += entry.length
end