class BioDSL::AssemblePairs

Assemble ordered overlapping pair-end sequences in the stream.

assemble_pairs assembles overlapping pair-end sequences into single sequences that are output to the stream - the orginal sequences are no output. Assembly works by progressively considering all overlaps between the maximum considered overlap using the overlap_max option (default is the length of the shortest sequence) until the minimum required overlap supplied with the overlap_min option (default 1). For each overlap a percentage of mismatches can be allowed using the mismatch_percent option (default 20%).

Mismatches in the overlapping regions are resolved so that the residues with the highest quality score is used in the assembled sequence. The quality scores are averaged in the overlapping region. The sequence of the overlapping region is output in upper case and the remaining in lower case.

Futhermore, sequences must be in interleaved order in the stream - use read_fastq with input and input2 options for that.

The additional keys are added to records with assembled sequences:

Using the merge_unassembled option will merge any unassembled sequences taking into account reverse complementation of read2 if the reverse_complement option is true. Note that you probably want to set overlap_min to 1 before using merge_unassembled to improve chances of making an assembly before falling back to a simple merge.

Usage

assemble_pairs([mismatch_percent: <uint>[, overlap_min: <uint>
               [, overlap_max: <uint>[, reverse_complement: <bool>
               [, merge_unassembled: <bool>]]]]])

Options

Examples

If you have two pair-end sequence files with the Illumina data then you can assemble these using assemble_pairs like this:

BD.new.
read_fastq(input: "file1.fq", input2: "file2.fq).
assemble_pairs(reverse_complement: true).
run

rubocop:disable ClassLength

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for the AssemblePairs class.

@param [Hash] options Options hash.

@option options [Integer] :mismatch_percent

Maximum allowed overlap mismatches in percent.

@option options [Integer] :overlap_min

Minimum length of overlap.

@option options [Integer] :overlap_max

Maximum length of overlap.

@option options [Boolean] :reverse_complement

Reverse-complment read2.

@option options [Boolean] :merge_unassembled

Merge read pairs that couldn't be assembled.

@option options [Boolean] :allow_unassembled

Output reads that couldn't be assembled.

@return [ReadFasta] Returns an instance of the class.

# File lib/BioDSL/commands/assemble_pairs.rb, line 112
def initialize(options)
  @options = options

  @overlap_sum = 0
  @hamming_sum = 0

  check_options
  defaults
end

Public Instance Methods

lmb() click to toggle source

Return a lambda for the read_fasta command.

@return [Proc] Returns the read_fasta command lambda.

# File lib/BioDSL/commands/assemble_pairs.rb, line 125
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each_slice(2) do |record1, record2|
      @status[:records_in] += 2

      if record2 && record1[:SEQ] && record2[:SEQ]
        assemble_pairs(record1, record2, output)
      else
        output_record(record1, output)
        output_record(record2, output) if record2
      end
    end

    calc_status
  end
end

Private Instance Methods

assemble_entries(entry1, entry2) click to toggle source

Assemble a pair of given entries if possible and return an assembled entry, or nil the entries could not be assembled.

@param entry1 [BioDSL::Seq] Sequence entry1. @param entry2 [BioDSL::Seq] Sequence entry2.

@return [BioDSL::Seq, nil] Returns Seq entry or nil.

# File lib/BioDSL/commands/assemble_pairs.rb, line 238
def assemble_entries(entry1, entry2)
  BioDSL::Assemble.pair(
    entry1,
    entry2,
    mismatches_max: @options[:mismatch_percent],
    overlap_min:    @options[:overlap_min],
    overlap_max:    @options[:overlap_max]
  )
end
assemble_pairs(record1, record2, output) click to toggle source

Assemble records with sequences and output to the stream

@param record1 [Hash] BioDSL record1. @param record2 [Hash] BioDSL record2. @param output [Enumerator::Yielder] Output stream.

# File lib/BioDSL/commands/assemble_pairs.rb, line 181
def assemble_pairs(record1, record2, output)
  entry1, entry2 = records2entries(record1, record2)

  if overlap_possible?(entry1, entry2, @options[:overlap_min]) &&
     (assembled = assemble_entries(entry1, entry2))
    output_assembled(assembled, output)
  elsif @options[:merge_unassembled]
    output_merged(entry1, entry2, output)
  elsif @options[:allow_unassembled]
    output_entries(entry1, entry2, output)
  else
    @status[:unassembled] += 1
  end
end
assembled2record(assembled) click to toggle source

Convert a sequence entry to a BioPiece record with hamming distance and overlap length from the entry’s seq_name.

@param assembled [BioDSL::Seq] Merged sequence entry.

@return [Hash] BioDSL record.

# File lib/BioDSL/commands/assemble_pairs.rb, line 267
def assembled2record(assembled)
  new_record = assembled.to_bp

  if assembled.seq_name =~ /overlap=(\d+):hamming=(\d+)$/
    overlap = Regexp.last_match(1).to_i
    hamming = Regexp.last_match(2).to_i
    @overlap_sum += overlap
    @hamming_sum += hamming
    new_record[:OVERLAP_LEN]  = overlap
    new_record[:HAMMING_DIST] = hamming
  end

  new_record
end
calc_status() click to toggle source

Calculate additional status values.

# File lib/BioDSL/commands/assemble_pairs.rb, line 326
def calc_status
  assembled_percent =
    (100 * 2 * @status[:assembled].to_f / @status[:sequences_in]).round(2)
  @status[:assembled_percent] = assembled_percent
  @status[:overlap_mean]      =
    (@overlap_sum.to_f / @status[:records_out]).round(2)
  @status[:hamming_dist_mean] =
    (@hamming_sum.to_f / @status[:records_out]).round(2)
end
check_options() click to toggle source

Check the options.

# File lib/BioDSL/commands/assemble_pairs.rb, line 147
def check_options
  options_allowed(@options, :mismatch_percent, :overlap_min, :overlap_max,
                  :reverse_complement, :merge_unassembled,
                  :allow_unassembled)
  options_allowed_values(@options, reverse_complement: [true, false, nil])
  options_allowed_values(@options, merge_unassembled: [true, false, nil])
  options_allowed_values(@options, allow_unassembled: [true, false, nil])
  options_conflict(@options, allow_unassembled: :merge_unassembled)
  options_assert(@options, ':mismatch_percent >= 0')
  options_assert(@options, ':mismatch_percent <= 100')
  options_assert(@options, ':overlap_min > 0')
end
defaults() click to toggle source

Set default options.

# File lib/BioDSL/commands/assemble_pairs.rb, line 161
def defaults
  @options[:mismatch_percent] ||= 20
  @options[:overlap_min] ||= 1
end
entry2record(entry) click to toggle source

Converts a sequence entry to a BioPeice record.

@param entry [BioDSL::Seq] Sequence entry.

@return [Hash] BioDSL record.

# File lib/BioDSL/commands/assemble_pairs.rb, line 318
def entry2record(entry)
  record = entry.to_bp
  record[:OVERLAP_LEN]  = 0
  record[:HAMMING_DIST] = entry.length
  record
end
output_assembled(assembled, output) click to toggle source

Output assembled pairs to the output stream.

@param assembled [BioDSL::Seq] Assembled sequence entry. @param output [Enumerator::Yielder] Output stream.

# File lib/BioDSL/commands/assemble_pairs.rb, line 252
def output_assembled(assembled, output)
  output << assembled2record(assembled)

  @status[:assembled] += 1
  @status[:records_out] += 1
  @status[:sequences_out] += 1
  @status[:residues_out] += assembled.length
end
output_entries(entry1, entry2, output) click to toggle source

Output unassembled entries to the stream.

@param entry1 [BioDSL::Seq] Entry1. @param entry2 [BioDSL::Seq] Entry2. @param output [Enumerator::Yielder] Output stream.

# File lib/BioDSL/commands/assemble_pairs.rb, line 303
def output_entries(entry1, entry2, output)
  output << entry2record(entry1)
  output << entry2record(entry2)

  @status[:unassembled] += 2
  @status[:sequences_out] += 2
  @status[:residues_out] += entry1.length + entry2.length
  @status[:records_out] += 2
end
output_merged(entry1, entry2, output) click to toggle source

Merge and output entries to the stream.

@param entry1 [BioDSL::Seq] Entry1. @param entry2 [BioDSL::Seq] Entry2. @param output [Enumerator::Yielder] Output stream.

# File lib/BioDSL/commands/assemble_pairs.rb, line 287
def output_merged(entry1, entry2, output)
  entry1 << entry2

  output << entry2record(entry1)

  @status[:unassembled] += 1
  @status[:sequences_out] += 1
  @status[:residues_out] += entry1.length
  @status[:records_out] += 1
end
output_record(record, output) click to toggle source

Output a record to the stream if a stram is provided.

@param record [Hash] BioDSL record to output. @param output [Enumerator::Yielder, nil] Output stream or nil.

# File lib/BioDSL/commands/assemble_pairs.rb, line 170
def output_record(record, output)
  return unless output
  output << record
  @status[:records_out] += 1
end
overlap_possible?(entry1, entry2, overlap_min) click to toggle source

Determines if an overlap between two given entries is possible considering the minimum overlap length.

@param entry1 [BioDSL::Seq] Sequence entry1. @param entry2 [BioDSL::Seq] Sequence entry2. @param overlap_min [Integer] Minimum overlap.

@return [Boolean] True if overlap possible otherwise false.

# File lib/BioDSL/commands/assemble_pairs.rb, line 227
def overlap_possible?(entry1, entry2, overlap_min)
  entry1.length >= overlap_min && entry2.length >= overlap_min
end
records2entries(record1, record2) click to toggle source

Given a pair of records convert these into sequence entries and reverse-complment if need be.

@param record1 [Hash] Record1. @param record2 [Hash] Record2.

@return [Array] Returns a tuple of sequence entries.

# File lib/BioDSL/commands/assemble_pairs.rb, line 203
def records2entries(record1, record2)
  entry1 = BioDSL::Seq.new_bp(record1)
  entry2 = BioDSL::Seq.new_bp(record2)
  entry1.type = :dna
  entry2.type = :dna

  if @options[:reverse_complement] && entry2.length > 0
    entry2.reverse!.complement!
  end

  @status[:sequences_in] += 2
  @status[:residues_in] += entry1.length + entry2.length

  [entry1, entry2]
end