class BioDSL::AssembleSeqSpades

Assemble sequences the stream using SPAdes.

assemble_seq_spades is a wrapper around the single prokaryotic genome assembler SPAdes:

bioinf.spbau.ru/spades

Any records containing sequence information will be included in the assembly, but only the assembled contig sequences will be output to the stream.

The sequences records may contain qualty scores, and if the sequence names indicates that the sequence order is inter-leaved paired-end assembly will be performed.

Usage

assemble_seq_spades([careful: <bool>[, cpus: <uint>[, kmers: <list>]]])

Options

Examples

If you have two pair-end sequence files with the Illumina data then you can assemble these using assemble_seq_spades like this:

BD.new.
read_fastq(input: "file1.fq", input2: "file2.fq).
assemble_seq_spades(kmers: [55,77,99,127]).
write_fasta(output: "contigs.fna").
run

rubocop:disable ClassLength

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for the AssembleSeqSpades class.

@param [Hash] options Options hash.

@option options [Boolean] :careful

Flag indicating use of careful assembly.

@option options [Array] :kmers

List of kmers to use.

@option options [Integer] :cpus

CPUs to use.

@return [AssembleSeqSpades] Returns an instance of the class.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 88
def initialize(options)
  @options = options
  @lengths = []
  @type    = nil

  aux_exist('spades.py')
  check_options
  defaults
end

Public Instance Methods

lmb() click to toggle source

Return a lambda for the AssembleSeqSpades command.

@return [Proc] Returns the command lambda.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 101
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    TmpDir.create('reads.fq', 'reads.fa') do |in_fq, in_fa, tmp_dir|
      process_input(in_fq, in_fa, input, output)
      input_file = (@type == :fastq) ? in_fq : in_fa
      execute_spades(input_file, tmp_dir)
      process_output(output, File.join(tmp_dir, 'scaffolds.fasta'))
    end

    calc_n50(status)
  end
end

Private Instance Methods

calc_n50(status) click to toggle source

Calculate the n50 and add to the status.

{en.wikipedia.org/wiki/N50_statistic}

@param status [Hash] Status hash.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 233
def calc_n50(status)
  @lengths.sort!
  @lengths.reverse!

  status[:contig_max] = @lengths.first
  status[:contig_min] = @lengths.last

  count = 0

  @lengths.each do |length|
    count += length

    if count >= status[:residues_out] * 0.50
      status[:contig_n50] = length
      break
    end
  end
end
check_options() click to toggle source

Check the options.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 119
def check_options
  options_allowed(@options, :careful, :cpus, :kmers)
  options_allowed_values(@options, careful: [true, false, nil])
  options_assert(@options, ':cpus >= 1')
  options_assert(@options, ":cpus <= #{BioDSL::Config::CORES_MAX}")
end
compile_command(input_file, tmp_dir) click to toggle source

Compile the spades command.

@param input_file [String] Path to input file. @param tmp_dir [String] Path to temp dir.

@return [String] A command string for executing Spades.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 198
def compile_command(input_file, tmp_dir)
  cmd = []
  cmd << 'spades.py'
  cmd << "--12 #{input_file}"
  cmd << '--only-assembler'
  cmd << '--careful'                        if @options[:careful]
  cmd << "-k #{@options[:kmers].join(',')}" if @options[:kmers]
  cmd << "-t #{@options[:cpus]}"
  cmd << "-o #{tmp_dir}"

  cmd.join(' ')
end
defaults() click to toggle source

Set default options.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 127
def defaults
  @options[:cpus] ||= 1
end
execute_spades(input_file, tmp_dir) click to toggle source

Execute spades using a system call.

@param input_file [String] Path to input file. @param tmp_dir [String] Path to temp dir.

@raise if command fails.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 179
def execute_spades(input_file, tmp_dir)
  cmd_line = compile_command(input_file, tmp_dir)

  if BioDSL.verbose
    $stderr.puts cmd_line
    system(cmd_line)
  else
    system(cmd_line + ' > /dev/null 2>&1')
  end

  fail "Command failed: #{cmd_line}" unless $CHILD_STATUS.success?
end
process_input(in_fq, in_fa, input, output) click to toggle source

Process input stream and write all sequence records to a temporary file.

@param in_fq [String] Path to FASTQ temp file. @param in_fa [String] Path to FASTA temp file. @param input [Enumerator] Input stream. @param output [Enumerator::Yielder] Output stream.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 137
def process_input(in_fq, in_fa, input, output)
  BioDSL::Fastq.open(in_fq, 'w') do |io_fq|
    BioDSL::Fasta.open(in_fa, 'w') do |io_fa|
      input.each do |record|
        @status[:records_in] += 1

        if record.key? :SEQ
          write_sequence(io_fq, io_fa, record)
        else
          @status[:records_out] += 1
          output.puts record
        end
      end
    end
  end
end
process_output(output, output_file) click to toggle source

Process the spades output and emit the contigs to the output stream.

@param output [Enumerator::Yielder] Output stream @param output_file [String] Path to output FASTA file with contigs.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 215
def process_output(output, output_file)
  BioDSL::Fasta.open(output_file) do |ios|
    ios.each do |entry|
      output << entry.to_bp
      @status[:records_out] += 1
      @status[:sequences_out] += 1
      @status[:residues_out] += entry.length

      @lengths << entry.length
    end
  end
end
write_sequence(io_fq, io_fa, record) click to toggle source

Write a sequence record to the temporary file.

@param io_fq [BioDSL::Fastq::IO] FASTQ IO stream. @param io_fa [BioDSL::Fasta::IO] FASTA IO stream. @param record [Hash] BioPiece record with sequence.

# File lib/BioDSL/commands/assemble_seq_spades.rb, line 159
def write_sequence(io_fq, io_fa, record)
  entry = BioDSL::Seq.new_bp(record)

  @status[:sequences_in] += 1
  @status[:residues_in] += entry.length

  if entry.qual
    @type = :fastq
    io_fq.puts entry.to_fastq
  else
    io_fa.puts entry.to_fasta
  end
end