class BioDSL::FilterRrna

Filter rRNA sequences from the stream.

Description

filter_rrna utilizes sortmerna to identify and filter ribosomal RNA sequences from the stream. The sortmerna and indexdb_rna executables must be installed for filter_rrna to work.

Indexed reference files are produced using indexdb_rna.

For more about the sortmerna look here:

bioinfo.lifl.fr/RNA/sortmerna/

Usage

filter_rrna(ref_fasta: <file(s)>, ref_index: <file(s)>)

Options

Examples

To filter all reads matching the SILVA archaea 23S rRNA do:

BD.new.
read_fastq(input: "reads.fq").
filter_rrna(ref_fasta: ["silva-arc-23s-id98.fasta"],
            ref_index: ["silva-arc-23s-id98.fasta.idx*"]).
write_fastq(output: "clean.fq").
run

rubocop:disable ClassLength

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor the FilterRrna class.

@param options [Hash] Options hash. @option options [String,Array] Path(s) to reference FASTA files. @option options [String,Array] Path(s) to reference index files.

@return [FilterRrnas] Class instance of FilterRrnas.

# File lib/BioDSL/commands/filter_rrna.rb, line 79
def initialize(options)
  @options = options
  @filter  = Set.new

  aux_exist('sortmerna')
  check_options
end

Public Instance Methods

lmb() click to toggle source

Return the command lambda for filter_rrnas.

@return [Proc] Command lambda.

# File lib/BioDSL/commands/filter_rrna.rb, line 90
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    TmpDir.create('tmp', 'seq', 'out') do |tmp_file, seq_file, out_file|
      ref_files = process_ref_files
      process_input(input, tmp_file, seq_file)
      execute_sortmerna(ref_files, seq_file, out_file)
      parse_sortme_output(out_file)
      process_output(output, tmp_file)
    end
  end
end

Private Instance Methods

check_options() click to toggle source

Check options.

# File lib/BioDSL/commands/filter_rrna.rb, line 107
def check_options
  options_allowed(@options, :ref_fasta, :ref_index)
  options_files_exist(@options, :ref_fasta, :ref_index)
end
execute_sortmerna(ref_files, seq_file, out_file) click to toggle source

Execute ‘sortmerna’.

@param ref_files [String] Reference file string for sortmerna. @param seq_file [String] Path to intput file with reads. @param out_file [String] Path to output file.

@raise if execution of ‘sortmerna’ fails.

# File lib/BioDSL/commands/filter_rrna.rb, line 139
def execute_sortmerna(ref_files, seq_file, out_file)
  cmd = ['sortmerna']
  cmd << "--ref #{ref_files}"
  cmd << "--reads #{seq_file}"
  cmd << "--aligned #{out_file}"
  cmd << '--fastx'
  cmd << '-v' if BioDSL.verbose

  cmd_line = cmd.join(' ')

  $stderr.puts "Running command: #{cmd_line}" if BioDSL.verbose

  system(cmd_line)

  fail "command failed: #{cmd_line}" unless $CHILD_STATUS.success?
end
output_record(output, record, i) click to toggle source

Output a record to the output stream unless it contains sequence information that should be filtered.

@param output [Enumerator::Yielder] Output stream. @param record [Hash] BioDSL record. @param i [Integer] Index.

# File lib/BioDSL/commands/filter_rrna.rb, line 225
def output_record(output, record, i)
  if record.key? :SEQ
    unless @filter.include? i
      output << record
      @status[:records_out] += 1
      @status[:sequences_out] += 1
      @status[:residues_out] += record[:SEQ].length
    end
  else
    output << record
    @status[:records_out] += 1
  end
end
parse_sortme_output(out_file) click to toggle source

Parse the ‘sortmerna’ output file and add all sequence name indices to the filter set.

@param out_file [String] Path to output file.

# File lib/BioDSL/commands/filter_rrna.rb, line 160
def parse_sortme_output(out_file)
  BioDSL::Fasta.open("#{out_file}.fasta", 'r') do |ios|
    ios.each do |entry|
      @filter << entry.seq_name.to_i
    end
  end
end
process_input(input, tmp_file, seq_file) click to toggle source

Process input stream and serialize all records and write a temporary FASTA file.

@param input [Enumerator] Input stream. @param tmp_file [String] Path to tmp file for serialized records. @param seq_file [String] Path to tmp FASTA sequence file.

# File lib/BioDSL/commands/filter_rrna.rb, line 174
def process_input(input, tmp_file, seq_file)
  BioDSL::Fasta.open(seq_file, 'w') do |seq_io|
    File.open(tmp_file, 'wb') do |tmp_ios|
      BioDSL::Serializer.new(tmp_ios) do |s|
        input.each_with_index do |record, i|
          @status[:records_in] += 1

          s << record
          # FIXME: need << method
          seq_io.puts record2entry(record, i).to_fasta if record.key? :SEQ
        end
      end
    end
  end
end
process_output(output, tmp_file) click to toggle source

Process the serialized data and output all records, that does not match the filter, to the output stream.

@param output [Enumerator::Yielder] Output stream. @param tmp_file [String] Path to tmp file with serialized records.

# File lib/BioDSL/commands/filter_rrna.rb, line 209
def process_output(output, tmp_file)
  File.open(tmp_file, 'rb') do |ios|
    BioDSL::Serializer.new(ios) do |s|
      s.each_with_index do |record, i|
        output_record(output, record, i)
      end
    end
  end
end
process_ref_files() click to toggle source

Given reference index and fasta files in the options hash, process these into a string of the format read by ‘sortmerna’: fasta1,id1:fasta2,id2:…

@return [String] Reference file string for sortmerna.

# File lib/BioDSL/commands/filter_rrna.rb, line 116
def process_ref_files
  ref_index = @options[:ref_index]
  ref_fasta = @options[:ref_fasta]

  if ref_index.is_a? Array
    ref_index.map { |f| f.sub!(/\*$/, '') }
  else
    ref_index.sub!(/\*$/, '')
  end

  ref_fasta = [ref_fasta.split(',')] if ref_fasta.is_a? String
  ref_index = [ref_index.split(',')] if ref_index.is_a? String

  ref_fasta.zip(ref_index).map { |m| m.join(',') }.join(':')
end
record2entry(record, i) click to toggle source

Given a BioDSL record and an index create a new sequence entry object that is returned using the index as sequence name.

@param record [Hash] BioDSL record @param i [Integer] Index.

@return [BioDSL::Seq] Sequence entry.

# File lib/BioDSL/commands/filter_rrna.rb, line 197
def record2entry(record, i)
  entry = BioDSL::Seq.new(seq_name: i, seq: record[:SEQ])
  @status[:sequences_in] += 1
  @status[:residues_in] += entry.length
  entry
end