class BioDSL::FilterRrna
Filter rRNA sequences from the stream.¶ ↑
Description
filter_rrna
utilizes sortmerna
to identify and filter ribosomal RNA sequences from the stream. The sortmerna
and indexdb_rna
executables must be installed for filter_rrna
to work.
Indexed reference files are produced using indexdb_rna
.
For more about the sortmerna look here:
bioinfo.lifl.fr/RNA/sortmerna/
Usage¶ ↑
filter_rrna(ref_fasta: <file(s)>, ref_index: <file(s)>)
Options¶ ↑
-
ref_fasta <file(s)> - One or more reference FASTA files.
-
ref_index <file(s)> - One or more index reference files.
Examples¶ ↑
To filter all reads matching the SILVA archaea 23S rRNA do:
BD.new. read_fastq(input: "reads.fq"). filter_rrna(ref_fasta: ["silva-arc-23s-id98.fasta"], ref_index: ["silva-arc-23s-id98.fasta.idx*"]). write_fastq(output: "clean.fq"). run
rubocop:disable ClassLength
Constants
- STATS
Public Class Methods
Constructor the FilterRrna
class.
@param options [Hash] Options hash. @option options [String,Array] Path(s) to reference FASTA files. @option options [String,Array] Path(s) to reference index files.
@return [FilterRrnas] Class instance of FilterRrnas.
# File lib/BioDSL/commands/filter_rrna.rb, line 79 def initialize(options) @options = options @filter = Set.new aux_exist('sortmerna') check_options end
Public Instance Methods
Return the command lambda for filter_rrnas.
@return [Proc] Command
lambda.
# File lib/BioDSL/commands/filter_rrna.rb, line 90 def lmb lambda do |input, output, status| status_init(status, STATS) TmpDir.create('tmp', 'seq', 'out') do |tmp_file, seq_file, out_file| ref_files = process_ref_files process_input(input, tmp_file, seq_file) execute_sortmerna(ref_files, seq_file, out_file) parse_sortme_output(out_file) process_output(output, tmp_file) end end end
Private Instance Methods
Check options.
# File lib/BioDSL/commands/filter_rrna.rb, line 107 def check_options options_allowed(@options, :ref_fasta, :ref_index) options_files_exist(@options, :ref_fasta, :ref_index) end
Execute ‘sortmerna’.
@param ref_files [String] Reference file string for sortmerna. @param seq_file [String] Path to intput file with reads. @param out_file [String] Path to output file.
@raise if execution of ‘sortmerna’ fails.
# File lib/BioDSL/commands/filter_rrna.rb, line 139 def execute_sortmerna(ref_files, seq_file, out_file) cmd = ['sortmerna'] cmd << "--ref #{ref_files}" cmd << "--reads #{seq_file}" cmd << "--aligned #{out_file}" cmd << '--fastx' cmd << '-v' if BioDSL.verbose cmd_line = cmd.join(' ') $stderr.puts "Running command: #{cmd_line}" if BioDSL.verbose system(cmd_line) fail "command failed: #{cmd_line}" unless $CHILD_STATUS.success? end
Output a record to the output stream unless it contains sequence information that should be filtered.
@param output [Enumerator::Yielder] Output stream. @param record [Hash] BioDSL
record. @param i [Integer] Index.
# File lib/BioDSL/commands/filter_rrna.rb, line 225 def output_record(output, record, i) if record.key? :SEQ unless @filter.include? i output << record @status[:records_out] += 1 @status[:sequences_out] += 1 @status[:residues_out] += record[:SEQ].length end else output << record @status[:records_out] += 1 end end
Parse the ‘sortmerna’ output file and add all sequence name indices to the filter set.
@param out_file [String] Path to output file.
# File lib/BioDSL/commands/filter_rrna.rb, line 160 def parse_sortme_output(out_file) BioDSL::Fasta.open("#{out_file}.fasta", 'r') do |ios| ios.each do |entry| @filter << entry.seq_name.to_i end end end
Process input stream and serialize all records and write a temporary FASTA file.
@param input [Enumerator] Input stream. @param tmp_file [String] Path to tmp file for serialized records. @param seq_file [String] Path to tmp FASTA sequence file.
# File lib/BioDSL/commands/filter_rrna.rb, line 174 def process_input(input, tmp_file, seq_file) BioDSL::Fasta.open(seq_file, 'w') do |seq_io| File.open(tmp_file, 'wb') do |tmp_ios| BioDSL::Serializer.new(tmp_ios) do |s| input.each_with_index do |record, i| @status[:records_in] += 1 s << record # FIXME: need << method seq_io.puts record2entry(record, i).to_fasta if record.key? :SEQ end end end end end
Process the serialized data and output all records, that does not match the filter, to the output stream.
@param output [Enumerator::Yielder] Output stream. @param tmp_file [String] Path to tmp file with serialized records.
# File lib/BioDSL/commands/filter_rrna.rb, line 209 def process_output(output, tmp_file) File.open(tmp_file, 'rb') do |ios| BioDSL::Serializer.new(ios) do |s| s.each_with_index do |record, i| output_record(output, record, i) end end end end
Given reference index and fasta files in the options hash, process these into a string of the format read by ‘sortmerna’: fasta1,id1:fasta2,id2:…
@return [String] Reference file string for sortmerna.
# File lib/BioDSL/commands/filter_rrna.rb, line 116 def process_ref_files ref_index = @options[:ref_index] ref_fasta = @options[:ref_fasta] if ref_index.is_a? Array ref_index.map { |f| f.sub!(/\*$/, '') } else ref_index.sub!(/\*$/, '') end ref_fasta = [ref_fasta.split(',')] if ref_fasta.is_a? String ref_index = [ref_index.split(',')] if ref_index.is_a? String ref_fasta.zip(ref_index).map { |m| m.join(',') }.join(':') end
Given a BioDSL
record and an index create a new sequence entry object that is returned using the index as sequence name.
@param record [Hash] BioDSL
record @param i [Integer] Index.
@return [BioDSL::Seq] Sequence entry.
# File lib/BioDSL/commands/filter_rrna.rb, line 197 def record2entry(record, i) entry = BioDSL::Seq.new(seq_name: i, seq: record[:SEQ]) @status[:sequences_in] += 1 @status[:residues_in] += entry.length entry end