class BioDSL::DereplicateSeq
Dereplicate sequences in the stream.¶ ↑
dereplicate_seq
removes all duplicate sequence records. Dereplicated sequences are output along with the count of replicates. Using the ignore_case
option disables the default case sensitive sequence matching.
Usage¶ ↑
dereplicate_seq([ignore_case: <bool>])
Options¶ ↑
-
ignore_case: <bool> - Ignore sequence case.
Examples¶ ↑
Consider the following FASTA file test.fna:
>test1 ATGC >test2 ATGC >test3 GCAT
To dereplicate all sequences we use read_fasta
and dereplicate_seq
:
BD.new.read_fasta(input: "test.fna").dereplicate_seq.dump.run {:SEQ_NAME=>"test1", :SEQ=>"ATGC", :SEQ_LEN=>4, :SEQ_COUNT=>2} {:SEQ_NAME=>"test3", :SEQ=>"GCAT", :SEQ_LEN=>4, :SEQ_COUNT=>1}
Constants
- STATS
Public Class Methods
Constructor for the DereplicateSeq
class.
@param options [Hash] Options hash. @option options [Boolean] :ignore_case Ignore sequence case.
@return [DereplicateSeq] Class intance.
# File lib/BioDSL/commands/dereplicate_seq.rb, line 70 def initialize(options) @options = options @lookup = {} check_options end
Public Instance Methods
Return the command lambda for DereplicateSeq
.
@return [Proc] Command
lambda.
# File lib/BioDSL/commands/dereplicate_seq.rb, line 80 def lmb lambda do |input, output, status| status_init(status, STATS) TmpDir.create('dereplicate_seq') do |tmp_file, _| process_input(input, output, tmp_file) process_output(output, tmp_file) end end end
Private Instance Methods
Check options.
# File lib/BioDSL/commands/dereplicate_seq.rb, line 94 def check_options options_allowed(@options, :ignore_case) options_allowed_values(@options, ignore_case: [nil, true, false]) end
Process input stream and serialize all records with sequence information. All other records are emitted to the output stream.
@param input [Enumerator] Input stream. @param output [Enumerator::Yielder] Output stream. @param tmp_file [String] Path to temporary file.
# File lib/BioDSL/commands/dereplicate_seq.rb, line 105 def process_input(input, output, tmp_file) File.open(tmp_file, 'wb') do |ios| BioDSL::Serializer.new(ios) do |s| input.each do |record| @status[:records_in] += 1 if record.key? :SEQ serialize(record, s) else output << record @status[:records_out] += 1 end end end end end
Read all serialized records from tmp file and emit to the output stream along with the sequence count.
@param output [Enumerator::Yielder] Output stream. @param tmp_file [String] Path to tmp file.
# File lib/BioDSL/commands/dereplicate_seq.rb, line 150 def process_output(output, tmp_file) File.open(tmp_file, 'rb') do |ios| BioDSL::Serializer.new(ios) do |s| s.each do |record| seq = record[:SEQ].dup @status[:residues_out] += seq.length seq.downcase! if @options[:ignore_case] record[:SEQ_COUNT] = @lookup[seq.to_sym] output << record @status[:records_out] += 1 @status[:sequences_out] += 1 end end end end
Serialize records with unique sequences and keep a count of how many time each sequence was encountered.
@param record [Hash] BioDSL
record. @param s [BioDSL::Serializer] Serializer
.
# File lib/BioDSL/commands/dereplicate_seq.rb, line 128 def serialize(record, s) @status[:sequences_in] += 1 seq = record[:SEQ].dup @status[:residues_in] += seq.length seq.downcase! if @options[:ignore_case] key = seq.to_sym unless @lookup[key] s << record @lookup[key] = 0 end @lookup[key] += 1 end