class BioDSL::DereplicateSeq

Dereplicate sequences in the stream.

dereplicate_seq removes all duplicate sequence records. Dereplicated sequences are output along with the count of replicates. Using the ignore_case option disables the default case sensitive sequence matching.

Usage

dereplicate_seq([ignore_case: <bool>])

Options

Examples

Consider the following FASTA file test.fna:

>test1
ATGC
>test2
ATGC
>test3
GCAT

To dereplicate all sequences we use read_fasta and dereplicate_seq:

BD.new.read_fasta(input: "test.fna").dereplicate_seq.dump.run

{:SEQ_NAME=>"test1", :SEQ=>"ATGC", :SEQ_LEN=>4, :SEQ_COUNT=>2}
{:SEQ_NAME=>"test3", :SEQ=>"GCAT", :SEQ_LEN=>4, :SEQ_COUNT=>1}

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for the DereplicateSeq class.

@param options [Hash] Options hash. @option options [Boolean] :ignore_case Ignore sequence case.

@return [DereplicateSeq] Class intance.

# File lib/BioDSL/commands/dereplicate_seq.rb, line 70
def initialize(options)
  @options = options
  @lookup  = {}

  check_options
end

Public Instance Methods

lmb() click to toggle source

Return the command lambda for DereplicateSeq.

@return [Proc] Command lambda.

# File lib/BioDSL/commands/dereplicate_seq.rb, line 80
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    TmpDir.create('dereplicate_seq') do |tmp_file, _|
      process_input(input, output, tmp_file)
      process_output(output, tmp_file)
    end
  end
end

Private Instance Methods

check_options() click to toggle source

Check options.

# File lib/BioDSL/commands/dereplicate_seq.rb, line 94
def check_options
  options_allowed(@options, :ignore_case)
  options_allowed_values(@options, ignore_case: [nil, true, false])
end
process_input(input, output, tmp_file) click to toggle source

Process input stream and serialize all records with sequence information. All other records are emitted to the output stream.

@param input [Enumerator] Input stream. @param output [Enumerator::Yielder] Output stream. @param tmp_file [String] Path to temporary file.

# File lib/BioDSL/commands/dereplicate_seq.rb, line 105
def process_input(input, output, tmp_file)
  File.open(tmp_file, 'wb') do |ios|
    BioDSL::Serializer.new(ios) do |s|
      input.each do |record|
        @status[:records_in] += 1

        if record.key? :SEQ
          serialize(record, s)
        else
          output << record

          @status[:records_out] += 1
        end
      end
    end
  end
end
process_output(output, tmp_file) click to toggle source

Read all serialized records from tmp file and emit to the output stream along with the sequence count.

@param output [Enumerator::Yielder] Output stream. @param tmp_file [String] Path to tmp file.

# File lib/BioDSL/commands/dereplicate_seq.rb, line 150
def process_output(output, tmp_file)
  File.open(tmp_file, 'rb') do |ios|
    BioDSL::Serializer.new(ios) do |s|
      s.each do |record|
        seq = record[:SEQ].dup
        @status[:residues_out] += seq.length
        seq.downcase! if @options[:ignore_case]
        record[:SEQ_COUNT] = @lookup[seq.to_sym]

        output << record

        @status[:records_out] += 1
        @status[:sequences_out] += 1
      end
    end
  end
end
serialize(record, s) click to toggle source

Serialize records with unique sequences and keep a count of how many time each sequence was encountered.

@param record [Hash] BioDSL record. @param s [BioDSL::Serializer] Serializer.

# File lib/BioDSL/commands/dereplicate_seq.rb, line 128
def serialize(record, s)
  @status[:sequences_in] += 1

  seq = record[:SEQ].dup
  @status[:residues_in] += seq.length
  seq.downcase! if @options[:ignore_case]
  key = seq.to_sym

  unless @lookup[key]
    s << record

    @lookup[key] = 0
  end

  @lookup[key] += 1
end