class BioDSL::MeanScores

Calculate the mean or local mean of quality SCORES in the stream.

mean_scores calculates either the global or local mean value or quality SCORES in the stream. The quality SCORES are encoded Phred style in character string.

The global (default) behaviour calculates the SCORES_MEAN as the sum of all the scores over the length of the SCORES string.

The local means SCORES_MEAN_LOCAL are calculated using means from a sliding window, where the smallest mean is returned.

Thus, subquality records, with either an overall low mean quality or with local dip in quality, can be filtered using grab.

Usage

mean_scores([local: <bool>[, window_size: <uint>]])

Options

Examples

Consider the following FASTQ entry in the file test.fq:

@HWI-EAS157_20FFGAAXX:2:1:888:434
TTGGTCGCTCGCTCGACCTCAGATCAGACGTGG
+
BCDEFGHIIIIIII,,,,,IFFIIIIIIIIIII

The values of the scores in decimal are:

SCORES: 33;34;35;36;37;38;39;40;40;40;40;40;40;40;11;11;11;11;11;40;37;
        37;40;40;40;40;40;40;40;40;40;40;40;

To calculate the mean score do:

BD.new.read_fastq(input: "test.fq").mean_scores.dump.run

{:SEQ_NAME=>"HWI-EAS157_20FFGAAXX:2:1:888:434",
 :SEQ=>"TTGGTCGCTCGCTCGACCTCAGATCAGACGTGG",
 :SEQ_LEN=>33,
 :SCORES=>"BCDEFGHIIIIIII,,,,,IFFIIIIIIIIIII",
 :SCORES_MEAN=>34.58}

To calculate local means for a sliding window, do:

BD.new.read_fastq(input: "test.fq").mean_scores(local: true).dump.run

{:SEQ_NAME=>"HWI-EAS157_20FFGAAXX:2:1:888:434",
 :SEQ=>"TTGGTCGCTCGCTCGACCTCAGATCAGACGTGG",
 :SEQ_LEN=>33,
 :SCORES=>"BCDEFGHIIIIIII,,,,,IFFIIIIIIIIIII",
 :SCORES_MEAN_LOCAL=>11.0}

Which indicates a local minimum was located at the stretch of ,,,,, = 11+11+11+11+11 / 5 = 11.0

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for MeanScores.

@param options [Hash] Options hash. @option options [Boolean] :local @option options [Fixnum] :window_size

@return [MeanScores] Class instance.

# File lib/BioDSL/commands/mean_scores.rb, line 100
def initialize(options)
  @options = options
  @min     = Float::INFINITY
  @max     = 0
  @sum     = 0
  @count   = 0

  check_options
  defaults
end

Public Instance Methods

lmb() click to toggle source

Return command lambda for mean_scores.

@return [Proc] Command lambda.

# File lib/BioDSL/commands/mean_scores.rb, line 114
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each do |record|
      @status[:records_in] += 1

      calc_mean(record) if record[:SCORES] && record[:SCORES].length > 0

      output << record

      @status[:records_out] += 1
    end

    @status[:mean_mean] = (@sum.to_f / @count).round(2)
  end
end

Private Instance Methods

calc_mean(record) click to toggle source

Calculate the mean score for a given record and record count, sum, min and max.

@param record [Hash] BioDSL record.

# File lib/BioDSL/commands/mean_scores.rb, line 151
def calc_mean(record)
  entry = BioDSL::Seq.new_bp(record)

  if @options[:local]
    mean = entry.scores_mean_local(@options[:window_size]).round(2)
    record[:SCORES_MEAN_LOCAL] = mean
  else
    mean = entry.scores_mean.round(2)
    record[:SCORES_MEAN] = mean
  end

  @sum += mean
  @status[:min_mean] = mean if mean < @status[:min_mean]
  @status[:max_mean] = mean if mean > @status[:max_mean]
  @count += 1
end
check_options() click to toggle source

Check options

# File lib/BioDSL/commands/mean_scores.rb, line 135
def check_options
  options_allowed(@options, :local, :window_size)
  options_tie(@options, window_size: :local)
  options_allowed_values(@options, local: [true, false])
  options_assert(@options, ':window_size > 1')
end
defaults() click to toggle source

Set default options.

# File lib/BioDSL/commands/mean_scores.rb, line 143
def defaults
  @options[:window_size] ||= 5
end