class BioDSL::ClassifySeqMothur

Run classify_seq_mothur on sequences in the stream.

This is a wrapper for the mothur command +classify.seqs()+. Basically, it classifies sequences in the stream given a database file and a taxonomy file which can be downloaded here:

www.mothur.org/w/images/5/59/Trainset9_032012.pds.zip

Please refer to the manual:

www.mothur.org/wiki/Classify.seqs

Mothur must be installed for classify_seq_mothurs to work. Read more here:

www.mothur.org/

Usage

classify_seq_mothur(<database: <file>>, <taxonomy: <file>>
                    [, confidence: <uint>[, cpus: <uint>]])

Options

Examples

To classify a bunch of OTU sequences in the file otus.fna we do:

database = "trainset9_032012.pds.fasta"
taxonomy = "trainset9_032012.pds.tax"

BD.new.
read_fasta(input: "otus.fna").
classify_seq_mothur(database: database, taxonomy: taxonomy).
grab(exact: true, keys: :RECORD_TYPE, select: "taxonomy").
write_table(output: "classified.tab", header: true, force: true,
            skip: [:RECORD_TYPE]).
run

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for ClassifySeqMothur.

@param options [Hash] Options hash. @option options [String] :database Path to database file. @option options [String] :taxonomy Path to taxonomy file. @option options [Integer] :confidence Confidence cutoff. @option options [Integer] :cpus Number of CPUs to use.

@return [ClassifySeqMothur] Instance of class.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 89
def initialize(options)
  @options = options

  aux_exist('mothur')
  check_options
  defaults
end

Public Instance Methods

lmb() click to toggle source

Command lambda for ClassifySeqMothur.

@return [Proc] Lambda for the command.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 100
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    TmpDir.create('input.fasta') do |tmp_in, tmp_dir|
      process_input(input, output, tmp_in)
      run_mothur(tmp_dir, tmp_in)
      tmp_out = Dir.glob("#{tmp_dir}/input.*.taxonomy").first
      process_output(output, tmp_out)
    end
  end
end

Private Instance Methods

check_options() click to toggle source

Check options.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 116
def check_options
  options_allowed(@options, :database, :taxonomy, :confidence, :cpus)
  options_required(@options, :database, :taxonomy)
  options_files_exist(@options, :database, :taxonomy)
  options_assert(@options, ':confidence > 0')
  options_assert(@options, ':confidence <= 100')
  options_assert(@options, ':cpus >= 1')
  options_assert(@options, ":cpus <= #{BioDSL::Config::CORES_MAX}")

  defaults
end
confidence_filter(record) click to toggle source

Filter taxonomic leveles based on the confidence.

@param record [Hash] BioDSL record with taxonomy.

@return [String] Return taxonomic string.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 210
def confidence_filter(record)
  new_levels = []

  record[:TAXONOMY].split(';').each do |level|
    next unless level =~ /^([^(]+)\((\d+)\)$/
    name       = Regexp.last_match(1)
    confidence = Regexp.last_match(2).to_i

    if confidence >= @options[:confidence]
      new_levels << "#{name}(#{confidence})"
    end
  end

  new_levels.empty? ? 'Unclassified' : new_levels.join(';')
end
defaults() click to toggle source

Set default options.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 129
def defaults
  @options[:confidence] ||= 80
  @options[:cpus] ||= 1
end
process_input(input, output, tmp_in) click to toggle source

Process input data and save sequences to a temporary file for classifcation.

@param input [Enumerator] Input stream. @param output [Enumerator::Yielder] Output stream. @param tmp_in [String] Path to temporary FASTA file.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 140
def process_input(input, output, tmp_in)
  BioDSL::Fasta.open(tmp_in, 'w') do |ios|
    input.each_with_index do |record, i|
      @status[:records_in] += 1

      if record[:SEQ]
        @status[:sequences_in] += 1
        @status[:sequences_out] += 1
        @status[:residues_in] += record[:SEQ].length
        @status[:records_out] += record[:SEQ].length
        seq_name = record[:SEQ_NAME] || i.to_s

        entry = BioDSL::Seq.new(seq_name: seq_name, seq: record[:SEQ])

        ios.puts entry.to_fasta
      end

      output << record
      @status[:records_out] += 1
    end
  end
end
process_output(output, tmp_out) click to toggle source

Parse mothur classfication output and emit to stream.

@param output [Enumerator::Yielder] Output stream. @param tmp_out [String] Path to file with classfication result.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 189
def process_output(output, tmp_out)
  BioDSL::CSV.open(tmp_out) do |ios|
    ios.each_hash do |new_record|
      new_record[:SEQ_NAME] = new_record[:V0]
      new_record[:TAXONOMY] = new_record[:V1]
      new_record[:TAXONOMY].tr!('"', '')
      new_record.delete(:V0)
      new_record.delete(:V1)
      new_record[:TAXONOMY]    = confidence_filter(new_record)
      new_record[:RECORD_TYPE] = 'taxonomy'
      output << new_record
      @status[:records_out] += 1
    end
  end
end
run_mothur(tmp_dir, tmp_in) click to toggle source

Run Mothur using a system call.

@param tmp_dir [String] Path to temporary dir. @param tmp_in [String] Path to input file.

@raise [RunTimeError] If system call fails.

# File lib/BioDSL/commands/classify_seq_mothur.rb, line 169
    def run_mothur(tmp_dir, tmp_in)
      cmd = <<-CMD.gsub(/^\s+\|/, '').delete("\n")
        |mothur "#set.dir(input=#{tmp_dir});
        |set.dir(output=#{tmp_dir});
        |classify.seqs(fasta=#{tmp_in},
        |reference=#{@options[:database]},
        |taxonomy=#{@options[:taxonomy]},
        |method=wang,
        |processors=#{@options[:cpus]})"
      CMD

      BioDSL.verbose ? system(cmd) : system("#{cmd} > /dev/null 2>&1")

      fail 'Mothur failed' unless $CHILD_STATUS.success?
    end