class BioDSL::Taxonomy::Search

Class for searching sequences in a taxonomic database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

Constants

BYTES_IN_HIT
BYTES_IN_INT
MAX_COUNT
MAX_HITS
Node

Structure for taxonomic tree nodes.

Result

Structure for holding the search result.

Public Class Methods

new(options) click to toggle source

Constructor for initializing a Search object.

# File lib/BioDSL/taxonomy.rb, line 307
def initialize(options)
  @options    = options

  symbols = %i(kmer_size step_size dir prefix consensus coverage hits_max)

  symbols.each do |opt|
    fail TaxonomyError, "missing #{opt} option" unless @options[opt]
  end

  @count_ary  = BioDSL::CAry.new(MAX_COUNT, BYTES_IN_INT)
  @hit_ary    = BioDSL::CAry.new(MAX_HITS, BYTES_IN_HIT)
  @tax_index  = load_tax_index
  @kmer_index = load_kmer_index
end

Public Instance Methods

execute(entry) click to toggle source

Method to execute a search for a given sequence entry. First the sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size. See Taxonomy::Index.add. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:

* hits_max  - Include maximally this number of hits in the consensus.
* best_only - Include only the best scoring hits in the consensus.
              That is if a hit consists of 344 kmers out of 345
              possible, only hits with 344 kmers are included.
* coverage  - Filter hits based on kmer coverage. If a hit contains
              fewer kmers than the total amount of kmers x coverage
              it will be filtered.
* consensus - For a number of hits accept consensus at a given level
              if within this percentage.
# File lib/BioDSL/taxonomy.rb, line 341
def execute(entry)
  kmers = entry.to_kmers(kmer_size: @options[:kmer_size],
                         step_size: @options[:step_size])

  puts "DEBUG Q: #{entry.seq_name}" if BioDSL.debug

  TAX_LEVELS.reverse.each do |level|
    kmers_lookup(kmers, level)

    hit_count = hits_select_C(@count_ary.ary, @count_ary.count,
                              @hit_ary.ary, kmers.size,
                              (@options[:best_only] ? 1 : 0),
                              @options[:coverage])
    hit_count = @options[:hits_max] if @options[:hits_max] < hit_count

    if hit_count == 0
      puts "DEBUG no hits @ #{level}" if BioDSL.debug
    else
      puts "DEBUG hit(s) @ #{level}"  if BioDSL.debug
      taxpaths = []

      (0...hit_count).each do |i|
        start = BYTES_IN_HIT * i
        stop  = BYTES_IN_HIT * i + BYTES_IN_HIT

        node_id, count = @hit_ary.ary[start...stop].unpack('II')

        taxpath = TaxPath.new(node_id, count, kmers.size, @tax_index)

        if BioDSL.debug
          seq_id = @tax_index[node_id].seq_id
          puts "DEBUG S_ID: #{seq_id} KMERS: [#{count}/#{kmers.size}] \
                #{taxpath}"
        end

        taxpaths << taxpath
      end

      return Result.new(hit_count, compile_consensus(taxpaths, hit_count).
             tr('_', ' '))
    end
  end

  Result.new(0, 'Unclassified')
end

Private Instance Methods

compile_consensus(taxpaths, hit_size) click to toggle source

Method that given a list of taxonomic paths determines a consensus for each taxonomic level. E.g. for the kingdom level if 60% of the taxpaths indicate ‘Bacteria’ and the consensus is 50% then the consensus for the kingdom level will be reported as ‘Bacteria(60)’. If the name at any level consists of multiple words they are treated independently. E.g if we have three taxpath at the species level with the names:

*  Escherichia coli K-12
*  Escherichia coli sp. AC3432
*  Escherichia coli sp. AC1232

The corresponding consensus for that level will be reported as ‘Escherichia coli sp.(100/100/66)’. The forth word in the last two taxonomy strings (AC3432 and AC1232) have a consensus below 50% and are ignored.

# File lib/BioDSL/taxonomy.rb, line 462
def compile_consensus(taxpaths, hit_size)
  consensus = []
  tax_hash  = decompose_consensus(taxpaths)

  tax_hash.each do |level, subhash|
    cons   = []
    scores = []

    subhash.each_value do |subsubhash|
      subsubhash.sort_by { |_, count| count }.reverse.
        each do |subname, count|
        if count >= hit_size * @options[:consensus]
          cons   << subname
          scores << ((count / hit_size.to_f) * 100).to_i
        end
      end
    end

    break if cons.empty?

    consensus << "#{level.upcase}##{cons.join('_')}(#{scores.join('/')})"
  end

  if consensus.empty?
    'Unclassified'
  else
    consensus.join(';')
  end
end
decompose_consensus(taxpaths) click to toggle source

Method that given a list of taxonomic paths splits these into a data structure appropriate for subsequence determination of the taxonomic consensus.

# File lib/BioDSL/taxonomy.rb, line 495
def decompose_consensus(taxpaths)
  tax_hash = Hash.new do |h1, k1|
    h1[k1] = Hash.new { |h2, k2| h2[k2] =  Hash.new(0) }
  end

  taxpaths.each do |taxpath|
    taxpath.nodes[1..-1].each do |node|  # Ignoring root level, start at 1
      node.name.split('_').each_with_index do |subname, i|
        tax_hash[node.level][i][subname] += 1
      end
    end
  end

  tax_hash
end
kmers_lookup(kmers, level) click to toggle source

Method that given a list of kmers and a taxonomic level lookups all the nodes for each kmer and increment the count array posisions for all nodes. The lookup for each kmer is initially done from a database, but subsequent lookups for that particular kmer are cached.

# File lib/BioDSL/taxonomy.rb, line 435
def kmers_lookup(kmers, level)
  @count_ary.zero!

  kmers.each do |kmer|
    next unless @kmer_index[level]

    if (nodes = @kmer_index[level][kmer])
      increment_C(@count_ary.ary, nodes, nodes.size / BYTES_IN_INT)
    end
  end
end
load_kmer_index() click to toggle source

Method to load and return the kmer_index from file.

# File lib/BioDSL/taxonomy.rb, line 411
def load_kmer_index
  kmer_index = Hash.new { |h, k| h[k] = {} }
  file = File.join(@options[:dir], "#{@options[:prefix]}_kmer_index.dat")
  File.open(file) do |ios|
    ios.each do |line|
      line.chomp!

      next if line[0] == '#'

      level, kmer, nodes = line.split("\t")

      kmer_index[level.to_sym][kmer.to_i] = nodes.split(';').map(&:to_i).
                                            pack('I*')
    end
  end

  kmer_index
end
load_tax_index() click to toggle source

Method to load and return the tax_index from file.

# File lib/BioDSL/taxonomy.rb, line 390
def load_tax_index
  tax_index = {}
  file = File.join(@options[:dir], "#{@options[:prefix]}_tax_index.dat")
  File.open(file) do |ios|
    ios.each do |line|
      line.chomp!

      next if line[0] == '#'

      seq_id, node_id, level, name, parent_id = line.split("\t")

      tax_index[node_id.to_i] = Node.new(seq_id.to_i, node_id.to_i,
                                         level.to_sym, name,
                                         parent_id.to_i)
    end
  end

  tax_index
end