class BioDSL::Taxonomy::Search
Class for searching sequences in a taxonomic database. The database consists a taxonomic tree index and indices for each taxonomic level saved in the following files:
* taxonomy_tax_index.dat - return node for a given node id. * taxonomy_kmer_index.dat - return list of node ids for a given level and kmer.
Constants
- BYTES_IN_HIT
- BYTES_IN_INT
- MAX_COUNT
- MAX_HITS
- Node
Structure for taxonomic tree nodes.
- Result
Structure for holding the search result.
Public Class Methods
Constructor for initializing a Search
object.
# File lib/BioDSL/taxonomy.rb, line 307 def initialize(options) @options = options symbols = %i(kmer_size step_size dir prefix consensus coverage hits_max) symbols.each do |opt| fail TaxonomyError, "missing #{opt} option" unless @options[opt] end @count_ary = BioDSL::CAry.new(MAX_COUNT, BYTES_IN_INT) @hit_ary = BioDSL::CAry.new(MAX_HITS, BYTES_IN_HIT) @tax_index = load_tax_index @kmer_index = load_kmer_index end
Public Instance Methods
Method to execute a search for a given sequence entry. First the sequence is broken down into unique kmers of a given kmer_size overlapping with a given step_size. See Taxonomy::Index.add
. Now, for each taxonomic level, starting from species all nodes for each kmer is looked up in the database. The nodes containing most kmers are considered hits. If there are no hits at a taxonomic level, we move to the next level. Hits are sorted according to how many kmers matched this particular node and a consensus taxonomy string is determined. Hits are also filtered with the following options:
* hits_max - Include maximally this number of hits in the consensus. * best_only - Include only the best scoring hits in the consensus. That is if a hit consists of 344 kmers out of 345 possible, only hits with 344 kmers are included. * coverage - Filter hits based on kmer coverage. If a hit contains fewer kmers than the total amount of kmers x coverage it will be filtered. * consensus - For a number of hits accept consensus at a given level if within this percentage.
# File lib/BioDSL/taxonomy.rb, line 341 def execute(entry) kmers = entry.to_kmers(kmer_size: @options[:kmer_size], step_size: @options[:step_size]) puts "DEBUG Q: #{entry.seq_name}" if BioDSL.debug TAX_LEVELS.reverse.each do |level| kmers_lookup(kmers, level) hit_count = hits_select_C(@count_ary.ary, @count_ary.count, @hit_ary.ary, kmers.size, (@options[:best_only] ? 1 : 0), @options[:coverage]) hit_count = @options[:hits_max] if @options[:hits_max] < hit_count if hit_count == 0 puts "DEBUG no hits @ #{level}" if BioDSL.debug else puts "DEBUG hit(s) @ #{level}" if BioDSL.debug taxpaths = [] (0...hit_count).each do |i| start = BYTES_IN_HIT * i stop = BYTES_IN_HIT * i + BYTES_IN_HIT node_id, count = @hit_ary.ary[start...stop].unpack('II') taxpath = TaxPath.new(node_id, count, kmers.size, @tax_index) if BioDSL.debug seq_id = @tax_index[node_id].seq_id puts "DEBUG S_ID: #{seq_id} KMERS: [#{count}/#{kmers.size}] \ #{taxpath}" end taxpaths << taxpath end return Result.new(hit_count, compile_consensus(taxpaths, hit_count). tr('_', ' ')) end end Result.new(0, 'Unclassified') end
Private Instance Methods
Method that given a list of taxonomic paths determines a consensus for each taxonomic level. E.g. for the kingdom level if 60% of the taxpaths indicate ‘Bacteria’ and the consensus is 50% then the consensus for the kingdom level will be reported as ‘Bacteria(60)’. If the name at any level consists of multiple words they are treated independently. E.g if we have three taxpath at the species level with the names:
* Escherichia coli K-12 * Escherichia coli sp. AC3432 * Escherichia coli sp. AC1232
The corresponding consensus for that level will be reported as ‘Escherichia coli sp.(100/100/66)’. The forth word in the last two taxonomy strings (AC3432 and AC1232) have a consensus below 50% and are ignored.
# File lib/BioDSL/taxonomy.rb, line 462 def compile_consensus(taxpaths, hit_size) consensus = [] tax_hash = decompose_consensus(taxpaths) tax_hash.each do |level, subhash| cons = [] scores = [] subhash.each_value do |subsubhash| subsubhash.sort_by { |_, count| count }.reverse. each do |subname, count| if count >= hit_size * @options[:consensus] cons << subname scores << ((count / hit_size.to_f) * 100).to_i end end end break if cons.empty? consensus << "#{level.upcase}##{cons.join('_')}(#{scores.join('/')})" end if consensus.empty? 'Unclassified' else consensus.join(';') end end
Method that given a list of taxonomic paths splits these into a data structure appropriate for subsequence determination of the taxonomic consensus.
# File lib/BioDSL/taxonomy.rb, line 495 def decompose_consensus(taxpaths) tax_hash = Hash.new do |h1, k1| h1[k1] = Hash.new { |h2, k2| h2[k2] = Hash.new(0) } end taxpaths.each do |taxpath| taxpath.nodes[1..-1].each do |node| # Ignoring root level, start at 1 node.name.split('_').each_with_index do |subname, i| tax_hash[node.level][i][subname] += 1 end end end tax_hash end
Method that given a list of kmers and a taxonomic level lookups all the nodes for each kmer and increment the count array posisions for all nodes. The lookup for each kmer is initially done from a database, but subsequent lookups for that particular kmer are cached.
# File lib/BioDSL/taxonomy.rb, line 435 def kmers_lookup(kmers, level) @count_ary.zero! kmers.each do |kmer| next unless @kmer_index[level] if (nodes = @kmer_index[level][kmer]) increment_C(@count_ary.ary, nodes, nodes.size / BYTES_IN_INT) end end end
Method to load and return the kmer_index from file.
# File lib/BioDSL/taxonomy.rb, line 411 def load_kmer_index kmer_index = Hash.new { |h, k| h[k] = {} } file = File.join(@options[:dir], "#{@options[:prefix]}_kmer_index.dat") File.open(file) do |ios| ios.each do |line| line.chomp! next if line[0] == '#' level, kmer, nodes = line.split("\t") kmer_index[level.to_sym][kmer.to_i] = nodes.split(';').map(&:to_i). pack('I*') end end kmer_index end
Method to load and return the tax_index from file.
# File lib/BioDSL/taxonomy.rb, line 390 def load_tax_index tax_index = {} file = File.join(@options[:dir], "#{@options[:prefix]}_tax_index.dat") File.open(file) do |ios| ios.each do |line| line.chomp! next if line[0] == '#' seq_id, node_id, level, name, parent_id = line.split("\t") tax_index[node_id.to_i] = Node.new(seq_id.to_i, node_id.to_i, level.to_sym, name, parent_id.to_i) end end tax_index end