class BioDSL::Taxonomy::Index
Class for creating and databasing an index of a taxonomic tree. This is done in two steps. 1) A temporary tree is creating using the taxonomic strings from the sequence names in a FASTA file. 2) A simplistic tree is constructed from the temporary tree allowing this to be saved to files. The resulting index consists of the following files:
* taxonomy_tax_index.dat - return node for a given node id. * taxonomy_kmer_index.dat - return list of node ids for a given level and kmer.
Attributes
Public Class Methods
Constructor Index
object.
# File lib/BioDSL/taxonomy.rb, line 55 def initialize(options) @options = options # Option hash @seq_id = 0 # Sequence id @node_id = 0 # Node id @tree = TaxNode.new(nil, :r, 'root', nil, @node_id) # Root node @node_id += 1 %i(kmer_size step_size output_dir prefix).each do |option| fail TaxonomyError, "missing #{option} option" unless @options[option] end end
Public Instance Methods
Method to add a Sequence entry to the taxonomic tree. The sequence name contain a taxonomic string.
Example entry: seq_name: K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria; \
O#Vibrionales;F#Vibrionaceae;G#Vibrio;S#Vibrio
seq: UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGA \
UGCAGCCAUGCCGCGUGUAUGAAGGCCUUCGGGUUGUAACUC ...
The sequence is reduced to a list of oligos of a given size and a given step size, e.g. 8 and 1, respectively:
UCCUACGG
CCUACGGG CUACGGGA UACGGGAG ACGGGAGG ...
Each oligo is encoded as an kmer (integer) by encoding two bits per nucleotide:
A = 00 U = 01 C = 10 G = 11
E.g. UCCUACGG = 0110100100101111 = 26927
For each node in the tree a set is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node.
# File lib/BioDSL/taxonomy.rb, line 99 def add(entry) node = @tree old_name = false tax_levels = entry.seq_name.split(';') if tax_levels.size != TAX_LEVELS.size - 1 fail TaxonomyError, "Wrong number of tax levels in #{entry.seq_name}" end tax_levels.each_with_index do |tax_level, i| level, name = tax_level.split('#') if level.downcase.to_sym != TAX_LEVELS[i + 1] fail TaxonomyError, "Unexpected tax id in #{entry.seq_name}" end if name if i > 0 && !old_name fail TaxonomyError, "Gapped tax level info in #{entry.seq_name}" end if (child = node[name]) else child = TaxNode.new(node, level.downcase.to_sym, name, @seq_id, @node_id) @node_id += 1 end if leaf?(tax_levels, i) kmers = entry.to_kmers(kmer_size: @options[:kmer_size], step_size: @options[:step_size]) child.kmers |= Set.new(kmers) end node[name] = child node = node[name] end old_name = name end @seq_id += 1 self end
Testing method to get a node given an id. Returns nil if node wasn’t found.
# File lib/BioDSL/taxonomy.rb, line 155 def get_node(id) queue = [@tree] until queue.empty? node = queue.shift return node if node.node_id == id node.children.each_value do |child| queue.unshift(child) unless child.nil? end end nil end
Remap and save taxonomic tree to index files.
# File lib/BioDSL/taxonomy.rb, line 146 def save tree_union(@tree) save_kmer_index save_tax_index end
Method that traverses the tax tree and populate all parent nodes with the union of all kmers from the patents children.
# File lib/BioDSL/taxonomy.rb, line 173 def tree_union(node = @tree) node.children.each_value { |child| tree_union(child) } node.children.each_value do |child| if node.kmers.nil? && child.kmers.nil? elsif node.kmers.nil? node.kmers = child.kmers else node.kmers |= child.kmers if child.kmers end end end
Private Instance Methods
Method that determines if a node is a leaf or not.
# File lib/BioDSL/taxonomy.rb, line 189 def leaf?(tax_levels, i) if tax_levels[i + 1] && tax_levels[i + 1].split('#')[1] false else true end end
Construct and save kmer index to file. This is done BFS style one taxonomic level at a time to save memory.
# File lib/BioDSL/taxonomy.rb, line 220 def save_kmer_index file = File.join(@options[:output_dir], "#{@options[:prefix]}_kmer_index.dat") File.open(file, 'wb') do |ios| ios.puts %w(#LEVEL KMER NODES).join("\t") level = 0 queue = [@tree] until queue.empty? kmer_index = Hash.new { |h, k| h[k] = [] } new_queue = [] queue.each do |node| node.kmers.to_a.map { |kmer| kmer_index[kmer] << node.node_id } node.children.each_value { |child| child && new_queue << child } end kmer_index.keys.sort.each do |kmer| nodes = kmer_index[kmer].sort.join(';') ios.puts [TAX_LEVELS[level], kmer, nodes].join("\t") end queue = new_queue level += 1 end kmer_index end end
Save tax index to file.
# File lib/BioDSL/taxonomy.rb, line 198 def save_tax_index file = File.join(@options[:output_dir], "#{@options[:prefix]}_tax_index.dat") File.open(file, 'wb') do |ios| ios.puts %w(#SEQ_ID NODE_ID LEVEL NAME PARENT_ID).join("\t") queue = [@tree] until queue.empty? node = queue.shift ios.puts [node.seq_id, node.node_id, node.level, node.name, node.parent_id].join("\t") node.children.each_value do |child| queue.unshift(child) unless child.nil? end end end end