class BioDSL::Taxonomy::Index

Class for creating and databasing an index of a taxonomic tree. This is done in two steps. 1) A temporary tree is creating using the taxonomic strings from the sequence names in a FASTA file. 2) A simplistic tree is constructed from the temporary tree allowing this to be saved to files. The resulting index consists of the following files:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

Attributes

node_id[R]
size[R]

Public Class Methods

new(options) click to toggle source

Constructor Index object.

# File lib/BioDSL/taxonomy.rb, line 55
def initialize(options)
  @options   = options                                       # Option hash
  @seq_id    = 0                                             # Sequence id
  @node_id   = 0                                             # Node id
  @tree      = TaxNode.new(nil, :r, 'root', nil, @node_id)   # Root node
  @node_id  += 1

  %i(kmer_size step_size output_dir prefix).each do |option|
    fail TaxonomyError, "missing #{option} option" unless @options[option]
  end
end

Public Instance Methods

add(entry) click to toggle source

Method to add a Sequence entry to the taxonomic tree. The sequence name contain a taxonomic string.

Example entry: seq_name: K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria; \

O#Vibrionales;F#Vibrionaceae;G#Vibrio;S#Vibrio

seq: UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGA \

UGCAGCCAUGCCGCGUGUAUGAAGGCCUUCGGGUUGUAACUC ...

The sequence is reduced to a list of oligos of a given size and a given step size, e.g. 8 and 1, respectively:

UCCUACGG

CCUACGGG
 CUACGGGA
  UACGGGAG
   ACGGGAGG
    ...

Each oligo is encoded as an kmer (integer) by encoding two bits per nucleotide:

A = 00 U = 01 C = 10 G = 11

E.g. UCCUACGG = 0110100100101111 = 26927

For each node in the tree a set is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node.

# File lib/BioDSL/taxonomy.rb, line 99
def add(entry)
  node       = @tree
  old_name   = false
  tax_levels = entry.seq_name.split(';')

  if tax_levels.size != TAX_LEVELS.size - 1
    fail TaxonomyError, "Wrong number of tax levels in #{entry.seq_name}"
  end

  tax_levels.each_with_index do |tax_level, i|
    level, name = tax_level.split('#')

    if level.downcase.to_sym != TAX_LEVELS[i + 1]
      fail TaxonomyError, "Unexpected tax id in #{entry.seq_name}"
    end

    if name
      if i > 0 && !old_name
        fail TaxonomyError, "Gapped tax level info in #{entry.seq_name}"
      end

      if (child = node[name])
      else
        child = TaxNode.new(node, level.downcase.to_sym, name, @seq_id,
                            @node_id)
        @node_id += 1
      end

      if leaf?(tax_levels, i)
        kmers = entry.to_kmers(kmer_size: @options[:kmer_size],
                               step_size: @options[:step_size])
        child.kmers |= Set.new(kmers)
      end

      node[name] = child
      node = node[name]
    end

    old_name = name
  end

  @seq_id += 1

  self
end
get_node(id) click to toggle source

Testing method to get a node given an id. Returns nil if node wasn’t found.

# File lib/BioDSL/taxonomy.rb, line 155
def get_node(id)
  queue = [@tree]

  until queue.empty?
    node = queue.shift

    return node if node.node_id == id

    node.children.each_value do |child|
      queue.unshift(child) unless child.nil?
    end
  end

  nil
end
save() click to toggle source

Remap and save taxonomic tree to index files.

# File lib/BioDSL/taxonomy.rb, line 146
def save
  tree_union(@tree)

  save_kmer_index
  save_tax_index
end
tree_union(node = @tree) click to toggle source

Method that traverses the tax tree and populate all parent nodes with the union of all kmers from the patents children.

# File lib/BioDSL/taxonomy.rb, line 173
def tree_union(node = @tree)
  node.children.each_value { |child| tree_union(child) }

  node.children.each_value do |child|
    if node.kmers.nil? && child.kmers.nil?
    elsif node.kmers.nil?
      node.kmers = child.kmers
    else
      node.kmers |= child.kmers if child.kmers
    end
  end
end

Private Instance Methods

leaf?(tax_levels, i) click to toggle source

Method that determines if a node is a leaf or not.

# File lib/BioDSL/taxonomy.rb, line 189
def leaf?(tax_levels, i)
  if tax_levels[i + 1] && tax_levels[i + 1].split('#')[1]
    false
  else
    true
  end
end
save_kmer_index() click to toggle source

Construct and save kmer index to file. This is done BFS style one taxonomic level at a time to save memory.

# File lib/BioDSL/taxonomy.rb, line 220
def save_kmer_index
  file = File.join(@options[:output_dir],
                   "#{@options[:prefix]}_kmer_index.dat")
  File.open(file, 'wb') do |ios|
    ios.puts %w(#LEVEL KMER NODES).join("\t")

    level = 0
    queue = [@tree]

    until queue.empty?
      kmer_index = Hash.new { |h, k| h[k] = [] }
      new_queue = []

      queue.each do |node|
        node.kmers.to_a.map { |kmer| kmer_index[kmer] << node.node_id }
        node.children.each_value { |child| child && new_queue << child }
      end

      kmer_index.keys.sort.each do |kmer|
        nodes = kmer_index[kmer].sort.join(';')

        ios.puts [TAX_LEVELS[level], kmer, nodes].join("\t")
      end

      queue = new_queue
      level += 1
    end

    kmer_index
  end
end
save_tax_index() click to toggle source

Save tax index to file.

# File lib/BioDSL/taxonomy.rb, line 198
def save_tax_index
  file = File.join(@options[:output_dir],
                   "#{@options[:prefix]}_tax_index.dat")
  File.open(file, 'wb') do |ios|
    ios.puts %w(#SEQ_ID NODE_ID LEVEL NAME PARENT_ID).join("\t")
    queue = [@tree]

    until queue.empty?
      node = queue.shift

      ios.puts [node.seq_id, node.node_id, node.level, node.name,
                node.parent_id].join("\t")

      node.children.each_value do |child|
        queue.unshift(child) unless child.nil?
      end
    end
  end
end