class BioDSL::IndexTaxonomy

Create taxonomy index from sequences in the stream.

index_taxonomy is used to create a taxonomy index to allow subsequent taxonomic classification with classify_seq. The records with taxnomic information must contain :SEQ_NAME and :SEQ keys where the :SEQ_NAME value must be formatted with an initial ID number followed by a space and then the taxonomy string progressing from kingdom to species level. Only the following leves are accepted:

* K - kingdom
* P - phylum
* C - class
* O - order
* F - family
* G - genus
* S - species

Truncated taxonomic strings are allowed, e.g. a string that stops at family level. Below is an example of a full taxonomic string:

32 K#Bacteria;P#Actinobacteria;C#Actinobacteria;O#Acidimicrobiales; \
   F#Acidimicrobiaceae;G#Ferrimicrobium;S#Ferrimicrobium acidiphilum

The resulting index consists of the following files (here using the default “taxonomy” as prefix) which are saved to a specified output_dir:

* taxonomy_tax_index.dat  - return node for a given node id.
* taxonomy_kmer_index.dat - return list of node ids for a given level and
                            kmer.

The index is constructed by breaking the sequences into kmers of a given kmer_size and using a given step_size:

Example FASTA entry:

>2 K#Bacteria;P#Proteobacteria;C#Gammaproteobacteria;O#Vibrionales; \
   F#Vibrionaceae;G#Vibrio;S#Vibrio
UCCUACGGGAGGCAGCAGUGGGGAAUAUUGCACAAUGGGCGCAAGCCUGAUGCAGCCAUGCCGCGUGUAUGA

This sequence is broken down to a list of oligos using the default kmer_size and step_size of 8 and 1, respectively:

UCCUACGG
 CCUACGGG
  CUACGGGA
   UACGGGAG
    ACGGGAGG
     ...

Oligos containing ambiguity codes are skipped. Each oligo is encoded as an kmer (integer) by encoding two bits per nucletoide:

* A = 00
* U = 01
* C = 10
* G = 11

E.g. UCCUACGG = 0110100100101111 = 26927

For each node in the tree a vector is kept containing information of all observed oligos for that particular node. Thus all child nodes contain a subset of oligos compared to the parent node. Finally, the tree is saved to files.

It should be noted that the speed and accuarcy of the classification is strongly dependent on the size and quality of the taxonomic database used (RDP, GreenGenes or Silva) and for a particular amplicon it is strongly recommended to create a slice from the database aligment matching the amplicon.

Usage

index_taxonomy(<output_dir: <dir>>[, kmer_size: <uint>
               [, step_size: <uint>[, prefix: <string>
               [, force: <bool>]]]])

Options

* output_dir: <dir> - Output directory to contain index files.
* kmer_size: <uint> - Size of kmer to use (default=8).
* step_size: <uint> - Size of steps (default=1).
* prefix: <string>  - Prefix to use with index file names
                      (default="taxonomy").
* force: <bool>     - Force overwrite existing index files.

Examples

BD.new.
read_fasta(input: "RDP_11_Bacteria.fna").
index_taxonomy(output_dir: "RDP_11").
run

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for IndexTaxonomy.

@param options [Hash] Options hash. @option options [String] :output_dir Path to output directory. @option options [String] :prefix Database file name prefix. @option options [Integer] :kmer_size Kmer size to use for indexing. @option options [Integer] :step_size Step size to use for indexing. @option options [Boolean] :force Flag for force-overwriting output files.

@return [IndexTaxonomy] Instance of class.

# File lib/BioDSL/commands/index_taxonomy.rb, line 134
def initialize(options)
  @options = options

  defaults
  check_options
  create_output_dir
  check_output_files

  @index = BioDSL::Taxonomy::Index.new(options)
end

Public Instance Methods

lmb() click to toggle source

Return command lambda for index_taxonomy.

@return [Proc] Command lambda.

# File lib/BioDSL/commands/index_taxonomy.rb, line 148
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    input.each do |record|
      @status[:records_in] += 1

      add_to_index(record) if record[:SEQ_NAME] && record[:SEQ]

      output << record
      @status[:records_out] += 1
    end

    @index.save
  end
end

Private Instance Methods

add_to_index(record) click to toggle source

Add to the taxonomy index the sequence information from a given record.

@param record [Hash] BioDSL record with sequence info.

# File lib/BioDSL/commands/index_taxonomy.rb, line 218
def add_to_index(record)
  @status[:sequences_in] += 1

  _, seq_name = record[:SEQ_NAME].split(' ', 2)

  @index.add(BioDSL::Seq.new(seq_name: seq_name, seq: record[:SEQ]))
end
check_options() click to toggle source

Check options.

# File lib/BioDSL/commands/index_taxonomy.rb, line 168
def check_options
  options_allowed(@options, :output_dir, :kmer_size, :step_size, :prefix,
                  :force)
  options_required(@options, :output_dir)
  options_allowed_values(@options, force: [nil, true, false])
  options_files_exist_force(@options, :report)
  options_assert(@options, ':kmer_size > 0')
  options_assert(@options, ':kmer_size <= 12')
  options_assert(@options, ':step_size > 0')
  options_assert(@options, ':step_size <= 12')
end
check_output_files() click to toggle source

Check if the output files already exist and throw an exception if so and the no force options is used.

@raise [BioDSL::OptionsError] If file exists and force option not used.

# File lib/BioDSL/commands/index_taxonomy.rb, line 199
def check_output_files
  files = [
    File.join(@options[:output_dir], "#{@options[:prefix]}_tax_index.dat"),
    File.join(@options[:output_dir], "#{@options[:prefix]}_kmer_index.dat")
  ]

  files.each do |file|
    next unless File.exist? file

    unless @options[:force]
      msg = "File exists: #{file} - use 'force: true' to overwrite"
      fail BioDSL::OptionError, msg
    end
  end
end
create_output_dir() click to toggle source

Create the output directory specified in the options hash if this does not already exist.

# File lib/BioDSL/commands/index_taxonomy.rb, line 189
def create_output_dir
  return if File.exist?(@options[:output_dir])

  FileUtils.mkdir_p(@options[:output_dir])
end
defaults() click to toggle source

Set the default options hash values.

# File lib/BioDSL/commands/index_taxonomy.rb, line 181
def defaults
  @options[:prefix] ||= 'taxonomy'
  @options[:kmer_size] ||= 8
  @options[:step_size] ||= 1
end