class BioDSL::ReadFasta

Read FASTA entries from one or more files.

read_fasta read in sequence entries from FASTA files. Each sequence entry consists of a sequence name prefixed by a ‘>’ followed by the sequence name on a line of its own, followed by one or my lines of sequence until the next entry or the end of the file. The resulting Biopiece record consists of the following record type:

{:SEQ_NAME=>"test",
 :SEQ=>"AGCATCGACTAGCAGCATTT",
 :SEQ_LEN=>20}

Input files may be compressed with gzip og bzip2.

For more about the FASTA format:

en.wikipedia.org/wiki/Fasta_format

Usage

read_fasta(input: <glob>[, first: <uint>|last: <uint>])

Options

Examples

To read all FASTA entries from a file:

read_fasta(input: "test.fna")

To read all FASTA entries from a gzipped file:

read_fasta(input: "test.fna.gz")

To read in only 10 records from a FASTA file:

read_fasta(input: "test.fna", first: 10)

To read in the last 10 records from a FASTA file:

read_fasta(input: "test.fna", last: 10)

To read all FASTA entries from multiple files:

read_fasta(input: "test1.fna,test2.fna")

To read FASTA entries from multiple files using a glob expression:

read_fasta(input: "*.fna")

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for the ReadFasta class.

@param [Hash] options Options hash. @option options [String, Array] :input String or Array with glob

expressions.

@option options [Integer] :first Dump first number of records. @option options [Integer] :last Dump last number of records.

@return [ReadFasta] Returns an instance of the class.

# File lib/BioDSL/commands/read_fasta.rb, line 93
def initialize(options)
  @options = options
  @count   = 0
  @buffer  = []

  check_options
end

Public Instance Methods

lmb() click to toggle source

Return a lambda for the read_fasta command.

@return [Proc] Returns the read_fasta command lambda.

# File lib/BioDSL/commands/read_fasta.rb, line 104
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)

    read_input(input, output)

    options_glob(@options[:input]).each do |file|
      BioDSL::Fasta.open(file) do |ios|
        if @options[:first] && read_first(ios, output)
        elsif @options[:last] && read_last(ios)
        else
          read_all(ios, output)
        end
      end
    end

    write_buffer(output) if @options[:last]
  end
end

Private Instance Methods

check_options() click to toggle source

Check the options.

# File lib/BioDSL/commands/read_fasta.rb, line 127
def check_options
  options_allowed(@options, :input, :first, :last)
  options_required(@options, :input)
  options_files_exist(@options, :input)
  options_unique(@options, :first, :last)
  options_assert(@options, ':first >= 0')
  options_assert(@options, ':last >= 0')
end
read_all(input, output) click to toggle source

Read in all entries from input and emit to output.

@param input [BioDSL::Fasta] FASTA file input stream. @param output [Enumerable::Yielder] Output stream.

# File lib/BioDSL/commands/read_fasta.rb, line 199
def read_all(input, output)
  input.each do |entry|
    output << entry.to_bp

    @status[:records_out] += 1
    @status[:sequences_out] += 1
    @status[:residues_out] += entry.length
  end
end
read_first(input, output) click to toggle source

Read in a specified number of entries from the input and emit to the output.

@param input [BioDSL::Fasta] FASTA file input stream. @param output [Enumerable::Yielder] Output stream.

@return [Fixnum] Number of read entries.

# File lib/BioDSL/commands/read_fasta.rb, line 161
def read_first(input, output)
  first = @options[:first]

  input.each do |entry|
    break if @count == first
    output << entry.to_bp

    @status[:records_out] += 1
    @status[:sequences_out] += 1
    @status[:residues_out] += entry.length

    @count += 1
  end

  @count
end
read_input(input, output) click to toggle source

Read and emit records from the input to the output stream.

@param input [Enumerable::Yielder] Input stream. @param output [Enumerable::Yielder] Output stream.

# File lib/BioDSL/commands/read_fasta.rb, line 140
def read_input(input, output)
  return unless input

  input.each do |record|
    output << record
    @status[:records_in] += 1

    if record[:SEQ]
      @status[:sequences_in] += 1
      @status[:residues_in] += record[:SEQ].length
    end
  end
end
read_last(input) click to toggle source

Read in entries from input and cache the specified last number in a buffer.

@param input [BioDSL::Fasta] FASTA file input stream.

@return [Fixnum] Number of read entries.

# File lib/BioDSL/commands/read_fasta.rb, line 184
def read_last(input)
  last = @options[:last]

  input.each do |entry|
    @buffer << entry
    @buffer.shift if @buffer.size > last
  end

  @buffer.size
end
write_buffer(output) click to toggle source

Emit all entries in buffer to output.

@param output [Enumerable::Yielder] Output stream.

# File lib/BioDSL/commands/read_fasta.rb, line 212
def write_buffer(output)
  @buffer.each do |entry|
    output << entry.to_bp

    @status[:records_out] += 1
    @status[:sequences_out] += 1
    @status[:residues_out] += entry.length
  end
end