class BioDSL::Grab

Grab records in stream.

grab select records from the stream by matching patterns to keys or values. grab is BioDSL’ equivalent of Unix’ grep, however, grab is much more versatile.

NB! If chaining multiple grab commands then use the most restrictive grab first in order to get the best performance.

NB! Avoid using exact with long values because of memory use.

Usage

grab(<select: <pattern>|select_file: <file>|reject: <pattern>|
     reject_file: <file>|evaluate: <expression>|exact: <bool>>
     [, keys: <list>|keys_only: <bool>|values_only: <bool>|
     ignore_case: <bool>])

Options

Examples

To easily grab all records in the stream that has any mentioning of the pattern ‘human’ just pipe the data stream through grab like this:

grab(select: "human")

This will search for the pattern ‘human’ in all keys and all values. The select option alternatively uses an array of patterns, so in order to match one of multiple patterns do:

grab(select: ["human", "mouse"])

It is also possible to invoke flexible matching using regex (regular expressions) instead of simple pattern matching. If you want to grab records with the sequence ATCG or GCTA you can do this:

grab(select: "ATCG|GCTA")

Or if you want to grab sequences beginning with ATCG:

grab(select: "^ATCG")

It is also possible to use the select_file option to load patterns from a file with one pattern per line.

grab(select_file: "patterns.txt")

If you want the opposite result - to find all records that does not match the a pattern, use the reject option:

grab(reject: "human")

Similar to select_file there is a reject_file option to load patterns from a file, and use any of these patterns to reject records:

grab(reject_file: "patterns.txt")

If you want to search the record keys only, e.g. to grab all records containing the key SEQ you can use the keys_only option. This will prevent matching of SEQ in any record value, and in fact SEQ is a not uncommon peptide sequence you could get an unwanted record. Also, this will give an increase in speed since only the keys are searched:

grab(select: "SEQ", keys_only: true)

However, if you are interested in grabbing the peptide sequence SEQ and not the SEQ key, just use the vals_only option:

grab(select: "SEQ", vals_only: true)

Also, if you want to grab for certain key/value pairs you can supply a comma separated list or an array of keys whos values will then be grabbed using the keys option. This is handy if your records contain large genomic sequences and you don’t want to search the entire sequence for e.g. the organism name - it is much faster to tell grab which keys to search the value for:

grab(select: "human", keys: :SEQ_NAME)

You can also use the evaluate option to grab records that fulfill an expression. So to grab all records with a sequence length greater than 30:

grab(evaluate: 'SEQ_LEN > 30')

If you want to grab all records containing the pattern ‘human’ and where the sequence length is greater that 30, you do this by running the stream through grab twice:

grab(select: 'human').grab(evaluate: 'SEQ_LEN > 30')

Finally, it is possible to grab for exact pattern using the exact option. This is much faster than the default regex pattern grabbing because with exact the patterns are used to create a lookup hash for instant matching of keys or values. This is useful if you e.g. have a file with ID numbers and you want to grab matching records from the stream:

grab(select_file: "ids.txt", keys: :ID, exact: true)

rubocop:disable ClassLength

Constants

STATS

Public Class Methods

new(options) click to toggle source

Constructor for the ReadFasta class.

@param [Hash] options Options hash.

@option options [String, Array] :select

Patterns or list of patterns to select records.

@option options [String] :select_file

File path with patterns, one per line, to select records.

@option options [String, Array] :reject

Patterns or list of patterns to reject records.

@option options [String] :reject_file

File path with patterns, one per line, to reject records.

@option options [String] :evaluate

Expression that is evaluated to select records.

@option options [Boolean] :exact

Flag indicating that a given pattern must match over its entire length.

@option options [Symbol, Array] :keys

Key or list of keys whos key/value pairs to grab for.

@option options [Boolean] :keys_only

Flag indicating to grab for key only - not values.

@option options [Boolean] :values_only

Flag indicating to grab for values only - not keys.

@option options [Boolean] :ignore_case

Flag indicating that pattern matching should be case insensitive.

@return [ReadFasta] Returns an instance of the class.

# File lib/BioDSL/commands/grab.rb, line 183
def initialize(options)
  @options = options

  check_options

  @keys_only = @options[:keys_only]
  @vals_only = @options[:values_only]
  @invert    = @options[:reject] || @options[:reject_file]
  @eval      = @options[:evaluate]
  @exact     = nil
  @regex     = nil
  @keys      = nil
end

Public Instance Methods

lmb() click to toggle source

Return a lambda for the grab command.

@return [Proc] Returns the grab command lambda.

# File lib/BioDSL/commands/grab.rb, line 200
def lmb
  lambda do |input, output, status|
    status_init(status, STATS)
    compile_keys
    compile_exact
    compile_regexes

    input.each do |record|
      @status[:records_in] += 1

      match = case
              when @exact then exact_match? record
              when @regex then regex_match? record
              when @eval  then eval_match? record
              end

      emit_match(output, record, match)
    end
  end
end

Private Instance Methods

check_options() click to toggle source

Check the options.

# File lib/BioDSL/commands/grab.rb, line 224
def check_options
  options_allowed(@options, :select, :select_file, :reject, :reject_file,
                  :evaluate, :exact, :keys, :keys_only, :values_only,
                  :ignore_case)
  options_required_unique(@options, :select, :select_file, :reject,
                          :reject_file, :evaluate)
  options_conflict(@options, keys: :evaluate, keys_only: :evaluate,
                             values_only: :evaluate, ignore_case: :evaluate,
                             exact: :evaluate)
  options_unique(@options, :keys_only, :values_only)
  options_files_exist(@options, :select_file, :reject_file)
end
compile_exact() click to toggle source

Compile a lookup hash for fast exact matching.

@return [Set] Set of exact patterns.

# File lib/BioDSL/commands/grab.rb, line 320
def compile_exact
  return unless @options[:exact]

  @exact = {}

  compile_exact_patterns(@options[:select])
  compile_exact_patterns(@options[:reject])
  compile_exact_file(@options[:select_file])
  compile_exact_file(@options[:reject_file])
end
compile_exact_file(file) click to toggle source

Compile a lookup hash a given file with one pattern per line.

@param file [String] Path to file with patterns.

# File lib/BioDSL/commands/grab.rb, line 349
def compile_exact_file(file)
  return unless file

  File.open(file) do |ios|
    ios.each_line do |line|
      pattern = line.chomp!

      type = pattern.to_num.class.to_s.to_sym unless type

      if type == :String
        @exact[pattern.to_sym] = true
      else
        @exact[pattern] = true
      end
    end
  end
end
compile_exact_patterns(patterns) click to toggle source

Compile a lookup hash for a given list of patterns.

@param patterns [Array] List of patterns.

# File lib/BioDSL/commands/grab.rb, line 334
def compile_exact_patterns(patterns)
  return unless patterns

  [patterns].flatten.each do |pattern|
    if pattern.class == String
      @exact[pattern.to_sym] = true
    else
      @exact[pattern] = true
    end
  end
end
compile_keys() click to toggle source

Compile a list of keys from the options hash, which may contain either a list of keys, a symbol or a comma seperated string of keys.

# File lib/BioDSL/commands/grab.rb, line 255
def compile_keys
  return unless @options[:keys]

  @keys = case @options[:keys].class.to_s
          when 'Array'
            @options[:keys].map(&:to_sym)
          when 'Symbol'
            [@options[:keys]]
          when 'String'
            @options[:keys].split(/, */).map do |key|
              key.sub(/^:/, '').to_sym
            end
          end
end
compile_regex_file(file) click to toggle source

Compile a list of regex from a given file with one pattern per line.

@param file [String] Path to file with patterns.

# File lib/BioDSL/commands/grab.rb, line 301
def compile_regex_file(file)
  return unless file

  File.open(file) do |ios|
    ios.each_line do |line|
      line.chomp!

      if @options[:ignore_case]
        @regex << Regexp.new(/#{line}/i)
      else
        @regex << Regexp.new(/#{line}/)
      end
    end
  end
end
compile_regex_patterns(patterns) click to toggle source

Compile a list of regex from a list of given patterns.

@param patterns [Array] List of patterns.

# File lib/BioDSL/commands/grab.rb, line 286
def compile_regex_patterns(patterns)
  return unless patterns

  [patterns].flatten.each do |pattern|
    if @options[:ignore_case]
      @regex << Regexp.new(/#{pattern}/i)
    else
      @regex << Regexp.new(/#{pattern}/)
    end
  end
end
compile_regexes() click to toggle source

Compile a list of regexes for matching.

# File lib/BioDSL/commands/grab.rb, line 271
def compile_regexes
  return if @options[:exact]
  return if @options[:evaluate]

  @regex = []

  compile_regex_patterns(@options[:select])
  compile_regex_patterns(@options[:reject])
  compile_regex_file(@options[:select_file])
  compile_regex_file(@options[:reject_file])
end
emit_match(output, record, match) click to toggle source

Emit a record to the output stream if a match was found and w/o invert matching, or if no match was found and with invert matching.

@param output [Enumerator::Yielder] Output stream. @param record [Hash] Record to emit. @param match [Boolean] Flag indicating a positive match.

# File lib/BioDSL/commands/grab.rb, line 243
def emit_match(output, record, match)
  if match && !@invert
    output << record
    @status[:records_out] += 1
  elsif !match && @invert
    output << record
    @status[:records_out] += 1
  end
end
eval_match?(record) click to toggle source

Match using eval expression on record values.

@param record [Hash] Record to match.

@return [Boolean] True if eval match found.

# File lib/BioDSL/commands/grab.rb, line 517
def eval_match?(record)
  expr = []

  @eval.split("\s").each do |item|
    if item[0] == ':'
      key = item[1..-1].to_sym

      return false unless record[key]

      expr << record[key]
    else
      expr << item
    end
  end

  eval expr.join(' ')
end
exact_match?(record) click to toggle source

Match exactly record keys or values

@param record [Hash] Record to match.

@return [Boolean] True if exact match found.

# File lib/BioDSL/commands/grab.rb, line 372
def exact_match?(record)
  keys = @keys || record.keys

  if @keys_only
    exact_match_keys?(keys)
  elsif @vals_only
    exact_match_values?(record, keys)
  else
    exact_match_key_values?(record, keys)
  end
end
exact_match_key_values?(record, keys) click to toggle source

Match exactly any record keys or values.

@param record [Hash] Record to match. @param keys [Array] List of keys or values to match.

@return [Boolean] True if exact match found.

# File lib/BioDSL/commands/grab.rb, line 425
def exact_match_key_values?(record, keys)
  keys.each do |key|
    return true if @exact.include?(key)

    value = record[key]

    next unless value

    if value.class == String
      return true if @exact.include?(value.to_sym)
    else
      return true if @exact.include?(value)
    end
  end

  false
end
exact_match_keys?(keys) click to toggle source

Match exactly any record keys.

@param keys [Array] List of keys to match.

@return [Boolean] True if exact match found.

# File lib/BioDSL/commands/grab.rb, line 389
def exact_match_keys?(keys)
  keys.each do |key|
    return true if @exact[key]
  end

  false
end
exact_match_values?(record, keys) click to toggle source

Match exactly any record values.

@param record [Hash] Record to match. @param keys [Array] List of keys whos values to match.

@return [Boolean] True if exact match found.

# File lib/BioDSL/commands/grab.rb, line 403
def exact_match_values?(record, keys)
  keys.each do |key|
    value = record[key]

    next unless value

    if value.class == String
      return true if @exact.include?(value.to_sym)
    else
      return true if @exact.include?(value)
    end
  end

  false
end
regex_match?(record) click to toggle source
# File lib/BioDSL/commands/grab.rb, line 443
def regex_match?(record)
  keys = @keys || record.keys

  if @keys_only
    regex_match_keys?(keys)
  elsif @vals_only
    regex_match_values?(record, keys)
  else
    regex_match_key_values?(record, keys)
  end
end
regex_match_key_values?(record, keys) click to toggle source

Match using regex any record keys or values.

@param record [Hash] Record to match. @param keys [Array] List of keys or values to match.

@return [Boolean] True if regex match found.

# File lib/BioDSL/commands/grab.rb, line 495
def regex_match_key_values?(record, keys)
  keys.each do |key|
    @regex.each do |regex|
      return true if key.to_s =~ regex
    end

    next unless record[key]
    value = record[key]

    @regex.each do |regex|
      return true if value.to_s =~ regex
    end
  end

  false
end
regex_match_keys?(keys) click to toggle source

Match using regex any record keys.

@param keys [Array] List of keys to match.

@return [Boolean] True if regex match found.

# File lib/BioDSL/commands/grab.rb, line 460
def regex_match_keys?(keys)
  keys.each do |key|
    @regex.each do |regex|
      return true if key.to_s =~ regex
    end
  end

  false
end
regex_match_values?(record, keys) click to toggle source

Match using regex any record values.

@param record [Hash] Record to match. @param keys [Array] List of keys whos values to match.

@return [Boolean] True if regex match found.

# File lib/BioDSL/commands/grab.rb, line 476
def regex_match_values?(record, keys)
  keys.each do |key|
    next unless record[key]
    value = record[key]

    @regex.each do |regex|
      return true if value.to_s =~ regex
    end
  end

  false
end