class GeneValidator::BlastUtils

Contains methods that run BLAST and methods that analyse sequences

Constants

EVALUE

Public Class Methods

guess_sequence_type(sequence_string) click to toggle source

Strips all non-letter characters. guestimates sequence based on that. If less than 10 useable characters… returns nil If more than 90% ACGTU returns :nucleotide. else returns :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein

# File lib/genevalidator/blast.rb, line 105
def guess_sequence_type(sequence_string)
  # removing non-letter and ambiguous characters
  cleaned_sequence = sequence_string.gsub(/[^A-Z]|[NX]/i, '')
  return nil if cleaned_sequence.length < 10 # conservative

  type = Bio::Sequence.new(cleaned_sequence).guess(0.9)
  type == Bio::Sequence::NA ? :nucleotide : :protein
end
guess_sequence_type_from_input_file(file = opt[:input_fasta_file]) click to toggle source
# File lib/genevalidator/blast.rb, line 116
def guess_sequence_type_from_input_file(file = opt[:input_fasta_file])
  lines = File.foreach(file).first(10)
  seqs = ''
  lines.each { |l| seqs += l.chomp unless l[0] == '>' }
  guess_sequence_type(seqs)
end
parse_next(iterator) click to toggle source

Parses the next query from the blast xml output query Params: iterator: blast xml iterator for hits type: the type of the sequence: :nucleotide or :protein Outputs: Array of Sequence objects corresponding to the list of hits

# File lib/genevalidator/blast.rb, line 53
def parse_next(iterator)
  iter = iterator.next

  # parse blast the xml output and get the hits
  # hits obtained are proteins! (we use only blastp and blastx)
  hits = []
  iter.each do |hit|
    seq                = Query.new
    seq.length_protein = hit.len.to_i
    seq.type           = :protein
    seq.identifier     = hit.hit_id
    seq.definition     = hit.hit_def
    seq.accession_no   = hit.accession
    seq.hsp_list       = hit.hsps.map { |hsp| Hsp.new(xml_input: hsp) }

    hits << seq
  end
  hits
rescue StopIteration
  nil
end
run_blast_on_input_file() click to toggle source

Runs BLAST on an input file Params: blast_type: blast command in String format (e.g 'blastx' or 'blastp') opt: Hash made of :input_fasta_file :blast_xml_file, :db, :num_threads gapopen: gapopen blast parameter gapextend: gapextend blast parameter nr_hits: max number of hits Output: XML file

# File lib/genevalidator/blast.rb, line 28
def run_blast_on_input_file
  remote = opt[:db].match?(/remote/) ? true : false
  print_blast_info_text(remote)

  log_file = File.join(dirs[:tmp_dir], 'blast_cmd_output.txt')
  `#{blast_cmd(opt, config, remote)} > #{log_file} 2>&1`

  return unless File.zero?(opt[:blast_xml_file])
  warn 'Blast failed to run on the input file.'
  if remote
    warn 'You are using BLAST with a remote database. Please'
    warn 'ensure that you have internet access and try again.'
  else
    warn 'Please ensure that the BLAST database exists and try again.'
  end
  exit 1
end
type_of_sequences(fasta_format_string) click to toggle source

Method copied from sequenceserver/sequencehelpers.rb Splits input at putative fasta definition lines (like “>adsfadsf”); then guesses sequence type for each sequence. If not enough sequence to determine, returns nil. If 2 kinds of sequence mixed together, raises ArgumentError Otherwise, returns :nucleotide or :protein Params: sequence_string: String to validate Output: nil, :nucleotide or :protein

# File lib/genevalidator/blast.rb, line 86
def type_of_sequences(fasta_format_string)
  # the first sequence does not need to have a fasta definition line
  sequences = fasta_format_string.split(/^>.*$/).delete_if(&:empty?)
  # get all sequence types
  sequence_types = sequences.collect { |seq| guess_sequence_type(seq) }
                            .uniq.compact

  return nil if sequence_types.empty?
  sequence_types.first if sequence_types.length == 1
end

Private Class Methods

blast_cmd(opt, config, remote) click to toggle source
# File lib/genevalidator/blast.rb, line 125
def blast_cmd(opt, config, remote)
  blast_type = config[:type] == :protein ? 'blastp' : 'blastx'
  # -num_threads is not supported on remote databases
  threads = remote ? '' : "-num_threads #{opt[:num_threads]}"

  "#{blast_type} -query '#{opt[:input_fasta_file]}'" \
  " -db #{opt[:db]} -outfmt 5 -evalue #{EVALUE} #{threads}" \
  " -out '#{opt[:blast_xml_file]}' #{opt[:blast_options]}"
end
print_blast_info_text(remote) click to toggle source