class Bio::Velvet::Graph

Parser for a velvet assembler’s graph file (Graph or LastGraph) output from velvetg

The definition of this file is given in the velvet manual, at www.ebi.ac.uk/~zerbino/velvet/Manual.pdf

Attributes

arcs[RW]

ArcArray of Arc objects

hash_length[RW]

Taken directly from the graph, statistics and information about the Graph i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”

nodes[RW]

NodeArray object of all the graph’s node objects

number_of_nodes[RW]

Taken directly from the graph, statistics and information about the Graph i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”

number_of_sequences[RW]

Taken directly from the graph, statistics and information about the Graph i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”

Public Class Methods

log() click to toggle source
# File lib/bio-velvet/graph.rb, line 25
def self.log
  self.new.log
end
parse_from_file(path_to_graph_file, options={}) click to toggle source

Parse a graph file from a Graph, Graph2 or LastGraph output file from velvet into a Bio::Velvet::Graph object

Options:

  • :dont_parse_noded_reads: if true, then parsing of the NR section is skipped

  • :interesting_read_ids: If not nil, is a Set of nodes that we are interested in. Reads

not of interest will not be parsed in (the NR part of the velvet LastGraph file). Regardless all nodes and edges are parsed in. Using this options saves both memory and CPU.

  • :interesting_node_ids: like :interesting_read_ids except it allows targeting of particular nodes

rather than particular reads.

  • :grep_hack: to make the parsing of read associations go even faster, a grep-based, rather

hacky method is applied to the graph file, so only NR data of interesting_read_ids is presented to the parser. This can save days of parsing time, but is a bit of a hack and its usage may not be particularly future-proof. The value of this option is the amount of context coming out of grep (the -A and -B flags). Using 500 should probably work for most circumstances, if not an Exception will be raised.

# File lib/bio-velvet/graph.rb, line 44
def self.parse_from_file(path_to_graph_file, options={})
  graph = self.new
  state = :header

  current_node = nil
  graph.nodes = NodeArray.new
  graph.arcs = ArcArray.new
  current_node_direction = nil

  line_number = 0
  Hopcsv.foreach(path_to_graph_file,"\t") do |row|
    line_number += 1

    if state == :header
      raise "parse exception on header line, this line #{line_number}: #{row.inspect}" unless row.length >= 3
      graph.number_of_nodes = row[0].to_i
      graph.number_of_sequences = row[1].to_i
      graph.hash_length = row[2].to_i
      #Not quite sure what the function of the 4th column is
      state = :nodes_0
      log.debug "Now parsing velvet graph nodes" if log.debug?
      next
    end

    if state == :nodes_0
      # NODE $NODE_ID $COV_SHORT1 $O_COV_SHORT1 $COV_SHORT2 $O_COV_SHORT2
      # $ENDS_OF_KMERS_OF_NODE
      # $ENDS_OF_KMERS_OF_TWIN_NODE
      if row[0] == 'NODE'
        raise unless row.length > 2
        current_node = Node.new
        current_node.node_id = row[1].to_i
        current_node.length = row[2].to_i
        current_node.coverages = row[3...row.length].collect{|c| c.to_i}
        current_node.parent_graph = graph
        state = :nodes_1
        raise "Duplicate node name" unless graph.nodes[current_node.node_id].nil?
        graph.nodes[current_node.node_id] = current_node
        next
      else
        state = :arc
        log.debug "Now parsing velvet graph arcs" if log.debug?
        # No next in the loop so that this line gets parsed as an ARC further down the loop
      end
    elsif state == :nodes_1
      # Sometimes nodes can be empty
      row[0] ||= ''
      current_node.ends_of_kmers_of_node = row[0]
      raise "Unexpected nodes_1 type line on line #{line_number}: #{row.inspect}" if row.length != 1
      state = :nodes_2
      next
    elsif state == :nodes_2
      # Sometimes nodes can be empty
      row[0] ||= ''
      raise if row.length != 1
      current_node.ends_of_kmers_of_twin_node = row[0]
      state = :nodes_0
      next
    end

    if state == :arc
      if row[0] == 'ARC'
        # ARC $START_NODE $END_NODE $MULTIPLICITY
        arc = Arc.new
        raise unless row.length == 4
        arc.begin_node_id = row[1].to_i.abs
        arc.end_node_id = row[2].to_i.abs
        arc.multiplicity = row[3].to_i
        arc.begin_node_direction = (row[1].to_i > 0)
        arc.end_node_direction = (row[2].to_i > 0)
        graph.arcs.push arc
        next
      else
        state = :nr
        log.debug "Finished parsing velvet graph arcs. Now parsing the rest of the file" if log.debug?
      end
    end

    if state == :nr
      if row[0] == 'SEQ'
        log.warn "velvet graph parse warning: SEQ lines in the Graph file parsing not implemented yet, tracking of reads now not parsed either"
        break
      end

      # If short reads are tracked, for every node a block of read identifiers:
      # NR $NODE_ID $NUMBER_OF_SHORT_READS
      # $READ_ID $OFFSET_FROM_START_OF_NODE $START_COORD
      # $READ_ID2 etc.
      #p row
      if row[0] == 'NR'
        break if options[:dont_parse_noded_reads] # We are done if NR things aren't parsed
        if options[:grep_hack]
          unless options[:interesting_read_ids] or options[:interesting_node_ids]
            raise "Programming error using bio-velvet: if :grep_hack is specified, then :interesting_read_ids or :interesting_node_ids must also be"
          end
          apply_grep_hack graph, path_to_graph_file, options[:interesting_read_ids], options[:interesting_node_ids], options[:grep_hack]
          break #no more parsing is required
        else
          raise unless row.length == 3
          node_pm = row[1].to_i
          current_node_direction = node_pm > 0
          current_node = graph.nodes[node_pm.abs]
          current_node.number_of_short_reads ||= 0
          current_node.number_of_short_reads += row[2].to_i
          next
        end
      else
        raise unless row.length == 3
        read_id = row[0].to_i
        if (options[:interesting_node_ids] and !options[:interesting_node_ids].include?(current_node.node_id)) or
          (options[:interesting_read_ids] and !options[:interesting_read_ids].include?(read_id))
          # We have come across an uninteresting read. Ignore it.
          next
        end
        nr = NodedRead.new
        nr.read_id = read_id
        nr.offset_from_start_of_node = row[1].to_i
        nr.start_coord = row[2].to_i
        nr.direction = current_node_direction
        current_node.short_reads ||= NodedReadArray.new
        current_node.short_reads.push nr
        next
      end
    end
  end
  log.debug "Finished parsing velvet graph file" if log.debug?

  return graph
end

Private Class Methods

apply_grep_hack(graph, path_to_graph_file, interesting_read_ids, interesting_node_ids, grep_context) click to toggle source
# File lib/bio-velvet/graph.rb, line 583
def self.apply_grep_hack(graph, path_to_graph_file, interesting_read_ids, interesting_node_ids, grep_context)
  interesting_read_ids ||= []
  interesting_node_ids ||= []
  if interesting_read_ids.empty? and interesting_node_ids.empty?
    log.debug "Nothing to grep for in grep hack" if log.debug?
    return
  end

  Tempfile.open('grep_v_hack') do |tempfile|
    # Create a file to pass to grep -f
    unless interesting_read_ids.nil?
      interesting_read_ids.each do |read_id|
        tempfile.puts "^#{read_id}\t"
      end
    end
    unless interesting_node_ids.nil?
      interesting_node_ids.each do |node_id|
        tempfile.puts "^NR\t#{node_id}\t"
        tempfile.puts "^NR\t-#{node_id}\t"
      end
    end
    tempfile.close

    cmd = "grep -B #{grep_context.inspect} -A #{grep_context.inspect} -f #{tempfile.path} #{path_to_graph_file.inspect}"
    # TODO: make this call more robust
    # grep_result = Bio::Commandeer.run cmd
    s, grep_result, stderr = systemu cmd

    # Parse the grepped out results
    current_node = nil
    current_node_direction = nil
    in_nr_section = false
    grep_result.each_line do |line|
      row = line.split("\t")
      if in_nr_section == false
        # If there is a lot of context then the context includes ARC definitions etc. Skip past this.
        if row[0] == 'NR'
          in_nr_section = true
        elsif row[0] == '--'
          raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving"
        else
          next #skip to next line, waiting to ge into NR section
        end
      end

      if line == "--\n" #the break introduced by grep
        # If we encounter a grep break, but haven't assigned any nodes, then that's not good enough
        if current_node.nil?
          raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving"
        end
        # reset the parsing situation
        current_node = nil
      elsif row[0] == 'NR'
        raise unless row.length == 3
        node_pm = row[1].to_i
        current_node_direction = node_pm > 0
        current_node = graph.nodes[node_pm.abs]
        current_node.number_of_short_reads ||= 0
        current_node.number_of_short_reads += row[2].to_i
        next
      else
        raise unless row.length == 3
        read_id = row[0].to_i
        if (current_node.nil? or !interesting_node_ids.include?(current_node.node_id)) and
          !interesting_read_ids.include?(read_id)
          # We have come across an uninteresting read. Ignore it.
          next
        end
        if current_node.nil?
          # Came across a high coverage node, and grep isn't giving enough context. Hopefully this won't happen much
          # particularly if the reads you are interested in are given to velvet first
          raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving"
        end
        nr = NodedRead.new
        nr.read_id = read_id
        nr.offset_from_start_of_node = row[1].to_i
        nr.start_coord = row[2].to_i
        nr.direction = current_node_direction
        current_node.short_reads ||= NodedReadArray.new
        current_node.short_reads.push nr
        next
      end
    end

    if current_node.nil?
      raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving"
    end
  end
end

Public Instance Methods

delete_nodes_if() { |node| ... } click to toggle source

Deletes nodes and associated arcs from the graph if the block passed evaluates to true (as in Array#delete_if). Actually the associated arcs are deleted first, and then the node, so that the graph remains sane at all times - there is never dangling arcs, as such.

Returns a [deleted_nodes, deleted_arcs] tuple, which are both enumerables, each in no particular order.

# File lib/bio-velvet/graph.rb, line 225
def delete_nodes_if(&block)
  deleted_nodes = []
  deleted_arcs = []
  nodes.each do |node|
    if yield(node)
      deleted_nodes.push node

      # delete associated arcs
      arcs_to_del = @arcs.get_arcs_by_node_id(node.node_id)
      deleted_arcs.push arcs_to_del
      arcs_to_del.each do |arc|
        @arcs.delete arc
      end

      # delete the arc itself
      nodes.delete node
    end
  end
  return deleted_nodes, deleted_arcs.flatten
end
get_arcs_by_node(node1, node2) click to toggle source

Return an array of Arc objects between two nodes (specified by node objects), or an empty array if none exists. There is four possible arcs between two nodes, connecting their beginnings and ends

# File lib/bio-velvet/graph.rb, line 184
def get_arcs_by_node(node1, node2)
  @arcs.get_arcs_by_node_id(node1.node_id, node2.node_id)
end
get_arcs_by_node_id(node_id1, node_id2) click to toggle source

Return an array of Arc objects between two nodes (specified by integer IDs), or an empty array if none exists. There is four possible arcs between two nodes, connecting their beginnings and ends

# File lib/bio-velvet/graph.rb, line 177
def get_arcs_by_node_id(node_id1, node_id2)
  @arcs.get_arcs_by_node_id(node_id1, node_id2)
end
neighbours_into_start(node) click to toggle source

Return the adjacent nodes in the graph that connect to the end of a node

# File lib/bio-velvet/graph.rb, line 204
def neighbours_into_start(node)
  # Find all arcs that include this node in the right place
  passable_nodes = []
  @arcs.get_arcs_by_node_id(node.node_id).each do |arc|
    if arc.end_node_id == node.node_id and arc.end_node_direction
      passable_nodes.push nodes[arc.begin_node_id]
    elsif arc.begin_node_id == node.node_id and !arc.begin_node_direction
      passable_nodes.push nodes[arc.end_node_id]
    end
  end
  return passable_nodes.uniq
end
neighbours_off_end(node) click to toggle source

Return the adjacent nodes in the graph that connect to the end of a node

# File lib/bio-velvet/graph.rb, line 189
def neighbours_off_end(node)
  # Find all arcs that include this node in the right place
  passable_nodes = []
  @arcs.get_arcs_by_node_id(node.node_id).each do |arc|
    if arc.begin_node_id == node.node_id and arc.begin_node_direction
      # The most intuitive case
      passable_nodes.push nodes[arc.end_node_id]
    elsif arc.end_node_id == node.node_id and !arc.end_node_direction
      passable_nodes.push nodes[arc.begin_node_id]
    end
  end
  return passable_nodes.uniq
end
parse_additional_noded_reads(path_to_graph_file, options) click to toggle source

Add more noded reads to this already parsed graph. There is no gaurantee that old NodedRead information is preserved, or removed.

Options:

  • :interesting_read_ids: If not nil, is a Set of nodes that we are interested in. Reads

not of interest will not be parsed in (the NR part of the velvet LastGraph file). Regardless all nodes and edges are parsed in. Using this options saves both memory and CPU.

  • :interesting_node_ids: like :interesting_read_ids except it allows targeting of particular nodes

rather than particular reads.

  • :grep_hack: to make the parsing of read associations go even faster, a grep-based, rather

hacky method is applied to the graph file, so only NR data of interesting_read_ids is presented to the parser. This can save days of parsing time, but is a bit of a hack and its usage may not be particularly future-proof. The value of this option is the amount of context coming out of grep (the -A and -B flags). Using 500 should probably work for most circumstances, if not an Exception will be raised.

# File lib/bio-velvet/graph.rb, line 260
def parse_additional_noded_reads(path_to_graph_file, options)
  grep_context = options[:grep_hack]
  if grep_context.nil?
    raise "Calling Graph#parse_additional_noded_reads without specifying :grep_hack is currently not implemented"
  end
  self.class.apply_grep_hack(self,
    path_to_graph_file,
    options[:interesting_read_ids],
    options[:interesting_node_ids],
    grep_context
    )
end