class Bio::Velvet::Graph
Parser for a velvet assembler’s graph file (Graph
or LastGraph) output from velvetg
The definition of this file is given in the velvet manual, at www.ebi.ac.uk/~zerbino/velvet/Manual.pdf
Attributes
Taken directly from the graph, statistics and information about the Graph
i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”
NodeArray
object of all the graph’s node objects
Taken directly from the graph, statistics and information about the Graph
i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”
Taken directly from the graph, statistics and information about the Graph
i.e. from the velvet manual “$NUMBER_OF_NODES $NUMBER_OF_SEQUENCES $HASH_LENGTH”
Public Class Methods
# File lib/bio-velvet/graph.rb, line 25 def self.log self.new.log end
Parse a graph file from a Graph
, Graph2 or LastGraph output file from velvet into a Bio::Velvet::Graph
object
Options:
-
:dont_parse_noded_reads: if true, then parsing of the NR section is skipped
-
:interesting_read_ids: If not nil, is a Set of nodes that we are interested in. Reads
not of interest will not be parsed in (the NR part of the velvet LastGraph file). Regardless all nodes and edges are parsed in. Using this options saves both memory and CPU.
-
:interesting_node_ids: like :interesting_read_ids except it allows targeting of particular nodes
rather than particular reads.
-
:grep_hack: to make the parsing of read associations go even faster, a grep-based, rather
hacky method is applied to the graph file, so only NR data of interesting_read_ids is presented to the parser. This can save days of parsing time, but is a bit of a hack and its usage may not be particularly future-proof. The value of this option is the amount of context coming out of grep (the -A and -B flags). Using 500 should probably work for most circumstances, if not an Exception will be raised.
# File lib/bio-velvet/graph.rb, line 44 def self.parse_from_file(path_to_graph_file, options={}) graph = self.new state = :header current_node = nil graph.nodes = NodeArray.new graph.arcs = ArcArray.new current_node_direction = nil line_number = 0 Hopcsv.foreach(path_to_graph_file,"\t") do |row| line_number += 1 if state == :header raise "parse exception on header line, this line #{line_number}: #{row.inspect}" unless row.length >= 3 graph.number_of_nodes = row[0].to_i graph.number_of_sequences = row[1].to_i graph.hash_length = row[2].to_i #Not quite sure what the function of the 4th column is state = :nodes_0 log.debug "Now parsing velvet graph nodes" if log.debug? next end if state == :nodes_0 # NODE $NODE_ID $COV_SHORT1 $O_COV_SHORT1 $COV_SHORT2 $O_COV_SHORT2 # $ENDS_OF_KMERS_OF_NODE # $ENDS_OF_KMERS_OF_TWIN_NODE if row[0] == 'NODE' raise unless row.length > 2 current_node = Node.new current_node.node_id = row[1].to_i current_node.length = row[2].to_i current_node.coverages = row[3...row.length].collect{|c| c.to_i} current_node.parent_graph = graph state = :nodes_1 raise "Duplicate node name" unless graph.nodes[current_node.node_id].nil? graph.nodes[current_node.node_id] = current_node next else state = :arc log.debug "Now parsing velvet graph arcs" if log.debug? # No next in the loop so that this line gets parsed as an ARC further down the loop end elsif state == :nodes_1 # Sometimes nodes can be empty row[0] ||= '' current_node.ends_of_kmers_of_node = row[0] raise "Unexpected nodes_1 type line on line #{line_number}: #{row.inspect}" if row.length != 1 state = :nodes_2 next elsif state == :nodes_2 # Sometimes nodes can be empty row[0] ||= '' raise if row.length != 1 current_node.ends_of_kmers_of_twin_node = row[0] state = :nodes_0 next end if state == :arc if row[0] == 'ARC' # ARC $START_NODE $END_NODE $MULTIPLICITY arc = Arc.new raise unless row.length == 4 arc.begin_node_id = row[1].to_i.abs arc.end_node_id = row[2].to_i.abs arc.multiplicity = row[3].to_i arc.begin_node_direction = (row[1].to_i > 0) arc.end_node_direction = (row[2].to_i > 0) graph.arcs.push arc next else state = :nr log.debug "Finished parsing velvet graph arcs. Now parsing the rest of the file" if log.debug? end end if state == :nr if row[0] == 'SEQ' log.warn "velvet graph parse warning: SEQ lines in the Graph file parsing not implemented yet, tracking of reads now not parsed either" break end # If short reads are tracked, for every node a block of read identifiers: # NR $NODE_ID $NUMBER_OF_SHORT_READS # $READ_ID $OFFSET_FROM_START_OF_NODE $START_COORD # $READ_ID2 etc. #p row if row[0] == 'NR' break if options[:dont_parse_noded_reads] # We are done if NR things aren't parsed if options[:grep_hack] unless options[:interesting_read_ids] or options[:interesting_node_ids] raise "Programming error using bio-velvet: if :grep_hack is specified, then :interesting_read_ids or :interesting_node_ids must also be" end apply_grep_hack graph, path_to_graph_file, options[:interesting_read_ids], options[:interesting_node_ids], options[:grep_hack] break #no more parsing is required else raise unless row.length == 3 node_pm = row[1].to_i current_node_direction = node_pm > 0 current_node = graph.nodes[node_pm.abs] current_node.number_of_short_reads ||= 0 current_node.number_of_short_reads += row[2].to_i next end else raise unless row.length == 3 read_id = row[0].to_i if (options[:interesting_node_ids] and !options[:interesting_node_ids].include?(current_node.node_id)) or (options[:interesting_read_ids] and !options[:interesting_read_ids].include?(read_id)) # We have come across an uninteresting read. Ignore it. next end nr = NodedRead.new nr.read_id = read_id nr.offset_from_start_of_node = row[1].to_i nr.start_coord = row[2].to_i nr.direction = current_node_direction current_node.short_reads ||= NodedReadArray.new current_node.short_reads.push nr next end end end log.debug "Finished parsing velvet graph file" if log.debug? return graph end
Private Class Methods
# File lib/bio-velvet/graph.rb, line 583 def self.apply_grep_hack(graph, path_to_graph_file, interesting_read_ids, interesting_node_ids, grep_context) interesting_read_ids ||= [] interesting_node_ids ||= [] if interesting_read_ids.empty? and interesting_node_ids.empty? log.debug "Nothing to grep for in grep hack" if log.debug? return end Tempfile.open('grep_v_hack') do |tempfile| # Create a file to pass to grep -f unless interesting_read_ids.nil? interesting_read_ids.each do |read_id| tempfile.puts "^#{read_id}\t" end end unless interesting_node_ids.nil? interesting_node_ids.each do |node_id| tempfile.puts "^NR\t#{node_id}\t" tempfile.puts "^NR\t-#{node_id}\t" end end tempfile.close cmd = "grep -B #{grep_context.inspect} -A #{grep_context.inspect} -f #{tempfile.path} #{path_to_graph_file.inspect}" # TODO: make this call more robust # grep_result = Bio::Commandeer.run cmd s, grep_result, stderr = systemu cmd # Parse the grepped out results current_node = nil current_node_direction = nil in_nr_section = false grep_result.each_line do |line| row = line.split("\t") if in_nr_section == false # If there is a lot of context then the context includes ARC definitions etc. Skip past this. if row[0] == 'NR' in_nr_section = true elsif row[0] == '--' raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving" else next #skip to next line, waiting to ge into NR section end end if line == "--\n" #the break introduced by grep # If we encounter a grep break, but haven't assigned any nodes, then that's not good enough if current_node.nil? raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving" end # reset the parsing situation current_node = nil elsif row[0] == 'NR' raise unless row.length == 3 node_pm = row[1].to_i current_node_direction = node_pm > 0 current_node = graph.nodes[node_pm.abs] current_node.number_of_short_reads ||= 0 current_node.number_of_short_reads += row[2].to_i next else raise unless row.length == 3 read_id = row[0].to_i if (current_node.nil? or !interesting_node_ids.include?(current_node.node_id)) and !interesting_read_ids.include?(read_id) # We have come across an uninteresting read. Ignore it. next end if current_node.nil? # Came across a high coverage node, and grep isn't giving enough context. Hopefully this won't happen much # particularly if the reads you are interested in are given to velvet first raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving" end nr = NodedRead.new nr.read_id = read_id nr.offset_from_start_of_node = row[1].to_i nr.start_coord = row[2].to_i nr.direction = current_node_direction current_node.short_reads ||= NodedReadArray.new current_node.short_reads.push nr next end end if current_node.nil? raise "Parsing exception - grep hack too hacky. Sorry. Try modifying the code to increase the default amount of context grep is giving" end end end
Public Instance Methods
Deletes nodes and associated arcs from the graph if the block passed evaluates to true (as in Array#delete_if). Actually the associated arcs are deleted first, and then the node, so that the graph remains sane at all times - there is never dangling arcs, as such.
Returns a [deleted_nodes, deleted_arcs] tuple, which are both enumerables, each in no particular order.
# File lib/bio-velvet/graph.rb, line 225 def delete_nodes_if(&block) deleted_nodes = [] deleted_arcs = [] nodes.each do |node| if yield(node) deleted_nodes.push node # delete associated arcs arcs_to_del = @arcs.get_arcs_by_node_id(node.node_id) deleted_arcs.push arcs_to_del arcs_to_del.each do |arc| @arcs.delete arc end # delete the arc itself nodes.delete node end end return deleted_nodes, deleted_arcs.flatten end
Return an array of Arc
objects between two nodes (specified by node objects), or an empty array if none exists. There is four possible arcs between two nodes, connecting their beginnings and ends
# File lib/bio-velvet/graph.rb, line 184 def get_arcs_by_node(node1, node2) @arcs.get_arcs_by_node_id(node1.node_id, node2.node_id) end
Return an array of Arc
objects between two nodes (specified by integer IDs), or an empty array if none exists. There is four possible arcs between two nodes, connecting their beginnings and ends
# File lib/bio-velvet/graph.rb, line 177 def get_arcs_by_node_id(node_id1, node_id2) @arcs.get_arcs_by_node_id(node_id1, node_id2) end
Return the adjacent nodes in the graph that connect to the end of a node
# File lib/bio-velvet/graph.rb, line 204 def neighbours_into_start(node) # Find all arcs that include this node in the right place passable_nodes = [] @arcs.get_arcs_by_node_id(node.node_id).each do |arc| if arc.end_node_id == node.node_id and arc.end_node_direction passable_nodes.push nodes[arc.begin_node_id] elsif arc.begin_node_id == node.node_id and !arc.begin_node_direction passable_nodes.push nodes[arc.end_node_id] end end return passable_nodes.uniq end
Return the adjacent nodes in the graph that connect to the end of a node
# File lib/bio-velvet/graph.rb, line 189 def neighbours_off_end(node) # Find all arcs that include this node in the right place passable_nodes = [] @arcs.get_arcs_by_node_id(node.node_id).each do |arc| if arc.begin_node_id == node.node_id and arc.begin_node_direction # The most intuitive case passable_nodes.push nodes[arc.end_node_id] elsif arc.end_node_id == node.node_id and !arc.end_node_direction passable_nodes.push nodes[arc.begin_node_id] end end return passable_nodes.uniq end
Add more noded reads to this already parsed graph. There is no gaurantee that old NodedRead
information is preserved, or removed.
Options:
-
:interesting_read_ids: If not nil, is a Set of nodes that we are interested in. Reads
not of interest will not be parsed in (the NR part of the velvet LastGraph file). Regardless all nodes and edges are parsed in. Using this options saves both memory and CPU.
-
:interesting_node_ids: like :interesting_read_ids except it allows targeting of particular nodes
rather than particular reads.
-
:grep_hack: to make the parsing of read associations go even faster, a grep-based, rather
hacky method is applied to the graph file, so only NR data of interesting_read_ids is presented to the parser. This can save days of parsing time, but is a bit of a hack and its usage may not be particularly future-proof. The value of this option is the amount of context coming out of grep (the -A and -B flags). Using 500 should probably work for most circumstances, if not an Exception will be raised.
# File lib/bio-velvet/graph.rb, line 260 def parse_additional_noded_reads(path_to_graph_file, options) grep_context = options[:grep_hack] if grep_context.nil? raise "Calling Graph#parse_additional_noded_reads without specifying :grep_hack is currently not implemented" end self.class.apply_grep_hack(self, path_to_graph_file, options[:interesting_read_ids], options[:interesting_node_ids], grep_context ) end