class Bio::EuPathDB::FastaParser
Looks like EuPathDB
databases have settled on something like >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 where the species name differs but the rest is mostly constant
Attributes
species_name[RW]
Public Class Methods
new(species_name, filename)
click to toggle source
The species name is what should show up in the 2nd bracket, so something like ‘Toxoplasma_gondii_ME49’ for >gb|TGME49_000380 | organism=Toxoplasma_gondii_ME49 | product=myb-like DNA binding domain-containing protein | location=TGME49_chrVIII:6835359-6840923(-) | length=1528 for instance
# File lib/eupathdb_fasta.rb, line 15 def initialize(species_name, filename) @species_name = species_name @filename = filename end
Public Instance Methods
each() { |n| ... }
click to toggle source
Enumerate through fasta file entries
# File lib/eupathdb_fasta.rb, line 21 def each @flat = Bio::FlatFile.open(Bio::FastaFormat, @filename) n = next_entry while !n.nil? yield n n = next_entry end end
next_entry()
click to toggle source
Return the entry in the fasta file, or nil if there is no more or the fasta file could not be opened correctly.
# File lib/eupathdb_fasta.rb, line 32 def next_entry return nil if !@flat n = @flat.next_entry return nil if !n s = parse_name(n.definition) s.sequence = n.seq return s end
parse_name(definition)
click to toggle source
# File lib/eupathdb_fasta.rb, line 42 def parse_name(definition) s = FastaAnnotation.new regex = /^(\S+)\|(.*?) \| organism=#{@species_name} \| product=(.*?) \| location=(.*) \| length=\d+$/ matches = definition.match(regex) if !matches raise Exception, "Definition line has unexpected format: `#{definition}'. Trying to match this line to the regular expression `#{regex.inspect}'" end matches2 = matches[4].match(/^(.+?)\:/) if !matches2 raise ParseException, "Definition line has unexpected scaffold format: #{matches[4]}" end s.sequencing_centre = matches[1] s.scaffold = matches2[1] s.gene_id = matches[2] s.annotation = matches[3] return s end