class Bio::GFF::GFF3::Record::Gap

Bio:GFF::GFF3::Record::Gap is a class to store data of “Gap” attribute.

Constants

Code

Code is a class to store length of single-letter code.

Attributes

data[R]

Internal data. Users must not use it.

Public Class Methods

new(str = nil) click to toggle source

Creates a new Gap object.


Arguments:

  • str: a formatted string, or nil.

     # File lib/bio/db/gff.rb
1277 def initialize(str = nil)
1278   if str then
1279     @data = str.split(/ +/).collect do |x|
1280       if /\A([A-Z])([0-9]+)\z/ =~ x.strip then
1281         Code.new($1.intern, $2.to_i)
1282       else
1283         warn "ignored unknown token: #{x}.inspect" if $VERBOSE
1284         nil
1285       end
1286     end
1287     @data.compact!
1288   else
1289     @data = []
1290   end
1291 end
new_from_sequences_na(reference, target, gap_regexp = /[^a-zA-Z]/) click to toggle source

Creates a new Gap object from given sequence alignment.

Note that sites of which both reference and target are gaps are silently removed.


Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (nucleotide sequence)

  • gap_regexp: regexp to identify gap

     # File lib/bio/db/gff.rb
1393 def self.new_from_sequences_na(reference, target,
1394                                gap_regexp = /[^a-zA-Z]/)
1395   gap = self.new
1396   gap.instance_eval { 
1397     __initialize_from_sequences_na(reference, target,
1398                                    gap_regexp)
1399   }
1400   gap
1401 end
new_from_sequences_na_aa(reference, target, gap_regexp = /[^a-zA-Z]/, space_regexp = /\s/, forward_frameshift_regexp = /\>/, reverse_frameshift_regexp = /\</) click to toggle source

Creates a new Gap object from given sequence alignment.

Note that sites of which both reference and target are gaps are silently removed.

For incorrect alignments that break 3:1 rule, gap positions will be moved inside codons, unwanted gaps will be removed, and some forward or reverse frameshift will be inserted.

For example,

atgg-taagac-att
M  V  K  -  I

is treated as:

atggt<aagacatt
M  V  K  >>I

Incorrect combination of frameshift with frameshift or gap may cause undefined behavior.

Forward frameshifts are recomended to be indicated in the target sequence. Reverse frameshifts can be indicated in the reference sequence or the target sequence.

Priority of regular expressions:

space > forward/reverse frameshift > gap

Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (amino acid sequence)

  • gap_regexp: regexp to identify gap

  • space_regexp: regexp to identify space character which is completely ignored

  • forward_frameshift_regexp: regexp to identify forward frameshift

  • reverse_frameshift_regexp: regexp to identify reverse frameshift

     # File lib/bio/db/gff.rb
1589 def self.new_from_sequences_na_aa(reference, target,
1590                                   gap_regexp = /[^a-zA-Z]/,
1591                                   space_regexp = /\s/,
1592                                   forward_frameshift_regexp = /\>/,
1593                                   reverse_frameshift_regexp = /\</)
1594   gap = self.new
1595   gap.instance_eval { 
1596     __initialize_from_sequences_na_aa(reference, target,
1597                                       gap_regexp,
1598                                       space_regexp,
1599                                       forward_frameshift_regexp,
1600                                       reverse_frameshift_regexp)
1601   }
1602   gap
1603 end
parse(str) click to toggle source

Same as new(str).

     # File lib/bio/db/gff.rb
1294 def self.parse(str)
1295   self.new(str)
1296 end

Public Instance Methods

==(other) click to toggle source

If self == other, returns true. otherwise, returns false.

     # File lib/bio/db/gff.rb
1617 def ==(other)
1618   if other.class == self.class and
1619       @data == other.data then
1620     true
1621   else
1622     false
1623   end
1624 end
process_sequences_na(reference, target, gap_char = '-') click to toggle source

Processes nucleotide sequences and returns gapped sequences as an array of sequences.

Note for forward/reverse frameshift: Forward/Reverse_frameshift is simply treated as gap insertion to the target/reference sequence.


Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (nucleotide sequence)

  • gap_char: gap character

     # File lib/bio/db/gff.rb
1717 def process_sequences_na(reference, target, gap_char = '-')
1718   s_ref, s_tgt = dup_seqs(reference, target)
1719 
1720   s_ref, s_tgt = __process_sequences(s_ref, s_tgt,
1721                                      gap_char, gap_char,
1722                                      1, 1,
1723                                      gap_char, gap_char)
1724 
1725   if $VERBOSE and s_ref.length != s_tgt.length then
1726     warn "returned sequences not equal length"
1727   end
1728   return s_ref, s_tgt
1729 end
process_sequences_na_aa(reference, target, gap_char = '-', space_char = ' ', forward_frameshift = '>', reverse_frameshift = '<') click to toggle source

Processes sequences and returns gapped sequences as an array of sequences. reference must be a nucleotide sequence, and target must be an amino acid sequence.

Note for reverse frameshift: Reverse_frameshift characers are inserted in the reference sequence. For example, alignment of “Gap=M3 R1 M2” is:

atgaagat<aatgtc
M  K  I  N  V

Alignment of “Gap=M3 R3 M3” is:

atgaag<<<attaatgtc
M  K  I  I  N  V

Arguments:

  • reference: reference sequence (nucleotide sequence)

  • target: target sequence (amino acid sequence)

  • gap_char: gap character

  • space_char: space character inserted to amino sequence for matching na-aa alignment

  • forward_frameshift: forward frameshift character

  • reverse_frameshift: reverse frameshift character

     # File lib/bio/db/gff.rb
1754 def process_sequences_na_aa(reference, target,
1755                             gap_char = '-',
1756                             space_char = ' ',
1757                             forward_frameshift = '>',
1758                             reverse_frameshift = '<')
1759   s_ref, s_tgt = dup_seqs(reference, target)
1760   s_tgt = s_tgt.gsub(/./, "\\0#{space_char}#{space_char}")
1761   ref_increment = 3
1762   tgt_increment = 1 + space_char.length * 2
1763   ref_gap = gap_char * 3
1764   tgt_gap = "#{gap_char}#{space_char}#{space_char}"
1765   return __process_sequences(s_ref, s_tgt,
1766                              ref_gap, tgt_gap,
1767                              ref_increment, tgt_increment,
1768                              forward_frameshift,
1769                              reverse_frameshift)
1770 end
to_s() click to toggle source

string representation

     # File lib/bio/db/gff.rb
1606 def to_s
1607   @data.collect { |x| x.to_s }.join(" ")
1608 end

Private Instance Methods

__initialize_from_sequences_na(reference, target, gap_regexp = /[^a-zA-Z]/) click to toggle source

(private method) Parses given reference-target sequence alignment and initializes self. Existing data will be erased.

     # File lib/bio/db/gff.rb
1324 def __initialize_from_sequences_na(reference, target,
1325                                    gap_regexp = /[^a-zA-Z]/)
1326   
1327   data_ref = __scan_gap(reference, gap_regexp, :I, :M)
1328   data_tgt = __scan_gap(target,    gap_regexp, :D, :M)
1329   data = []
1330 
1331   while !data_ref.empty? and !data_tgt.empty?
1332     ref = data_ref.shift
1333     tgt = data_tgt.shift
1334     if ref.length > tgt.length then
1335       x = Code.new(ref.code, ref.length - tgt.length)
1336       data_ref.unshift x
1337       ref.length = tgt.length
1338     elsif ref.length < tgt.length then
1339       x = Code.new(tgt.code, tgt.length - ref.length)
1340       data_tgt.unshift x
1341       tgt.length = ref.length
1342     end
1343     case ref.code
1344     when :M
1345       if tgt.code == :M then
1346         data.push ref
1347       elsif tgt.code == :D then
1348         data.push tgt
1349       else
1350         raise 'Bug: should not reach here.'
1351       end
1352     when :I
1353       if tgt.code == :M then
1354         data.push ref
1355       elsif tgt.code == :D then
1356         # This site is ignored,
1357         # because both reference and target are gap
1358       else
1359         raise 'Bug: should not reach here.'
1360       end
1361     end
1362   end #while
1363 
1364   # rest of data_ref
1365   len = 0
1366   data_ref.each do |r|
1367     len += r.length if r.code == :M
1368   end
1369   data.push Code.new(:D, len) if len > 0
1370 
1371   # rest of data_tgt
1372   len = 0
1373   data_tgt.each do |t|
1374     len += t.length if t.code == :M
1375   end
1376   data.push Code.new(:I, len) if len > 0
1377 
1378   @data = data
1379   true
1380 end
__initialize_from_sequences_na_aa(reference, target, gap_regexp = /[^a-zA-Z]/, space_regexp = /\s/, forward_frameshift_regexp = /\>/, reverse_frameshift_regexp = /\</) click to toggle source

(private method) Parses given reference(nuc)-target(amino) sequence alignment and initializes self. Existing data will be erased.

     # File lib/bio/db/gff.rb
1453 def __initialize_from_sequences_na_aa(reference, target,
1454                                       gap_regexp = /[^a-zA-Z]/,
1455                                       space_regexp = /\s/,
1456                                       forward_frameshift_regexp =
1457                                       /\>/,
1458                                       reverse_frameshift_regexp =
1459                                       /\</)
1460 
1461   data = []
1462   sc_ref = StringScanner.new(reference)
1463   sc_tgt = StringScanner.new(target)
1464 
1465   re_one = /./mn
1466 
1467   while !sc_tgt.eos?
1468     if len = sc_tgt.skip(space_regexp) then
1469       # ignored
1470     elsif len = sc_tgt.skip(forward_frameshift_regexp) then
1471       cur = __push_code_to_data(cur, data, :F, len)
1472       len.times { sc_ref.scan(re_one) }
1473 
1474     elsif len = sc_tgt.skip(reverse_frameshift_regexp) then
1475       cur = __push_code_to_data(cur, data, :R, len)
1476       pos = sc_ref.pos
1477       pos -= len
1478       if pos < 0 then
1479         warn "Incorrect reverse frameshift" if $VERBOSE
1480         pos = 0
1481       end
1482       sc_ref.pos = pos
1483 
1484     elsif len = sc_tgt.skip(gap_regexp) then
1485       len.times do
1486         ref_gaps, ref_fs = __scan_codon(sc_ref,
1487                                         gap_regexp,
1488                                         space_regexp,
1489                                         forward_frameshift_regexp,
1490                                         reverse_frameshift_regexp)
1491         case ref_gaps
1492         when 3
1493           # both ref and tgt are gap. ignored the site
1494         when 2, 1
1495           # forward frameshift inserted
1496           ref_fs += (3 - ref_gaps)
1497         when 0
1498           cur = __push_code_to_data(cur, data, :D, 1)
1499         else
1500           raise 'Bug: should not reach here'
1501         end
1502         if ref_fs < 0 then
1503           cur = __push_code_to_data(cur, data, :R, -ref_fs)
1504         elsif ref_fs > 0 then
1505           cur = __push_code_to_data(cur, data, :F, ref_fs)
1506         end
1507       end #len.times
1508     elsif len = sc_tgt.skip(re_one) then
1509       # always 1-letter
1510       ref_gaps, ref_fs = __scan_codon(sc_ref,
1511                                       gap_regexp,
1512                                       space_regexp,
1513                                       forward_frameshift_regexp,
1514                                       reverse_frameshift_regexp)
1515       case ref_gaps
1516       when 3
1517         cur = __push_code_to_data(cur, data, :I, 1)
1518       when 2, 1, 0
1519         # reverse frameshift inserted when gaps exist
1520         ref_fs -= ref_gaps
1521         # normal site
1522         cur = __push_code_to_data(cur, data, :M, 1)
1523       else
1524         raise 'Bug: should not reach here'
1525       end
1526       if ref_fs < 0 then
1527         cur = __push_code_to_data(cur, data, :R, -ref_fs)
1528       elsif ref_fs > 0 then
1529         cur = __push_code_to_data(cur, data, :F, ref_fs)
1530       end
1531     else
1532       raise 'Bug: should not reach here'
1533     end
1534   end #while
1535 
1536   if sc_ref.rest_size > 0 then
1537     rest = sc_ref.scan(/.*/mn)
1538     rest.gsub!(space_regexp, '')
1539     rest.gsub!(forward_frameshift_regexp, '')
1540     rest.gsub!(reverse_frameshift_regexp, '')
1541     rest.gsub!(gap_regexp, '')
1542     len = rest.length.div(3)
1543     cur = __push_code_to_data(cur, data, :D, len) if len > 0
1544     len = rest.length % 3
1545     cur = __push_code_to_data(cur, data, :F, len) if len > 0
1546   end
1547 
1548   @data = data
1549   self
1550 end
__process_sequences(s_ref, s_tgt, ref_gap, tgt_gap, ref_increment, tgt_increment, forward_frameshift, reverse_frameshift) click to toggle source

(private method) insert gaps refers to the gap rule inside the object

     # File lib/bio/db/gff.rb
1640 def __process_sequences(s_ref, s_tgt,
1641                         ref_gap, tgt_gap,
1642                         ref_increment, tgt_increment,
1643                         forward_frameshift,
1644                         reverse_frameshift)
1645   p_ref = 0
1646   p_tgt = 0
1647   @data.each do |c|
1648     #$stderr.puts c.inspect
1649     #$stderr.puts "p_ref=#{p_ref} s_ref=#{s_ref.inspect}"
1650     #$stderr.puts "p_tgt=#{p_tgt} s_tgt=#{s_tgt.inspect}"
1651     case c.code
1652     when :M # match
1653       p_ref += c.length * ref_increment
1654       p_tgt += c.length * tgt_increment
1655     when :I # insert a gap into the reference sequence
1656       begin
1657         s_ref[p_ref, 0] = ref_gap * c.length
1658       rescue IndexError
1659         raise 'reference sequence too short'
1660       end
1661       p_ref += c.length * ref_increment
1662       p_tgt += c.length * tgt_increment
1663     when :D # insert a gap into the target (delete from reference)
1664       begin
1665         s_tgt[p_tgt, 0] =  tgt_gap * c.length
1666       rescue IndexError
1667         raise 'target sequence too short'
1668       end
1669       p_ref += c.length * ref_increment
1670       p_tgt += c.length * tgt_increment
1671     when :F # frameshift forward in the reference sequence
1672       begin
1673         s_tgt[p_tgt, 0] = forward_frameshift * c.length
1674       rescue IndexError
1675         raise 'target sequence too short'
1676       end
1677       p_ref += c.length
1678       p_tgt += c.length
1679     when :R # frameshift reverse in the reference sequence
1680       p_rev_frm = p_ref - c.length
1681       if p_rev_frm < 0 then
1682         raise 'too short reference sequence, or too many reverse frameshifts'
1683       end
1684       begin
1685         s_ref[p_rev_frm, 0] = reverse_frameshift * c.length
1686       rescue IndexError
1687         raise 'reference sequence too short'
1688       end
1689       
1690     else
1691       warn "ignored #{c.to_s.inspect}" if $VERBOSE
1692     end
1693   end
1694 
1695   if s_ref.length < p_ref then
1696     raise 'reference sequence too short'
1697   end
1698   if s_tgt.length < p_tgt then
1699     raise 'target sequence too short'
1700   end
1701   return s_ref, s_tgt
1702 end
__push_code_to_data(cur, data, code, len) click to toggle source

(private method) internal use only

     # File lib/bio/db/gff.rb
1439 def __push_code_to_data(cur, data, code, len)
1440   if cur and cur.code == code then
1441     cur.length += len
1442   else
1443     cur = Code.new(code, len)
1444     data.push cur
1445   end
1446   return cur
1447 end
__scan_codon(sc_ref, gap_regexp, space_regexp, forward_frameshift_regexp, reverse_frameshift_regexp) click to toggle source

(private method) scans a codon or gap in reference sequence

     # File lib/bio/db/gff.rb
1405 def __scan_codon(sc_ref, 
1406                  gap_regexp, space_regexp,
1407                  forward_frameshift_regexp,
1408                  reverse_frameshift_regexp)
1409   chars = []
1410   gap_count = 0
1411   fs_count = 0
1412 
1413   while chars.size < 3 + fs_count and char = sc_ref.scan(/./mn)
1414     case char
1415     when space_regexp
1416       # ignored
1417     when forward_frameshift_regexp
1418       # next char is forward frameshift
1419       fs_count += 1
1420     when reverse_frameshift_regexp
1421       # next char is reverse frameshift
1422       fs_count -= 1
1423     when gap_regexp
1424       chars.push char
1425       gap_count += 1
1426     else
1427       chars.push char
1428     end
1429   end #while
1430   if chars.size < (3 + fs_count) then
1431     gap_count += (3 + fs_count) - chars.size
1432   end
1433   return gap_count, fs_count
1434 end
__scan_gap(str, gap_regexp = /[^a-zA-Z]/, code_i = :I, code_m = :M) click to toggle source

(private method) Scans gaps and returns an array of Code objects

     # File lib/bio/db/gff.rb
1300 def __scan_gap(str, gap_regexp = /[^a-zA-Z]/,
1301                code_i = :I, code_m = :M)
1302   sc = StringScanner.new(str)
1303   data = []
1304   while len = sc.skip_until(gap_regexp)
1305     mlen = len - sc.matched_size
1306     data.push Code.new(code_m, mlen) if mlen > 0
1307     g = Code.new(code_i, sc.matched_size)
1308     while glen = sc.skip(gap_regexp)
1309       g.length += glen
1310     end
1311     data.push g
1312   end
1313   if sc.rest_size > 0 then
1314     m = Code.new(code_m, sc.rest_size)
1315     data.push m
1316   end
1317   data
1318 end
dup_seqs(*arg) click to toggle source

duplicates sequences

     # File lib/bio/db/gff.rb
1627 def dup_seqs(*arg)
1628   arg.collect do |s|
1629     begin
1630       s = s.seq
1631     rescue NoMethodError
1632     end
1633     s.dup
1634   end
1635 end