class Bio::GFF::GFF3::Record::Gap
Bio:GFF::GFF3::Record::Gap is a class to store data of “Gap” attribute.
Constants
- Code
Code
is a class to store length of single-letter code.
Attributes
Internal data. Users must not use it.
Public Class Methods
Creates a new Gap
object.
Arguments:
-
str: a formatted string, or nil.
# File lib/bio/db/gff.rb 1277 def initialize(str = nil) 1278 if str then 1279 @data = str.split(/ +/).collect do |x| 1280 if /\A([A-Z])([0-9]+)\z/ =~ x.strip then 1281 Code.new($1.intern, $2.to_i) 1282 else 1283 warn "ignored unknown token: #{x}.inspect" if $VERBOSE 1284 nil 1285 end 1286 end 1287 @data.compact! 1288 else 1289 @data = [] 1290 end 1291 end
Creates a new Gap
object from given sequence alignment.
Note that sites of which both reference and target are gaps are silently removed.
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (nucleotide sequence)
-
gap_regexp: regexp to identify gap
# File lib/bio/db/gff.rb 1393 def self.new_from_sequences_na(reference, target, 1394 gap_regexp = /[^a-zA-Z]/) 1395 gap = self.new 1396 gap.instance_eval { 1397 __initialize_from_sequences_na(reference, target, 1398 gap_regexp) 1399 } 1400 gap 1401 end
Creates a new Gap
object from given sequence alignment.
Note that sites of which both reference and target are gaps are silently removed.
For incorrect alignments that break 3:1 rule, gap positions will be moved inside codons, unwanted gaps will be removed, and some forward or reverse frameshift will be inserted.
For example,
atgg-taagac-att M V K - I
is treated as:
atggt<aagacatt M V K >>I
Incorrect combination of frameshift with frameshift or gap may cause undefined behavior.
Forward frameshifts are recomended to be indicated in the target sequence. Reverse frameshifts can be indicated in the reference sequence or the target sequence.
Priority of regular expressions:
space > forward/reverse frameshift > gap
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (amino acid sequence)
-
gap_regexp: regexp to identify gap
-
space_regexp: regexp to identify space character which is completely ignored
-
forward_frameshift_regexp: regexp to identify forward frameshift
-
reverse_frameshift_regexp: regexp to identify reverse frameshift
# File lib/bio/db/gff.rb 1589 def self.new_from_sequences_na_aa(reference, target, 1590 gap_regexp = /[^a-zA-Z]/, 1591 space_regexp = /\s/, 1592 forward_frameshift_regexp = /\>/, 1593 reverse_frameshift_regexp = /\</) 1594 gap = self.new 1595 gap.instance_eval { 1596 __initialize_from_sequences_na_aa(reference, target, 1597 gap_regexp, 1598 space_regexp, 1599 forward_frameshift_regexp, 1600 reverse_frameshift_regexp) 1601 } 1602 gap 1603 end
Same as new(str).
# File lib/bio/db/gff.rb 1294 def self.parse(str) 1295 self.new(str) 1296 end
Public Instance Methods
If self == other, returns true. otherwise, returns false.
# File lib/bio/db/gff.rb 1617 def ==(other) 1618 if other.class == self.class and 1619 @data == other.data then 1620 true 1621 else 1622 false 1623 end 1624 end
Processes nucleotide sequences and returns gapped sequences as an array of sequences.
Note for forward/reverse frameshift: Forward/Reverse_frameshift is simply treated as gap insertion to the target/reference sequence.
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (nucleotide sequence)
-
gap_char: gap character
# File lib/bio/db/gff.rb 1717 def process_sequences_na(reference, target, gap_char = '-') 1718 s_ref, s_tgt = dup_seqs(reference, target) 1719 1720 s_ref, s_tgt = __process_sequences(s_ref, s_tgt, 1721 gap_char, gap_char, 1722 1, 1, 1723 gap_char, gap_char) 1724 1725 if $VERBOSE and s_ref.length != s_tgt.length then 1726 warn "returned sequences not equal length" 1727 end 1728 return s_ref, s_tgt 1729 end
Processes sequences and returns gapped sequences as an array of sequences. reference must be a nucleotide sequence, and target must be an amino acid sequence.
Note for reverse frameshift: Reverse_frameshift characers are inserted in the reference sequence. For example, alignment of “Gap=M3 R1 M2” is:
atgaagat<aatgtc M K I N V
Alignment
of “Gap=M3 R3 M3” is:
atgaag<<<attaatgtc M K I I N V
Arguments:
-
reference: reference sequence (nucleotide sequence)
-
target: target sequence (amino acid sequence)
-
gap_char: gap character
-
space_char: space character inserted to amino sequence for matching na-aa alignment
-
forward_frameshift: forward frameshift character
-
reverse_frameshift: reverse frameshift character
# File lib/bio/db/gff.rb 1754 def process_sequences_na_aa(reference, target, 1755 gap_char = '-', 1756 space_char = ' ', 1757 forward_frameshift = '>', 1758 reverse_frameshift = '<') 1759 s_ref, s_tgt = dup_seqs(reference, target) 1760 s_tgt = s_tgt.gsub(/./, "\\0#{space_char}#{space_char}") 1761 ref_increment = 3 1762 tgt_increment = 1 + space_char.length * 2 1763 ref_gap = gap_char * 3 1764 tgt_gap = "#{gap_char}#{space_char}#{space_char}" 1765 return __process_sequences(s_ref, s_tgt, 1766 ref_gap, tgt_gap, 1767 ref_increment, tgt_increment, 1768 forward_frameshift, 1769 reverse_frameshift) 1770 end
string representation
# File lib/bio/db/gff.rb 1606 def to_s 1607 @data.collect { |x| x.to_s }.join(" ") 1608 end
Private Instance Methods
(private method) Parses given reference-target sequence alignment and initializes self. Existing data will be erased.
# File lib/bio/db/gff.rb 1324 def __initialize_from_sequences_na(reference, target, 1325 gap_regexp = /[^a-zA-Z]/) 1326 1327 data_ref = __scan_gap(reference, gap_regexp, :I, :M) 1328 data_tgt = __scan_gap(target, gap_regexp, :D, :M) 1329 data = [] 1330 1331 while !data_ref.empty? and !data_tgt.empty? 1332 ref = data_ref.shift 1333 tgt = data_tgt.shift 1334 if ref.length > tgt.length then 1335 x = Code.new(ref.code, ref.length - tgt.length) 1336 data_ref.unshift x 1337 ref.length = tgt.length 1338 elsif ref.length < tgt.length then 1339 x = Code.new(tgt.code, tgt.length - ref.length) 1340 data_tgt.unshift x 1341 tgt.length = ref.length 1342 end 1343 case ref.code 1344 when :M 1345 if tgt.code == :M then 1346 data.push ref 1347 elsif tgt.code == :D then 1348 data.push tgt 1349 else 1350 raise 'Bug: should not reach here.' 1351 end 1352 when :I 1353 if tgt.code == :M then 1354 data.push ref 1355 elsif tgt.code == :D then 1356 # This site is ignored, 1357 # because both reference and target are gap 1358 else 1359 raise 'Bug: should not reach here.' 1360 end 1361 end 1362 end #while 1363 1364 # rest of data_ref 1365 len = 0 1366 data_ref.each do |r| 1367 len += r.length if r.code == :M 1368 end 1369 data.push Code.new(:D, len) if len > 0 1370 1371 # rest of data_tgt 1372 len = 0 1373 data_tgt.each do |t| 1374 len += t.length if t.code == :M 1375 end 1376 data.push Code.new(:I, len) if len > 0 1377 1378 @data = data 1379 true 1380 end
(private method) Parses given reference(nuc)-target(amino) sequence alignment and initializes self. Existing data will be erased.
# File lib/bio/db/gff.rb 1453 def __initialize_from_sequences_na_aa(reference, target, 1454 gap_regexp = /[^a-zA-Z]/, 1455 space_regexp = /\s/, 1456 forward_frameshift_regexp = 1457 /\>/, 1458 reverse_frameshift_regexp = 1459 /\</) 1460 1461 data = [] 1462 sc_ref = StringScanner.new(reference) 1463 sc_tgt = StringScanner.new(target) 1464 1465 re_one = /./mn 1466 1467 while !sc_tgt.eos? 1468 if len = sc_tgt.skip(space_regexp) then 1469 # ignored 1470 elsif len = sc_tgt.skip(forward_frameshift_regexp) then 1471 cur = __push_code_to_data(cur, data, :F, len) 1472 len.times { sc_ref.scan(re_one) } 1473 1474 elsif len = sc_tgt.skip(reverse_frameshift_regexp) then 1475 cur = __push_code_to_data(cur, data, :R, len) 1476 pos = sc_ref.pos 1477 pos -= len 1478 if pos < 0 then 1479 warn "Incorrect reverse frameshift" if $VERBOSE 1480 pos = 0 1481 end 1482 sc_ref.pos = pos 1483 1484 elsif len = sc_tgt.skip(gap_regexp) then 1485 len.times do 1486 ref_gaps, ref_fs = __scan_codon(sc_ref, 1487 gap_regexp, 1488 space_regexp, 1489 forward_frameshift_regexp, 1490 reverse_frameshift_regexp) 1491 case ref_gaps 1492 when 3 1493 # both ref and tgt are gap. ignored the site 1494 when 2, 1 1495 # forward frameshift inserted 1496 ref_fs += (3 - ref_gaps) 1497 when 0 1498 cur = __push_code_to_data(cur, data, :D, 1) 1499 else 1500 raise 'Bug: should not reach here' 1501 end 1502 if ref_fs < 0 then 1503 cur = __push_code_to_data(cur, data, :R, -ref_fs) 1504 elsif ref_fs > 0 then 1505 cur = __push_code_to_data(cur, data, :F, ref_fs) 1506 end 1507 end #len.times 1508 elsif len = sc_tgt.skip(re_one) then 1509 # always 1-letter 1510 ref_gaps, ref_fs = __scan_codon(sc_ref, 1511 gap_regexp, 1512 space_regexp, 1513 forward_frameshift_regexp, 1514 reverse_frameshift_regexp) 1515 case ref_gaps 1516 when 3 1517 cur = __push_code_to_data(cur, data, :I, 1) 1518 when 2, 1, 0 1519 # reverse frameshift inserted when gaps exist 1520 ref_fs -= ref_gaps 1521 # normal site 1522 cur = __push_code_to_data(cur, data, :M, 1) 1523 else 1524 raise 'Bug: should not reach here' 1525 end 1526 if ref_fs < 0 then 1527 cur = __push_code_to_data(cur, data, :R, -ref_fs) 1528 elsif ref_fs > 0 then 1529 cur = __push_code_to_data(cur, data, :F, ref_fs) 1530 end 1531 else 1532 raise 'Bug: should not reach here' 1533 end 1534 end #while 1535 1536 if sc_ref.rest_size > 0 then 1537 rest = sc_ref.scan(/.*/mn) 1538 rest.gsub!(space_regexp, '') 1539 rest.gsub!(forward_frameshift_regexp, '') 1540 rest.gsub!(reverse_frameshift_regexp, '') 1541 rest.gsub!(gap_regexp, '') 1542 len = rest.length.div(3) 1543 cur = __push_code_to_data(cur, data, :D, len) if len > 0 1544 len = rest.length % 3 1545 cur = __push_code_to_data(cur, data, :F, len) if len > 0 1546 end 1547 1548 @data = data 1549 self 1550 end
(private method) insert gaps refers to the gap rule inside the object
# File lib/bio/db/gff.rb 1640 def __process_sequences(s_ref, s_tgt, 1641 ref_gap, tgt_gap, 1642 ref_increment, tgt_increment, 1643 forward_frameshift, 1644 reverse_frameshift) 1645 p_ref = 0 1646 p_tgt = 0 1647 @data.each do |c| 1648 #$stderr.puts c.inspect 1649 #$stderr.puts "p_ref=#{p_ref} s_ref=#{s_ref.inspect}" 1650 #$stderr.puts "p_tgt=#{p_tgt} s_tgt=#{s_tgt.inspect}" 1651 case c.code 1652 when :M # match 1653 p_ref += c.length * ref_increment 1654 p_tgt += c.length * tgt_increment 1655 when :I # insert a gap into the reference sequence 1656 begin 1657 s_ref[p_ref, 0] = ref_gap * c.length 1658 rescue IndexError 1659 raise 'reference sequence too short' 1660 end 1661 p_ref += c.length * ref_increment 1662 p_tgt += c.length * tgt_increment 1663 when :D # insert a gap into the target (delete from reference) 1664 begin 1665 s_tgt[p_tgt, 0] = tgt_gap * c.length 1666 rescue IndexError 1667 raise 'target sequence too short' 1668 end 1669 p_ref += c.length * ref_increment 1670 p_tgt += c.length * tgt_increment 1671 when :F # frameshift forward in the reference sequence 1672 begin 1673 s_tgt[p_tgt, 0] = forward_frameshift * c.length 1674 rescue IndexError 1675 raise 'target sequence too short' 1676 end 1677 p_ref += c.length 1678 p_tgt += c.length 1679 when :R # frameshift reverse in the reference sequence 1680 p_rev_frm = p_ref - c.length 1681 if p_rev_frm < 0 then 1682 raise 'too short reference sequence, or too many reverse frameshifts' 1683 end 1684 begin 1685 s_ref[p_rev_frm, 0] = reverse_frameshift * c.length 1686 rescue IndexError 1687 raise 'reference sequence too short' 1688 end 1689 1690 else 1691 warn "ignored #{c.to_s.inspect}" if $VERBOSE 1692 end 1693 end 1694 1695 if s_ref.length < p_ref then 1696 raise 'reference sequence too short' 1697 end 1698 if s_tgt.length < p_tgt then 1699 raise 'target sequence too short' 1700 end 1701 return s_ref, s_tgt 1702 end
(private method) internal use only
# File lib/bio/db/gff.rb 1439 def __push_code_to_data(cur, data, code, len) 1440 if cur and cur.code == code then 1441 cur.length += len 1442 else 1443 cur = Code.new(code, len) 1444 data.push cur 1445 end 1446 return cur 1447 end
(private method) scans a codon or gap in reference sequence
# File lib/bio/db/gff.rb 1405 def __scan_codon(sc_ref, 1406 gap_regexp, space_regexp, 1407 forward_frameshift_regexp, 1408 reverse_frameshift_regexp) 1409 chars = [] 1410 gap_count = 0 1411 fs_count = 0 1412 1413 while chars.size < 3 + fs_count and char = sc_ref.scan(/./mn) 1414 case char 1415 when space_regexp 1416 # ignored 1417 when forward_frameshift_regexp 1418 # next char is forward frameshift 1419 fs_count += 1 1420 when reverse_frameshift_regexp 1421 # next char is reverse frameshift 1422 fs_count -= 1 1423 when gap_regexp 1424 chars.push char 1425 gap_count += 1 1426 else 1427 chars.push char 1428 end 1429 end #while 1430 if chars.size < (3 + fs_count) then 1431 gap_count += (3 + fs_count) - chars.size 1432 end 1433 return gap_count, fs_count 1434 end
(private method) Scans gaps and returns an array of Code
objects
# File lib/bio/db/gff.rb 1300 def __scan_gap(str, gap_regexp = /[^a-zA-Z]/, 1301 code_i = :I, code_m = :M) 1302 sc = StringScanner.new(str) 1303 data = [] 1304 while len = sc.skip_until(gap_regexp) 1305 mlen = len - sc.matched_size 1306 data.push Code.new(code_m, mlen) if mlen > 0 1307 g = Code.new(code_i, sc.matched_size) 1308 while glen = sc.skip(gap_regexp) 1309 g.length += glen 1310 end 1311 data.push g 1312 end 1313 if sc.rest_size > 0 then 1314 m = Code.new(code_m, sc.rest_size) 1315 data.push m 1316 end 1317 data 1318 end
duplicates sequences
# File lib/bio/db/gff.rb 1627 def dup_seqs(*arg) 1628 arg.collect do |s| 1629 begin 1630 s = s.seq 1631 rescue NoMethodError 1632 end 1633 s.dup 1634 end 1635 end