bio-maf¶ ↑
This is a plugin for BioRuby adding support for the Multiple Alignment Format (MAF), used in bioinformatics to store whole-genome sets of multiple sequence alignments.
This library provides indexed and sequential access to MAF data, as well as performing various manipulations on it and writing modified MAF files.
For more information, see the project wiki.
Developer documentation generated with YARD is available at rubydoc.info.
This is being developed by Clayton Wheeler as part of the Google Summer of Code 2012, under the auspices of the Open Bioinformatics Foundation. The development blog may be of interest.
Dependencies¶ ↑
Kyoto Cabinet is a database library, required for building MAF indexes. Install the core library in the appropriate way for your platform, as documented here.
If you’re using MRI, the kyotocabinet-ruby gem will be used to interact with Kyoto Cabinet. For best performance, however, you should really consider using JRuby. On JRuby, the kyotocabinet-java gem will be used instead; this builds a Java library using JNI to call into Kyoto Cabinet. Please file a bug report if you encounter problems building or using this gem, which is still fairly new.
Installation¶ ↑
bio-maf
is now published as a Ruby gem.
$ gem install bio-maf
Performance¶ ↑
This parser performs best under JRuby, particularly with Java 7. See the Performance wiki page for more information. For best results, use JRuby in 1.9 mode with the ObjectProxyCache disabled:
$ export JRUBY_OPTS='--1.9 -Xji.objectProxyCache=false'
Many parsing modes are multithreaded. Under JRuby, it will default to using one parser thread per available core, but if desired this can be configured with the :threads
parser option.
Ruby 1.9.3 is fully supported, but does not perform as well, especially since its concurrency features are not useful for this workload.
Usage¶ ↑
Create an index on a MAF file¶ ↑
Much of the functionality of this library relies on an index. You can create one with maf_index(1), like so:
$ maf_index test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
To index all sequences for searching, not just the reference sequence:
$ maf_index --all test/data/mm8_chr7_tiny.maf /tmp/mm8_chr7_tiny.kct
To build an index programmatically:
require 'bio-maf' parser = Bio::MAF::Parser.new("test/data/mm8_chr7_tiny.maf") idx = Bio::MAF::KyotoIndex.build(parser, "/tmp/mm8_chr7_tiny.kct", false)
Compress and index a MAF file¶ ↑
This library fully supports {BGZF}[http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html]-compressed MAF files, which combine gzip compression with blocking for efficient random access. These can be generated with blocking optimized for MAF access using the included {maf_bgzip(1)
} tool. This writes BGZF-compressed MAF files and optionally indexes them as well:
$ maf_bgzip --dir /tmp --index --all test/data/mm8.chrM.maf
This is the easiest way to prepare MAF files for use with this library.
Extract blocks from an indexed MAF file, by genomic interval¶ ↑
Refer to {mm8_chr7_tiny.maf
}.
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)] access.find(q) do |block| ref_seq = block.sequences[0] puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases" end # => Matched block at 80082592, 121 bases # => Matched block at 80082713, 54 bases
Or, equivalently, one can work with a specific MAF file and index directly:
require 'bio-maf' parser = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf') idx = Bio::MAF::KyotoIndex.open('test/data/mm8_chr7_tiny.kct') q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)] idx.find(q, parser).each do |block| ref_seq = block.sequences[0] puts "Matched block at #{ref_seq.start}, #{ref_seq.size} bases" end # => Matched block at 80082592, 121 bases # => Matched block at 80082713, 54 bases
This can be done with {maf_extract(1)
} as well:
$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766
Extract alignment blocks truncated to a given interval¶ ↑
Given a genomic interval of interest, one can also extract only the subsets of blocks that intersect with that interval, using the #slice
method like so:
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') int = Bio::GenomicInterval.zero_based('mm8.chr7', 80082350, 80082380) blocks = access.slice(int).to_a puts "Got #{blocks.size} blocks, first #{blocks.first.ref_seq.size} base pairs." # => Got 2 blocks, first 18 base pairs.
Or, with {maf_extract(1)
}:
$ maf_extract -d test/data --mode slice --interval mm8.chr7:80082592-80082766
Filter species returned in alignment blocks¶ ↑
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') access.sequence_filter = { :only_species => %w(hg18 mm8 rheMac2) } q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)] blocks = access.find(q) block = blocks.first puts "Block has #{block.sequences.size} sequences." # => Block has 3 sequences.
With {maf_extract(1)
}:
$ maf_extract -d test/data --interval mm8.chr7:80082592-80082766 --only-species hg18,mm8,rheMac2
Extract blocks matching certain conditions¶ ↑
See also the Cucumber feature and step definitions for this.
Match only blocks with all specified species¶ ↑
access = Bio::MAF::Access.maf_dir('test/data') q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082471, 80082730)] access.block_filter = { :with_all_species => %w(panTro2 loxAfr1) } n_blocks = access.find(q).count # => 1
With {maf_extract(1)
}:
$ maf_extract -d test/data --interval mm8.chr7:80082471-80082730 --with-all-species panTro2,loxAfr1
Match only blocks with a certain number of sequences¶ ↑
access = Bio::MAF::Access.maf_dir('test/data') q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082767, 80083008)] access.block_filter = { :at_least_n_sequences => 6 } n_blocks = access.find(q).count # => 1
With {maf_extract(1)
}:
$ maf_extract -d test/data --interval mm8.chr7:80082767-80083008 --min-sequences 6
Match only blocks within a text size range¶ ↑
access = Bio::MAF::Access.maf_dir('test/data') q = [Bio::GenomicInterval.zero_based('mm8.chr7', 0, 80100000)] access.block_filter = { :min_size => 72, :max_size => 160 } n_blocks = access.find(q).count # => 3
With {maf_extract(1)
}:
$ maf_extract -d test/data --interval mm8.chr7:0-80100000 --min-text-size 72 --max-text-size 160
Process each block in a MAF file¶ ↑
require 'bio-maf' p = Bio::MAF::Parser.new('test/data/mm8_chr7_tiny.maf') puts "MAF version: #{p.header.version}" # => MAF version: 1 p.each_block do |block| block.sequences.each do |seq| do_something(seq) end end
Parse empty (‘e’) lines¶ ↑
Refer to {chr22_ieq.maf
}.
require 'bio-maf' p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf', :parse_empty => false) block = p.parse_block block.sequences.size # => 3 p = Bio::MAF::Parser.new('test/data/chr22_ieq.maf', :parse_empty => true) block = p.parse_block block.sequences.size # => 4 block.sequences.find { |s| s.empty? } # => #<Bio::MAF::EmptySequence:0x007fe1f39882d0 # @source="turTru1.scaffold_109008", @start=25049, # @size=1601, @strand=:+, @src_size=50103, @text=nil, # @status="I">
Such options can also be set on a Bio::MAF::Access
object:
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') access.parse_options[:parse_empty] = true
Remove gaps from parsed blocks¶ ↑
After filtering out species with {Parser#sequence_filter
}, gaps may be left where there was an insertion present only in sequences that were filtered out. Such gaps can be removed by setting the :remove_gaps
parser option:
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') access.parse_options[:remove_gaps] = true
Join blocks after filtering together¶ ↑
Similarly, filtering out species may remove a species which had caused two adjacent alignment blocks to be split. By enabling the :join_blocks
parser option, such blocks can be joined together:
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') access.parse_options[:join_blocks] = true
See the Cucumber feature for more details.
Extract bio-alignment representations of blocks¶ ↑
When the :as_bio_alignment
parser option is given, blocks will be returned as Bio::BioAlignment::Alignment objects as used in the bio-alignment Biogem. This offers a great deal of built-in functionality for column-wise operations, alignment manipulation, and more.
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') access.parse_options[:as_bio_alignment] = true q = [Bio::GenomicInterval.zero_based('mm8.chr7', 80082592, 80082766)] access.find(q) do |aln| col = aln.columns[3] puts "bases in column 3: #{col}" end
Tile blocks together over an interval¶ ↑
Extracts alignment blocks overlapping the given genomic interval and constructs a single alignment block covering the entire interval for the specified species. Optionally, any gaps in coverage of the MAF file’s reference sequence can be filled in from a FASTA sequence file. See the Cucumber feature for examples of output, and also the {maf_tile(1)
} man page.
require 'bio-maf' access = Bio::MAF::Access.maf_dir('test/data') interval = Bio::GenomicInterval.zero_based('mm8.chr7', 80082334, 80082468) access.tile(interval) do |tiler| # reference is optional tiler.reference = 'reference.fa.gz' tiler.species = %w(mm8 rn4 hg18) # species_map is optional tiler.species_map = { 'mm8' => 'mouse', 'rn4' => 'rat', 'hg18' => 'human' } tiler.write_fasta($stdout) end
Compression¶ ↑
MAF files can optionally be compressed in the BGZF format defined in the SAM specification. This is best done with {maf_bgzip(1)
}, but files compressed with the bgzip(1)
tool from samtools will also work, though less efficiently.
MAF files compressed with plain gzip will be decompressed on the fly, but random access to these files will not be possible. However, gzipped MAF files are suitable as input to {maf_bgzip(1)
}.
Command line tools¶ ↑
Man pages for command line tools:
With gem-man installed, these can be read with:
$ gem man bio-maf
Other documentation¶ ↑
Also see the API documentation. For more code examples see the RSpec and Cucumber test files in the source tree.
Also, the scripts in the bin directory provide good worked examples of how to use the existing parsing API.
Project home page¶ ↑
For information on the source tree, documentation, examples, issues and how to contribute, see
The BioRuby community is on IRC server: irc.freenode.org, channel: bioruby.
Cite¶ ↑
If you use this software, please cite one of
Biogems.info¶ ↑
This Biogem is published at biogems.info.
Copyright¶ ↑
Copyright © 2012 Clayton Wheeler. See LICENSE.txt for further details.