snp-search

an easy to use tool for management of SNPs generated from haploid next generation sequencing data. Given a vcf file, snp-search stores the SNPs generated by the variant calling algorithm into a sqlite database. snp-search can then be used to extract useful information from the database.

Obtaining and installing the code

SNPsearch is written in Ruby and operates in a Unix environment. It is made available as a gem. See the github site for more information (github.com/hpa-bioinformatics/snp-search).

To install snp-search, do

gem install snp-search

Requirements

Not much, you just need:

Thats it!

Running snp-search

1- The first thing you need to do is to create the database (snp-search -create)

Two files are needed to create the SQLite3 database:

1A- Variant Call Format (.vcf) file (which contains the SNP information)

1B- Your database reference genome that you used to generate your .vcf file (in genbank or embl format, the script will automatically detect the format).

You need the following parameters:

-d    Name of your database (note that this is a required field in all commands).
-v    .vcf file       
-r    Database Reference genome (The same file that was used in generating the .vcf file).  This should be in genbank or embl format.

Optional: -A  AD ratio cutoff (default 0.9)

Usage:
  snp-search -create -d my_snp_db.sqlite3 -r my_ref.gbk -v my_vcf_file.vcf 

Note: The strain names in your database will be taken from your vcf file so make sure they are named appropriately in your vcf file.

2- Now that you have created the database (my_snp_db.sqlite3) you can use snp-search to output several queried data.

First, you need to tell snp-search what you want out.  You have several options:
- Querying the Database to select the number of unique SNPs within the list of the strains/samples provided (list_of_my_strains.txt). The output is a text file with a list of the unique SNPs and information about each SNP (e.g. if its synonymous or non-synonymous SNP).  

  -output -unique_snps -d db.sqlite3 [options]
    -u, --unique_snps                      Query for unique snps in the database
    -c, --cuttoff_snp_qual                 SNP quality cutoff, (default = 90)
    -g, --cuttoff_genotype                 Genotype quality cutoff (default = 30)
    -s, --strain                           The strains/samples you like to query (only used with -unique_snps flag)
    -o, --out                              Name of output file, Required

  Usage: 
  snp-search -O -u -d my_snp_db.sqlite3 -s list_of_my_strains.txt -o unique_snps.out

- Querying the database to output all SNPs without SNPs in a specified features in the database (e.g. phages).  This is a way of ignoring SNPs in genes (likely to be mobile element genes) that are not needed for SNP analysis.  The user has the option of generating a core SNP tree Newick file for SNP phylogeny (if -F option was used to ouput fasta file).  

-output -all_or_filtered_snps -d db.sqlite3 [options]
  -f, --all_or_filtered_snps             SNPs from specified features in the database (if you do not want to ignore any SNPs, just use this option with -n -F/T -o)
  -F, --fasta                            output fasta file format (default)
  -T, --tabular                          output tabular file format
  -c, --cuttoff_snp_qual                 SNP quality cutoff, (default = 90)
  -g, --cuttoff_genotype                 Genotype quality cutoff (default = 30)
  -R, --remove_non_informative_snps      Only output informative SNPs. Only used with -e option
  -e, --ignore_snps_in_range             A list of position ranges to ignore e.g 10..500,2000..2500. Only used with -e option
  -a, --ignore_strains                   A list of strains to ignore (seperate by comma e.g. S1,S4,S8 ). Only used with -f option
  -I, --ignore_snps_on_annotation        The name of the feature(s) to ignore.  Features should be seperated by comma (e.g. phages,inserstion,transposons)
  -o, --out                              Name of output file, Required
  -t, --tree                             Generate SNP phylogeny (only used with -fasta option)
  -p, --fasttree_path                    Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)

Usage:
snp-search -O -F -f -n my_snp_db.sqlite3 -a phage,insertion,transposon -R -o snps_without_phages.fasta

- Optionally, you can add the following options to generate a phylogenetic tree from the resulting fasta file:

-t  Generate SNP phylogeny
-p  Full path to the FastTree tool (e.g. /usr/local/bin/FastTree. only used with -tree option)
Usage:
snp-search -O -F -e -n my_snp_db.sqlite3 -a phage,insertion,transposon -r -t -p /usr/local/bin/FastTree -o snps_without_phages.fasta

The algorithm FastTree is used to generate the nwk file.  FastTree can be downloaded from http://www.microbesonline.org/fasttree/#Install (see above)

- Output all SNPs with information.  Information for each SNP includes whether the SNP is synonymous or non-synonymous, gene function, whether it is a pseudogene and other useful information.  These information will be tab-seperated. 

-output -info -d db.sqlite3 [options]
  -i, --info                             Output various information about SNPs
  -c, --cuttoff_snp_qual                 SNP quality cutoff, (default = 90)
  -g, --cuttoff_genotype                 Genotype quality cutoff (default = 30)
  -o, --out                              Name of output file, Required

Usage:
snp-search -O -info -d my_snp_db.sqlite3 -o snps_all_with_info.txt

View database in Unix or in a GUI

Your database will be in sqlite3 format. If you like to view your table(s) and perform direct queries you can type

sqlite3 snp_db.sqlite3

Alternatively, you may download a SQL tool to view your database (e.g. SQLite sorcerer).

Contact

If you have any comments, questions or suggestions, please email

ali.al-shahib@phe.gov.uk

or

anthony.underwood@phe.gov.uk

Have fun snp-searching!

Copyright © 2012 Ali Al-Shahib. See LICENSE.txt for further details.