VhgSubsetHittable {Virusparies} | R Documentation |
VhgSubsetHittable: Filter VirusHunter and VirusGatherer hittables
Description
VhgSubsetHittable filters a VirusHunter or VirusGatherer hittable based on specified criteria, including specific virus groups, minimum number of hits, and observations below certain E-value or identity percentage criteria.
Usage
VhgSubsetHittable(
file,
group_column = "best_query",
virus_groups = NULL,
num_hits_min = NULL,
ViralRefSeq_E_criteria = NULL,
ViralRefSeq_ident_criteria = NULL,
contig_len_criteria = NULL
)
Arguments
file |
A data frame containing VirusHunter or VirusGatherer hittable results. |
group_column |
A string indicating the column containing the virus groups specified in the virus_groups argument. Note: Gatherer hittables do not have a "best_query" column. Please provide an appropriate column for grouping. |
virus_groups |
A character vector specifying virus groups to filter by. |
num_hits_min |
Minimum number of hits required. Default is NULL, which means no filter based on num_hits. |
ViralRefSeq_E_criteria |
Maximum E-value threshold for ViralRefSeq_E criteria. Default is NULL, which means no filter based on ViralRefSeq_E. |
ViralRefSeq_ident_criteria |
Maximum or minimum sequence identity percentage threshold for ViralRefSeq_ident criteria. Default is NULL, which means no filter based on ViralRefSeq_ident. If positive, filters where ViralRefSeq_ident is above the threshold. If negative, filters where ViralRefSeq_ident is below the absolute value of the threshold. |
contig_len_criteria |
(Gatherer only): Minimum contig length required. |
Details
The function filters the input VirusHunter or VirusGatherer data (file
) based on specified criteria:
-
group_column
: Specifies the column to filter by, which must be either "ViralRefSeq_taxonomy" or "best_query". -
virus_groups
: Allows filtering by specific virus groups. If NULL, all virus groups are included. -
num_hits_min
: Filters rows where the number of hits ("num_hits") is greater than or equal to the specified minimum. -
ViralRefSeq_E_criteria
: Filters rows where the E-value ("ViralRefSeq_E") is below the specified maximum threshold. -
ViralRefSeq_ident_criteria
: Filters rows where the sequence identity percentage ("ViralRefSeq_ident") is above or below the specified threshold. Use a positive value to filter where ViralRefSeq_ident is above the threshold, and a negative value to filter where ViralRefSeq_ident is below the absolute value of the threshold. -
contig_len_criteria
: (Gatherer only) Filters rows where the contig length ("contig_len") is greater than or equal to the specified threshold.
Value
A filtered dataframe based on the specified criteria.
Author(s)
Sergej Ruff
See Also
VirusHunterGatherer is available here: https://github.com/lauberlab/VirusHunterGatherer.
Examples
path <- system.file("extdata", "virushunter.tsv", package = "Virusparies")
file <- ImportVirusTable(path)
cat("The dimensions of the VirusHunter hittable before filtering are: \n");dim(file)
file_filtered <- VhgSubsetHittable(file,group_column = "best_query",
virus_groups = "Anello_ORF1core",
num_hits_min = 4,ViralRefSeq_ident_criteria = -90,ViralRefSeq_E_criteria = 0.00001)
cat("The dimensions of the VirusHunter Hittable after filtering are: \n");dim(file_filtered)
# other examples for viral_group
# Include a single group:
result1 <- VhgSubsetHittable(file, virus_groups = "Hepadna-Nackedna_TP")
# Include multiple groups:
result2 <- VhgSubsetHittable(file, virus_groups = c("Hepadna-Nackedna_TP", "Gemini_Rep"))
# Exclude a single group:
result3 <- VhgSubsetHittable(file, virus_groups = list(exclude = "Hepadna-Nackedna_TP"))
# Exclude multiple groups:
result4 <- VhgSubsetHittable(file, virus_groups = list(exclude =
c("Hepadna-Nackedna_TP", "Anello_ORF1core")))