Interpretable detection of novel human viruses from genome sequencing data
Bartoszewicz, Jakub M.
Seidel, Anja
Renard, Bernhard Y.
Viruses evolve extremely quickly, so reliable meth-
ods for viral host prediction are necessary to safe-
guard biosecurity and biosafety alike. Novel human-
infecting viruses are difficult to detect with stan-
dard bioinformatics workflows. Here, we predict
whether a virus can infect humans directly from next-
generation sequencing reads. We show that deep
neural architectures significantly outperform both
shallow machine learning and standard, homology-
based algorithms, cutting the error rates in half and
generalizing to taxonomic units distant from those
presented during training. Further, we develop a
suite of interpretability tools and show that it can
be applied also to other models beyond the host pre-
diction task. We propose a new approach for con-
volutional filter visualization to disentangle the in-
formation content of each nucleotide from its contri-
bution to the final classification decision. Nucleotide-
resolution maps of the learned associations between
pathogen genomes and the infectious phenotype can
be used to detect regions of interest in novel agents,
for example, the SARS-CoV-2 coronavirus, unknown
before it caused a COVID-19 pandemic in 2020. All
methods presented here are implemented as easy-
to-install packages not only enabling analysis of NGS
datasets without requiring any deep learning skills,
but also allowing advanced users to easily train and
explain new models for genomics.
Files in this item