Annika Jochheim1,2, Florian A. Jochheim2,3, Alexandra Kolodyazhnaya1, Étienne Morice1,2, Martin Steinegger4,5,6* and Johannes Söding1,2,7*
1Quantitative and Computational Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
2International Max-Planck Research School for Genome Sciences, University of Göttingen, Göttingen, Germany
3Dep. of Molecular Biology, Max-Planck Institute for Multidisciplinary Sciences, Göttingen, Germany
4School of Biological Sciences, Seoul National University, Seoul, South Korea
5Artifcial Intelligence Institute, Seoul National University, Seoul, South Korea
6Institute of Molecular Biology and Genetics, Seoul National University, Seoul, South Korea
7Campus Institute Data Science (CIDAS), University of Göttingen, Göttingen, Germany
Corresponding authors
Correspondence to Martin Steinegger or Johannes Söding.
Background: Metagenomics is a powerful approach to study environmental and human-associated microbial communities and, in particular, the role of viruses in shaping them. Viral genomes are challenging to assemble from metagenomic samples due to their genomic diversity caused by high mutation rates. In the standard de Bruijn graph assemblers, this genomic diversity leads to complex k-mer assembly graphs with a plethora of loops and bulges that are challenging to resolve into strains or haplotypes because variants more than the k-mer size apart cannot be phased. In contrast, overlap assemblers can phase variants as long as they are covered by a single read.
Results: Here, we present PenguiN, a software for strain resolved assembly of viral DNA and RNA genomes and bacterial 16S rRNA from shotgun metagenomics. Its exhaustive detection of all read overlaps in linear time combined with a Bayesian model to select strain-resolved extensions allow it to assemble longer viral contigs, less fragmented genomes, and more strains than existing assembly tools, on both real and simulated datasets. We show a 3-40-fold increase in complete viral genomes and a 6-fold increase in bacterial 16S rRNA genes.
Conclusion: PenguiN is the first overlap-based assembler for viral genome and 16S rRNA assembly from large and complex metagenomic datasets, which we hope will facilitate studying the key roles of viruses in microbial communities. Video Abstract.
관련 링크
연구자 키워드
관련분야 연구자보기
소속기관 논문보기
관련분야 논문보기
해당논문 저자보기