한빛사논문
Juwan Kim1†, Chul Lee1†, Byung June Ko2, Dong Ahn Yoo1, Sohyoung Won1, Adam M. Phillippy3, Olivier Fedrigo4, Guojie Zhang5,6,7,8, Kerstin Howe9, Jonathan Wood9, Richard Durbin9,10, Giulio Formenti4,11, Samara Brown11, Lindsey Cantin11, Claudio V. Mello12, Seoae Cho13, Arang Rhie3, Heebal Kim1,2,13* and Erich D. Jarvis4,11,14*
1Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea. 2Department of Agricultural Biotechnology and Research Institute of Agriculture and Life Sciences, Seoul National University, Seoul, Republic of Korea. 3Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA. 4Vertebrate Genome Lab, The Rockefeller University, New York City, USA. 5BGIShenzhen, Shenzhen 518083, China. 6Villum Centre for Biodiversity Genomics, Section for Ecology and Evolution, Department of Biology, University of Copenhagen, Universitetsparken 15, 2100 Copenhagen, Denmark. 7State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming 650223, China. 8Center for Excellence in Animal Evolution and Genetics, Chinese Academy of Sciences, Kunming 650223, China. 9Wellcome Sanger Institute, Cambridge, UK. 10Department of Genetics, University of Cambridge, Cambridge, UK. 11Laboratory of Neurogenetics of Language, The Rockefeller University, New York City, USA. 12Department of Behavioral Neuroscience, Oregon Health and Science University, Portland, OR 97239, USA. 13eGnome, Inc, Seoul, Republic of Korea. 14Howard Hughes Medical Institute, Chevy Chase, MD, USA.
†Juwan Kim and Chul Lee contributed equally to this work.
*Correspondence
Abstract
Background
Many short-read genome assemblies have been found to be incomplete and contain mis-assemblies. The Vertebrate Genomes Project has been producing new reference genome assemblies with an emphasis on being as complete and error-free as possible, which requires utilizing long reads, long-range scaffolding data, new assembly algorithms, and manual curation. A more thorough evaluation of the recent references relative to prior assemblies can provide a detailed overview of the types and magnitude of improvements.
Results
Here we evaluate new vertebrate genome references relative to the previous assemblies for the same species and, in two cases, the same individuals, including a mammal (platypus), two birds (zebra finch, Anna’s hummingbird), and a fish (climbing perch). We find that up to 11% of genomic sequence is entirely missing in the previous assemblies. In the Vertebrate Genomes Project zebra finch assembly, we identify eight new GC- and repeat-rich micro-chromosomes with high gene density. The impact of missing sequences is biased towards GC-rich 5′-proximal promoters and 5′ exon regions of protein-coding genes and long non-coding RNAs. Between 26 and 60% of genes include structural or sequence errors that could lead to misunderstanding of their function when using the previous genome assemblies.
Conclusions
Our findings reveal novel regulatory landscapes and protein coding sequences that have been greatly underestimated in previous assemblies and are now present in the Vertebrate Genomes Project reference genomes.
논문정보
관련 링크
관련분야 연구자보기
관련분야 논문보기
해당논문 저자보기