Stability of SARS-CoV-2 phylogenies
Autoři:
Yatish Turakhia aff001; Nicola De Maio aff003; Bryan Thornlow aff001; Landen Gozashti aff001; Robert Lanfear aff005; Conor R. Walker aff003; Angie S. Hinrichs aff002; Jason D. Fernandes aff001; Rui Borges aff008; Greg Slodkowicz aff009; Lukas Weilguny aff003; David Haussler aff001; Nick Goldman aff003; Russell Corbett-Detig aff001
Působiště autorů:
Department of Biomolecular Engineering, University of California Santa Cruz, Santa Cruz, CA, United States of America
aff001; Genomics Institute, University of California Santa Cruz, Santa Cruz, CA, United States of America
aff002; European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Cambridge, United Kingdom
aff003; Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, United States of America
aff004; Department of Ecology and Evolution, Research School of Biology, Australian National University, Canberra, ACT, Australia
aff005; Department of Genetics, University of Cambridge, Cambridge, United Kingdom
aff006; Howard Hughes Medical Institute, University of California, Santa Cruz, CA, United States of America
aff007; Institut für Populationsgenetik, Vetmeduni Vienna, Wien, Austria
aff008; MRC Laboratory of Molecular Biology, Cambridge, United Kingdom
aff009
Vyšlo v časopise:
Stability of SARS-CoV-2 phylogenies. PLoS Genet 16(11): e1009175. doi:10.1371/journal.pgen.1009175
Kategorie:
Research Article
doi:
https://doi.org/10.1371/journal.pgen.1009175
Souhrn
The SARS-CoV-2 pandemic has led to unprecedented, nearly real-time genetic tracing due to the rapid community sequencing response. Researchers immediately leveraged these data to infer the evolutionary relationships among viral samples and to study key biological questions, including whether host viral genome editing and recombination are features of SARS-CoV-2 evolution. This global sequencing effort is inherently decentralized and must rely on data collected by many labs using a wide variety of molecular and bioinformatic techniques. There is thus a strong possibility that systematic errors associated with lab—or protocol—specific practices affect some sequences in the repositories. We find that some recurrent mutations in reported SARS-CoV-2 genome sequences have been observed predominantly or exclusively by single labs, co-localize with commonly used primer binding sites and are more likely to affect the protein-coding sequences than other similarly recurrent mutations. We show that their inclusion can affect phylogenetic inference on scales relevant to local lineage tracing, and make it appear as though there has been an excess of recurrent mutation or recombination among viral lineages. We suggest how samples can be screened and problematic variants removed, and we plan to regularly inform the scientific community with our updated results as more SARS-CoV-2 genome sequences are shared (https://virological.org/t/issues-with-sars-cov-2-sequencing-data/473 and https://virological.org/t/masking-strategies-for-sars-cov-2-alignments/480). We also develop tools for comparing and visualizing differences among very large phylogenies and we show that consistent clade- and tree-based comparisons can be made between phylogenies produced by different groups. These will facilitate evolutionary inferences and comparisons among phylogenies produced for a wide array of purposes. Building on the SARS-CoV-2 Genome Browser at UCSC, we present a toolkit to compare, analyze and combine SARS-CoV-2 phylogenies, find and remove potential sequencing errors and establish a widely shared, stable clade structure for a more accurate scientific inference and discourse.
Klíčová slova:
Alleles – Genomics – Microbial mutation – Phylogenetic analysis – Phylogenetics – SARS CoV 2 – Trees – Viral evolution
Zdroje
1. NCBI Staff. NCBI Insights: INSDC Statement on SARS-CoV-2 sequence data sharing during COVID-19. 17 Aug 2020 [cited 26 Aug 2020]. Available: https://ncbiinsights.ncbi.nlm.nih.gov/2020/08/17/insdc-covid-data-sharing/
2. Maurano MT, Ramaswami S, Westby G, Zappile P, Dimartino D, Shen G, et al. Sequencing identifies multiple, early introductions of SARS-CoV2 to New York City Region. doi: 10.1101/2020.04.15.20064931 32511587
3. Deng X, Gu W, Federman S, Du Plessis L, Pybus O, Faria N, et al. A Genomic Survey of SARS-CoV-2 Reveals Multiple Introductions into Northern California without a Predominant Lineage. doi: 10.1101/2020.03.27.20044925 32511579
4. Zhang Y-Z, Holmes EC. A Genomic Perspective on the Origin and Emergence of SARS-CoV-2. Cell. 2020;181:223–227. doi: 10.1016/j.cell.2020.03.035 32220310
5. Bal A, Destras G, Gaymard A, Bouscambert-Duchamp M, Valette M, Escuret V, et al. Molecular characterization of SARS-CoV-2 in the first COVID-19 cluster in France reveals an amino-acid deletion in nsp2 (Asp268Del). doi: 10.1016/j.cmi.2020.03.020 32234449
6. Grubaugh ND, Ladner JT, Lemey P, Pybus OG, Rambaut A, Holmes EC, et al. Tracking virus outbreaks in the twenty-first century. Nat Microbiol. 2019;4:10–19. doi: 10.1038/s41564-018-0296-2 30546099
7. Yi H. 2019 novel coronavirus is undergoing active recombination. Clin Infect Dis. 2020. doi: 10.1093/cid/ciaa219 32130405
8. Chaw S-M, Tai J-H, Chen S-L, Hsieh C-H, Chang S-Y, Yeh S-H, et al. The origin and underlying driving forces of the SARS-CoV-2 outbreak. doi: 10.1186/s12929-020-00665-8 32507105
9. van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, et al. Emergence of genomic diversity and recurrent mutations in SARS-CoV-2. Infection, Genetics and Evolution. 2020. p. 104351. doi: 10.1016/j.meegid.2020.104351 32387564
10. Li Y, Wang Y, Qiu Y, Gong Z, Deng L, Pan M, et al. SARS-CoV-2 Spike Glycoprotein Receptor Binding Domain is Subject to Negative Selection with Predicted Positive Selection Mutations. doi: 10.1101/2020.05.04.077842
11. Victorovich KV, Rajanish G, Aleksandrovna KT, Krishna KS, Nicolaevich SA, Vitoldovich PV. Translation-associated mutational U-pressure in the first ORF of SARS-CoV-2 and other coronaviruses. doi: 10.3389/fmicb.2020.559165 33072018
12. Zehender G, Lai A, Bergna A, Meroni L, Riva A, Balotta C, et al. GENOMIC CHARACTERISATION AND PHYLOGENETIC ANALYSIS OF SARS-COV-2 IN ITALY. doi: 10.1101/2020.03.15.20032870
13. Gardy JL, Loman NJ. Towards a genomics-informed, real-time, global pathogen surveillance system. Nat Rev Genet. 2018;19:9–20. doi: 10.1038/nrg.2017.88 29129921
14. Chitranshi N, Gupta VK, Rajput R, Godinez A, Pushpitha K, Sheng T, et al. Evolving geographic diversity in SARS-CoV2 and in silico analysis of replicating enzyme 3CLPro targeting repurposed drug candidates. doi: 10.1186/s12967-020-02448-z 32646487
15. Adebali O, Bircan A, Circi D, Islek B, Kilinc Z, Selcuk B, et al. Phylogenetic Analysis of SARS-CoV-2 Genomes in Turkey. doi: 10.3906/biy-2005-35 32595351
16. Hadfield J, Megill C, Bell SM, Huddleston J, Potter B, Callender C, et al. Nextstrain: real-time tracking of pathogen evolution. Bioinformatics. 2018. pp. 4121–4123. doi: 10.1093/bioinformatics/bty407 29790939
17. Neher RA, Bedford T. nextflu: real-time tracking of seasonal influenza virus evolution in humans. Bioinformatics. 2015. pp. 3546–3548. doi: 10.1093/bioinformatics/btv381 26115986
18. Rambaut A, Holmes EC, Hill V, O’Toole Á, McCrone JT, Ruis C, et al. A dynamic nomenclature proposal for SARS-CoV-2 to assist genomic epidemiology. doi: 10.1038/s41564-020-0770-5 32669681
19. Mavian C, Marini S, Prosperi M, Salemi M. A snapshot of SARS-CoV-2 genome availability up to 30th March, 2020 and its implications. doi: 10.1101/2020.04.01.020594
20. Fountain-Jones NM, Appaw RC, Carver S, Didelot X, Volz EM, Charleston M. Emerging phylogenetic structure of the SARS-CoV-2 pandemic. bioRxiv. 2020. p. 2020.05.19.103846. doi: 10.1101/2020.05.19.103846
21. Bogner P, Capua I, Lipman DJ, Cox NJ. A global initiative on sharing avian flu data. Nature. 2006. pp. 981–981. doi: 10.1038/442981a
22. Rayko M, Komissarov A. Quality control of low-frequency variants in SARS-CoV-2 genomes. doi: 10.1101/2020.04.26.062422
23. Akther S, Bezrucenkovas E, Sulkow B, Panlasigui C. CoV Genome Tracker: tracing genomic footprints of Covid-19 pandemic. bioRxiv. 2020. Available: https://www.biorxiv.org/content/10.1101/2020.04.10.036343v1.abstract
24. DeMaio N, Walker C, Borges R, Weilguny L, Slodkowicz G, Goldman N. Issues with SARS-CoV-2 sequencing data. In: Virological [Internet]. 5 May 2020 [cited 13 May 2020]. Available: http://virological.org/t/issues-with-sars-cov-2-sequencing-data/473
25. Freeman TM, Genomics England Research Consortium, Wang D, Harris J. Genomic loci susceptible to systematic sequencing bias in clinical whole genomes. Genome Res. 2020;30: 415–426. doi: 10.1101/gr.255349.119 32156711
26. van Dorp L, Richard D, Tan CCS, Shaw LP, Acman M, Balloux F. No evidence for increased transmissibility from recurrent mutations in SARS-CoV-2. 2020. p. 2020.05.21.108506. doi: 10.1101/2020.05.21.108506
27. Korber B, Fischer WM, Gnanakaran S, Yoon H, Theiler J, Abfalterer W, et al. Spike mutation pipeline reveals the emergence of a more transmissible form of SARS-CoV-2. doi: 10.1101/2020.04.29.069054
28. Lythgoe KA, Hall MD, Ferretti L, de Cesare M, MacIntyre-Cockett G, Trebes A, et al. Shared SARS-CoV-2 diversity suggests localised transmission of minority variants. doi: 10.1101/2020.05.28.118992
29. Banerjee AK, Begum F, Ray U. Mutation Hot Spots in Spike Protein of COVID-19. doi: 10.20944/preprints202004.0281.v1
30. Laamarti M, Alouane T, Kartti S, Chemao-Elfihri MW, Hakmi M, Essabbar A, et al. Large scale genomic analysis of 3067 SARS-CoV-2 genomes reveals a clonal geo-distribution and a rich genetic variations of hotspots mutations. doi: 10.1371/journal.pone.0240345 33170902
31. Wang C, Liu Z, Chen Z, Huang X, Xu M, He T, et al. The establishment of reference sequence for SARS-CoV-2 and variation analysis. Journal of Medical Virology. 2020. pp. 667–674. doi: 10.1002/jmv.25762 32167180
32. Wang Y, Mao J-M, Wang G-D, Qiu Z, Yao Q, Chen K-P. Human SARS-CoV-2 has evolved to reduce CG dinucleotide in its open reading frames. doi: 10.1038/s41598-020-69342-y 32704018
33. Wen F, Yu H, Guo J, Li Y, Luo K, Huang S. Identification of the hyper-variable genomic hotspot for the novel coronavirus SARS-CoV-2. J Infect. 2020. doi: 10.1016/j.jinf.2020.02.027 32145215
34. Pachetti M, Marini B, Benedetti F, Giudici F, Mauro E, Storici P, et al. Emerging SARS-CoV-2 mutation hot spots include a novel RNA-dependent-RNA polymerase variant. doi: 10.1186/s12967-020-02344-6 32321524
35. Rehman SU, Shafique L, Ihsan A, Liu Q. Evolutionary Trajectory for the Emergence of Novel Coronavirus SARS-CoV-2. Pathogens. 2020;9. doi: 10.3390/pathogens9030240 32210130
36. Wertheim JO. A Glimpse Into the Origins of Genetic Diversity in the Severe Acute Respiratory Syndrome Coronavirus 2. Clinical Infectious Diseases. 2020. doi: 10.1093/cid/ciaa213 32129842
37. Vasilarou M, Alachiotis N, Garefalaki J, Beloukas A, Pavlidis P. Population genomics insights into the recent evolution of SARS-CoV-2. doi: 10.1101/2020.04.21.054122
38. Ou J, Zhou Z, Dai R, Zhang J, Lan W, Zhao S, et al. Emergence of RBD mutations in circulating SARS-CoV-2 strains enhancing the structural stability and human ACE2 receptor affinity of the spike protein. bioRxiv. 2020. p. 2020.03.15.991844. doi: 10.1101/2020.03.15.991844
39. Sashittal P, Luo Y, Peng J, El-Kebir M. Characterization of SARS-CoV-2 viral diversity within and across hosts. bioRxiv. 2020. p. 2020.05.07.083410. doi: 10.1101/2020.05.07.083410
40. Velazquez-Salinas L, Zarate S, Eberl S, Gladue DP, Novella I, Borca MV. Positive selection of ORF3a and ORF8 genes drives the evolution of SARS-CoV-2 during the 2020 COVID-19 pandemic. doi: 10.3389/fmicb.2020.550674 33193132
41. Brianna SC, Paskov K, Stockham N, J-Y J, Varma M, Washington P, et al. Common Microdeletions in SARS-CoV-2 Sequences. In: Virological [Internet]. 15 May 2020 [cited 16 May 2020]. Available: http://virological.org/t/common-microdeletions-in-sars-cov-2-sequences/485
42. Ramazzotti D, Angaroni F, Maspero D, Gambacorti-Passerini C, Antoniotti M, Graudenzi A, et al. Characterization of intra-host SARS-CoV-2 variants improves phylogenomic reconstruction and may reveal functionally convergent mutations. doi: 10.1101/2020.04.22.044404
43. Dellicour S, Durkin K, Hong SL, Vanmechelen B, Martí-Carreras J, Gill MS, et al. A phylodynamic workflow to rapidly gain insights into the dispersal history and dynamics of SARS-CoV-2 lineages. doi: 10.1101/2020.05.05.078758
44. Morel B, Barbera P, Czech L, Bettisworth B, Hübner L, Lutteropp S, et al. Phylogenetic analysis of SARS-CoV-2 data is difficult. bioRxiv. 2020. doi: 10.1101/2020.08.05.239046
45. Rice AM, Morales AC, Ho AT, Mordstein C, Mühlhausen S, Watson S, et al. Evidence for strong mutation bias towards, and selection against, T/U content in SARS-CoV2: implications for attenuated vaccine design. doi: 10.1101/2020.05.11.088112
46. Xia X. Extreme genomic CpG deficiency in SARS-CoV-2 and evasion of host antiviral defense. Mol Biol Evol. 2020. doi: 10.1093/molbev/msaa094 32289821
47. Fitch WM. Toward Defining the Course of Evolution: Minimum Change for a Specific Tree Topology. Systematic Zoology. 1971. p. 406. doi: 10.2307/2412116
48. Sankoff D. Minimal Mutation Trees of Sequences. SIAM Journal on Applied Mathematics. 1975. pp. 35–42. doi: 10.1137/0128004
49. Simmonds P. Rampant C->U hypermutation in the genomes of SARS-CoV-2 and other coronaviruses–causes and consequences for their short and long evolutionary trajectories. doi: 10.1101/2020.05.01.072330
50. Bishop KN, Holmes RK, Sheehy AM, Malim MH. APOBEC-mediated editing of viral RNA. Science. 2004;305:645. doi: 10.1126/science.1100658 15286366
51. Giorgio SD, Di Giorgio S, Martignano F, Torcia MG, Mattiuz G, Conticello SG. Evidence for host-dependent RNA editing in the transcriptome of SARS-CoV-2. doi: 10.1126/sciadv.abb5813 32596474
52. Ma X, Shao Y, Tian L, Flasch DA, Mulder HL, Edmonson MN, et al. Analysis of error profiles in deep next-generation sequencing data. Genome Biol. 2019;20:50. doi: 10.1186/s13059-019-1659-6 30867008
53. Minoche AE, Dohm JC, Himmelbauer H. Evaluation of genomic high-throughput sequencing data generated on Illumina HiSeq and genome analyzer systems. Genome Biol. 2011;12:R112. doi: 10.1186/gb-2011-12-11-r112 22067484
54. Jain M, Koren S, Miga KH, Quick J, Rand AC, Sasani TA, et al. Nanopore sequencing and assembly of a human genome with ultra-long reads. Nat Biotechnol. 2018;36:338–345. doi: 10.1038/nbt.4060 29431738
55. Beerenwinkel N, Günthard HF, Roth V, Metzner KJ. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front Microbiol. 2012;3:329. doi: 10.3389/fmicb.2012.00329 22973268
56. Kugelman JR, Wiley MR, Nagle ER, Reyes D, Pfeffer BP, Kuhn JH, et al. Error baseline rates of five sample preparation methods used to characterize RNA virus populations. PLoS One. 2017;12:e0171333. doi: 10.1371/journal.pone.0171333 28182717
57. Orton RJ, Wright CF, Morelli MJ, King DJ, Paton DJ, King DP, et al. Distinguishing low frequency mutations from RT-PCR and sequence errors in viral deep sequencing data. BMC Genomics. 2015;16:229. doi: 10.1186/s12864-015-1456-x 25886445
58. McElroy K, Thomas T, Luciani F. Deep sequencing of evolving pathogen populations: applications, errors, and bioinformatic solutions. Microb Inform Exp. 2014;4:1. doi: 10.1186/2042-5783-4-1 24428920
59. Minh BQ, Schmidt HA, Chernomor O, Schrempf D, Woodhams MD, von Haeseler A, et al. IQ-TREE 2: New Models and Efficient Methods for Phylogenetic Inference in the Genomic Era. Mol Biol Evol. 2020;37:1530–1534. doi: 10.1093/molbev/msaa015 32011700
60. Hoang DT, Chernomor O, von Haeseler A, Minh BQ, Vinh LS. UFBoot2: Improving the Ultrafast Bootstrap Approximation. Mol Biol Evol. 2018;35:518–522. doi: 10.1093/molbev/msx281 29077904
61. Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, et al. The variant call format and VCFtools. Bioinformatics. 2011;27:2156–2158. doi: 10.1093/bioinformatics/btr330 21653522
62. Fernandes JD, Hinrichs AS, Clawson H, Gonzalez JN, Lee BT, Nassar LR, et al. The UCSC SARS-CoV-2 Genome Browser. doi: 10.1038/s41588-020-0700-8 32908258
63. Sanjuán R, Nebot MR, Chirico N, Mansky LM, Belshaw R. Viral Mutation Rates. Journal of Virology. 2010. pp. 9733–9748. doi: 10.1128/JVI.00694-10 20660197
64. Bogdanowicz D, Giaro K, Wróbel B. TreeCmp: Comparison of Trees in Polynomial Time. Evolutionary Bioinformatics. 2012. p. EBO.S9657. doi: 10.4137/ebo.s9657
65. Malafiejska A. New scalable measure for comparing phylogenetic trees. 2008 1st International Conference on Information Technology. 2008. doi: 10.1109/inftech.2008.4621645
66. Kendall M, Eldholm V, Colijn C. Comparing phylogenetic trees according to tip label categories. doi: 10.1101/251710
67. Nye TMW. Trees of Trees: An Approach to Comparing Multiple Alternative Phylogenies. Systematic Biology. 2008. pp. 785–794. doi: 10.1080/10635150802424072 18853364
68. Bogdanowicz D. Comparing phylogenetic trees using a minimum weight perfect matching. 2008 1st International Conference on Information Technology. 2008. doi: 10.1109/inftech.2008.4621680
69. Robinson DF, Foulds LR. Comparison of phylogenetic trees. Mathematical Biosciences. 1981. pp. 131–147. doi: 10.1016/0025-5564(81)90043-2
70. Huson DH, Scornavacca C. Dendroscope 3: an interactive tool for rooted phylogenetic trees and networks. Syst Biol. 2012;61:1061–1067. doi: 10.1093/sysbio/sys062 22780991
71. Revell LJ. phytools: an R package for phylogenetic comparative biology (and other things). Methods in Ecology and Evolution. 2012. pp. 217–223. doi: 10.1111/j.2041-210x.2011.00169.x
72. Sukumaran J, Holder MT. DendroPy: a Python library for phylogenetic computing. Bioinformatics. 2010;26:1569–1571. doi: 10.1093/bioinformatics/btq228 20421198
73. Hodcroft EB, Hadfield J, Neher RA, Bedford T. Year-letter Genetic Clade Naming for SARS-CoV-2 on Nextstain.org. In: Virological [Internet]. 2 Jun 2020 [cited 8 Jun 2020]. Available: https://virological.org/t/year-letter-genetic-clade-naming-for-sars-cov-2-on-nextstain-org/498
74. An integrated national scale SARS-CoV-2 genomic surveillance network. The Lancet Microbe. 2020. doi: 10.1016/S2666-5247(20)30054-9 32835336
75. Margush T, McMorris FR. Consensus n-trees. Bulletin of Mathematical Biology. 1981. pp. 239–244. doi: 10.1007/bf02459446
76. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data–from vision to reality. Eurosurveillance. 2017. doi: 10.2807/1560-7917.es.2017.22.13.30494 28382917
77. Shu Y, McCauley J. GISAID: Global initiative on sharing all influenza data—from vision to reality. Euro Surveill. 2017;22. doi: 10.2807/1560-7917.ES.2017.22.13.30494 28382917
78. Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison. Proceedings of the 26th Annual International Conference on Machine Learning-ICML ‘09. 2009. doi: 10.1145/1553374.1553511
79. Nguyen L-T, Schmidt HA, von Haeseler A, Minh BQ. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol Biol Evol. 2015;32:268–274. doi: 10.1093/molbev/msu300 25371430
Článek vyšel v časopise
PLOS Genetics
2020 Číslo 11
- Antibiotika na nachlazení nezabírají! Jak můžeme zpomalit šíření rezistence?
- FDA varuje před selfmonitoringem cukru pomocí chytrých hodinek. Jak je to v Česku?
- Prof. Jan Škrha: Metformin je bezpečný, ale je třeba jej bezpečně užívat a léčbu kontrolovat
- Ibuprofen jako alternativa antibiotik při léčbě infekcí močových cest
- Jak a kdy u celiakie začíná reakce na lepek? Možnou odpověď poodkryla čerstvá kanadská studie
Nejčtenější v tomto čísle
- Stability of SARS-CoV-2 phylogenies
- Formal commentary
- No association between SCN9A and monogenic human epilepsy disorders
- Oxidative stress antagonizes fluoroquinolone drug sensitivity via the SoxR-SUF Fe-S cluster homeostatic axis