A molecular barcode to inform the geographical origin and transmission dynamics of Plasmodium vivax malaria

Authors: Ernest Diez Benavente ^aff001; Monica Campos ^aff001; Jody Phelan ^aff001; Debbie Nolder ^aff001; Jamille G. Dombrowski ^aff002; Claudio R. F. Marinho ^aff002; Kanlaya Sriprawat ^aff003; Aimee R. Taylor ^aff004; James Watson ^aff006; Cally Roper ^aff001; Francois Nosten ^aff003; Colin J. Sutherland ^aff001; Susana Campino ^aff001; Taane G. Clark ^aff001
Authors place of work: Faculty of Infectious & Tropical Diseases, London School of Hygiene & Tropical Medicine, London, United Kingdom ^aff001; Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo, São Paulo, Brazil ^aff002; Shoklo Malaria Research Unit, Mahidol-Oxford Tropical Medicine Research Unit, Faculty of Tropical Medicine, Mahidol University, Mae Sot, Tak, Thailand ^aff003; Harvard T.H. Chan School of Public Health, Department of Epidemiology, Boston, Massachusetts, United States of America ^aff004; Broad Institute of MIT and Harvard, Cambridge, Massachusetts, United States of America ^aff005; Centre for Tropical Medicine and Global Health, Nuffield Department of Clinical Medicine Research Building, University of Oxford Old Road Campus, Oxford, United Kingdom ^aff006; Mahidol Oxford Research Unit, Faculty of Tropical Medicine, Mahidol University, Bangkok, Thailand ^aff007; Faculty of Epidemiology and Population Health, London School of Hygiene and Tropical Medicine, London, United Kingdom ^aff008
Published in the journal: A molecular barcode to inform the geographical origin and transmission dynamics of Plasmodium vivax malaria. PLoS Genet 16(2): e32767. doi:10.1371/journal.pgen.1008576
Category: Research Article
doi: https://doi.org/10.1371/journal.pgen.1008576

Summary

Although Plasmodium vivax parasites are the predominant cause of malaria outside of sub-Saharan Africa, they not always prioritised by elimination programmes. P. vivax is resilient and poses challenges through its ability to re-emerge from dormancy in the human liver. With observed growing drug-resistance and the increasing reports of life-threatening infections, new tools to inform elimination efforts are needed. In order to halt transmission, we need to better understand the dynamics of transmission, the movement of parasites, and the reservoirs of infection in order to design targeted interventions. The use of molecular genetics and epidemiology for tracking and studying malaria parasite populations has been applied successfully in P. falciparum species and here we sought to develop a molecular genetic tool for P. vivax. By assembling the largest set of P. vivax whole genome sequences (n = 433) spanning 17 countries, and applying a machine learning approach, we created a 71 SNP barcode with high predictive ability to identify geographic origin (91.4%). Further, due to the inclusion of markers for within population variability, the barcode may also distinguish local transmission networks. By using P. vivax data from a low-transmission setting in Malaysia, we demonstrate the potential ability to infer outbreak events. By characterising the barcoding SNP genotypes in P. vivax DNA sourced from UK travellers (n = 132) to ten malaria endemic countries predominantly not used in the barcode construction, we correctly predicted the geographic region of infection origin. Overall, the 71 SNP barcode outperforms previously published genotyping methods and when rolled-out within new portable platforms, is likely to be an invaluable tool for informing targeted interventions towards elimination of this resilient human malaria.

Keywords:

Genomics – Phylogenetic analysis – Molecular genetics – Trees – Malaria – Asia – Plasmodium – Principal component analysis

Introduction

Plasmodium vivax is the predominant cause of malaria outside of sub-Saharan Africa [1,2] and there are increasing reports of drug-resistance and severe complications that pose a threat to children and pregnant women [3–7]. Elimination efforts have led to reductions in the prevalence of the deadlier P. falciparum malaria, but areas of co-endemicity have seen a corresponding rise in the proportion of P. vivax infections, which appear more resilient to control strategies [8]. In order to halt transmission of vivax, malaria programmes should focus on identifying the main reservoirs of infection and target control measures towards these. P. vivax has been observed in regions where malaria transmission had once been interrupted [9], and in such a context, continuing surveillance and the use of genetic tools to identify transmission networks within malaria outbreaks is essential. The utility of genetic tools has been demonstrated in previous studies in low-transmission settings, such as in Malaysian Borneo [10] or places where there are imported cases, such as in Greece [11]. Further, understanding the transmission dynamics of the parasite populations through assessment of genetic diversity has the potential to play a key role in guiding the elimination efforts, including by revealing transmission events and identifying potential foci of infection. In recent years, genomic studies have dissected the molecular dynamics of P. vivax populations in regions with stable transmission [12–16]. Other studies have used microsatellites to study trends in the population dynamics [17–20]. The availability of whole genome sequencing data can inform the design of “genetic barcodes” which require low numbers of SNP polymorphisms [21] but can be used to infer transmission networks and the geographic source of infections. Molecular barcodes, when combined with affordable delivery systems, can facilitate P. vivax epidemiological and surveillance investigations.

Microsatellite markers have been used to reveal a spectrum of population structures in P. falciparum [22], as well as infer transmission dynamics and complexity of infections [23]. Compared to microsatellites, SNP markers are more suitable for comparisons of both strongly and weakly diverged populations, and in revealing ancestral patterns of genetic structuring [24]. By leveraging off whole genome sequencing for population characterisation [14,15,25,26], a number of SNP-based barcodes have been derived for P. falciparum. A barcode based on 24 SNPs in the nuclear genome (“24-SNP barcode”) has been used to identify and track isolates from an endemic population in Senegal [27]. This barcode harboured capacity to identify clones from non-clones [21] using data from a limited set of long-term adapted laboratory lines [27], and has been employed on field isolates to infer local temporal changes in genetic diversity [28]. This approach has demonstrated the potential effectiveness of such tools in combination with epidemiological methods to elucidate transmission intensity in malaria endemic regions [28]. Nevertheless, the use of isolates with a very limited geographical spread to generate the barcode can underestimate the genomic variability present in other Plasmodium populations and lead to low precision when estimating relatedness [21], thereby impeding the transportability of the 24-SNP barcode across global malaria regions. It has been shown that the predictive power of the 24-SNP barcode for geographical determination is poor, especially when compared to one formed of 23 SNPs from the mitochondria and apicoplast organellar genomes, which predicted the continental origin of samples with 92% accuracy [29]. Another barcode formed of 105 highly frequent nuclear genome SNPs was developed to infer transmission intensity using a geographically broader panel of isolates, thereby potentially providing greater utility across malaria-endemic countries [30]. Simulations on genome-wide data from P. falciparum has recommended the use of at least 200 barcoding SNPs for identity by decent (IBD) analysis in haploid eukaryotes [21]. Overall, the studies in P. falciparum have demonstrated that SNP barcodes can potentially provide insights into the intensity of transmission, identify the geographical origin of the field isolates, and inform the dynamics of the diversity in a parasite population. These include outbreak identification, an event which has been shown to be more likely in low-transmission settings [31].

SNP barcodes for P. vivax have been proposed, including one based on 42-SNP nuclear polymorphisms and another on mitochondrial genome markers, but both developed to ascertain the source of infection [32,33]. One limitation of these barcodes is that they were based on relatively small datasets. As more data becomes available and geographical coverage increases, genotyping tools with greater predictive power and wider global reach can be developed [14–16,34]. Technological advancements in genomics can be leveraged, including the high throughout sequencing of candidate genomic regions and portable genotyping [35]. Ultimately, the identification and integration of informative loci for P. vivax and other plasmodium parasites for inferring transmission and infection source has the potential to revolutionise global malaria surveillance. Here, using whole genome sequencing data for 433 P. vivax isolates across 17 countries, we applied machine learning and SNP tagging approaches from human genome-wide association studies (GWAS) to create a 71 SNP barcode with high predictive ability for geographic origin (91.4% accuracy) and the capability to infer transmission. We demonstrate that the barcode outperforms alternative approaches, including microsatellite genotyping, for global geographical profiling and inferring outbreak events within a low-transmission setting in Malaysia. Further, we validate the barcode by analysing the 71 SNPs in P. vivax DNA sourced from 132 recent UK travellers to East Africa, Asia and South America. Our work demonstrates that the 71 SNP barcode has the potential to be an invaluable tool to help elimination efforts of this resilient neglected Plasmodium species.

Results

SNPs, samples and population structure

We aligned raw sequence data from 867 isolates [12,14–16,34,36] to the PvP01_v1 (http://genedb.org) reference genome, and identified 1,522,046 SNPs. Isolates with high levels of missing genotype calls (> 30%) and high multiplicity of Infection (> 20% heterozygous genotypes) were excluded from analysis. The final dataset was formed of 433 isolates (S1 Table), spanning 17 countries and 6 regions (East Africa 25; South Asia 4; Southeast Asia 242; Papua New Guinea 26; South America 116; North America 20), and 720,340 high quality SNPs of which 89.7% were non-common (minor allele frequency (MAF) < 5%) (S1 Fig, left), consistent with previous findings [34].

A principal component analysis (PCA) revealed that the isolates clustered broadly by regional groups: Southeast Asia (Thailand, Myanmar, Cambodia, Vietnam and Laos), South Asian/East Asian/Southeast Asia (China), Americas (Peru, Colombia, Mexico and Brazil) and Oceanian/Southeast Asia (Papua New Guinea, Indonesia and Malaysia) and others located around the Arabian Sea (Fig 1). The genomic distance between intra-country isolates was on average smaller (mean: 26,505 SNPs, range: 0–37,801 SNPs) compared to between inter-border isolates (mean: 33,160 SNPs, range: 6,034–39,676 SNPs), in line with previous studies [14,15]. The notable exception to the pattern were isolates sourced from neighbouring Thailand and Myanmar, supporting evidence they belong to a similar parasite population [14], which matches with the Thai samples being collected in the western region [15]. Despite this exceptional situation, our analysis suggests the potential of identifying markers that could determine geographical origin to a country level in most settings. Intuitively, the highly skewed distribution of the MAF towards rare variants suggests that more common SNPs are those driving the population structure observed. To assess this hypothesis, we split the dataset into three equally sized divisions based on SNP MAF (tertiles: < 0.002, 0.002–0.007, > 0.007; S1 Fig, right), and for each we constructed neighbour-joining trees and correlated the pair-wise genome distance with the estimate based on all 720k SNPs (Fig 2). These comparisons revealed the strong explanatory power harboured by those SNPs with the highest MAF (Pearson’s r² correlation of 0.94 with the full 720k SNP set) compared to the subsets with lower MAF (Pearson’s r² correlations of 0.25 and 0.27). A MAF cut-off (> 0.3) was determined (S1 Fig, right), leading to a subset of 16,110 SNPs used as the starting point for barcode building for country classification.

Principal component (PC) analysis plot generated using 720,340 high quality SNPs across 433 <i>P</i>. <i>vivax</i> isolates reveals geographic clustering. — **Fig. 1. Principal component (PC) analysis plot generated using 720,340 high quality SNPs across 433 P. *vivax* isolates reveals geographic clustering.**

The sub-setting of SNPs by minimum allele frequency (MAF) reveals the strong explanatory power of high frequency SNPs in <i>Plasmodium vivax</i>. — **Fig. 2. The sub-setting of SNPs by minimum allele frequency (MAF) reveals the strong explanatory power of high frequency SNPs in *Plasmodium vivax*.**

Selection of highly informative barcoding SNPs using a tagging and machine learning approach

In order to further reduce number of markers used for barcoding construction, we applied the software TAGster [37] to identify tagging SNPs that summarise blocks of high linkage disequilibrium (LD), as estimated by the r² metric across windows of size 500 kbp. The genetic variability can be captured by tagging SNPs due to the strong LD structure found in P. vivax populations [34], and we identified 1,173 SNPs, which summarised variation in 4,896 neighbouring SNPs, almost 40% of the total dataset. The 1,173 highly informative SNPs were then used as the pool for a further selection using a random forest modelling approach. Prior to implementation, missing alleles (4.2%) were imputed with high accuracy (< 1% error, see Materials and methods), and a neighbour joining tree was constructed using the resulting data, thereby confirming that no bias was introduced to the clustering patterns (S2 Fig, top). The correlation between genome distances based on the 1,173 SNPs and whole genome sequencing (720k SNPs) was high (Pearson’s r² = 0.96) (S2 Fig, middle). Similarly, a Bland-Altman analysis revealed the genetic distances between isolates obtained using the subset of 1,173 tagging SNPs were similar to those based on whole genome SNPs (S2 Fig, bottom), especially those with the highest MAF (see Fig 2, bottom right). We sought to show that the 1,173 SNPs could not only predict differences between geographical regions, but also estimate genetic distances within intra-border isolates and identify subclades in low transmission settings [31]. In particular, when analysing those isolates within a genomic distance equivalent of <20,000 SNPs (0.74), we found a high correlation (Pearson’s r² = 0.98) between distances based on the 1,173 tagging and 720k genome-wide SNPs. This analysis revealed that the 1,173 high frequency tagging SNPs can not only detect strong inter-border differentiation but potentially identify highly related isolates within the same country based on genomic distance.

Application of the random forest classification approach to the classification of country involved constructing 500 trees, partitioning the dataset randomly into 80% training (n = 346) and 20% for validation (n = 87), and using 34 variables at each nodal split. These default settings have been used previously in P. falciparum genome-wide analysis with success [38]. Classification error rates became stable when 100 trees were averaged (S3 Fig, B). The final model inside the training set performed with an overall out-of-bag error rate of 17.1%, where the main classification errors were found across Southeast Asian populations (Thailand, Vietnam, Cambodia, and Myanmar). These populations were identified as being highly related using the SNP-based genomic distance (Fig 2, right), supporting previous observations [14]. The random forest model was then used to identify the 60 SNPs with the highest predictive importance across the trees. The 60 SNP cut-off was established using a point of inflection analysis of cumulative predictive importance, where the addition of further SNPs does not significantly improve predictive power (S3 Fig, B). A further 11 SNPs were chosen to summarise high between-country genetic differentiation based on the fixation index (F_ST > 0.7), leading to a final barcoding set of 71 SNPs (Table 1). Within this SNP set, there are differences in allele frequency across the two main regions (Southeast Asia and South America), but no markers were fixed across the populations. The 71 markers are in low linkage disequilibrium (S4 Fig) (LD r²: mean 0.15, inter-quartile range 0.02–0.24), but some blocks of correlation are observed, which can be explained by an uneven geographic distribution of the isolates.

Selection of 71 barcoding SNPs for <i>Plasmodium vivax</i>. — **Tab. 1. Selection of 71 barcoding SNPs for *Plasmodium vivax*.**

In order to assess the potential of the barcode for geographic classification, a PCA was generated using only the 71 SNPs (Fig 3, top), and similar clustering patterns were observed to those using the genome-wide (720k) SNP set (Fig 1). When comparing the genomic distances obtained using the 71 barcoding to 720k genome-wide SNPs in intra-border pair-wise comparisons, we observed a high Pearson’s r² correlation (0.898), providing evidence for the potential of this barcode to not only identify geographical origin but also provide insights into the relatedness of intra-border isolates (Fig 3, middle). Similarly, a Bland-Altman analysis revealed little overall bias in the estimation of genetic distance between isolates using whole genome and the 71 barcoding SNPs (Fig 3, bottom; average difference of 0.09). Using the 71 barcoding SNPs, there is a trend towards an underestimation of the genetic distance for closely related isolates, and an overestimation for distantly related ones. The potential use of the barcode to infer intra-border relatedness and provide insights into transmission dynamics, is supported by the observation that 98.2% (425/433) of the haplotypes obtained were unique in the dataset. Furthermore, the 71 SNPs were used to predict the geographical source of the 20% of the isolates (n = 87) not used to develop the random forest model. The model using only the 71 SNPs yielded an overall accuracy of 91.4% (8.6% out-of-bag error) in predicting the country source of these 87 isolates and a further 16 Brazilian newly sequenced isolates. The 71 SNPs outperformed a published 42-SNP barcode [33], which under the same random forest model conditions (80%/20% training/prediction and 500 trees) obtained a 77.5% accuracy. The low accuracy of the 42-SNP barcode can also be observed in the ambiguous clustering found in the PCA (S5 Fig, top) and neighbour-joining tree (S5 Fig, bottom). Furthermore, the correlation between genetic distances based on the SNP barcode and genome-wide (720k) SNPs was lower (Pearson’s r² = 0.59).

Geographic clustering of <i>Plasmodium vivax</i> isolates using the 71 SNP barcode. — **Fig. 3. Geographic clustering of *Plasmodium vivax* isolates using the 71 SNP barcode.**

In-silico testing of the barcode in a near-elimination setting in Malaysia

A field-ready SNP barcode with the potential for being deployed in low-transmission settings has to be proven efficacious in settings where the identification of foci of infection and imported cases are of key relevance. We used a dataset comprising 60 P. vivax isolates, sourced from Sabah Malaysia, where the population dynamics have been extensively studied using microsatellite markers and whole genome sequencing [31]. In silico characterization using the 71 SNP barcode in a PCA analysis identified two main populations, denoted previously as K1 and K2 (Fig 4A). K2 comprised of 26 almost genetically identical isolates in a known transmission outbreak, previously supported using a set of nine microsatellite markers [31]. The estimated genetic distances between isolates based on the 71 SNPs were strongly correlated with those based on the genome-wide SNPs (Pearson’s r² = 0.88). The outbreak isolates shared the same haplotype across the barcoding polymorphisms in most cases, with only 3 isolates presenting a one SNP difference. There was one notable exception (ERR1475456) which presented a larger genetic distance (Fig 4A). This isolate shared the same microsatellite haplotype as the outbreak K2, but the genomic distance from the samples in the rest of the outbreak samples was greater, and therefore likely to be of independent origin (Fig 4B). A neighbour-joining tree constructed using the 71 SNPs revealed clustering by administrative division in Sabah Malaysia, indicating its potential for tracking of cases from different health authorities (Fig 4C).

Use of the 71 SNP barcode in <i>Plasmodium vivax</i> isolates from Sabah, Malaysia reveals patterns of transmission. — **Fig. 4. Use of the 71 SNP barcode in *Plasmodium vivax* isolates from Sabah, Malaysia reveals patterns of transmission.**

Prospective testing of the barcode using traveller genotype data

The genotypes for the 71 barcoding SNPs were characterised in 132 P. vivax DNA sourced from returning travellers to the UK from ten endemic countries (n = 132; Afghanistan (n = 26), Bangladesh (n = 1), Eritrea (n = 11), Ethiopia (n = 6), Guyana (n = 3), India (n = 38), Pakistan (n = 35), the Philippines (n = 1), Sudan (n = 7) and Uganda (n = 4)). The genotypes were used to construct a combined PCA plot for the 132 prospective and 433 barcode-development isolates, which revealed clear clustering by geographic region (Fig 5). The isolates co-localised in the PCA with the previously determined wider-regional populations, including East Africa, South Asia, Southern Southeast Asia and South America. A PCA based only on prospectively collected data (n = 132) (S6 Fig) revealed a clear regional pattern, but no strong clustering at a country level. The country source was predicted using the 71 SNP barcoding genotypes for each prospective isolate. An 80.1% country-level prediction was obtained for those isolates with countries represented in the training data (Ethiopia and India). For validation isolates that were sourced from countries in Central and South Asia (Afghanistan and Pakistan) and East Africa (Sudan, Uganda and Eritrea) that were not included in the SNP barcode, the predictions reflected the nearest countries included in the development process.

The principal component (PC) analysis plot of the 565 <i>P</i>. <i>vivax</i> isolates, constructed using the 71 SNP barcode. The isolates include the 433 used in the design of the barcode (circles) and the 132 prospective UK traveller samples (stars). — Fig. 5. The principal component (PC) analysis plot of the 565 P. *vivax* isolates, constructed using the 71 SNP barcode. The isolates include the 433 used in the design of the barcode (circles) and the 132 prospective UK traveller samples (stars).

Discussion

Plasmodium vivax accounts for a significant proportion of the global malaria burden, with the greatest incidence outside of sub-Saharan Africa [2]. The resilience of the parasite is evidenced by its re-appearance in regions where malaria transmission had previously been halted [9]. Microsatellite genotyping has been used to study P. vivax genomic and population dynamics, although it underestimates the total variability in natural populations [39]. Whole genome sequencing is the gold standard approach but is currently logistic -⁠ and cost-inefficient for large genomic epidemiological studies, especially in under-resourced endemic areas. We have a developed in-silico a 71 SNP barcode, informed by whole genome sequencing data from 433 isolates across 17 different countries worldwide. To date, barcodes in Plasmodium species have each been designed for specific roles. Some barcodes have focused on the determination of the geographical origin of the samples using nuclear or organellar genomic markers [29,32], others on measuring transmission intensity using the frequency of unique haplotypes [28,33] or the complexity of the infections [30]. Further, P. vivax barcodes have been based on small datasets featuring fewer geographic populations [34]. As more datasets become available genotyping tools such as our barcode will have global applicability. In fact, application of a 42-SNP barcode [33] to our dataset led to a lower accuracy (77.5%) when predicting origin of the isolates at a country level. The 42-SNP barcode also performed sub-optimally when estimating SNP-based relatedness of isolates, and therefore may not be suited to the inference of local transmission networks.

Our barcode was constructed by recognising that common SNPs are more robust markers and harbour greater explanatory power for both geographical and transmission inference, as demonstrated previously in human genetic studies [40]. By triaging SNPs using established LD tagging methods, it was possible to apply random forest methodology to select 60 markers, augmented by 11 inter-continental SNPs, to accurately predict country of source (91.4%). Whilst there are alternatives to the random forest algorithm, this approach has become an established data analysis tool in bioinformatics, with reported high performance in settings where the number of variables is much larger than the number of observations [38]. The methodology has an ability to explore complex interactions between correlated SNPs, and return useful measures of their predictive importance [41,42]. The set of “important” SNP predictors of country determined by the random forest approach was robust to initial model parameterisation, and the final model and set of SNPs were validated on a 20% subset of the original samples augmented by prospectively collected ones from Brazil. The 71 SNP barcode and intermediate SNP sets informing its construction were used to reconstruct principal component analysis plots. These plots were consistent with the overall population structure based on 720k SNP markers, and therefore confirm there was no major loss of geographical specificity.

Our 71 SNP barcode is the first that shows such strong levels of accuracy in geographic prediction at a country level in P. vivax for a diverse dataset, making it a valuable tool for the detection of imported cases of malaria. As whole genome data becomes available, especially from sites with currently poor coverage such as central America, Africa and South Asia, the machine learning approach can be used to update the SNPs in the barcode. The barcode was also designed to be able to provide information about potential transmission trends and therefore be useful in field settings, where large genomic differences are less likely, with the exception of imported cases. The proportion of unique haplotypes identified across the dataset was high (98.2%; 425/433 haplotypes observed), which allows greater scope for informing on intra-border haplotype diversity, including low diversity such as in an outbreak setting. A simulation study using P. falciparum genome sequencing and P. vivax microsatellite data has estimated that 200 biallelic markers should be used for IBD analysis [21]. In our study, using whole genome sequencing data on P. vivax we demonstrate the potential of a 71 SNP barcode to estimate genetic distance (SNP differences) as a measure of relatedness and to predict geographic origin to the country level. Further work could be aimed at assessing and expanding this set of markers and explore its potential use for IBD analysis specifically focusing on intra-border transmission. This potential utility for transmission characterisation was demonstrated by the use of intra-border highly related isolates, where we confirmed that the SNP distances based on the 71-SNP barcode and genome-wide 720k markers were highly correlated, and represent an improvement on comparative reported values for microsatellites (Pearson’s r² = 0.70) [39]. Further, the use of the 71 SNP barcode was validated in silico using data from 60 P. vivax sourced from a low endemic and near-elimination setting in Malaysia, where it was possible to partition the population into highly structured subclades and confirm the presence of an outbreak cluster, previously identified using both microsatellite genotyping and sequencing. Also, by using the 71-SNP barcode we identified a misclassified “outbreak” isolate with an identical microsatellite haplotype, and evidence of regional clustering by the different administrative divisions, further supporting the utility of the tool and its superior performance to microsatellite genotyping. One of the challenges when studying P. vivax infections is the difficulty to differentiate recrudescent infections from reinfections or relapses [43–46]. Although, we have not explored the ability of our barcode to differentiate between such cases, based on the results from the Malaysian setting we anticipate it would at least be able to identify cases of relapse caused by meiotic siblings [44]. This is because of the high degree of similarity expected in such cases already picked up in clonal expansions, although specific work would be needed to address this challenge.

The barcode was also tested on 132 newly assessed isolates sourced from travelers to 10 different countries across the main P. vivax endemic regions, revealing that the 71 SNPs have a strong regional discriminatory power, and achieved an 80.1% overall accuracy in classifying the isolates at a country level for the countries represented in the barcode development. The PCA performed using the 71 SNPs showed that isolates originating from geographically distant regions (e.g. South Asia) clustered, and demonstrated the ability of the barcode to identify regional signatures even for previously unstudied locations. Nevertheless, the next step in order to improve our barcode would be the incorporation of the whole genome sequencing data from these isolates and future collections. Further, it is possible to genotype the barcoding SNPs using standard TaqMan genotyping assays or PCR followed by high resolution melting analysis, as previously described for other malaria SNP barcodes [27, 33]. It is also possible to apply amplicon sequencing using a portable sequencer, such as the MinION nanopore platform linked to a laptop computer. This approach was applied recently to genotype P. falciparum parasites [47].

In summary, we have presented a new in-silico molecular barcode for P. vivax that can provide information on both geographical origin and identify highly related isolates within country borders to help infer transmission events and identify foci of infection. The 71 SNP barcode out-competes previous genotyping methods and is a powerful and potentially affordable solution that could enable the execution of large genomic epidemiological studies, with high throughput assessment of large numbers of parasites. By leveraging off growing and large datasets of whole genome sequencing data, and the power of machine learning algorithms, it will be possible to update the barcode, augment it with drug resistance markers, and implement it rapidly in field-based settings using portable technologies. Ultimately, insights into genetic diversity will assist the much-needed understanding of the dynamics of P. vivax populations and inform disease control decision making.

Materials and methods

Genomic data generation

Illumina sequenced data from previously published studies [14–16] was downloaded from ENA repository to form a total dataset of 834 isolates. Some of these data are from the MalariaGEN P. vivax Genome Variation project. Each isolate data was mapped against the PvP01_v1 reference (obtained from http://genedb.org) using bwa-mem [47]. SNPs (n = 1,522,046) were called from the resulting alignments using the samtools software suite [48], as previously described [34]. Isolates and SNPs were excluded if they presented with > 20% missing or heterozygous genotype calls, and additional SNPs were removed if located within hypervariable gene regions (e.g. vir genes). The final dataset consisted of 720,340 high-quality SNPs and 433 isolates.

Population structure and tag SNP selection

The 720,340 high-quality biallelic SNPs and its subsets were used to infer distance matrices by calculating Manhattan distances (”identity-by-state”) between samples. These genetic distances where divided by the sum of the allele frequencies of the SNPs included in their calculation, in order to make the units comparable. These matrices were used to generate principal component analysis (PCA) plots and neighbour-joining trees (R ape package [49]). The Pearson’s r² metric, calculated using the R base function cor, was used to estimate the correlation between distance matrices. A Bland-Altman analysis comparing genetic distances based on different sets of SNPs was performed using the R BlandAltmanLeh library. The software TAGster [37] was used to identify SNPs which summarise blocks of high linkage disequilibrium, as estimated using the genetic r² metric. We specified an r² threshold of at least 0.7 for inclusion in a block and a window size of 500 kbp, leading to 16,110 SNPs with minor allele frequency > 0.3 being included for analysis. The resulting 1,173 tag SNPs identified were then further characterized for downstream analysis. The Fixation index (F_ST) was calculated using in house R scripts and a threshold of F_ST > 0.7 was used to determine population-level informative barcoding SNPs [25,34]. A flow diagram summarising the selection process of the SNPs is presented in S7 Fig.

Barcode SNP selection using a random forest approach

The selected 1,173 highly informative SNPs obtained using a combination of minor allele frequency and SNP tagging approaches were extracted from the original dataset. Allele imputation was performed on the missing data points (4.2%) in the dataset using the R missForest package [50]. This yielded an estimated out-of-bag error in the imputation of 16.2% which left a total of 0.7% potentially erroneous calls. After imputation, a random selection of 80% of the dataset was assigned as a training set and the remaining 20% was assigned as a test dataset. Source country was tested as the predicted variable. Subsequently, five hundred trees were calculated using the RandomForest [51] package in R in order to determine the SNPs in the dataset with highest importance for classification of samples into countries. We then selected the 60 SNPs with highest importance and used the R LDcorSV package [52] to calculate the correlation between the markers. PCA plots and neighbour-joining trees were constructed for this subset of SNPs. The final subset of 71 SNPs was then used to retrain the random forest model on the training set, and this model was applied for the prediction of the country of origin in the validation dataset.

Validation of the barcode using UK traveller isolate genotype data

DNA was extracted from blood samples sourced from 132 returning travellers to the UK from endemic areas between 2017 and 2018. The 132 samples were sourced from travellers to Afghanistan (n = 26), Bangladesh (n = 1), Eritrea (n = 11), Ethiopia (n = 6), Guyana (n = 3), India (n = 38), Pakistan (n = 35), the Philippines (n = 1), Sudan (n = 7) and Uganda (n = 4). The blood samples are stored in the Public Health England (PHE) Malaria Reference Laboratory (MRL) at the LSHTM. The DNA underwent Illumina MiSeq 150bp paired-end sequencing at the LSHTM. The resulting data was aligned to the P. vivax PvP01_v1 reference genome using bwa-mem (see [34,53] for the bioinformatics pipeline), thereby allowing the calling of the genotypes at the 71 positions (S2 Table). The validation isolates were not used in the construction of the random forest model. The PHE and LSHTM ethics boards provided approval for the sequencing of the P. vivax DNA.

Supporting information

S1 Table [xlsx]
The 433 high-quality . isolates used for the development of the barcode.

S2 Table [xlsx]
The barcoding genotypes and country metadata for the 132 . prospective UK traveller isolates.

S1 Fig [tiff]
(Left) Distribution of SNPs according to minor allele frequency (MAF); (Right) SNPs partitioned into three equally sized divisions based on MAF (blue dashed lines), and a cut-off of MAF > 0.3 (red dashed line) was used to pre-select SNPs for downstream analysis.

S2 Fig [tiff]
(Top) Neighbour-joining tree based on 1,173 tagging SNPs in . selected using the [] software shows strong similarity with the tree from (right), observing a strong geographical signal; (Middle) the correlation of genome distance based on whole genome sequencing (WGS; 720k SNPs) with the subset of tagging (1,173) SNPs is high (Pearson’s = 0.98); (Bottom) A Bland-Altman analysis that compares the differences in genetic distance between those based on the whole genome and the subset of 1,173 tagging SNPs; it shows a slight underestimation of the distance measured by the tagging SNPs (mean of differences: 0.177; with standard deviation (SD)).

S3 Fig [tiff]
(A) Classification error for the different geographic categories across the 500 trees in the random forest model reaches stability when 100 trees are averaged; (B) Variable importance estimated from the random forest model for the number of 1,173 tagging SNPs.

S4 Fig [ld]
The low linkage disequilibrium (LD) between the 71 SNPs in the .

S5 Fig [tiff]
The principal component analysis (PCA) plot and neighbour-joining tree constructed using a previously published 42-SNP barcode [].

S6 Fig [tiff]
The principal component analysis (PCA) plot for the 132 . prospective UK traveller isolates, constructed using the 71 barcoding SNP genotypes.

S7 Fig [tiff]
Flow diagram of the SNP selection process for the 71 SNP barcode.

Zdroje

1. Howes RE, Battle KE, Mendis KN, Smith DL, Cibulskis RE, Baird JK, et al. Global Epidemiology of Plasmodium vivax. Am J Trop Med Hyg. 2016;95 : 15–34. doi: 10.4269/ajtmh.16-0141 27402513

2. WHO. World Malaria Report 2017. Geneva; 2017.

3. Tjitra E, Anstey NM, Sugiarto P, Warikar N, Kenangalem E, Karyana M, et al. Multidrug-Resistant Plasmodium vivax Associated with Severe and Fatal Malaria: A Prospective Study in Papua, Indonesia. PLOS Med. 2008;5: e128. Available: https://doi.org/10.1371/journal.pmed.0050128 18563962

4. Poespoprodjo JR, Fobia W, Kenangalem E, Lampah DA, Warikar N, Seal A, et al. Adverse Pregnancy Outcomes in an Area Where Multidrug-Resistant Plasmodium vivax and Plasmodium falciparum Infections Are Endemic. Clin Infect Dis. 2008;46 : 1374–1381. doi: 10.1086/586743 18419439

5. Poespoprodjo JR, Fobia W, Kenangalem E, Lampah DA, Hasanuddin A, Warikar N, et al. Vivax Malaria: A Major Cause of Morbidity in Early Infancy. Clin Infect Dis. 2009;48 : 1704–1712. Available: http://dx.doi.org/10.1086/599041 19438395

6. Price RN, von Seidlein L, Valecha N, Nosten F, Baird JK, White NJ. Global extent of chloroquine-resistant Plasmodium vivax: a systematic review and meta-analysis. Lancet Infect Dis. 2014;14 : 982–991. doi: 10.1016/S1473-3099(14)70855-2 25213732

7. Menard D, Dondorp A. Antimalarial Drug Resistance: A Threat to Malaria Elimination. Cold Spring Harb Perspect Med. 2017;7. Available: http://perspectivesinmedicine.cshlp.org/content/7/7/a025619.abstract

8. Cotter C, Sturrock HJW, Hsiang MS, Liu J, Phillips AA, Hwang J, et al. The changing epidemiology of malaria elimination: new strategies for new challenges. Lancet (London, England). 2013;382 : 900–911. doi: 10.1016/S0140-6736(13)60310-4

9. Sattabongkot J, Tsuboi T, Zollner GE, Sirichaisinthop J, Cui L. Plasmodium vivax transmission: chances for control? Trends Parasitol. 2004;20 : 192–198. doi: 10.1016/j.pt.2004.02.001 15099559

10. Auburn S, Benavente ED, Miotto O, Pearson RD, Amato R, Grigg MJ, et al. Genomic analysis of a pre-elimination Malaysian Plasmodium vivax population reveals selective pressures and changing transmission dynamics. Nat Commun. 2018;9. doi: 10.1038/s41467-018-04965-4 29968722

11. Spanakos G, Snounou G, Pervanidou D, Alifrangis M, Rosanas-Urgell A, Baka A, et al. Genetic Spatiotemporal Anatomy of Plasmodium vivax Malaria Episodes in Greece, 2009–2013. Emerg Infect Dis. 2018;24 : 541–548. doi: 10.3201/eid2403.170605 29460743

12. Winter DJ, Pacheco MA, Vallejo AF, Schwartz RS, Arevalo-Herrera M, Herrera S, et al. Whole Genome Sequencing of Field Isolates Reveals Extensive Genetic Diversity in Plasmodium vivax from Colombia. PLoS Negl Trop Dis. 2016;9: e0004252. Available: https://doi.org/10.1371/journal.pntd.0004252

13. Parobek CM, Lin JT, Saunders DL, Barnett EJ, Lon C, Lanteri CA, et al. Selective sweep suggests transcriptional regulation may underlie Plasmodium vivax resilience to malaria control measures in Cambodia. Proc Natl Acad Sci U S A. 2016;113: E8096–E8105. doi: 10.1073/pnas.1608828113 27911780

14. Hupalo DN, Luo Z, Melnikov A, Sutton PL, Rogov P, Escalante A, et al. Population genomics studies identify signatures of global dispersal and drug resistance in Plasmodium vivax. Nat Genet. 2016;48 : 953–958. Available: http://dx.doi.org/10.1038/ng.3588 27348298

15. Pearson RD, Amato R, Auburn S, Miotto O, Almagro-Garcia J, Amaratunga C, et al. Genomic analysis of local variation and recent evolution in Plasmodium vivax. Nat Genet. 2016;48 : 959–964. Available: http://dx.doi.org/10.1038/ng.3599 27348299

16. de Oliveira TC, Rodrigues PT, Menezes MJ, Gonçalves-Lopes RM, Bastos MS, Lima NF, et al. Genome-wide diversity and differentiation in New World populations of the human malaria parasite Plasmodium vivax. PLoS Negl Trop Dis. 2017;11: e0005824. Available: https://doi.org/10.1371/journal.pntd.0005824 28759591

17. Abdullah NR, Barber BE, William T, Norahmad NA, Satsu UR, Muniandy PK, et al. Plasmodium vivax population structure and transmission dynamics in Sabah Malaysia. PLoS One. 2013;8: e82553. doi: 10.1371/journal.pone.0082553 24358203

18. Getachew S, To S, Trimarsanto H, Thriemer K, Clark TG, Petros B, et al. Variation in Complexity of Infection and Transmission Stability between Neighbouring Populations of Plasmodium vivax in Southern Ethiopia. PLoS One. 2015;10: e0140780. doi: 10.1371/journal.pone.0140780 26468643

19. Trimarsanto H, Benavente ED, Noviyanti R, Utami RAS, Trianty L, Pava Z, et al. VivaxGEN: An open access platform for comparative analysis of short tandem repeat genotyping data in Plasmodium vivax Populations. PLoS Negl Trop Dis. 2017;11. doi: 10.1371/journal.pntd.0005465 28362818

20. Pava Z, Noviyanti R, Handayuni I, Trimarsanto H, Trianty L, Burdam FH, et al. Genetic micro-epidemiology of malaria in Papua Indonesia: Extensive P. vivax diversity and a distinct subpopulation of asymptomatic P. falciparum infections. PLoS One. 2017;12: e0177445. doi: 10.1371/journal.pone.0177445 28498860

21. Taylor AR, Jacob PE, Neafsey DE, Buckee CO. Estimating Relatedness Between Malaria Parasites. Genetics. 2019;212 : 1337–1351. doi: 10.1534/genetics.119.302120 31209105

22. Anderson TJ, Haubold B, Williams JT, Estrada-Franco JG, Richardson L, Mollinedo R, et al. Microsatellite markers reveal a spectrum of population structures in the malaria parasite Plasmodium falciparum. Mol Biol Evol. 2000;17 : 1467–1482. doi: 10.1093/oxfordjournals.molbev.a026247 11018154

23. Greenhouse B, Myrick A, Dokomajilar C, Woo JM, Carlson EJ, Rosenthal PJ, et al. Validation of microsatellite markers for use in genotyping polyclonal Plasmodium falciparum infections. Am J Trop Med Hyg. 2006;75 : 836–842. 17123974

24. Andrews KR, Good JM, Miller MR, Luikart G, Hohenlohe PA. Harnessing the power of RADseq for ecological and evolutionary genomics. Nat Rev Genet. 2016;17 : 81–92. doi: 10.1038/nrg.2015.28 26729255

25. Manske M, Miotto O, Campino S, Auburn S, Almagro-Garcia J, Maslen G, et al. Analysis of Plasmodium falciparum diversity in natural infections by deep sequencing. Nature. 2012;487 : 375–379. Available: http://dx.doi.org/10.1038/nature11174 22722859

26. Diez Benavente E, Florez de Sessions P, Moon RW, Holder AA, Blackman MJ, Roper C, et al. Analysis of nuclear and organellar genomes of Plasmodium knowlesi in humans reveals ancient population structure and recent recombination among host-specific subpopulations. PLOS Genet. 2017;13: e1007008. doi: 10.1371/journal.pgen.1007008 28922357

27. Daniels R, Volkman SK, Milner DA, Mahesh N, Neafsey DE, Park DJ, et al. A general SNP-based molecular barcode for Plasmodium falciparum identification and tracking. Malar J. 2008;7 : 223. doi: 10.1186/1475-2875-7-223 18959790

28. Daniels RF, Schaffner SF, Wenger EA, Proctor JL, Chang H-H, Wong W, et al. Modeling malaria genomics reveals transmission decline and rebound in Senegal. Proc Natl Acad Sci. 2015;112 : 7067–7072. Available: http://www.pnas.org/content/112/22/7067.abstract 25941365

29. Preston MD, Campino S, Assefa SA, Echeverry DF, Ocholla H, Amambua-Ngwa A, et al. A barcode of organellar genome polymorphisms identifies the geographic origin of Plasmodium falciparum strains. Nat Commun. 2014;5 : 4052. Available: http://dx.doi.org/10.1038/ncomms5052 24923250

30. Chang H-H, Worby CJ, Yeka A, Nankabirwa J, Kamya MR, Staedke SG, et al. THE REAL McCOIL: A method for the concurrent estimation of the complexity of infection and SNP allele frequency for malaria parasites. PLOS Comput Biol. 2017;13: e1005348. Available: https://doi.org/10.1371/journal.pcbi.1005348 28125584

31. Auburn S, Benavente ED, Miotto O, Pearson RD, Amato R, Grigg MJ, et al. Genomic analysis of a pre-elimination Malaysian Plasmodium vivax population reveals selective pressures and changing transmission dynamics. Nat Commun. 2018;9 : 2585. doi: 10.1038/s41467-018-04965-4 29968722

32. Rodrigues PT, Alves JMP, Santamaria AM, Calzada JE, Xayavong M, Parise M, et al. Using mitochondrial genome sequences to track the origin of imported plasmodium vivax infections diagnosed in the United States. Am J Trop Med Hyg. 2014;90 : 1102–1108. doi: 10.4269/ajtmh.13-0588 24639297

33. Baniecki ML, Faust AL, Schaffner SF, Park DJ, Galinsky K, Daniels RF, et al. Development of a Single Nucleotide Polymorphism Barcode to Genotype Plasmodium vivax Infections. PLoS Negl Trop Dis. 2015;9: e0003539. Available: https://doi.org/10.1371/journal.pntd.0003539 25781890

34. Benavente ED, Ward Z, Chan W, Mohareb FR, Sutherland CJ, Roper C, et al. Genomic variation in Plasmodium vivax malaria reveals regions under selective pressure. PLoS One. 2017;12. doi: 10.1371/journal.pone.0177134 28493919

35. Nag S, Dalgaard MD, Kofoed P-E, Ursing J, Crespo M, Andersen LO, et al. High throughput resistance profiling of Plasmodium falciparum infections based on custom dual indexing and Illumina next generation sequencing-technology. Sci Rep. 2017;7. doi: 10.1038/s41598-017-02724-x 28546554

36. Rumaseb A, Marfurt J, Anstey NM, Price RN, Auburn S, Barber B, et al. Genomic Analysis of Plasmodium vivax in Southern Ethiopia Reveals Selective Pressures in Multiple Parasite Mechanisms. 2019. doi: 10.1093/infdis/jiz016 30668735

37. Xu Z, Kaplan NL, Taylor JA. TAGster: efficient selection of LD tag SNPs in single or multiple populations. Bioinformatics. 2007;23 : 3254–3255. doi: 10.1093/bioinformatics/btm426 17827206

38. Takala-Harrison S, Jacob CG, Arze C, Cummings MP, Silva JC, Dondorp AM, et al. Independent Emergence of Artemisinin Resistance Mutations Among Plasmodium falciparum in Southeast Asia. J Infect Dis. 2014;211 : 670–679. doi: 10.1093/infdis/jiu491 25180241

39. Vali U, Einarsson A, Waits L, Ellegren H. To what extent do microsatellite markers reflect genome-wide genetic diversity in natural populations? Mol Ecol. 2008;17 : 3808–3817. doi: 10.1111/j.1365-294X.2008.03876.x 18647238

40. Thompson EA. The estimation of pairwise relationships. Ann Hum Genet. 1975;39 : 173–188. doi: 10.1111/j.1469-1809.1975.tb00120.x 1052764

41. Díaz-Uriarte R, de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7 : 3. doi: 10.1186/1471-2105-7-3 16398926

42. Chen X, Ishwaran H. Random Forests for Genomic Data Analysis. Genomics. 2012;99 : 323–329. doi: 10.1016/j.ygeno.2012.04.003 22546560

43. Bright AT, Alenazi T, Shokoples S, Tarning J, Paganotti GM, White NJ, et al. Genetic analysis of primaquine tolerance in a patient with relapsing vivax malaria. Emerg Infect Dis. 2013;19 : 802–805. doi: 10.3201/eid1905.121852 23648098

44. Bright AT, Manary MJ, Tewhey R, Arango EM, Wang T, Schork NJ, et al. A high resolution case study of a patient with recurrent Plasmodium vivax infections shows that relapses were caused by meiotic siblings. PLoS Negl Trop Dis. 2014;8: e2882–e2882. doi: 10.1371/journal.pntd.0002882 24901334

45. Daher A, Silva JCAL, Stevens A, Marchesini P, Fontes CJ, Ter Kuile FO, et al. Evaluation of Plasmodium vivax malaria recurrence in Brazil. Malar J. 2019;18 : 18. doi: 10.1186/s12936-019-2644-y 30670020

46. Commons RJ, Simpson JA, Thriemer K, Humphreys GS, Abreha T, Alemu SG, et al. The effect of chloroquine dose and primaquine on Plasmodium vivax recurrence: a WorldWide Antimalarial Resistance Network systematic review and individual patient pooled meta-analysis. Lancet Infect Dis. 2018;18 : 1025–1034. doi: 10.1016/S1473-3099(18)30348-7 30033231

47. Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26 : 589–595. doi: 10.1093/bioinformatics/btp698 20080505

48. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The Sequence Alignment/Map format and SAMtools. Bioinformatics. 2009;25 : 2078–2079. doi: 10.1093/bioinformatics/btp352 19505943

49. Paradis E, Claude J, Strimmer K. APE: Analyses of Phylogenetics and Evolution in R language. Bioinformatics. 2004;20 : 289–290. doi: 10.1093/bioinformatics/btg412 14734327

50. Stekhoven DJ, Bühlmann P. MissForest—non-parametric missing value imputation for mixed-type data. Bioinformatics. 2012;28 : 112–118. Available: http://dx.doi.org/10.1093/bioinformatics/btr597 22039212

51. Breiman L. Random Forests. Mach Learn. 2001;45 : 5–32. doi: 10.1023/A:1010933404324

52. Mangin B, Siberchicot A, Nicolas S, Doligez A, This P, Cierco-Ayrolles C. Novel measures of linkage disequilibrium that correct the bias due to population structure and relatedness. Heredity (Edinb). 2012;108 : 285–291. doi: 10.1038/hdy.2011.73 21878986

53. Benavente ED, Oresegun DR, de Sessions PF, Walker EM, Roper C, Dombrowski JG, et al. Global genetic diversity of var2csa in Plasmodium falciparum with implications for malaria in pregnancy and vaccine development. Sci Rep. 2018;8 : 15429. doi: 10.1038/s41598-018-33767-3 30337594

Článek PRH1 mediates ARF7-LBD dependent auxin signaling to regulate lateral root development in Arabidopsis thaliana

Článek Evolutionary dynamics of microRNA target sites across vertebrate evolution

Článek Dramatically diverse Schizosaccharomyces pombe wtf meiotic drivers all display high gamete-killing efficiency

Článek Magnetic genes: Studying the genetics of biomineralization in magnetotactic bacteria

Článek Tensor decomposition of stimulated monocyte and macrophage gene expression profiles identifies neurodegenerative disease-specific trans-eQTLs

Článek Mutational processes of distinct POLE exonuclease domain mutants drive an enrichment of a specific TP53 mutation in colorectal cancer

Článek Parallel and nonparallel genomic responses contribute to herbicide resistance in Ipomoea purpurea, a common agricultural weed

Článek Skp, Cullin, F-box (SCF)-Met30 and SCF-Cdc4-Mediated Proteolysis of CENP-A Prevents Mislocalization of CENP-A for Chromosomal Stability in Budding Yeast

Článek Meiotic gene silencing complex MTREC/NURS recruits the nuclear exosome to YTH-RNA-binding protein Mmi1

Článek Interaction of yeast Rad51 and Rad52 relieves Rad52-mediated inhibition of de novo telomere addition

Článek Aberrant cell segregation in the craniofacial primordium and the emergence of facial dysmorphology in craniofrontonasal syndrome

Článek Identification of a novel base J binding protein complex involved in RNA polymerase II transcription termination in trypanosomes