Introduction

Epigenome-wide association studies (EWAS) have shown that a substantial amount of variation in DNA methylation (DNAme) exists between human populations [1,2,3,4,5,6,7]. Therefore, if left unaccounted for, population-associated variation can interfere with the discovery of DNAme alterations associated with disease or environment. This type of confounding, often referred to as population stratification, can be addressed by inferring population-associated variation directly from DNAme data itself [8,9,10], as is done in genome-wide association studies (GWAS) [11]. However, unlike genetic markers, epigenetic markers are tissue-specific and, therefore, a DNAme-based method developed in a specific tissue or population may not generalize well to other tissues with unique DNAme profiles.

In EWAS, confounding from population stratification is most often addressed using self-reported ethnicity/race to stratify study samples across the phenotype of interest. But, defining ethnicity/race is a complex task requiring the interpretation of a combination of biological and social factors leading to several complications: (i) inconsistent definition of ethnicity/race categories between individuals/organizations [12, 13]; (ii) self-reporting more than one ethnicity/race [14]; and (iii) missing ethnicity information altogether. To overcome the limitations of ethnicity/race categories, genetically defined ancestry can be used [15] as an alternative measure of population-specific variation. In contrast to the discrete nature of ethnicity/race categories, genetic ancestry can be expressed as several continuous variables that reflect ancestry composition [16].

Though the use of genetic ancestry could help to better design EWAS, genotyping markers might not be collected in DNAme studies. In cases where self-reported ethnicity and genetic ancestry information are unavailable, methods have been developed to infer this information directly from DNAme (Table 1) measured on the popular Infinium Human Methylation 450k Beadchip array (HM450K) [8,9,10, 17]. Barfield et al. [8] and EPISTRUCTURE [9] methods both utilize principal components analysis (PCA) on select DNAme sites to infer genetic ancestry. Since only DNAme sites that are associated with nearby genetic variation are used, these methods produce principal components (PCs) that are often highly correlated with genome-wide genetic variation [8, 9] and, therefore, can be used as a measurement of genetic ancestry. Zhou et al. [10] explored using the set of 65 SNPs measured on HM450K to produce ethnicity/race classifications. However, it has not been investigated whether these methods perform well in populations and tissues other than the ones they were developed and tested in (Table 1).

Table 1 Description of methods to infer self-reported ethnicity or genetic ancestry using HM450K data

DNAme studies using placental tissue are of particular interest because the functioning of the placenta is essential to a healthy pregnancy [18, 19]. Although many DNAme alterations associated with placental-mediated diseases have been identified [20,21,22,23], the incidence of many of these conditions vary by population [24,25,26]. In this study we developed PlaNET (Placental DNAme Elastic Net Ethnicity Tool), an ethnicity classifier, using DNAme and genotyping data measured on the HM450K array in multiple cohorts of placentas from North America. PlaNET was developed on overlapping sites from HM450K and the newer Illumina MethylationEPIC BeadChip array (EPIC) to ensure compatibility with future studies. We show that PlaNET out-performs existing methods in predicting ethnicity in placental tissue and can produce accurate measures of genetic ancestry. Importantly, our method can be used to classify individuals into discrete ancestral populations (i.e., African, Asian, and Caucasian) or to describe individuals on an ancestral continuum that may more accurately reflect the nature of modern human populations. In studies where ethnicity information is unavailable, PlaNET can be applied to predict ethnicity after obtaining DNAme data, and used to investigate population-specific differences or to minimize confounding by population stratification in statistical analyses.

Results

Datasets

Our goal was to develop a placental DNAme-based ethnicity classifier, which could learn ethnicity-specific DNAme patterns from one set of samples to assign ethnicity labels to a new set of samples. We searched for placental HM450K data on the Gene Expression Omnibus [27] that contained more than one ethnicity group and made sample-specific ethnicity information available (Table 2). Five distinct cohorts met these criteria (labeled C1–C5), with three major North American ethnicities represented by sufficiently large numbers across more than one dataset: African (n = 58), Asian (n = 53), and Caucasian (n = 389). We opted to include samples from both healthy and abnormal pregnancies (preeclampsia, gestational diabetes mellitus, fetal growth restriction or overgrowth) (Table 2) [21, 28,29,30,31,32,33]. Though there were significant cohort-specific effects on DNAme that may reflect batch/technical variation (Additional file 2: Figure S1), we included these multiple datasets and phenotypes to enable the development of a robust classifier that would generalize well in future studies [34].

Table 2 Description of HM450K DNAme datasets used to develop and test PlaNET

Development of a placental DNA methylation ethnicity classifier

To determine the best machine learning classification algorithm that could learn ethnicity-specific patterns from DNAme microarray data, we compared four algorithms previously shown to be well-suited for prediction using high-dimensional genomics data [34,35,36]: generalized logistic regression with an elastic net penalty (GLMNET) [37, 38], nearest shrunken centroids (NSC) [35], k-nearest neighbours (KNN) [39], and support vector machines (SVM) [40]. For each algorithm, hyperparameter(s) were selected (e.g. k number of neighbors for KNN) that resulted in the highest performance estimated by repeated fivefold cross validation (three repeats). All algorithms performed favorably (logLoss = 0.170–0.276; Additional file 2: Figure S2a), except KNN (logLoss = 1.82). However, all algorithms showed a bias for high predictability of Caucasians (average accuracy = 0.980), and low predictability of Asians (average accuracy = 0.448) (Additional file 2: Figure S2b). Considering overall- and ethnicity-specific performance, the GLMNET algorithm was used for the remainder of the study (accuracy = 0.866, 0.625, 0.998 for Africans, Asians, and Caucasians, respectively), and we refer to this classifier as PlaNET (Placental DNAme Elastic Net Ethnicity Tool).

For each sample, PlaNET returns a probability that the sample is African, Asian or Caucasian, and the final classification is defined by the ethnicity class with the highest of these probabilities. We reason that these probabilities have the potential to identify samples with mixed ancestry or ethnicity. Therefore, we implemented a threshold function on PlaNET’s probability outputs that classifies samples as ‘Ambiguous’ if the highest of the three class-specific probabilities is below 0.75 (“Methods”, Additional file 2: Figure S3). This resulted in 7 self-reported African, 12 Asian, and 13 Caucasian samples as being classified as ambiguous, which led to a slight decrease in performance (Fig. 1a). However, we note that because genetic ancestry is on a continuum and due to the limitations of self-reported ethnicity, there are likely to be individuals of mixed ancestry/ethnicity in our sample set and, therefore, hypothesize that a model that includes an ambiguous class is more realistic and accurate than one without. Cross validation, where training/validation subsets were created based on cohort-identity, yielded an overall accuracy of 0.900, a Kappa of 0.738, and a positive predictive value of 0.944 (Fig. 1a), which was consistent when examining performance by dataset (Additional file 2: Figure S4).

Fig. 1
figure 1

Evaluating PlaNET’s performance and characterizing ethnicity-predictive HM450K sites. We developed PlaNET (Placental elastic net ethnicity classifier), using placental HM450K data and evaluated its classification performance using leave-one-dataset-out cross validation. a Each sample’s ethnicity classification from PlaNET is shown with respect to their self-reported ethnicity. Samples were called ‘ambiguous’ if their predicted probability fell below a ‘confidence’ threshold of 75%. b PlaNET utilizes a subset of ethnicity-predictive sites from the HM450K. To investigate whether genetic signal is present in the measurement for these sites, we cross-referenced ethnicity-predictive sites to an existing placental mQTL database [42] and determined whether any sites had SNPs present in either the probe body, CpG site of interrogation, or single base extension sites, based on dbSNP137

Ethnicity-predictive sites on the HM450K array are largely linked to genetic variation

To better understand the basis of PlaNET’s ethnicity prediction, we examined the 1860 sites (Additional file 1: Table S1) automatically selected by the GLMNET model. These sites were enriched for SNP probes, containing 15 of the 59 SNPs explicitly measured on both HM450K and EPIC DNAme arrays (p < 1e−16). Of the remaining 1845 DNAme sites, we found significant enrichment for sites linked to genetic variation: 802 sites (43.1%) have a documented SNP in either the probe body, CpG site of interrogation, or the single base extension site (p < 1e−16) [41], and 220 sites (11.8%) corresponded to previously identified placental-specific methylation quantitative trait loci (mQTLs) [42] (p < 1e−16, Fig. 1b). With respect to chromosomal location, we found significant enrichment for ethnicity-predictive sites on chromosomes 2 (p < 0.01), 15 (p < 0.05), and 17 (p < 0.05) (Additional file 2: Figure S5a). With respect to CpG density, we found significant enrichment for ethnicity-predictive sites in OpenSea (p < 0.001) and South Shore (p < 0.05) regions (Additional file 2: Figure S5b), where relatively neutral (unselected) genetic variation is more likely to be located [43]. Pathway analysis for GO and KEGG terms for genes associated with the 1860 sites, found only one significant (p < 0.05) GO term (homophilic cell adhesion via plasma membrane adhesion molecules).

DNAme -inferred ethnicity and genetic ancestry

To test the ability of PlaNET to identify individuals of mixed ancestry, we examined whether samples classified as ‘ambiguous’ were also intermediate with respect to genetically defined ancestry. Genetic ancestry was inferred from 50 ancestry informative genotyping markers (AIMs) in samples from cohorts C4 and C5 (n = 109), using 1000 Genomes Project samples as reference populations [44, 45]. These 50 markers were previously selected based on their ability to differentiate between African, European, East Asian, and South Asian populations [45]. Plotting the first two multi-dimensional scaling coordinates calculated on the 50 AIMs in (Fig. 2), shows a handful of samples intermediate to three more distinct ancestry clusters. The samples with less extreme genetic ancestry coordinates based on AIMs tended to have lower PlaNET-calculated probabilities associated with the ethnicity classification matching the individual’s self-reported ethnicity (Fig. 2), confirming that PlaNET provides some information on the genetic ancestry composition.

Fig. 2
figure 2

Probabilities associated with PlaNET ethnicity predictions and genetic ancestry inferred from AIMs. Ethnicity classifications from PlaNET and associated confidence/probability scores were compared to genetic ancestry inferred from 50 AIMs (n = 109, cohorts C4, C5), represented by the first three coordinates from multidimensional scaling using 1000 genomes project samples as reference populations

Although genetic ancestry can be adequately inferred from a small set of AIMs, it is best obtained from a large number of unlinked markers [46]. Therefore, we also inferred genetic ancestry in a smaller number of samples from C5 (n = 37) with high-density genotyping array data (Omni 2.5, > 2.5 million SNPs), again using 1000 Genomes Project samples as reference populations [44, 47, 48], and compared this to PlaNET’s predicted membership probabilities for each ethnicity (Fig. 3a–c). 10 of these 37 samples were not initially used for previous analyses due to a lack of available self-reported ethnicity information (Fig. 3a). We found that genetic ancestry coefficients reflected the probabilities associated with ethnicity classification to a high degree (Fig. 3b, c, R2 = 0.95–0.96, p < 0.001).

Fig. 3
figure 3

Probabilities associated with PlaNET ethnicity predictions and genetic ancestry inferred from high-density genotyping data. PlaNET was tested in a subset of cohort C5 (n = 37). a PlaNET’s ethnicity classifications were compared with self-reported ethnicity. b Ethnicity probabilities generated by PlaNET were compared to c genetic ancestry coefficients determined from high-density genotyping data (Omni 2.5, > 2 million SNPs), using the function snmf() from the R package LEA, and found to be highly correlated (R2 = 0.95–0.96, p < 0.001) determined by linear regression

Characterizing existing methods to infer population structure in placental DNA methylation data

To evaluate our hypothesis that a placental-specific approach to population inference would outperform existing methods developed in other tissues, we compared the performance of PlaNET to three previously published HM450K methods: Barfield’s SNP-based filtering approach [8], EPISTRUCTURE [9], and Zhou’s SNP-based classifier [10]. To address the differences in the type of outcomes produced by each method (e.g. PCs or ethnicity classifications), we used PCA to generate metrics that could be compared between methods. PCA was performed on the set of HM450K sites corresponding to each method (Table 1), which were then included in a series of simple linear models, where each PC was either a function of self-reported ethnicity (Fig. 4a; n = 499, cohorts C1–C5), genetic ancestry (Fig. 4b; n = 109, cohorts C4 and C5 only), or cohort-specific patient variables (e.g. microarray batch, sex, gestational age; Additional file 2: Figure S6). Linear models were constructed for each of the top ten PCs and R2 for each linear model was compared between each method. For computation of PCs on PlaNET’s sites, we used a cohort-specific cross validation framework to account for bias that could be introduced using the same samples for development and testing. Specifically, PlaNET’s PCs were computed separately for each cohort using ethnicity-predictive sites selected in all other cohorts (“Methods”).

Fig. 4
figure 4

Comparing PlaNET to existing methods to account for population stratification using HM450K data. For each cohort, principal components analysis was conducted on PlaNET using a model trained on all other cohorts. PlaNET’s principal components (PCs) were then compared to the PCs computed on sites from EPISTRUCTURE [9], Barfield’s method [8], and the 59 SNPs. a Amount of variance explained from a series of linear models where principal component “i” is a function of self reported ethnicity encoded as a dummy variable. b This was then repeated using AIMs coordinates 1 and 2 instead of ethnicity as the independent variable (n = 109)

We found that for all cohorts, the first two PCs computed on PlaNET’s sites and the 59 SNPs were highly correlated with self-reported ethnicity (Fig. 4a, R2 = 0.649 ± 0.087, 0.697 ± 0.110, respectively) and genetic ancestry (Fig. 4b, R2 = 0.555 ± 0.246, 0.487 ± 0.335). In contrast, the first PC computed on Barfield’s and EPISTRUCTURE’s sites showed almost no correlation with self-reported ethnicity (Fig. 4a, R2 = 0.0452 ± 0.060, 0.066 ± 0.082, respectively), or genetic ancestry (Fig. 4b, R2 = 0.0435 ± 0.0548, 0.104 ± 0.0653). Instead, for Barfield and EPISTRUCTURE, the PCs that correlated with ethnicity/ancestry were confined to PCs 3–6 (Fig. 4a, b), while often the top PCs (e.g., 1–4) for these two methods were associated with variables other than ethnicity/ancestry (Additional file 2: Figure S6). For example, in cohort C4, EPISTRUCTURE PC1 was most correlated with row position on the HM450K array (R2 = 0.482), PC2 with gestational age (R2 = 0.315), PC3 with genetic ancestry coordinate 1 (R2 = 0.450), and PC5 with ethnicity (R2 = 0.579; Additional file 2: Figure S6).

Limiting to methods that predict ethnicity classes, we compared the performance of PlaNET to Zhou et al. [73] SNP-based classifier (Additional file 2: Figure S7). Both classifiers demonstrated similar accuracy in classifying self-reported Africans (p = 0.68, 87.1% for PlaNET; 90.3% for Zhou) and Caucasians (p = 0.062, 96.7% vs. 97.9%), but PlaNET was more accurate in classifying self-reported Asians (p = 0.00052, 74.4% vs. 41.0%).

Application of PlaNET in an EWAS setting

Lastly, to demonstrate the utility of applying PlaNET to placental DNAme data, we applied PlaNET to obtain ethnicity classifications across two previously published EWAS studies using three datasets (Table 3, Additional file 2: Figure S9). We note that this includes samples from cohorts C4 and C5 that were used to develop PlaNET.

Table 3 Distribution of PlaNET ethnicity predictions across previously published placental EWAS datasets

One study used two distinct cohorts from Vancouver, Canada (GSE100197, n = 102) and Toronto, Canada (GSE98224, n = 48) to investigate placental DNAme alterations associated with preeclampsia status [21]. We reasoned that correction for ethnicity should decrease false positives in the EWAS and, therefore, increase concordance between hits identified in the two data sets. In the original EWAS, with no adjustment for ethnicity, our group reported that 599 out of the 1703 (35.1%) significant associations found in the Vancouver cohort were also significant in the Toronto cohort, and the correlation of the difference in mean DNAme between controls and preeclampsia-affected samples (i.e. delta betas) at FDR significant sites between discovery and validation was 0.62 [21]. When we repeated the analysis while adjusting for ethnicity determined by PlaNET, the number of preeclampsia-associated sites that overlapped between cohorts increased to 651/1614 (40.3%) (Additional file 1: Table S5), and the correlation between delta betas increased to 0.66. We also found that repeating gene set enrichment analysis, which originally found nothing significant [21], yielded several significantly enriched (FDR < 0.05) GO terms such as developmental process, inflammatory response, and cell adhesion (Additional file 1: Table S6). Lastly, we also adjusted for ethnicity determined by Zhou et al.’s SNP classifier, which resulted in a smaller increase in overlapping associations and correlation between delta betas (607/1662 = 36.52%, correlation = 0.65). However, no GO terms were found significant at an FDR < 0.05. In summary, any adjustment for ethnicity improved the replicability of our preeclampsia EWAS results, with PlaNET performing best when used in placental samples.

Next, because adjustment for population stratification can not only be done via correction in linear modeling, but can also be done by stratifying an analysis by population identity, we performed a secondary EWAS confined to samples predicted as Caucasians (n = 71/102 for discovery, n = 28/48 for validation). This resulted in a decrease in overlap in preeclampsia-associated sites between cohorts: 359/1488 (17%) (Additional file 1: Table S7), although the correlation between delta betas remained high (r = 0.67), indicating the observed decrease in overlap between significantly differentially methylated sites was likely due to a decrease in power from smaller sample size (particularly in the validation group) rather than a decrease in concordance between cohorts.

PlaNET can be useful for checking for discrepancies in self-reported ethnicity information. We tested whether PlaNET could identify the ethnicity of samples from an all-Caucasian population. GSE71678 (n = 343), a cohort not used in the development of PlaNET, consisted of DNAme data from placental samples collected from a New Hampshire, USA birth cohort that investigated the effects of arsenic exposure on placental DNAme [49]. PlaNET-determined 342 samples were classified as Caucasian, and 1 sample had a high probability of belonging to the Caucasian group (Probability = 0.73) but was below our confidence threshold and was, therefore, classified as ‘ambiguous’, confirming ethnic homogeneity was high in this cohort and adjustment for population stratification was not needed in this study.

Discussion

In this study, we developed PlaNET, a method to predict Asian, African, and Caucasian ethnicity using placental HM450K array data. To enable compatibility with future studies, PlaNET was developed on sites (452,453 CpGs and 59 SNPs) overlapping between the older HM450K and the newer EPIC Illumina DNAme arrays. Although all samples in this study were reported as a single ethnicity/race, we expected that there would be significant population substructure that might limit our ability to develop predictive models of ethnicity and to assess their performance. Despite this limitation, ethnicity could be predicted with high accuracy as assessed by cross validation. PlaNET’s DNAme-based ethnicity classification relies on HM450K sites with large amounts of genetic signal, which supported our initial efforts to filter our data to enrich for genetic-informative sites prior to classifier development (“Methods”) [41, 50, 51]. When examining PlaNET’s 1860 sites used to predict ethnicity, more than half could be linked to a nearby genetic polymorphism. Of these, 802 CpG sites have documented SNPs in their probe body, single base extension or CpG site of interrogation, which previously have been identified to differ between European and East Asian populations [41]. Several studies have suggested the genetic influence on DNAme at these sites is primarily technical in nature [41, 50, 51], suggesting the patterns in DNAme at these sites are likely tissue-agnostic, warranting further investigation in their utility in predicting ethnicity and/or genetic ancestry in tissues other than the placenta. A significant proportion of other ethnicity-predictive CpG sites (n = 220) were previously found associated with placental mQTLs in a population with similar demographics to the ones studied here [42]. This finding, together with EPISTRUCTURE—a method that also relies on mQTLs [9]—suggests that leveraging the tissue- and population-specificity of mQTLs can produce highly effective DNAme-based population structure inference methods.

Of the existing methods to assess population stratification from DNAme data, we note that Barfield’s method and EPISTRUCTURE infer continuous measures of genetic ancestry, while Zhou’s SNP-based classifier returns discrete ethnicity classifications; however, ours produce both [8,9,10] (Table 1). EPISTRUCTURE and Barfield’s method are unsupervised PCA-based approaches, which rely on the empirical observation that specific DNAme sites can be highly correlated with PCs computed on genome-wide genotype data in adult blood samples [8, 9]. However, we found that DNAme at these sites did not produce PCs that are highly associated with genotype data in placental samples. Instead, top PCs were more often associated with non-ancestry related variables in the placental samples included in this study, such as gestational age, preeclampsia, and technical variables. Ethnicity- and genetic ancestry-associated PCs were confined to the third to sixth component of variation, suggesting that application of these methods may require identifying which PCs are ethnicity/ancestry-specific, which is impossible when self-reported ethnicity and genetic ancestry information is unavailable (i.e. when these methods are needed most). Future improvements to these types of methods can aim at improving the amount of ethnicity and genetic ancestry-associated signal in the sites used to ensure the top two–three PCs are always associated with ethnicity and ancestry. This aim could also be supported in identifying ethnicity- and ancestry-associated sites that are also robust to changes in non-genetic drivers of DNAme such as cell type, gestational age, and severe pathology.

Supervised population inference approaches such as ethnicity classifiers can return an explicit assignment of samples into distinct ancestral groups. In comparison to self-reported ethnicity, an assessment based on DNAme/genetic data is more objectively defined, which allows for more robust investigation of ethnicity-specific effects. An important goal of any population structure inference method would be to identify samples of mixed ancestry, a capability not well supported by Zhou’s ethnicity classifier [10]. In contrast, PlaNET produced membership probabilities corresponding to each ethnicity group that were highly correlated with genetic ancestry estimated from genotyping data. This was consistent whether we used principal components analysis on AIMs data, or model-based estimation of ancestry on high-density genotyping array data [47, 52,53,54]. In this study, we defined samples of potential mixed ancestry as those with a maximum membership probability of less than 0.75, but we note that this threshold can be manually adjusted by the user and that the probabilities themselves can be used to adjust for population structure in study populations including significant numbers of samples with mixed ancestry.

Results of DNAme studies on genetic ancestry and ethnicity, such as this one, depend on the number and proportion of different populations sampled from, as well as the tissue studied. Due to limitations in sample availability, only African, Asian, and Caucasian ethnicities were included in our study. However, we note that these ethnicities are among the most common in North American populations—but future developments should consider inclusion of additional ethnicities. Furthermore, due to limited number of samples with high-density genetic data, we were unable to address the extent of finer population structure that likely exists within the major ancestral groups studied. Differences in ethnic composition in samples from our study and samples used to develop Barfield’s method and EPISTRUCTURE may also explain why Barfield’s method or EPISTRUCTURE performed poorly in our study [8, 9]. A lack of generalizability of these methods to our placental samples was likely further compounded by the use of different tissues to develop each method—Barfield and EPISTRUCTURE were both developed and tested in blood tissue only. This is especially important to consider when applying these techniques to tissues with unique DNAme profiles, such as placenta [18]. It is possible that application of these approaches to other tissues that are more similar to blood (e.g. other somatically-derived tissues) may result in better performance compared to when applied to placenta as seen in this study. However, any DNAme-based test needs to be validated before application to new tissues, which has not yet been done for these methods.

A major goal of EWAS is to uncover signal truly associated with the phenotype/environment of interest that might generalize to other relevant populations. This is challenging given the wide host of technical variables that can affect DNAme measurements and the common finding that many phenotypes are associated with relatively small effect sizes [33, 55]. To this end, adjustment for major confounders such as genetic ancestry or ethnicity can significantly improve EWAS. We demonstrated, in a reanalysis of our previously published PE placentas, that adjustment for ethnicity, determined by PlaNET, improved the replicability of significant associations between independent cohorts. Conversely, overadjustment can occur when populations are relatively homogeneous, resulting in bias and/or loss of precision. We showed that PlaNET can indicate minimal population stratification when applied to a homogenous Caucasian population. Thus, PlaNET will be useful in assessing population stratification in future placental EWAS, as well as conducting ethnicity-stratified analyses, which may lead to important insights into the disparities between populations of pregnancy-related outcomes [24,25,26].

Conclusions

We demonstrated that ethnicity and genetic ancestry can be accurately predicted using placental HM40K DNAme microarray data with respect to three major ethnicity/ancestral populations. Although samples that were used to develop PlaNET were reported to come from single ethnic populations, our classifier was able to capture mixed ancestry, and outperformed existing prediction methods. PlaNET will be valuable in assessing and accounting for population stratification, which can confound associations between DNAme with disease or environment, in future studies using HM450K or EPIC arrays. The machine-learning approach used to develop PlaNET can easily be applied for other tissues and populations for use in future DNAme studies.

Methods

Collection of previously published placental HM450K DNA methylation data

Placental DNAme data from liveborn deliveries of healthy and mixed pregnancy complications (n = 585), were combined from seven GEO HM450K datasets corresponding to five North American cohorts (summarized in Table 2; sample-specific information in Additional file 1: Table S4) [21, 27, 29,30,31,32]. Five unpublished samples from the C5 cohort were included and are available at GSE128827. Gestational ages of these pregnancies at delivery ranged from 26 to 42 weeks and 50.30% of samples were male. Samples were excluded (n = 67) if their self-reported ethnicity was missing or did not fall into one of three major race/ethnicity groups: Asian/East Asian (n = 53), Caucasian/White (non-hispanic) (n = 389), or African/African American/Black (n = 57). Based on census data [56], we note that self-reported Caucasian/White (non-hispanic) samples are typically of European ancestry, self-reported Asians are typically of East Asian ancestry and self-reported Africans represent diverse ancestries from Africa with a significant potential of admixture from other ancestries [57]. When possible, data was downloaded as raw IDAT files (GSE75248, GSE100197, GSE100197, GSE108567, GSE74738), otherwise methylated and unmethylated intensities were utilized (GSE70453, GSE73375).

DNA methylation data processing

All samples were analyzed using the Illumina Infinium HumanMethylation450 BeadChip array (HM450K), the most popular measure of DNAme for EWAS. Array data analysis was performed using R version 3.5.0. To allow compatibility of PlaNET with the newest Infinium MethylationEPIC BeadChip array (EPIC), the raw HM450K data (485,512 CpGs, 65 SNPs) was filtered to the 452,453 CpGs and 59 SNPs common between both platforms prior to classifier development [10]. Because genetic variability can capture ancestry information, we omitted the common filtering step that would remove sites with probes that overlap SNPs (n = 52,116 at a minor allele frequency > 0.05). CpGs were removed if greater than 1% of samples had poor quality signal (bead count < 3, or a detection p-value > 0.01; n = 14,858). The remaining poor quality measurements were replaced with imputed values using K-nearest neighbours from the R package impute [58]. Cross-hybridizing (n = 41,937) [50, 51] and placental-specific non-variable sites (n = 86,502) [59] were also removed, leaving 319,233 sites for classifier development.

Biological sex was determined by hierarchical clustering on DNAme measured from sites on the sex chromosomes and then compared to reported sex. Samples with discordant reported and inferred sex were removed (n = 3). Samples were also removed if they had a low mean inter-array correlation (< 0.95, n = 5). Intra-array normalization methods, normal-exponential out-of-band (NOOB) [60] and beta mixture quantile normalization (BMIQ) [61] were used from R packages minfi (version 1.26.2) [62] and wateRmelon (version 1.24.0) [63] to normalize data.

Genotyping data collection and genetic ancestry assessment

In a subset of C5 (n = 27) and 10 additional samples, high-density SNP array genotypes were collected. DNA samples from one site from the fetal side of each placenta were collected as previously described [45] and quality was checked using a NanoDrop ND-1000 (Thermo Scientific) as well as by electrophoresis on a 1% agarose gel. Genotyping at ~ 2.3 million SNPs was done on the Illumina Infinium Omni2.5-8 (Omni2.5) array at the Centre for Applied Genomics, Hospital for Sick Kids, Toronto, Canada. For inferring genetic ancestry, the data for these 37 samples was combined with a previously processed 1000 Genomes Project Omni2.5 dataset (n = 1756) to use as reference populations [44, 48]. Genotypes in this combined dataset were filtered for quality (missing call rate > 0.05, n removed = 31,604), minor allele frequency (MAF > 0.05, n removed = 114,628), and linkage disequilibrium pruning was performed to select representative SNPs (R2 < 0.25, n removed = 919,824) for a final dataset of 218,732 SNPs and n = 1793 samples. Genetic ancestry coefficients were estimated using the R package LEA, which utilizes sparse non-negative matrix factorization to produce similar results to model-based algorithms ADMIXTURE and STRUCTURE [47, 54]. Cross-entropy criterion was used to determine the number of ancestral populations (Additional file 2: Figure S8) [64].

A smaller panel of 50 ancestry-informative genotyping markers (AIMs) was collected in a subset of samples from cohorts C4 (n = 41) and C5 (n = 68). AIMs were selected based on their ability to differentiate between African, European, East Asian, and South Asian populations [65,66,67]. Results from cohort C5 have been published elsewhere [45], and genotyping data was collected for cohort C4 in the same manner. Briefly, these markers were measured in placental villus DNA using the Sequenom iPlex Gold platform (Génome Québec Innovation Centre, Montréal, Canada). Genetic ancestry inferred from 50 AIMs markers was computed using multi-dimensional scaling after combining with the same 50 AIMs from the 1000 Genomes Project samples, as previously described [45].

Developing the ethnicity classifier and assessing its performance

To develop and assess the performance of PlaNET we used a ‘leave-one-dataset-out cross-validation’ (LODOCV) approach. This approach uses four out of five datasets to develop a predictive model (training), which is then used to generate ethnicity classifications on the samples in the remaining dataset (testing). This differs from the traditional cross validation approach of randomly splitting the full dataset into training and testing. LODOCV produces more accurate estimates of classifier performance for future studies, and has been previously used for evaluating age-predictive models [34]. Each iteration of LODOCV generates dataset-specific estimates of performance (accuracy, Kappa). After all iterations, overall performance was assessed by aggregating classifications across all datasets.

For fitting predictive models within LODOCV-generated training sets, we used the R package caret [68]. Several algorithms were compared: logistic regression with an elastic net penalty (GLMNET) [37, 38], nearest shrunken centroids (NSC) [35, 69], K-nearest neighbours (KNN) [39], and support vector machines (SVM) [40]. To determine optimum tuning parameters for each algorithm (e.g., ‘k’ number of neighbours for KNN, alpha and lambda for GLMNET), we built several models while varying the tuning parameter(s) and compared the performance of these models within each training set using repeated (n = 3) fivefold cross validation. Hyperparameter values were left as default settings in caret [68], or a grid of values for GLMNET (alpha = 0.025–0.500, lambda = 0.0025–0.2500). We compared the performance of these models using accuracy, positive predictive value, cohen’s Kappa [70], and logLoss (a measure of classification accuracy that heavily penalizes over-confident misclassifications). The results from this analysis can be found in Additional file 1: Tables S2, S3. After assessing the classifier performance using LODOCV, a final GLMNET model was fit to the entire dataset (cohorts C1–C5) using the same model fitting procedure described above and is available for use in future datasets (https://github.com/wvictor14/planet).

Enrichment analysis

The DNAme sites (see Additional file 1: Table S1) and SNPs selected to predict ethnicity in this final model (n = 1860) were used for enrichment analysis. For DNAme sites, we looked for enrichment for SNPs in the probe body, CpG site, and single base extension sites based on Illumina’s HM450K annotation version 1.2 [71]. We looked for enrichment for placental mQTLs [42], chromosomes and CpG islands (HG19; Additional file 2: Figure S5). Fisher’s exact test was used for all enrichment tests using a p-value threshold of < 0.05, and was carried out in R using the function fisher.test(). GO and KEGG pathway analysis was done using the R package missMethyl version 3.8 [72].

Threshold analysis

We explored the use of a ‘threshold function’ to identify samples that are difficult to classify into discrete ethnicity groupings because of mixed ancestry. Because PlaNET’s ethnicity classifications are associated with varying degrees of confidence (i.e., probabilities), we reasoned that a sample’s most probable ethnicity classification (i.e., max(P(Asian), P(African), P(Caucasian)) would be lower with a higher degree of mixed ancestry. Therefore, we implemented a threshold function on PlaNET’s probability outputs that classifies samples as ‘Ambiguous’ if the highest of the three class-specific probabilities is below a certain threshold. We explored several thresholds and decided on 0.75, which minimized the resulting decrease in predictive performance (Additional file 2: Figure S3).

Comparison of methods for inferring genetic ancestry/ethnicity from HM450K data

Because existing population inference methods and PlaNET use different statistical approaches to infer genetic ancestry/ethnicity (PCA-based vs. predictive modeling), we compared each method based on the amount of population-associated signal in DNAme from each method-specific subset of sites. This was done by applying principal component analysis (PCA) to standardized beta values for HM450k sites associated with each method (Table 1) [8,9,10] within each cohort. To avoid bias, the PCs associated with PlaNET were calculated for each cohort using a classifier trained on all other cohorts (generated from LODOCV). Several simple linear regression models were applied to estimate the amount of variance explained in PCi (i = 1, 2, 3,…, 10) by self-reported ethnicity and genetic ancestry when available. Self-reported ethnicity was encoded with indicator variables when testing group-specific associations (Additional file 2: Figure S6) and also overall association with ethnicity (Fig. 4a). Genetic ancestry was tested using coordinates one, two, and three, in a total of four different models: three models of each coordinate tested separately (Additional file 2: Figure S6), and then one model including both coordinates one and two to gain an overall estimate of the association with genetic ancestry (Fig. 4). To determine other factors that might affect signal in these sites, we also tested for the association between PCi and each covariate available for each cohort. All simple regression tests were done in R using the function lm(). For Barfield’s approach, we compared the various sets of sites that differ by the distance of a given CpG site to the nearest genetic variant (0, 1, 2, 5, 10, 50 bp) (Additional file 2: Figure S10). We used the set “0 bp from a genetic variant”, following two observations: (1) the sets (0, 1, 2, 5 bp) were not significantly different in their association with ethnicity or genetic ancestry (p value > 0.05), and (2) the sets (10, 50 bp) were significantly less associated with ethnicity and genetic ancestry for the first PC (p value < 0.05). In summary, the closer the genetic variant was to the CpG, the stronger the signal associated with genetic ancestry and ethnicity.

To compare PlaNET to Zhou et al.’s SNP-based classifier [10], we used the package R package sesame (version 1.1.0) [73] to obtain SNP-based ethnicity classifications for samples with idats available (cohorts C3, C4, and C5). To compare class-specific performance, McNemar’s Chi-squared Test for Count Data was calculated using the stats R package.

Application of PlaNET to previous EWAS

To demonstrate application of PlaNET, we downloaded placental HM450K DNAme datasets GSE98224, GSE100197, and GSE71678. We note that GSE100197 and GSE98224 overlap cohorts C4 and C5, respectively. To apply PlaNET to obtain ethnicity information, raw data were downloaded from GEO in the form of IDATs and loaded into R using minfi (version 1.26.2). Both NOOB and BMIQ normalization were applied before applying PlaNET. The R package limma (version 3.36.2) was used to test for differentially methylated sites. For GSE98224 and GSE100197, the processed DNAme data were used, and statistical thresholds were chosen the same as the published analysis [21]. For enrichment analysis, differentially methylated CpGs were inputted into the gometh function from the R package missMethyl (version 1.16.0) using all filtered sites as background, and default settings.