Using face-to-face interviews, self-reported race and ethnicity information was collected from 645 participants from 7 counties. Questionnaire responses allowed for two ethnicities (Hispanic or Non-Hispanic) and six race categories (Black or African American, American Indian or Alaska Native, Asian, Native Hawaiian or Other Pacific Islander, White, and Some Other Race), and multiple categories could be picked (Multiracial).
DNA from whole blood was isolated and genotyped using the Illumina HumanExome Array, with 641 samples passing quality control criteria. The HumanExome Array was designed with approximately 3,000 ancestry informative markers that distinguish between European and African American ancestry, and 1,000 markers that distinguish between European and Native American ancestry. Additional content included sites that could vary by population, but that were not chosen for ancestry informativeness, such as GWAS single nucleotide polymorphisms (SNPs), coding variation, randomly selected synonymous sites, and human leukocyte antigen (HLA) tags. To identify all sites that were informative for ancestry, we calculated informativeness  to distinguish between the 5 superpopulation groups of the 1KG Project and identified around 30,000 sites with positive informativeness.
For each NCS participant, we identified the most similar 1KG super population. Using the ancestry informative SNPs, we clustered the genotypes of the NCS participant with the 1KG participants using multidimensional scaling (MDS). To identify the most similar superpopulation, we created a linear discriminant model based on the top 20 dimensions of the MDS, and trained it using the 1KG data. Then, based on the model, we predicted the most likely superpopulation for each NCS participant (Table 1). We additionally performed this analysis using the 21 1KG populations for which we had data (see Additional file 1: Table S1).
For each self-reported race and ethnic stratum, we identified which 1KG super population(s) we expected the group to match (Table 1). When multiple superpopulations were plausible, they were all included as expected matches. For example, we expected self-reported Hispanic African Americans to be most similar to either the African (AFR) or the American Admixed (AMR) 1KG superpopulations. We did not include those that identified themselves as Multiracial or Non-Hispanic Other (a total of 33 individuals) in the concordance estimates. For the NCS participants, we observed high levels of agreement between estimated genetic ancestry and self-reported ethnicity and race (Figure 1). Overall, we observed high levels of agreement between self-report and estimated ancestry, with 601/608 (98.8%) concordant calls.
Clustering can be visualized by plotting the first MDS components against each other for the 1KG (Figure 1A, B) and the NCS individuals (Figure 1C, D). Data points were plotted in the first and second dimensions (Figure 1A,C) and in the second and third dimensions (Figure 1B,D). The results showed that AFR, EUR, and ASN superpopulations are clearly differentiated in the first two dimensions, while the SAN and AMR groups are overlapping, reflecting their historical European and East Asian ancestry (Figure 1A). While the AMR group is broadly distributed, indicating that some individuals are genetically more similar to the EUR group and others to either the ASN or AFR groups, individuals in the SAN group cluster together. In the second and third dimensions, SAN and AMR are distinctly identifiable (Figure 1B). NCS individuals identified as Asian by self-report overlap with both the SAN and ASN groups. This is expected, as a distinct racial category for persons of South Asian descent (largely Indian) was not available as a self-reported race category.
We further investigated individuals that were discordant with our predictions. Linear discriminant analysis provides a relative score for how well each individual matches each group, and we observed that discordant individuals often matched their second-best superpopulation prediction. Of the seven discordantly assigned individuals, six matched their second most likely super population group, and the remaining one matched their third most likely group. We also examined our analysis of the 21 1KG populations (see Additional file 1: Table S1), and observed that 4 of the 7 discordant individuals matched a population that was in their best-matched superpopulation by self-report, even though they were not placed in that group when the 5 1KG superpopulations were used for the analysis. This suggests that in some cases, analyses at population level may be more accurate for assigning genetic ancestry to an individual than analyses at superpopulation level. Overall, however, we observed the same level of concordance using populations as we did using superpopulations (601/608, 98.8%). Discordant individuals were not likely to be the result of misidentified samples, because these individuals were collected from five of the seven NCS sites, and were not consistent with swaps within each site (data not shown).
Hispanic White and Hispanic Other self-reported groups were determined to be of closely related ancestry, with 78% of Hispanic White and 94% of Hispanic Other predicted to match the AMR population. However, individuals with a self-report of Hispanic White were more likely to match the EUR group (22%) than the Hispanic Other (6%), which is consistent with individuals that identify as Hispanic having a heritage that includes European and often, but not always, Native American ancestry.
While there was no expected population group for the 33 individuals who reported being Multiracial or Non-Hispanic Other, we were able to assign them to their most similar superpopulations. As a group, they showed great diversity, with individuals matching to each of the five superpopulations. Of note, 11 (33%) of these individuals matched to the ASN or SAN groups, which were less represented in the other categories (27/609, 4.4%). These data suggest that individuals of South Asian or East Asian descent may not adequately be captured by the NCS ethnicity and race categories.
Comparison of reported ethnicity and race with genetic ancestry highlighted the difficulties in properly capturing this information for individuals from populations with historical admixture. For the Non-Hispanic Asian population, we observed two clearly distinct populations: those closely related to the ASN population, which is composed of Han Chinese individuals (from Beijing and Southern China), Chinese individuals from Denver (CO), Japanese individuals, and Kinh individuals from Ho Chi Minh City (Vietnam); and those closely related to the SAN population, which is a population composed of Gujarati Indian individuals from Texas (Figure 1; see Additional file 1). While of related ancestry, these two populations can be clearly discriminated genetically, and the currently used race category of ‘Asian’ does not adequately distinguish between individuals of South Asian versus East Asian descent, highlighting the relevance of using genetically determined ancestry rather than self-reported ancestry alone.
A comparison of genetic ancestry to self-reported ethnicity and race for Hispanic individuals determined that the genetic ancestry of those choosing the categories of Hispanic, White (50 persons) and Hispanic, Other (62 persons) is largely the same. Individuals choosing Hispanic, White or Hispanic, Other were most similar to the AMR superpopulation (102/112) (composed of Colombian individuals in Medellin, Colombia; Mexican individuals from Los Angeles, CA; Peruvian individuals in Lima, Peru; and Puerto Rican individuals in Puerto Rico) , with the remaining individuals matching the European or African superpopulations.