Introduction

Prior research revealed significant health and behavioral risk factor disparities between American Indians and Alaskan Natives (AI/AN) and other racial groups. For instance, in 2017–2018, the US age-adjusted prevalence of diabetes for AI/AN adults was 14.7%, compared with 7.5% for non-Hispanic White adults [1]. According to the CDC [2], in 2018 the age-adjusted obesity prevalence for AI adults was 48.1% compared with 30.0% for Non-Hispanic Whites. Additionally, AI/AN youth and adults have the highest prevalence of cigarette smoking among all racial/ethnic groups in the US [3, 4].

However, studying the AI/AN population is challenging due to the small population size, high heterogeneity, geographic spread, and poor representation of the existing survey data. In 2019, AI/AN represented 1.7% of the US population, with 5.7 million AI/AN nationwide [5]. AI/AN populations in many states are very small, ranging from 9,006 in Vermont to 772,394 in California [5]. In 2021, US tribal populations ranged from about 400 for Coachella California to 389,751 for Cherokee Nation in Oklahoma.

Recent COVID-19 studies have included AI/AN with a sample size of only three [6], while others lumped AI/AN with other races [7,8,9]. Still others ignored the population altogether [10,11,12,13,14,15,16,17], while some considered AI/AN without discussing misclassification [18,19,20], and others focused solely on AI/AN populations [21,22,23,24,25]. In addition, American Indians and Alaskan Natives are often lumped into the same group, but have different histories and lifestyles [26]. In fact, most tribes in the US have very different histories [27], economic structure (27), political structure (27), health care access [28], social vulnerability [29], levels of overcrowding, historical trauma [30,31,32] and population mixing [26].

Representative survey data for AI/AN are often limited [33]. A 2013 review of US population surveys from 1960 to 2010 for cancer research found only 17 surveys with AI/AN work, and many of those had less than 500 respondents [34]. Due to the sparseness of AI/AN populations, some surveys, such as National Health Nutrition and Examination Survey (NHANES), do not release identification information for AI/AN. The Oklahoma Tribal Epidemiology Center (OKTEC) Tribal BRFSS (TBRFSS) provides a unique data source for AI/AN with a sample size over 700, providing rich information related to social demographics, health, and behavior. Such information has been used by OKTEC for designing intervention strategy for improving AI/AN health. TBRFSS collected high quality data for AI/AN by using a well-developed sampling design of a combination of event sampling, email sampling, and social media sampling from Kansas, Oklahoma, and Texas. However, it may suffer from selection bias since the sampling design for TBRFSS is non-probability (e.g. not every unit in AI/AN population had non-zero probability of being selected), see Baker et al. (2013) for more discussion about issues with non-probability samples [35].

Data integration procedures, by combining information from probability samples and non-probability samples, can be used effectively to reduce selection bias of non-probability samples (see [36,37,38], among others). Calibration methods [39, 40] have been used frequently in practice by constructing the calibrated weights in non-probability sample such that the weighted estimates by using a non-probability sample will benchmark with the estimates from the probability sample. The underlying assumption for using calibration methods is that the non-probability sample and the probability sample should have some overlap variables, which will often be satisfied in practice. Mass imputation [38, 41, 42] is another set of data integration methods that borrow the strength of statistical modeling to predict the outcome variables in the probability sample after fitting prediction models by using the non-probability sample. The power of the mass imputation procedure depends on the effectiveness of the modeling process.

In this paper, we propose using calibration methods and mass imputation by fully conditional specification (FCS) [43, 44] to improve the representativeness of the TBRFSS. The imputation by fully conditional specification (FCS) method has been shown to be very effective for handling multivariate missingness in practice [45, 46]. To the best of our knowledge, we are the first to consider such a method in data integration problems. Previous literature did not consider mass imputation with multiple outcome variables. In addition, we are the first to consider using data integration methods to reduce the selection bias for AI/AN populations. Our proposed data integration methods are important for AI/AN research because most samples for AI/AN research are non-probability samples, which may suffer from serious selection bias.

Our paper is organized as follows. In section two, we describe the BRFSS and TBRFSS surveys. We discuss our proposed data integration methods in section three. Results are presented in section four. Final remarks, including future research and limitations, are presented in section five.

Background for BRFSS and TBRFSS

2018 and 2019 Behavioral Risk Factor Surveillance System (BRFSS)

The Behavioral Risk Factor Surveillance System (BRFSS) is the nation’s premier system of health-related telephone surveys that collect state data about U.S. noninstitutionalized adults regarding their health-related risk behaviors, chronic health conditions, and use of preventive services. It provides large scale (over 840,000 completed cases in 2018 and 2019 together) national- and state-level representative samples (e.g., probability samples) that use a stratified sampling design with both landline and cell frames. The sampling is done independently for each state. Together, the 2018 and 2019 BRFSS collected information for over 11,500 adults (with over 970 AI/AN adults) in Oklahoma. For more detailed information about the sampling and weighting of the BRFSS, see https://www.cdc.gov/brfss/annual_data/2019/pdf/overview-2019-508.pdf.

2019 tribal Behavioral Risk Factor Surveillance System (TBRFSS)

The goal of the 2019 TBRFSS was to survey American Indian populations in Kansas, Oklahoma, and Texas to determine the extent of behaviors that contribute to or protect from adverse health outcomes. Convenience sampling was used to obtain responses to the survey. AI/AN representation in surveys and sampling is often small and limited. The OKTEC serves as the public health authority for 43 federally recognized tribes in the Southern Plains Area (Kansas, Oklahoma, and Texas), which had an AI/AN population of approximately 443,734 in 2017. The OKTEC and the University of Oklahoma Health Sciences Center Hudson College of Public Health have a longstanding relationship and have previously conducted projects together. The TBRFSS is a deliberate oversampling to have a better understanding of the health factors in American Indian communities. The compiled analyzed data will then go back to the tribes to provide information for making decisions with programs, applying for grants, and to help with other public health outcomes.

The survey collection was conducted in three phases to try to capture a more complete sample of the AI/AN population. Each of the three phases had a drawing for incentives, with three larger awards available to be drawn from the participants within each phase. The first phase was conducted from September to December 2019 with a mix of convenience sampling by attending tribal events in person, over email, and through website availability. In phase 1, 793 surveys were collected (46 from online responses and 747 from physical surveys). The second phase was interrupted by the 2020 COVID-19 pandemic and the original survey collection plans were adjusted. Surveys were collected from April 2020 to October 2020 through several methods, including using convenience sampling with email and website availability through social media. An address-based sample from census tracts with high percentages of AI/AN people was purchased through the Marketing System Group. Potential respondents were mailed a postcard with a link to an online survey. The final method of collection was through a telephone survey using a call center with the University of Oklahoma Health Sciences Center, the Sooner Survey Center. Phone numbers were also purchased through the Marketing System Group and the Sooner Call Center. Sixty-one surveys were collected through email and web availability, 23 were collected through the postcard mailing, and 295 were collected through the telephone collection survey. For evaluation purposes, to eliminate the effect of the Covid-19 pandemic, we used only the Phase I Oklahoma state sample data in this paper.

The initial target population of the BRFSS for Oklahoma state is all adults in Oklahoma. This target population is different from that of the TBRFSS, for which the target population of interest is Oklahoma American Indian adults. However, since the BRFSS used a probability sampling design (dual-frame random digit dialing method) for data collection, if we restrict the BRFSS data file to only Oklahoma American Indian adults, it is still a probability sample for that population, see Sect. 2.12 of the well-cited “Sampling techniques” book by Cochran (2007) [47]. Therefore, after restriction to Oklahoma American Indian adults, the two surveys have the same target population of interest.

Methods

To combine the information from the 2018 and 2019 BRFSS, we stacked the two data files (e.g., 2018 and 2019 BRFSS) together. Then, we created the composite weight variable, which equals the original weight variable in the 2018 BRFSS (or 2019 BRFSS) divided by 2 if the corresponding participants belong to the 2018 BRFSS (or 2019 BRFSS). We also treated year and original stratification variables together as the new stratification variable after combining them. Other researchers [48,49,50] provided detailed discussions on creating composite weights for combining information from multiple surveys.

To evaluate the selection bias of the TBRFSS, we first compared the BRFSS weighted descriptive statistics (e.g., frequency and percentage for categorical variables, mean and standard deviation for continuous variables) and TBRFSS unweighted descriptive statistics for demographic and general health variables: Age Group, Gender, Marital Status, Education Level, Employment Status, Income Level, Body Mass Index (BMI), and General Health Status. The Rao-Scott Chi-square test [51, 52] was used to compare the distribution difference of categorical variables between the BRFSS and TBRFSS. Survey weighted regression [53] was used to compare differences in continuous variables between BRFSS and TBRFSS. Stratification and survey weights were incorporated.

The following data integration approaches were performed. First, the calibration approach was used, creating survey weights for TBRFSS by benchmarking the weighted descriptive statistics for variables (Age Group, Gender, Marital Status, Education Level, Employment Status, Income Level, BMI Status, and General Health Status) in the TBRFSS with those in the BRFSS. In other words, the calibrated weights \({\tilde{w}}_{i}\) for unit \(i\in {s}_{B}\) where \({s}_{B}\) denotes the non-probability sample can be obtained by minimizing the distance function \(\sum _{i\in {s}_{B}}\frac{1}{2}{\left(\frac{{\tilde{w}}_{i}}{{w}_{i}}-1\right)}^{2}\) such that \(\sum _{i\in {s}_{B}}{\tilde{w}}_{i}{x}_{i}=\sum _{i\in {s}_{A}}{d}_{i}{x}_{i}\) where \({x}_{i}\) is a vector of covariate variables described above, \({w}_{i}\) is the initial weight in the non-probability sample (in our application, we used \({w}_{i}=1\)), \({s}_{A}\) denotes the probability sample, and \({d}_{i}\) is the design weight in the probability sample for unit \(i\). Specifically, an iterative proportional fitting algorithm [54, 55] was used to obtain the above calibration weights. Second, the sequential mass imputation approach by FCS method was used [43, 44]. Variables described in the calibration approach as well as the two-way interaction terms were used as the predictors in the sequential mass imputation model. Logistic regression models were used for categorical study variables. Linear regression models were used for continuous study variables. For both data integration approaches, we considered the following nine outcome variables of interest: Smoking status, Arthritis status, Cardiovascular Disease status (CVD), Chronic Obstructive Pulmonary Disease status (COPD), Asthma status, Cancer status, Stroke status, Diabetes status, and Health Coverage status. Even though the nine outcome variables were observed in both the BRFSS and TBRFSS data files, for evaluation purposes, we assumed that they were only observed in the TBRFSS for conducting the two data integration approaches. For the sequential mass imputation approach, we first used a non-probability sample (e.g., TBRFSS) to fit the imputation model, then generated imputed values of the above nine outcome variables sequentially once for the entire combined BRFSS data file. For simplicity, we only generated one imputed data file. Besides the assumption that there are overlapping covariate variables in both probability sample and non-probability sample, the validity of calibration approach depends on the assumption that there is a reasonable linear correlation between the outcome variables and covariate variables at the population level. The validity of mass imputation approach depends on the assumption that the imputation model fitted by using non-probability sample is reasonable and holds at the population level. If the associations between the covariate variables and outcome variables were small, then the data integration methods may not be very effective for reducing the selection bias. After data integration, we compared the adjusted estimates after data integration with the true estimates observed in the BRFSS. The Rao-Scott Chi-square test [51, 52] has been used to compare the distribution differences in categorical variables between different methods. Missing values for the above eight covariate variables and night outcome variables in both BRFSS and TBRFSS were imputed by using the SAS procedure ‘PROC MI’ before data integration. Since the missing rates for all variables are less than 5%, we only used single imputation. All analyses were conducted using SAS 9.4.

Results

After subsetting to only the Oklahoma AI/AN adults, there were 635 observations in the 2019 TBRFSS and 973 observations in the 2018 and 2019 BRFSS combined data file. Table 1 presents the comparison between the BRFSS weighted descriptive statistics and the TBRFSS unweighted descriptive statistics for the variables listed in the Methods section. All variables were highly significant with p values less than 0.001. The TBRFSS sample tended to be older (45.04% vs. 27.42% for 55 plus), female (77.95% vs. 51.20%), have fewer married cases (38.11% vs. 44.31%), have higher education level (60.00% vs. 47.98% for above high school graduation), have higher employment status (62.99% vs. 57.80%), have lower income (9.92% vs. 20.73% for above $75,000), have higher BMI (55.75% vs. 40.30% for obesity), and have less general health (31.50% vs. 42.72% for Very Good and Excellent).

Table 1 Comparison of BRFSS weighted descriptive statistics, TBRFSS unweighted descriptive statistics, and TBRFSS weighted descriptive statistics after calibration process (Significant results with p values less than 0.001 are marked with *)

Table 1 also presents the comparison between the BRFSS weighted descriptive statistics and the TBRFSS weighted descriptive statistics for the variables listed in the Methods section after the calibration process. As expected, the difference between the two distributions was small after calibration, which shows the improvement of the representativeness for the TBRFSS after calibration. For example, the weighted percentage of Male by using TBRFSS becomes 48.80% after calibration adjustment and it equals to the BRFSS weighted percentage. The weighted percentage of Married people by using TBRFSS becomes 44.29% after calibration adjustment and it is close to the BRFSS weighted percentage of 44.31%.

Table 2; Fig. 1 present the comparison of two data integration methods (calibration and mass imputation) with the naïve method without any adjustment. In Table 2, the frequency (%) for mass imputation was computed by using the composite weight variable and the imputed outcome variables in the combined BRFSS data file. The frequency (%) for the calibration method was calculated based on the calibration weight for TBRFSS data. According to Table 2, both calibration and mass imputation significantly improved the estimates of frequency, since their estimates were much closer to the true estimates from the BRFSS (p values less than 0.001). The naïve estimates were significantly smaller than the true estimates in terms of frequency (p values less than 0.001). For percentages, data integration methods outperformed the naïve method for most variables, except Arthritis status and Asthma status. For example, the estimates of percentage of smoking status were 29.54% and 22.31% for the mass imputation and calibration methods, respectively, and they were closer to the true estimate from the BRFSS (26.22%) than the naïve estimate (18.90%) from the TBRFSS. The estimates of percentage of Diabetes status were 23.58% and 19.83% for the mass imputation and calibration methods, respectively, and they were closer to the true estimate from the BRFSS (16.54%) than the naïve estimate (27.24%) from the TBRFSS. The mass imputation method was significantly better at estimating the prevalence of smoking than the naïve estimate (p value less than 0.05). The calibration method was significantly better at estimating the prevalence of diabetes than the naïve estimate (p value less than 0.05). In general, the two data integration methods outperform naïve method by only using TBRFSS since they took advantage of the associations between covariate variables and outcome variables and also the representativeness of the probability sample.

Table 2 Comparison of data integration methods with nine outcome variables (Significant results with p values less than 0.001 are marked with *)
Fig. 1
figure 1

Comparison of data integration methods with nine outcome variables

Discussion

Non-probability samples (e.g., convenience samples) have been used frequently in public health research due to the cost-effectiveness, time efficiency, and convenience of implementation. For instance, one component in the initial stage of the Strong Heart Study was a survey with a non-probability sample of American Indian tribal members 35–74 years of age residing in the three study areas (the community mortality study) to determine cardiovascular disease mortality rates from 1984 to 1994. Another well-known example is the Framingham Heart Study, which collected a non-probability sample of 5,209 men and women between the ages of 30 and 62 from the town of Framingham, Massachusetts, who had not yet developed overt symptoms of cardiovascular disease or suffered a heart attack or stroke. Furthermore, most surveys for American Indian health research, such as the TBRFSS and the Cherokee Nation Health Survey, used non-probability samples. Statistical analysis based on a non-probability sample without proper adjustment may lead to biased results due to selection bias. Data integration has been regarded as one of the most effective ways to reduce the selection bias from non-probability samples.

In this paper, we proposed a novel sequential mass imputation approach and applied it together with the calibration approach to improve the representativeness of the TBRFSS. To the best of our knowledge, we are the first to implement data integration approaches to American Indian Health research. Data integration procedures, including calibration and sequential mass imputation methods, show promising results for improving the representativeness of TBRFSS. Our proposed methods can be naturally applied to other health studies that use non-probability samples, such as the Strong Heart Study, the Framingham Heart Study, and many others. If selection biases or measurement errors were consistent over a period of time, then the trend analysis might still be possible. In addition, we developed user-friendly computational resource for other public health researchers. Computational codes can be obtained by sending email to the first author of this paper. The approach had limitations. First, we evaluated only the proposed data integration methods by using the BRFSS and TBRFSS. Further evidence must be provided by using other data files. Second, we considered only the nine outcome variables discussed in previous sections as examples. It might be interesting to evaluate other outcome variables. Thirdly, we only considered parametric sequential mass imputation approaches in the study. Nonparametric and machine learning methods can be used to improve the robustness of the methods. Lastly, there might be some small degree of misclassification for race variables in both BRFSS and TBRFSS. We ignored such effect in our study.

Conclusion

In this paper, we compared two of the most popular data integration approaches (calibration and mass imputation) for AI/AN health research by combining the information from the 2018 and 2019 BRFSS and the 2019 TBRFSS. Both data integration approaches improved the representativeness of the original TBRFSS sample for all outcome variables in terms of frequency estimates and for most outcome variables in terms of percentage estimates. In practice, it can happen that the unweighted estimate from non-probability sample is closer to the weighted estimate from probability sample compared to the adjusted estimate since we only observe one random sample and the performance of the data integration methods depend on the underlying model assumptions of outcome variables and covariate variables. The approach also had strengths. First, we are the first to apply data integration methods to improve the representativeness of AI/AN survey data. Second, we are the first to propose using sequential mass imputation for data integration. In future research, we will propose and apply machine learning-based data integration approaches, including regression tree, random forest, XGboosting, and Deep Learning, to further improve the representativeness of the TBRFSS, since those machine learning approaches can better model the complex non-linearity in the data file and may have better prediction power. We also plan to apply our proposed data integration approaches to combine information from future TBRFSS surveys with the BRFSS and American Community Surveys. One can first build an imputation model by using the TBRFSS and then impute outcome variables of interest in both the BRFSS and American Community Surveys.