A Methodology for Integrating Population Health Surveys Using Spatial Statistics and Visualizations for Cross-Sectional Analysis

Large-scale population surveys are beneficial in gathering information on the performance indicators of public well-being, including health and socio-economic standing. However, conducting national population surveys for low and middle-income countries (LMIC) with high population density comes at a high economic cost. To conduct surveys at low-cost and efficiently, multiple surveys with different, but focused, goals are implemented through various organizations in a decentralized manner. Some of the surveys tend to overlap in outcomes with spatial, temporal or both scopes. Mining data jointly from surveys with significant overlap gives new insights while preserving their autonomy. We propose a three-step workflow for integrating surveys using spatial analytic workflow supported by visualizations. We implement the workflow on a case study using two recent population health surveys in India to study malnutrition in children under-five. Our case study focuses on finding hotspots and coldspots for malnutrition, specifically undernutrition, by integrating the outcomes of both surveys. Malnutrition in children under-five is a pertinent global public health problem that is widely prevalent in India. Our work shows that such an integrated analysis is beneficial alongside independent analyses of such existing national surveys to find new insights into national health indicators.


Introduction
Large-scale surveys are implemented to gather information about specific issues of the population, e.g., socio-political, economic, or ethnic issues pertaining to a country or geography. Survey analyses provide time-tested mechanisms for monitoring multi-dimensional indicators of such populations. Survey instruments are widely used in medical/ health geographic studies to make observations of patterns in the population, if not necessarily for explaining the causality of the observations [1]. Particularly, health surveys are used for surveillance of public health outcomes [2]. Such surveillance involves quantitative analysis of total population health and outcome indicators [3,4].
However, despite the central role surveys play in monitoring population trends, implementing surveys is a complex problem owing to the demographic and socio-economic variations in the population, survey design for a multifaceted focus, diversity in handling data, decentralization of survey administration in the field, decisions on publishing data and outcomes, and finally, the economic and time cost of implementing surveys. Hence, we increasingly see that surveys are owned by various competent organizations who undertake them for specific requirements. This leads us to the case of overlapping surveys, as multiple surveys are implemented, with a focus on different metrics but considerable similarities [5]. Integrating such overlapping surveys is beneficial for gaining new knowledge, e.g., multiple health surveys can Harshitha Ravindra, Jaya Sreevalsan-Nair have contributed equally to this work.
This article is part of the topical collection "Geographical Information Systems Theory, Applications and Management" guest edited by Lemonia Ragia, Cédric Grueau and Robert Laurini. be used to jointly estimate household wealth and expenditures while still maintaining the length of the questionnaires by integrating them [6]. At times, outcomes of the same indicator from multiple surveys can also be used for validation of hypotheses built using these indicators.
Even though big data are gathered and analyzed in surveys, the scope of reaping integrated benefits from overlapping surveys is limited. This is because the integration may require centralized planning efforts before conducting them. Such centralized activities reduce the degree of the desired autonomy in survey implementation for economic reasons in practice. There are primarily two issues with integrating surveys during its design and administration [5]. Firstly, there is a requirement of concerted effort to determine the scope and extent of overlap between multiple surveys to check the feasibility and benefit of such an integration. Secondly, there is a requirement of efficient government, which fosters such an integration, from planning a survey to publishing its outcomes. Hence, integrating multiple surveys during the data processing is more feasible than during the design or the administration phases.
In several domains, integrating disparate data sources has been widely practiced for such requirements, and survey integration can be seen in the same light. For example, various sources of data, such as geographic information, can be integrated with surveys [7]. One can also link spatial data from surveys and databases for the integration, e.g., health surveys and health facility databases [8]. Since spatial and temporal information is essential metadata to population survey data, they are used for testing the feasibility of direct integration of surveys. They further provide the mappings between the surveys for the integration itself.
Survey data integration becomes increasingly realizable with the open access to the raw or the processed data. The data collection and the reporting systems have been encouraged to share data to improve the adaptation of integrated surveys [3]. As an example, in India, the availability of raw data and reports of the National Family Health Survey (NFHS) in the public domain has improved the uptake of several researchers working with the data, compared to similar national surveys [9]. The NFHS is favorably implemented at the national scale at a higher frequency, i.e., roughly once in five years, aligned with the worldwide data collection efforts. The NFHS data can be strategically used with other national and local surveys to infer health and related socio-economic factors, even though its focus is on maternal-child health indicators. Hence, we choose to integrate the NFHS-4 during 2015-2016, the fourth edition of NFHS [10], and the Comprehensive National Nutrition Survey (CNNS) during 2016-2018 [11]. These surveys are conducted by the Ministry of Health and Family Welfare (MoHFW), Government of India (GoI), and implemented by the Ministry of Statistics and Programme Implementation (MoSPI), GoI. MoSPI provides access to the demographic survey outcomes. Overall, open data encourages several researchers to innovate with it.
In general, studies using the open data have examined these surveys in siloes based on their specific individual goals. There is prior work on comparing these surveys, specifically [12]. It is well-known that an integrated analysis of pertinent surveys can effectively reduce the burden of conducting numerous surveys in a developing country [6]. However, there is a gap in integrated analyses of existing national surveys. Hence, our goal is to demonstrate a proof-of-concept of a retrospective analysis of integration of existing cross-sectional surveys. Cross-sectional analysis helps in the determination of prevalence of conditions [13], which are observed across multiple surveys in our work.
This brings forth the challenge of integrating the data with differences in the scales of the open data available in the two chosen surveys. Generally, the household or individual is considered microscale, the district or county as the mesoscale, and the state or province as the macroscale. These can be seen as scales with respect to geopolitical regions or populations. As the scale increases, the analysis shifts from data of an individual to population-representative data, thus tending more towards highly approximated analysis. While the analysis at the finer scales is desirable for more accurate insights, mesoscale data analysis is the straightforward solution for cases with difference in scales of data from multiple sources. Thus, we improve macroscale data analysis in our work through novel approaches using spatial statistics and visualizations.
Our contribution is in the novel use of geospatial visualizations and spatial statistics for integrating surveys. In our previous work [14], we have proposed a three-step workflow for survey integration and demonstrated its efficacy through a specific case study of integrating two national health surveys, namely NFHS-4 and CNNS in India, for under-five child malnutrition study. The workflow has the following steps: • Step-1: for a macroscale (state-wise) comparative analysis for determining the feasibility of survey integration using mesoscale data from one survey and macroscale data from the other, we use descriptive statistics, mapbased visualizations, and distribution distance computation for determining the feasibility of the chosen survey integration. • Step-2: for a macroscale (region-based) study to identify variables for survey integration, we use spatial heterogeneity analysis, circular radar plots for visualization, and a literature survey for the identification of these variables. • Step-3: for determination of spatial clusters as survey integration outcomes, we use spatial autocorrelation SN Computer Science analysis of these selected variables, thus jointly analyzing the two surveys.
We have used the available mesoscale (district-level) data from NFHS-4 and macroscale (state-level) data from CNNS, for Step-1. In the current work, we formalize our three-step workflow, with a more detailed exposition on the different scales of the data and a more structured process for the selection of variables for integration in Step-2. Here, we broadly categorize variables as outcome indicators [4] and contextual factors [15]. We use only macroscale data from both surveys. Overall, this has led to the inclusion of more variables for integration, for which we demonstrate the results in our case study. In this work, we also discuss the preliminary results from adding a new survey, namely NFHS-5, to our case study. This shows the extensibility of our method. Overall, extending our previous work [14], our current contributions are in: • A structured workflow for integrating two different national health surveys with open access data at different scales, • Integration of visualization and statistical methods for the integration, • An in-depth analysis of the results of our proposed workflow in an appropriately selected case study.

Important Definitions:
Before going further into the article, we first consider formal definitions of terms used in our work.

Definition 1 (Cross-sectional Survey)
We define a largescale survey that collects data from a sample population, generally pertaining to a geopolitical region, at a specific instance of time for determination of prevalence of a certain set of outcomes or other variables, as a cross-sectional survey.

Definition 2 (Outcome Indicator in Healthcare)
We define a variable that is computed from observable variables for the purpose of an outcome of a population survey for health surveillance as an outcome indicator in healthcare, based on the goals of the survey. For instance, stunting, wasting, and underweight observations are outcome indicators of malnutrition, and specifically, undernutrition.

Definition 3 (Contextual Factor)
We define a variable that pertains to the local environment and influences the health outcomes as a contextual factor. Contextual factors are variables with respect to specific contextual dimensions [15]. For instance, immunization is considered as a contextual factor in the public health dimension, whereas diet/food quality is a contextual factor in the behavioral dimension.
Definition 4 (Spatial Heterogeneity) Spatial heterogeneity is a property of the variation in the spatial distribution of a point pattern (e.g., a child with underweight condition), or the variation in the quantitative or qualitative value in a surface patter (e.g., temperature of a region). Point patterns are used in population in econometrics, and surface patterns are used in geographical or landscape studies. Spatial local heterogeneity pertains to specific points and its localities, whereas spatial stratified heterogeneity pertains to the variance of strata with respect to its neighboring ones. We use the latter in spatial autocorrelation analysis here.

Methodology
Spatio-temporal metadata of the cross-sectional surveys is critical in their selection for integration. This leads to the requirement of appropriate feasibility checks and subsequent solutions to alleviate the differences in the spatial scales of open data and minor overlap of its time-periods in the surveys. Hence, our integration method is driven predominantly by spatial analysis after imposing a strict requirement of sufficient overlap of time periods for the selected surveys. Given the complexity of the data in such large-scale surveys, visualizations enable a qualitative understanding of the spatial trends, and spatial statistics provide the quantitative counterpart. Since medical geography involves exploiting locality [1], we choose the analysis of under-five (U5) malnutrition in a lower-middle-income country, such as India using our integrated survey analysis as our case study. Overall, we propose an approach combining both qualitative and quantitative methods for survey integration and demonstrate the effectiveness of our workflow using a case study for identifying high-risk regions in India for U5 child malnutrition in our case study. Our Proposed Method Our three-step workflow, as defined in Section "Methodology", is designed for a joint analysis of a population across two cross-sectional surveys. We explain the rationale behind each step here.
Step-1: Given the difference in the scope and goals, implementation, and time-frame of the surveys, we first check the feasibility of integrating them. The difference of the time of implementation across the surveys is ensured to be small or negligible for a cross-sectional study. If there is considerable overlap or close proximity (say, of 1 year) of project periods of population surveys (i.e., including design and implementation), then the temporal variation across the surveys is negligible. This is because the two surveys can now be considered to be approximately simultaneous, where observations in the surveys will not yield considerable differences in the health outcomes. However, we also consider the differences in the survey implementation, including population sampling and differences in publishing data.
Overall, we undertake the feasibility test through a comparative analysis using descriptive statistics, visualizations, and distribution distances. While we can use data at different scales (i.e., macro, meso, micro) for this comparative analysis, the remaining part of the workflow is implemented at the coarsest scale of the data in this work.
Step-2: The variables chosen for the cross-sectional analysis are classified as outcome indicators in healthcare or contextual factors. We first analyze the outcome indicators common across the surveys. This is followed by an analysis of the outcome indicators from one survey jointly with the contextual factors from the other. We choose spatial heterogeneity as a criterion for the selection of a variable for the analysis, as variables with high spatial heterogeneity help in discovering patterns in the data. Here, this data is used in the form of spatial point patterns. A point pattern refers to a data point of prevalence, i.e., indicated by a child with a positive outcome indicator or a household where the contextual factor is prevalent.
Step-3: Once the variables are identified, i.e., one from each survey, then they are to be integrated at the data-level using appropriate aggregation function. If we use a spatial locality-based approach, then spatial statistical measures may be appropriately applied in both the selection of the variables as well as their aggregation. For aggregation using spatial statistics, a spatial autocorrelation is an effective approach in retrospective cross-sectional studies for studying trends based on spatial locality. This is because retrospective cross-sectional studies are focused on identifying prevalence with specific research questions [13], such as, spatial understanding. While we study contextual factors in this cross-sectional study, we do not differentiate between cause and effect. This is in line with the design of crosssectional studies which are for determining prevalence and associations, but not causality [13].
Our overall approach is as follows: • Step-1: For feasibility test, we compare the absolute population counts and compute the distribution distances if a frequency distribution is available. A survey integration is feasible only if the differences between absolute counts for common variables or the distribution distances are low or negligible. • Step-2: We identify the variables, one from each survey, to be considered for integration. We use spatial local heterogeneity as a factor for the choice of a variable from a survey. We can also choose variables identified as contextual factors in a survey. • Step-3: We then use spatial statistical methods to identify significant geographical regions with characteristic behaviors of the variables selected in Step-2. The methods may be selected based on the analytical task pertaining to the significant regions. For example, in our study, we find hotspots and coldspots using spatial autocorrelation.
We use visualizations in our work for qualitative analysis. Visualizing state-or region-wise discrepancies provides a look-up to explain the differences we see in the outcome indicators based on the source/input survey. Juxtaposed views are effective for comparative visualizations, where the visualizations are maximally decoupled and are independently generated [16].

Proposed Case Study and Motivation
We focus on mining information on various aspects of malnutrition for U5 children, in India, through this integrated study. Under-five studies are concluding spatial heterogeneity in various health indicators on malnutrition [17][18][19], which can be exploited. The interest in U5 studies is due to the persistence of childhood morbidity and mortality in India, as per NFHS-4 [20]. Wasting has not reduced as much as stunting between the NFHS-3 and NFHS-4 findings. In the weighted sample taken in CNNS, the prevalence of anemia is 40.5% amongst U5 children, with iron-deficiency anemia being the most prevalent type [21]. The nutritional deficiency affects all age groups, but U5 children, particularly those with severe acute malnutrition (SAM), have a higher mortality risk from common childhood illnesses such as diarrhea, pneumonia, and malaria [22]. While the infant mortality rate (IMR) is at 41 per thousand live births, the under-5-mortality rate (U5MR) is at 50. Childhood undernutrition accounts for 45% of U5MR alone and is a crucial public health issue in India. Dietary diversification is an additional solution apart from the focus on infrastructure for food distribution and delivery by the government [20]. There is an emphatic call for more frequent health surveys to be conducted to continuously monitor the progress due to such nutrition programs and infrastructural improvement, motivating our integrated study. A fine-grained analysis has been done on the occurrence of anemia, stunting, and incomplete immunization in children aged 12-59 months, at district and individual levels, using NFHS-4 data [18], as done in our work.
Surveys Used Our case study includes two national health surveys, namely, the NFHS-4 and the CNNS.
NFHS-4, 2015-2016 provides information on population, health, and nutrition for women, men, and U5 children in all districts of all states and union territories in India. The International Institute for Population Sciences (IIPS), Mumbai, is the nodal agency for conducting different rounds of the survey.
CNNS, 2016-2018 is the largest exhaustive nutrition survey, including micro-nutrients conducted for the first time in India, led by UNICEF and Population Council, New Delhi. This survey is focused on all children, i.e., population under-18 years of age.

SN Computer Science
Both surveys overlap in the coverage of nutrition indicators of U5 children. The summary analysis reports have been published for both surveys, and the raw anonymized household-level data is available for the NFHS-4. The data and indicators that we use for our case study using NFHS-4 and CNNS surveys are listed in Table 1. While the relevant indicators for undernutrition are present in both surveys owing to their respective scope and goals, certain variables are covered in only one of the two. Our goal is to correlate variables across the two surveys spatially. NFHS-5 [23] is the fifth edition of NFHS implemented during 2018-2020, for which the publication of the Table 1 Metadata and overall descriptive statistics (mean and standard deviation within the corresponding respondents) of selected outcome indicators and contextual factors of malnutrition of children under-five (U5), and available distribution data on severity with given labels, from NFHS-4 and CNNS [14] This fact-sheets by the survey administrators is currently ongoing. Since none of the data of NFHS-5 was available at the time of our previous work [14], we have included a preliminary understanding of this data in our survey integration workflow in our current work in Section "Discussion".

Scope of our Case Study
To jointly study the observations on U5 child malnutrition as recorded in the NFHS-4 and CNNS surveys, we find potential outcome indicators and contextual factors for identifying high-risk regions of two variables observed across the surveys. The integration is macro-scale, i.e., at the state-level, given the coarsest granularity of data available in both surveys. Since the indicators for undernutrition conditions, except anemia, are available for sub-groups (Table 1) based on gender and urbanization, we use this additional information to study distributions of specific populations. Our proposed spatial analysis, inclusive of visualizations, validates the choice of variables used in the integrated study.
Variables and Data Types In this study, outcome indicators refers to malnutrition status of children U5 namely stunting (height-for-age), underweight (weight-for-age), wasted (weight-for-height) and anemia [17]. Further, we look at micronutrition deficiency of folate, low serum ferritin, vitamin A, vitamin B12, vitamin D, zinc from the CNNS data, and the completion of immunization based on BCG, DPT, hepatitis B, measles, polio vaccinations from the NFHS-4 data. These measures are the standardized ones used for determining the malnutrition status of children, as commonly used in literature [17,18]. The contextual factors are proximate or distal determinants of malnutrition that were used based on prior literature and data availability [18]. Maternal illiteracy, undernourished women, partially immunized children, unimproved sanitation, non-institutional births, partially breastfed children with inadequate diet have been considered as contextual factors here.
The data types for outcome variables i.e., stunting, underweight, wasting and anemia are numerical, that is the count of children having specific malnutrition conditions. This data is at microscale in NFHS-4 data, and macroscale in the CNNS fact sheets. Further, we use the percentage data for computing the distribution distances of the severity classes of each malnutrition condition for each district or state.
Workflow Implementation for the Case Study Since our proposed workflow is predominantly data-driven, each step gets refined by the semantics of the selected data.
Step-1:-For the feasibility test, we first visually check if the state-wise sample distributions for both surveys are equivalent and check them against the state-wise population distribution of the latest official census taken in 2011 in the states. The region-wise grouping of the thirty states and distribution of sample population covered in the surveys are shown in Fig. 1A, B respectively. Here, we use juxtaposed views of maps for stunting, underweight, and wasting across both surveys (Fig. 1C).
To complete the feasibility test, we first identify the common output indicators from both surveys. We consider the malnutrition conditions, namely, stunting, underweight, wasting, and anemia. In addition to their absolute count of U5 children with these malnutrition conditions, both surveys have data of the frequency distributions of the severity of each of the conditions, as reported in Table 1. Except for anemia, we also have data available in the gender-and urbanization-based groups, in addition to the total population. Hence, we use the frequency distribution to find the discrete probability distribution. The probabilities are then used for computing distribution distances for each group using Hellinger distance (HD). HD quantifies the similarity between the indicators across the two surveys. The HD between two discrete distributions P = (p 1 , p 2 , ..., p k ) and Q = (q 1 , q 2 , ..., q k ) , D HD (P, Q) , is: We choose HD as the distribution distance measure owing to its properties of symmetry and being a bounded metric with 0 ≤ D HD (P, Q) ≤ 1 [24]. The support for HD, i.e., [0,1], is such that 0 and 1 mean highly similar distributions, and highly dissimilar, respectively. These properties enable comparisons of the HD distances across states, where the HD is computed per state between distributions across surveys.
Another important desirable property of HD is that the distance measure follows the triangle inequality property, i.e., the HD between the two empirical discrete probability density distributions is not greater than the HD between the discrete distribution and the actual parameterized distribution [24]. Thus, our use of HD ensures the comparison of the lower bounds of the distances.
We compute the HD distances between distributions of [non-severe, severe, absence] for each malnutrition condition in the two surveys. We compute distances for selected group (female, male, urban, rural), wherever applicable, as well as the for the total population, e.g., we find the HD of distribution of [non-severe, severe, absence] of stunting for female U5 children, given in percentages, between NFHS and CNNS data (Fig. 2, first row, a).
Unlike other undernutrition conditions, the data for anemia in the CNNS report is both sparse and only available at the macroscale, i.e., at the state level. Also, the HD values between the anemia distribution data in NFHS-4 and CNNS are negligible values across all states. Hence, to visualize perceptible differences in such distributions, we use piechart glyphs in maps (Fig. 3). Pie-charts depict the part-towhole data effectively, and we use pie-chart glyph to depict data at the state level. Here again, we use juxtaposed views SN Computer Science of the survey-wise visualizations. Thus, we qualitatively compare the surveys' distributions in terms of the severity of anemia in U5 children.
Step-2:-The second step in our workflow is the variable selection for the integrated analysis of surveys. There have been several studies of the spatial local heterogeneity in undernutrition [17,19] in India. However, in India, based on the climatic conditions and natural resources, a state follows culture, tradition, and diet similar to that of its neighbor. Thus, when we consider contiguously situated states together, they form regions, as followed widely (Fig. 1A). This analysis is called a spatial stratified heterogeneity [25], which is defined as the unevenness in the distribution of traits, events, or their relationship between strata or regions, each of which includes a number of local units [25]. Here, as a novel methodology, we study the region-based trends in variables exclusive to each of the surveys. This analysis is at macroscale, given states and regions are larger than districts, which is used for mesoscale analysis [17], to determine spatial local heterogeneity. Given the difference in the spatial scales of the data available for NFHS-4 and CNNS, we choose to use spatial stratified heterogeneity over spatial local heterogeneity.
The CNNS report [11] provides five different grouped studies relevant to U5 children pertaining to, namely, anemia, undernutrition (i.e., stunting, underweight, wasting), micronutrient deficiencies, diet quality, and markers of non-communicable diseases (NCD). Of these, we have considered anemia and undernutrition conditions in Step-1. Also, the markers of NCD is a study for children above five years of age and is not relevant here. Hence, we use the region-wise observations of micronutrient deficiencies and diet quality as potential outcome indicators in Step-2. The micronutrient deficiency includes the percentage of U5 children with deficiencies in [folate, low serum ferritin, vitamin A, vitamin B12, vitamin D, Zinc]. We also study the immunization status from NFHS-4, as its outcome indicator, for comparison. The immunization status includes the percentage of U5 children completing [BCG, DPT, Hepatitis B, Measles, Polio] vaccinations and achieving fully immunized status. In general, lower percentages for micronutrient deficiencies, and similarly, higher percentages for immunization status, imply better health indicators for U5 children in the entire region. Here, we select indicators that exhibit high spatial stratified heterogeneity, so that interesting patterns can be observed using spatial statistical analysis.
Since we are analyzing the output indicators for integrating surveys, we use visualizations using a circular radar plot for qualitative comparisons. We choose a circular plot to visually represent percentage data (Figs. 4, 5). The choice of Fig. 1 Data from the selected surveys, as given in our previous work [14]. A Region-wise grouping of states in the political map of India. B Comparison of sampled population distribution for NFHS-4 and CNNS using ratios of state-wise count with respect to that of the country, against baseline ratios using the population size from Census 2011, using percentage format. C Percentage of U5 children who are stunted, wasted and underweight across all states in India, as reported by the surveys radar plot is owing to its compactness, where a region-wise radar plot has each spoke or axis representing a state in the region. We again use juxtaposed views of region-wise visualization for qualitative comparison [16]. We use the set of circular radar plots corresponding to all regions to determine the spatial stratified heterogeneity, which we further quantitatively verify using the q-statistic [25]. We observe that the individual circular radar plot demonstrates the spatial local heterogeneity within the region, which has been considered in our previous study [14]. Suppose we have a total of N geographical units in R regions with N i units in each region for i = 1, 2, … , R . Let X k and X ik be the values of an outcome indicator in the unit k in the population and region i, respectively, and X and X i be the values of population mean and region mean, respectively. Then, the q-statistic is given as [25]: Fig. 2 Hellinger distance between discrete probability distribution of different levels of severity [non-severe, severe, absence] of different undernutrition conditions, namely, stunting, underweight and wasting in U5 children in the states of India, as shown in our previous work [14]. The distances are computed for different populations of the children, namely, female, male, urban, rural and total The q value is bounded in the interval [0,1], and is proportional to the strength of spatial stratified heterogeneity. Table 2 gives the measures computed for our selected outcome indicators from CNNS. It must be noted that we have excluded some of the variables given in the grouped analysis of CNNS data here. For instance, we have not considered the variable on the consumption of iron-rich food for children of 0-23 months of age (Fig. 5A), as this is a relatively low percentage and has already been considered in the analysis of anemia in Step-1. Similarly, we have skipped variables related to dietary composition from eggs and meat (flesh foods) for children of 24-59 months of age (Fig. 5B), as they form a small percentage of the diet, owing to the predominantly lacto-vegetarian diet followed in India. We decide to use a grouped study with a larger number of variables Fig. 4 Our previous results [14] of circular radar plots showing region-wise A the occurrence in micronutrient deficiency, as given in the CNNS report, and B the coverage of immunization, as given in the NFHS-4 report. Each state-wise value is given in percentages of populations in respective states and union territories in India. (Note: A has been modified from the earlier version [14] with the addition of Goa in the West region.) with significant spatial stratified heterogeneity, where the values in the fourth quartile are taken for further consideration (Table 2). Thus, we choose the micronutrient deficiency group from CNNS as the outcome indicators for survey integration.
A study similar to our work has shown the influence of maternal education on the occurrence of anemia, stunting, and incomplete immunization in U5 children at the district level [18]. There is also evidence that there is spatial influence on poor sanitation, which is one of the causes of stunting in India, where the extreme temperature is a contextual correlate [26]. The variables listed as mesoscale correlates of malnutrition [17] are considered as contextual factors here. Such mesoscale correlates include maternal body mass index (BMI), breastfeeding practices, institutional births, and household electrification, in addition to immunization and improved sanitation [17]. Thus, we use the aforementioned variables as available from our chosen surveys as contextual factors of malnutrition, based on prior studies.
Step-3:-The third step in our workflow is the integrated analysis using spatial autocorrelation using global Moran's I and localized cluster maps using bivariate LISA (Local Indicators of Spatial Association) computed using local Moran's I [27]. We determine the spatial autocorrelation between a variable from the first survey vs another from the second survey. These variables include the indicators common to both surveys, i.e., indicators of undernutrition conditions, studied in Step-1. We also find the spatial autocorrelation between the output indicators and the contextual factors identified in Step-2. In this case, we ensure that the variable and the contextual factor are not from the same survey in order to ensure joint analysis of the surveys. Since these measures are directed, i.e., asymmetric, we decide to use the variables with finer-scale data to give the neighborhood information.
Thus, in our case study, we compute all spatial autocorrelations of a variable from CNNS vs that from NFHS-4.
Moran's I is a weighted correlation coefficient [27] measuring spatial autocorrelation, where the weights are provided based on spatial locations of the entities, given by: where N is the number of observations, X is the mean of the variable X, X i and X j are the values of X at locations i and j, respectively, and w ij is a weight indexing location i with respect to location j. In matrix-vector format, we have W to be the spatial weight matrix and z X to be the z-scores of X. We compute the Moran's I value for each common indicator between its values from both the surveys, using states as observations. A threshold = −1 N−1 is important to make inferences of the Moran's I values. The I values significantly less than imply negative spatial autocorrelation, and significantly higher than imply positive spatial autocorrelation. The I values are transformed to z-scores, and its p-values provide information about spatial clustering and statistical significance, respectively. The p-value less than 5%, i.e., p 0.05, implies that the variable is statistically significant in rejecting the null hypothesis that the spatial distribution of features is an outcome of random spatial processes. A positive z-score indicates more spatially clustered patterns, and a negative z-score indicates more spatially dispersed patterns.
We use bivariate LISA to identify the high-risk (hotspots) and the low-risk (coldspot) regions. The bivariate LISA uses the local Moran's I value, but with two different variables [28].
where z Y is the standardized or z-scores of another variable Y, and in this generalized version, the spatial weight matrix "averages" the neighboring values. Since this is an asymmetric measure, we use the intuitive understanding of X being the outcome indicator from CNNS data and Y being the contextual factor from NFHS-4 data. Thus, we find bivariate LISA cluster maps for X vs Y (Fig. 7). This is different from the analysis done in our previous work [14], where we analyzed the bivariate LISA cluster maps of Y vs X. In our current work, we first analyze the Moran's I (Bivariate LISA) value, Z-value (z-score), and p-value, and generate visualizations only for statistically significant values, i.e., p 0.05.

Results
We have used Python 3.0 implementation with the Scipy package for computing HD. The map-based visualizations have been generated using QGIS version 3.8.3, the circular radar charts using R, and the spatial autocorrelation and cluster maps using GeoDa 1.14. Here, we present our new results in Step-2 and 3, especially in comparison to our previous findings [14].
Step- 1 We check the feasibility of integrating NFHS-4 and CNNS surveys, specifically for the data on malnutrition in U5 children in India. We observe that the statistical descriptors of stunting, underweight, and wasting are comparable (Table 1), but there are macroscale (i.e., at state-level) variations across surveys for the percentage of occurrence of these malnutrition conditions (Fig. 1C). We observe that NFHS-4 captures more regions for the high-occurrence of each of these conditions than CNNS, especially in the West and Central regions [14]. These variations in medium-and high-occurrence states can be attributed to the differences in sampling, survey administration, data processing, reporting, and sampling (Fig. 1B) across the states. For a fine-grained analysis to improve the feasibility of our study, we have used the distribution of different levels of severity of stunting, underweight, and wasting occurring in sub-populations of U5 children.
The additional information on the distribution of different severity levels of stunting, underweight, and wasting of U5 children has been used for a finer analysis. Studying the distributions in sub-populations in the urban and rural regions improves the spatial scale of analysis. These data are used for computing macroscale, i.e., state-wise, Hellinger distances (HD) between the indicators from the surveys, which are visualized using choropleth maps in Fig. 2. Here, we observe that the state-wise variations are low, as the upper bound of the state-wise HDs has been found to be 0.184 overall, which is much lower than the upper bound for HD measure, 1.0 [14]. We observe that two isolated states show relatively higher HDs, namely Jammu & Kashmir for stunting and Uttarakhand for wasting, across all the five sub-populations. This could also be attributed to the relative sparsity of samples from these regions.
For the data for anemia (Fig. 3), the pie-chart glyph sizes that correspond to the occurrence of anemia in each state are similar across the surveys [14]. However, we also observe the differences in the distribution of severity of anemia occurring in different states, as demonstrated by the pie-chart glyphs themselves. We do not see salient differences in counts for the occurrence of severe-variant anemia owing to its lesser prevalence. The differences in the prevalence of mild-and moderate-variants of anemia across surveys could be attributed to the lack of information on the population size on which these percentages have been computed in the CNNS.
Overall, we have proceeded further with the survey integration based on our conclusion [14] that the distribution of the outcome indicators has stronger similarities across NFHS-4 and CNNS, compared to the absolute counts or ratios of the population classified as undernourished. Thus, the survey outputs can be considered to be extensions of each other statistically, and hence, the integration of NFHS-4 and CNNS pass the feasibility test.
Step-2 Since the contextual factors influence variables in spatial stratified and local heterogeneous scenarios similarly, we have used the variables identified for the latter [17,18,26] for our purpose of finding the contextual factors. Immunization record and micronutrient deficiency for indicators of U5 children, and maternal illiteracy, women below 18.5 kg/m 2 BMI (i.e., underweight), non-institutional births, partially immunized children, partially breastfed children with inadequate diet, households with poor sanitation facilities and household with no electricity are known to be mesoscale correlates of malnutrition [17,18,26]. They have now been chosen as contextual factors, for which the available data used here is given in Table 1.
Using the favorable measures of spatial stratified heterogeneity, we have selected the outcome indicators. The circular radar plots and q-statistics have provided the qualitative and quantitative analysis of the spatial heterogeneity. These plots of region-wise values of the indicators (Fig. 4A) demonstrate that micronutrient deficiency has high spatial local heterogeneity even within regions [14]. We see that there is a significant deficiency in folate in Assam and Nagaland in the North East, in Andhra Pradesh in the South, and Madhya Pradesh in the Central regions. A significant deficiency in low serum ferritin, which is a primary cause of iron-deficient anemia, is observed in Haryana and Punjab in the North and Karnataka in the southern regions. In comparison, we observe that there is predominantly uniform coverage of immunization in states in each region (Fig. 4B). We now also observe high spatial stratified heterogeneity owing to the non-uniform patterns in all regions, e.g., the deficiencies are relatively high in the North and the North East regions. For the sake of completeness of grouped indicators in CNNS that are relevant for U5, in our current study, we have considered the dietary quality for children in the age-groups (0-23) months (Fig. 5A) and (24-59) months (Fig. 5B), which are considered as two sub-groups due to the known differences in the diet consumption patterns between the sub-groups. The circular plots and q-statistics ( Table 2) both indicate that the spatial local and stratified heterogeneity are weaker compared to those found in the micronutrient deficiencies.

SN Computer Science
We have considered the fourth quartile of the q-statistics values to select the outcome indicators which demonstrate high spatial heterogeneity. We also observe that a few variables in other groups, e.g., dietary consumption of dairy products, and (grains, roots, tubers) among children of the age group of (24-59) months also show high spatial stratified heterogeneity, which may be studied further in future. In our results in Step-3, we find that the variables with high q-statistics value within the micronutrient deficiency group, namely, the serum ferritin, Vitamin B12, and Vitamin D demonstrate significant joint results too. Thus, the additional analysis of considering the dietary quality and q-statistics evaluation in our current work has strengthened our choice of micronutrient deficiencies as outcome indicators for integration from our previous work [14].
Step-3 Our integrated analysis of surveys is based on multivariate spatial statistics, as explained in Section "Methodology". We first studied the outcome indicators which are common to both surveys, considered in Step-1, and then we jointly studied the indicators and contextual factors identified in Step-2.
The global Moran's I statistics for spatial autocorrelation between common indicators in both surveys for the undernutrition conditions, namely, stunting, underweight, wasting, and anemia, are shown using the bivariate LISA cluster maps and the p-values in Fig. 6. The global Moran's I for these indicators in CNNS vs NFHS-4 are 0.2966, 0.432, 0.284, and 0.450, for stunting, underweight, wasting, and anemia, respectively. For N=30 (states) in India, we get the threshold of spatial autocorrelation = − 0.034. Thus, these four measures are significantly higher than , and are also statistically significant, given p 0.05 (Fig. 6). Hence, we see here that there is high positive spatial autocorrelation between the two surveys, which is statistically significant and which is applicable for all four undernutrition conditions. This indicates that the outcome indicators seen in both surveys have similar statistical distributions.
The hotspot and coldspot regions (Fig. 6) are those with a high and low prevalence of undernutrition conditions, respectively. Thus, both surveys state that the states of Madhya Pradesh and Chattisgarh have a high prevalence of all conditions except anemia; while Maharashtra and Gujarat have a high prevalence of underweight and wasting; and Bihar and Jharkhand for stunting and underweight. We also observe that there are a relatively lower number of high-low and low-high regions and coldspots, indicating lesser agreement of indicators from CNNS with neighboring regions with values from the NFHS-4 survey when both values are not high. Punjab is the only hotspot in the case of anemia, which is also confirmed by the pie-chart glyph comparison (Fig. 3). Assam is the only coldspot for wasting, and Arunachal Pradesh for anemia, confirmed by Figs. 2 and 3, respectively. Fig. 6 Global Moran's I statistics and bivariate LISA cluster maps of India, from our previous work [14], showing the local clustering (hotspots & coldspots) at macroscale (i.e., the state-level), of outcome indicators in CNNS vs those in NFHS-4 surveys for undernutrition indicators in U5 children for a stunting, b underweight, c wasting, and d anemia Table 3 presents the values of bivariate Moran's I statistics selected based on the micronutrient deficiency indicators from CNNS vs the contextual factors from NFHS-4. Here, we observe statistically significant spatial autocorrelations occur for Serum Ferritin, Vitamin B12, and Vitamin D, for which we have already observed high spatial stratified heterogeneity (Table 2). We have grouped the contextual factors (Table 3) as a household, woman (maternal), and child characteristics to understand the influence of specific agents in the U5 child malnutrition problem.
Unimproved sanitation is an important factor of stunting [29]. Thus, indicating that the hotspots of micronutrient deficiency vs unimproved sanitation are potential regions for the co-occurrence of both stunting and micronutrient deficiency. Parental education is yet another important factor associated with the prevalence of anemia and stunting [18]. Thus, we conclude that the hotspots are regions with a high risk of co-occurrence of micronutrient deficiencies and stunting or anemia. Hence, we have focused on lack of proper household sanitation and maternal illiteracy as contextual factors in our previous work [14]. Here, we consider more contextual factors, as explained in Section "Methodology" and reported in Table 3.
In Fig. 7, the high-low and low-high clusters occur infrequently, and do not provide any conclusive inferences. Disregarding them, we focus on coldspots next. Coldspots are also few, and we observe that Assam is a coldspot in three different cases, namely, Vitamin B12 deficiency vs low maternal BMI, low serum ferritin vs inadequate child diet, and Vitamin B12 deficiency vs inadequate child diet (Fig. 7d, f, g). Thus, Assam can be perceived as a low-risk region based on the joint analysis of the two national surveys.
Finally, we identify the following hotspots in the joint spatial autocorrelation analysis from both the surveys: We also observe that a large contiguous region along the Central-East-West mid-region of the country is affected by diet-related contextual factors, be it for women or children. Similar region is a hotspot for underweight and wasting conditions. When visualizing stunting along with unimproved sanitation and women illiteracy as contextual factors we observe hotspots with a high amount of overlap. This confirms the results from previous studies [18,29]. Also, unimproved sanitation and women's illiteracy could be considered as socio-economic factors together. Stunting and Vitamin B12 deficiency have a high overlap.
In the case study, one of the strongest observation is that the South and North East regions do not show any significant patterns that can be attributed to either a sampling bias or better public health outcome indicators.

Discussion
Our observations reported in Section "Results" confirm the credibility and reliability of each of the selected surveys as cross-sectional studies. These surveys, apart from being comprehensive for their goals in their individual capacities, are also viable for a joint analysis. We also find that our novel uses of different visualizations, i.e., choropleth maps for ratios and Hellinger distances, circular radar plots, and bivariate LISA cluster maps confirm the measures computed from the data, i.e., q-statistics, and global Moran's I. Thus, the agreement between our qualitative and quantitative Since such an advanced analysis of survey-based studies is rare and is predominantly done on isolated case studies, we do not have any state-of-the-art method comparisons to validate our entire workflow. Moreover, this work falls under the category of exploratory investigation, as the skeletal workflow is fleshed out by the data itself. Since our conclusions confirm the findings in existing literature of NFHS-4 data analysis [17,18], we conclude that our proposed method is effective in identifying spatial patterns in a joint analysis of multiple surveys, thus integrating them.
Preliminary Analysis of NFHS-5 The fifth edition of the NFHS, i.e., NFHS-5, has been conducted during 2019-2020, and the Phase-I of the compendium has been published in 2021. The open-access data are now available in the form of fact sheets for 22 states and union territories.
Since this edition of NFHS has been implemented soon after the CNNS, we expect overlaps just as we have observed between NFHS-4 and CNNS. Hence, we conduct a preliminary analysis of the data that is available from NFHS-5. Unlike the microscale scope of NFHS-4 data, NFHS-5 data are currently available only at the macroscale level. The microscale (i.e., raw/household) data of NFHS-5 and its mesoscale (i.e., district-level) analysis are expected to be published as open access. The availability of the microscale or mesoscale data will facilitate comparative analysis of the finer scales of the data of NFHS-4 and NFHS-5 can be done. Such an analysis would provide insights to the efficiency of the healthcare programs initiated and promoted by the government. Here, in its absence, the available macroscale data has been used in our workflow. Even though the analysis is incomplete owing to the insufficient data, our preliminary Using the macroscale data in NFHS-5, we conduct the preliminary comparisons between NFHS-5 and CNNS in Step-1 of our workflow. We visualize the percentages of U5 children suffering from undernutrition conditions, namely, stunting, underweight, and wasting, using choropleth maps (Fig. 8). We then compute the Hellinger distance for just wasting, for the severity levels, [non-severe, severe, absence], between CNNS and NFHS-5 (Fig. 9). We cannot proceed further in our workflow, as for computing heterogeneity, we require data of all neighbors of the states. The choropleth maps for the Hellinger distance (Fig. 9) show that there is a 100% increase in the range of the distance, indicating that there exist higher dissimilarities between the data distributions of the surveys. There are higher distances for the rural regions, and consequently, in the total region. Given the survey has been conducted during the COVID-19 pandemic during 2019-2020, we expect that there are larger socio-economic changes that would have caused unpredictable behavior in the data.
We observe that there are several similarities between NFHS-4 ( Fig. 1C) and NFHS-5 (Fig. 8). At the same time, there are noticeable differences between the outcomes of the two surveys. As an improvement to U5 child malnutrition, we observe that stunting has considerably reduced in the states of Gujarat and Karnataka. Also, as a setback, we observe that wasting has increased in the North region, namely, Jammu and Kashmir, and Haryana, from NFHS-4 to NFHS-5. Just as we have observed in our case study, there are salient differences between CNNS and NFHS-5 (Fig. 8).
Overall, we demonstrate that, in addition to the survey integration, several of the methods we have proposed can be repurposed for novel applications, e.g., change detection across different editions of a survey which can be viewed as time-series data, e.g., NFHS-4 and NFHS-5.
Survey Integration and Multi-Source Data Fusion Survey integration as discussed in our work is a data science algorithm and can be generalized for similar applications of multi-source data fusion. The cross-sectional surveys spanning the same geographical region and similar time-periods are multi-source data [30,31]. Different from multi-modal data (i.e., collection of data acquired using multiple modalities), the effectiveness of the multi-source dataset is based on its degree of heterogeneity [30]. High degree of heterogeneity implies latent relations between the variables in the data that can lead to new insights [30]. In our work, we exploit the inherent homogeneity of the data from different sources to establish the feasibility of the integration (Step-1), and further use the parts of the dataset with high heterogeneity for the final aggregation (Step-3).
Integrating the survey data is an example of data fusion, which can be classified as early, intermediate, and late, based on time of occurrence in the data science workflow [32]. They correspond to raw data-level, feature-level, and decision-level fusion, respectively [31,32]. The existing literature consider these concepts for multi-modal data and for deep learning workflows, but at the same time, these are directly applicable to multi-source data and for other data science workflows. In our specific work, our macroscale analysis is an example of late fusion. With availability of raw data, we can shift the analysis upstream to early or intermediate fusion. The late fusion uses rule-based approaches predominantly and has the advantage of fusing decisions without assumption of any data priors, but is still the weakest of all fusion strategies [32].
Data combination, integration, and aggregation of multisource data are different from each other conceptually [31]. Data combination, considered as level-1, involves simple combination of data. Data integration, i.e., level-2, is achieved by co-existence of data sources. Data aggregation, considered as level-3, involves creation of new data forms from the input datasets. In our work, Hellinger distance and spatial autocorrelation computations and juxtaposed visualizations are data combination strategies.
Building multi-source data fusion tools for population survey data requires abstraction of the datasets, and applicability of fusion strategies. The suitable strategy is largely decided by the availability of the data with respect to different scales (microscale, mesoscale, macroscale) and its semantics. Another pressing requirement in this line of work is identification of sets of survey data which can be integrated semantically, as shown in our work through Step-1 that not all datasets can be integrated. Automating identification of such datasets involves semantic analysis of the questionnaires and the metadata of such surveys. Thus, developing generalized abstractions of the "survey integration" problem statement is in the future scope of this work.
In our current work, our proof-of-concept is limited to oneto-one mapping of variables across two surveys. However, this can be expanded to many-to-one, one-to-many, and many-tomany mapping in future, leading to creating new literature on such joint analysis and addressing the gap in existing literature.
Computational Tools in Social Sciences Our work demonstrates the design of a computational tool for a social science problem, that falls in the category of digital humanities tools [33]. Thus, our work can be viewed through the lens of a "research question in search of a tool." This problem statement is to be solved using algorithmic thinking by the experts in social sciences and humanities [34,35]. An alternative view of the interdisciplinary work in computer science and liberal arts is of the tailored use of algorithms, data representations, and tools by computer scientists in applications from liberal arts [35]. Our work pertains to the latter view of problem solving in an interdisciplinary setting. Thus, an organic step forward is in automating our workflow for any two given population surveys. This provides the future scope of this work in using deep learning methods on both the questionnaire and the survey response data to recommend appropriate data fusion.

Conclusions
We propose a method for the integration of national health surveys at the data level, and at macroscale, as available in open access. The goal of the workflow is to determine the high-and low-risk regions of co-occurrence of certain public health outcome indicators and its contextual factors, as available across two different surveys. We have demonstrated our workflow in a case study of malnutrition conditions in U5 children in India. The analysis is done for undernutrition conditions at the macroscale (i.e., the state-level) and resolving the difference in the spatial scales of the data openly available for both surveys. Our results of hotspots and coldspots using the indicators for micronutrient deficiencies from CNNS and contextual factors (with household, woman, and child characteristics) from NFHS-4 show the effectiveness of our work. We have also shown that the indicators which are commonly available for both the surveys also reveal hotspots and coldspots, where CNNS in 2016-2018 reinforces the findings of NFHS-4 in 2015-2016. Our systematic integration of the surveys uses a three-step workflow involving a feasibility check, variable identification, and the integration using spatial statistics. Further, our spatial clustering results also show the high-risk and low-risk regions identified across the surveys for indicators common in both. Our work has future scope of generalization across any two large-scale population surveys, using a formal abstraction. Extending our previous work [14], we have improved on the Step-2 and 3 of our workflow by improving validation of our results using quantitative methods.
In summary, we show a proof-of-concept of integrating existing large-scale population surveys, benefiting the stakeholders. Such integrated findings may have been otherwise siloed within the surveys, but are now found to be significant when observed together. The goal of our work is to demonstrate evidence of such significant integrated results to improve the adaptation of survey integration. The responsibility of data collection is split strategically between national and local population health surveys for economic reasons. Planning joint outcomes across different surveys and mining data jointly from multiple surveys can continue to give deeper insights together while preserving the entire autonomy of each survey.