1 Background

The Communiqué of the Seventh National Population Census was recently released, with certain results deviating from those of previous annual sample surveys. Such deviations have aroused widespread public concern. Errors generated in surveys can be generally divided into two types: sampling errors and non-sampling errors, with the former mainly resulting from the sampling design. Previous research, of which there is a great deal in this area, has provided a solid theoretical basis for the control and measurement of sampling errors (Cochran, 1977; Groves et al., 2009; Kish, 1995). Non-sampling errors occur at all stages of a survey, and the extent and characteristics of these errors are influences by an array of factors, such as the content, modes, and methods of the survey. In particular, when the quantitative data for used for measurements are difficult to obtain, the research undertaken, especially systematic empirical research, may be compromised. Scholars have reviewed existing sample surveys undertaken by government bodies and found that increasingly refined designs and enriched methods for statistical inference have brought sampling errors in sample surveys under control. Nonetheless, non-sampling errors in these surveys remain a critical source of errors that we cannot afford to ignore (Jin & Dai, 2012). This fact highlights the complexity of the surveys and the diverse causes of errors.

Scholars have made extensive theoretical explorations focused on how to control non-sampling errors (Jin, 1996; Kish, 1995; Lessler & Kalsbeek, 1992), dividing the sources of non-sampling errors incurred during data collection and processing into sampling frame errors, non-response errors and other non-sampling errors. In order to simplify the classification, non-sampling errors other than sampling frame errors and non-response errors are collectively referred to as measurement errors.

Sampling frame errors result mainly from the loss of target units, the inclusion of non-target units, composite connections, incorrect auxiliary information, and outdated sampling frames. Experience has shown that in sample surveys carried out by government bodies, the degree of loss of target units differs from stage to stage. Units at or above the county level are not easily lost, while some special units (such as military bases and prisons) are often not included in surveys because they are inaccessible or cannot be accessed due to security concerns. Units below the county level are more likely to be lost due to a lack of relevant information or the inaccessibility of data. For example, in surveys carried out in places of residence, it is often impossible to obtain information about students living on campus. With respect to units lost in the sampling frame, some experts have attempted to remedy this defect by using multiple sampling frames (He, 2019). However, actual sampling in such cases may use a multiplicity of calculation methods, producing extremely complicated calculations and making it difficult to proceed. The inclusion of non-target units is often linked to an outdated sampling frame. Although certain administrative districts have been redrawn or relocated, their original names remain in the sampling frame, thereby resulting in the inclusion of non-target units. Composite connection occurs mostly during the stage of household sampling. When a family owns multiple residences or multiple families share the same residence, the problem of composite connections may arise. Auxiliary information can help improve sampling efficiency, but incorrect auxiliary information can be counterproductive For example, the urban–rural variable is generally used at the village level, yet some survey designs apply it at the county level, which lead to treat counties as rural areas, and cities or districts as urban areas, thereby compromising sampling efficiency. Outdated sampling frames are the most common problem in data collection. For example, if a survey is carried out in 2020 but uses data from the 2010 census as the sampling frame, it will undoubtedly include noticeable deviations.

Broadly defined, non-response errors include group unit non-responses that occur in the early stage of sampling, individual unit non-responses that occur in the final stage of sampling, and item non-responses that occur in the midst of completing the questionnaire. Units selected in the early stage of sampling cannot be surveyed. For example, group unit non-responses might occur when the sampling frame has not taken into account facts such as demolitions, zoning adjustments, or natural disasters, or because the survey has not been correctly implemented at the grassroots level. In the final stage of sampling, individual unit non-responses might occur when respondents cannot be found at their addresses, respondents cannot be reached during the survey, respondents are not part of the target population, or respondents refuse to answer questions. Even if a respondent is successfully interviewed, item non-responses can occur if the respondent does not answer some of the questions in the questionnaire. Narrowly defined, non-response errors refer to errors caused by the non-responses of individual units. Research on individual unit non-response focuses mainly on how to reduce the probability of non-response (Couper & Ofstedal, 2009; Qi, 2016), or examines the adoption of a model to deal with non-response (Bethlehem, 2009). However, little research has considered the replacement methods commonly used today by government agencies that implement current surveys.

The sources of measurement errors are relatively complex and the relevant research is not systematic enough (Feng, 2007). During the survey cycle, problems with questionnaire design that occur early in the cycle may cause errors. Designers generally devise indicators based on a certain logical framework and then use these indicators as the basis for designing questions. If the design of questions is unreasonable, design errors might occur. During the sampling design stage, design errors can occur if the sampled population is inconsistent with the target population. During the survey training stage, if a trainer fails to provide comprehensive, systematic training to surveyors in compliance with the requirements of the training design, or if surveyors fail to understand or even misunderstand the meaning of the questions while receiving their training, survey training errors may occur. During the field survey stage, if a surveyor deliberately misleads a respondent or fills in the wrong answers, or if a respondent deliberately conceals the truth or has a memory bias, interview errors might occur. During the data entry and cleaning stage, if the data deviate from facts due to mistakes made by data entry or cleaning personnel, entry or cleaning errors might occur. Measurement errors rooted in complex causes test the ability of survey designers and surveyors to control the errors. Some scholars believe that the use of existing administrative records as auxiliary information for surveys can help improve the accuracy of estimates (Andreea et al., 2021). The use of CAPI systems in surveys can effectively reduce measurement errors and improve the quality of data (Wang, 2011). Together, the application of administrative records and CAPI systems has minimized the impact of measurement errors on survey design indicators.

Strengthening the study of non-sampling errors can help shed light on inconsistencies in the results of sampling surveys. In some cases, surveys that considered the same or similar subject matter, had basically identical target populations, survey scopes and time ranges, used comparable sampling methods, and had sample sizes that were sufficiently large have, nonetheless, produced different results. This phenomenon may be linked to non-sampling errors. Recently, research on fertility has not only become the epicenter of much government decision-making, but also a key focus of academia. While the current fertility rate in China is a hotly debated topic among experts (Chen & Duan, 2019; Guo, 2017; Wang & Wang, 2019), the deductions and calculations of these experts are based on the premise that the existing data are basically reliable. They have not, for the most part, explored the possibility of deviations in the data and the potential impact such deviations might have. This study uses a national fertility sampling survey as an example, and with a focus on relevant practices in survey design and implementation, has availed itself of big data resources such as administrative records to examine in-depth the sources of non-sampling errors, and to explain the ways to avoid such errors and the effects of error control. Moreover, before drawing its conclusions, the study also discusses the possible impact such deviations can have on the structural indicators, core indicators and sub-subpopulation indicators employed by surveys.

2 Sampling design and non-sampling error control in fertility sampling surveys

2.1 Sampling design of the 2017 national fertility sampling survey

The 2017 National Fertility Survey (the Survey) was a continuous cross-sectional survey conducted by the former National Health and Family Planning Commission, marking the seventh nationwide sampling survey undertaken since 1982 (He et al., 2019). The target population of the Survey encompassed 15–60 year old females of the citizens of the People’s Republic of China who lived in 31 provinces (autonomous regions or municipalities, excluding Hong Kong, Macau, and Taiwan) as of July 1, 2017, at 0:00 AM. The Survey focused on the fertility status of women across China and in various provinces (autonomous regions or municipalities), the current fertility intentions among women of childbearing age, and the status of fertility decision-making and contraceptive use among couples during the 10 years prior to the Survey. It endeavored to ensure that overall population and sub-populations were representative in terms of key indicators. During sampling design, the samples from the 31 provinces (autonomous regions and municipalities) were treated as independent sub-populations, each of which was stratified into registered populationFootnote 1 and inflow populationFootnote 2 according to their household registration (hukou) status.Footnote 3 Three-stage probability proportionate to size (PPS) sampling was applied to each layer to infer the overall fertility level of China.

In the first stage, counties (cities, districts) were sorted by administrative division codes and townships (towns, sub-districts) were sorted by type at each layer, and townships (towns, sub-districts) were sampled using systematic PPS sampling proportional to the size of registered (or inflow) population of the township (town, sub-district). A total of 6250 primary sampling units were selected across China, with Fig. 1 presenting the distribution of the numbers of primary sampling units in each province (autonomous region or municipality). In the second stage, the primary sampling units were sorted by the type of residenceFootnote 4 and the type of village (neighborhood) committee, and systematic PPS sampling proportional to the size of registered (or inflow) population living within the village (neighborhood) was then used to select two village (neighborhood) committees. It should be noted that high schools, colleges, and universities within the jurisdiction of the townships (towns, sub-districts) selected in the first stage were included as independent village-level units in the list of the sampling frames for the second stage. In the third stage, females aged 15–60 were sorted by marital status and age and eventually 20 females were selected through systematic PPS sampling. The Survey employed different survey methods for different types of sample points. For schools, the Survey used online questionnaires to collect information remotely; for other sample points, a face-to-face CAPI system was used to collect information.

Fig. 1
figure 1

Source: Technical Document for the National Fertility Survey (National Health and Family Planning Commission of China, 2017)

Distribution of the numbers of primary sampling units in each province in the 2017 National Fertility Survey.

2.2 Control of non-sampling errors in the 2017 National Fertility Survey

There are differences between surveys undertaken by government agencies and those conducted by research institutes or commercial entities. Backed by robust financial and institutional support, government agencies can procure advanced technologies and services to ensure the quality of its surveys. By using grassroots public servants as surveyors, government agencies can skip the recruitment process and save money. Grassroots surveyors can also help to optimize survey design by, among other things, reporting problems completing specific parts of the work or requesting replacements for certain sample points. The government can avail itself of rich administrative records that serve as auxiliary information to significantly enhance the efficiency of sampling. Respondents are more likely to participate in surveys conducted by the government.

2.2.1 Control of sampling frame errors

Fertility surveys have shown that marriage has a significant impact on fertility, while age has a greater impact on indicators such as fertility desire and contraceptive use (Zhuang et al., 2019). The sampling design of surveys should take account of these facts. Presently, massive population migration is a fundamental reality in China, with a steady stream of unmarried young people flowing from rural areas into urban areas, and migrating from central and western regions of China to the eastern region. Against this backdrop, sampling frames based on the household registration (hukou) status of the sample population cannot meet the needs of survey design in the new era. Although sampling frames based on the permanent residents populationFootnote 5 are representative of the actual living status of the population, it should be noted that the distribution of inflow populations is uneven. If this fact is not give consideration during the sampling design process, the actual sample of inflow population obtained will not shed light on their status in the population of the locale where they currently reside.

In order to include as many unmarried young women as possible, the Survey used dual sampling frames. Relying on the total population database and the total migrant population database reported directly by each province,Footnote 6 Survey designers used township and sub-district as the primary units in each province to report the current numbers of permanent and non-permanent female residents aged 15–60. Due to the lack of appropriate sampling verification data for these two types of data, the numbers of registered population and inflow population for each province was used as verification data.

Upon completion of data collection for the first-stage sampling frame, the data for registered population and inflow population were checked separately. The data for registered population were checked using the sampling frames reported by respective provinces in 2011, summary data of the Total Population Database for 2014,Footnote 7 and the China Population and Employment Statistics Yearbook 2015. Given the accuracy and authority of these three data sources, experts recommended that the sources be assigned weights of 40%, 20% and 40%, respectively. In view of the time factor, the rate of natural increase was assumed to be 5.2‰Footnote 8 to generate verification data for the data reported by each province in 2017. If deviations exceeded 5%, the provincial reporting unit was asked to provide an explanation or revision. The data for national and provincial inflow populations were checked using the Tabulation on Total Migrant Population for H2 2016 published by the former National Health and Family Planning Commission. Because migration is ongoing, inflow populations change frequently and the collection of data for this group is time consuming. As a result, data comparison for this population was relatively loose compared to that for the registered population, with the alert threshold for deviation set to 15%.

During data collection for the second-stage sampling frame, units selected in the first stage were checked for existence and whether they were capable of participating in survey implementation, for the impact zoning changes, demolitions, or other externalities may have had on their situations, and for significant changes in the size of the target population. In the event a township (town, sub-district) no longer existed, had been assigned to another jurisdiction, had undergone demolitions and construction on a large scale that significantly changed the size of the target population, or was incapable of implementing the survey for some other reason, the same rules used to select primary sampling units were used to select a replacement unit as near as possible to the original area. In distributing the second-stage sampling results, the Survey collected feedback from localities on the sampling results, and replaced sample points where the survey could not be implemented due to large-scale demolitions, insufficient human resources capability at the grassroots level, or population outflow. The difference between the actual size of population and the size reported in the second stage was recorded in order to inform the subsequent weighting.

2.2.2 Control of non-response errors

The Survey used two ways to control non-response errors. The first way was to minimize the occurrence of non-response errors from the source. In preparing the name list for the sampling frame in the final stage, efforts were made to verify the accuracy of information about number of residents and their names, ages, marital status, and addresses, This effort minimized the occurrence of non-response errors due to the target respondent having moved away or not meeting requirements for Survey participation. The second way was to exert “targeted” control. Non-responses can be divided into random non-responses and non-random non-responses, of which non-random non-responses cause the most serious non-sampling errors. In order to minimize non-sampling errors resulting from non-random non-responses, the Survey employed homogeneous substitution to control the non-response errors. Strict substitution criteria were set in the sampling module for the CAPI system and associated with the final-stage sampling database. When a non-response occurred, the age and marital status of the non-response respondent was taken as the “target”, and another respondent closest in age and marital status to the target was selected from the final-stage sampling frame database to replace the non-response respondent. For non-response respondents that could not be replaced using the preset algorithm, the sampling design team would manually determine candidates for substitution. Such targeted control minimizes the serious non-sampling errors caused by surveyors randomly replacing non-response respondents with people of different ages or marital status. Due to the different method used to survey school-type sample points, homogeneous substitution was not applied to these sample points. Instead, the non-responses were treated as naturally missing responses. Given the strong homogeneity among respondents selected from school-type sample points—they were relatively young and about the same age, mostly unmarried, and childless—such non-responses were handled through linear weighting.

2.2.3 Control of measurement errors

Measurement errors come from a wide array of sources and can occur in multiple stages before, during, and after a survey. Error control is therefore carried out in each of these stages. In the pre-Survey stage, numerous expert seminars were held and multiple pre-Survey trials were performed across China to test whether the questionnaire design was scientifically sound. In selecting surveyors, priority was given to middle-aged women with grassroots work experience. Trainees were tested and only those with scores above 90 points were qualified to conduct the Survey. Because the role of supervisors was key to the Survey, the same testing requirement was also applicable to supervisors. The Survey was carried out using a CAPI system that was designed with a logic analyzer, alerts, and prompts to prevent surveyors recording answers incorrectly. Respondents' answers to key questions were recorded, and dedicated reviewers were hired to check that the completed questionnaires matched the voice recordings; questionnaires with inconsistencies were immediately invalidated. A hotline was established and a QQ chat group was created in order to promptly respond to questions raised by surveyors from all parts of the country. The supervisory team went to 16 provinces (autonomous regions and municipalities) to oversee Survey activities and to resolve problems that arose. After the Survey, 64 primary units (townships and sub-districts) were selected randomly to carry out a follow-up door-to-door survey. Administrative records databases were utilized to verify the identity and fertility status of Survey respondents, and localities were urged to verify and confirm the results when inconsistencies were found. Survey data that weighted with design weight, design weight adjustment factor during sampling process, and non-response weight were compared with data from the 2015 1% National Population Sample Survey conducted by the National Bureau of Statistics. And post-stratification weight was added and applied on Survey data based on the results of comparison. Post-stratification weight\ takes into account multiple factors such as the province where the respondent lived, age and marital status, and the proportion of non-agricultural populationFootnote 9 in the survey locale.

3 Characteristics of non-sampling errors in the National Fertility Survey

3.1 Characteristics of sampling frame errors

The sampling design of the Survey made provincial health commissions responsible for reporting the sampling frames at all levels. According to the results of the consolidation of data from primary sampling frames, the registered population was 1384.32 million, a deviation of − 0.41% from the 1390.08 million (as of year’s end) reported in the 2017 Statistical Bulletin of the National Bureau of Statistics; the migrant populationFootnote 10 totaled 141.55 million, a deviation of 1.56% from the 139.38 million shown in the tabulation of total migrant population released by the former National Health and Family Planning Commission at the end of 2016. According to the China Statistical Yearbook 2018, as of the end of 2017, China had a total of 39,888 townships (towns, sub-districts), while the number reported in the primary sampling frames for the Survey was 43,738, a deviation of 8.8%. The additional units reported in the primary sampling frames resulted from inclusion of regimental-level army units, agricultural and forest reclamation units established in Northeast China, and development zones or industrial parks set up in a number of provinces. The units selected in the first stage were checked after the second-stage sampling frames had been reported. Among the 6250 units selected in the first stage, 44 units underwent changes, accounting for 0.7% of all units. These changes could be divided into zoning adjustments (13 units), demolitions (12 units), substantial reduction of target population in the jurisdiction (9 units), inadequate capability to implement the survey (7 units), and non-existence of the unit (3 units). Compared with the sampling frames reported in the first stage, the variance in the number of respondents reported in the second stage after data consolidation was − 7.0%, which broke down to a − 7.5% variance in the registered population and a − 3.3% variance in the migrant population.

The sampling frames reported in the first stage mainly collected data from established townships, towns, and sub-districts, while sampling frames reported in the second stage collected auxiliary information about school units and residence types in the jurisdiction. The additional information enhanced sampling efficiency, but also made the collection of information more difficult, thereby leading to sampling frame errors such as the loss of auxiliary information and the loss of target units. Among the 12,500 units selected in the second stage, 324 units underwent changes, accounting for 2.6% of all units. These changes can be divided into demolitions (124 units), substantial reduction of target population in the jurisdiction (107 units), incorrect auxiliary information (43 units), inadequate capability to implement the Survey (34 units), zoning adjustments (8 units), and non-existence of the unit (8 units). After the second-stage sampling results had been distributed, all provinces successively verified the number of respondents within the units and gave feedback. After data consolidation, the variance stood at − 2.7%, which broke down to a − 2.3% variance in the registered population and a − 4.4% variance in the migrant population. As for the loss of target units in the second stage due to the lack of reliable data for verification, it was impossible to make accurate estimates, and hence only information available at that time could be used to make rough estimates. School-type units are not primary administrative units. Because the population information for schools is maintained mainly by local education authorities, it was not easy for grassroots surveyors outside of the education system to obtain detailed sampling frame data. As a result, it is possible some of this information was lost during the Survey. The province-specific number of students per capita was obtained by comparing the province-specific numbers of junior college and above students as shown in the China Education Statistical Yearbook 2017 with the province-specific numbers of permanent population as shown in the China Statistical Yearbook 2018. Next, the original data of the second-stage sampling frames of the China Fertility Survey was utilized to obtain the province-specific proportions of respondents from school-type sampling units to total respondents in both the registered and inflow populations. In theory, there should be a high correlation between these two proportions. Correlation analysis of these two proportions revealed that Spearman's rank correlation coefficient was 0.426, with significance level that reached 0.017, and the Kendall's rank correlation coefficient was 0.332, with significance level that reached 0.009. Although there was a certain degree of correlation between the two, the degree of correlation was not high, suggesting that there may have been a loss of school-type units among the second-stage sampling units.

3.2 Characteristics of non-response errors

With a design sample size of 250,000, the Survey mainly employed homogeneous replacements to address non-responses. 246,840 non-response respondents who participated face-to-face interview were handled through homogeneous replacement. And 3160 non-response student respondents who participated in online survey were treated as missing responses. A total of 270,859 respondents were interviewed during the Survey, with 16,123 respondents in the design sample having non-responses; there was a total of 20,859 replacements. The net non-response rateFootnote 11 was 5.95%, and the total non-response rateFootnote 12 was 7.72%. Among respondents handled through homogeneous replacements, the net non-response rate was 6.00% and the total non-response rate was 7.79%. Among student respondents treated as missing responses, the net non-response rate was 1.71%. Respondents were treated as non-respondents for a number of reasons: they could not be located, could not be contacted, refused to be interviewed, related auxiliary information was incorrect, or responses were invalidated upon review. The proportions for these reasons were 1.66%, 2.04%, 1.96%, 1.94%, and 0.12%, respectively.

Among the 20,859 replacement respondents, 4,315 were unmarried (20.7%), 16,022 had a spouse (76.8%), and 522 were divorced or widowed (2.5%). The difference between these findings and data from the 2015 1% National Population Sample Survey is insignificant, with the corresponding marriage ratios for the earlier survey standing at 20.1%, 76.3%, and 3.6%, respectively. With respect to the age of replacement respondents, they mostly occurred within the 25–34 age group, which was the largest age group among the total Survey population. The breakdown of replacements by age group was basically the same as in the 2015 1% National Population Sample Survey. Comparing the characteristics of replacement respondents with age composition at the national level basically confirms that the non-response were random with respect to age.

In order to determine whether missing values caused systematic biases in core target variables, the 20,859 replacement respondents were first sorted by province, migration status, type of residence,Footnote 13 urban/rural location,Footnote 14 marital status, and age, and then 1% of the group was randomly selected through equidistant sampling to obtain 208 records. Next, these 208 records were compared with their records in the Population Administration Decision Information System (PADIS) of the National Health Commission. A total of 137 records were matched to records in PADIS, with the success rate hitting 65.9%. Most of the 71 respondents for whom no matches were found were migrants and married people. The 137 replacement respondents who were matched had a total of 152 children, an average of 1.1 children per respondent. A search for the 137 interviewed respondents in the PADIS database showed that they had a total of 147 children, an average of 1.1 children per interviewed respondent. This average was consistent with the average number of children born to the replacement respondents. A non-parametric test (Wilcoxon signed rank test) also showed that there was no significant difference between the two sets of samples, suggesting that there was no significant difference between the replacement respondents and the interviewed respondents when it comes to the number of children born to them (Fig. 2).

Fig. 2
figure 2

Source: Technical Document for the National Fertility Survey (National Health and Family Planning Commission of China, 2017), Tabulation of the 2015 1% National Population Sample Survey (National Bureau of Statistics of China, 2016)

Comparison of age composition between replacement respondents and their counterparts in the 2015 Census (%).

3.3 Characteristics of measurement errors

In order to assess the measurement errors of major indicators in the Survey, provincial populations were considered to be sub-populations, the 125,000 sample points were sorted by migrant status, type of residence and urban/rural location; and a total of 64 sample points were selected and 1280 respondents were shortlisted for systematic sampling to produce a post-enumeration survey. The post-enumeration survey examined multiple indicators relating to the basic characteristics and the fertility status of the respondents. A comparison of the results for basic characteristic indicators in the post-enumeration survey and the original Survey pointed to inconsistency rates of 3.3%, 1.8%, 11.7% and 6.3%, respectively, for “birth date”, “marital status”, “education level” and “hukou status” (household registration status). These inconsistency rates suggest that measurement errors for key structural indicators such as age and marital status were insignificant, but errors were significant for indicators such as education level and hukou status. Comparisons of fertility status indicators showed inconsistency rates of 0.4%, 2.7%, 16.4%, 19.9% and 22.1%, respectively, for “experience of childbirth”, “number of children”, “birth date of first child”, “birth date of second child” and “birth date of third child and additional children”. The higher inconsistency rates in the birth date of children resulted mainly from respondents' difficulties remembering the exact month of their children's birth. Comparing only the year of birth would reduce the inconsistency rates to 4.9%, 6.8% and 11.5%, respectively. The inconsistency rates were 1.6%, 3.2% and 7.2%, respectively, for "sex of first child", "sex of second child" and "sex of third child". These inconsistency rates suggest insignificant measurement errors in core indicators such as the experience of childbirth, the number of children, and the sex of children, but significant measurement errors in indicators such as the birth dates of children.

There tends to be some underreporting of births in fertility surveys. To address this problem, name and ID information of respondents was collected in the Survey questionnaire and compared with information contained in the National Health Commission’s PADIS. Since population migration causes a lag in information registration, the number of children recorded in PADIS is usually equal to or less than the number of children in Survey. When the number of children in PADIS was found to be greater than the number of children shown in Survey results, localities were urged to revisit the households in question and verify the information provided by the respondents before feeding the final results back to the Survey implementation agency. Information for 249,946 women was included in the Survey; the total number of records containing valid ID information was 249,592, with the ratio of valid information hitting 99.9%. A total of 220,828 women were matched in PADIS, with the match rate reaching 88.5%. Further investigation found 7486 women with PADIS records showing the number of children reported for these women exceeded the number of children reported in the Survey. The results suggested that 10,969 children were not accounted for by the Survey. The households in question were revisited and the actual number of children under-reported was found to be 3345.

4 Impact of non-sampling errors on fertility surveys

Non-sampling errors can have a significant impact on survey results. Before completion of a survey, a series of measures can be taken to control non-sampling errors. Upon completion of the survey, if design weighting and non-response weighting are found to have been inadequate to contain structural deviation, a weight adjustment method is usually employed to adjust the structural deviation to a level basically consistent with the survey’s overall structure (Jin & Zhang, 2014). In such cases, it is appropriate to consider the difference between deviated structure and overall structure as indicative of the impact of non-sampling errors. Zhuang et al. (2019) has explained the details of weight adjustment.

4.1 The Survey’s principal structural indicators

The age, marital status and hukou status of the respondents have a major impact on the main results of a fertility survey. For example, if the structure of a survey's sample population is relatively young, the "fertility intentions" indicator will show fewer children; if the sample population includes a much lower proportion of unmarried people, and the "hukou status" indicator includes a much higher proportion of population with rural/agricultural hukou, the “fertility intentions” indicator will show higher children. A combination of such findings can lead to an overestimation of total fertility rate. Therefore, it is necessary to look more closely at the potential impact of non-sampling errors from such structural indicators. Table 1 shows the weighting results of the main structural indicators of the Survey examined in this paper.

Table 1 Comparison of design weighting and post-stratification weighting results for the main structural indicators in the Survey

Compared with the proportions after post-stratification weighting, the proportions after design weighting differ significantly among certain age groups, with the 15–34 and 55–60 age groups observed with lower proportions, the 40–54 age group observed with a higher proportion, and the 35–39 age group seeing little change. The two indicators for "marital status" are significantly different, with the variance in the proportion of "unmarried" respondents almost 50%; in fact, the proportion of "unmarried" respondents is excessively low in the sample. For the "Hukou status" indicator, the proportion of respondents with rural/agricultural hukou was excessively high, this proportion dropped from more than three-quarters to about two-thirds after weight adjustment. Generally speaking, after design weighting, there were fewer young respondents, more middle-aged respondents, fewer unmarried respondents, and more respondents with rural/agricultural hukou, which was caused mainly by non-sampling errors since sampling errors was few,. Upon post-stratification weighting, the impact of non-sampling errors was limited by increasing the proportion of young respondents, reducing the proportion of middle-aged respondents, increasing the proportion of unmarried respondents, and reducing the proportion of respondents with rural/agricultural hukou. And sampling errors did not arise significantly with post-stratification weighting. Some experts have undertaken in-depth studies of the impact of complex weights on standard deviation (Lv, 2017).

4.2 Core indicators of the survey

Total fertility rate (TFR) is one of the most commonly used indicators for measuring fertility level, and it is also a core indicator in fertility surveys. Analyzing this indicator and comparing the design weighting and post-stratification weighting results give us an intuitive glimpse of the impact of non-sampling errors on this core indicator. For this study, the relevant statistical results for comparison are shown in Table 2.

Table 2 Comparison between the results of design weighting and post-stratification weighting of TFR in the survey

TFRs for the years 2006–2016 are shown and standard deviations are calculated for each year. TFRs after design weighting only are relatively higher than the results after post-stratification weighting, with the maximum TFR variance reaching 0.50 in 2016. Looking back from 2016 to 2006, TFR variance edges down gradually, with a slight pick-up in 2014 only with the exception of 2006, the difference between the two was significant. This result is closely linked to the changes in marital status as shown in Table 1. Due to the higher proportion of married respondents after design weighting, there are more children born to the respondents, which in turn leads to the overestimation of TFR. With the passage of time, the impact of marital status wanes gradually, so does its impact on TFR, until the impact becomes no longer significant.

4.3 Analysis of selected indicators by province

Since the sampling design called for the presentation of data for the indicators for each province, it is necessary to consider the impact of non-sampling errors during the analysis stage on the relevant indicators for provincial sub-populations. Here, we compare the proportion of unmarried respondents with the long-acting contraceptive prevalence rate, one of the core indicators that was of great concern to the former National Health and Family Planning Commission. The results are shown in Fig. 3.

Fig. 3
figure 3

Source: National Fertility Survey (National Health and Family Planning Commission of China, 2017)

Variance in the proportion of unmarried respondents and the long-acting contraceptive prevalence rate by provincial sub-populations before and after post-stratification weighting (%).

In all of the provinces except for Hainan and Ningxia, there were significant differences between the results of design weighting and post-stratification weighting of the proportion of unmarried respondents. This result accords with the variance in the proportion of unmarried respondents at the national level.

In terms of the long-acting contraceptive prevalence rate, there was no significant difference between the results of design weighting and post-stratification weighting in any of the provinces. This is mainly because only married respondents were queried about the use of long-acting contraceptives. As a result, the lower proportion of unmarried respondents caused by non-sampling errors did not compromise the findings for this indicator. The same also applies to similar national-level indicators, which are not presented here for the sake of brevity.

5 Conclusions

This paper presents an empirical study focused on the control of non-sampling errors. The study unpacks and then reviews the sampling and implementation processes for a national fertility sampling survey undertaken by the central government. The characteristics of sampling frame errors, non-response errors and measurement errors found in the survey and the impact these errors have on the structural indicators, core indicators, and indicators for provincial sub-populations used to assess the sample population are analyzed empirically. The study examines a range of factors, including the sources of non-sampling errors, the strategies used to control errors, and the impact non-sampling errors have on findings. The main conclusions are as follows:

  1. (1)

    The use of administrative records can effectively reduce non-sampling errors and enhance the reliability of fertility survey data. In May 2021, the National Bureau of Statistics of China released the bulletin of the main results of the Seventh National Population Census and adjusted the number of births in the past decade based on the latest census results, and the adjusted fertility rates showed strong consistency with the fertility results of this study. This fact suggests that fertility surveys, which employ administrative records to reduce non-sampling errors, are a reliable source of data for obtaining period fertility level for non-census years and can provide reliable data for research on fertility-related topics in China.

  2. (2)

    In a sample survey with large sample size, non-sampling errors are the main source of total errors. Due to the combined effects of multiple factors such as survey design, survey funding, those employed to conduct the survey, the level of respondents’ cooperation, the sampling frame data available for use, and the accessibility of administrative records, non-sampling errors cannot be completely eliminated. However, measures can be taken during the initial planning stages and during/after survey implementation to control non-sampling errors in ways that minimize their adverse impact on survey results.

  3. (3)

    The sampling frame errors found in the 2017 Survey examined in this study came mainly from the underreporting of school-type sample points in the second stage. These errors led to significant deviations in the structural indicators and core indicators in the Survey, including showing the age structure of the sample to be older than it was, and showing a lower proportion of unmarried respondents and a higher proportion of respondents with rural/agricultural hukou than was actually the case. Sampling frame errors also led to an overestimation of TFR, a core indicator with close connections to the marital status of respondents. This is because TFR is a composite indicator influenced by both the number of children born and the number of reproductive-age females aged 15–49. Because the sampling frame showed a lower number of unmarried respondents, the impact of the number of children born was magnified. These errors were corrected through post-stratification weighting that adjusted the age, marital status and hukou status indicators to normal levels, so as to bring down TFR while still maintaining few sampling errors. It should be noted that for those indicators not greatly affected by age and marital status, weight adjustments make little difference to the results. During survey implementation, the quality of the sampling frame can be greatly improved by making full use of the auxiliary information that is available, and emphasis must be put on accurately determining the residential characteristics of young people in order to avoid missing young respondents living in schools or at their places of employment. Support from local education authorities can help to minimize such omission-related sampling frame errors.

  4. (4)

    Non-response errors in the Survey did not have a significant impact on the results. The Survey employed homogeneous replacement to address non-responses, using a CAPI system to strictly control the replacement process to ensure that replacement respondents were random and to prevent surveyors from arbitrarily selecting respondents. By analyzing the demographic characteristics of the replacement respondents, and comparing Survey data with fertility date contained in a government administered records database, it was concluded that non-response errors had been effectively controlled in the Survey and had a minor impact on core indicator results. Many earlier surveys have recognized the potential impact of non-response rates in provincial sample populations during the design stage, and increased the sample size to guarantee an effective sample size. However, the estimated non-response rate might not accurately predict the actual rate, and in such cases design error results in money being wasted. This study has demonstrated that using a CAPI system to implement a homogeneous replacement method avoids the need to predict the non-response rate in each province, thereby averting design errors and saving money on both survey design and survey implementation. Replacing for non-response respondents is a common practice adopted in many national surveys, one that can increase the effective sample size and facilitate grassroots management, provided that man-made biases are eliminated from the replacement process. Non-responses that are not replaced for should be treated as naturally missing responses.

  5. (5)

    Measurement errors had little impact on the Survey results. During the Survey, questionnaires were reviewed immediately upon completion, thereby effectively reducing measurement errors. The post-enumeration survey showed that there were only minor measurement errors caused by respondents or surveyors filling in wrong answers for principal indicators. The survey also matched and verified the birth results by using government administrative records, replenishing the number of children that were under-reported in the survey and minimize the occurrence of errors caused by the omission of the number of children from the survey respondents. Presently, government agencies control a wide variety of data and information resources. If these resources can be utilized for survey design, not only can the huge potential of these dormant resources be unleashed, but the data quality of surveys can also be greatly enhanced. Moreover, the strong demand arising from the use of such resources can also help to drive China's push for informatization and remove barriers that hinder information sharing between government agencies.