Fertility preferences in China seem to be leveling off at 1.7 children on average (Hou, 2015; Yao et al., 2010) in the first decade of twenty-first century after a sharp decline from the 1970s to 1990s. The variation in the average fertility preferences across studies is, however, striking (Basten & Gu, 2013; Feng, 2017; Hou, 2015; Yao et al., 2010). We found the Chinese literature revealed estimates on family size preferences as different as 0.91 and 2.88 from different papers (data from the meta-review of this paper). This variation is not only due to regional or demographic differences in samples from which fertility estimates are used, but potentially also due to the various types of measurement of fertility preferences. Feng (2017) has made the point that the current large variation in measurements obscure proper interpretations of fertility intentions in China. Understanding the sources of this large variation in fertility preferences is therefore important to assess whether, and which, fertility preferences have really leveled off, and whether specific groups are precursors of much lower preferences. This is particularly relevant in the context of debates on current fertility rates (Chen & Duan, 2019; Chen, 2016; Cai, 2012; Hesketh et al., 2005) and on the impact of family policies.

In this paper, we ask the question: how much heterogeneity in aggregate-level fertility preferences can be explained by considering the demographic characteristics of the sample, its geographical location, and especially the measurement of fertility preferences used? We do so on the basis of 124 estimates derived from 94 Chinese studies published since 2001. While other studies have carried out meta-reviews of fertility preferences (e.g., Feng & Zhang, 2002; Hou, 2015; Yao et al., 2010), especially to analyze the trends over time, we take a different approach by focusing on the sources of heterogeneity. This has not been done before and consequently provides a better understanding of the large variations across studies, and especially of the respective role of sample characteristics, geography, and measurement. Moreover, we provide our detailed dataset as open data to allow for transparency and further analyses.

Sources of heterogeneity

In this study we focus on three broad sources of heterogeneity in fertility preferences. The first source is the demographic characteristics of samples (e.g., age, gender, marital status, urban or rural residence). Likely this may account for much variation in fertility preferences (Ajzen, 1991; Chen & Deng, 2007; Chen & Jin, 2011; Feng, 2017; Li & Xiang, 2010; Zhuang et al., 2014). For example, estimates of fertility preferences from younger samples are usually lower (Zhuang, 2008). People living in urbans area prefer fewer children compared to those living in rural areas (Feng & Zhang, 2002; Yao et al., 2010).

The second source are regional differences which are particularly relevant for a large and diverse country like China where regions differ significantly in terms of economic development and cultural practices, and these differences will be reflected in fertility preferences. For instance, the cost of raising a child is much higher in economic-advanced regions like the East of China compared to the less developed Western regions (Zhuang et al., 2014). Economic differences are certainly not the only factor in explaining geographic differences in fertility preferences, values and attitudes towards of children also vary among different regions irrespective of economic differences (Wang & Chi, 2017). For instance, Jiangsu and Guangdong are both economically advanced provinces, but people in Jiangsu prefer fewer children (Zheng et al., 2009) compared to those in Guangdong because of the strong son preferences in the latter region (Feng & Zhang, 2002). This cultural preference is thought to come from patriarchal values arising from inheritance traditions in which only sons inherit family fortune (Ma et al., 2016).

A third source of heterogeneity in fertility preferences, is its measurement. Fertility ideals, desires and intentions and plans all refer to different concepts that lead to different average preferences (Billari et al., 2009; Gu, 2011; Philipov & Bernardi, 2011; Zheng, 2014). Unlike fertility “ideals” which reflect unconstrained ideal family size, fertility “desires” take realistic barriers into consideration. “Fertility plans” or “intentions” include even more realistic conditions and revolve around the possibility to realize preferences in the near future. Many previous studies do not distinguish these measurements clearly, using the general phrase “fertility desires” in Chinese to present all kind of preferences. An additional dimension is whether preferences are assessed with or without considering policy restrictions. This is relevant in the Chinese context in view of restrictive family planning policy that have been in place for decades (Feng, 2017). At the same time, and because people are usually hesitant to express their preferences if they conflicted with policy, researchers used an indirect way to measure actual fertility desires or intentions, by asking respondents to hypothesize about a situation in which there were no restrictions. Feng (2017), however, has questioned the reliability of these conditional measurements arguing that because they measured the hypothesized preferences, they may be far from peoples’ real preferences. These various measurements are summarized in Table 1, along with their Chinese original language wording translated.

Table 1 The measurements of fertility preferences


We used the China National Knowledge Infrastructure database (CNKI) to investigate heterogeneity in fertility preferences. We used the search term “survey” and all measurements and concepts of fertility preferences used in Chinese, (e.g., “fertility desire”, “fertility decisions”, “ideal family size”, “planned number of children” etc.). Our search was carried out on 11th June 2018. The detailed search and selection process and the original data can be provided upon request. We used surveys from 2001 onwards because fertility preferences remained stable at a low level and showed disparity of variance since the twenty-first century based on our previous research that focused on the trends of fertility preferences in China since 1986 (self-citation).

For all studies we recorded the information about the survey, the detailed question(s) used to measure fertility preferences and estimates of preferred fertility and their standard deviations (if any). When the average number of children was not reported, but proportions of the number of children people preferred was provided, we used these proportions to calculate the means (Walter & Yao, 2007), more details can be found in the supplement 1. This resulted in 94 studies producing 124 estimates for the mean of preferred number of children. We characterize each survey and estimate on the following dimensions (see Table 2):

Table 2 The measurement of three sources of heterogeneity
  1. 1.

    Demographic characteristics: measured by different age groups and gender, marital status, having children or not, rural, or urban residence. We assigned categories based on the sampling framework of the study. If a study sample is not from a specific group, we noted this as “all”. For example, we use “all” as the category for samples including both men and women and use “women” to label the sample composed only by women, and we followed a similar strategy for other variables.

  2. 2.

    Regions: measured by both economic division of regions (Bureau of Statistics, 2011) and an alternative classification derived from human geography and designed by Fang et al. (2017). The economic regionalization is based on division from the National Bureau of Statistics of China which divided China into four areas: east, northeast, mid and west. The regional division based on human geography takes into account ten factors (e.g., natural, economic, cultural, ethnic, urbanization, etc.) to divide China into eight human geographic regions “north, east, central, south, northeast, northwest, southwest, Qinghai and Tibet (see the detailed regionalization in the supplement 2).

    Studies that cover multiple economic or geographic regions and studies that included regions for which few estimates were available were included in the category “other” (see details in supplement 2). National surveys that cover all regions are included in the category “national”. This leaves 5 categories for economic regions: east, mid, west, national, others and 6 categories for human geography: north, east, central, south, national, others.

  3. 3)

    Measurements differences: we regrouped the different measurements into five concepts based on the question asked in surveys, which are: (1) ideal family size, (2) ideal family size without consideration of policy restrictions, (3) desired family size, (4) desired family size without consideration of policy restrictions, and (5) planned family size. We exclude the planned family size without consideration of policy restrictions in our analysis because there is only one estimate for this measure.

Analytical strategy

Because the data collected are averages of preferred number of children from existing publications rather than individual-level data, we used a meta-analysis approach to quantify the sources of heterogeneity of fertility preferences. More specifically, we ask how much of the between-study heterogeneity can be explained by our predictors. A mixed-effect meta-regression is suitable here because it allows us to test the differences between the fixed groups and it also allows a random-effect model within such groups (Harrer et al., 2019). A random setting here is necessary because the studies included in each group are not from the same population. We assume studies within groups come from a universal population which is larger than the current studies we included. This setting also enables us to compare the means of each subgroup.

A meta-regression model needs the inverse variance of each study as weights (Schwarzer et al., 2015). The formula below is to calculate the inverse variance for average preferred family size of each estimate:

$${\text{w}}= \frac{1}{{\mathrm{SE}}^{2}}=\frac{1}{{\mathrm{SD}}^{2}/\mathrm{n}}$$

w: inverse variance for each estimate

SE: standard error for each estimate

SD: standard deviation for each estimate

n: sample size for each estimate

However, a standard deviation (SD) was available for only 18 estimates (out of 124). We use the proportions of different numbers of children preferred reported in these studies to calculate the SDs (85 estimates) and imputed the average of all SDs (Hozo et al., 2005) to replace the remaining unknown (21 estimates). We used the R package ‘metafor’ (Viechtbauer, 2010) for the meta-analysis.

We first run mixed-effect models with each source of heterogeneity separately to investigate how much variation can be explained only by that source. It means there is only one variable for each cause in one model or only several variables that attributed to the same source of heterogeneity in one model. In this step, we compare the tau2 (the amount of residual variability) and R2 (the amount of variability explained by predictors) to investigate how much variation can be explained by the different sources of heterogeneity in fertility preferences. In the subsequent step, we include all source simultaneously in the meta-regression and controlled for the variable year. We checked the correlations of year and regions to avoid the collinearity in our meta-regression models. The results show the year are not distributed differently from different human geographic regions or economic regions (see supplement 3) And we also run the meta-regression model without year, the results of effect size are similar which means the collinearity is not an issue in our analysis (see supplement 4).

Because some estimates are from the same survey (e.g., estimates of different years or measurements of preferences in one survey), we also included a model in which “survey” was a higher level in a three-level meta-regression model to control for this nested effect as a robustness check (results were very similar and are presented in the supplement 4).


The distribution of the average number of children preferred to have among 94 studies shows a remarkable range from 0.91 to 2.88, with an average of 1.70 (with SD 0.26) and a median of 1.71 (Fig. 1). But few estimates are above 2 (only 5 cases out 124), and there is only one estimates below 1. Fifty percent of all estimates lie between 1.54 and 1.87.

Fig. 1
figure 1

The distribution (histogram) of estimates of average number of children preferred in China based on 124 estimates

We found that the three sources of heterogeneity (demographic characteristics, region, and measurement) in fertility preferences in total account for 46% of the variation (when using regional differences based on human geography and 39% when based on economic indicators, see Table 3). Survey year explains less than 4% of the variance,Footnote 1 so we only added it in the analysis as a control variable but do not consider it as a source of heterogeneity in subsequent models. Figure 2 displays the variation in fertility preferences across survey years (since 2001). Across time, average fertility preferences increased, but the effect was very weak with a 0.02 increase in average fertility preference with an increase in survey year (see Table 4).

Table 3 The amount of heterogeneity attributed to each cause
Fig. 2
figure 2

The average number of children preferred in China between 2001 and 2018 (n = 124 estimates). There is a positive trend of survey year for the average family size

Table 4 Mix-effect meta-regression model of averaged family size

When comparing the amount of variation explained by predictors separately, demographic differences explained 23%, of which most of the variation is explained by age group (13%) and urban/rural residence (10%). Human geographic regionalization explains most variation (21%) compared to regions divided by economy (13%). The measurement of fertility preferences could account for 13% of the between-study variation.

Because human geography explains more variation compared to economic regions, we used human geography in our final analysis that compared the effects of each predictor. Younger samples (aged 20–35) preferred 0.25 fewer children on average compared to samples on the fertile-age group (aged 15–49) (Table 4). Samples with only women preferred 0.16 fewer children compared to the samples including both women and men. Samples with people living in rural areas preferred 0.13 more children compared to those including both rural and urban areas. Compared to Central China, East China (-0.15) and North China (-0.25) preferred fewer children on average. The desired family size on average in China since 2001 is at a low level of 1.61–1.94 (95% confidence interval), without considering the family planning policy this is higher at 1.78–2.01. Findings on ideal family size are similar at 1.75–1.99 (without considering policy: 1.82–2.12). The averaged planned number of children is lower than desired and ideal number of children at 1.46–1.80.


Based on the meta-review of 124 estimates from 94 studies, we observed a considerable range in the average number of children preferred in China in the twenty-first century, from fewer than 1 to about 3. Yet, half of the estimates were between 1.54 and 1.87. This result is consistent with previous studies (Feng & Zhang, 2002; Hou, 2015; Yao et al, 2010).

Most importantly, this paper quantified the heterogeneity of preferred fertility in Chinese studies from 2001 onwards and explored possible sources of the variation. The debate on total fertility rates (Chen, 2016; Chen & Duan, 2019) and whether or not China has fallen into a so-called ‘low fertility trap’ (Jin, 2014; Wu, 2016, 2019; Yang, 2012) is partly due to the large variation across studies in their estimates of fertility preferences. The study of the heterogeneity in fertility preferences can contribute to these debates by shedding light on whether particular groups have higher or lower preferences and by addressing differences in measurement of preferences.

Our results revealed that sample characteristics (demographic and regional differences) and the type of measurements of fertility preferences can explain around 46% of the variation in fertility preferences. This means that only three types of characteristics of the samples are able to explain almost half of the variation in average preferred fertility. Specially, we found demographic characteristics of sample explained 23%, human geographic regionalization explains 21% and the measurement of fertility preferences could account for 13% of the variation.

Of particular importance here is the measurement of fertility preferences in explaining variation between studies. This is something that has been overlooked in previous meta-reviews. Only Feng (2017) distinguishes these measurements, but he did not statistically compare the variation caused by them. We further quantify the differences in fertility preferences between ideal, desired and planned family size, with the consideration of the differences between estimates with reference to policy restrictions. Specifically, we find little difference between desired family and ideal family size, however, we also observe that both these measures increase substantially when respondents are asked not to consider family policy restrictions. This is an important finding that implies that respondents typically answer on questions of both desired and ideal family size with restrictions in mind.

A division of regions based on human geography was better able to explain variation between studies than the more frequently used division based on economic development. This reinforces the idea that cultural, historical patterns in regions can be equally or even more important than economic development. It reflects the lag effect of traditional culture especially on the attitude and value towards children (Feng & Zhang, 2002) but also indicates that the modern lifestyle and culture of regions plays an important role in contemporary China (Li et al., 2011).

About half of variation between samples is not accounted for. This is partly due to these sample characteristics being rather coarsely defined which clearly reduces the variation that these characteristics can explain. Other methodological aspects related to survey design and/or substantive group differences could further explain the remaining variation across survey estimates (e.g., mode of data collection (e.g., face-to-face, telephone, or online survey), and identity of the interviewers). Nonetheless the methodology used here, and the sharing of replicable data generation process, constitute one further step in our understanding of variations in fertility preferences in China.