1 Introduction

Crop insurance plays an important role in providing farmers with protection against catastrophic yield shortfalls (Charpentier 2008). The level of risk involved usually is quantitatively reflected in the size of the crop insurance premium. Individual-yield crop insurance is the most widely used and has usually been priced on the basis of long-term loss data recorded by insurance companies or long-term individual yield records. With long-term loss data, the loss cost ratio (LCR), or pure premium rate, is calculated through simple division calculation. The necessary data include the long-term operational history of agricultural insurance and constant or similar production across time (Josephson et al. 2000). The premium rate also can be determined through integration of the probability distribution simulated with yield records (Ozaki et al. 2008). Regional yield data, for example, are used to price crop insurance schemes, where indemnity and premium are based on the regional yield (Miranda 1991; Skees et al. 1997). But it is extremely difficult to obtain long series of yield records at the farm level to estimate the distribution of yields of individual smallholder farms with a sufficient degree of confidence. The situation is even worse in developing countries with smallholder agriculture. There are very few cases where high-quality farm yield series are available to support yield simulation based rating. Thus more appropriate data and ratemaking methods are urgently needed.

The existing literature suggests several methods using available regional yields to price farm-level insurance. Botts and Boles (1958) assumed normal distribution of farm yields around the county average yield with a certain standard deviation. The method was challenged because of its normal distribution assumption (Nelson 1990) and constant deviation (Skees and Reed 1986). Also, the relationship between regional yield and individual farm yields is very complicated (Wang and Zhang 2003). There is some research on ratemaking with yield data of only a few years (Nelson 1990). However, short-term series data do not fully characterize crop yield risks (Coble et al. 2010). For example, farmers are required to report 4–10 years of individual yield records when buying actual production history (APH) products. However, the risk management agency (RMA) of the US Department of Agriculture (USDA) uses the reported yield records to estimate the expected yield, instead of the yield distribution of individual farms (Schnapp et al. 2000). Regional yields and short-term series of farm-level yields do not provide farm-level and long-term series information at the same time. So they cannot provide adequate input for ratemaking. Characteristic values carry more information than the same number of reported records by individual farms. For example, the extremes, the upper and lower hinges (quartiles), and the median of a set of data are conventionally used in the configuration of a box plot to convey the characteristics of a distribution of data values (Mcgill et al. 1978; Williamson et al. 1989).

This article presents a survey data based approach to price crop insurance in counties of China. Different from the APH that requires farmers to provide 4–10 years of actual yield data, the method proposed here focuses on characteristic yield information—average yield, high yield, and low yield—of smallholder farms. As the memory of yields of specific past years becomes vague in the present, characteristic yield information is more reliable than reported yields for specific years, if both are recalled by farmers. Instead of using the reported data directly, special treatment is employed to compensate for random error and ensure data validity. Section 2 presents the basic principle for premium ratemaking, Sect. 3 outlines the survey that provided the data, Sect. 4 presents the steps of data processing, and Sect. 5 outlines conclusions from the study.

2 Basic Principle for Premium Ratemaking

Let’s assume that the random variable of interest is y jt , the production of a specific acre j at year t. Let N be the total number of acres. We arrange the random variables y jt in the following array (Koundouri and Kourogenis 2011):

$$ \left[ {\begin{array}{*{20}l} {} \hfill & {t = 1} \hfill & {t = 2} \hfill & \ldots \hfill \\ {j = 1} \hfill & {y_{11} } \hfill & {y_{12} } \hfill & \ldots \hfill \\ {j = 2} \hfill & {y_{21} } \hfill & {y_{22} } \hfill & \ldots \hfill \\ . \hfill & . \hfill & . \hfill & {} \hfill \\ . \hfill & . \hfill & . \hfill & {} \hfill \\ . \hfill & . \hfill & . \hfill & \ldots \hfill \\ {j = N} \hfill & {y_{N1} } \hfill & {y_{N2} } \hfill & {} \hfill \\ \end{array} } \right] $$
(1)

According to the indemnity calculation formula in the insurance contract (PICC 2011), the county annual LCR at year t is derived as follows:

$$ LCR_{t} = \frac{{\sum\nolimits_{j = 1}^{N} {\max \left[ {{\mu_{j}} - {y_{jt}} ,0} \right]} }}{{\sum\nolimits_{j = 1}^{N} {\mu_{j} } }}\quad \forall y_{jt} \le C \times {\mu_{j}} $$
(2)

C is the crop insurance coverage level. \( \mu_{j} \) is the average yield of specific acre j. The average LCR in T year is calculated as a simple average of the annual loss rates.

$$ \begin{aligned} E(LCR) = \frac{{\sum\nolimits_{t = 1}^{T} {\sum\nolimits_{j = 1}^{N} {\max [{\mu_{j}} - {y_{jt}} ,0]} } }}{{T \times \sum\nolimits_{j = 1}^{N} {\mu_{j} } }} \hfill \\ = \frac{{\sum\nolimits_{j = 1}^{N} {\sum\nolimits_{t = 1}^{T} {\max [{\mu_{j}} - {y_{jt}} ,0]} } }}{{T \times \sum\nolimits_{j = 1}^{N} {\mu_{j} } }}\;\; \quad\forall \, y_{jt} \le C \times \mu_{j} \hfill \\ \end{aligned} $$
(3)

If the yield probability distributions of acres are known, across time (temporal properties), the expected LCR (pure premium rate) can be expressed as:

$$ E(LCR) = \frac{E(I)}{E(\mu )} = \frac{{E[\int_{0}^{{C \times \overline{{y_{j} }} }} {(\mu_{j} - y_{jt} )f_{j} \left( {y_{it} } \right)d(y_{it} )} ]}}{E[\mu ]} $$
(4)

\( E(LCR) \) is the expected LCR. \( E(I) \) is the expected indemnity. \( f_{j} \) is the yield probability distribution of acre j. \( C \times \overline{{y_{j} }} \) is the yield coverage. The farmer whose yield is higher than \( C \times \overline{{y_{j} }} \) will not be paid by the insurance company.

In reality, long-term series of all farm-level yield records in regions (array 1) are difficult to obtain. So it would be more applicable to calculate the premium rate with Eq. 4, based on the estimation of the farm-level yield probability distribution. Here we try to do this based on the data obtained through a survey with a sample size of 486 households. The survey data provide the information to calculate the net rate.

3 Farm-Level Yield Data from the Survey

Based on Eqs. 14 of Sect. 2, the key question for ratemaking is how to estimate the probability distributions of farm-level yields. Estimating this distribution with long-term regional yield data (Botts and Boles 1958) has the disadvantage of ignoring the variation of farm-level distribution, although it incorporates considerations for temporal variations. Using short-term series farm-level yield records prohibits reliable characterization of the temporal aspects of the distribution. Therefore this study proposes the method of estimating the distribution with a farming household survey of the long-term yield characteristics. This section presents the design and execution of the 2011 survey in the case study area of Dingxing County, Hebei Province, on the North China Plain.

Winter wheat is one of the major crops in China and insurance premiums are heavily subsidized by the government. Almost all farmlands in the study area are planted with winter wheat. Therefore, this study chose winter wheat as the crop for examining premium ratemaking.

3.1 Design of the Survey

The survey was designed to characterize the long-term farm-level yield distributions to be used for pricing farm-level insurance. The characteristic values collected through the survey describe the distribution across a long time period and represent the volatility of farm-level yields. A well-designed sampling survey would provide yield information across a long time period and cover various farms in the survey region.

The questionnaire included personal information, farm information, wheat yields, investments in wheat cultivation, and the impacts of disasters on wheat production. Three closed questions directly related to yield: (1) what is the average yield of winter wheat on your farm? (2) what has been the highest yield of winter wheat on your farm? and (3) what has been the lowest yield of winter wheat on your farm? Answers to these questions can reflect the yield variation at the farm level and can be provided by all farmers. Farmers usually have a clear memory of reduction of output, bumper harvests, and average yield. Farmers have held individual contracts to use their farmlands since the mid-1980s, following the economic reform and opening-up of China that started in 1978. Most of the interviewees in the survey were 50–60 years old and had been farmers through the whole time period since the farmlands were allocated for individual use. So although the time period the characteristic yield values should cover was not explicitly specified in the survey questions, the interviewees commonly referred to the mid-1980s as the departure point for reporting the characteristic yields.

3.2 Sampling and Implementation of the Survey

The case study area of Dingxing County, Hebei Province, on the North China Plain (Fig. 1) is a representative winter wheat production area in China. The total land area of the county is 714 km2. The county has 16 towns with a total area of 684 km2; an urban district; a provincial industrial park; and 274 villages. Because agriculture is found only in the villages, the study area does not cover the urban district and the provincial industrial park (Table 1). The total wheat planting area is about 33,300 ha, with an annual production of more than 200,000 tons. Agricultural income is the main source of income for local farmers. The farming income for the surveyed households ranged between RMB 3,000 and 7,000 yuan, or about USD 480–1,100 (calculated at the 2011 wheat price of 2.2 yuan per kg). To generate off-farm income for supporting their families, some farmers became migrant workers and took on various jobs in cities. Winter wheat is usually plowed and sowed with tractors in late September. Field management, such as fertilization and irrigation, is completed with a large amount of labor input, which is organized within family units. In early June of the next year, winter wheat is harvested with large machines. Crop yield is mainly affected by droughts, heavy wind, cold spells, and hail.

Fig. 1
figure 1

Distribution of the 486 questionnaires from Dingxing County, Hebei Province (the dots represent the samples)

Table 1 Overview of the study area in Dingxing County, Hebei Province

In order to ensure the representativeness of the survey data for the entire county, all the villages in the county were visited by our research team. Village is the smallest settlement unit here, with a size of several hundred households. In each village, several farmers were interviewed by random selection. The interviewees were mainly farmers working on the farms, not wage laborers or others who were not engaged in agricultural production. We preferred older male farmers as interviewees because of their generally rich farming experiences compared to women and younger farmers. Guided by the predesigned questionnaire, the semistructured interviews with the farmers lasted half an hour or longer.

The survey was carried out in May 2011–June 2011, during the winter wheat harvest. At this time of the year, most of the farmers have just completed the process of wheat production and they remember information on the production process more accurately due to the so-called “context effects,” that is, the same environment is a retrieval cue to recall information (Meyers-Levy et al. 2010). In addition, farmers can be found more easily on the farms at this time of the year.

Within each village the timing and general practice of seeding, fertilization, irrigation, and harvesting are homogeneous, and yields are fairly comparable. A sample size larger than the village count with two samples from each village was deemed representative enough for yield information for the area. Of the final total of 538 completed questionnaires, 52 lack some key information, so only 486 of them were valid and were used for the analysis. Table 2 provides a brief description of the survey results.

Table 2 Descriptive statistics of the answers to the three wheat yield related questions directed at farmers in Dingxing County, Hebei Province

Based on data from statistical yearbooks, the minimum, maximum, median, and mean values of the annual average wheat yield in Dingxing County from 1993 to 2010 are 350, 420, 389, and 386 kg, respectively. The farmers’ reported characteristic yields range widely between normal, high, and low yield years and among farming households (Table 2). The reported yields in normal years have a slightly higher mean value for the surveyed households (403 kg) than the county average (386 kg); the reported yields in high yield years have a higher mean value for the surveyed households (471 kg) than the county average (420 kg); and the reported yields in low yield years have a lower mean value for the surveyed households (298 kg) than the county average (350 kg).

4 Yield Simulation and Insurance Pricing

This section describes the simulation of farm-level yields through compensating for the influence of error in the survey and recovery of the yield probability distributions with the linear additive model (LAM). The premium rate is determined according to the yield simulation result (Fig. 2).

Fig. 2
figure 2

Flowchart of the research

4.1 Compensating for Random Error

The survey data can demonstrate the probability distribution of farm-level yields in the region. Statistically, field observations and the resulting measurements are never exact. Any observation can contain various types of errors (Fan 1997; New Jersey Institute of Technology 2007). Random errors are caused by various subjective and objective factors in the process of memory production and the presentation of information, for example rounding and farmers’ memory errors. Because of these errors, yield data obtained from the survey do not represent precise crop yield figures.

The estimated distribution of farm-level yields is also significantly affected by random errors. To compensate for the influence of random errors, normal distribution n with zero mean was usually used (Topping 1957; Exell 2001). So the true value of characteristic values can be defined as:

$$ y_{ta} = y_{a} + fu_{offset} $$
(5)

where y a is the farmer’s response in the questionnaire, \( fu_{offset} \) represents the compensating factor for random errors and was set as \( N(0,\sigma^{2} ) \).

In the survey, the interviewees tended to round the yields to 50 kg. So y a will more likely represent a yield range of \( \left\{ {y_{ta} \left| {y_{a} \; - 25 < y_{ta} < y_{a} \; + 25} \right.} \right\} \). This is a simplified method of information diffusion that is used when information is incomplete (Huang 1997). Here two-sigma (σ) limits are used to control the variable’s range, which means that we set the \( \sigma \) as 12.5, so that 95.44 % of the \( fu_{offset} \) values lie within \( [- 25, 25] \).

4.2 Probability Distribution Based on Characteristic Yield

In the survey, we asked for average yield, highest yield, and lowest yield as the characteristic yields to express the probability distribution. The occurrence probability that corresponds to each characteristic yield is also needed to estimate distributions of yields of individual smallholder farms. Since it is impractical to get these statistical values from farmers, these values were derived from the county yield data.

Based on the LAM, variations in individual yield can be decomposed into variations in area yield that represent systemic risk and variations in the error term that represent individual-specific or nonsystemic risk (Ramaswami and Roe 2004).

$$ y_{jt} = \mu_{j} + \beta_{j} (Y_{t} - \overline{Y}) + \varepsilon_{j} $$
(6)

where y jt is the yield of individual j at year t, \( \mu_{j} \) is the unconditional mean of \( y_{jt} \), that is, \( E\left( {y_{jt} } \right) \). \( Y_{t} \) is the area yield at year t, \( \beta_{j} \) is the slope parameter satisfying \( \beta_{j} = Cov(y_{jt} ,Y_{t} )/Var(y_{jt} ) \). \( \overline{Y} \) is the unconditional mean of \( Y_{t} \), and \( \varepsilon_{j} \) is a mean zero random variable uncorrelated with area yield.

So the probability of individual yield in a certain range \( \{ a < y_{ta} < b\} \) can be expressed as:

$$ P\{ a < y_{jt} < b\} = P\left\{ {\frac{{a - \mu_{j} }}{{\beta_{j} }} + \overline{Y} < Y_{t} + \frac{{\varepsilon_{j} }}{{\beta_{j} }} < \frac{{b - \mu_{j} }}{{\beta_{j} }} + \overline{Y} } \right\}\quad {\text{a}} < {\text{b}} $$
(7)

As a primary source of systemic risk in agricultural systems, geographically extensive unfavorable weather events, such as droughts or extreme temperatures (Miranda and Glauber 1997), will affect large homogeneous topographic and land-use areas. In the whole of Dingxing County systemic risk in agricultural production is the main factor of individual risk. The evidence for this is reflected in the responses of the farmers: most of the interviewees reported highest yield in 2010, and lowest yield in 2003 when continuous rains during the winter wheat ripe period affected the growth of the crop, resulting in great decline in yields. This result is consistent with the county yield variation (Fig. 3). The longtime-sequence actual yield of crops is generally decomposed into trend yield (through fitting the actual crop yield by a trend line, depending on the mathematic model; this part of the yield is considered as a result of agricultural technology development and agricultural investment), climatic fluctuation-affected yields (which are considered as a contribution of climate fluctuations), and random error. In Fig. 3, the detrended yields are the remainder after the actual yield data was subtracted by the trend yield. The mathematic model used to obtained the trend yield is the least-squares fit.

Fig. 3
figure 3

Average wheat yield in Dingxing County, Hebei Province, 1993–2010

Therefore, the systemic risk will be sufficiently larger than the error term \( \left( {\varepsilon_{j} } \right) \) in the county. Significant yield loss or increase can only be caused by systemic risk, which means that the low yields indicated in the survey were the result of regional yield loss and the high yields were the result of regional yield increase. Therefore \( y_{j}^{low} \) is treated as the farm-level yield when \( (Y_{t} - \overline{Y} ) \) is smallest; and \( y_{j}^{high} \) is the farm-level yield when \( (Y_{t} - \overline{Y} ) \) is largest. So:

$$ \beta_{j} = \frac{{y_{j}^{c} - \mu_{j} - \varepsilon_{j} }}{{(Y_{c} - \overline{Y} )}} $$
(8)

where \( y_{j}^{c} \) is a characteristic yield from a single farming household and Y c is the corresponding county yield.

Because \( \varepsilon_{j} \) is a small mean zero random variable, and the difference between characteristic yields is much bigger, the average value of \( \beta_{j} \) when \( y_{j}^{c} = y_{j}^{high} \) and \( y_{j}^{c} = y_{j}^{low} \) is used for the calculation of the probability of individual yield. The probability of individual yields falls into three yield intervals \( \left( {y_{j}^{high} \pm 25,\;y_{j}^{low} \pm 25,\;{\text{and}}\;y_{a}^{normal} \pm 25} \right) \) and is approximately estimated with the systemic risks.

Calculated with Eq. 7, the three probabilities’ average values of all farmers are 0.1672 (low yield), 0.7059 (average yield), and 0.1269 (high yield), which indicates that the probability of getting below average yields is greater than that of getting above average yields. It is consistent with the research results of Gallagher (1987) and Harri et al. (2009) that the yield distribution is negatively skewed. This is also consistent with the survey result. Statements such as “there were few years with yields significantly higher than others” and “the yields were similar across years” were often made by the farmers.

4.3 Yield Simulation and Premium Rate

Based on each farm-level yield distribution, the yields of the 486 farming households were simulated. The inverse transform method was used to generate a sample of random variables with each farm-level yield distribution. For the simulation of one farm, there were two processing steps: (1) a pseudo-random number generator was used to generate a random variate uniformly distributed in [0, 1]; and (2) the random numbers were converted to a random variate of the farm-level yield distribution based on the cumulative distribution function (CDF).

When using the Monte Carlo method, a large number of simulation runs will return a stable result. Therefore 10,000 years (or times) of 486 farm-level yields were simulated with the help of matlab 2013. The simulated yield data were generated based on probability, and without a temporal dimension.

The probability distribution of the simulated yield is shown in Fig. 4. The mean value of the simulated yield is 393.3 kg, and the variance is 74.0. The maximum and minimum of the simulated mean yields of the 486 farmers are 576.2 and 234.9 kg respectively (Fig. 5).

Fig. 4
figure 4

Probability distribution of the simulated wheat yield in Dingxing County, Hebei Province

Fig. 5
figure 5

Simulated mean wheat yields of the 486 farmers in Dingxing County, Hebei Province

According to the simulated yield and Eqs. 24, the pure premium rate (the expected LCR) for the Dingxing County area is calculated as 2.47 %, which is greater than the average LCR (1.12 %) from 2008 to 2013. This is mainly because coverage in local agricultural insurance was limited to droughts and hail storms, but other hazards—including heat waves, heavy precipitation events, pest infestations, and crop diseases—have all made their impacts on actual yields. Data from insurance companies are available for only 4 years, which is too short a time to reflect the real expected LCR.

5 Conclusion

The rapid development of crop insurance in emerging markets (Kalra and Li 2013) points to the urgent need for valid pricing techniques tailored for regions with limited farm-level yield records and past insurance loss data. This study explores a new ratemaking approach based on survey data. Different from existing studies, this study focuses on the characteristic yields of long-term yield rather than yield figures of specific years as reported by farmers in some existing pricing systems (Skees and Reed 1986; Woodard et al. 2011). Although both types of information are from farmers’ recollection and can hardly be “exact,” characteristic information is more reliable than yields of specific years because it captures the feature of yield distribution (Gong et al. 2013). Random errors were taken into account when farm-level yield distribution was retrieved from the reported characteristic information.

Although 2.47 % is not a particularly high premium rate, given the very small planting area of an average household and the low household income in the study area, this is still difficult to afford for many farming households. The farmers were more willing to spend their tight resources on basic necessities than paying for agricultural insurance premiums (Smith and Glauber 2012; Zhao 2012). To encourage the purchase of crop insurance products, the Chinese government began to provide premium subsidies (up to 90 %) for agricultural insurance in 2007 to help farmers cope with the threat of natural disasters (Hou et al. 2010; Guo et al. 2011). As a result, China’s agricultural insurance market has experienced rapid expansion in recent years (Xiao et al. 2013).

The data used in this study were obtained through a household survey and the temporal coverage of the characteristic yield values was limited by the experience of the surveyed farmers. As a result, the estimation of probability of extreme yield events involves more uncertainties. So the insurance company will need high loaded premiums for a stable operation.

Farm-level yields were treated independently in the simulation, which means that interdependency of farm-level yields was not considered in this Monte Carlo simulation. However, a disaster, for example drought, usually affects many farms in the area in the same year and a large number of farms will face yield depression at the same time. If the yield of one farm is low, the yields of other farms are more probably low as well because of this spatial dependency (Woodard et al. 2012). This will cause a large amount of insurance indemnity in the region in a particular year. The yield simulated with this method cannot be used to estimate the annual exceedance probability (AEP) of the LCR. To address this question, measuring spatial correlation will be an important issue in future research.