1 Introduction

Recent studies have highlighted the large food losses which occur after crops are harvested up to the times when they are consumed (FAO 2011; Lundqvist et al. 2008). Reducing food losses could be a major contribution to satisfying anticipated higher global food demand and to improving food security and resource use efficiency (Godfray et al. 2010; West et al. 2014; Hertel 2015; Reynolds et al. 2015). There seems to be consensus in the literature that post-harvest losses in developed countries are relatively high at the consumer end, while in developing countries they are relatively high in the early stages of the post-harvest system i.e. at farm level (Parfitt et al. 2010; Hodges et al. 2011). However, recent and systematic evidence is lacking on the magnitude of post-harvest losses at farm level in developing countries. In sub-Saharan Africa (SSA), post-harvest loss studies at farm level almost exclusively focus on storage losses (www.aphlis.net; Rembold et al. 2011). In most studies non-standardized and biased methodologies are used and estimated storage losses are generally inaccurate (Affognon et al. 2015). Consequently, there is a lack of reliable information on the post-harvest losses faced by farmers. More generally, there is a lack of information on the post-harvest management by farmers and the conditions under which they operate in relation to post-harvest food losses.

Recognizing that thorough statistical analyses were hampered by lack of reliable data, the Living Standards Measurement Study - Integrated Surveys on Agriculture (LSMS-ISA) of the World Bank started to design and implement representative panel household surveys with a focus on agriculture. The systematically collected datasets contain information related to post-harvest management and self-reported post-harvest losses by farmers of different crops in addition to more general socio-economic information of farm households and geo-referenced information, for example on climate and distance of households to markets. In addition to these data, population weights are included and data are weighted to represent the national-level population of rural and small town households. Although eight African countries are involved in this project, in this study we focus on panel household data collected in Ethiopia. This multi-topic dataset allows analysis and modelling factors, which are associated with post-harvest losses in the literature such as the method of storage, ambient humidity and temperature, market access and household education (Tefera 2012; Stathers et al. 2013; Edoh Ognakossan et al. 2016).

The overall objective of this study is to disentangle factors that induce or relate to post-harvest losses of cereals in Ethiopia. A better understanding of the causes of post-harvest losses across geographical areas with different agro-ecological and socio-economic characteristics could enable more efficient targeting of interventions aimed at post-harvest loss reduction. This overall objective can be sub-divided into the following sub-goals:

  1. 1)

    To gain insight into the post-harvest storage management of cereals.

  2. 2)

    To gain insight into the scale, causes and reported percentages of post-harvest losses in cereal crops.

  3. 3)

    To identify and quantify major agro-ecological (e.g. altitude, rainfall, storage methods) and socio-economic (e.g. wealth of household, distance to market) variables that are related to post-harvest losses of cereals.

Kaminski and Christiaensen (2014) estimated post-harvest losses of maize in East Africa using LSMS-ISA data for Malawi (2010/2011), Tanzania (2008/2009 and 2010/2011) and Uganda (2009/2010). They concluded that on-farm self-reported post-harvest weight losses varied between 1.4 and 5.9% of the national maize harvest in these countries, while losses were concentrated among less than one fifth of the surveyed households. Kaminski and Christiaensen (2014) also analysed potential drivers of post-harvest losses, but the vast majority of the variation in post-harvest losses remained unexplained using classical parametric methods. In our study we started with a larger LSMS-ISA data set based on four East-African countries totalling seven years of data. Table 1 shows the number of cereal records (maize, barley, millet, rice, sorghum, teff, wheat) and the percentage of cereal records with post-harvest loss estimates (0%, >0%, and % missing data) in the available data set for Ethiopia, Malawi, Uganda and Tanzania. Post-harvest losses are recorded as such (in percentages) or calculated using the reported amount of crop losses (in local and/or SI units) divided by the total amount of harvested crop (in local and/or SI units) and expressed as percentage. In the remainder, we used post-harvest losses as a general term for both types of estimates. Except for the dataset of Ethiopia (2011/2012; 2013/2014) and Uganda (2011/2012) in roughly 90% of the cereal records the farmers’ self-reported estimates of post-harvest losses were missing. The reason for the large number of missing data is unknown but made these survey data unsuitable for further analysis and modelling. The Ethiopia (2013/2014) and Uganda data were the most complete but indicated the prevalence of post-harvest losses in only 2% of the cereal records, which is also not helpful for modelling post-harvest losses. Therefore, in this paper we focused exclusively on the dataset of Ethiopia (2011/2012) as it contained the most complete data with quantitative post-harvest loss estimates and variables related to post-harvest management and losses. Also, different from Kaminski and Christiaensen (2014), we used Random Forests (RF) to model post-harvest losses of cereal crops. RF is a non-parametric statistical ensemble learning method suited for the analysis of large data sets. Recently, RF has become popular in biological sciences because of its high-prediction accuracy and provision of information on the importance of variables for classification and regression (Breiman 2001; Touw et al. 2013).

Table 1 Number of available cereal records in various investigated country databases of the Living Standards Measurement Study-Integrated Surveys on Agriculture (LSMS-ISA) and the percentage of cereal records with post-harvest loss values equal to 0%, > 0% and missing values

In the remainder of this article we first describe the used data of Ethiopia (2011/2012), processing of data and the methods to model post-harvest losses. Section 3 describes 1) the post-harvest management of cereals, post-harvest losses and their causes in Ethiopia; 2) results of the post-harvest loss modelling for Ethiopia. In section 4 we reflect on the results of the various analyses, the RF method used to model post-harvest losses and the LSMS-ISA data used.

2 Data and methodology

2.1 Data: living standards measurement study-integrated surveys on agriculture (LSMS-ISA)

We used survey data of the LSMS-ISA project, which supports and collaborates with the national statistics offices of eight SSA countries (Burkina Faso, Ethiopia, Malawi, Mali, Niger, Nigeria, Tanzania, and Uganda) to design and implement systems of multi-topic, nationally representative panel household surveys with a focus on agriculture. The general setup of the surveys is the same across countries and typically consists of questionnaires related to the household, agriculture, livestock and community. The generic survey methodology carried out in different countries potentially allows cross-country and time series analyses of the data.

Household and post-harvest information in Ethiopia for 2011/2012 were collected over a period of 5 to 9 months, during which households were visited three times. The first round was in September–October 2011, the second round in November–December 2011 and the third round was in January–March 2012. The collected data were from the production year 2011. The LSMS-ISA Ethiopia data (2011/2012) covered all regional states except the capital, Addis Ababa, and formed a subset of the national agricultural sample survey. The LSMS-ISA survey was implemented in 290 rural and 43 small-town enumeration areas and includes all rural and small towns of Ethiopia except three zones of the Afar and six zones of the Somalia regions. The sample design provides representative estimates at the national level (excluding the nine zones in Afar and Somalia regions) for all rural households and for the combination of rural-area and small-town households. The regions of Ethiopia served as the strata of the two-stage sample design. Quotas were set for the number of enumerator areas in each region to ensure a minimum number of enumerator areas from each region. In the rural enumerator areas, a total of 12 households were sampled per enumerator area; 10 agricultural households were randomly selected from the agricultural sample survey, while the other two households were randomly selected. Population weights were included and applied to raise the sample households to national values for rural areas and small towns.

In addition to the standard survey questions relating to socio-economic variables of households, the survey also comprised information on the post-harvest characteristics of crop production, including self-reported quantitative estimates of post-harvest losses, self-reported causes of these losses and information on the post-harvest storage method and methods used to protect the cereals during the storage period. This information was used to gain insight into the post-harvest storage management and the scale, causes and self-reported percentages of post-harvest loss in major cereal crops; see Table 2 for an overview of the LSMS-ISA questions used for this purpose. The farmers’ self-reported post-harvest loss estimates in the LSMS-ISA are considered to represent upper bounds when compared to storage and handling losses reported in the literature as they may include losses due to handling, drying, storage and marketing (Kaminski and Christiaensen 2014).

Table 2 Living Standards Measurement Study-Integrated Surveys on Agriculture (LSMS-ISA) data variables of Ethiopia (2011/2012) used 1) To characterize post-harvest management and post-harvest losses, 2) As potential predictor variables to model post-harvest losses, 3) In both type of analyses

The households were further geo-referenced, using publicly accessible spatial databases. Information was provided on, for example, the distance to markets, annual rainfall and elevation. The survey data and detailed information on the sampling procedures, questionnaires, implementation of survey procedures and the spatial databases used can be found at the LSMS-ISA websites, accessible through www.worldbank.org.

2.2 Data processing

The LSMS-ISA project uses a number of topical questionnaires related to household, agriculture, livestock and community. We only used information collected through the household and agricultural questionnaires. The data were stored per topic in several files in which the households were identified by unique identifiers. A survey specific C# script was written, using the unique household identifier, to extract and collect all relevant information for data analysis on a per household basis. The newly prepared data file combined the information on post-harvest losses with the values of a number of variables. Post-harvest losses were recorded as such (percentages) or estimated using the amount of crop losses in IS units and the amount of harvested crop in IS units. For the Ethiopia data, 22 predictor variables were selected (Table 2), which can be categorized into demographic characteristics of the household head, such as age, gender and level of education; post-harvest management characteristics such as, cereal crop type, storage method and protection method; geo-referenced statistics, such as distance of the household dwelling to the nearest main road and nearest market, climate (average annual rainfall, total rainfall in 2011, annual mean temperature) and elevation; wealth status of the household approximated by the type of energy source for cooking (e.g. electricity vs. fire wood) and by the main source of light for the household (e.g. electricity meter vs. kerosene light lamp). Prior to the analysis, population weights for households were normalized to uniform weights. Although the original questionnaires contained many more variables, only those variables for which an association with post-harvest losses could be expected were selected for analysis with RF.

To assess participation bias, a nonresponse analysis was performed for the demographic characteristics, geo-referenced statistics and wealth status. For these variables, mean values were compared between the base cohort (5631 records) and the post-harvest loss cohort containing non-missing post-harvest losses only (3179 records).

2.3 Methods

To model post-harvest losses of cereals in Ethiopia we used Random Forests (RF), which is based on regression trees. A regression tree is a predictive modelling approach where many variables are mapped on a tree-like structure to predict a target value (Breiman et al. 1984). The outcome of the model can be graphically displayed as a binary tree showing how the dependent variable is affected by the predictor variables. The tree consists of nodes and branches and, depending on the value of a predictor variable at each node, one of two sub-branches is followed, finally ending in a leaf, which represents the target variable. The tree is grown using a training set and, at the same time, an independent set of data is mapped on this tree to evaluate the model.

Random Forests differs from regression trees in that not a single tree is grown but a large number of uncorrelated trees. This is the so-called forest. Each tree in the forest is grown using a bootstrap sample of the data. A vector of non-negative weights containing uniform probabilities is used to select cases as candidates for the bootstrap. After a large number of trees, the predicted value becomes available as the combined results across all trees using the cases that are not in the bootstrapped set. A difference with standard regression trees where a node is split using the best split among all predictor variables, is that in RF, a node is split using the best split among a random subset of input variables. Generating a forest of trees using bootstrapping in combination with random selection of predictor variables has several advantages compared to standard regression trees. In each bootstrap iteration a tree is grown using the training set and, at the same time, predicted values become available using the independent data. There is no need to prune, trees are grown very deep but variance is reduced by averaging many trees. One of the drawbacks of a random forest is that some interpretability is lost but, in general, the performance of the final model is boosted (Breiman 2001).

Random Forests can effectively handle large datasets that contain many variables with complex relationships. Though initially developed to maximize the predictive performance of the model, RF has a number of methods available for exploratory data analysis and interpretation of complex nonlinear relationships between explanatory and outcome variables. Graphical methods, like partial dependence plots, extract this information and visualize the relation between predictor variable and outcome. Variable importance scores can be calculated that indicate the relevance of a predictor variable for the outcome of the model. Values close to zero indicate that the variable is not important, where high values indicate that the predictive power of the forest is improved by including them.

When used for prediction only, there is no need to remove non-informative variables. In this study, where the aim is to improve understanding of the determinants of post-harvest losses, the number of variables in the model was reduced based on importance scores to improve the interpretation and understanding of factors contributing to post-harvest losses. Partial dependence plots have been used to visualize these relations. The partial effect of a variable was constructed for a range of evenly spaced values of the variable of interest, while keeping the values of the other variables unchanged. By taking the average prediction of the RF over all other covariates in such a point, the conditional effect of the variable of interest was calculated. Partial dependence co-plots are useful for investigating the combined effect of two variables on the response or to visualize pairwise interaction effects among variables. For a categorical variable, the partial effect of a continuous variable was calculated, conditional on the group membership. For continuous data, conditional membership has been accomplished by stratifying the conditioning variable into subgroups. For the relevant variables of the LSMS-ISA data, two groups of equal size were created each one representing a group with respectively values below and above the median value. Then, the partial effect of a variable was calculated conditional on the membership of the high and low group of the grouping variable.

Random Forests is available in a number of R packages. We used the randomForest package (Liaw 2015; Liaw and Wiener 2002, available at http://cran.r-project.org/package=randomForest) and the randomForestSRC package, version 2.4.1 (Ishwaran and Kogalur 2014; available at http://cran.r-project.org/package=randomForestSRC). The last package introduces the ggRandomForest package (Ehrlinger 2015), which implements tools for extracting intermediate data objects from the randomForestSRC package and uses the ggplot2 graphics package (Wickham 2009) to visualize RF models.

3 Results

3.1 General household information

Table 3 describes the major characteristics of the surveyed households in Ethiopia, which totalled 2472 unique households with cereals. Since most households managed several plots, the analysis comprises information of 5631 cereal plot records. From these records 3179 plots self-reported post-harvest losses of 0% or higher were reported. In Table 3, the means for the major variables are shown, both for the base cohort (n = 5631) and for the post-harvest loss cohort with reported values only (n = 3179; 56%). As shown, mean values for age, female headed households, annual rainfall, distance to the main road and nearest market were generally similar for both cohorts, demonstrating that the post-harvest cohort was an unbiased sample from the base cohort for cereals.

Table 3 Characteristics of households in Ethiopia (2011/2012) from the Living Standards Measurement Study-Integrated Surveys on Agriculture (LSMS-ISA) used to characterize post-harvest management and post-harvest losses

Maize is the major cereal crop reported on 29% of all cereal records, while sorghum (18%) and teff (21%) are also major cereals. Wheat and barley together make up 25% of the cereal records, while millet and especially the number of records with oats, rice and other cereals are negligible.

The average age of the household heads was 44 years, and 14% of the households were headed by females. The majority of the households in Ethiopia were illiterate (59%). Households live far from the nearest markets, on average 58 km. The distance to the main road is on average 15 km. The annual rainfall is about 905 mm and temperature is 18.7 °C.

3.2 Post-harvest storage management characteristics

Table 4 shows the different storage methods used in Ethiopia for the major cereals: maize, sorghum, teff, wheat and barley. Differences in storage method among the different types of cereals were relatively small. The most important method of storing cereals was a bag in the house; about 46% of the cereal records are stored this way. Modern storage methods, such as metal silos were hardly used. ‘Other’ storage methods, 39% of all responses, refer probably for a large part to the widely used traditional Gotera. This is an elevated storage platform often made from locally available material on which grains are stored close to dwellings. In the LSMS-ISA survey of Ethiopia 2013/2014 the option ‘traditional storage’ was added as a possible response and accounted for about 26% (unweighted) of all cereal storage methods (data not shown).

Table 4 Post-harvest storage methods in Ethiopia (2011/2012) expressed as percentage of the total number of storage methods used for each cereal type without taking into account the records with missing information on storage methods

Table 5 shows the different protection methods applied by farmers to reduce storage losses. The most frequently used protection measure for cereals in Ethiopia was elevation (for 43% of all cereals), which relates to the traditional Gotera storage platform (see previous paragraph). Elevation as a protection method was especially used for teff, wheat and barley. Traditional and modern pesticides tended to be more frequently used in maize and sorghum with 32 and 30%, respectively of all protection methods used in these crops. At least 30% of the respondents did not use any protection method, while another 26% did not provide a response on the method used for protection.

Table 5 Protection methods used by farmers in Ethiopia (2011/2012) during the storage period

3.3 Post-harvest losses and causes

Table 6 shows the average percentage of self-reported post-harvest loss estimates per cereal type and cause. These estimates are restricted to records with losses higher than 0% (n = 529). As shown in Table 1, only 10% of the cereal records contained losses higher than 0%, therefore, some of the average losses shown in Table 6 are based on a few estimates only. Given this limitation, the average self-reported post-harvest loss over all cereals was 24% with a somewhat higher loss for wheat, 27% and a lower loss for teff, 21%. The average post-harvest loss estimate due to ‘other’ factors was highest with 35%. The average post-harvest loss estimate due to insects and rotting was 27% for both, while for theft the lowest average loss was 5%.

Table 6 Average percentage of self-reported post-harvest loss (%) in Ethiopia (2011/2012) per cereal type and cause (n = 529

Table 7 shows the frequency of self-reported causes of post-harvest loss expressed as the percentage of the total number of reported causes of loss per cereal (n = 529). Rodents and other pests were most frequently reported as causes of post-harvest losses, on average 46%. The highest percentage was found for maize, 52%, the lowest for teff, 32%. Only for teff ‘other’ causes of post-harvest loss (36%) were more frequently reported causes.

Table 7 The frequency of self-reported causes of post-harvest loss (> 0%) in Ethiopia (2011/2012) expressed as percentage of the total number of reported causes per crop type (n = 529)

3.4 Modelling post-harvest losses

The Ethiopian records with self-reported post-harvest losses were analysed (n = 3179), records with missing values and post-harvest losses >100% were excluded. Missing values in the data matrix were imputed using the proximity matrix from the forest. The number of trees was 500 and the initial number of variables tried at each split was 7. After imputation, a full random forest was grown (ntree = 1000). The percentage of explained variance was 31%.

Figure 1 shows the importance scores, normalized by the standard deviation, of the predictor variables in the model for Ethiopia. Gender, age and variables related to education have low scores and are not useful for explaining post-harvest loss. Scores for crop type and methods of protection during storage were slightly higher but their contribution to the predictive power of the model was also negligible. Geo-referenced variables such as latitude, distance of the household dwelling to the main road and nearest market and different variables used to characterise rainfall are important and informative in describing post-harvest losses.

Fig. 1
figure 1

The importance scores of variables in explaining the self-reported post-harvest loss of cereals in Ethiopia (2011/2012)

In the full model, post-harvest losses were modelled using 22 predictor variables. To end up with a set of variables that facilitates understanding of the causes of post-harvest loss, we removed variables from the model, one at a time, starting with the variable with the lowest importance score. Then, the random forest was grown again and the next variable with the lowest importance score was removed. This process was repeated until the percentage of explained variance dropped substantially. The reduced model contained four variables and the percentage of explained variance finally dropped from 31 to 27%. Distance from the household to the main road and nearest market, average rainfall and latitude are important determinants of post-harvest losses for the Ethiopian data. Because latitude did not add to the interpretation of the model and because some level of confounding with distance to nearest market and main road and average annual rainfall could be present, we dropped this variable. The percentage of explained variance dropped to 26%, meaning that the predictive value of the model was not significantly influenced by leaving out latitude. Figure 2 shows the importance scores of the reduced model. In the RF analysis all cereals were pooled, despite the fact that crops may behave differently during storage and that pests attacking the harvest may differ among crops. Recognizing this, the variable crop type was used as a predictor variable in the model. The importance score for the variable crop type indicates that incorporating this variable in the model does not contribute to the explanation of the variability in post-harvest losses and that this variable could equally well be dropped (Fig. 1).

Fig. 2
figure 2

The importance of the main variables in explaining the self-reported post-harvest loss of cereals in Ethiopia (2011/2012)

To interpret the random forest, partial dependence plots have been made to visualize the individual effect of the three continuous predictor variables on post-harvest losses, i.e. distance of the household dwelling to main roads and nearest markets and average annual rainfall (Fig. 3). For the distance of the household to both the nearest market and the main road, the trend is that post-harvest losses are higher when the distance increases. The linear trend for average annual rainfall shows a negative effect on post-harvest losses. However, the individual data points on both extremes of the plot suggest that losses at low (< 600 mm) and high (>1200 mm) annual rainfall may be higher.

Fig. 3
figure 3

Partial dependence plots for the major predictor variables for post-harvest loss (%) in cereals in Ethiopia (2011/2012): distance to nearest markets, distance to main road and average annual rainfall. The shaded areas around the lines indicate the 95% confidence interval

Because interactions are likely to occur, the combined effect of the predictor variables is of real interest. To get a first idea about the size and direction of the effect, the values of distance to the main road and nearest market were divided into two groups of equal size using the median as splitting value yielding groups with respectively low and high values. For each group, the partial effect on post-harvest loss of the other variable was estimated and plotted and the trend was indicated by a linear function through the estimated values for post-harvest loss. Figure 4 shows the partial effect of distance of households to the nearest market conditional on the low (< 23 km) and high (> 23 km) group for distance of households to the main road. As also shown in Fig. 3 self-reported estimates of post-harvest losses increase for households located further away from the nearest market. There is hardly any effect on the self-reported post-harvest losses for households living close to the main road (continuous line in Fig. 4) or households living further away (dotted line in Fig. 4).

Fig. 4
figure 4

The partial effect on post-harvest losses (%) in cereals of the distance of households to the nearest market (km) conditional on the low (< 23 km) and high (> 23 km) group for distance of households to the main road (km) in Ethiopia (2011/2012). The shaded areas around the lines indicate the 95% confidence interval

In Fig. 5 the partial effect of distance of households to the main road conditional on the low (< 63 km) and high (> 63 km) group for distance of households to the nearest market is shown. Estimated post-harvest losses increase to the same extent both for households living further away from the main road (> 63 km, dotted line in Fig. 5) and households with a relatively nearby market (< 63 km, continous line in Fig. 5).

Fig. 5
figure 5

The partial effect on post-harvest loss (%) in cereals of the distance of households to the main road (km) conditional on the low (63 km) and high (> 63 km) group for distance of households to the nearest market (km) in Ethiopia (2011/2012). The shaded areas around the lines indicate the 95% confidence interval

In Fig. 6 the partial effect of average annual rainfall conditional on the low (< 23 km) and high (> 23 km) group for distance of households to the main road is shown. Both types of households living further away from a main road and living nearby a main road reported higher post-harvest losses under low rainfall conditions. There is no interaction effect, i.e. the effect of rainfall on self-reported post-harvest losses is not influenced by the distance of households to the main road.

Fig. 6
figure 6

The partial effect on post-harvest losses (%) in cereals of average annual rainfall (mm) conditional on the low (< 23 km) and high (> 23 km) group for distance of households to the main road (km) in Ethiopia (2011/2012). The shaded areas around the lines indicate the 95% confidence interval

Figure 7 shows the partial effect of average annual rainfall conditional on the low (< 63 km) and high (> 63 km) group for distance of households to the nearest market. Both types of households reported higher post-harvest losses under low rainfall conditions independent of the distance to the nearest market.

Fig. 7
figure 7

The partial effect on post-harvest loss (%) of average annual rainfall (mm) conditional on the low (< 63 km) and high (> 63 km) group for distance of households to the nearest market (km) in Ethiopia (2011/2012). The shaded areas around the lines indicate 95% confidence interval

4 Discussion and conclusions

We started with analysing LSMS-ISA national survey data of more than 15,000 households and more than 25,000 cereal records from four countries (Ethiopia, Malawi, Tanzania and Uganda) and covering seven years to gain better insight into post-harvest storage management and post-harvest losses of cereals in SSA. However, in the four datasets of Tanzania and Malawi about 90% of the responses on the farmers’ self-reported post-harvest losses in cereals were missing (Table 1). Such large amounts of missing data do not allow meaningful modelling and analysis of post-harvest losses, and highlight extreme problems in the data collection and quality management of these large survey data sets. The datasets of Ethiopia (2013/2014) and Uganda were most complete but indicated the prevalence of post-harvest losses in only 2% of the cereal records, which is also not helpful to for modelling and better understanding post-harvest losses. Therefore, in this paper we focussed exclusively on the dataset of Ethiopia (2011/2012). Yet, 44% of the self-reported post-harvest loss estimates were missing in this dataset, but as shown in Table 3 the difference between the sample with post-harvest data and the base cohort (including missing values) was negligible. Therefore, our findings and conclusions are representative for the entire data set of Ethiopia (2011/2012).

The LSMS-ISA data could potentially have a number of advantages over the use of other post-harvest loss data sources such as case study data and expert estimates of losses (Kaminski and Christiaensen 2014): (i) sample bias is avoided because the survey data provide nationally-representative samples of agricultural households and the post-harvest losses these households report; (ii) harmonization in the survey methodology facilitates comparison of the outcomes across years and countries. However, the main reason for using the LSMS-ISA data in our study is that the multi-topic and geo-referenced survey approach helps to improve our understanding of those agro-ecological factors (e.g. altitude, rainfall, storage methods) and socio-economic conditions of households (e.g. wealth of household, distance to market) that favour post-harvest losses. This helps to better target interventions aimed at reducing post-harvest losses. One disadvantage of the LSMS-ISA data is that the post-harvest-loss estimates are based on subjective reported information from farmers, which may be less accurate than measured loss data. However, practical, methodological as well as conceptual challenges to measure accurately post-harvest losses at farm level are great (Parfitt et al. 2010; Hodges 2013; Affognon et al. 2015). Another disadvantage of the LSMS-ISA data is that the current post-harvest loss estimate is an aggregated loss of all possible losses that may occur during the entire post-harvest chain. More detailed information on where losses in the post-harvest chain occur would be useful to better target interventions aimed at reducing such losses. Understanding of post-harvest management and losses can be increased considerably through adding survey questions on the post-harvest losses incurred during different stages of the post-harvest chain, such as harvesting, drying, winnowing and storage. The currently available post-harvest loss estimates from LSMS-ISA data discloses losses that farmers regard as important and therefore provide an appropriate yardstick for assessing losses that are imperative for reduction through targeted interventions.

As indicated at the beginning of this section, the major reason for not using more LSMS-ISA data were the large amount of missing data with respect to self-reported post-harvest loss estimates, especially in the datasets of Tanzania and Uganda. The reasons for the large amount of missing data are unknown and not limited to only post-harvest management and loss variables. Other agricultural and household data that were not used in our study also showed large data gaps, which raises doubts about the survey implementation and data quality control. The Ethiopian data set (2011/2012) appeared to be the most complete and allowed an insight into current post-harvest management of cereals to be gained through Random Forests analysis. This method disentangled those factors that induce or relate to post-harvest losses of cereals in Ethiopia. More than 50% of the cereals in Ethiopia are stored inside the house in bags or piles. Also traditional elevated storage platforms (Goteras) close to the dwellings are still frequently used. Elevation was the most used protection method, while traditional and modern pesticides tended to be more frequently used in maize and sorghum than in teff, barley or wheat. The farmers reporting post-harvest loss estimates faced an average weight grain loss of 24%, within a relatively small range of a lowest loss of 21% for teff and a highest loss of 27% for wheat (Table 6). Storage losses of teff are known to be lower than other cereals because of the small grain size, which makes teff more resistant to insect attacks than other cereals (WB/NRI/FAO 2011). Most frequently reported causes of post-harvest losses were rodents and other pests, except for teff for which ‘other causes’ were more important (Table 7). The high post-harvest loss estimates in cereals due to ‘other causes’ (35%; Table 6) and the high frequency of reported ‘other causes’ of post-harvest losses (23%; Table 7) calls for more in-depth research into these causes.

In the modelling of post-harvest losses of cereal crops using Random Forests we pooled the data for cereal crops. Although the pooling of all cereal crops in one analysis may be criticized, the outcome of the Random Forest shows that post-harvest losses did not depend significantly on the type of crop (Fig. 1). The crop variable was in the first set of 22 predictor variables, but the importance score was low. After dropping crop type as a predictor variable from the model, the percentage of explained variance did not drop significantly, indicating that the type of crop was not important in explaining post-harvest losses.

RF is well suited for the analysis of large, noisy data sets that exhibit highly irregular patterns, nonlinear effects and interacting variables. Moreover, RF can easily handle many predictor variables that may be correlated or not, have interactions or not and do not require any distributional assumptions. Classical approaches to model such data, such as ANOVA, regression, or (generalized linear) mixed models are often hampered by the high number of potentially available predictor variables that interact in unexpected ways, making it hard to identify important determinants or combinations of determinants of post-harvest losses. Complex nonlinear relationships are often missed without explicit pre-specification thus complicating the disclosure of new features of the data and may easily lead to spurious conclusions (Kaminski and Christiaensen 2014; Krupnik et al. 2015). This is also confirmed by Flack and Chang (1987), who demonstrated that variable selection within a set of noisy predictor variables frequently resulted in selected subsets of noise variables. Therefore, RF was a flexible method to explore the Ethiopian data. The initial model for the Ethiopian dataset (2011/2012) contained 22 variables and explained 31% of the variance. Many variables were correlated compromising the interpretation of the model. For example, for describing precipitation, the LSMS-ISA data used four variables, all characterising different aspects of the amount of rainfall during a year. The RF algorithm turned out to be flexible enough in dropping variables that were confounded without losing too much of its performance. By reducing the model, its interpretability was enlarged at the expense of losing some predictive power. Therefore, the results as derived with RF, may be considered as the best attempt to obtain information embedded in the data. Because of the advantages mentioned above, exploratory data analysis with RF will probably outperform classical approaches such as regression.

The percentage variance of the final model explained was low for the Ethiopian 2011/2012 data. Nevertheless, this dataset can be considered as an illustration as to how RF can be used to analyse such large data sets and how methods such as partial dependence can be used to extract substantive insights from the forest. In the Ethiopia 2011/2012 data the distance of the household dwelling to the nearest market and main road, and the average annual rainfall were identified as major factors that affected post-harvest losses in cereals. In Ethiopia, households living further away from markets and main roads report the highest post-harvest losses, while lower rainfall (higher losses) had a minor effect compared to the remoteness of households. Therefore, infrastructure and access to markets is not only of major importance for stimulating agricultural productivity, growth and development but also for reducing post-harvest losses (Dorosh et al. 2012; Tefera 2012). Thus, the reduction of post-harvest losses in Ethiopia requires large public investments but these are complementary to investments required for achieving productivity growth and food security (Rosegrant et al. 2015).

The LSMS-ISA infrastructure is well placed within national statistical agencies to collect through surveys information on post-harvest management and self-reported post-harvest loss information across the range of crops grown in various SSA countries. This information is potentially suitable to model post-harvest losses identifying generic factors and conditions favouring losses. However, the available LSMS-ISA information on both storage management and self-reported post-harvest losses shows that implementation of the surveys differ greatly across countries. Overall, information on crop storage, protection methods used during the storage period and above all, on the self-reported post-harvest loss estimates is incomplete in various LSMS-ISA data sets. This hampers the identification and quantification of important variables and conditions associated with post-harvest losses in SSA, which can help to identify appropriate interventions to reduce post-harvest losses. More emphasis should be placed on improving the quality, relevance and use of data and checking data at the early stages of data collection. This paper is therefore a call for greater awareness raising on the importance of post-harvest management and losses at every level but also a call for better data collection for which the infrastructure is already in place.