1 Introduction

In the past decades, consumers in rich countries have increasingly taken interest in the social and environmental aspects of the products they buy. This manifests in the rising demand for agricultural products that carry labels such as Fairtrade, Organic, or Rainforest Alliance-so-called voluntary sustainability standards (VSS). Hainmueller et al. (2015), for example, report an average annual growth rate in the demand for Fair Trade certified products in the U.S. of 40% between 1999 and 2008. Such labels are designed with the aim to improve the lives of the producers in the global south or to reduce the impact of the production process on the environment. To achieve this, the standards organizations (SO) behind the labels establish a catalog of criteria according to which goods have to be produced and/or traded in order to obtain their label. The certified producers are usually monitored by independent auditors. If the audit is passed, their product may carry a label signaling to consumers that they are purchasing a good that was produced in line with the standard.

Through the improvement of agricultural practices, VSS have a large potential to contribute to sustainable development in many areas. However, the scientific literature still remains largely inconclusive on the effects of certification (see, e.g., Oya et al., 2018). Most studies try to assess the overall effect of certification by comparing certified to non-certified producers in a particular area while somehow controlling for selection bias (see, e.g., Ruben and Fort, 2012; Waarts et al., 2014; Ingram et al., 2017). As is highlighted in a detailed literature review by Oya et al. (2018), both, the findings and the quality of most studies are mixed or limited. One reason why it is difficult to identify the effect of certification could be compliance-or more specifically: the variance in the degree of compliance with a standard’s criteria among producers. Full compliance is usually not necessary to obtain a label. Thus, the effect of certification might depend on a producer’s degree of compliance. To give an example, if Ethiopian coffee producers frequently do not comply with environmental criteria, a study looking at whether the certification has a positive effect on environmental outcomes for Ethiopian coffee growers is thus likely to find no effect.

While not attempting to find direct causal evidence for such heterogeneous treatment effects, we are contributing to the literature by investigating compliance with VSS in detail. In particular, we study audit data provided by Rainforest Alliance (RA), one of the biggest global players in the market of VSS. The data are at the certificate level where certificate refers to coffee and cocoa producers that can either be certified as individual farms or groups, such as cooperatives. Among the criteria to obtain RA certification, many are likely to be interrelated. Hence, the hypothesis we aim to verify is that it is possible to identify patterns of non-compliance within the complex structure of the audit data. If there are clusters of producers with similar non-compliance patterns, they are likely to also be comparable along other dimensions. Thus, we state a second hypothesis that it is possible to identify potential drivers of cluster affiliation. Identifying clusters of producers could help to develop tailored training courses. Additionally, finding clusters that struggle to comply in similar areas of the standard is the starting point for the analysis of what is driving non-compliance. A descriptive analysis sheds light on the type of producers that do not comply with different areas of the RA standard. To the best of our knowledge, we are the first to present detailed descriptive evidence on compliance with a VSS on a global level. Besides laying the foundation for causal inference studies or for classification, such information is in itself interesting since it might tell us something about who is having a hard time keeping up with the standard and who does not. This information can help to improve the standard through systematic assignment of training units or ex-ante risk assessment. Finally, as an additional exercise, we check whether it is possible to predict cluster affiliation based on producer characteristics. This sheds light on which variables have the most power in sorting producers into different pre-defined clusters.

To test these hypotheses, we take advantage of recent advances in machine learning. While machine learning techniques have been used extensively in recent years in other fields, it is a rather novel approach for research regarding VSS. One of the main advantages of the methods is that they manage to uncover patterns and structure in complex data unsupervised. We regard this as highly advantageous in our setting where relying on previous theory or empirical results is hardly possible.

Finding structure in the extensive audit data is the first step to understanding compliance. Hence, we first propose an adapted k-modes clustering algorithm to uncover frequent compliance patterns and to group producers according to the criteria they do not comply with. We then use a large array of additional indicators-matched to producers using location data-to find common characteristics of certificates within each of these clusters. As a last step, we show the results of a classification exercise where we try to predict cluster affiliation based on these characteristics.

We can identify four clusters: compliers, non-compliers with environmental criteria, non-compliers with management criteria, and non-compliers with social criteria. We find among other insights that producers in the first cluster-the compliers-are more likely to be large individual farms that have been certified for longer and are found predominantly in more developed regions of Central and South America. Producers in the second cluster-non-compliance with environmental criteria-tend to be located in the least developed regions. This cluster contains predominantly small coffee producers that are certified in groups, especially in Ethiopia. Weak management is often found in cocoa production in West Africa-most Ghanaian producers are assigned to this cluster. Finally, group certificates for coffee production in Central America, mainly in El Salvador, are over-represented in the last cluster where issues with social criteria are prevalent.

The paper proceeds as follows. The next section summarizes the existing literature, followed by Sect. 3 outlining the data. The compliance patterns found through clustering are presented in Sect. 4. Section 5 characterizes the clusters, and Sect. 6 shows the results of the classification exercise. After a short summary of our results, we discuss the regarding limitations and existing literature in Sect. 7. This is followed by the implications of the results, and in Sect. 9, we conclude.

2 Literature

As indicated in the previous section, the literature on compliance with VSS is extremely sparse. We know of merely three papers addressing the issue. Kirumba & Pinard (2010) look at UTZ certification of coffee farmers in Kenya. They compare compliant farms to those that never achieved certification although they tried. The results point out that economic drivers determine compliance. However, the study does not provide any details on what the issues with compliance were, they merely suggest a positive selection bias arising from the certification process. The other two studies look at a previous version of the standard by Rainforest Alliance for Brazilian coffee producers only. Pinto et al. (2014) compare the compliance of group to that of individual farm certificates. They find that the two types of producers obtain similar compliance levels. In a more recent paper, Maguire-Rajpaul et al. (2020) in large parts confirm these findings using more data. They find that Brazilian group certificates have slightly higher non-compliance with a selection of management criteria and somewhat lower social performance than individual farm certificates-although the differences are statistically insignificant. Our study extends the analysis to the entire world and more crops. We use more recent data and also provide more structure to the analysis by first clustering the certificates according to their non-compliance patterns, while the previous studies simply compute compliance scores. Our paper relates further to the empirical literature on the impact of certification in general (e.g., Van Rijsbergen et al., 2016; Cramer et al., 2017; Dragusanu and Nunn, 2018; Glasbergen, 2018; Krumbiegel et al., 2018; Borsky and Spata, 2018; Sellare et al., 2020; Dietz et al., 2020). We also contribute to the growing literature on machine learning techniques used in social sciences (e.g., Mullainathan and Spiess, 2017; Chalfin et al., 2016; Kleinberg et al., 2018) and in particular, we add to the methodological literature on clustering algorithms (e.g., Huang, 1997; Khan and Ahmad, 2013; Cao et al., 2013).

3 Data

At the heart of the analysis are audit data at producer level that are collected by third-party auditors and processed by RA. The dataset reports the compliance status for all criteria of the 2017 Sustainable Agriculture Standard, which is one of the most widely adopted VSS. It was developed by RA and the Sustainable Agriculture Network (SAN) and is made up of 119 criteria grouped into four principles: effective planning and management system, biodiversity conservation, natural resource conservation, and improved livelihoods and human well-being. The standard consists of critical and continuous improvement criteria. Compliance with the critical criteria is mandatory, both, for the initial certification and the retaining of certification. Critical criteria are the foundation of the standard and include labor, social, and environmental issues of highest priority and risk. After the initial certification, the standard defines a sequential progress including three levels: C, B and A that consist of the continuous improvement criteria. Every year an increasing share of the continuous improvement criteria must be complied with to retain certification. As the new standard has only been implemented in 2017 and every producer was then set to “year zero”, we will not look at the criteria of levels A and B, as no producer had to comply with these at the time of the audits in our data.

The dataset contains the results of 919 independent audits between 2017 and 2019 for 561 certified producers.Footnote 1 To reduce the data to one audit per producer, we choose the one in year 2018 where we have most observations. If not available, we use the year 2017 or 2019. Of the producers included in the data, 58.6% are group certificates, 39.2% are individual farms, and a meager 2.1% are multi-site certificates. We count the latter as individual farms to avoid having a very small group. Further, 86.8% produce coffee (Arabica), the most widely certified crop worldwide, and 13.2% produce cocoa, another important crop in the certification business.

The audit result for each criterion can either be compliance (1), non-compliance (0), or non-applicable (NA). Non-applicable criteria refer to practices irrelevant to a producer, e.g., “safe application of pesticide by aircraft” if a producer does not use aircraft. To give an impression of the distribution of audit responses: over all criteria and observations 72.2% are compliant, 7.4% are non-compliant, and 20.4% are non-applicable. Further, the share of NA ranges from 0 to 92.2% depending on the criteria. Of the 561 observations, a total of 91.4% pass the audit and are (re)certified. If a producer does fail the audit, it is predominantly due to non-compliance with one or more critical criteria. Compliance with level C criteria ranges from 50% of the criteria (which is sufficient for certification in the first year) to 100%.

Besides audit data, RA provides data on the individual certificates, which we were able to match to the producer-level audit data. We use variables on the number of workers, the number of producers in group certificates, the operation size in hectares, output per hectares, and the year a producer was first certified.Footnote 2 In addition to these variables, the data contain coordinates of each producer, which we use to match a large array of GIS-coded variables to our data. We use data on the harvested area and quantity of coffee and cocoa in the surroundings of producers, certification density of the two crops by other VSS, population density, distance to the nearest city, child mortality, nighttime light density, vegetation, protected areas and terrain ruggedness. We include a buffer around the producers with a radius of 15km for several reasons: to reduce noise, to control for the fact that the resolution of the GIS data differs depending on the variable, and because some of the group certificates report the location of their city offices. Within this buffer, we aggregate the underlying grid-cells using the mean.Footnote 3 Additionally, we look at a number of economic and political variables at the country level. We will present summary statistics in Sect. 5 below. Appendix A provides details of all variables used in the analysis.

4 Compliance patterns

4.1 Method

One of the most popular unsupervised machine learning techniques is the k-means algorithm that groups data into similar clusters. For our cluster analysis, we use an adapted version of this algorithm that takes into account the peculiarities of our dataset. Hence, we program our own clustering algorithm using the C++ programming language and incorporating findings from the literature. We follow Huang (1997) who introduces the k-modes algorithm, as the k-means algorithm is designed for numerical data. As described above, our audit data are binary, in which case both algorithms work in theory. However, we find that k-modes perform better in clustering our particular data.

Additionally, our dataset has a lot of “missing” values since not every criterion applies to every observation. Most observations have some non-applicable criteria; thus, we cannot drop them and we need an algorithm that runs well even with “missing” data. With k-modes, it would potentially be possible to treat non-applicable as an own category. However, we do not want the clustering to be driven by the non-applicable criteria and, thus, rule out this approach. In the following, we describe a version of k-modes that handles this data structure.

To determine the similarity of observations, we need to define a distance measure. For our binary data, we use the Hamming Distance that counts the number of dissimilarities between two objects. More specifically, each criterion takes on one of three values in \(\{0,1,\text {NA}\}\), where \(0\) and \(1\) stand for non-compliance and compliance, respectively, and NA stands for non-applicable. Thus, the distance between two vectors \(X\) and \(Y\) (all criteria of two observations) is given by

$$\begin{aligned} d(X, Y) = \frac{1}{m^{0,1}}\sum _{j = 1}^{m}\delta (x_j, y_j) \omega _j, \end{aligned}$$
(1)

where \(\omega _j\) is the weight of criterion \(j\) described below, and \(m^{0,1}\) is the number of criteria that are not NA in either \(X\) or \(Y\). As for the distance measure, it is computed as

$$\begin{aligned} \delta (x_j, y_j) = {\left\{ \begin{array}{ll} 1 :\quad x_j \ne y_j \quad \cap \quad x_j,y_j \ne NA\\ 0 :\quad x_j = y_j \quad \cup \quad x_j = \text {NA} \quad \cup \quad y_j = \text {NA} \end{array}\right..} \end{aligned}$$
(2)

Hence, \(d(X, Y)\) is the number of criteria in which \(x_j = 0\) and \(y_j = 1\) or vice versa, divided by the number of criteria in which \(x_j, y_j \ne \text {NA}\). Or, in other words, we do not count NA as different from either \(1\) or \(0\). The vector \(\omega\) represents weights that are proposed by Huang (1997). These weights ensure that homogeneous criteria get more weight for the clustering such that outliers in certain categories are well-detected. The algorithm proceeds in the following steps.

K-Modes Algorithm

[1.] Input: \(k\) = number of clusters; \(D\) = \(n\) x \(m\) array (data); \(M\) = \(k\) x \(m\) array of initial modes

[2.] Compute the distance of each observation \(i\) to cluster \(k\) in \(M\)

[3.] Assign each observation to the closest cluster

[4.] Compute the new modes for each cluster (get new array \(M\))

[5.] Repeat until no observation changes the cluster

[6.] Output: \(M\) = \(k\) x \(m\) array of cluster modes; \(C\) = \(n\) x 1 vector of cluster assignments

For the algorithm to run, two input choices must be made by the researcher ex-ante: the values of the initial modes and the optimal number of clusters. The values chosen as initial starting points are likely to influence the results. Thus, it is essential to have a reasonable array of input modes. We follow Khan and Ahmad (2013) and determine the initial modes based on the distribution in the data. Using a data-driven approach, we find the optimal number of clusters to be four. We describe both input decisions in detail in Appendix B1.

Before we run the algorithm, we prune the data to drop criteria with high occurrence of NA or with almost complete compliance since these contain little information for clustering.Footnote 4 In particular, we drop all criteria with more than 50% NA and all criteria with compliance rates over 95%, whereby NAs are ignored.Footnote 5 This leaves us with a remaining 30 criteria. Note that this procedure drops all critical criteria since the compliance rates for these are higher than 95%.

4.2 Results

Figure 1 and Table 1 show the result of the clustering exercise. The first panel of Fig. 1 presents the modes of each cluster where the criteria are listed on the horizontal and the clusters on the vertical axis. The remaining panels show the same criteria on the horizontal and each observation in a cluster on the vertical axis. The figure gives a good impression on how neatly the observations are sorted into the clusters. When we look at Table 1, we see the compliance share for each of the 30 criteria by cluster. If all observations in a cluster complied with a certain criterion, the value in the table would be 1. A value of 0 represents perfect non-compliance. Of course, such extreme values are not to be expected since most certificates are unique in their compliance pattern. We interpret a value below 0.5 as non-compliance. (Mode is non-compliance.) The criteria are grouped into the four principles of the SAN-RA Standard. In addition, the first column shows the overall compliance shares by criteria. The values range from 0.540 to 0.943.Footnote 6 In the following, we interpret these results by describing each cluster in turn. A detailed description of non-compliant criteria for each cluster can be found in Appendix C.

Fig. 1
figure 1

Clusters Note: The first panel of this Figure shows for each cluster (vertical axis) the modes of each criterion (horizontal axis). Panels 2 to 5 show the compliance status for each observation (vertical axis) within a given cluster. Each of the panels 2 to 5 corresponds to one line in the first panel. For example: In cluster NC-M (panel 4), a majority of observations do not comply (red) with criterion 1.7, and hence, the corresponding cell (line 3, column 1) in panel 1 shows a mode ”non-compliance“ for this criterion. Remember that cluster C stands for compliance, NC-E stands for non-compliance with environmental criteria, NC-M stands for noncompliance with management criteria, and NC-S stands for non-compliance with social criteria

Table 1 Compliance shares by criteria and cluster

4.2.1 Cluster C: compliers

The first cluster is the largest and makes up over one half of total producers (N = 332). These producers achieve very high compliance levels in almost all criteria. As can be seen in Fig. 1, the only exception is criterion 3.28 (vegetative barriers between pesticide-applied crops and areas of human activity), where a majority does not comply. This seems to be a very specific criterion, and from column 1 of Table 1, we learn that it has one of the lowest overall compliance levels. The fact that a large number of producers that are not compliant with this criterion end up in the cluster with the otherwise highest compliance suggests that it is barely correlated with other criteria.

4.2.2 Cluster NC-E: non-compliers with natural resource conservation (environment)

The second cluster describes a small category of producers (N = 42) that do not sufficiently implement environmental criteria. They disregard rules on the safe application of pesticides and face problems with water use and erosion as well as with their waste management. Additionally, workers do not benefit from sufficient health protection. Moreover, there is a high occurrence of management issues in these same areas, where careful planning and evaluation are lacking.

4.2.3 Cluster NC-M: non-compliers with management criteria

The third cluster comprises just under one quarter of producers (N = 121). A look at Fig. 1 reveals that these certificates have few criteria with non-compliances. However, a pattern appears: all criteria that are not complied with are related to the farm management or the group administrator. They do not develop a farm management plan, nor a plan to increase and restore the native vegetation, nor one for pest and waste management. Interestingly, these producers comply to a large extent with other criteria concerned with the actual task of conserving natural resources or implementing social criteria. The issues seem to lie solely with the (group) management.

4.2.4 Cluster NC-S: non-compliers with social criteria

Finally, the last cluster groups producers that have low compliance with social criteria. This group contains over one ninth of producers (N = 66). Workers suffer from insufficient housing and incomes. Also, the workers’ representation is insufficient, i.e., there is no Occupational Health and Safety committee. Additionally, there are some problems with leadership such as the lack of upkeep and evaluation of the farm management plan as well as the recording of pest infections.

5 Characteristics of the clusters

So far, we have described the compliance patterns peculiar to each of the four clusters. There clearly seems to be a pattern that has some meaning in that each cluster has its own area of problems. In this section, we add more variables to the analysis to characterize the certificates in each cluster in more detail. This sheds light on who the producers in each cluster are and what could be potential drivers for non-compliance.

Fig. 2
figure 2

Maps of clusters. Note: To comply with the confidentiality agreement and to improve visibility, we dropped country-crop-type combinations with fewer than five observations from the figure

5.1 Type, crop, and location

Most empirical studies concerned with the effect of certification are looking at a geographically confined area and certain types of producers only. Thus, little is known about spatial heterogeneity of impact or about which crops are especially suited for certification. Figure 2 illustrates the location of the certificates per cluster and crop-type combination. Table 2 shows some additional indicators regarding location, type of producer, crop, and experience with certification. In cluster C, coffee production-especially on individual farms-is slightly over-represented. Figure 2 shows that most of these farms are located in Brazil. Unsurprisingly, producers in this cluster have on average been certified for longer and also had more previous audits. This either suggests a learning effect-at least in terms of how to pass the audit-or reflects a mechanical correlation if non-compliers drop out of the program over time. Producers in cluster NC-E are often located in East Africa, especially in Ethiopia, and consist nearly exclusively of coffee groups. Additionally, they have had the least amount of experience with certification. NC-M certificates are often found in cocoa group certificates. Nearly, all Ghanaian cocoa producers are assigned to this cluster. Also, producers in Central America rarely fall into this group. Finally, coffee group certificates located in Central America-especially in El Salvador-are over-represented in cluster NC-S.

Table 2 Type and continent
Fig. 3
figure 3

Operation size. Note The black lines represent standard errors. We drop the single cocoa farm in the data and the single cocoa group in cluster NC-E from this analysis

5.2 Operation size

While the location and the type of producers provide some insights, we now turn to more specific indicators at the certificate level. Figure 3 provides statistics on the scale of production grouped by crop and certificate type. Cluster C consists of coffee farms that are large in area but employ relatively few workers. Nevertheless, they achieve the highest output per hectare, which suggests a high degree of mechanization. This high productivity corresponds to large-scale coffee farms, mainly in Brazil. Looking at coffee group certificates, the compliers tend to have fewer member farms, but the individual farms are larger, both, in terms of area and number of workers. It is plausible that both findings arise because large, productive producers face a smaller adaptive burden to comply with the standard.

For both, coffee farms and groups, producers in cluster NC-E struggle with productivity, especially in per worker terms. The group certificates consist of many small farms that employ little labor from outside the household. This corresponds well with our findings in Sect. 5.3 below that the poorest farmers tend to be found in this cluster. Output per hectare is rather high for certificates in cluster NC-M, but compared to the complier cluster, similar amounts of output are generated by more workers for coffee farms, and by more member farms for coffee groups. Meaning the actual productivity, given inputs, is lower than for cluster C. The fact that the number of workers for farm certificates and the number of farms for group certificates are high might also make them more difficult to manage, which would explain the predominance of issues within these criteria. Contrary to the coffee groups in NC-E and NC-M, producers in cluster NC-S tend to employ some labor beyond the household members, which might explain why they struggle more with social criteria. The few cocoa groups in cluster NC-S achieve very high productivity with little labor input.

5.3 Economic, political, and geographic factors

The economic and political environment plays a crucial role for the viability of VSS. On the one hand, VSS should target the vulnerable producers in poor regions to have the biggest impact (e.g., Dragusanu et al., 2014). On the other hand, certification is more likely to fail in difficult circumstances. Fundamental conditions for investments in certification are market access (e.g., Dammert and Mohan, 2015), a sound political setting (e.g., Naylor, 2014; AbarcaOrozco, 2015), the availability of skilled workers and credit (e.g., Kirumba and Pinard, 2010), and to some degree competitiveness with non-certified producers (e.g., De Janvry et al., 2015). VSS organizations aim to improve these conditions locally, and however, better initial conditions reduce the adaptive burden once certification is put in place and, hence, facilitate compliance.Footnote 7

Table 3 Economic, political, and geographic indicators

Table 3 offers an array of variables along these dimensions. The first section reports indicators on the economic and political development. Clearly, the development level is highest around cluster C producers. They have the highest GDP per capita, which is also reflected in night/time lights if put in relation to population density. Additionally, child mortality is comparably low, a concept closely related to poverty and health. Finally, these producers’ countries also fare best in the Polity II index, a measure for the level of democracy. Further, these producers are situated in countries, where agriculture is of minor importance in the overall economy and agricultural wages are quite high. These findings suggest that RA certification works best in already well-off places.

Producers in cluster NC-E are situated in much poorer countries with high population density, which is again reflected in night-time lights. Child mortality is high and the level of democracy a magnitude lower than in any other cluster. Agriculture on average still makes up over 20% of total GDP, and exports of agricultural commodities remain important. It is noteworthy that it is the environmental and not the social criteria that are neglected in the poorest places. An explanation could be that poorer producers often do not have the means to implement criteria regarding environment. In contrast, they are often small producers with few employees beyond the household members where complying with social criteria is less of an issue.

Clusters NC-M and NC-S are situated in contexts of intermediate levels of development. With similar income levels, cluster NC-M displays higher child mortality. This is in line with the longer distance to cities and high export costs implying a remote, rural setting. The high ethnic fractionalization is probably driven by the large share of West African producers in this cluster. Cluster NC-S producers are the most “urban” in that they are closer to cities, more likely to have city offices and face low export costs despite hilly terrain.

5.4 Agriculture

To get an impression of the agricultural landscape surrounding certified producers, we turn to several grid cell level variables reported in Table 4. We group the certificates into coffee and cocoa producers since the optimal growing conditions differ for the two crops. For both crops, it immediately becomes apparent that the compliers (cluster C) are located in places where the relevant crop is cultivated and certified intensively with high yields per hectare. The most probable interpretation is that these are locations that are well-suited for the growth of the crop.

For cluster NC-E, we find the lowest intensity of production and yield per hectare among coffee producers. Additionally, certification of coffee does not seem to be common in the locations of these certificates. Thus, the story might indeed be that farmers in less suitable places fail to comply with criteria regarding the environment. Similarly, producers in cluster NC-M are also located in areas that are not optimal for coffee production, as the yield p.h. is significantly lower and certification is not widespread.

Producers in cluster NC-S show different results depending on the crop. While the locations of the certificates seem rather suitable for coffee, the opposite is the case for the few certificates in NC-S producing cocoa. Surprisingly, these are the same cocoa producers that showed high output per hectare compared to the other clusters in Sect. 5.2.

Table 4 Agricultural surroundings

The last panel of Table 4 reports densities of the four most common domestic animals (per square kilometer). Livestock density-with the exception of pigs-is significantly higher for cluster NC-E. However, the reasoning behind this is not clear. It could be that livestock density, especially through manure production, can have negative impacts on the environment. Alternatively, high livestock density could be an indicator for poorer regions, where farmers do not specialize in one specific crop but diversify by farming both, livestock and crops.

6 Classification

As an additional exercise, we use the information gathered in the previous section to predict cluster affiliation of new producers. We use one of the most popular machine learning procedures: classification. Specifically, we use a type of random forest algorithm by Strobl et al. (2009). To avoid overfitting, we take our results from the previous section and include only variables as predictors that have shown to be highly significant for cluster differentiation (at the 1%-significance level). Due to the imbalance of our clusters, we follow Delgado and Tibau (2019) and use the Matthew’s Correlation Coefficient (MCC) as our performance measure. The classification method is described in detail in Appendix B2.

6.1 Results

Table 5 shows the confusion matrix with cell counts in percent. The overall share of correct predictions is 0.54, and the MCC is 0.28. While the algorithm predicts cluster affiliation rather well for cluster C, it struggles with identifying the other clusters. There is also a high level of heterogeneity in accuracy depending on crop-country-type combinations, as shown in Fig. 4. Unsurprisingly, country-crop-type combinations where most observations are in the same cluster are well predicted. This is the case for Brazilian coffee farms, Guatemalan coffee farms and groups, or Indian coffee farms. We find that prediction is particularly weak in countries where the cluster algorithm does not yield clean results. For example, Kenyan coffee producers tend not to fit well in any of the four clusters and are consequently classified with very low accuracy.

Table 5 Confusion matrix
Fig. 4
figure 4

Share of correct predictions Note: The white numbers at the bottom of each bar indicate the number of observations in that country-crop-type combination. We drop all country-crop-type combinations with \(N\le 5\)

Figure 5 shows the ten most important variables for the classification. Variable importance is the mean decrease in accuracy of the forest when randomly re-shuffling the values of a variable (also referred to as permutation importance). As we can see, geographic variables are crucial to determine cluster affiliation. Additionally, the type of certificate and the number of years certified seem to play an important role.

Fig. 5
figure 5

Variable importance Note: The figure shows mean decrease in accuracy if a variable is removed from the model for the ten variables with the largest decrease

7 Summary and discussion

In this section, we present a summary of the results found in our analysis of compliance and then put them in perspective regarding limitations and existing literature.

We were able to find sensible clusters of non-compliance that allow to group producers according to different areas they struggle to comply with. Hence, our first hypothesis is fulfilled. Besides producers that are very successful in implementing the standard, we identify three clusters that have compliance issues in the areas of natural resource conservation, management, and social issues. Additionally, we can characterize the clusters to find out what type of producers belongs to which cluster. Unsurprisingly, the compliers are more predominantly found in more developed regions of Central and South America and tend to be large individual farms that achieve high yields per hectare-a hint toward the high mechanization prevalent on these farms. For groups, it is those with relatively few but large member farms that show high productivity. Interestingly, for coffee producers, it is an asset to employ a lot of workers, while for cocoa cultivators, the opposite holds. Finally, producers in the compliers cluster have been certified for longer suggesting a learning effect.

Producers in the second cluster-non-compliance with environmental criteria-have the least experience with certification and tend to be located in poor regions that are also less suitable for cultivation. Most of the certificates are coffee producers that are certified in groups and employ few workers outside of their household. Producers in the third cluster-management issues-are often found in cocoa production in West Africa. Most Ghanaian producers are assigned to this cluster. They have the highest share of ethnic fractionalization. Finally, issues with social criteria are prevalent in group certificates for coffee production in Central America, mainly in El Salvador. In line with these results, we find in our classification exercise that the variables most important to predict cluster affiliation are mostly of geographic nature, next to the number of years of certification and the type of certificate

To be fair, the prediction accuracy of our classification exercise is modest at best. Thus, our second hypothesis regarding the identification of potential drivers of cluster affiliation is only partially fulfilled. Nevertheless, we are confident in recommending the approach for further research for several reasons. First, the number of observations is relatively small for classification, affecting the power of the analysis. Unfortunately, this is near-impossible to overcome given the present scale of RA certification. Second, we suspect that some critical information needed to accurately classify all observations is missing. To address this issue, more certificate-level data would need to be collected. And third, clustering all certificates world-wide bears the risk that the analysis is inept for certain crop-country combinations that do not fit in any of the clusters so obtained.

Comparing our results to the previous literature is difficult as the literature on compliance is extremely sparse. The only studies that we found focus on compliance of Brazilian coffee producers and compare group certificates to certified individual farms. Pinto et al. (2014) find similar overall compliance scores for individual farm and group certificates in 2011. Groups perform somewhat better in working conditions and health and safety measures, while individual farms outperform groups in wildlife and water protection as well as integrated crop management. Maguire-Rajpaul et al. (2020), who use compliance data from 2006 to 2014, find that Brazilian group certificates have somewhat higher non-compliance with a selection of management criteria and slightly lower social performance-although the differences are not statistically significant. Additionally, they find that the larger the certificates are in terms of area, the more compliant they are with social and management criteria. We can confirm the latter results for coffee even on a global level, as producers with a large certified area are over-represented in the compliers cluster. Further, our analysis suggests that individual farm certificates outperform group certificates in Brazil, and however, it is well possible that this difference only came about more recently, e.g., through improvements on large coffee farms. Talking to experts at RA, we find that certified Brazilian coffee farms are in many ways exceptional. They are much larger and more mechanized than the typical farms working with RA and contribute heavily to the global output of RA-certified coffee.Footnote 8

8 Implications

In the introduction, we have outlined the importance of our research for two main avenues: highlighting the issue of heterogeneous compliance for causal inference and the improvement of VSS through systematic assignment of tailored training units and ex-ante risk assessment of new producers. Hence, in this chapter, we want to discuss implications of our results for both standard organizations and researchers. Previous research on the impact of certification with VSS was mostly concerned with selection bias. The rationale behind this is that producers with different characteristics differ in their take-up of certification. In Sect. 4, we have shown that certified producers additionally differ considerably in compliance, which suggests that even if treated, the outcome may vary among producers. Researchers interested in specific outcomes of certification are therefore encouraged to take into account compliance in addition to selection bias. For treated, i.e., certified, producers, the level of compliance with the VSS can potentially be observed by consulting audit data. However, this is never possible for control groups. One possible way to address this issue is to predict compliance and match to each treated observation a control unit with similar expected compliance if it would certify (e.g., using a random forest algorithm). To do this, one needs to know the drivers of compliance with the VSS. In Sects. 5 and 6, we have identified several variables that possibly spur compliance. These insights are a starting point to guide future research in the adequate choice of the sample and matching technique.

Further, past impact studies were mostly concerned with very specific settings, e.g., small-holder coffee growers in Ethiopia. Compliance patterns for some potential settings such as coffee farms in Brazil or Guatemala are well predicted by our global study. In other settings like cocoa groups in Ecuador or Ivory Coast, it does considerably worse. However, it should be possible to repeat the analysis for a specific context to generate adequate clusters and significant predictors to be used in a particular causal study. In this sense, our analysis provides an example for a procedure to address this important issue.

Regarding implications for standards organizations, the potential for risk assessment remains limited in the face of the data currently available. While it is possible to predict compliance patterns of some larger homogeneous groups quite accurately, it is near impossible for others. We argue that this is due to missing information. Admittedly, prediction of clusters is further complicated by the ill fit of some of the observations to any of the clusters. However, we are able to identify those observations and we know where the prediction is likely to be accurate and where less so. Additionally, for the cases where classification is expected to be difficult, there is the option to turn to the full probability table of the prediction outcome. The classification algorithm returns a probability of belonging to each cluster for every observation. If we allow for some degree of human judgment, this information can be very useful in forecasting the compliance issues of a newly certified producer.

Finally, the clustering exercise in Sect. 4 lays a useful foundation for the systematic assignment of training units to certified producers in order to enhance their compliance with the standard. Our analysis would propose four different specifically tailored training courses for certified farmers in accordance with the four identified clusters. On the one hand, this would allow to bundle resources and to develop an education program to its highest effect. On the other hand, together with an improved prediction or insights from a first audit, it would become straightforward to assign each certified producer to the best training. While this approach seems efficient, it has one major shortcoming: It does not provide a solution for outliers within a cluster. However, there is a trade-off. A small number of training units can efficiently be handled at the cost of having misfits within the groups. In contrast, a large number of training units (equal to the number of certificates in the extreme) would allow for made-to-measure guidance with the expected increase in costs.

9 Conclusion

Voluntary sustainability standards are gaining popularity among conscious consumers. However, the effects of certification are still not well understood. We contribute to this scientific debate by investigating compliance-a complex and to date understudied aspect of the workings of VSS. For one particular voluntary sustainability standard-Rainforest Alliance-we show how compliance patterns can be clustered in order to understand and simplify the problem. We further identify possible drivers of these patterns, which can potentially be used for sophisticated causal inference studies, but also to improve VSS through targeted training or better risk assessment. Finally, we run a random forest algorithm using these drivers as predictors. Even though we find only a mediocre accuracy of the classification, the approach is promising since more data will be collected in the future. Thus, our study, while providing first descriptive insights into the complex matter of compliance, also outlines an avenue for further research toward understanding the mechanisms of certification.