1 Introduction

1.1 Supply chain risk child labor

Child labor deprives children of their childhood, their potential and their dignity, and is harmful to their physical and mental development. In conventions Nos. 182 and 138, the International Labour Organization (ILO) calls for the elimination of child labor (International Labour Organization 1973; International Labour Organization 1999) and the 2014 Nobel prize for Kailash Satyarthi and Malala Yousafzay recently focused global attention on the topic (Nobelprize.org 2014). Nevertheless, child labor is still common in many fields of work, especially in developing countries. In 2012, the number of children working in any form totaled 168 million, with 85 million of these performing hazardous work involving physical abuse or the handling of dangerous machinery (International Programme on the Elimination of Child Labour 2013, 13).

Practicing child labor or doing business with parties that engage in such practices is not sustainable from a social point of view. Issues with elements of social sustainability such as human rights or forced labor can also have a significant effect on a company’s image (Taylor et al. 2009), even if they occur elsewhere in the supply chain (Lemke and Petersen 2013). Although the responsibility for sustainability is shared along the whole supply chain (Vermeulen and Seuring 2009), the focal, dominant company may be particularly affected. This can be seen in the cases of Nike and Gap in 2000 (Kenyon et al. 2000) or of Foxconn and Apple in 2012 (Tsukayama 2012), which involved accusations of unethical working conditions and show that current risk management practices at least partly fail. Triggered boycotts and loss in brand value on the product and labor market also pose a significant economic risk (Anderson and Anderson 2009). Consequently, sustainability risk management needs to cover the whole supply chain (Seuring and Müller 2008) and containing the likelihood of child labor in supplier locations is an important aspect of supply chain risk management (Lemke and Petersen 2013) and corporate social responsibility (Hutchins and Sutherland 2008).

“Codes of conduct” are acknowledged as the primary instrument for managing social sustainability risks in supply chains. They define what standards need to be followed, thereby regulating sustainability aspects in supply chains and guiding suppliers (Ciliberti et al. 2008; Egels-Zandén 2007; Pedersen and Andersen 2006). However, codes of conduct do not solve the problem of information asymmetry in supply chains (Sarkis et al. 2011), i.e. they do not change the situation that the supplier has knowledge of the local situation which is not available to the focal firm. Consequently, they must be monitored and enforced to guarantee compliance (Pedersen and Andersen 2006). Supplier assessments are seen as a particularly important tool to safeguard compliance with pre-defined standards (Keating et al. 2008; Miemczyk et al. 2012). For this purpose, auditing is an important approach to ensure that pre-defined standards are met (Klassen and Vereecke 2012). On-site auditing is often required to gain insights into social sustainability (Kogg and Mont 2012; Klassen and Vereecke 2012; Benoît and Vickery-Niederman 2011). Another important instrument often associated with audits is certifications, defined at an inter-company level (Ashby et al. 2012), which help reduce control costs through sector-specific or cross-sector initiatives (Vermeulen and Seuring 2009). Typically, they standardize audit details to a certain extent (Kogg and Mont 2012). These approaches to measuring social sustainability often require a significant amount of resources together with third party input (Vermeulen and Seuring 2009). Consequently, one approach has been to form coalitions to share insights from monitoring and reduce costs (Bremer and Udovich 2001).

However, monitoring and assessment approaches based on audits and certifications are constrained, for a number of reasons: First, supplier monitoring and assessment in modern supply chains is complex. Today’s supply chains have a large number of suppliers, often globally dispersed; this complexity and the associated costs (Kogg and Mont 2012) make effective ongoing verification of compliance with standards difficult. It seems impossible for companies to have all factors evaluated in depth, let alone first-hand, by an internal company employee, although the need would exist in practice (Kogg and Mont 2012). Even more resources are required if deeper levels of the supply chain are considered in case these suppliers are known (Svensson 2009; Grimm et al. 2016).

Second, timeliness is a problem. Given that internal supplier information is mostly only available through audits, certifications, or supplier communication (e.g. Klassen and Vereecke 2012), there is an associated lag in compliance verification. Infrequent certifications only provide limited defense against issues, as, for example, the child labor revelations at the OTTO group showed in 2007 (Aktiv gegen Kinderarbeit 2014; McDougall and Schmitz 2007). Moreover, suppliers nowadays tend to be overloaded with requests for certifications and audits (Ceres and Sustainalytics 2014), which is referred to as “audit fatigue” (Kogg and Mont 2012, 162). Others even describe monitoring as a way to convey an adversarial stance and that focal firms act more like supply chain ‘bullies’ than CSR ‘champions’ (Boyd et al. 2007).

Third, there is a lack of objectivity. Data gathered through monitoring systems may not reflect the truth due to the potential effects of bribery, corruption, and culture or standards differences, particularly if third parties are used to perform the ground work (Leire and Mont 2010). Locke et al. (2009) declare the information gathered through audits to be often inaccurate, biased, and incomplete. Some companies (e.g. British Telecommunications 2012) use supplier questionnaires to identify the companies to follow up. However, these selections are based on the quality of data that suppliers themselves provide (Leire and Mont 2010).

Finally, the management of the collected information and its aggregation to valid KPIs can take considerable amounts of time. Usually, managers prefer indicators that are easy and fast to calculate (Mcintyre et al. 1998).

1.2 Automated child labor risk monitoring

In order to reduce the effort of monitoring the risk of child labor in the supply chain, one could envision an expert system that automatically computes a risk score based on the geographic location and industry of a supplier and the result of the respective audits and certifications. This score would then be used to direct on-site audits to the supplier locations with the highest risk. Given the fact that unstructured text has already been promisingly integrated into risk management approaches in non-sustainability domains such as financial risks (e.g. Groth and Muntermann 2011), general business risks (Leidner and Schilder 2010), tracking society-related sustainability indicators (Rivera et al. 2014), or employee fraud (Holton 2009), it appears fruitful to also integrate external public information into the computation of the child labor risk score in order to enhance the objectivity of monitoring results and to overcome the time lag between supplier reviews. In fact, external risk-related information collection has already been reported in the sustainability domain; however, automated, integrated IT-supported risk modeling is not specified (Koplin et al. 2007). Ongoing input from news sources or social networks may help to identify risk-relevant events in the supply chain. These events can be gathered based on geography-, sector-, and production-specific relations (see e.g. UNEP 2009, 60), or on other cause-and-effect relationships derived from literature related to child labor. First suggestions for the system have been presented in TBD [removed for peer-review/conference paper].

To check if a child labor risk management system as described above would also be useful in practice, expert interviews on information needs were conducted with five managers in charge of sustainability in the oil and gas, paper and retail industries. The requirements stated to be top priority were “Inclusion of external data (external blacklists, external platforms, social media, news media)”, “Inclusion of risk metadata (country/region data, supplier data, legal data, components/products, certifications)”, “Definition of KPI structure and aggregation logic” and “Allow for supplier selection, supplier ranking and audit triggering/prioritization”. Thus, it is worthwhile to study a risk management system for monitoring child labor risk in supplier locations which mines text sources for reports of child labor incidents and automatically combines this evidence with risk assessments based on supplier location and industry as well as results of audits and certifications.

1.3 Existing literature

Several quantitative models for supply chain risk management that reflect social sustainability have been described in literature. A model specific to textile supply chains is suggested in Rabenasolo and Zeng (2012). It uses linguistic variables and relies on a combination of performance indicators. Weber et al. (2010) include sustainability risks in a credit risk indicator template filled by credit officers. Badurdeen et al. (2014) outline a Bayesian-network-based approach to combine various risk categories into a final probabilistic score. Still, social or environmental concerns are just one input out of several, and the analysis is based on expert input only. Finally, Hadiguna (Hadiguna 2012) suggests, without offering mathematical details, a decision-support framework for risk management that addresses social elements such as labor strikes, demonstrations, and local customs to a limited extent. Other contexts for the application of Bayesian networks have been, for example, new product development (Chin et al. 2009) or early crisis warning (Dabrovski et al. 2016).

Sarkis and Dhavale (2014) introduce an initial approach using Bayesian inference (without a network). They combine a series of criteria reflecting sustainability, norm them, and apply Gibbs sampling to derive a posterior distribution from prior distributions and observations. Erol et al. (2011) present a yearly sustainability alert system based on fuzzy entropy and fuzzy multi-attribute utility that creates alerts if values for sustainability indicators are above or below certain thresholds. Fuge et al. (2013) include probabilistic measures of sustainability integrating different private and public indicators via a weighting. They replace unknown specific indicators with more general indicators while respecting the additional variance this brings. Similarly, Ahi and Searcy (2014) suggest a stochastic model to measure the sustainability performance of a supply chain. Shokravi and Kurnia (2014) include an importance parameter which is based on analyzing texts in order to derive a weight for each measure. Wu et al. (2017) use a multitude of quantitative techniques to combine qualitative, social media and quantitative data. For social media they strongly build on term frequencies and company-related sources. Mani et al. (2017) show the usefulness of big data analysis for sustainability analysis in supply chains using a case study. Summarizing, to the best of the authors’ knowledge, none of the related quantitative approaches deals with ongoing automatic monitoring combining internal like audits and external data sources, including broad news. This is while text mining is recognized as one of the key techniques in big data analysis (Nguyen et al. 2017) and more tools for sustainability analysis with big data are called for (Wang et al. 2016).

1.4 Outline

Given this state of the art, a prototype of an expert system for monitoring child labor risk in supply chains was built. This prototype was then tested by comparing its decisions with those of human experts. Section 2 of the paper deals with the description of the prototype, and Sect. 3 reports the results of the tests. After the discussion in Sect. 4, a summary of the findings, limitations of the work and suggestions for further research are the subject of Sect. 5.

2 Description of the system

2.1 Overview

We propose using a Bayesian network (BN) to compute the likelihood of child labor in a supplier location based on the evidence from geography and sector, audits and news reports. BNs are a well-known probabilistic modeling technique introduced by Pearl (1988). BNs are based on Bayes’ Theorem. A BN is a directed acyclic graph (DAG) in which nodes correspond to random variables of interest and directed arcs represent direct causal or influential relation between nodes. The uncertainty of the interdependence of the variables is represented locally by a conditional probability table (Watthayu and Peng 2004).

BNs have previously been successfully applied to risk management due to their understandability and ease of information integration (Duespohl et al. 2012; Koks and Challa 2005; Wooldridge 2003). The key advantage of BNs is their explicit treatment of uncertain information supporting decision making (Reckhow 1999) and the possibility to include different types of sources into a single consistent model (Uusitalo 2007; Duespohl et al. 2012; Wooldridge 2003). A process of updates allows the inclusion of news as it comes up, continuously triggering updates of the likelihoods (Neapolitan 2003, 12–29). BNs tend to be easily communicable, fostering a common understanding (Duespohl et al. 2012; Correa et al. 2009). These features of BNs are of special interest for a quantitative risk model, as a decision-relevant information system must be understandable for company executives who have to make and defend their decisions based on the input (Hubbard 2009).

We suggest implementing a BN for each supplier location. We propose starting with the initial hypothesis that a supplier conforms to given, pre-defined social sustainability standards (e.g., a “code of conduct”). Then we calculate a relative measure for the likelihood of this hypothesis being false based on the evidence on the likelihood of a compliance breach. Hence, computing the likelihoods for individual supplier locations and relating them can provide a relative risk ranking (see Fig. 1). Figure 2 depicts the structure of a BN for a supplier location in the notation of Netica, which was used as implementation environment (Norsys 2013). Input data for the BN can come from multiple static or dynamic sources that either provide structured or unstructured data. To gather these inputs, survey data on child labor and audit scores are used. Text mining is leveraged to extract information on child labor incidents from unstructured news articles. For any new location, the prior distribution is taken. Once the location is known, the country, area type and sector are given and the priors can be updated. The priors for audits are updated anytime when the result of an audit of the respective supplier location is entered into the BN. The priors for observations are updated when information on a child labor incident relevant to the location is supplied as input to the BN.

Fig. 1
figure 1

Overview of risk model and system

Fig. 2
figure 2

Full view of Bayesian risk network

The parameterization and testing of the system was conducted together with 28 experts, 13 of which with background in supply chain management, 6 with background in sustainability management, and 3 in general management, risk management and other, respectively. 15 of the experts consulted had more than 5 years of experience in their position while another twelve had between 1 and 5 years. Only 1 expert had less than 1 year of experience. 10 experts had their workplace in Austria, 7 in Germany, 2 each were from China, Malaysia, and the United Kingdom, and 1 each from the Czech Republic, Denmark, Romania, France and Columbia. The companies the experts worked for were in a variety of industries, but with a significant spike in manufacturing and wholesale/retail trade (18 experts), which have a potentially higher exposure to child labor. According to India, National Sample Survey Round 66 (NSS-R66) 2009–2010 (Understanding Children’s Work 2010), 23.6% of children from age 5 to 14 work in commerce in urban areas. This is the highest percentage of all sectors. 23 experts worked in companies with more than 1000 employees.

Although experts are subject to at least not fully rational (bounded) behavior (for an early discussion, see e.g. Edwards 1954), experts decide using a set of decision strategies that employ a significant amount of heuristics and learning from the past (March 1994; Shanteau 1988). Therefore, a comparison with expert output provides an indication of any overlap between the system’s approaches and these strategies. Moreover, it helps in discussing whether the experts’ responses can be covered by the system’s design.

2.2 Impact of region and sector

Let us now describe the various components of a supplier location BN in detail. Sources like UCW (Understanding Children’s Work 2014) show that the frequency of child labor varies regionally, depending on the country (C) and whether the area is rural or urban (R), and between sectors (S). Therefore, suppliers located in different areas and working in different sectors have different prior probabilities of employing children. These contextual priors can be determined on the basis of publicly available statistics on the number of children and companies per context C,R,S and the fraction of children who are non-self-employed workers. As one cannot derive the number of children working at a particular company from the statistical data available it makes sense to assume that the distribution of children working to the companies in the context is random. If one assumes that all child workers are randomly assigned to the \(NCOMP_{C,R,S}\) companies in the specific context the probability that a child is not linked to a specific company is

$$1 - \frac{1}{{NCOMP_{C,R,S} }}$$

Multiplying the frequency of children working non-self-employed in context C,R,S with the total number of children in the respective context yields the number of non-self-employed working children \(ANCL_{C,R,S}\). The probability that all children working in the respective context are not working for a particular company is given by the probability that a child is not linked to a particular company to the power of ANCL_C,R,S. Then, the probability that all these children are not working at the specific company is given as

$$\left( {1 - \frac{1}{{NCOMP_{C,R,S} }}} \right)^{{ANCL_{C,R,S} }}$$

and the probability of having at least one child working at the respective company is given as 1 minus the probability that all children working in the respective context are not working for a particular company:

$$P_{CL} = 1 - \left( {1 - \frac{1}{{NCOMP_{C,R,S} }}} \right)^{{ANCL_{C,R,S} }}$$

Table 1 shows the resulting \(P_{CL}\) values for supplier locations in India and Indonesia in year 2012 calculated on the basis of BPS Statistic Indonesia (2008), Diallo et al. (2013), International Programme on the Elimination of Child Labour (2013), Ministry of Statistics and Programme and Implementation of India (2006), The World Bank Group (2013) and Understanding Children’s Work (2010).

Table 1 Comparison of example child labor incident priors for different approaches, countries, area type, and sectors

To derive the standard deviation of this contextual prior, it is assumed that the errors made when calculating the probability of a child labor incident can be compared to those made when estimating the frequency of child labor in a particular country. Table 2 depicts the differences between the results of two surveys referring to the same reference period in nine countries analyzed (see Guarcello et al. 2010, 10). We interpret these differences as 2 \(\sigma\) intervals, and set the standard deviation of the contextual prior \(\sigma_{prior}\) to 13.32, the mean of the differences reported in Guarcello et al. (2010) divided by 2.

Table 2 Estimation of standard error of child labor rate calculations based on country comparisons

As we only have estimates for mean and variance the assumption of a normal distribution is the most parsimonious one based on entropy arguments (Cover and Thomas 1991, 409f.). We therefore assume that \(P_{CL}\) is the expected value of a normally distributed random variable. As can be seen from Fig. 2 negative values are dealt with in the BN by the categorizing the potential outcomes into intervals where the lowest one aggregates negative values. Initially, all realizations of the context are assumed as equally likely in the BN of a supplier location. Once the respective country, area and sector are known, one value is selected with 100% probability.

In order to validate the method for determining the contextual prior, the experts were asked to rank the four hypothetical supplier locations shown in Table 3 according to child labor risk. Based on the data provided, the experts suggested an initial ranking that is in many ways comparable to the one created using the model. In order to better compare the two approaches, Fig. 3 introduces a scaled measure. This measure was derived by norming the best-rated supplier to zero and the worst-rated supplier to one. For the model, relative distances were then calculated using the mean prior values. The model was set up using the values for the prior that correspond to the ones used in the questionnaire. The adjustments for the expert responses were calculated using a weighted score (response frequency \(f\) and weight \(w\) of 1 for best rank, 4 for worst rank with combination of \(f.w\)) and norming to 0–1. It must be noted that the experts’ responses have not been interval scaled, so the ordering is important. Experts tend to view the location of supplier B as worse than the one for supplier D when compared with the model under these assumptions. Nevertheless, in general the model prior and expert responses appear to have a comparable pattern.

Table 3 Supplier locations used for validation
Fig. 3
figure 3

Consolidated ranking comparison between all experts and model

A more granular analysis of the result, however, shows that experts often strongly disagree in their judgment of the riskiness of the different suppliers (see Fig. 4). The rank selected by the majority of experts is only equivalent to the one calculated by the model for ranks three and four. Suppliers B and D show a particularly large spread of answers.

Fig. 4
figure 4

Rankings per expert and model

2.3 Impact of audits

Audits are limited in what they can measure and can only be conducted within a defined timeframe, leaving the suppliers alone before and after this timeframe (Locke et al. 2009). Also, a higher number of compliance audits does not suggest that a supplier is better than others. Rather, often the compliance level of the supplier stays the same and sometimes even worsens (Locke et al. 2007). Consequently, we only include the result of the last audit into the BN and assume that the variance of the breach likelihood increases with the time since the last audit.

In order to infer the relationship between audit score and breach likelihood we asked the experts “Which probability of having a child labor incident (if only audit data is taken into consideration) would you associate with a random supplier reaching either a minimum (worst), a medium, or a maximum (best) audit score?”. As can be seen from the high standard deviations provided in Fig. 5 and Table 4, the relation between audit scores and average probability of an incident is judged to be very ambiguous. While some experts put a lot of trust in audit scores, others see only limited value. Even if an audit attributes the best score to a supplier, experts tend to still see a certain probability of an incident. Similarly, the worst audit score does not necessarily indicate that child labor is present.

Fig. 5
figure 5

Number of responses for incident probability given a supplier audit score grouped by percentage categories

Table 4 Average estimated probability values (incl. standard deviation) of incident for different audit scores

We assume that an audit yields results in the range [\(a_{min}\), … \(a_{max}\)], where the minimum audit score \(a_{min}\) is assumed to be greater or equal zero. As we only have estimates for mean and variance, the assumption of a normal distribution is the most parsimonious one based on entropy arguments (Cover T, Thomas J 1991, 409f.). Based on these judgments, it is assumed that the audit likelihood \(P_{audit}\) follows a normal distribution, whose expected value \(E\left( {P_{audit} } \right)\) is related to the audit score via

$$E\left( {P_{audit} } \right) = m - a\times\frac{m - n}{{a_{max} }}$$

where \(a_{min} \ge 0\) is the a minimum audit score and \(a_{max}\) the maximum audit score. For the prototype, the minimum audit score \(a_{min}\) is set to 0 and the maximum \(a_{max}\) to 5, and the parameters m and n are set to m = 57 and n = 48.2. This formulation fits the values given in Table 4 and was considered valid by the experts. Table 5 contains the time-dependent values used for the standard deviation of the audit probability.

Table 5 Audit probability standard deviation values depending on time since last audit

For the prior of the audit score, a normal distribution is assumed with a mean of 4 and standard deviation of 1, while for the variance of \(P_{audit}\) we assume a prior value of a mean of 9 months with a deviation of 3 months. If an introductory audit for a supplier location is entered (i.e., marking one discretization with 100% probability), the prior for the node has no influence anymore. The respective distributions were derived from discussions with the experts and given to them for validation.

2.4 Impact of observations

Both observations of child labor incidents in related contexts and news on drivers affecting the demand and supply of child labor are candidate inputs for the determination of the breach likelihood. Empirical research has uncovered a multitude of factors influencing the extent of child labor. Only few of these factors such as socio-economic dislocation (economic crisis, political and social transition) or production peaks/labor shortages are observable from external information and have short-time impact. These are difficult to detect automatically, though, as they cover a wide array of happenings, including earthquakes, volcanic eruptions, strikes, or demand surges. Moreover, while the literature identifies connections between these events and child labor, the propensity of the effect varies by context. Also, descriptions of actual child labor incidence observations are more homogeneous than descriptions of factors that influence child labor. Thus, descriptions of child labor incidents are easier to detect and to codify automatically than descriptions of factors influencing child labor. Also, one can argue that they indirectly cover relevant influence factors, as socio-economic dislocation or production peaks/labor shortages should affect all companies operating in a similar context. We therefore choose to only include news reporting incidents of child labor into the expert system.

In order to determine the impact of various types of news on child labor the experts were confronted with the four hypothetical news articles in the Appendix 1. Then they were asked “How much does the news report influence the perceived probability of a child labor incident at Supplier B?” and a five-step Likert influence scale was used (extremely influential-not at all influential; Wigas 2006) to code the answers. As can be seen from the score depicted in Fig. 6 all articles provided have at least a slight influence on the experts’ decisions. Comparing articles one, two and three, the additional geographic detail (region) is nearly as influential as the explicit mentioning of the company. Hence, closely related geographic proximity drives relevance. This is not the case for the article obtained through social media (the worsening could also be due to the reference to a different sector).

Fig. 6
figure 6

Number of responses per article and influence selection (*calculation of score: extremely influential—5 points; very influential—4 points; somewhat influential—3 points; slightly influential—2 points; not at all influential—1 point)

Consequently, we suggest considering two variables, credibility and relevance, to represent the content quality of an observation. Credibility c is defined as comprising the content of evidence captured by a sensor which includes veracity, objectivity, observational sensitivity, and self-confidence (Blasch et al. 2013), while relevance r assesses how a given uncertainty representation is able to capture whether a given input is related to the problem that was the source of the data request (Blasch et al. 2013). In other words, the model understands relevance as capturing how closely the messages used as input for a certain supplier location are in fact related to the supplier location. In order to derive a relevance measure, the availability of dimensional attributes in the news articles is used as an approximate indicator. The more an observation can be linked to a certain location in a granular and specific way, the more relevant it is. If observations with partly conflicting dimensional information are included, the relevance can only be derived based on the non-conflicting dimensional information. Credibility is suggested as being defined either at an input channel or source level in order to cover different media types as completely as possible.

Tables 6 and 7 show the particular values used for credibility c and relevance r.

Table 6 Credibility values based on publishing channels
Table 7 Relevance values based on dimensional attribute availability

A BN is initialized at a particular point in time which can serve as a basic reference point. Until this time, zero or more observations of child labor incidents may have been stored and a set of observations can be retrieved as discussed above. In general, when revising the probability based on evidence from textual media sources, two options may be considered. Either only the latest observation is entered as a single finding or the network is continuously updated with the evidence from new observations. If the observations are assumed to be independent, each one is likely to include valuable information. Consequently, the BN will be modeled using the latter option, allowing the inclusion of evidence from multiple reports. It is then the task of the input procedure to ensure independence between the incidence observations. Relevance and credibility can be evaluated for each observation as described above using the pre-configured values in Tables 6 and 7 given an observation’s data. Even observations with low credibility or relevance are understood to increase the overall observational probability. Given these assumptions, the expected value of the observational likelihood \(P_{obs}\) needs to be a monotonically increasing function of the number of independent incident observations included: \(\forall f \ge 0:E\left( {P_{obs} \left( {f_{1} ,c_{1} ,r_{1} } \right)} \right) \le E(P_{obs} (\left( {f_{2} ,c_{2} ,r_{2} )} \right);\,f_{2} \ge f_{1} ;0 \le p,c_{1,2} ,r_{1,2} \le 1;\,f_{1,2} \in {\mathbb{N}.}\)

Both the credibility and the relevance will be continuously updated with new evidence from observations. The evidence will then be entered into the BNs whose context overlaps with the observations’ context. This leads to a model containing frequency f, credibility c, and relevance r as variables. These are combined via the equation \(x = f\cdot\left( {c + r} \right)\), which fulfils the requirement of monotonicity requirement formulated above and conforms to the notion that both credibility and relevance increase with frequency in a simple way. However, this monotonically increasing function x has no defined upper bound. Therefore, a scaling function is needed to return a value between 0 and 1 for the mean of the normal distribution of \(P_{obs}\). For this purpose, a monotonically increasing function with limit 1 is suggested. This can be achieved with an inverted, shifted hyperbola. The frequency score function s(x) is suggested for this (the function can be parameterized through the parameter τ, which is initially set to 5): \(s\left( x \right) = 1 - \left( {\frac{1}{{1 + \frac{x}{\tau }}}} \right)\). This formula has the desired property of limit 1 and is simple.

The node of the observational likelihood \(P_{obs}\) is represented with a normal distribution given relevance \(r\), credibility \(c\), and frequency \(f\). Hence, knowing about \(f\) observations for a specific supplier location, the calculated likelihood of a child labor incident should be within a predefined confidence interval. This interval should be smaller the higher the number of observations with high reliability and credibility received. For a known standard distribution, 95% of its probability mass lies within the mean \(\mu\) plus/minus 1.95994 times the standard deviation. If, as defined by the user, the area covered by the 95% interval is \(p^{\prime}\) percent points if no observation has been received and \(p^{\prime\prime}\) percent points if ten fully credible and relevant observations have been received, then the respective standard deviations in percent points can be calculated with \(\sigma \left( p \right) = \frac{p/2}{1.959964}\). For example, setting \(p^{\prime}\) to 40 and \(p^{\prime\prime}\) to 10 percent points yields \(\sigma^{\prime} = 10.204\) and \(\sigma^{\prime\prime} = 2.551\). \(\sigma\) is seen dependent on the values of \(f, r, c\) and a linear functional connection is assumed.

$$\sigma \left( {f,r,c} \right) = \alpha - \beta .f.r.c$$

Using \(\sigma^{\prime}\) and \(\sigma^{\prime\prime}\), the values for \(\alpha\) and \(\beta\) can be determined leading to the following function

$$\sigma \left( {f,r,c} \right) = \alpha - \beta \cdot f\cdot r\cdot c = 10.204 - 0.7653\cdot f\cdot r\cdot c$$

As shown in Thoeni (2015), this specification is consistent with the above stated monotonicity condition. Besides fulfilling the monotonicity requirement, it also was considered plausible by the experts.

2.5 Text mining child labor incidence observations

2.5.1 Methodology

Manually reading, extracting and coding child labor incidents from a continuously arriving stream of text is clearly infeasible. However, automatic extraction is difficult: the examples shown in Appendix 2 give an impression of the diverse way in which child labor is depicted in various texts. First, the datasets also include broader reports. They often reference a broader array of different child labor incidents (#1, #2), together with contextual references (e.g. #3). Beyond these general reports, other texts also give broader references to a combined set of multiple child labor cases (#4, #5). Incidents may be depicted in narrative fashion building on a single individual case (#6) or at least referencing it directly (#7). However, incidents are also reported directly, as can be seen in the later text excerpts (#8 to #11). There, the reports on child labor can also include child labor categories such as prostitution, begging, or domestic work that are less relevant from a company perspective (#4).

The automatic text mining and BN updating procedure depicted in Fig. 7 has been developed to cope with this challenge. Given the low frequency of child labor incidents and the large variety of forms in which these are expressed, we employ data-driven document classification with candidate set reduction and tagged event extraction. In particular, the following four text mining steps are performed:

Fig. 7
figure 7

Text mining and BN updating workflow for one document

  1. 1.

    Preprocessing and Candidate Set Reduction A tokenizer splits the words and other characters, a sentence splitter detects sentence boundaries, and a POS-tagger is used to differentiate word lemmas. As sentences close together in a text tend to be on the same topic (Zha 2002) and a direct mention of “child labor” may be seen as the most obvious trigger of a child labor incident event, a distance-based approach using a cut-off distance is suggested to prune negative cases. The distance is measured as the number of characters between “child” and a word indicating “labor.” Indicative words may be synonyms, hypernyms, or other related word sets. Stop words are eliminated between “child” and “labor”. The cut-off distance to be used is determined together with model selection and parameter estimation so as to optimize the F1 measure combining precision and recall.

  2. 2.

    Classification The thus identified text passage between “child” and “labor” could describe a child labor event. This could in turn contain several child labor observations, as demonstrated by the following example: “As many as 36 cases from Koderma and 22 from Khunti were brought to Dube´s notice when he visited these districts. Such incidents include a girl from Khunti missing since 2009 when she went to Delhi for work […]”. Classification based on the relative word frequencies of the words within the feature is used to verify if the extracted feature actually deals with a child labor event. For training, the Reuters TRC2 corpus containing 1,800,370 news articles (Reuters, National Institute of Standards and Technology 2009) was used. 16,948 articles in Reuters TC2 contained the word “child” and manual tagging yielded 117 articles that contained a child labor event. A number of variants for feature construction (maximum or minimum distance between “child” and “labor” including or excluding leading or trailing words to complete sentences), model selection (SVM, PAUM, KNN, NB, C4.5) and cut-off values were tried out using this gold standard (Li et al. 2005; Quinlan 1993). This resulted in the choice of SVM with a cutoff distance of 80 applied to the sentences within the maximum distance as the best variant with precision 97.1%, recall 73.7% and F1 value of 83.4%.

  3. 3.

    Event extraction The result of the classification step is a list of news reports of which each should (and in the case of 100% precision actually does) contain at least one child labor incident observation referring to a child labor incident event. The goal of the next step is to extract these incident observations together with the corresponding attribute values from the text. DBpedia Spotlight and Open Calais by Reuters were used for geography tagging, and Open Calais for company tagging, yielding respective URIs. Sector tagging must yield a sector conforming to United Nations ISIC industry classification so as to conform to the statistical data. This was done via a rule-based gazetteer and an ML-based approach, where the labels and descriptions from the ISIC classes were used to train a classifier based on the lemmas of the respective tokens. Given that different taggers produce syntactically and also partly semantically different tags, they have to be aligned to a common tag set via a domain ontology (Gangemi 2013; Rizzo et al. 2012).

  4. 4.

    Independent Observation Extraction Output of event extraction is a frame with zero to several values for each of the dimensions (components of context) location (hierarchy country, region, city), company and sector hierarchy. In a next step, child labor incident observations that contain at most one value per dimension are generated using heuristics based on proximity and child labor distances. This procedure yielded an overall F1 value of 44.7%, and F1 values of 57.6% for geography, 74.4% for organization and 27.2% for sector in the best alternative. Comparing these results to those obtained for the classification step reveals that observation extraction shows weaker results. This is particularly true for the sector dimension. Consequently, a manual cleaning step appears necessary before values are eventually entered into a risk management system in a productive scenario. Finally, these child labor observations are checked for duplicates by checking if there are no conflicts with already stored observations within a given time-frame. If this is the case, the old observation is deleted and the frequency of the new one is increased (see Fig. 8).

    Fig. 8
    figure 8

    Example independent observation (IIO) extraction

The output of these steps is a list of independent incident observations with fully or partly filled attributes linked to a domain ontology representing this frame. These can then be incorporated into the risk model by activating the supplier location BNs whose context variables overlap with the dimensions of the observations incorporating hierarchical relationships.

2.5.2 Validation of input data availability

A key criterion for the usefulness of the system is the availability of child labor incident observations in sufficient number and granularity. To probe into this issue, the suitability of two publicly available data sets was investigated. News was gathered through searches on Google News, the European Media Monitor, and two selected RSS feeds, resulting in a very broad coverage of news. Altogether, this dataset contains 48,339 news articles published between 15th March 2011 and 16th September 2014. Most articles were retrieved from the British Broadcasting Corporation (BBC), Times of India, The Daily Mail, The Guardian, and The Hindu.

Also, a list of NGOs that potentially post on Twitter was built in order to retrieve the related content. In order to cope with the amount of data, we restricted ourselves to India, given the importance of English (Crystal 2004) and prevalence of child labor (Understanding Children’s Work 2010), and Indonesia, due to its high Twitter use (Bennett 2012) and the presence of child labor (Understanding Children’s Work 2009). 5138 unique NGO websites (predominantly in India) were automatically parsed to determine whether a link to Twitter was provided on the first page that opened by following the website link (see Table 8). Altogether, this resulted in a set of 778 unique twitter accounts. Using the Twitter search API iteratively, produced a set of roughly one million tweets, published between 11th July 2007 and 10th March 2014. When downloading each tweet, external links (included in the tweet) were followed. This website data is stored together with the tweet. The tweets have been reduced to a set where each linked text contains “child” at any place in the text, similar to the assumption used in the text mining methodology. Consequently, 85,020 texts were then stored as a new set for further processing.

Table 8 Overview of number of NGOs collected with and without websites, including sources (author’s representation)

The two sources were input into the text mining procedure described above. This resulted in 708 texts from the news data set and 280 texts from the NGO data set (see Fig. 9). The results of the analysis of the random selection of 100 articles from the news and NGO datasets are presented in Fig. 10. Manual inspection shows that the large majority of articles in the sets (96 and 89% respectively) do in fact include business-related child labor incidents. Only five cases had no dimension, i.e., the text contains a reference which can be classified as a child labor incident under the definition used in this thesis but which is too broad to be considered a dimension for text mining. Furthermore, only 24 cases mention only the country, but in many cases the additional detail does not go significantly beyond this. In fact, most articles also provide the sector (not shown above) without giving any details as granular as a geographic reference to the city level. However, the sample from the NGO dataset has more geographically detailed cases. Analyzing the types of links in the NGO dataset random sample reveals that a large share of NGO posts redirect to classic news pages such as The GuardianFootnote 1 when the tiny URLsFootnote 2 in the posts are expanded. Nevertheless, many unique references (41 in total) still link to non-classic news pages such as blogs, NGO websites or videos (with descriptions), and special news pages or special websites. Thus, one can state that publicly available sources provide child labor incident observations in sufficient number and granularity.

Fig. 9
figure 9

Number of articles in different steps of input data analysis for news and NGO datasets (*unknown due to IR approach through news aggregators)

Fig. 10
figure 10

Analysis of news and NGO dataset based on a random sample of 100 items

2.6 Determination of breach likelihood

The final node of the BN models is the likelihood of a breach of child labor compliance standards \(P_{breach}\). It combines the contextual prior with the audit and observational likelihoods, thus revising its prior. Its distribution is modeled via a sampling process. Netica creates this using a Monte Carlo sampling based on the model equations, i.e. calculating the result for each of the nodes of the BN based on the equations outlined above. In order to determine the weight to be used for the three contributing nodes, the experts were asked “Which weight would you give the following three probabilities if they are combined in order to calculate an overall probability of a child labor incident at a supplier location? The sum should equal to 100%.” Table 9 summarizes the answers. It turns out that audits are still seen as providing the most important source of information, being most frequently weighted highest (Fig. 11). In contrast, statistics are seen as the least important for a ranking. All three are significantly different from zero (0.000 level).

Table 9 Average estimated relative importance of independently found probability values for overall supplier risk judgment
Fig. 11
figure 11

Frequency of ranks calculated from weights attributed to different sources of evidence

Additionally to these values, the final node incorporates the distinction between supplier locations that have signed the “code of conduct” and (potential) supplier location that have not. It is assumed that suppliers need to comply with the code of conduct irrespective of a signature. However, not signing it increases the breach risk significantly. Thus, for locations that have signed “codes of conducts”, the prior probability’s mean is shifted by a user-defined factor, for which 0.25 was used. This factor is modeled via a discrete node with two states representing whether the supplier has signed a code of conduct or not. The mean of the resulting final node can be used to establish the prioritization of the supplier locations.

3 System test

Pitchforth and Mengersen (2013) outlined seven validity tests that can be performed in the context of a BN. Out of these, nomological, face and content validity have been discussed during the description of the sub-parts of the system above. Given that the network depicted here does not reuse parts of other networks, concurrent validity is not tested. As there is no other BN available for child labor risk management, we cannot test convergent validity. Consequently, also discriminant validity has not been tested. This paragraph relates to testing predictive validity, particularly focusing on model sensitivity.

In order to test the whole system, we will start out by assuming that four hypothetical supplier locations in Table 10 have equal breach likelihood. Subsequently, the locations are audited, with results depicted in Table 11. After that, the four news on child labor incidents described in Table 12 are entered into the BN. This yields the development of the breach likelihoods and rankings depicted in Fig. 12. Given that the third news item details the second, the update leads to an increase in its maximum relevance. Although they have a good audit result, inputting the additional news sources causes the rank of supplier B to worsen twofold (as can be seen in 2). At the same time, the ranking of supplier D improves to first place after the third news item. These results were presented to the expert panel for discussion. The overall evaluation was positive, and the experts agreed that the system´s conclusions were plausible and found the resulting derivation of the ranking of the supplier locations valid.

Table 10 Supplier locations used for validation
Table 11 Audit scores used as ranking test input
Table 12 Dimensions of news sources used as ranking test input
Fig. 12
figure 12

Development of breach likelihood and supplier rank after audit and news inputs

4 Discussion

With the concept of a BN that estimates the likelihood of a breach in child labor standards at a given supplier location and allows the integration of evidence using relevance and credibility scores, this paper introduces this aspect with a focus on social sustainability. Continuous model updates through new evidence lead to an increase in the likelihood that a supplier will breach a code of conduct. The evidence has to conflict with the basic assumption that suppliers comply with a company’s standards. Additionally, the update process requires that the items entered as evidence are independent of each other and, thus, each update carries additional information that leads to an increase in the breach likelihood. This behavior also has been shown mathematically ([reference deleted for blind review]). Thereby, relevance and credibility are pertinent factors when differentiating the quality of news inputs. Parameters were proposed to attribute a relevance to a news text based on its level of detail. The idea that influence on the overall breach likelihood increases as articles become more detailed was supported by the expert questionnaire. This also underlines the fact that cases which do not occur directly at a supplier location should still affect the breach likelihood if they can be related to a supplier location through an article’s content.

Besides ongoing input from news sources, the network is also updated with new audit results. The more recent audit score is integrated together with the time since the last audit. Concentrating on the last known audit result means the best in-depth data on a supplier is used. This is supported by the expert questionnaire; the experts view audits as the most important source of information when needing to estimate the risk level of supplier locations with regard to child labor. Nonetheless, the experts still assign a residual child labor risk, even if a supplier achieves the best possible audit score. Instead of audit data, this input can also come from certification processes, partner companies, or through platform-based exchange. Focusing on the most recent audit result ignores other possible input from earlier data, which could provide additional conclusions such as diminishing performance.

The prior integrates data from statistical sources to mathematically determine the probability of child labor per country, sector, and area type. The cases tested together with the experts show comparable rankings for supplier locations. However, the experts provided a wide variety of different answers and rankings, which decreases the interpretability of the results. This variation may be due to the difficulty in estimating the probability value based on the numbers provided without an additional calculation framework, or it may be due to the different heuristics experts use to determine risk probabilities given limited information. Nevertheless, the prior value provides a mathematically derived quantified number, and its components have been agreed on by the experts. In a different context than child labor, another structure for the prior might be necessary given the data availability and underlying driving forces of a different social sustainability factor.

BNs have the advantage that they are more easily understandable than other probabilistic frameworks (Duespohl et al. 2012; Koks and Challa 2005; Wooldridge 2003). The nodes of the BN proposed in this paper can be directly explained to sustainability managers. Moreover, as the expert questionnaire highlights, the requirements incorporated into the BN (together with the surrounding system) are strongly supported by experts in supply chain, sustainability, and risk management. Only one requirement did not see significant agreement—experts want to be able to manually change the final input into the risk model. But as the current system design (apart from the initial configuration) allows unbiased input into further processing steps, the amount of user input, if permitted, has to be discussed in detail. Experts do not necessarily make the correct judgments, and biases may affect the manual input, leading to a questionable ranking of supplier locations. There are also numerous factors that can unconsciously influence decisions (Bazerman 2006). Consequently, also the configuration needs to be cross-checked. Nevertheless, experts nearly significantly attribute job relevance to a model and system incorporating the chosen requirements (at the 6% level, but true for large companies), supporting the overall assumption that the system has a high likelihood of being adopted by an organization.

Altogether, the findings lead to several recommendations for practical application. Using a quantified risk model that is continuously updated based on observations can help to focus a company’s auditing resources and other activities such as supplier development where they are most effective. Designing a model and system that fulfills the key requirements and is understandable is feasible. Consequently, its implementation may also allow a more unbiased and objective discussion of social sustainability activities across the supply chain. To be effective, companies would need to have the resources (internal or external) and processes to be able to work with the outcome of the risk model. They need to analyze potential issues more deeply, potentially also performing on-site visits. While the focus here was on child labor, this process can also be adapted for other social sustainability risk sources such as forced labor. A key question when performing a risk-based analysis of suppliers or when a social sustainability issue is detected at a supplier location is how to deal with the particular supplier. The appropriate reaction strongly depends on the type of issue. Besides financial aid for the child or children [as suggested for example by Social Accountability International (2008)], supplier development can offer further opportunities (Harms et al. 2013). Increased commitment, collaboration, and supportive treatment of suppliers can be steps that may already be taken in order to mitigate issues, particularly for high-risk suppliers (Locke et al. 2009).

5 Conclusion

Issues with child labor and other social sustainability themes can cause severe reputational damage to companies, even if found in very remote areas of the supply chain. Moreover, the societal impact of global supply chains has come under particular scrutiny in recent years. This paper suggests using a Bayesian network (BN) continuously fed by reports on child labor observations to estimate the risk of a breach of corporate sustainability standards at a particular supplier location. The BN risk model builds on a statistically derived prior and is updated using the most recent audit results for the location and news items containing child labor issues that can be related to the location. The latter makes use of Bayesian updating to incorporate credibility and relevance of news items as well as their number, yielding an observational probability. Through its probabilistic nature, the BN provides a quantified ranking of supplier locations based on their level of risk, which may be used for further mitigating actions in the supply chain. Domain experts have been asked to provide their input on different requirements and on the calibration of the model.

Clearly, the prototype developed has a number of limitations. Due to the limited availability of data, the nodes of a BN are assumed to be conditionally independent. One could argue that an audit score or news report might be partly related to statistical data depending on the sector or region and given data collected in practice one could build conditional probability tables connecting context, audit results and news. However, the effect of the assumption of independence should be very limited for audits as they mainly leverage company internal information. For news reports, statistical information is typically significantly older than the evidence that can be gathered from news articles, which is often affected by recent events. In addition, as discussed above, news should only influence the breach likelihood strongly if it embodies sufficiently detailed information. Verification has been limited, too, both in terms of the number of experts involved and the number of cases, which were restricted to India and Indonesia.

Thus, one can conclude that interesting further research opportunities exist in all areas covered. While the improvement of event/observation extraction and the extension of the text sources to other languages can be tackled independently from the other issues, the refinement of the BN can only be done on the basis of an enlarged set of real-world data and concurrently with an extended verification process. In order to extend the relevant data basis, it would be advisable that several firms cooperate, maybe also via NGOs and public organizations, who could offer a child labor risk management as software as a service.

Overall, this paper provides a first quantitative risk model for social sustainability monitoring in supply chains based on a Bayesian network and text mining. In organizations, a system building on the techniques suggested would still need some level of manual interaction and adaptation beyond IT, and organizational processes would need to be established to trigger appropriate responses to changes in risk measures. Still, it may be seen as a step towards greater supply chain responsibility.