1 Introduction

An industrial mining complex is an integrated network that involves several mines, a large and diverse equipment fleet, and processing facilities that transform material coming from the mines into mineral products to deliver to customers [1,2,3,4,5,6]. Short-term mine planning optimizes the production schedule on a weekly or monthly basis. This involves defining certain decisions, such as an extraction sequence in which materials, represented by mining block units, are removed from the mines, the destination where the materials will be processed, and the allocation of shovel and truck fleet responsible for transporting ore and waste materials, while operational requirements must be addressed. To simultaneously define these decisions, it is necessary to consider the dynamic interactions present in an operating mining complex. For example, the extraction rate of a mining front depends on the shovel type, the number of trucks assigned and the destination of the extracted material. Thus, there is a need to investigate approaches, such as those based on reinforcement learning (RL) [7], that provide fast decisions given different configurations of mine operations.

Uncertainties arise from multiple sources in a mining complex, and disregarding them can result in large deviations from production targets [8,9,10,11,12,13]. Supply uncertainty is due to partial knowledge about pertinent geological attributes, such as metal grades, given limited drillhole samples. This is typically quantified by geostatistical methods that simulate spatially distributed geological attributes [14,15,16] and generate a set of orebody model realizations. Equipment performance uncertainties, such as transportation cycles or daily productions, are usually addressed through discrete event simulations [17]. State-of-art short-term planning models consider some or multiple sources of uncertainty through stochastic integer programming (SIP) [3, 18], robust optimization [19] and mathematical programming combined with discrete event simulation [20,21,22]. Upadhyay and Askari-Nasab [20] present an optimization-simulation model that minimizes production target deviations and shovel movement and maximizes shovel utilization. The mathematical model is integrated with a discrete event simulation approach to assess uncertainty in production rates and haulage capacities. Both and Dimitrakopoulos [3] incorporated equipment and geological uncertainties into a nonlinear SIP to optimize a short-term production plan for mining complexes, thus maximizing profits obtained by processing blended materials. A general drawback associated with these established approaches is the need to reoptimize the production plan whenever the state of the mining complex changes. For example, if a mining area scheduled to be mined is postponed or some equipment breaks down, production is delayed. This would require new optimized decisions that consider the current configuration of the mining complex. Thus, adaptive decisions must be made in a short timeframe while taking the dynamic interactions between different equipment into consideration.

In addition, it is common practice to collect new data during different mining activities, either by traditional in-fill drilling and grade-control practices, by devices installed in different types of equipment or through sensors that discriminate ore and waste material [23,24,25]. This additional information can influence production planning decisions, especially if it is used to update the set of simulated geological orebody models [26,27,28], for example, through the ensemble Kalman filter (EnKF) approach [29]. However, only a few studies have focused on connecting short-term production planning and updating simulated orebody models [30,31,32]. An equipment allocation optimization model in a coal mine [30] demonstrates a considerable reduction in target deviations by inputting updated orebody models into the optimizer. A drawback of this approach is the requirement of reoptimizing the job scheduling model when new information is collected. This can be computationally demanding when large applications are considered. Similar concepts are seen in the context of smart oil fields [33,34,35,36,37].

Recently, RL-based approaches have shown improvements and demonstrated excellent performance in complex environments [38,39,40,41,42], which have motivated applications in the context of short-term planning of industrial mining complexes [31, 32, 43,44,45]. Paduraru and Dimitrakopoulos [44] proposed a policy gradient approach that defines the destination of each extracted material. This is expanded by incorporating multiple processors while the orebody models are updated using grade-control data [32]. In both cases, a simulator that combines discrete event simulation and system dynamics is presented to forecast the material flow resulting from different mining activities. Both approaches assume a fixed extraction sequence of the material from the mines, which introduces biases, as the agent knows exactly which material becomes available in each time step. Given the dynamic interactions of the mining operations, modifications in the extraction sequence can be observed, which would require retraining the model. To address this limitation, Kumar and Dimitrakopoulos [31] proposed a single RL agent that jointly defines the extraction sequence and material destinations based on the policy gradient and the Monte Carlo tree-search framework, motivated by the AlphaGo architecture [46]. However, combining these decisions into a single action output by the RL agent can exponentially increase the total number of actions. For example, including either additional mining areas or more processing streams would increase the output space and require substantially more computational effort for training. This prevents the model from directly allocating shovels and trucks; instead, some simplifications are assumed to generate feasible extraction sequences. Additionally, the authors do not explicitly address minimum spatial requirements for shovel allocation and thus do not provide operational equipment allocation and destination policies.

The contribution of this work consists of studying the potential value coming from connecting production planning decisions taken at different time scales, and more specifically, the link between operational aspects and improved extraction sequence and material destination decisions. Thus, this work proposes an actor-critic RL method [7] to provide updated short-term production schedules and truck-and-shovel allocation in a mining complex, given incoming data and uncertainty in terms of geological attributes and equipment performance. Two RL agents are introduced: the first agent assigns shovels to mining locations, and the other agent defines the material destination and the required number of trucks for transportation. The material flow resulting from these decisions is forecasted by a simulator of the mining complex operation based on historical data related to equipment performance. New data obtained from sampling stations near the processors update the set of simulated orebody models using the EnKF method, which is embedded in the material flow simulator. The following section describes the proposed actor-critic framework. Subsequently, a case study at a copper mining complex demonstrates the proposed method’s ability to improve the financial performance of the operation by adapting fleet allocation and production scheduling decisions. Conclusions and future research directions follow.

2 Methods

Actor-critic RL approaches are methods that learn a policy function \(\pi \left(a|s,\theta \right)\) that guides the agent to behave in a certain environment by taking action \(a\) given the state \(s\) and parameterized by parameters \(\theta\) [7], which are further discussed in the following sections. Herein, the actor-critic RL-based framework is trained to provide truck-and-shovel allocations and material destinations in the context of short-term planning for industrial mine complexes given supply and equipment performance uncertainty. The approach requires a simulator of the mining complex operations to forecast the material flow from each mine to the related destination processors based on discrete event simulation and system dynamics. This simulator samples operational performance values related to excavation, transportation and processing activities based on historical data distribution to forecast the associated material flow. Additionally, it is assumed that new data can be constantly collected at sampling stations installed near the processors. These data are used to update the set of simulated orebody models using the EnKF approach. The mining blocks near the ones whose information can be backtracked have their grades updated, and this method is embedded into the simulator of the mining complex operations. Next, Section ‎2.1 presents a summary of the EnKF approach, and Section ‎2.2 describes the proposed model of the mining complex model and the actor-critic approach applied to take short-term planning decisions in the context of mining complexes.

2.1 Data assimilation for simulated orebody models

Uncertainty and variability of the spatially distributed attributes of mineral deposits are typically modeled by a set of geostatistically simulated orebody models [15, 16, 47]. Each stochastic simulation of the random field is defined as \({\varvec{Z}}_{k}\left(\varvec{u}\right),\) where \(k=\text{1,2},\dots ,{n}_{gs}\), \({n}_{gs}\) is the number of simulated scenarios of the orebody model and \(\varvec{u}\) represents the mining block location in domain \(D\subseteq {R}^{3}\). As new data characterizing the mineral deposit are collected, they can be assimilated into orebody simulation models using the ensemble Kalman filter approach (EnKF) [28, 29]. The method is based on a linear update of the ensemble models given by the dissimilarity between measured (\(\varvec{d})\) and forecasted observations \(f\left({\varvec{Z}}_{k}\left(\varvec{u}\right)\right)\). The EnKF method assumes that both \({\varvec{Z}}_{k}\left(\varvec{u}\right)\) and \(\varvec{d}\) are described by multi-Gaussian distributions. Thus, a univariate normal score transformation \(\phi (\cdot )\) is applied [16], such that \({\varvec{Y}}_{k}\left(\varvec{u}\right)={\phi }_{data}\left({\varvec{Z}}_{k}\left(\varvec{u}\right)\right)\) and \({\varvec{d}}_{\varvec{k}}^{\phi }={\phi }_{obs}\left({\varvec{d}}_{\varvec{k}}\right)\). The updating step \(t\) is given as follows:

$$\begin{array}{cc}{\varvec{Y}}_{\varvec{k}}{\left(\varvec{u}\right)}_{t}={\varvec{Y}}_{\varvec{k}}{\left(\varvec{u}\right)}_{t-1}+{K}_{t}\left[{\left({\varvec{d}}_{k}^{\varphi }\right)}_{t}-{\left({\varvec{d}}_{{for}_{k}}\right)}_{t-1}\right], & \forall k=\text{1,2},\dots {n}_{gs}\end{array}$$
(1)

where

$${\varvec{d}}_{{for}_{k}}=f\left({\varvec{Y}}_{k}\left(\varvec{u}\right)\right)$$
(2)

and \({\varvec{d}}_{k}^{\phi }\) is the measured observation with added noise coming from a normal distribution, \({e}_{k}\sim N\left(0, \varvec{R} \right)\). \(\varvec{R}\) is the covariance error provided by the sensor measurement.

$$\begin{array}{cc}{\varvec{d}}_{k}^{\phi }={\varvec{d}}^{\varvec{\phi }}+{\varvec{e}}_{k}, & k=1,\dots ,{n}_{ens}\end{array}$$
(3)

The Kalman gain weights \({K}_{t}\) are a linear operator that maps the influence of the forecast mismatch into the model \({\varvec{Y}}_{\varvec{k}}\left(\varvec{u}\right)\) and are obtained as follows:

$${K}_{t}={\left({C}_{Y,\text{O}bs}\right)}_{t}{\left({C}_{Obs,\text{O}bs}\right)}_{t}^{-1}$$
(4)

where \({C}_{Y,\text{O}bs}\) is the cross-covariance between the model \(\varvec{Y}\left(\varvec{u}\right)\) and the observed data \({\varvec{d}}^{\varvec{\phi }}\), and \({C}_{obs,obs}\) is the autocovariance between the \({\varvec{d}}^{\varvec{\phi }}\). Both covariances are computed experimentally at each time step \(t\):

$${\left({C}_{Y,\text{O}bs}\right)}_{t}\cong \frac{1}{{N}_{ens}}\sum\limits_{k=1}^{{N}_{ens}}\left({\varvec{Y}}_{k}{\left(\varvec{u}\right)}_{t}-\overline{{\varvec{Y}}_{k}{\left(\varvec{u}\right)}_{t}}\right)\cdot{\left({\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}-\overline{{\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}}\right)}^{T}$$
(5)
$${\left({C}_{obs,obs}\right)}_{t}\cong \frac{1}{{N}_{ens}}{\sum }_{k=1}^{{N}_{ens}}\left({\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}-\overline{{\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}}\right)\cdot {\left({\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}-\overline{{\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}}\right)}^{T}+R$$
(6)

where \(\overline{{\varvec{Y}}_{k}{\left(\varvec{u}\right)}_{t}}\) and \(\overline{{\left({\varvec{d}}_{es{t}_{k}}\right)}_{t}}\) represent the ensemble mean of the ensemble model and the observation forecast, respectively. The reader can refer to [48, 49] for a general review of EnKF and to [28] for applications to mineral deposits.

2.2 An actor-critic method applied to operational decisions in mining complexes

Actor-critic is an algorithm that belongs to the class of policy-gradient methods [7]. A policy is a parametric function that maps the state space \(S\), representing a numerical description of the mining complex environment to the output space, characterized by a set of probabilistic values \(\pi \left(A|S,\theta \right)\) associated with the desirability of choosing a specific action. The two RL agents’ architectures are described in Section ‎2.2.3. At each time step \(t\), Agent 1 perceives the mining complex environment by the state \({s}_{1,t}\in {S}_{1}\subset S\) and outputs actions \({a}_{1}\in {\mathcal{A}}_{1}\subset A\), determining mining block locations where the shovels will operate. This ultimately defines the sequence in which the mining blocks are extracted from the mines. Next, given the state observed \({s}_{2,t}\in {S}_{2}\subset S\), which also considers mining blocks defined by \({a}_{1}\), Agent 2 chooses the processor destination and the number of trucks required to transport the material from the mining faces to the corresponding destination, \({a}_{2}\in {\mathcal{A}}_{2}\subset A\). The actions made by the agents are referred to as decisions interchangeably throughout this paper. Section ‎2.2.3 also presents a more detailed description of the mining complex state.

A simulator of mining complex operations is developed herein to forecast the outcomes obtained from the decisions provided by the RL agents. This combines ideas from discrete event simulation approaches [50] and system dynamics [51]. A general workflow of the different components concerning the simulation of mining complex operations is presented in Fig. 1. The RL agents’ actions \({a}_{1}\) and \({a}_{2}\) trigger a stochastic response of the mining complex model since different forms of uncertainty are incorporated, such as in terms of geological attributes and equipment performance. Concerning geological uncertainty, all grade simulations are considered, which means that the forecast for metal-related attributes displays multiple values. Regarding uncertain equipment capacities, a single daily capacity value is sampled from each piece of equipment’s historical distribution. Combining the capacities of all related equipment provides the amount of material that can flow from the mine fronts to the processors. For example, as shown in Fig. 1, the mining capacity of each shovel-truck operation is defined by the variable \(curren{t}_{j}^{cap}\), which depends on the uncertain performance related to shovel \(j\) and the trucks assigned to it.

Shovel relocation can also impact daily productivity, as a lost tonnage factor is applied given the time required to move these pieces of equipment between mining areas or mining levels. The changes in the mining complex state, generated by \({a}_{1}\) and \({a}_{2}\), are represented by changes in the \({s}_{1,t}\) and \({s}_{2,t}\) vectors. The simulation of the mining operations also generates a reward value \({r}_{t}\), which is presented to the agents to define how well they are performing. This is defined in Section ‎2.2.1. The interaction between RL agents and the mining complex model continues until the time horizon \(T\), representing the total number of operating days \({N}_{days}\) is reached. In addition, it is assumed that the material feeding the processors is monitored and sampled daily. The acquired data are used to update the set of simulated geological models, as mentioned in Section ‎2.1. This feature is embedded into the simulator of the mining complex operations.

Fig. 1
figure 1

General workflow of the simulator of excavation and transportation operations

The trajectory \({\tau }_{t}\) is then defined by the set of actions taken and the state observed until time \(t\):

$${\tau }_{t}=\left[\left({s}_{\text{1,0}},{a}_{\text{1,0}},{s}_{\text{2,0}},{a}_{2,0},{r}_{1}\right),\left({s}_{\text{1,1}},{a}_{\text{1,1}},{s}_{\text{2,1}},{a}_{\text{2,1}}{,r}_{2}\right),\dots ,\left({s}_{1,t-1},{a}_{1,\text{t}-1},{s}_{2,\text{t}-1},{a}_{2,\text{t}-1}{,r}_{t}\right)\right]$$
(7)

The cumulative reward is then defined as follows:

$${R}_{t}=\sum\limits_{k=0}^{T-t-1}{{\gamma }^{k}r}_{t+k+1},$$
(8)

where \(\gamma\) is the discount factor that decreases the weight of rewards received in the distant future. \(\gamma\) is bounded to be smaller than 1 to help convergence [7]. The goal is to maximize the expected total reward obtained as follows:

$$\begin{array}{cc}\underset{{a}_{\text{j}}}{\text{max}} \ {E}_{\pi }\left[{R}_{t}|{s}_{j,t}\right], & j=\text{1,2}\end{array}$$
(9)

In the presented method, Agents 1 and 2 are each represented by an actor and a critic. The actors are responsible for making decisions about the mining complex environment according to the policy functions \({\pi }_{1}\left({A}_{1,t}|{s}_{1,t},{\theta }_{1}\right)\) and \({\pi }_{2}\left({A}_{2,t}|{s}_{2,t},{\theta }_{2}\right)\), given the mining complex states \({s}_{1,t}\) and \({s}_{2,t}\) and the neural network parameters \({\theta }_{1}\) and \({\theta }_{2}\). The critics are value functions \({v}_{1}\left({s}_{1,t},{\omega }_{1}\right)\) and \({v}_{2}\left({s}_{2,t},{\omega }_{2}\right)\) that approximate the total reward that each agent can collect given the current mining complex state and the neural network weights \({\omega }_{1},{\omega }_{2}\):

$$\begin{array}{cc}v\left({s}_{j,t},{\omega }_{j}\right)\approx {E}_{\pi }\left[{R}_{t}|{s}_{j,t}\right], & j=\text{1,2}\end{array}$$
(10)

The critic guides the actor toward an improving direction through the advantage function \(\delta \left({s}_{j,t},{A}_{j,t}\right)\):

$$\begin{array}{cc}\delta \left({s}_{j,t},{A}_{j,t}\right)={r}_{t+1}+\gamma \cdot v\left({s}_{j,t},{\omega }_{j}\right) -v\left({s}_{j,t},{\omega }_{j}\right), & j=\text{1,2}\end{array}$$
(11)

One can perform gradient descent with respect to both the actor and critic, according to the following:

$$\begin{array}{cc}{\theta }_{j}\leftarrow {\theta }_{j}+{\alpha }_{{\theta }_{j}}\delta \left({s}_{j,t},{A}_{j,t}\right)\nabla \ \text{ln}\left(\pi \left({A}_{j,t}|{s}_{j,t},{\theta }_{j}\right)\right), & j=\text{1,2}\end{array}$$
(12)
$$\begin{array}{cc}{\omega }_{j}\leftarrow {\omega }_{j}+{\alpha }_{{\omega }_{j}}\delta \left({s}_{j,t},{A}_{j,t}\right)\nabla v\left({s}_{j,t},{\omega }_{j}\right), & j=\text{1,2}\end{array}$$
(13)

where \({\alpha }_{{\theta }_{j}}\) and \({\alpha }_{{\omega }_{j}}\) are the learning rates related to the actor’s and critic’s neural networks. Figure 2 summarizes the interactions between both RL agents and the mining complex environment.

Fig. 2
figure 2

Diagram of the actor-critic approach applied to mining complexes. Adapted from [7]

2.2.1 Reward function

Agents 1 and 2 provide new decisions when a shovel finishes extracting the assigned set of blocks. The mining complex model simulates the consequences of these decisions, modifying its state configuration, where uncertainty plays an important part. To evaluate the performance, a utility function \(u\left({\tau }_{t}\right)\) that measures the overall financial performance is calculated as the total revenue obtained minus the penalty deviations from the operational objectives:

$$u\left({\tau }_{t}\right)=\sum\limits_{{t}^{{\prime }}=0}^{t}\left(\begin{array}{c}\sum\limits_{d\in \mathcal{D}}\frac{1}{{n}_{gs}}\sum\limits_{k=1}^{{n}_{gs}}\left(re{v}_{k,d,{t}^{{\prime }}}-proc\_cos{t}_{d,{t}^{{\prime }}}\right)-\\ \left(m\_cos{t}_{{t}^{{\prime }}}+pen\_shov\_mo{v}_{{t}^{{\prime }}}+pen\_practical\_sche{d}_{{t}^{{\prime }}}\right)\end{array}\right)$$
(14)

where \(re{v}_{k,d,{t}^{{\prime }}}\) is the forecasted revenue obtained at time \(t\) by processing the ore in destination \(d\) and geological scenario \(k\). This accounts for the uncertainty related to the metal content in the material sent to the destination \(d\). \(proc\_cos{t}_{d,t}\) is the processing cost that considers the stochastic performance of the crushers and the processing plant. \(m\_cos{t}_{t}\) is the mining cost of extracting the material from the mining fronts by shovels and transporting it by trucks to the related destination. Multiple mining areas can be operated simultaneously, and the shovels can be relocated between these regions and to different levels. It might be desirable to extract material with different properties, but moving shovels can result in production loss that is penalized by \(pen\_shov\_mo{v}_{t}\) and production delay. Finally, \(pen\_practical\_sche{d}_{t}\) is the penalty associated with deviating from a practical schedule, which is discussed in Section ‎2.2.2. The reward function is modeled as the temporal difference of the utility function \(u\left({\tau }_{t}\right)\) calculated between two time steps as follows:

$${r}_{t+1}=u\left({\tau }_{t+1}\right)-u\left({\tau }_{t}\right)$$
(15)

The choice for formulating Eq. (15) as this temporal difference approximates the cumulative reward function, Eq. (8), to the objective function applied to stochastic optimization of short-term planning of mining complexes [3].

2.2.2 Action space

Each proposed RL agent is concerned with a different type of decision-making that will be applied to the mining complex environment. Agent 1 defines shovel allocation by selecting mining block locations to be excavated. This requires accessibility constraints, such as horizontal and vertical access. Regarding the standard slope constraints, all blocks lying above must be previously extracted to expose the mining block in consideration, respecting the slope angle requirements. To address the operational requirements representing horizontal access, there must be a minimum mining width allowing the shovel to reach the material at the mining faces. The assumption is that at least three blocks are required to be mined to expose a certain mining block, whose conditions are illustrated in Fig. 3. Figure 3a shows an example of a mining block highlighted in gray requiring that at least one set of predecessor blocks, Fig. 3b, are removed, so the gray block can be accessed by a shovel.

Fig. 3
figure 3

Illustration of the requirements for horizontal predecessors, given the (a) initial configuration and (b) set of possible predecessors

To obtain policies that facilitate material destination decisions in an operationally feasible context, Agent 1 sequentially decides the next three shovel allocations, as shown in Fig. 1. The idea is to provide a practical schedule where some adjacent materials are processed at the same destination, minimally satisfying the operational requirements. In total, four types of shovel allocation decisions are considered:

  • Surrounding window actions: considers a predefined number of mining blocks that represent the region in which the shovel can be assigned. These blocks are the ones falling inside a certain window (for example, 7 × 7 blocks as in the case study presented), as presented in Fig. 4. Since three decisions are taken consecutively, the window is maintained as is and as a certain block is chosen the set of feasible blocks are updated, as other blocks are exposed. Using the search template, Fig. 4b, the possible blocks that can be mined given horizontal accessibility are flagged. Figure 4c-f show how this is considered when Agent 1 is triggered sequentially to define a new allocation decision.

  • Upper/lower level actions: consider the option to move the shovel one level upward or downward if slope constraints and access from the ramp are satisfied. The number of decisions is limited to reduce the total number of possibilities. Thus, if a shovel moves between levels, it is allocated to the exposed mining block with the highest grade. This is a limitation, and different movement strategies will be studied in future research.

  • Different area actions: this decision allows the shovels to be relocated to another mining area. Since there are \({N}_{areas}\) mining areas in total, \({N}_{areas}-1\) decisions are available in total. As an analogy to the “upper/lower levels” set of actions, the shovel is relocated to the available mining block with the highest grade.

  • Unmined leftover actions: This decision addresses the situation where some mining blocks become detached from the mining face. If that happens, those blocks are tracked, and this option becomes available. In this case, if there are unmined leftover blocks and the RL agent chooses another option, a penalty \(pen\_practical\_sche{d}_{t}\) is applied. Figure 5 illustrates the situation where choosing one block may lead another block to become detached from the mining face.

Fig. 4
figure 4

Illustration of the “surrounding window” possible actions. (a) shows the shovel’s position and the available and unmined blocks given the search template (b). (c)-(e) represent a possible sequence of allocation actions that can be selected by Agent 1

Fig. 5
figure 5

Illustration of a possible configuration with a block detached from the mining front

Once Agent 1 has defined the next three mining block decisions to allocate the shovel, Agent 2 must determine their destination and the number of trucks required for its transportation. There are multiple processing streams that a material can go through in a mining complex. Figure 6a shows the strategy to list all \({N}_{streams}\) possible paths from the mining face to the destination. This allows the differentiation of sending material to Mill 2 through Crusher 1 from the alternative route of processing the material also at Mill 2 but through Crusher 2. Although in both cases they would be processed at the same location, the crusher usage is different. Additionally, there are \({N}_{t\to s}\) different truck-shovel strategies. For example, the case study shown in the following section assumes three possibilities, \({N}_{t\to s}=3\), to associate trucks with a shovel. These options consider assigning three, four or five trucks to a single shovel. Combining the truck strategies with the processing options results in \({N}_{2}={N}_{t\to s}\cdot {N}_{streams}\) possible actions. Figure 6b illustrates the combined \({N}_{2}\) decisions listed that are presented to Agent 2.

Fig. 6
figure 6

Agent 2’s action space: (a) possible processing paths and (b) enumeration of the processing options (OP) combined with the number of truck possibilities

2.2.3 State space

The state space is a numerical representation of the mining complex’s current state. Figure 7 shows Agent 1’s neural network architecture, which has two categories of inputs, a convolutional neural network and additional features. The convolutional part considers the three-dimensional spatial configuration of the orebody model and four attributes that are treated as different channels. These are binary-coded values representing:

  • Extracted and unmined mining blocks.

  • Exposed and unavailable mining blocks.

  • Shovel trajectory, where the blocks extracted by this equipment are flagged.

  • Current mining blocks assignment of all other shovels, including blocks to be extracted.

The set of additional features includes the following:

  • Grade features: include the set of simulated grades of all blocks in the search window presented in Section 2.2.2. If a block is not available for extraction, a zero flag is set instead.

  • General features: consider production rate per mine, throughputs, targets and the set of simulated head grades of crushers and processors, related to the past days. These features give a notion of past actions taken.

  • Equipment features: present shovel and truck current capacities, the attributes of the blocks being currently extracted by other shovels and their assigned destinations.

  • Downstream features: consider the percentage of the target throughput delivered to crushers and processors as well as each processor’s set of simulated head grades.

Fig. 7
figure 7

Agent 1’s neural network architecture

Figure 8 shows the architecture of Agent 2, where part of the inputs is similar to the “additional features” presented for Agent 1. Then, inputs for Agent 2 include “general features,” “equipment features,” “downstream features,” and the set of stochastically simulated grade attributes of the three blocks selected by Agent 1. The output options presented in the figure are the ones discussed above in Section 2.2.2.

Fig. 8
figure 8

Agent 2’s neural network architecture

2.2.4 Learning procedure

Figure 9 shows the diagram summarizing the agents’ actions and the associated mining complex response. Agent 1 decides the next three shovel allocations, whereas Agent 2 defines the material destination and the number of trucks required. Next, the simulator of the mining complex operations forecasts the material flow associated with these actions and outputs a reward value. The mining complex state is modified by the material flowing from the mining faces.

Fig. 9
figure 9

Diagram summarizing the operational decisions taken in the proposed framework

The algorithm is described as follows:

1. Randomly initialize Agent 1 and Agent 2 parameters \({\theta }_{1},{\omega }_{1}\) and \({\theta }_{2},{\omega }_{2}\), respectively.

2. Initialize the mining complex model with all the mines, crushers, processors and other components. Provide the initial shovel and truck allocations and the initial material destination.

3. Split the number of geological simulations into two sets for training and testing the model. Split each again into a group of simulations to be used as initial orebody model models (\({\mathcal{S}}_{initial}^{train}, {\mathcal{S}}_{initial}^{test}\)) and another to be used as a simulation of reality (\({\mathcal{S}}_{reality}^{train}, {\mathcal{S}}_{reality}^{test}\)).

4. All simulations \(s\in {\mathcal{S}}_{initial}^{train}\) are used to model the uncertainty of the geological grade attributes.

5. Define the number of training episodes, \({N}_{training}\), where for each episode:

5.1. Sample one stochastic simulation \({s}_{reality}\in {\mathcal{S}}_{reality}^{train}\) to represent reality in terms of grade attributes. This is used to emulate the sampling mechanisms placed on conveyor belts.

5.2. Each iteration has a length of \({N}_{days}\), where for each day:

5.2.1. Sample shovel, truck and crusher equipment performances from the historical data distributions. Define how much material can be transported from the mining faces to the processors, as shown in Fig. 1.

5.2.2. Forecast the material that shovel \(i{\in N}_{shovels}\) can extract and send it to the processor.

5.2.3. If shovel \(i\) requires a new allocation decision, select three mining block locations (\({a}_{1,t}\)) based on \(\pi \left({A}_{1}|{s}_{1,t},{\theta }_{1}\right)\) and the destination and number of trucks (\({a}_{2,t}\)) according to \(\pi \left({A}_{2}|{s}_{2,t},{\theta }_{2}\right)\). Store the value function \({v}_{1,t}=v\left({s}_{1,t},{\omega }_{1}\right)\), and \({v}_{2,t}=v\left({s}_{2,t},{\omega }_{2}\right)\), and the log probabilities \(log{p}_{1,t}=\text{log}\pi \left({a}_{1,t}|{s}_{1,t},{\theta }_{1}\right)\) and \(log{p}_{2,t}=\text{log}\pi \left({a}_{2,t}|{s}_{2,t},{\theta }_{2}\right)\).

5.2.4. After taking actions \({a}_{1,t}\) and \({a}_{2,t}\), calculate the reward \({r}_{t}\), according to Eqs. (14) and (15).

5.2.5. Simulate the grade of material passing through the conveyor belts assuming the reality orebody model \({s}_{reality}\). This value is used to update the uncertainty models according to Section 2.1.

5.3. Once the episode is finished, calculate the advantage functions \(\delta \left({s}_{1,t},{A}_{1,t}\right)\) and \(\delta \left({s}_{2,t},{A}_{2,t}\right)\) as in Eq. (11).

5.4. Perform gradient descent given by Eqs. (12) and (13).

3 Case study

3.1 Overview of the mining complex

The proposed method is applied to a copper mining complex of an existing large operation, where the major components are highlighted in Fig. 10. Material supply is sourced from two open pits and processed at one of the leach pads or one of the available mills. The oxide leach pad and the mills require a prior crushing stage, while the sulfide leach pad and the waste dump receive material directly from the mines. Open Pit 1 is composed of three mining areas, while Open Pit 2 has only one mining area, as shown in Fig. 11. Seven shovels operate in Open Pit 1, where they can be reallocated between the three areas, and three shovels are available in Open Pit 2. The operation does not allow moving shovels between the two mines given the short-term scale in consideration and the cost involved. Regarding the truck fleet, there are 50 units, where each shovel can operate with three, four, or five trucks. The daily performances of shovels, trucks, and crushers are characterized by fifty Monte Carlo simulations sampled from historical data [52]. Please note that due to a non-disclosure agreement, the dataset cannot be publicly available. Every episode considered by the simulator of the mining complex operations randomly selects one simulated realization for each piece of equipment to forecast the material flow resulting from the mining activities. The variability and uncertainty associated with the copper grades of each mineral deposit are characterized by a set of ten simulated orebody model realizations generated using the sequential direct block Gaussian simulation method [53]. Figure 12 shows examples of the geological and equipment performance uncertainties considered.

Fig. 10
figure 10

Diagram of the copper mining complex and associated processing streams

A systematic sampling procedure, present at the conveyor belts, quantifies the quality of the copper material flowing between the crushers and the processors. These data are a noisy observation of the blended material passing through the conveyor belt. Additionally, a material tracking strategy can identify where this material is sourced in the orebody. For example, the sensor makes an observation in which 30% of the material comes from location A and 70% from location B. Thus, the resource model is updated accordingly with this incoming information using the EnKF approach.

Fig. 11
figure 11

Mining areas within Open Pit 1 (a) and Open Pit 2 (b) and the initial shovel allocation

Fig. 12
figure 12

Examples of the sources of uncertainty considered in the copper mining complex

The simulator of the mining complex operations interacts with the RL agents for 30 days of operations, defining one episode. In total, 3,500 episodes are considered to train the model, requiring approximately 24 h, on an Intel® Xeon® CPU E5-2650 v4 at 2.20 GHz, running Oracle Linux Server 7.9 and with an NVIDIA Tesla P100 PCIe 12 GB GPU. As presented in Section 2.2.3, Agent 1’s architecture is divided into two parts. The convolutional network has approximately 70,000 inputs and outputs 5,500 values. These inputs are combined with an additional 3,500 inputs characterizing the state of the mining complex’s performance. Finally, 55 values are output, each representing the desirability of a potential shovel assignment. To consider only feasible allocations, the unfeasible actions are assigned with zero chance probability. Agent 2 has approximately 3,000 inputs in a fully connected neural network architecture and outputs 33 values that combine processing destination paths and the number of trucks to be assigned to the related shovel.

3.2 Policy improvement

Over the 3,500 training episodes, the total reward obtained per episode is recorded and averaged every ten repetitions. This result is presented in Fig. 13 and illustrates the learnability of the method. The overall upward trend demonstrates the consistent improvement of the proposed RL agents. The reward values received by the RL agents present a high variance, which is due to multiple sources of uncertainties being considered simultaneously. In general, high variances in rewards can result in a more difficult and slower learning procedure.

Fig. 13
figure 13

Total reward averaged over 10 episodes

Another way to investigate this challenging learning capacity is through the study of the state value functions \(v\left({s}_{1,t},{\omega }_{1}\right)\) and \(v\left({s}_{2,t},{\omega }_{2}\right)\) related to each RL agent. This allows a more focused investigation of each agent’s learning behavior. The initial states \({s}_{\text{1,0}}\) and \({s}_{\text{2,0}}\) are stored at the beginning of the training. Then, the values \(v\left({s}_{1,t},{\omega }_{1}\right)\) and \(v\left({s}_{2,t},{\omega }_{2}\right)\) are recorded at each training iteration and averaged every ten episodes. Figure 14 shows the evolution of the state-value functions, where both agents present policy improvement over time. The state-value function associated with Agent 1, Fig. 14a, reaches a plateau earlier when compared to Agent 2, Fig. 14b. In addition, a high variance is also observed in the values forecasted by Agent 1’s state-value function. This indicates challenging learning, and it is probably associated with a large number of possibilities for shovel allocation. Thus, this early stabilization of the state-value function \(v\left({s}_{1,t},{\omega }_{1}\right)\) suggests that the approach reached a local maximum and has room for improvement. However, Agent 2’s value function presents values with lower variance during the learning process, which resulted in a more stable policy improvement.

Fig. 14
figure 14

State-value function evolution over time for (a) Agent 1 and (b) Agent 2

3.3 Production scheduling and forecast

This section shows the proposed method’s performance considering one episode corresponding to 30 days of mining complex operations, which is obtained in few seconds. The following results are presented on a weekly basis by aggregating daily decisions. Figure 15 shows the shovel allocation strategy and extraction sequence related to Open Pit 1, where each mining area is shown individually for better visualization. Initially, the shovels are more evenly allocated over the three mining areas, for example, with three shovels placed in Area 1 and two in Area 2, as shown in Fig. 11. However, RL Agent 1 identifies Area 3 as more profitable and sends more shovels to operate there. For example, shovel 5 initiates the operation in Area 2, and as soon as its first assignment terminates, the equipment is moved to Area 3 and remains for a considerable amount of time. The reason for preferring Area 3 is due to its larger concentration of ore material, which returns larger rewards to the agents by sending said material preferentially to Mill 1 and Mill 2, as depicted in Fig. 16. Thus, this region quickly becomes the one with the highest extraction rate, leading to more material being extracted from there.

Fig. 15
figure 15

Extraction sequence and shovel allocation obtained for Open Pit 1

Fig. 16
figure 16

Material destination and the number of trucks assigned to each block in Open Pit 1

The truck allocation strategy is also shown in Fig. 16. The RL agents often assign five trucks to the shovels in mining Area 1 and Area 2. This strategy increases the extraction rate in these regions so that reallocations can occur sooner. As mentioned earlier, in Area 3, most of the supply material is processed by Mill 1 or Mill 2. To avoid sending more material than the processors can receive and to prevent queue formation, the shovels’ production rates are decreased by allocating fewer trucks in this zone by weeks 4 and 5. Note that the objective function does not directly minimize mining costs, but parking trucks help to reduce transportation costs.

The preference for mining Area 3 is further analyzed in Fig. 17, which shows an example of how the set of simulated orebody models is updated over different weeks. One realization from this set of simulated orebody models is selected, and the related weekly updates are presented. The red circles highlight approximately the areas being operated in the related week. The legend presents values ranging between 0.7 and 3.0 to facilitate visual inspection. Initially, the shovels are positioned in lower-grade blocks (note that a 0.7% Cu is not exactly low grade), but as time passes, they move toward the high-grade zones. As higher rewards can be obtained from these high-grade regions, the RL agents provide fleet allocation that aims to mine these zones as fast as possible.

Fig. 17
figure 17

Example of one simulated orebody model realization and related updates applied by the end of every week concerning Open Pit 1 Area 3. The red circles represent approximate regions being extracted

Figure 18 presents the shovel allocation strategy and the extraction sequence in Open Pit 2. Overall, the shovel allocation strategy shows a relatively practical schedule. Only one mining area is available in this pit, and more constrained allocation possibilities help generate a schedule that can be more easily implementable. The regions in which the shovels operate are well delineated, and they do not move excessively to distant locations. Shovel 8 only moves to the level below by the fifth week. Figure 19 presents the processing stream decisions and truck allocation strategies. Consistent destination policies are also observed, and this is a result of sending a group of blocks to the same processor given the combined decisions of Agent 1 and Agent 2. This is an important concern in short-term planning where certain operational aspects are needed. Concerning truck allocation, the overall strategy presented consists of initially assigning five trucks to shovels for faster waste extraction. As time progresses and more ore is exposed, slightly fewer trucks are assigned for transportation. This is similar to the strategy in Open Pit 1. Even though the material can be extracted from different locations, it is all still processed at the same location. The processors’ capacities then limit the material flow, as ore from several mining fronts starts being extracted simultaneously. Assigning fewer trucks decreases the extraction rate without relocating the shovels.

Fig. 18
figure 18

Extraction sequence (a) and shovel allocation (b) obtained for Open Pit 2

Figure 20 shows an example of the updates applied to one of the simulated orebody model realizations associated with Open Pit 2. The related weekly updates show how the knowledge about the grades changes over time. When comparing Figs. 19 and 20, it should be noted that most of the updating occurs in the locations assigned to Shovel 9. The sampling stations collect data related to the ore material passing through the conveyor belts and not from waste material. Initially, Shovel 9 extracts Oxide Leach Pad material, and after the information about this material’s grade is acquired, the orebody is updated. As higher grades are observed, Agent 2 realizes that it is more profitable to send the nearby material to one of the mills. However, Shovels 8 and 9 initially focus on stripping to expose more ore. Since there is no sampling at the waste dump, no more information about these locations is observed.

Fig. 19
figure 19

Material destination decisions (a) and the number of trucks assigned (b) to each block in Open Pit 2

Fig. 20
figure 20

Example of one simulated orebody model realization and related updates applied by the end of every week concerning Open Pit 2. The red circles represent approximate regions being extracted

The following results present the production forecast associated with the above decisions. The graphs showing the uncertainty related to copper content are displayed in terms of P10, P50 and P90, representing the 10th, 50th and 90th percentiles, respectively, and are associated with the geological uncertainty. The graphs displaying ore or waste tonnages do not present uncertainty because density values were not simulated. Figure 21 shows the ore throughput and copper head grade at Mill 1 and Mill 2. Mill 2 presents a somewhat stable throughput feed with small deviations, while the RL agents do not provide a steady ore feed for Mill 1. However, note that the objective function does not penalize ore shortages. This may not be a desirable operational performance, and future research can investigate how to minimize these fluctuations. However, when analyzing the head-grade feeding the processors, we can see that different processing stream strategies are being applied to each mill. This suggests that Agent 2 can distinguish Mills 1 and 2, as they have different capacities and recoveries. The results show that preferably higher-grade material is feeding Mill 1, profiting from its higher recovery, and lower-grade ore to Mill 2. The forecasted shortage in Mill 1 by the fourth day and end of the second week can then be justified by the lack of shovels operating on high-grade ore. Observing the sequence of extraction and destination policies, as shown in Figs. 15, 16, 18 and 19, it can be observed that the shovels are more focused on extracting waste material to expose higher-grade material. A change in this strategy becomes clearer by the third week. Richer material becomes available for extraction, and Mill 1’s production starts getting closer to its maximum capacity. Sufficient medium-grade ore is available throughout the 30 days of operation; consequently, Mill 2 is adequately fed. By the third week, Mill 2 also starts receiving material with slightly higher grades, which is a result of more availability of more valuable ore.

Fig. 21
figure 21

Ore throughput and head-grade forecast obtained for Mill 1 and Mill 2

Figure 22 presents the tonnages processed in different crushers and at the waste dump. The ore throughput at Crusher 2 and Crusher 3 can feed Mill 1 and Mill 2, as discussed above, even though they are not operating at full capacity. Agent 2 must provide a policy that balances crusher utilization while also determining different routes depending on the quality of the ore material. This means that the final processor has an impact on how the crushers operate. Regarding waste extraction, there is a larger amount of material being extracted in the early weeks, and a decreasing rate can be observed for the subsequent periods. Even though there is no incentive through the reward function to remove waste material, the two RL agents still manage to allocate shovels to waste blocks to expose faster higher-grade ore.

Fig. 22
figure 22

Production forecast obtained for Crusher 2, Crusher 3 and the waste dump

Figure 23 presents the extraction rate from each of the pits and the total amount of copper recovered. Open Pit 1 presents an extraction rate with large fluctuations, where the periods with lower production are associated with the relocation of shovels between mining areas and levels. Open Pit 1 moves considerably more material than Open Pit 2, which is explained by its larger fleet size. The decreasing production in Open Pit 2 is explained by the truck allocation strategy that from week 3 onward tends to decrease the production rate as the ore processors become adequately fed. Truck cycle times are another aspect impacting production. Then, the overall extraction rate drops when more ore material becomes available since the processors are located further from the pit compared to the waste dump. Concerning the total copper production, Fig. 23c presents the risk profile showing the upward trend that focuses on delivering high-grade material at the mills, especially from the third week onward. Again, this is tied to the shovel allocation strategy that focuses on improving the overall profit by exposing and extracting the high-grade material mostly present in mining Area 3 in Open Pit 1.

Fig. 23
figure 23

Total tonnage extracted from Open Pit 1 (a) and Open Pit 2 (b) and daily copper production (c)

Figure 24 shows the performance of the mining complex in terms of cumulative profit. Since the profit calculation depends on the metal quantity produced, the results are also displayed in terms of P10, P50 and P90. The graph compares the financial forecast obtained using the proposed model and the orebody model updating framework based on the information coming from the sampling mechanisms at the conveyor belts (blue lines) and the forecast without geological uncertainty updating (red lines). The same short-term decisions are taken in both situations, and the forecast shows an improvement of 5.2%. This demonstrates an improvement in the ability to forecast the quality of the material feed the processors at the mining complex and therefore an improvement in profit assessment. The improvement in the prediction can already be observed by the second week, highlighting the benefits of incorporating more information into decision-making. Additionally, two distinguished behaviors can be observed in the profit profile: slow growth until the 15th day and a sharper rise until the end of the planning horizon due to the pursuit of high-grade materials, as mentioned above.

Fig. 24
figure 24

Cumulative profit obtained by operating the copper mining complex with (blue lines) and without (red lines) orebody model updating

3.4 Baseline comparison

This section highlights the benefits of the proposed approach compared to the previous developments in the field [31], which is referred to as the Baseline case. This Baseline case is the conventional two-step approach that considers a previously optimized short-term production schedule, followed by the shovel allocation to the closest available mining block minimizing shovel movement, and the truck allocation that maximizes material movement. Additionally, shovels are not allowed to move between mining areas. The initial equipment configuration is the same as the one presented by the RL agent approach, as shown in Fig. 11. Open Pit 1 has three shovels placed in Area 1, and both Area 2 and Area 3 have two shovels each. Open Pit 2 has three shovels in operation. Regarding the truck assignment, both mines consider five trucks assigned to each of the shovels to maximize production. Finally, to evaluate the performance of this approach, the forecast for 30 operating days is assessed using the simulator of the mining complex operations, as presented in Section ‎2.2, and compared with the results of the proposed approach. Similarly to the proposed method, this evaluation takes only a few seconds, but unlike the former, it requires a pre-existing schedule.

Figure 25 shows the sequence of extraction and shovel allocation strategy obtained with the Baseline approach for Open Pit 1. Since the shovel equipment is not allowed to move between different mining areas, the allocation strategy is more operationally implementable than the one obtained by the RL agents, as shown in Fig. 15. Figure 26 presents these same decisions, but they are applied to Open Pit 2. In this case, the results are similar to the results presented by the proposed approach (Fig. 18), as this region has only one mining area to allocate the shovels. Another aspect to be highlighted is that different areas are mined when the two cases presented are compared. This is due to the RL agents’ capacity to observe different extraction sequence configurations and then adapt shovel assignments accordingly.

Fig. 25
figure 25

Baseline results concerning extraction sequence and shovel allocation for Open Pit 1

Fig. 26
figure 26

Baseline results concerning extraction sequence and shovel allocation for Open Pit 2

Figure 27 shows the tonnage throughput at Mill 1 and Mill 2 forecasted by the Baseline case, which can be compared to the results obtained by the proposed approach, as shown in Fig. 21. In contrast, the amount of material feeding Mill 1 obtained by the Baseline case presents larger shortages than the outcomes from the RL agents. Regarding Mill 2, both approaches presented a similar rate, where the Baseline sends slightly more tonnage to this processing facility. Overall, this shows that the RL agents can also improve the mill’s usage by providing more material with a higher feed grade. Making use of the incoming data and updating the related uncertainty allows the RL agents to make decisions that improve metal production. Finally, Fig. 28 presents the financial performance of the Baseline approach compared to the RL agents’ method. All things considered, the RL agents improve the financial forecast by 47% over the period of 30 operating days. This increase in performance is a summation of all points discussed above, but the major contribution comes from the incorporation of additional data at a short-term scale, improving the current knowledge related to mining complex operations and providing additional flexibility to adapt fleet equipment allocation decisions.

Fig. 27
figure 27

Ore throughput and head grade achieved by the Baseline case for Mill 1 and Mill 2

Fig. 28
figure 28

Comparison between the cumulative profit obtained by the Baseline and the proposed actor-critic approach

4 Conclusions

This paper proposes a new actor-critic reinforcement learning approach applied to the short-term planning of mining complexes. The approach provides fast and efficient decision-making regarding the sequence of extraction, material destination and shovel and truck allocation, which involve different sources of uncertainty and incoming data collected during operation. The method presents two RL agents. Agent 1 defines shovel allocations that influence the sequence of extraction, whereas Agent 2 assigns the required trucks for transportation and provides the material destination. Uncertainties in terms of material supply from the mines are represented by a set of simulated orebody models, and equipment performance values are sampled from a Monte Carlo-based approach using historical data. A simulator of mining complex operations is proposed to forecast the material flow that results from daily operations, such as excavation, transportation, and processing of the material supply. A sampling mechanism is assumed at the conveyor belts allowing the characterization of the ore material. These incoming data are used to update the set of simulated orebody models based on the ensemble Kalman filter method. This new configuration is observed by the RL agents that adapt to these changes, providing updated equipment allocation and material destination decisions to maximize the mining complex’s overall financial performance. The interaction between the RL agents and the mining complex environment is repeated for multiple iterations, generating experiences to improve the RL agents’ decision-making capability. This improves the generalization capacity of the method to make quick informed decisions when needed.

The proposed method is applied to a copper mining complex composed of two open-pit mines, multiple crushers, and different processing facilities. The results show that the method can adapt accordingly to incoming information to maximize the profit of the operation. Even if some shovels are initially allocated to a particular portion of the mine, the method reallocates them to ensure that metal production is improved. The approach can adapt decisions in real-time, increasing cash flow. A comparison with a Baseline case that does not adapt the shovel allocation decisions shows that the proposed method increases the profitability of the operation by 47%. The study also shows the benefits of quickly updating the set of simulated orebody models given incoming information. This improves the ability to forecast future revenues by better characterizing the uncertainty associated with the mineral deposit, and the case study presents a 5.2% increase in profit.

Note that the proposed approach learns two policies simultaneously, which adds complexity to the learning process, which is illustrated by the high variance presented in the learning curves, shown in Figs. 13 and 14. Future research will focus on further investigating the learning process and attempting to stabilize the improvement of both policies. Another possibility is to modify the action space to help the RL agents efficiently explore profitable decisions and operational requirements. Currently, the number of RL agents’ decisions is not too large, but another limitation lies in assigning shovels to the highest grade block when relocating them to different benches or areas. Ideally, RL agents should have the flexibility to assign shovels based on a variety of reasons in addition to the highest grade. Finally, short-term activities and processes have demanding operational requirements, so the activities can be more easily implemented, which must be further addressed. The work presented attempts to provide feasible allocation for shovels at mining faces, which can be expanded to incorporate more operational requirements. This can include limiting the total amount of equipment per location, the number of mining fronts that can be operated simultaneously, and the improvement of stockpile modeling and grade-control activities.