EURO Journal on Decision Processes

, Volume 1, Issue 3–4, pp 299–324 | Cite as

Multi-time-scale Markov decision processes for organizational decision-making

Original Article


Decision-makers in organizations and other hierarchical systems interact within and across multiple organizational levels and take interdependent actions over time. The challenge is to identify incentive mechanisms that align agents’ interests and to provide these agents with guidance for their decision processes. To this end, we developed a multiscale decision-making model that combines game theory with multi-time-scale Markov decision processes to model agents’ multi-level, multi-period interactions. For a two-level, multi-period, multi-agent problem, we derived closed-form solutions that determine optimal incentive levels, provide agents with decision rules, and identify information needs. We show the difference between a single-time-scale model, where agents make decisions with the same frequency, and a multi-time-scale model, where superior agents make decisions less frequently than their subordinates. Our results show that in the single-time-scale case, agents can make optimal decisions with information merely from the current and subsequent period. In the multi-time-scale case, data from the entire decision horizon are necessary. This paper contributes to multiscale decision theory by generalizing solutions for multi-period interactions that were previously limited to a few time periods. We apply our model to a management problem with an account manager and multiple customer representatives.


Multiscale decision theory Markov decision processes Hierarchical games Stochastic games Management application 

Mathematics Subject Classification

90C40 90B50 91A65 91A15 91A35 

1 Introduction

Decision-making over multiple periods is a challenging task for agents. The decision process is further complicated when agents across multiple hierarchical levels affect each other’s decisions and outcomes, and when uncertainties need to be considered. We model and analyze the interactions between decision-making agents across two hierarchical levels, where the higher-level agent relies on cooperative decisions by multiple lower-level agents. A conflict of interest exists between the two levels as the preferred actions of the lower-level agents reduce the superior agent’s likelihood to achieve its goal. To align interests, the higher-level agent can offer a share of its reward as an incentive to the lower-level agents. A sufficiently large incentive results in cooperative behavior by the lower-level agents. In this paper, we determine agents’ optimal decisions and incentives given their organizational interdependencies and multi-period interactions.

Decisions by higher and lower-level agents occur on different time scales. The higher-level agent makes decisions on a slower time scale, i.e., less frequently, than lower-level agents. In other words, the decisions of the higher-level agents are more strategic, whereas the decisions by the lower-level agents are more operational.

To analyze this type of interaction and to determine optimal decisions for agents, we combine a game-theoretic model with multi-time-scale Markov decision processes (MMDPs) to account for the multi-period, multi-time-scale interdependencies.

This paper contributes to multiscale decision theory (MSDT) by generalizing solutions for multi-period interactions that were previously limited to few periods. In general, analytical models have been limited to either multi-agent one-period problems, or single-agent multi-period problems. Multi-agent, multi-period problems have not yet been solved analytically. This paper develops a unified model for simultaneously solving hierarchical and temporal decision challenges. Lastly, this paper makes a contribution to the literature on MMDPs (e.g., Chang et al. 2003) by proposing a novel modeling approach and formulation technique.

The paper is organized as follows: Sect. 2 discusses the related literature and Sect. 3 introduces the model. In Sect. 4, the agent interaction is analyzed, first for a single-time-scale situation and then for the multi-time-scale case. Section 5 provides a numerical example of the model. Section 6 presents the conclusions.

2 Literature review

We review the literature on hierarchical and organizational decision-making that has influenced the development of MSDT. We discuss the state-of-the-art methods in MSDT and show their relationship with the literature on multi-time-scale modeling and multi-organizational scale modeling.

2.1 Hierarchical and organizational decision-making

The phenomenon of organizations and their hierarchical structures was first studied by sociologists and economists. In his 1922 posthumously published work on “Economy and Society”, Max Weber (Weber 1978) views bureaucracies as hierarchical structures composed of rational decision-makers. The seminal work by Williamson (1979) on transaction-cost economics linked the existence of organizations and their hierarchical structures to transaction costs associated with asset specificity, uncertainty, and transaction frequency. Transaction costs are the reasons why organizations are formed as opposed to single consumer–producer interactions in the market place. Williamson (1967, 1970) had earlier developed a quantitative model to determine the optimal number of hierarchical levels in organizations.

Hierarchy within organizations can be explained with information-processing limitations and social preferences of humans. Radner (1992) views decentralization as a necessity to overcome the limitations of individuals in processing information, but attributes the fact that decentralization takes the form of hierarchy to sociological and psychological reasons rather than pure economic reasons. The concept of bounded rationality reconciled economists’ rationality assumption with behavioral observations of the limitations of human decision-making. Bounded rationality is most prominently associated with the work of Simon (1957, 1972), who received the Noble Memorial Prize in Economics for his work in this area. Earlier discussions of the bounded rationality concept can be traced back to “Foundations of Statistics” Savage (1954).

Rational decision-making of interdependent individuals in organization can be formulated mathematically using cooperative and non-cooperative game theory. The theory of teams by Marschak (1955) and Marschak and Radner (1972) formulates the problems of collaborating individuals who have the same interests and beliefs, but do not share the same information. In contrast to this cooperative game theory approach, non-cooperative game theory is used to describe conflicts of interests between individuals in organizations. Personal and/or team objectives can be in conflict with that of other individuals, teams, or the higher-level objectives, particularly in large organizations. Incentives can be used to align conflicting interests and motivate work that supports higher-level, and thus more important, organizational objectives (Wernz 2008).

This paper determines effective incentives for objective alignment, and our model has similarities to principal-agent models. Principal-agent theory describes a hierarchical interaction between two decision-makers with incomplete and asymmetric information. The superior decision-maker (principal) offers a contract to the subordinate decision-maker (agent). The contract is designed to motivate the agent to perform work in the interest of the principal (Milgrom and Roberts 1992). Although a number of papers, e.g., Besanko (1985), Plambeck and Zenios (2000), discuss principal-agent models for multi-periods, also called dynamic principal agent models, no publication has yet discussed the multi-time-scale aspect in which the principal makes decisions on a strategic level, i.e. less frequently than the subordinate agent.

Organizational agent interactions have also been studied in fields outside of economics, particularly in operations research, management science, computer science, and systems engineering. In the following paragraphs, we present contributions that have most significantly influenced the development of MSDT.

A taxonomy to classify and formally describe hierarchical agent interactions was developed by Schneeweiss (1995, 2003a, b). A unified notation for various hierarchical agent interaction situations was proposed. The author acknowledges the different time scales present in organizational systems (strategic, tactical, operational), but the models only describe one-period interactions.

Mesarovic et al. (1970) developed a mathematical framework to describe and analyze multi-level hierarchical systems, but again only for single period interactions. A mathematical theory of coordination was developed based on the conceptual and formal aspects of hierarchical systems. From their work, we adopted the interaction relationship between lower and higher levels, in which the success of the supremal agent depends on the performance of the infimal agent.

To account for uncertainty and temporal dynamics in strategic agent interactions, stochastic game theory was developed (Shapley 1953). Stochastic games are a combination of Markov decision processes (MDPs) (Puterman 1994) and classical game theory. Shapley (1953) was the first to propose an algorithm that solves stochastic games. Zachrisson (1964) coined the term “Markov games” to emphasize the connection to MDPs. For survey papers on stochastic games and solution algorithms see Howard (1960), Pollatschek and Avi-Itzhak (1969), Raghavan and Filar (1991), Mertens (1992), Filar and Vrieze (1996), and Bowling and Veloso (2000). We are using an MDP formulation in our model to account for the stochastic, multi-period agent interactions. In the following section, we further discuss this topic with an emphasis on multi-time-scale aspects.

2.2 Multi-time-scale decision-making models

Researchers have used various approaches to model multi-time-scale phenomena via MDPs. One can distinguish between two classes of contributions: (1) multi-time-scale systems that are modeled with MDPs and (2) one-time-scale systems with imposed hierarchical, multi-time-scale structures. The latter is mostly used to increase the computational efficiency of solution algorithms; e.g., Sutton (1995), Hauskrecht et al. (1998), Parr (1998), and Sutton et al. (1999). Our model falls into class (1) of multi-time-scale systems, and the related literature is discussed in the following paragraphs.

Chang et al. (2003) proposed MMDPs. The authors’ model is an extension of Muppala et al. (1996) and Goseva-Popstojanova and Trivedi (2000), in which the lower level is modeled as a Markov chain and the upper level is described via a Markov reward process. MMDP has been applied to production planning in semiconductor fabrication (Panigrahi and Bhatnagar 2004; Bhatnagar and Panigrahi 2006), management of hydro-power plants (Zhu et al. 2006), and has influenced work on target tracking in sensor networks (Yeow et al. 2005, 2007) and reverse supply chains (Wongthatsanekorn et al. 2010). Chang (2004) extended his earlier work to include game-theoretic interactions between agents with conflicting interests. In Chang’s model, agents make their decisions sequentially. In our model, we assume that agents do not know the actions taken by the other agents, which we recognize by applying a non-cooperative game-theoretic model with simultaneous decision-making.

Jacobson et al. (2003) proposed periodically time-inhomogeneous Markov decision processes (PTMDPs). The evolution of the system is described by (N + 1)-periodic sequences of reward functions and transition probabilities. The first N epochs are fast-scale epochs, while the interval N + 1 is a slow-scale cycle. Their approach presents an alternative to our approach of modeling multi-time-scale evaluation, yet without including strategic agent interactions.

In their book, Sethi and Zhang (1994) focused on hierarchical decision-making in stochastic manufacturing systems. Different hierarchical levels have different time scales associated with their decisions. Hierarchy is discussed from a temporal perspective, but organizational interactions across hierarchies are not explicitly modeled.

Delebecque and Quadrat (1981) analyzed a multi-time scale phenomenon in hydroelectric power generation with singularly perturbed MDPs. In singularly perturbed MDPs, small perturbations on fast time scales affect the aggregate behavior observable on slower time scales. Singularly perturbed MDPs have been extensively studied in the literature; see Yin and Zhang (1998, 2005) for an overview. In our model, fast scale decisions also affect higher-level phenomena, but the link is through organizational hierarchies, not perturbations.

2.3 Multiscale decision theory

The model that describes the type of hierarchical agent interaction considered in this paper was introduced by Wernz and Deshmukh (2007a, b). Rewards are passed down from a supremal agent to motivate cooperative decisions by infimal agents. A cooperative decision by an infimal agent improves the supremal agent’s chances of success, i.e., increases the probability of a transition to a state associated with higher rewards.

Wernz and Deshmukh (2010) built upon this initial model in their development of the multiscale decision-making framework. Their model was the first comprehensive approach that could derive closed-form, analytic solutions for one period, multi-organizational scale interactions in large, hierarchical organizations. The multiscale decision-making framework has been applied to a two-level management challenge (Wernz and Deshmukh 2007b), a two-level production planning problem (Wernz and Deshmukh 2007a), a three-level service operations application (Wernz and Henry 2009), and a three-stage supply chain problem (Henry and Wernz 2013). Further details on these applications are presented at the end of the example problem in Sect. 3.1.

Wernz and Deshmukh (2012) extended a three-level model from one to three periods. The next step in the theory’s development is to extend the multiscale decision-making model to account for interactions over many periods. The first efforts in this direction were documented in conference proceedings by Wernz and Deshmukh (2009, 2010). This paper extends their work by generalizing the model from two agents to n + 1 agents, by formulating theorems and developing proofs, by applying the model to a challenge problem, by providing a numerical example, and by discussing the results and their implications in a decision processes context.

3 Model

We motivate our model with an example problem from the customer service industry. The purpose of the example is to illustrate the applicability of the multiscale decision-making model and to put the mathematical result into a decision-relevant context.

3.1 Example problem

In a customer service division of an organization, an account manager is supported by a number of customer representatives. The manager is responsible for renewing customer contracts at regular time intervals. The day-to-day customer care is handled by the representatives, who answer customer phone calls to solve customer problems.

When contacting customers for service contract renewals, the manager decides between two approaches: (1) she calls the customer to discuss new contract terms, or (2) sends a small gift along with a letter with details of the new terms. Both actions have the same costs for the organization. As a result of the manager’s action, customers either renew or discontinue their contracts. Both outcomes are possible for a given action, but a call is more likely to result in a renewal than when a letter with a gift is sent.

The representatives, when receiving a customer’s phone call with a service request, can also take one of two actions. To address a customer request, representatives can (1) be strict and insist on the contractual agreement when identifying resources to solve the customer’s problem, or (2) be lenient and use additional resources to try to satisfy the customer. As a result of the interaction, the customer is either satisfied or dissatisfied. Satisfying the customer is associated with additional cost to the organization. A lenient response to a customer request most likely results in a satisfied, high-cost customer, while a strict response most likely results in an unsatisfied, low-cost customer. Still, both outcomes are possible for a given action.

The representative’s pay depends on the cost he produces; lower costs are preferred and result in a higher salary from the organization. Consequently, the representative prefers to be strict. However, since customer satisfaction has an influence on contract renewal rates, the manager might be willing to compensate the payoff difference for the representatives. If it is financially advantageous for her, the manager will offer an incentive to the representatives. We assume that the incentive is a proportional share of the manager’s pay. The higher the customer renewal rate, the higher the manager’s pay, and the higher the incentive for the representatives.

A number of questions arise for the manager and the representatives: how high does the incentive for the representatives have to be to switch from the initially preferred action of being strict to the cooperative action of being lenient? What is the maximum share of her pay the manager is willing to offer before the cost of incentives exceeds the benefits? What role do the transition probabilities that describe the link between actions and outcomes play? How can the manager and the representatives compute their optimal decisions? What information do they need to determine an optimal decision? How do the temporal dynamics of representatives interacting multiple times with a customer before contract renewal affect all of these aspects?

The temporal dynamics will be modeled via MDPs. To account for the fact that the manager only makes decisions from time to time, while the representatives make their decisions more frequently, we model the manager’s decision as time-invariant until the next contract renewal phase. This means that the manager’s decision (i.e., to call or send a letter) and the incentive level she offers to the representatives do not change over time. This difference in decision frequency accounts for the multi-time-scale nature of the problem, and the fact that the time-scale is coupled to the organizational scale.

Figure 1 illustrates the model and the relationship between organizational scale and time scale. On the left side of Fig. 1, a dependency graph (Dolgov and Durfee 2004) depicts the mutual influence of the decision-makers. Corresponding to the organizational scale, the right side of Fig. 1 shows the multi-period and multi-time-scale aspects across both hierarchical levels.
Fig. 1

Multiscale model for two hierarchical levels and many periods

The following section will describe the example problem mathematically and in general terms. The mathematical model description is applicable to a wide range of hierarchical decision-making problems. The customer service example serves the purpose of illustrating results and providing the context for the decision process studied.

As mentioned in the literature review, similar examples with applications in production planning, maintenance service operations, and supply chain management have been analyzed. In production planning, a hierarchical chain consisting of operations manager, production planner, and material handler have conflicts of interest related to profitability, meeting production deadlines, and inventory costs (Wernz and Deshmukh 2007a, 2010b). In a three-level maintenance service scenario, an account manager uses incentives to align the interests of her subordinate maintenance supervisor, who in turn seeks to align the actions of his maintenance workers (Wernz and Henry 2009; Wernz and Deshmukh 2012). Multiscale decision-making has also been used to coordinate the decisions between organizations. In a supply chain, a retailer, a wholesaler, and a producer can use incentives to coordinate their chain-wide decisions (Henry and Wernz 2013). Further applications in health care and hospital management are currently being explored (Sudhaakar and Wernz 2013).

3.2 Model formulation

In the general model, the account manager will be referred to as the supremal agent, or agent SUP, and the representatives as infimal agents or agents INF1, INF2 etc., generally INFx, with \( x = 1, \ldots ,X \).

We formulate the multi-period model as a discrete time problem with N epochs that demarcate \( N - 1 \) periods following the MDP notation of Puterman (1994). Time begins with decision epoch 1, the start of period 1, and ends with period \( N - 1 \) followed by the final epoch N.

At every decision epoch \( t = 1, \ldots ,N - 1 \), agents INFx carry out actions \( a_{n(x),t}^{{{\text{INF}}x}} \), where \( n\left( x \right) \) is the index that denotes the chosen action from agent INFx’s action space. No action is carried out at the final epoch N. Agent SUP decides in the first decision epoch to take an action indexed by m for this and all future periods, i.e., \( a_{m,t}^{\text{SUP}} \) with \( t = 1, \ldots ,N - 1 \). By limiting agent SUP to only one decision that governs the entire time horizon, we account for the different decision frequencies between agent SUP and agents INFx. In our example, the manager makes one decision (call customer or send letter), while the representatives have multiple interactions and make a new decision (strict or lenient) each time.

Associated with each agent are states \( s_{i,t}^{\rm SUP} \in {\mathcal{S}}_{t}^{\rm SUP} \) and \( s_{k\left( x \right),t}^{{{\text{INF}}x}} \in {\mathcal{S}}_{t}^{{{\text{INF}}x}} \) for every period. Index i and index function \( k(x) \) denote the agents’ possible states given period t. Depending on their current states \( s_{i,t}^{\text{SUP}} \) and \( s_{k\left( x \right),t}^{{{\text{INF}}x}} \), and their actions \( a_{m,t}^{\text{SUP}} \) and \( a_{n\left( x \right),t}^{{{\text{INF}}x}} \), agents move to states \( s_{j,t + 1}^{\rm SUP} \) and \( s_{l\left( x \right),t + 1}^{{\rm INF}x} \) with probability \( p_{t}^{\rm SUP} \left( {s_{j,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{m,t}^{\rm SUP} } \right) \) and \( p_{t}^{{\rm INF}x} \left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} |s_{j\left( x \right),t}^{{\rm INF}x} ,a_{n\left( x \right),t}^{{\rm INF}x} } \right) \) in the following period. The state to which the agent transitions determines the agent’s reward (referred to as “pay” in the example), which is notated by \( r_{t}^{\rm SUP} \left( {s_{j,t + 1}^{\rm SUP} } \right) \) and \( r_{t}^{INF} \left( {s_{l\left( x \right),t + 1}^{INF} } \right) \), respectively. This process repeats for every period \( t = 1, \ldots ,N - 1 \).

We analyze a situation in which each agent has a distinct set of two actions and two states in each period. The action spaces for agents SUP and INFx at a given decision epoch t are denoted by \( {\mathcal{A}}_{t}^{\rm SUP} : = \left\{ {a_{1,t}^{\rm SUP} ,a_{2,t}^{\rm SUP} } \right\} \), \( {\mathcal{A}}_{t}^{{\rm INF}x} : = \left\{ {a_{1,t}^{{\rm INF}x} ,a_{2,t}^{{\rm INF}x} } \right\} \) with \( t = 1, \ldots ,N - 1 \) and their state spaces by \( {\mathcal{S}}_{t}^{\rm SUP} : = \left\{ {s_{1,t}^{\rm SUP} ,s_{2,t}^{\rm SUP} } \right\} \), \( {\mathcal{S}}_{t}^{{\rm INF}x} : = \left\{ {s_{1,t}^{{\rm INF}x} ,s_{2,t}^{{\rm INF}x} } \right\} \) with \( t = 1, \ldots ,N \).

In the context of our example, action \( a_{1,t}^{\rm SUP} \) corresponds to the manager’s phone call, and \( a_{2,t}^{\rm SUP} \) to sending a letter to the customer. The corresponding outcomes are \( s_{1,t}^{\rm SUP} \) for contract renewal and \( s_{2,t}^{\rm SUP} \) for no contract renewal. For the customer representatives, \( a_{1,t}^{{\rm INF}x} \) corresponds to being lenient and \( a_{2,t}^{{\rm INF}x} \) to being strict. The outcomes are \( s_{1,t}^{{\rm INF}x} \), a satisfied, high-cost customer, and \( s_{2,t}^{{\rm INF}x} \) a dissatisfied, low-cost customer.

The initial rewards for agents SUP and INFx are
$$ r_{t}^{\rm SUP} \left( {s_{1,t + 1}^{\rm SUP} } \right): = \rho_{1,t}^{\rm SUP} ,\quad r_{t}^{\rm SUP} \left( {s_{2,t + 1}^{\rm SUP} } \right): = \rho_{2,t}^{\rm SUP}, $$
$$ r_{t}^{{\rm INF}x} \left( {s_{1,t + 1}^{{\rm INF}x} } \right): = \rho_{1,t}^{{\rm INF}x},\quad r_{t}^{{\rm INF}x} \left( {s_{2,t + 1}^{{\rm INF}x} } \right): = \rho_{2,t}^{{\rm INF}x} $$
The data can alternatively be represented in vector notation:
$$ R_{t}^{\rm SUP} : = \left( {\begin{array}{*{20}c} {\rho_{1,t}^{\rm SUP} } \\ {\rho_{2,t}^{\rm SUP} } \\ \end{array} } \right),\quad R_{t}^{{\rm INF}x} : = \left( {\begin{array}{*{20}c} {\rho_{1,t}^{{\rm INF}x} } \\ {\rho_{2,t}^{{\rm INF}x} } \\ \end{array} } \right) $$
The initial state-dependent transition probabilities for agent SUP are
$$ p_{t}^{\rm SUP} \left( {s_{1,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{1,t}^{\rm SUP} } \right): = \alpha_{i.1,t}^{\rm SUP},\quad p_{t}^{\rm SUP} \left( {s_{2,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{1,t}^{\rm SUP} } \right): = 1 - \alpha_{i.1,t}^{\rm SUP} $$
$$ p_{t}^{\rm SUP} \left( {s_{1,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{2,t}^{\rm SUP} } \right): = 1 - \alpha_{i.2,t}^{\rm SUP} ,\quad p_{t}^{\rm SUP} \left( {s_{2,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{2,t}^{\rm SUP} } \right): = \alpha_{i.2,t}^{\rm SUP} $$
with \( i = 1,2 \), \( m = 1,2 \) and \( 0 \le \alpha_{i.m,t}^{\rm SUP} \le 1 \). In matrix notation, we can compactly represent the data of agent SUP as
$$ P_{t}^{\rm SUP} \left( {s_{i,t}^{\rm SUP} } \right): = \left( {\begin{array}{*{20}c} {\alpha_{i.1,t}^{\rm SUP} } & {1 - \alpha_{i.1,t}^{\rm SUP} } \\ {1 - \alpha_{i.2,t}^{\rm SUP} } & {\alpha_{i.2,t}^{\rm SUP} } \\ \end{array} } \right). $$
The rows of the matrices correspond to the actions and the columns correspond to the states to which the agent transitions. For agents INFx, the notation can be represented accordingly, and in matrix notation the transition probabilities are
$$ P_{t}^{{\rm INF}x} \left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right): = \left( {\begin{array}{*{20}c} {\alpha_{j\left( x \right).1,t}^{{\rm INF}x} } & {1 - \alpha_{j\left( x \right).1,t}^{{\rm INF}x} } \\ {1 - \alpha_{j\left( x \right).2,t}^{{\rm INF}x} } & {\alpha_{j\left( x \right).2,t}^{{\rm INF}x} } \\ \end{array} } \right) $$
with index function \( j\left( x \right) = 1,2 \).
The hierarchical interactions between agent SUP and agents INFx consist of bottom-up and a top-down influences. The bottom-up influence is the effect agents INFx have on agent SUP’s expected outcome. The states to which agents INFx transition influence agent SUP’s chances of success, i.e., its transition probability \( p_{t}^{\rm SUP}. \) MSDT models this influence using an additive influence function \( f_{x,t} \). The final transition probability of agent SUP is based on the initial transition probability and the influence functions such that
$$ p_{final,t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{m,t}^{\rm SUP} ,\left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right) = p_{t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{m,t}^{\rm SUP} } \right) + \sum\limits_{x = 1}^{X} {f_{x,t} \left( {s_{k,t + 1}^{\rm SUP} |s_{l,t + 1}^{{\rm INF}x} } \right)} $$
for indices \( i,k,m,l\left( x \right) = 1,2 \), and with \( \left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} := s_{l\left( 1 \right),t + 1}^{INF1}, s_{l\left( 2 \right),t + 1}^{INF2} , \ldots,s_{l\left( X \right),t + 1}^{INFX} \). We choose the influence function to be a constant and define it as
$$ f_{x,t} \left( {s_{k,t + 1}^{\rm SUP} |s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right): = \left\{ \begin{gathered} c_{x,t} \quad \hbox{if}\;k = l\left( x \right), \hfill \\ - c_{x,t} \quad\hbox{if} \, k \ne l\left( x \right), \hfill \\ \end{gathered} \right. \quad {\text{with}}\,\,c_{x,t} > 0. $$
Constant \( c_{x,t} \) is referred to as the change coefficient. We denote the aggregate influence of all agents INFx in period t as \( C_{t} \) with \( C_{t} : = \sum\nolimits_{x = 1}^{X} {f_{x,t}^{{}} \left( {s_{k,t + 1}^{\rm SUP} |s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)} \). Since probabilities can neither be negative nor exceed unity, \( 0 \le p_{final,t}^{\rm SUP} \left( \cdot \right) \le 1 \) must hold, which bounds the aggregate influence coefficient to
$$ 0 \le C_{t} \le \hbox{min} \left\{ {\alpha_{i.1,t}^{\rm SUP} ,\alpha_{i.2,t}^{\rm SUP} ,1 - \alpha_{i.1,t}^{\rm SUP} ,1 - \alpha_{i.2,t}^{\rm SUP} } \right\}. $$
The meaning of the chosen change coefficient structure in (9) is as follows: in each period, states \( s_{1,t}^{{\rm INF}x} \) increase the probability of state \( s_{1,t}^{\rm SUP}, \) and reduce the probability of state \( s_{2,t}^{\rm SUP} \) accordingly. The effect on agent SUP’s transition probabilities change in opposite direction for agents INFx’s states \( s_{2,t}^{{\rm INF}x} \), i.e., state \( s_{2,t}^{\rm SUP} \) becomes more likely and state \( s_{1,t}^{\rm SUP} \) becomes less likely. This effect on transition probabilities applies to situations where the states of infimal agents lead to agent SUP reaching a specific state with higher or lower probability. The aggregate change coefficient \( C_{t} \) describes the influence of all infimal agents combined. In the context of our example, representatives who can achieve high customer satisfaction contribute to a higher contract renewal rate for the manager.
The top-down influence of the hierarchical interaction between agent SUP and agents INFx is an incentive payment by agent SUP to all agents INFx. Each agent INFx receives a share \( b_{x,t} \) of agent SUP’s reward. We refer to \( b_{x,t} \) as the share coefficient. The final reward in period t for an agent INFx is its initial reward plus the incentive, i.e.,
$$ r_{final,t}^{{\rm INF}x} \left( {s_{k,t + 1}^{\rm SUP} ,s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right): = r_{t}^{{\rm INF}x} \left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right) + b_{x,t} \times r_{t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} } \right) = \rho_{l\left( x \right),t}^{{\rm INF}x} + b_{x,t} \times \rho_{k,t}^{\rm SUP}. $$
Agent SUP’s initial reward is reduced by the reward share given to agents INFx. Agent SUP’s final reward is
$$ r_{{\text{final}},t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} } \right): = \left( {1 - \sum\limits_{x = 1}^{X} {b_{x,t} } } \right) \times \rho_{k,t}^{\rm SUP}. $$
Figure 2 provides a graphical summary of the model. It shows the interaction between agent SUP and an agent INFx over two periods.The following assumptions about the model’s parameters are made:
$$ \rho_{1,t}^{\rm SUP} > \rho_{2,t}^{\rm SUP} ,\quad \rho_{1,t}^{{\rm INF}x} < \rho_{2,t}^{{\rm INF}x} $$
$$ \alpha_{i.m,t}^{\rm SUP} ,\alpha_{j\left( x \right).n,t}^{{\rm INF}x} > \frac{1}{2} $$
$$ \alpha_{1.1,t}^{\rm SUP} > \alpha_{2.1,t}^{\rm SUP} ,\quad \alpha_{1.1,t}^{{\rm INF}x} > \alpha_{2.1,t}^{{\rm INF}x}. $$
Inequalities in (13) express that agent SUP prefers state \( s_{1,t + 1}^{\rm SUP} \) over \( s_{2,t + 1}^{\rm SUP} \) and agents INFx initially prefer \( s_{2,t + 1}^{{\rm INF}x} \) over \( s_{1,t + 1}^{{\rm INF}x} \). This corresponds to the manager’s preference for contract renewal and the representatives’ goal of low costs. One can see that a conflict of interest exists between agent SUP and agents INFx. Agents INFx initially prefer to reach state \( s_{2,t + 1}^{{\rm INF}x} \), which would reduce agent SUP’s chance of attaining its preferred state \( s_{1,t + 1}^{\rm SUP} \).
Fig. 2

Schematic model representation

Expression (14) states that an action is linked to the state with the same index. In other words, there is an index-corresponding action for every state, which is the most likely consequence of the respective action. This restriction circumvents redundant cases in the analysis.

Inequalities in (15) express that, given action \( a_{1,t}^{{}} \), the transition probability \( \alpha_{1.1,t}^{{}} \) from state \( s_{1,t}^{{}} \) to \( s_{1,t + 1}^{{}} \) is greater than \( \alpha_{2.1,t}^{{}} \), which denotes the probability of transitioning from state \( s_{2,t}^{{}} \) to \( s_{1,t + 1}^{{}} \). That means it is more likely to remain in state 1 than to switch from state 2 to state 1, given the corresponding action \( a_{1,t}^{{}} \). Applied to our example, this implies that the probability of contract renewal is greater than non-renewal when the customer has planned to renew and the manager makes the phone call. At the representative level, this means that a satisfied customer will more likely remain satisfied as opposed to dissatisfied when the representative is lenient.

Agents seek to maximize their rewards. For each agent INFx in period t the expected reward is:
$$ \begin{gathered} E\left( {r_{{\rm final},t}^{{\rm INF}x} |s_{i,t}^{\rm SUP} ,\left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} ,a_{m,t}^{\rm SUP} ,\left( {a_{n\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right): = \hfill \\ \sum\limits_{k = 1}^{2} {\sum\limits_{l\left( 1 \right) = 1}^{2} { \ldots \sum\limits_{l\left( x \right) = 1}^{2} { \ldots \sum\limits_{l\left( X \right) = 1}^{2} {\left[ {r_{{\rm final},t}^{{\rm INF}x} \left( {s_{k,t + 1}^{\rm SUP} ,s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right) \times p_{{\rm final},t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{m,t}^{\rm SUP} ,\left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right)} \right.} } } } \hfill \\ \times \prod\limits_{x = 1, \ldots ,X} {\left. {p_{t}^{{\rm INF}x} \left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} |s_{j\left( x \right),t}^{{\rm INF}x} ,a_{n\left( x \right),t}^{{\rm INF}x} } \right)} \right]}. \hfill \\ \end{gathered} $$
The expected reward for agent SUP can be calculated in the same way by replacing \( r_{{\rm final},t}^{{\rm INF}x} \left( {s_{k,t + 1}^{\rm SUP} ,s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)\,\,{\text{with}}\,\,r_{{\rm final},t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} ,\left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right) \hbox{ in} \) (16).
Agents seek to maximize the sum of all \( r_{{\rm final},t}^{{}} \) over the time horizon, i.e., their cumulative rewards. The cumulative rewards from period t to period \( N - 1 \) for agents INFx and SUP are
$$ r_{{\rm final}(t)}^{{\rm INF}x} \left( {s_{i,t}^{\rm SUP} ,s_{j\left( x \right),t}^{{\rm INF}x} } \right): = \sum\limits_{\tau = t}^{N - 1} {r_{{\rm final},\tau }^{{\rm INF}x} \left( {s_{k,\tau + 1}^{\rm SUP} ,s_{l\left( x \right),\tau + 1}^{{\rm INF}x} } \right)} $$
$$ r_{{\rm final}(t)}^{\rm SUP} \left( {s_{i,t}^{\rm SUP} ,\left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right): = \sum\limits_{\tau = t}^{N - 1} {r_{{\rm final},\tau }^{\rm SUP} \left( {s_{k,\tau + 1}^{\rm SUP} ,\left( {s_{l\left( x \right),\tau + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right)} $$
To calculate the expected cumulative reward, which agents need to calculate to determine their optimal course of action, the backward induction principle (Bellman 1957) is applied, starting in the last period of the time horizon and working backwards to period 1. The expected cumulative reward from period t to period \( N - 1 \) is
$$ \begin{gathered} E\left( {r_{{\rm final}\left( t \right)}^{{\rm INF}x} |s_{i,t}^{\rm SUP} ,\left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} ,a_{m,t}^{\rm SUP} ,\left( {a_{n\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right): = \hfill \\ E\left( {r_{{\rm final},t}^{{\rm INF}x} |s_{i,t}^{\rm SUP} ,\left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} ,a_{m,t}^{\rm SUP} ,\left( {a_{n\left( x \right),t}^{INF} } \right)_{1, \ldots ,X} } \right) + \hfill \\ \sum\limits_{k = 1}^{2} {\sum\limits_{l\left( 1 \right) = 1}^{2} { \ldots \sum\limits_{l\left( x \right) = 1}^{2} \ldots \sum\limits_{l\left( X \right) = 1}^{2} {p_{{\rm final},t}^{\rm SUP} \left( {s_{k,t + 1}^{\rm SUP} |s_{i,t}^{\rm SUP} ,a_{m,t}^{\rm SUP} ,\left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right) \times \prod\limits_{x = 1, \ldots ,X} {p_{t}^{{\rm INF}x} \left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} |s_{j\left( x \right),t}^{{\rm INF}x} ,a_{n\left( x \right),t}^{{\rm INF}x} } \right)} } } } \times \hfill \\ E^{ * } \left( {r_{{{\rm final}\left( {t + 1} \right)}}^{{\rm INF}x} |s_{k,t + 1}^{\rm SUP} ,\left( {s_{l\left( x \right),t + 1}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right) \hfill \\ \end{gathered} $$
for \( i,j,m,n = 1,2 \), \( t = 1, \ldots ,N - 1 \) and with \( E^{ * } \left( {r_{{\rm final}\left( N \right)}^{{\rm INF}x} |s_{k,N}^{\rm SUP} ,\left( {s_{l\left( x \right),N}^{{\rm INF}x} } \right)_{1, \ldots ,X} } \right) = 0 \).

In the following analysis, we determine optimal incentive levels and optimal actions for the agents.

4 Analysis

We assume that agents are risk-neutral and rational, i.e., agents maximize their expected utilities, or equivalently, their expected rewards. Rational agents are able to calculate both their own expected rewards, as well as the other agents’ expected rewards. They can decide which decisions yield the highest expected rewards for themselves. Hence, agents will engage in a game-theoretic reasoning process, recognizing the dependency of each other’s decisions. The Nash equilibrium concept is used to determine which decisions the rational agents take for a given incentive.

The game occurs in two stages. In the first stage, agent SUP determines the incentive level, i.e., the share coefficient. In the second stage, agents take the respective actions in response to the incentive level chosen. To determine an optimal action, agents evaluate their and the other agents’ cumulative expected rewards of the entire time horizon, i.e., \( E\left( {r_{{\rm final}\left( 1 \right)}^{\rm SUP} | \cdot } \right) \) and \( E\left( {r_{{\rm final}\left( 1 \right)}^{{\rm INF}x} | \cdot } \right) \).

The Nash equilibrium is the set of agents’ actions that are the best responses to other agents’ preferred actions. Expressed mathematically, this means that the Nash equilibrium is a strategy profile \( a^{*} = \left( {a_{m,t}^{SUP*} ,a_{n(1),t}^{INF1*} , \ldots ,a_{n(X),t}^{INFX*} } \right)_{t = 1, \ldots ,N - 1} \) for a given \( b^{*} \) such that
$$ E\left( {r_{{\rm final}\left( 1 \right)}^{\rm SUP} | \cdot ,a^{*} } \right) \ge E\left( {r_{{\rm final}\left( 1 \right)}^{\rm SUP} | \cdot ,a^{ - } } \right)\,\,{\text{and}} $$
$$ E\left( {r_{{\rm final}\left( 1 \right)}^{{\rm INF}x} | \cdot ,a^{*} } \right) \ge E\left( {r_{{\rm final}\left( 1 \right)}^{{\rm INF}x} | \cdot ,a^{ - } } \right)\,\,\forall x $$
for all \( a^{ - } \), where \( a^{ - } \) refers to any strategy profile different from \( a^{*} \).

It is easy to show that agent SUP has a dominant action \( a_{1,t}^{\rm SUP} \), i.e., a best response regardless of other agents’ actions. Since agent SUP’s preferred state is \( s_{1,t}^{\rm SUP} \), it will choose action \( a_{1,t}^{\rm SUP} \) as it has the highest probability of reaching state \( s_{1,t}^{\rm SUP} \). This result holds for all levels of incentives. For a formal proof, see Wernz and Henry (2009).

Agent SUP’s chances of success, i.e., reaching states \( s_{1,t}^{\rm SUP} \), are improved when agents INFx switch from their initially preferred actions \( a_{2,t}^{{\rm INF}x} \) to \( a_{1,t}^{{\rm INF}x} \). For a sufficiently high level of incentives, agents INFx will choose the cooperative actions \( a_{1,t}^{{\rm INF}x} \). Agent SUP will offer this incentive if its expected gain in reward is larger than the cost of the incentive.

In the following sections, we will determine the conditions for which a collaborative Nash equilibrium exists. First, we determine the Nash equilibrium for a single-time-scale model, where agents make decisions at the same frequency, before analyzing the multi-time-scale case. The single-time-scale model provides important results that are the basis for multi-time-scale analysis. In the multi-time-scale model, hierarchical interactions occur on multiple time scales, i.e., agent SUP and agents INFx make decisions at different frequencies.

4.1 Single time-scale model

In the single-time-scale model, agent SUP can vary its decision from period to period and can select different incentive levels (share coefficients) in each period. In contrast, agent SUP can choose only one decision and one share coefficient for all periods in the multi-time-scale case.

We begin the analysis by determining the share coefficients \( b_{x,t} \) for which agents INFx would switch from non-cooperative to cooperative actions. Using backward induction, we start in the final decision epoch, and then move to earlier periods.

Theorem 1

The share coefficients\( b_{x,N - 1} \) of the final period\( N - 1 \)that are cost-minimal for agent SUP and that motivate agents INFx to choose the cooperative actions\( a_{1,N - 1}^{{\rm INF}x} \)are
$$ b_{x,N - 1}^{ * } = \frac{{\rho_{2,N - 1}^{{\rm INF}x} - \rho_{1,N - 1}^{{\rm INF}x} }}{{2c_{x,N - 1} \left( {\rho_{1,N - 1}^{\rm SUP} - \rho_{2,N - 1}^{\rm SUP} } \right)}}. $$

Proof see Appendix.

The result of Theorem 1 is surprising insofar as although all agents INFx affect the expected reward of agent SUP, only data from the respective infimal agent play a role in the optimal share coefficient. Furthermore, neither the agents’ prior states nor the transition probabilities affect the result. Transition probabilities and prior states do not affect the share coefficient, because at the optimal incentive level \( b_{x,N - 1}^{ * } \) agents INFx are indifferent between their actions (still, weakly prefer the cooperative action) and the transition probability variables cancel each other out.

Next, we will investigate whether agent SUP is willing to give the share \( b_{x,N - 1}^{ * } \) of its reward to agents INFx in return for their cooperation. The so-called participation condition checks if agent SUP receives a higher expected reward if an incentive according to \( b_{x,N - 1}^{ * } \) is offered to agents INFx. Mathematically, the participation condition is
$$ \begin{gathered} E\left( {r_{{\rm final}\left( t \right)}^{\rm SUP} \left| {_{{\left( {b_{x,t} } \right)_{1, \ldots ,X} }} s_{i,t}^{\rm SUP} ,\left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} ,\ a_{1,t}^{\rm SUP} , \ a_{1,t}^{{\rm INF}x} ,\left( {a_{n\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,x - 1,x + 1, \ldots X} } \right.} \right) \ge \hfill \\ E\left( {r_{{\rm final}(t)}^{\rm SUP} \left| {_{{b_{x,t} = 0,\left( {b_{x,t} } \right)_{1, \ldots ,x - 1,x + 1, \ldots ,X} }} s_{i,t}^{\rm SUP} ,\left( {s_{j\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,X} ,\ a_{1,t}^{\rm SUP} , \ a_{2,t}^{{\rm INF}x} ,\left( {a_{n\left( x \right),t}^{{\rm INF}x} } \right)_{1, \ldots ,x - 1,x + 1, \ldots X} } \right.} \right) \hfill \\ \end{gathered} $$
Solving this inequality results in
$$ b_{x,N - 1} \le \frac{{2c_{x,N - 1} \times \left( {\rho_{1,N - 1}^{\rm SUP} - \rho_{2,N - 1}^{\rm SUP} } \right) \times \left( {\alpha_{1.1,N - 1}^{{\rm INF}x} + \alpha_{1.2,N - 1}^{{\rm INF}x} - 1} \right)}}{{\left( {\rho_{1,N - 1}^{\rm SUP} - \rho_{2,N - 1}^{\rm SUP} } \right) \times \left( {c_{x,N - 1} \times \left( {2\alpha_{1.1,N - 1}^{INF} - 1} \right) + \alpha_{1.1,SUP}^{\rm SUP} } \right) + \rho_{2,N - 1}^{\rm SUP} }} $$
An incentive level that induces cooperation only exists if \( b_{x,N - 1}^{*} \) satisfies (24). In the final section of this paper, we will illustrate this participation condition with a numerical example. Next, we determine the share coefficients for the earlier periods.

Theorem 2

The cost-minimal share coefficients\( b_{x,t} \) for\( t = 1, \ldots ,N - 2 \) that motivate agents INFx to choose the cooperative action\( a_{1,t}^{{\rm INF}x} \) only depends on data from the current periodt and the next period\( t + 1 \). The share coefficient is
$$ b_{x,t}^{*} = \frac{{\rho_{2,t}^{{\rm INF}x} - \rho_{1,t}^{{\rm INF}x} }}{{2c_{x,t} \left( {\rho_{1,t}^{\rm SUP} - \rho_{2,t}^{\rm SUP} } \right)}} - \frac{{\left( {\alpha_{1.1,t + 1}^{\rm SUP} - \alpha_{2.1,t + 1}^{\rm SUP} } \right) \times \left( {\rho_{2,t + 1}^{{\rm INF}x} - \rho_{1,t + 1}^{{\rm INF}x} } \right)}}{{2c_{x,t + 1} \left( {\rho_{1,t}^{\rm SUP} - \rho_{2,t}^{\rm SUP} } \right)}}\,\,{\text{for}}\,\,t = 1, \ldots ,N - 2 $$

Proof see Appendix.

Theorem 2 shows that agents have to look merely one period ahead to determine their optimal decisions. Surprisingly, periods further into the future do not affect the current decisions. This fact can be explained by the anticipated choices of optimal share coefficients \( b_{x,t}^{ * } \) in future periods. The optimal share coefficients make agents INFx indifferent between both of their actions in period t, i.e., the expected rewards are the same. Consequently, rewards of periods beyond the next period do not play a role in the decision process at the current epoch. For share coefficients \( b_{x,t}^{ * } \ne b_{x,t}^{{}} \) this is not the case, and data from all future periods are decision-relevant. We will show this effect in the next section, where agent SUP selects one share coefficient value for the entire time horizon.

4.2 Multi-time-scale decision-making

In the multi-time-scale case, agent SUP commits to one type of action and one share coefficient value that applies to all periods. This case resembles a strategic decision. Agent SUP does not have the flexibility to change its initial decision. Another way to interpret this situation is that agent SUP makes one decision that controls the Markov chain over subsequent periods. In our example, the manager either calls or sends a letter, which affects the final state specifying contract renewal or no renewal.

The time-invariant share coefficient is denoted by \( b_{x,1..N - 1} \) and the time-invariant action by \( a_{m,1..N - 1}^{\rm SUP} \) with the subscript indicating that they apply to periods 1 through \( N - 1 \). The multi-time-scale aspect does not affect agent SUP’s preferred decision, which is still action with index 1, i.e., \( a_{1,1..N - 1}^{\rm SUP} \).

To determine the optimal share coefficients, we first calculate intermediate, auxiliary share coefficient values in each period. We introduce an auxiliary share coefficient \( b_{x,[t]} \), which is the basis for selecting the optimal value \( b_{x,1..N - 1}^{ * } \) for each agent INFx. Agent SUP then chooses the largest share \( b_{x,[t]} \) among the auxiliary share coefficients over the time horizon to ensure that each agent INFx chooses the cooperative action in every period, i.e.,
$$ b_{x,1..N - 1}^{ * } = \mathop {\hbox{max} }\limits_{t = 1, \ldots ,N - 1} \left[ {b_{x,[t]} } \right] $$
For the following theorem, we introduce a shorthand for the reward and transition probability differences, which simplifies the representation of results:
$$ \Updelta \rho_{t}^{\rm SUP} = \rho_{1,t}^{\rm SUP} - \rho_{2,t}^{\rm SUP} ,\quad \Updelta \rho_{t}^{{\rm INF}x} = \rho_{2,t}^{{\rm INF}x} - \rho_{1,t}^{{\rm INF}x} $$
$$ \Updelta \alpha_{t}^{\rm SUP} = \alpha_{1.1,t}^{\rm SUP} - \alpha_{2.1,t}^{\rm SUP} ,\quad \Updelta \alpha_{t}^{{\rm INF}x} = \alpha_{1.1,t}^{{\rm INF}x} - \alpha_{2.1,t}^{{\rm INF}x} $$

Theorem 3

The auxiliary share coefficient\( b_{x,[t]} \), which is required in (26) to determine the optimal share coefficient\( b_{x,1..N - 1}^{ * } \), can be calculated as follows:
$$ b_{x,[t]} = \frac{1}{2} \times \frac{{{\text{num}}_{x,t} }}{{{\text{den}}_{x,t} }}\quad \hbox{for}\quad t = 1, \ldots ,N - 1 $$
where numerator (\( {\text{num}}_{x,t} \)) and denominator (\( {\text{den}}_{x,t} \)) are computed recursively as follows:
$$ {\text{num}}_{x,N - 1} = \Updelta \rho_{N - 1}^{{\rm INF}x} $$
$$ {\text{num}}_{x,t} = \Updelta \alpha_{t + 1}^{{\rm INF}x} \times {\text{num}}_{t + 1} + \Updelta \rho_{t}^{{\rm INF}x} \quad \hbox{for}\quad t = 1, \ldots ,N - 2 $$
$$ {\text{den}}_{x,N - 1} = \Updelta \rho_{N - 1}^{\rm SUP} \times c_{x,N - 1} $$
$$ \begin{gathered} {\text{den}}_{x,t} = \Updelta \alpha_{t + 1}^{{\rm INF}x} \times {\text{den}}_{x,t + 1} + c_{x,t} \times \left( {\Updelta \rho_{t}^{\rm SUP} + \Updelta \rho_{t + 1}^{\rm SUP} \times \Updelta \alpha_{t + 1}^{{\rm INF}x} } \right. + \Updelta \rho_{t + 2}^{\rm SUP} \times \Updelta \alpha_{t + 2}^{{\rm INF}x} \times \Updelta \alpha_{t + 1}^{{\rm INF}x} + \cdots + \hfill \\ \left. {\Updelta \rho_{N}^{\rm SUP} \times \Updelta \alpha_{N}^{{\rm INF}x} \times \Updelta \alpha_{N - 1}^{{\rm INF}x} \times \cdots \times \Updelta \alpha_{t + 1}^{{\rm INF}x} } \right) \hfill \\ = \Updelta \alpha_{t + 1}^{{\rm INF}x} \times {\text{den}}_{x,t + 1} + c_{x,t} \times \left[ {\Updelta \rho_{t}^{\rm SUP} + \sum\limits_{\tau = t + 1}^{N - 1} {\left( {\Updelta \rho_{\tau }^{\rm SUP} \times \prod\limits_{{{\rm T} = t}}^{\tau } {\Updelta \alpha_{{{\rm T} + 1}}^{{\rm INF}x} } } \right)} } \right]\,\,{\text{for}}\,t = 1, \ldots ,N - 2 \hfill \\ \end{gathered} $$

Proof see Appendix.

The results of Theorem 3 show that in order to make decisions in the multi-time-scale case, agents SUP and INFx need information from all future periods. In contrast, for the single-time-scale case (Theorem 2), agents INFx only needed information from the current and next period. Besides the effect of higher information needs, there is a second effect associated with the multi-time-scale model: agent SUP has to pay a larger share of its reward to incentivize agents INFx. In the single-time-scale case, agent SUP had more flexibility to adjust incentives, which allowed agent SUP to retain more reward for itself.

In the final step, we determine if agent SUP wants to offer the optimal share coefficients \( b_{x,1..N - 1}^{ * } > 0 \) that result in cooperative behavior by agents INFx, or rather not pay an incentive. Not paying an incentive allows agent SUP to retain all of its rewards, but results in non-cooperative behavior by agents INFx. Agent SUP commits to share coefficients \( b_{x,1..N - 1}^{ * } > 0 \) according to (26) and Theorem 3, if its participation condition is met. Agent SUP decides on the level of incentive in the first period and informs agents INFx of its selection of the share coefficients. The participation condition can be determined as a closed-form analytic result similar to (24), but the size of the equation is too large to present effectively. Still, the participation condition can be easily calculated and evaluated for specific data using mathematical software. The following section provides a numerical example.

5 Numerical example

To provide further insights and apply the results of the model, we analyze a numerical example. The example illustrates the difference between the single-time-scale and the multi-time-scale model. In addition, we compare the expected rewards of cooperative infimal agents to the no-cooperation case, where agent SUP does not offer incentives and agents INFx choose non-cooperative actions.

The example is based on the following time-invariant data of a four-period interaction (\( N = 5 \)) between agent SUP and one infimal agent.
$$ R_{t}^{\rm SUP} = \left( {\begin{array}{*{20}c} {60} \\ 5 \\ \end{array} } \right),\quad R_{t}^{INF1} = \left( {\begin{array}{*{20}c} 1 \\ 3 \\ \end{array} } \right), $$
$$ P_{t}^{y} \left( {s_{1,t}^{y} } \right) = \left( {\begin{array}{*{20}c} {0.8} & {0.2} \\ {0.4} & {0.6} \\ \end{array} } \right),\quad P_{t}^{y} \left( {s_{2,t}^{y} } \right) = \left( {\begin{array}{*{20}c} {0.6} & {0.4} \\ {0.2} & {0.8} \\ \end{array} } \right) $$
$$ C_{t} = c_{1,t} = 0.15 $$
for \( t = 1,2,3,4 \), \( y \in \left\{ {SUP,INF1} \right\} \). We assume that agents are in states with index 1 in the first period, i.e., \( s_{1,1}^{\rm SUP} \) and \( s_{1,1}^{INF1} \). Table 1 shows the results of the analysis.
Table 1

Numerical results of analysis




No cooperation


\( b_{1,1} = 9.70\,\% \)

\( b_{1,[1]} = 9.75\,\% \)

\( b_{1,1} = 0 \)


\( b_{1,2} = 9.70\,\% \)

\( b_{1,[2]} = 9.89\,\% \)

\( b_{1,2} = 0 \)


\( b_{1,3} = 9.70\,\% \)

\( b_{1,[3]} = 10.39\,\% \)

\( b_{1,3} = 0 \)


\( b_{1,4} = 12.12\,\% \)

\( b_{1,1..5}^{ * } = b_{1,[4]} = 12.12\,\% \)

\( b_{1,4} = 0 \)

\( E\left( {r_{{\rm final}(1)}^{\rm SUP} | \cdot } \right) \)




\( E\left( {r_{{\rm final}(1)}^{{\rm INF}x} | \cdot } \right) \)




A number of observations can be made based on these results. First, agent SUP is better off choosing the optimal, non-zero share coefficients in both the multi-time-scale and single-time-scale case. These incentives induce cooperative behavior by agent INF1 since its expected reward is higher than if it did not cooperate. In other words, the participation condition is met for the chosen data. Furthermore, agent INF1 significantly benefits from the incentive and its cooperative actions, as its cumulative expected reward about triples from 9.63 to 27.39 and 31.21, respectively.

The second observation is that in the single-time-scale case, the share coefficient values are identical for periods 1, 2, and 3. The values are identical since the data of all three periods is the same. The share coefficient in the fourth and last period is larger than the prior ones. This deviation can be explained as follows: in earlier periods, agent INF1 benefited from transitioning with a higher probability to its preferred state. Thus, agent INF1 received a benefit beyond the current period’s reward, which allowed for a lower share coefficient as compensation. In the last period, this effect no longer exists, hence the larger share coefficient of 12.12 % compared to 9.70 %.

The third observation is that in the multi-time-scale model, the auxiliary share coefficient values differ in each period. The last period requires the largest share coefficient value of 12.12 %, which is applied to all periods. With data that is different for different periods, any period, not just the last, could have been the period with the highest auxiliary share coefficient.

Extending the example from one to multiple infimal agents yields similar results, since agents INFx, their actions, and incentives are independent from one another. However, the cumulative influence \( C_{t} \) must not violate condition (10), which implies that individual change coefficients \( c_{x,t} \) will have to be smaller for a larger number of infimal agents. The consequence of smaller change coefficients is that incentives for agents INFx need to increase to motivate cooperation, while the upper limit agent SUP is willing to pay for incentives (participation condition) goes down. Thus, the likelihood of cooperation decreases if rewards remain the same.

We conducted a sensitivity analysis with respect to the agents’ rewards. We found that reward difference \( \Updelta \rho_{t}^{\rm SUP} \) has to be multiple times larger than \( \Updelta \rho_{t}^{{\rm INF}x} \) to result in situations in which agent SUP benefits from paying the necessary incentives that motivate cooperative behavior from agents INFx. This effect intensifies as the number of infimal agents increases. This means that incentives will only be offered when the difference in payoffs between preferred and not preferred outcomes for the superior is sufficiently larger than that of their subordinates. For example, for \( \Updelta \rho_{t}^{\rm SUP} \le 40 \) instead of 55 (without any other changes) or for \( \Updelta \rho_{t}^{{\rm INF}x} \ge 5 \) instead of 2, agent SUP prefers to not offer an incentive.

Furthermore, infimal agents’ actions need to have a significant effect on agent SUP via change coefficient \( C_{t} \). In the context of our example, this means that the financial benefits of customer contract renewal must be larger than the cost of providing customer care for the manager to offer incentives to her customer representatives. Furthermore, customer care must have a sufficiently strong influence on the chance of contract renewal. Otherwise, the manager will prefer not to pay an incentive and not to motivate cooperative behavior. For example, a reduction of the change coefficient to \( c_{1,t} \le 0.12 \) from 0.15 would result in no incentive payments.

The advantage of analytical solutions for sensitivity analysis is that one can readily see from equations which data influence the results and how. By differentiating the share coefficient equations and participation conditions with respect to model parameters, one can analytically describe the effect of changes in data. For an example of a comprehensive sensitivity analysis of multiscale decision making models see Wernz and Deshmukh (2007a) and Henry and Wernz (2013).

6 Discussion and conclusions

We developed a multi-time-scale decision-making model that allows hierarchically interacting agents to make optimal decisions over many periods. The supremal agent, agent SUP, chooses one type of decision and one level of incentive that is fixed over the entire decision horizon. If the participation condition is met, i.e., if profitable, agent SUP will offer a share of its pay (reward) to motivate cooperative actions by agents INFx. We determined the optimal levels of incentives, which take into account multi-period agent interdependencies. In the context of the motivational example, this means that the account manager can determine a cost-minimal incentive level that motivates cooperative behavior by customer representatives, which can resolve the representatives’ conflicts of interest.

Furthermore, we showed the difference between a multi-time-scale model and a single-time scale model. In the single-time scale case, agent SUP (the manager) can change its decisions and level of incentives from period to period. The advantage of this flexibility is that the optimal incentive levels and decisions can be determined with a subset of data from the current and next period. In contrast, in the multi-time-scale, where agent SUP makes a strategic decision that applies to all periods, it needs information from all periods to find an optimal decision. In addition, agent SUP has to pay more, i.e., offer higher incentive levels, to obtain cooperative behavior from its infimal agents. The comparison of the single- and multi-time-scale models shows the cost of strategic management and the benefits of flexibility in terms of rewards and data requirements.

The paper contributes to the unification of the temporal and organizational scale in multi-agent, multi-period models. Furthermore, the paper provides an alternative formulation for multi-time-scale MDPs. Most results in this paper can be described analytically as opposed to being described only numerically. The advantage of analytic solutions is that agents know which information they need in order to make optimal decisions, the effect of parameters and the sensitivity of results can be readily determined, and optimal decisions can be calculated with little computational effort.

In the description of the example problem, we listed a number of questions that can now be answered.
  • 1. How high does the incentive for the representatives have to be to switch from the initially preferred action of being strict to the cooperative action of being lenient? Theorems 1–3 answer this question by providing equations that calculate the minimum level of incentive necessary to motivate cooperative action. In the numerical example, the incentive that induces cooperative action from customer representatives, expressed in terms of the share of the manager’s reward, was 9.7 and 12.12 %, respectively.

  • 2. What is the maximum share of her pay the manager is willing to offer before the cost of the incentives exceeds the benefits? The participation condition formulated in (23) answers this question. For the single-time-scale case, equation (24) is the analytic solution for the maximum incentive level in the last decision epoch. For earlier periods, and for the multi-time-scale case, results can be obtained for specific data, as shown in the numerical example. The sensitivity analysis showed that for changes in parameters, such as \( \Updelta \rho_{t}^{\rm SUP} \le 40 \), \( \Updelta \rho_{t}^{{\rm INF}x} \ge 5 \), or \( c_{1,t} \le 0.12 \), the participation condition is no longer met.

    With an increasing number of customer representatives (agents INFx) that each individually exert a smaller influence on customers’ decisions to renew their contracts, the manager’s propensity to offer incentives goes down. The managerial implication of this insight is that incentives should be paid to small teams that effectively support their superiors’ work. Once the team grows larger and the individual’s contribution becomes smaller, incentives are no longer cost-effective for superiors.

  • 3. What role do the transition probabilities that describe the link between actions and outcomes play? How can the manager and the representatives compute their optimal decisions? What information do they need to determine an optimal decision? The transition probabilities in the single-time-scale case play only a limited role in determining optimal incentives. For the incentives in the final period, transition probabilities do not affect the result (Theorem 1). For earlier periods, only the probabilities at the manager level (agent SUP) are needed to determine the optimal incentive levels (Theorem 2). This implies that customer representatives (agents INFx) can determine their optimal actions using partial or incomplete information. In particular, the incentive payments are independent of the action-outcome uncertainties (transition probabilities) at the customer representatives’ level. In contrast, the manager needs data on all model parameters to determine whether she wants to pay the optimal levels of incentives. Furthermore, all data are needed by all agents in the multi-time-scale case. However, aggregated data in form of reward and probability differences are sufficient; see (27)–(28).

  • 4. How do the temporal dynamics of representatives interacting multiple times with a customer before contract renewal affect all of these aspects? The difference in decision frequency is captured by the multi-time-scale model. When compared to the single-time-scale model, incentives for customer representatives have to be larger in the multi-time-scale case. This results primarily from the inflexibility of the manager to adjust the incentives from period to period. Consequently, customer representatives receive higher incentives and the manager has to share a higher percentage of her reward with her subordinates.

The multi-time-scale model considered only one decision horizon at the level of agent SUP (manager). An expansion to multiple decision horizons, i.e., a long-term perspective on interactions of manager, representatives and customers, is possible. Each decision horizon is coupled to the next only by the factor of whether a customer has renewed their contract or not. The analysis can thus be broken down into individual decision horizons and merely the initial states would differ. To explore a long or infinite decision horizon, the limit behavior of model parameters on optimal share coefficients would provide insights. Without much analysis, one can see that the effect of far-out periods on current periods is small, getting smaller with more temporal distance.

Future work should explore the effect of more hierarchical levels, more outcome states, and more actions. Wernz and Deshmukh (2012) modeled a 3-level, 3-period model, which illustrates the complexity of adding just one level in a multi-time-scale case. Still, a multi-level expansion that uses an algorithmic rule to capture all possible permutations and automatically analyzes all interaction scenarios seems possible. Sudhaakar and Wernz (2013) explored the effect of multiple actions and outcomes for a multi-level, one-period model. They showed that with more actions and outcomes, the transition probabilities, which previously were not necessary, are now needed to calculate the incentive levels. In a multi-period extension of this work, we expect to lose the ability to present the results as analytic solutions, but solutions based on data using analytical methods would still be possible.



The author thanks the editor-in-chief Ahti Salo, the anonymous editor and the two anonymous referees for handling this paper and for providing constructive comments and improvement suggestions. This research has been funded in part by NSF grant CMMI-1335407 and by the Virginia Tech Institute of Critical Technology and Applied Science (ICTAS) grant JFC-11-130.


  1. Bellman RE (1957) Dynamic programming. Princeton University Press, PrincentonGoogle Scholar
  2. Besanko D (1985) Multi-period contracts between principal and agent with adverse selection. Econ Lett 17(1–2):33–37CrossRefGoogle Scholar
  3. Bhatnagar S, Panigrahi JR (2006) Actor-critic algorithms for hierarchical markov decision processes. Automatica 42(4):637–644CrossRefGoogle Scholar
  4. Bowling M, Veloso M (2000) An analysis of stochastic game theory for multiagent reinforcement learning. School of Computer Science, Carnegie Mellon University, PennsylvaniaGoogle Scholar
  5. Chang HS (2004) A model for multi-timescaled sequential decision-making processes with adversary. Math Comput Model Dyn Syst 10(3–4):287–302. doi:10.1080/13873950412331335261 CrossRefGoogle Scholar
  6. Chang HS, Fard PJ, Marcus SI, Shayman M (2003) Multitime scale Markov decision processes. IEEE Trans Autom Control 48(6):976–987CrossRefGoogle Scholar
  7. Delebecque F, Quadrat JP (1981) Optimal control of Markov chains admitting strong and weak interactions. Automatica 17(2):281–296CrossRefGoogle Scholar
  8. Dolgov D, Durfee E (2004) Graphical models in local, asymmetric multi-agent Markov decision processes. In: Proceedings of the third international joint conference on autonomous agents and multiagent systems, vol 2, pp 956–963Google Scholar
  9. Filar JA, Vrieze K (1996) Competitive Markov decision processes. Springer, New YorkCrossRefGoogle Scholar
  10. Goseva-Popstojanova K, Trivedi KS (2000) Stochastic modeling formalisms for dependability, performance and performability. In: Haring G, Lindemann C, Reiser M (eds) Performance evaluation—origins and directions., Lecture Notes in Computer ScienceSpringer, New York, pp 403–422CrossRefGoogle Scholar
  11. Hauskrecht M, Meuleau N, Kaelbling LP, Dean T (1998) Boutilier C Hierarchical solution of Markov decision processes using macro-actions. In: Proceedings of the fourteenth conference on uncertainty in artificial intelligence, University of Wisconsin Business School, Madison, WI, July 24–26, pp 220–229Google Scholar
  12. Henry A, Wernz C (2013) Revenue-sharing in a three-stage supply chain with uncertainty: a multiscale decision theory approach (under review)Google Scholar
  13. Howard RA (1960) Dynamic programming and Markov process. MIT Press, Cambridge, MAGoogle Scholar
  14. Jacobson M, Shimkin N, Shwartz A (2003) Markov decision processes with slow scale periodic decisions. Math Oper Res 28(4):777–800CrossRefGoogle Scholar
  15. Marschak J (1955) Elements for a theory of teams. Manag Sci 1(2):127–137CrossRefGoogle Scholar
  16. Marschak J, Radner R (1972) Economic theory of teams. Yale University Press, New HavenGoogle Scholar
  17. Mertens JF (1992) Stochastic games. In: Aumann RJ, Hart S (eds) Handbook of game theory with economic applications. North-Holland, AmsterdamGoogle Scholar
  18. Mesarovic MD, Macko D, Takahara Y (1970) Theory of hierarchical, multilevel systems. Academic Press, New YorkGoogle Scholar
  19. Milgrom PR, Roberts J (1992) Economics organizations and management. Prentice Hall, Englewood Cliffs, NJGoogle Scholar
  20. Muppala JK, Malhotra M, Trivedi KS (1996) Markov dependability models of complex systems: analysis techniques. In: Özekici S (ed) Reliability and maintenance of complex systems. Springer, Berlin, pp 442–486CrossRefGoogle Scholar
  21. Panigrahi JR, Bhatnagar S (2004) Hierarchical decision making in semiconductor fabs using multi-time scale Markov decision processes. In: Proceedings of IEEE conference on decision and control, Paradise Island, Nassau, Bahamas, 2004. IEEE, pp 4387–4392Google Scholar
  22. Parr RE (1998) Hierarchical control and learning for Markov decision processes. PhD thesis, University of California, BerkeleyGoogle Scholar
  23. Plambeck EL, Zenios SA (2000) Performance-based incentives in a dynamic principal-agent model. Manuf Serv Oper Manag 2(3):240–263CrossRefGoogle Scholar
  24. Pollatschek MA, Avi-Itzhak B (1969) Algorithms for stochastic games with geometrical interpretation. Manag Sci 15(7):399–415CrossRefGoogle Scholar
  25. Puterman ML (1994) Markov decision processes: discrete stochastic dynamic programming. Wiley, New YorkCrossRefGoogle Scholar
  26. Radner R (1992) Hierarchy: the economics of managing. J Econ Lit 30(3):1382–1415Google Scholar
  27. Raghavan TES, Filar JA (1991) Algorithms for stochastic games—Wa survey. Math Methods Oper Res (ZOR) 35(6):437–472CrossRefGoogle Scholar
  28. Savage LJ (1954) The foundation of statistics. Wiley, New YorkGoogle Scholar
  29. Schneeweiss C (1995) Hierarchical structures in organizations: a conceptual framework. Eur J Oper Res 86(1):4–31CrossRefGoogle Scholar
  30. Schneeweiss C (2003a) Distributed decision making. Springer, BerlinCrossRefGoogle Scholar
  31. Schneeweiss C (2003b) Distributed decision making—a unified approach. Eur J Oper Res 150(2):237–252CrossRefGoogle Scholar
  32. Sethi SP, Zhang Q (1994) Hierarchical decision making in stochastic manufacturing systems. Birkhäuser, BaselCrossRefGoogle Scholar
  33. Shapley LS (1953) Stochastic games. Proc Natl Acad Sci 39(10):1095–1100CrossRefGoogle Scholar
  34. Simon HA (1957) Models of man. Part IV. Wiley, New YorkGoogle Scholar
  35. Simon HA (1972) Theories of bounded rationality. In: Mcguire C, Radner R (eds) Decision and organization. North-Holland, Amsterdam, pp 161–176Google Scholar
  36. Sudhaakar S, Wernz C (2013) Advancing multiscale decision theory: from 2 to N decision alternatives. Working PaperGoogle Scholar
  37. Sutton RS (1995) TD Models: modeling the world at a mixture of time scales. In: Proceedings of the twelfth international conference on machine learning, Tahoe City, CA, July 9–12, 1995, vol 95, pp 531–539Google Scholar
  38. Sutton RS, Precup D, Singh S (1999) Between MDPs and Semi-MDPs: a framework for temporal abstraction in reinforcement learning. Artif Intell 112(1–2):181–211CrossRefGoogle Scholar
  39. Weber M (1978) Economy and society. University of California Press, BerkeleyGoogle Scholar
  40. Wernz C (2008) Multiscale decision making: bridging temporal and organizational scales in hierarchical systems. University of Massachusetts Amherst, DissertationGoogle Scholar
  41. Wernz C, Deshmukh A (2009) An incentive-based, multi-period decision model for hierarchical systems. In: Proceedings of the 3rd annual conference of the Indian subcontinent decision sciences institute region (ISDSI), Hyderabad, India, 2009Google Scholar
  42. Wernz C, Deshmukh A (2007a) Decision strategies and design of agent interactions in hierarchical manufacturing systems. J Manuf Syst 26(2):135–143CrossRefGoogle Scholar
  43. Wernz C, Deshmukh A (2007b) Managing hierarchies in a flat world. In: Proceedings of the 2007 industrial engineering research conference. Nashville, TN, pp 1266–1271Google Scholar
  44. Wernz C, Deshmukh A (2010a) Multi-time-scale decision making for strategic agent interactions. In: Proceedings of the 2010 industrial engineering research conference, Cancun, MexicoGoogle Scholar
  45. Wernz C, Deshmukh A (2010b) Multiscale decision-making: bridging organizational scales in systems with distributed decision-makers. Eur J Oper Res 202(3):828–840CrossRefGoogle Scholar
  46. Wernz C, Deshmukh A (2012) Unifying temporal and organizational scales in multiscale decision-making. Eur J Oper Res 223(3):739–751CrossRefGoogle Scholar
  47. Wernz C, Henry A (2009) Multilevel coordination and decision-making in service operations. Service Sci 1(4):270–283CrossRefGoogle Scholar
  48. Williamson OE (1967) Hierarchical control and optimum firm size. J Political Econ 75(2):123–138CrossRefGoogle Scholar
  49. Williamson OE (1970) Corporate control and business behavior. Prentice Hall, Engelwood Cliffs, NJGoogle Scholar
  50. Williamson OE (1979) Transaction-cost economics: the governance of contractual relations. J Law Econ 22(2):233–261. doi:10.2307/725118 CrossRefGoogle Scholar
  51. Wongthatsanekorn W, Realff MJ, Ammons JC (2010) Multi-time scale Markov decision process approach to strategic network growth of reverse supply chains. Omega 38(1–2):20–32CrossRefGoogle Scholar
  52. Yeow W-L, Tham C-K, Wong W-C A (2005) Novel target movement model and energy efficient target tracking in sensor networks. In: IEEE 61st Vehicular Technology Conference, 2005:2825–2829Google Scholar
  53. Yeow WL, Tham C-K, Wong W-C (2007) Energy efficient multiple target tracking in wireless sensor networks. IEEE Trans Veh Technol 56(2):918–928CrossRefGoogle Scholar
  54. Yin G, Zhang Q (1998) Continuous-time Markov chains and applications: a singular perturbation approach. Springer, BerlinCrossRefGoogle Scholar
  55. Yin G, Zhang Q (2005) Discrete-time Markov chains: two-time-scale methods and applications. Springer, BerlinGoogle Scholar
  56. Zachrisson LE (1964) Markov games’ advances in game theory. Ann Math Stud 52:211–253Google Scholar
  57. Zhu C, Zhou J, Wu W, Mo L (2006) Hydropower portfolios management via markov decision process. In: IECON 2006—32nd annual conference on IEEE industrial electronics, 2006:2883–2888Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg and EURO - The Association of European Operational Research Societies 2013

Authors and Affiliations

  1. 1.Grado Department of Industrial and Systems EngineeringVirginia TechBlacksburgUSA

Personalised recommendations