We solve the joint-to-person association using a densely connected graphical model as in [2]. The model proposed in [2], however, aims to resolve joint-to-person associations together with proposal labeling globally for all persons, which makes it very expensive to solve. In contrast, we propose to solve this problem locally for each person. We first briefly summarize the DeepCut method [2] in Sect. 5.1, and then describe the proposed local joint-to-person association model in Sect. 5.2.
5.1 DeepCut
DeepCut aims to solve the problem of multi-person human pose estimation by jointly modeling the poses of all persons appearing in an image. Given an image, it starts by generating a set D of joint proposals, where \(\mathbf {x}_d \in \mathbb {Z}^2\) denotes the 2D location of the \(d^{th}\) proposal. The proposals are then used to formulate a graph optimization problem that aims to select a subset of proposals while suppressing the incompatible proposals, label each selected proposal with a joint type \(j \in {1 \dots J}\), and associate them to unique individuals.
The problem can be solved by integer linear programming (ILP), optimizing over the binary variables \(x \in \{0,1\}^{D \times J}\), \(y \in \{0,1\}^{\left( {\begin{array}{c}D\\ 2\end{array}}\right) }\), and \(z \in \{0,1\}^{\left( {\begin{array}{c}D\\ 2\end{array}}\right) \times J^2}\). For every proposal d, a set of variables \(\{x_{dj}\}_{j = 1 \dots J}\) is defined where \(x_{dj} = 1\) indicates that the proposal d is of body joint type j. For every pair of proposals \(dd'\), the variable \(y_{dd'}\) indicates that the proposals d and \(d'\) belong to the same person. The variable \(z_{dd'jj'} = 1\) indicates that the proposal d is of joint type j, the proposal \(d'\) is of joint type \(j'\), and both proposals belong to the same person \((y_{dd'} = 1)\). The variable \(z_{dd'jj'}\) is constrained such that \(z_{dd'jj'} = x_{dj}x_{d'j'}y_{dd'}\). The solution of the ILP problem is obtained by optimizing the following objective function:
$$\begin{aligned} \min _{(x,y,z) \in X_{D}} \left\langle \alpha , x \right\rangle + \left\langle \beta , z \right\rangle \end{aligned}$$
(3)
subject to
$$\begin{aligned} \forall d \in D~\forall jj' \in \left( {\begin{array}{c}J\\ 2\end{array}}\right) :&\quad x_{dj} + x_{dj'} \le 1 \end{aligned}$$
(4)
$$\begin{aligned} \forall dd' \in \left( {\begin{array}{c}D\\ 2\end{array}}\right) :&\quad y_{dd'} \le \sum _{j \in J} x_{dj}, \quad y_{dd'} \le \sum _{j \in J} x_{d'j} \end{aligned}$$
(5)
$$\begin{aligned} \forall dd'd'' \in \left( {\begin{array}{c}D\\ 3\end{array}}\right) :&\quad y_{dd'} + y_{d'd''} - 1 \le y_{dd''} \end{aligned}$$
(6)
$$\begin{aligned} \forall dd' \in \left( {\begin{array}{c}D\\ 2\end{array}}\right) ~\forall jj' \in J^2 :&\quad x_{dj} + x_{d'j'} + y_{dd'} - 2 \le z_{dd'jj'} \nonumber \\&\quad z_{dd'jj'} \le min(x_{dj}, x_{d'j'}, y_{dd'}) \end{aligned}$$
(7)
and, optionally,
$$\begin{aligned} \forall dd' \in \left( {\begin{array}{c}D\\ 2\end{array}}\right) ~\forall jj' \in J^2 :&\quad x_{dj} + x_{d'j'} -1 \le y_{dd'} \end{aligned}$$
(8)
where
$$\begin{aligned} \alpha _{dj}&= \log \dfrac{1-p_{dj}}{p_{dj}} \end{aligned}$$
(9)
$$\begin{aligned} \beta _{dd'jj'}&= \log \dfrac{1-p_{dd'jj'}}{p_{dd'jj'}} \end{aligned}$$
(10)
$$\begin{aligned} \left\langle \alpha , x \right\rangle&= \sum _{d \in D} \sum _{j \in J} \alpha _{dj} x_{dj} \end{aligned}$$
(11)
$$\begin{aligned} \left\langle \beta , z \right\rangle&= \sum _{dd' \in \left( {\begin{array}{c}D\\ 2\end{array}}\right) } \sum _{j,j' \in J} \beta _{dd'jj'} z_{dd'jj'}. \end{aligned}$$
(12)
The constraints (4)–(7) enforce that optimizing (3) results in valid body pose configurations for one or more persons. The constraints (4) ensure that a proposal d can be labeled with only one joint type, while the constraints (5) guarantee that any pair of proposals \(dd'\) can belong to the same person only if both are not suppressed, i.e., \(x_{dj} = 1\) and \(x_{d'j'} = 1\). The constraints (6) are transitivity constraints and enforce for any three proposals \(dd'd'' \in \left( {\begin{array}{c}D\\ 3\end{array}}\right) \) that if d and \(d'\) belong to the same person, and \(d'\) and \(d''\) also belong to the same person, then the proposals d and \(d''\) must also belong to the same person. The constraints (7) enforce that for any \(dd' \in \left( {\begin{array}{c}D\\ 2\end{array}}\right) \) and \(jj' \in J^2\), \(z_{dd'jj'} = x_{dj}x_{d'j'}y_{dd'}\). The constraints (8) are only applicable for single-person human pose estimation, as they enforce that two proposals \(dd'\) that are not suppressed must be grouped together. In (9), \(p_{dj} \in (0,1)\) are the body joint unaries and correspond to the probability of any proposal d being of joint type j. While in (10), \(p_{dd'jj'}\) correspond to the conditional probability that a pair of proposals \(dd'\) belongs to the same person, given that d and \(d'\) are of joint type j and \(j'\), respectively. In [2] this ILP formulation is referred as Subset Partitioning and Labelling Problem, as it partitions the initial pool of proposal candidates to unique individuals, labels each proposal with a joint type j, and inherently suppresses the incompatible candidates.
5.2 Local Joint-to-Person Association
In contrast to [2], we solve the joint-to-person association problem locally for each person. We also do not label generic proposals as part of the ILP formulation since we use a neural network to obtain detections for each joint as described in Sect. 4. We therefore start with a set of joint detections \(D_J\), where every detection \(d_j\) at location \(\mathbf {x}_{d_j} \in \mathbb {Z}^2\) has a known joint type \(j \in {1 \dots J}\). Our model requires only two types of binary random variables \(x \in \{0,1\}^{D_J}\) and \(y \in \{0,1\}^{\left( {\begin{array}{c}D_J\\ 2\end{array}}\right) }\). Here, \(x_{d_j} = 1\) indicates that the detection \(d_j\) of part type j is not suppressed, and \(y_{{d_j}{{d'}_{j'}}} = 1\) indicates that the detection \(d_j\) of type j, and the detection \({d'}_{j'}\) of type \(j'\) belong to the same person. The objective function for local joint-to-person association takes the form:
$$\begin{aligned} \min _{(x,y) \in X_{D_J}} \left\langle \alpha , x \right\rangle + \left\langle \beta , y \right\rangle \end{aligned}$$
(13)
subject to
$$\begin{aligned} \forall {d_j}{{d'}_{j'}} \in \left( {\begin{array}{c}D_J\\ 2\end{array}}\right) :&\quad y_{{d_j}{{d'}_{j'}}} \le x_{d_j}, \quad y_{{d_j}{{d'}_{j'}}} \le x_{d'_{j'}} \end{aligned}$$
(14)
$$\begin{aligned} \forall {d_j}{d'_{j'}}{d''_{j''}} \in \left( {\begin{array}{c}D_J\\ 3\end{array}}\right) :&\quad y_{{d_j}{d'_{j'}}} + y_{{d'_{j'}}{d''_{j''}}} - 1 \le y_{{d_j}{{d''}_{j''}}} \end{aligned}$$
(15)
$$\begin{aligned} \forall {d_j}{d'_{j'}} \in \left( {\begin{array}{c}D_J\\ 2\end{array}}\right) :&\quad x_{d_j} + x_{d'_{j'}} -1 \le y_{{d_j}{{d'}_{j'}}} \end{aligned}$$
(16)
where
$$\begin{aligned} \alpha _{d_j}&= \log \dfrac{1-p_{d_j}}{p_{d_j}} \end{aligned}$$
(17)
$$\begin{aligned} \beta _{{d_j}{d'_{j'}}}&= \log \dfrac{1-p_{{d_j}{d'_{j'}}}}{p_{{d_j}{d'_{j'}}}} \end{aligned}$$
(18)
$$\begin{aligned} \left\langle \alpha , x \right\rangle&= \sum _{d_j \in D_J} \alpha _{d_j} x_{d_j} \end{aligned}$$
(19)
$$\begin{aligned} \left\langle \beta , y \right\rangle&= \sum _{{d_j}{d'_{j'}} \in \left( {\begin{array}{c}D_J\\ 2\end{array}}\right) } \beta _{{d_j}{d'_{j'}}} y_{{d_j}{d'_{j'}}}. \end{aligned}$$
(20)
The constraints (14) enforce that detection \(d_j\) and \({d'_{j'}}\) are connected \((y_{{d_j}{{d'}_{j'}}} = 1)\) only if both are not suppressed, i.e., \(x_{d_j} = 1\) and \(x_{d'_{j'}} = 1\). The constraints (15) are transitivity constraints as before and the constraints (16) guarantee that all detections that are not suppressed belong to the primary person. We can see from (3)–(8) and (13)–(16), that the number of variables are reduced from \(({D \times J}+{\left( {\begin{array}{c}D\\ 2\end{array}}\right) }+{\left( {\begin{array}{c}D\\ 2\end{array}}\right) \times J^2})\) to \(({D_J}+{\left( {\begin{array}{c}D_J\\ 2\end{array}}\right) })\). Similary, the number of constraints is also drastically reduced.
In (17), \(p_{d_j} \in (0,1)\) is the confidence of the joint detection \(d_j\) as probability. We obtain this directly from the score maps inferred by the CPM as \(p_{d_j} = \mathrm {f}_{\tau }(s^j_T(\mathbf {x}_{d_j}))\), where
$$\begin{aligned} \mathrm {f}_{\tau }(s) = {\left\{ \begin{array}{ll} s &{} \text {if} \quad s \ge \tau \\ 0 &{} \text {otherwise,} \end{array}\right. } \end{aligned}$$
(21)
and \(\tau \) is a threshold that suppresses detections with a low confidence score.
In (18), \(p_{{d_j}{d'_{j'}}} \in (0,1)\) corresponds to the conditional probability that the detection \(d_j\) of joint type j and the detection \(d'_{j'}\) of joint type \(j'\) belong to the same person. For \( j = j' \), it is the probability that both detections \(d_j\) and \(d'_{j'}\) belong to the same body joint. For \(j \ne j'\), it measures the compatibility between two detection candidates of different joint types. Similar to [2], we obtain these probabilities by learning discriminative models based on appearance and spatial features of the detection candidates. For \(j = j'\), we define a feature vector
$$\begin{aligned} f_{{d_j}{d'_{j'}}} = \{\bigtriangleup \mathbf {x}, \exp (\bigtriangleup \mathbf {x}), (\bigtriangleup \mathbf {x})^2\}, \end{aligned}$$
(22)
where \(\bigtriangleup \mathbf {x} = (\bigtriangleup u, \bigtriangleup v)\) is the 2D offset between the locations \(\mathbf {x}_{d_j}\) and \(\mathbf {x}_{d'_{j'}}\). For \(j \ne j'\), we define a separate feature vector based on the spatial locations as well as the appearance features obtained from the joint detectors as
$$\begin{aligned} f_{{d_j}{d'_{j'}}} = \{\bigtriangleup \mathbf {x}, ||\bigtriangleup \mathbf {x}||, \arctan \left( \dfrac{\bigtriangleup v }{\bigtriangleup u}\right) , \mathbf {s}_{T}(\mathbf {x}_{d_j}), \mathbf {s}_{T}(\mathbf {x}_{{d'}_{j'}}) \}, \end{aligned}$$
(23)
where \(\mathbf {s}_{T}(\mathbf {x})\) is a vector containing the confidences of all joints and the background at location \(\mathbf {x}\). For both cases, we gather positive and negative samples from the annotated poses in the training data and train an SVM with RBF kernel using LibSVM [30] for each pair \(jj' \in \left( {\begin{array}{c}J\\ 2\end{array}}\right) \). In order to obtain the probabilities \(p_{{d_j}{d'_{j'}}} \in (0,1)\) we use Platt scaling [31] to normalize the output of the SVMs to probabilities. After optimizing (13), the pose of the primary person is given by the detections with \(x_{d_j}=1\).