Because the problems of (4) and (5) are convex, the local solution is equivalent to the global optimal. However, naïvely solving these problems is computationally intractable because of the high dimensionality of \(\varvec{m}\). In this section, we introduce several useful rules for restricting candidate subgraphs while maintaining the optimality of the final solution. Note that the proofs for all the lemmas and theorems are provided in the appendix.
To make the optimization problem tractable, we work with only a small subset of features during the optimization process. Let \({{\mathcal {F}}}\subseteq [p]\) be a subset of features. By fixing \(m_i = 0\) for \(i \notin {{\mathcal {F}}}\), we define sub-problems of the original primal \(P_{\lambda }\) and dual \(D_{\lambda }\) problems as follows:
$$\begin{aligned}&\min _{{{\varvec{m}}}_{{\mathcal {F}}}\ge {\varvec{0}}} P_{\lambda }^{{\mathcal {F}}}({{\varvec{m}}}_{{\mathcal {F}}}) :=\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\ell _L({{\varvec{m}}}_{{\mathcal {F}}}^\top {{\varvec{c}}_{il}}_{{\mathcal {F}}}) +\sum _{j\in {\mathcal {S}}_i}\ell _{-U}(-{{\varvec{m}}}_{{\mathcal {F}}}^\top {{\varvec{c}}_{ij}}_{{\mathcal {F}}}) \right] +\lambda R({{\varvec{m}}}_{{\mathcal {F}}}), \end{aligned}$$
(9)
$$\begin{aligned}&\max _{{\varvec{\alpha }}\ge \varvec{0}} D_\lambda ^{{\mathcal {F}}}({\varvec{\alpha }}) :=-\frac{1}{4}\Vert {\varvec{\alpha }}\Vert _2^2+{\varvec{t}}^\top {\varvec{\alpha }} -\frac{\lambda \eta }{2}\Vert {\varvec{m}}_{\lambda } ({\varvec{\alpha }})_{{\mathcal {F}}}\Vert _2^2, \end{aligned}$$
(10)
where \(\varvec{m}_{{{\mathcal {F}}}}\), \(\varvec{c}_{ij{{\mathcal {F}}}}\), and \(\varvec{m}_{\lambda } ({\varvec{\alpha }})_{{\mathcal {F}}}\) are sub-vectors specified by \({{\mathcal {F}}}\). If the size of \({{\mathcal {F}}}\) is moderate, these sub-problems are significantly computationally easier to solve than the original problems.
We introduce several criteria that determine whether the feature k should be included in \({{\mathcal {F}}}\) using the techniques of safe screening (Ghaoui et al. 2010) and working set selection (Fan et al. 2008). A general form of our criteria can be written as
$$\begin{aligned} {\varvec{C}}_{k,:}{\varvec{q}}+r\Vert {\varvec{C}}_{k,:}\Vert _2 \le T, \end{aligned}$$
(11)
where \({\varvec{q}}\in {\mathbb {R}}_+^{2nK}\), \(r\ge 0\), and \(T \in {\mathbb {R}}\) are constants that assume different values for each criterion. If this inequality holds for k, we exclude the k-th feature from \({{\mathcal {F}}}\). An important property is that although our algorithm only solves these small sub-problems, we can guarantee the optimality of the final solution, as shown later.
However, selecting \({{\mathcal {F}}}\) itself is computationally expensive because the evaluation of (11) requires O(n) computations for each k. Thus, we exploit a tree structure of graphs for determining \({{\mathcal {F}}}\). Figure 1 shows an example of such a tree, which can be constructed by a graph mining algorithm, such as gSpan (Yan and Han 2002). Suppose that the k-th node corresponds to the k-th dimension of \(\varvec{x}\) (note that the node index here is not the order of the visit). If a graph corresponding to the k-th node is a subgraph of the \(k'\)-th node, the node \(k'\) is a descendant of k, which is denoted as \(k' \supseteq k\). Then, the following monotonic relation is immediately derived from the monotonicity of \(\phi _H\):
$$\begin{aligned} x_{i,k'} \le x_{i,k} \text { if } k' \supseteq k. \end{aligned}$$
(12)
Because any parent node is a subgraph of its children in the gSpan tree Fig. 1, the non-overlapped frequency \(\#(H \sqsubseteq G)\) of subgraph H in G is monotonically non-increasing while descending the tree node. Then, the condition (12) is obviously satisfied because for a sequence of \(H \sqsubseteq H' \sqsubseteq H'' \sqsubseteq \cdots \) in the descending path of the tree, \(x_{i,k(H)} = \phi _H(G_i) = g(\#(H \sqsubseteq G))\) is monotonically non-increasing, where \(x_{i,k(H)}\) is a feature corresponding to H in \(G_i\). Based on this property, the following lemma enables us to prune a node during the tree traversal.
Lemma 1
Let
$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r)&:=\sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}q_{il}\max \{x_{i,k},x_{l,k}\}^2 \nonumber \\&\quad +r\sqrt{\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\}^4 +\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}^4\right] } \end{aligned}$$
(13)
be a pruning criterion. Then, if the inequality
$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r) \le T \end{aligned}$$
(14)
holds, for any descendant node \(k' \supseteq k\), the following inequality holds:
$$\begin{aligned} {\varvec{C}}_{k',:}{\varvec{q}}+r\Vert {\varvec{C}}_{k',;}\Vert _2 \le T, \end{aligned}$$
where \({\varvec{q}}\in {\mathbb {R}}_+^{2nK}\) and \(r\ge 0\) are an arbitrary constant vector and scalar variable, respectively.
This lemma indicates that if the condition (14) is satisfied, we can say that none of the descendant nodes are included in \({{\mathcal {F}}}\). Assuming that the indicator function \(g(x) = 1_{x>0}\) is used in (2), a tighter bound can be obtained through the following lemma.
Lemma 2
If \(g(x) = 1_{x>0}\) is set in (2), the pruning criterion (13) can be replaced with
$$\begin{aligned} \mathrm {Prune}(k | {\varvec{q}}, r)&:=\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, x_{i,k} \left[ \sum _{l\in {\mathcal {D}}_i}q_{il}-\sum _{j\in {\mathcal {S}}_i}q_{ij}(1-x_{j,k}) \right] \right\} \\&\quad +r\sqrt{\sum _{i\in [n]}\left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\} +\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}\right] }. \end{aligned}$$
By comparing the first terms of Lemmas 1 and 2, we see that Lemma 2 is tighter when \(g(x) = 1_{x>0}\) as follows:
$$\begin{aligned}&\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, x_{i,k} \left[ \sum _{l\in {\mathcal {D}}_i}q_{il}-\sum _{j\in {\mathcal {S}}_i}q_{ij}(1-x_{j,k})\right] \right\} \\&\quad \le \sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, x_{i,k} \sum _{l\in {\mathcal {D}}_i}q_{il}\right\} \\&\quad =\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}q_{il}x_{l,k}, \sum _{l\in {\mathcal {D}}_i}q_{il}x_{i,k}\right\} \\&\quad \le \sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}\max \{q_{il}x_{l,k}, q_{il}x_{i,k}\}\\&\quad =\sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}q_{il}\max \{x_{l,k}, x_{i,k}\}. \end{aligned}$$
A schematic illustration of the optimization algorithm for IGML is shown in Fig. 2 (for further details, see Sect. 5). To generate a subset of features \({\mathcal {F}}\), we first traverse the graph mining tree during which the safe screening/working set selection procedure and their pruning extensions are performed (Step1). Next, we solve the sub-problem (9) with the generated \({\mathcal {F}}\) using a standard gradient-based algorithm (Step2). Safe screening is also performed during the optimization iteration in Step2, which is referred to as dynamic screening. This further reduces the size of \({\mathcal {F}}\).
Before moving onto detailed formulations, we summarize our rules to determine \({{\mathcal {F}}}\) in Table 1. The columns represent the different approaches to evaluating the necessity of features, i.e., safe and working set approaches. For the safe approaches, there are further ‘single \(\lambda \)’ (described in Sect. 4.1.2) and ‘range of \(\lambda \)’ (described in Sect. 4.1.3) approaches. The single \(\lambda \) approach considers safe rules for a specific \(\lambda \), while the range of \(\lambda \) approach considers safe rules that can eliminate features for a range of \(\lambda \) (not just a specific value). Both the single and range approaches are based on the bounds of the region in which the optimal solution exists, for which details are given in Sect. 4.1.1. The rows of Table 1 indicate the variation of rules to remove one specific feature and rules to prune all features in a subtree.
Table 1 Strategies to determine \({{\mathcal {F}}}\) Safe screening
Safe screening (Ghaoui et al. 2010) was first proposed to identify unnecessary features in LASSO-type problems. Typically, this approach considers a bounded region of dual variables in which the optimal solution must exist. Then, we can eliminate dual inequality constraints that are never violated given that the solution exists in that region. The well-known Karush-Kuhn-Tucker (KKT) conditions show that this is equivalent to the elimination of primal variables that take value 0 at the optimal solution. In Sect. 4.1.1, we first derive a spherical bound for our optimal solution, and then in Sect. 4.1.2, a rule for safe screening is shown. Section 4.1.3 extends rules that are specifically useful for the regularization path calculation.
Sphere bound for optimal solution
The following theorem provides a hyper-sphere containing the optimal dual variable \(\varvec{\alpha }^\star \).
Theorem 1
(DGB) For any pair of \({\varvec{m}}\ge {\varvec{0}}\) and \({\varvec{\alpha }}\ge {\varvec{0}}\), the optimal dual variable \({\varvec{\alpha }}^\star \) must satisfy
$$\begin{aligned} \Vert {\varvec{\alpha }}-{\varvec{\alpha }}^\star \Vert _2^2\le 4(P_\lambda ({\varvec{m}})-D_\lambda ({\varvec{\alpha }})). \end{aligned}$$
This bound is called the duality gap bound (DGB), and the parameters \(\varvec{m}\) and \(\varvec{\alpha }\) used to construct the bound are referred to as the reference solution. This inequality reveals that the optimal \(\varvec{\alpha }^{\star }\) should be in the inside of the sphere whose center is the reference solution \(\varvec{\alpha }\) and radius is \(2 \sqrt{P_\lambda ({\varvec{m}})-D_\lambda ({\varvec{\alpha }})}\), i.e., twice the square root of the duality gap. Therefore, if the quality of the reference solution \(\varvec{m}\) and \(\varvec{\alpha }\) is better, a tighter bound can be obtained. When the duality gap is zero, meaning that \(\varvec{m}\) and \(\varvec{\alpha }\) are optimal, the radius is shrunk to zero.
If the optimal solution for \(\lambda _0\) is available as a reference solution to construct the bound for \(\lambda _1\), the following bound, called regularization path bound (RPB), can be obtained.
Theorem 2
(RPB) Let \({\varvec{\alpha }}_0^\star \) be the optimal solution for \(\lambda _0\) and \({\varvec{\alpha }}_1^\star \) be the optimal solution for \(\lambda _1\). Then,
$$\begin{aligned} \left\| {\varvec{\alpha }}_1^\star -\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\| _2^2 \le \left\| \frac{\lambda _0-\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\| _2^2. \end{aligned}$$
This inequality indicates that the optimal dual solution for \(\lambda _1\) (\(\varvec{\alpha }_1^{\star }\)) should be in the sphere whose center is \(\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \) and radius is \(\left\| \frac{\lambda _0-\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0^\star \right\| _2\). However, RPB requires the exact solution, which is difficult to obtain in practice due to numerical errors. The relaxed RPB (RRPB) extends RPB to incorporate the approximate solution as a reference solution.
Theorem 3
(RRPB) Assuming that \({\varvec{\alpha }}_0\) satisfies \(\Vert {\varvec{\alpha }}_0-{\varvec{\alpha }}_0^\star \Vert _2\le \epsilon \), the optimal solution \({\varvec{\alpha }}_1^\star \) for \(\lambda _1\) must satisfy
$$\begin{aligned} \left\| {\varvec{\alpha }}_1^\star -\frac{\lambda _0+\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0\right\| _2 \le \left\| \frac{\lambda _0-\lambda _1}{2\lambda _0}{\varvec{\alpha }}_0\right\| _2+\Bigl (\frac{\lambda _0+\lambda _1}{2\lambda _0}+\frac{|\lambda _0-\lambda _1|}{2\lambda _0}\Bigr )\epsilon . \end{aligned}$$
In Theorem 1, the reference \(\varvec{\alpha }_0\) is only assumed to be close to \(\varvec{\alpha }_0^\star \) within the radius \(\epsilon \) instead of assuming that \(\varvec{\alpha }_0^\star \) is available. For example, \(\epsilon \) can be obtained using the DGB (Theorem 1).
Similar bounds to those derived here were previously considered for the triplet screening of metric learning on usual numerical data (Yoshida et al. 2018, 2019b). Here, we extend a similar idea to derive subgraph screening.
Safe screening and safe pruning rules
Theorems 1 and 3 identify the regions where the optimal solution exists using a current feasible solution \({\varvec{\alpha }}\). Further, from (6), when \({\varvec{C}}_{k,:}{\varvec{\alpha }}^\star \le \lambda \), we have \(m_k^\star =0\). This indicates that
$$\begin{aligned} \max _{{\varvec{\alpha }}\in {\mathcal {B}}}{\varvec{C}}_{k,:}{\varvec{\alpha }} \le \lambda \Rightarrow m_k^\star =0, \end{aligned}$$
(15)
where \({{\mathcal {B}}}\) is a region containing the optimal solution \({\varvec{\alpha }}^\star \), i.e., \({\varvec{\alpha }}^\star \in {\mathcal {B}}\). As we derived in Sect. 4.1.1, the sphere-shaped \({{\mathcal {B}}}\) can be constructed using feasible primal and dual solutions. By solving this maximization problem, we obtain the following safe screening (SS) rule.
Theorem 4
(SS Rule) If the optimal solution \({\varvec{\alpha }}^\star \) exists in the bound \({\mathcal {B}}=\{{\varvec{\alpha }} \mid \Vert {\varvec{\alpha }} -{\varvec{q}} \Vert _2^2\le r^2\}\), the following rule holds
$$\begin{aligned} {\varvec{C}}_{k,:}{\varvec{q}}+r\Vert {\varvec{C}}_{k,:}\Vert _2 \le \lambda \Rightarrow m_k^\star =0. \end{aligned}$$
(16)
Theorem 4 indicates that we can eliminate unnecessary features by evaluating the condition shown in (16). Here, the theorem is written in a general form, and in practice, \(\varvec{q}\) and r can be defined by the center and a radius of one of the sphere bounds, respectively. An important property of this rule is that it guarantees optimality, meaning that the sub-problems (9) and (10) have the exact same optimal solution to the original problem if \({{\mathcal {F}}}\) is defined through this rule. However, it is still necessary to evaluate the rule for all p features, which is currently intractable. To avoid this problem, we derive a pruning strategy on the graph mining tree, which we call the safe pruning (SP) rule.
Theorem 5
(SP Rule) If the optimal solution \({\varvec{\alpha }}^\star \) is in the bound \({\mathcal {B}}=\{{\varvec{\alpha }} \mid \Vert {\varvec{\alpha }} -{\varvec{q}} \Vert _2^2\le r^2, {\varvec{q}}\ge \varvec{0}\}\), the following rule holds
$$\begin{aligned} \mathrm {Prune}(k|{\varvec{q}}, r)\le \lambda \Rightarrow m_{k'}^\star =0. \text { for } \forall k' \supseteq k. \end{aligned}$$
(17)
This theorem is a direct consequence of Lemma 1. If this condition holds for a node k during the tree traversal, a subtree below that node can be pruned. This means that we can safely eliminate unnecessary subgraphs even without enumerating them. In this theorem, note that \({{\mathcal {B}}}\) has an additional non-negative constraint \(\varvec{q} \ge \varvec{0}\), but this is satisfied by all the bounds in Sect. 4.1.1 because of the non-negative constraint in the dual problem.
Range-based safe screening and safe pruning
The SS and SP rules apply to a fixed \(\lambda \). The range-based extension identifies an interval of \(\lambda \) for which the satisfaction of SS/SP is guaranteed. This is particularly useful for the path-wise optimization or regularization path calculation, where the problem must be solved with a sequence of \(\lambda \). We assume that the sequence is sorted in descending order, as optimization algorithms typically start from the trivial solution \(\varvec{m} = \varvec{0}\). Let \(\lambda =\lambda _1\le \lambda _0\). By combining RRPB with the rule (16), we obtain the following theorem.
Theorem 6
(Range-based Safe Screening (RSS)) For any k, the following rule holds
$$\begin{aligned} \lambda _a\le \lambda \le \lambda _0 \Rightarrow m_k^\star =0, \end{aligned}$$
(18)
where
$$\begin{aligned} \lambda _a :=\frac{\lambda _0(2\epsilon \Vert {\varvec{C}}_{k,:}\Vert _2 +\Vert {\varvec{\alpha }}_0\Vert _2\Vert {\varvec{C}}_{k,:}\Vert _2 +{\varvec{C}}_{k,:}{\varvec{\alpha }}_0)}{2\lambda _0+\Vert {\varvec{\alpha }}_0\Vert _2\Vert {\varvec{C}}_{k,:}\Vert _2 -{\varvec{C}}_{k,:}{\varvec{\alpha }}_0} . \end{aligned}$$
This rule indicates that we can safely ignore \(m_k\) for \(\lambda \in [\lambda _a, \lambda _0]\), while if \(\lambda _a > \lambda _0\), the weight \(m_k\) cannot be removed by this rule. For SP, the range-based rule can also be derived from (17).
Theorem 7
(Range-based Safe Pruning (RSP)) For any \(k' \supseteq k\), the following pruning rule holds:
$$\begin{aligned} \lambda '_a :=\frac{\lambda _0(2\epsilon b+\Vert {\varvec{\alpha }}_0\Vert _2b+a)}{2\lambda _0+\Vert {\varvec{\alpha }}_0\Vert _2b-a}\le \lambda \le \lambda _0 \Rightarrow m_{k'}^\star =0, \end{aligned}$$
(19)
where
$$\begin{aligned} a&:=\sum _{i\in [n]}\sum _{l\in {\mathcal {D}}_i}{\alpha _0}_{il}\max \{x_{l,k}, x_{i,k}\}^2,\\ b&:=\sqrt{\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\}^4+\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}^4 \right] }. \end{aligned}$$
This theorem indicates that, while \(\lambda \in [\lambda _a',\lambda _0]\), we can safely remove the entire subtree of k. Analogously, if the feature vector is generated from \(g(x) = 1_{x>0}\) (i.e., binary), the following theorem holds.
Theorem 8
(Range-Based Safe Pruning (RSP) for binary feature) Assuming \(g(x) = 1_{x>0}\) in (2), a and b in Theorem 7can be replaced with
$$\begin{aligned} a&:=\sum _{i\in [n]}\max \left\{ \sum _{l\in {\mathcal {D}}_i}{\alpha _0}_{il}x_{l,k}, x_{i,k} \left[ \sum _{l\in {\mathcal {D}}_i}{\alpha _0}_{il}-\sum _{j\in {\mathcal {S}}_i}{\alpha _0}_{ij}(1-x_{j,k})\right] \right\} ,\\ b&:=\sqrt{\sum _{i\in [n]} \left[ \sum _{l\in {\mathcal {D}}_i}\max \{x_{i,k},x_{l,k}\}+\sum _{j\in {\mathcal {S}}_i}\max \{x_{i,k},x_{j,k}\}\right] }. \end{aligned}$$
Because these constants a and b are derived from the tighter bound in Lemma 2, the obtained range becomes wider than the range in Theorem 7.
Once we calculate \(\lambda _a\) and \(\lambda '_a\) of (18) and (19) for some \(\lambda \), they are stored at each node of the tree. Subsequently, such \(\lambda _a\) and \(\lambda '_a\) can be used for the next tree traversal with different \(\lambda '\). If the conditions of (18) or (19) are satisfied, the node can be skipped (RSS) or pruned (RSP). Otherwise, we update \(\lambda _a\) and \(\lambda '_a\) by using the current reference solution.
Working set method
Safe rules are strong rules in the sense that they can completely remove features; thus, they are sometimes too conservative to fully accelerate the optimization. In contrast, the working set selection is a widely accepted heuristic approach to selecting a subset of features.
Working set selection and working set pruning
The working set (WS) method optimizes the problem with respect to only selected working set features. Then, if the optimality condition for the original problem is not satisfied, the working set is reselected and the optimization on the new working set restarts. This process iterates until optimality on the original problem is achieved.
Besides the safe rules, we use the following WS selection criterion, which is obtained directly from the KKT conditions:
$$\begin{aligned} {\varvec{C}}_{k,:}{\varvec{\alpha }}\le \lambda . \end{aligned}$$
(20)
If this inequality is satisfied, the k-th dimension is predicted as \(m_k^\star =0\). Hence, the working set is defined by
$$\begin{aligned} {\mathcal {W}}:=\{k\mid {\varvec{C}}_{k,:}{\varvec{\alpha }}>\lambda \}. \end{aligned}$$
Although \(m^\star _i = 0\) for \(i \notin {{\mathcal {W}}}\) is not guaranteed, the final convergence of the procedure is guaranteed by the following theorem.
Theorem 9
(Convergence of WS) Assume that there is a solver for the sub-problem (9) (or equivalently (10)) that returns the optimal solution for given \({{\mathcal {F}}}\). The working set method, which iterates optimizating the sub-problem with \({{\mathcal {F}}}= {{\mathcal {W}}}\) and updating \({{\mathcal {W}}}\) alternately, returns the optimal solution of the original problem in finite steps.
However, here again, the inequality (20) needs to be evaluated for all features, which is computationally intractable.
The same pruning strategy as for SS/SP can be incorporated into working set selection. The criterion (20) is also a special case of (11), and Lemma 1 indicates that if the following inequality
$$\begin{aligned} \mathrm {Prune}_{\mathrm{WP}}(k) :=\mathrm {Prune}(k|{\varvec{\alpha }}, 0)\le \lambda , \end{aligned}$$
holds, then no \(k' \supseteq k\) is included in the working set. We refer to this criterion as working set pruning (WP).
Relation with safe rules
Note that for the working set method, we may need to update \({{\mathcal {W}}}\) multiple times, unlike in the safe screening approaches, as shown by Theorem 9. Instead, the working set method can usually exclude a larger number of features compared with safe screening approaches. In fact, when the condition of the SS rule (16) is satisfied, the WS criterion (20) must likewise be satisfied. Because all the spheres (DGB, RPB and RRPB) contain the reference solution \(\varvec{\alpha }\), which is usually the current solution, the inequality
$$\begin{aligned} \varvec{C}_{k,:} \varvec{\alpha }\le \max _{\varvec{\alpha }' \in {{\mathcal {B}}}} \varvec{C}_{k,:} \varvec{\alpha }' \end{aligned}$$
(21)
holds, where \({{\mathcal {B}}}\) is a sphere created by DGB, RPB or RRPB. This indicates that when the SS rule excludes the k-th feature, the WS also excludes the k-th feature. However, to guarantee convergence, WS needs to be fixed until the sub-problem (9)–(10) is solved (Theorem 9). In contrast, the SS rule is applicable anytime during the optimization procedure without affecting the final optimality. This enables us to apply the SS rule even to the sub-problem (9)–(10), where \({{\mathcal {F}}}\) is defined by WS as shown in Step 2 of Fig. 2 (dynamic screening).
For the pruning rules, we first confirm the following two properties:
$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r)&\ge \mathrm{Prune}(k|\varvec{q},0), \\ \mathrm{Prune}(k| C \varvec{q},0)&= C ~ \mathrm{Prune}(k| \varvec{q},0), \end{aligned}$$
where \(\varvec{q} \in {\mathbb {R}}_+^{2 n K}\) is the center of the sphere, \(r \ge 0\) is the radius, and \(C \in {\mathbb {R}}\) is a constant. In the case of DGB, the center of the sphere is the reference solution \(\varvec{\alpha }\) itself, i.e., \(\varvec{q} = \varvec{\alpha }\). Then, the following relation holds between the SP criterion \(\mathrm{Prune}(k|\varvec{q},r)\) and WP criterion \(\mathrm{Prune}_{\mathrm{WP}}(k)\):
$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r) = \mathrm{Prune}(k|\varvec{\alpha }_0,r) \ge \mathrm{Prune}(k|\varvec{\alpha }_0,0) = \mathrm{Prune}_{\mathrm{WP}}(k). \end{aligned}$$
This once more indicates that when the SP rule is satisfied, the WP rule must be satisfied as well. When the RPB or RRPB sphere is used, the center of the sphere is \(\varvec{q} = \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha }_0\). Assuming that the solution for \(\lambda _0\) is used as the reference solution, i.e., \(\varvec{\alpha }= \varvec{\alpha }_0\), we obtain
$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r)&= \mathrm{Prune} \left( k|\frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha },r\right) \\&\ge \mathrm{Prune} \left( k|\frac{\lambda _0 + \lambda _1}{2 \lambda _0} \varvec{\alpha },0\right) \\&= \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \mathrm{Prune}(k| \varvec{\alpha },0) \\&= \frac{\lambda _0 + \lambda _1}{2 \lambda _0} \mathrm{Prune}_{\mathrm{WP}}(k). \end{aligned}$$
Using this inequality, we obtain
$$\begin{aligned} \mathrm{Prune}(k|\varvec{q},r) - \mathrm{Prune}_{\mathrm{WP}}(k) \ge \frac{\lambda _1 - \lambda _0}{2 \lambda _0} \mathrm{Prune}_{\mathrm{WP}}(k). \end{aligned}$$
From this inequality, if \(\lambda _1 > \lambda _0\), then \(\mathrm{Prune}(k|\varvec{q},r) > \mathrm{Prune}_{\mathrm{WP}}(k)\) (note that \(\mathrm{Prune}_{\mathrm{WP}}(k) \ge 0\) because \(\varvec{\alpha }\ge \varvec{0}\)), indicating that the pruning of WS is always tighter than that of the safe rule. However, in our algorithm presented in Sect. 5, \(\lambda _1 < \lambda _0\) holds because we start from a larger value of \(\lambda \) and gradually decrease it. Then, this inequality does not hold, and \(\mathrm{Prune}(k|\varvec{q},r) < \mathrm{Prune}_{\mathrm{WP}}(k)\) becomes a possibility.
When the WS and WP rules are strictly tighter than the SS and SP rules, respectively, using both of WS/WP and SS/SP rules is equivalent to using WS/WP only (except for dynamic screening). Even in this case, the range-based safe approaches (the RSS and RSP rules) can still be effective. When the range-based rules are evaluated, we obtain the range of \(\lambda \) such that the SS or SP rule is satisfied. Thus, as long as \(\lambda \) is in that range, we do not need to evaluate any safe or working set rules.