1 Introduction

Linear programming (LP) has been used widely and successfully in many practical areas since the introduction of the simplex method in the 1950s. Although an alternative solution technique, the interior point method (IPM), has become competitive and popular since the 1980s, the dual revised simplex method is frequently preferred, particularly when families of related problems are to be solved.

The standard simplex method implements the simplex algorithm via a rectangular tableau but is very inefficient when applied to sparse LP problems. For such problems the revised simplex method is preferred since it permits the (hyper-)sparsity of the problem to be exploited. This is achieved using techniques for factoring sparse matrices and solving hyper-sparse linear systems. Also important for the dual revised simplex method are advanced algorithmic variants introduced in the 1990s, particularly dual steepest-edge (DSE) pricing and the bound flipping ratio test (BFRT). These led to dramatic performance improvements and are key reasons for the dual simplex algorithm being preferred.

A review of past work on parallelising the simplex method is given by Hall [10]. The standard simplex method has been parallelised many times and generally achieves good speedup, with factors ranging from tens to up to a thousand. However, without using expensive parallel computing resources, its performance on sparse LP problems is inferior to a good sequential implementation of the revised simplex method. The standard simplex method is also unstable numerically. Parallelisation of the revised simplex method has been considered relatively little and there has been less success in terms of speedup. Indeed, since scalable speedup for general large sparse LP problems appears unachievable, the revised simplex method has been considered unsuitable for parallelisation. However, since it corresponds to the computationally efficient serial technique, any improvement in performance due to exploiting parallelism in the revised simplex method is a worthwhile goal.

Two main factors motivated the work in this paper to develop a parallelisation of the dual revised simplex method for standard desktop architectures. Firstly, although dual simplex implementations are now generally preferred, almost all the work by others on parallel simplex has been restricted to the primal algorithm, the only published work on dual simplex parallelisation known to the authors being due to Bixby and Martin [1]. Although it appeared in the early 2000s, their implementation included neither the BFRT nor hyper-sparse linear system solution techniques so there is immediate scope to extend their work. Secondly, in the past, parallel implementations generally used dedicated high performance computers to achieve the best performance. Now, when every desktop computer is a multi-core machine, any speedup is desirable in terms of solution time reduction for daily use. Thus we have used a relatively standard architecture to perform computational experiments.

A worthwhile simplex parallelisation should be based on a good sequential simplex solver. Although there are many public domain simplex implementations, they are either too complicated to be used as a foundation for a parallel solver or too inefficient for any parallelisation to be worthwhile. Thus the authors have implemented a sequential dual simplex solver (hsol) from scratch. It incorporates sparse LU factorization, hyper-sparse linear system solution techniques, efficient approaches to updating LU factors and sophisticated dual revised simplex pivoting rules. Based on components of this sequential solver, two dual simplex parallel solvers (pami and sip) have been designed and developed.

Section 2 introduces the necessary background, Sects. 3 and 4 detail the design of pami and sip respectively and Sect. 5 presents numerical results and performance analysis. Conclusions are given in Sect. 6.

2 Background

The simplex method has been under development for more than 60 years, during which time many important algorithmic variants have enhanced the performance of simplex implementations. As a result, for novel computational developments to be of value they must be tested within an efficient implementation or good reasons given why they are applicable in such an environment. Any development which is only effective in the context of an inefficient implementation is not worthy of attention.

This section introduces all the necessary background knowledge for developing the parallel dual simplex solvers. Section 2.1 introduces the computational form of LP problems and the concept of primal and dual feasibility. Section 2.2 describes the regular dual simplex method algorithm and then details its key enhancements and major computational components. Section 2.3 introduces suboptimization, a relatively unknown dual simplex variant which is the starting point for the pami parallelisation in Sect. 3. Section 2.4 briefly reviews several existing simplex update approaches which are key to the efficiency of the parallel schemes.

2.1 Linear programming problems

A linear programming (LP) problem in general computational form is

$$\begin{aligned} \text {minimize } f = {{\varvec{c}}}^T {{\varvec{x}}}\quad \text { subject to } A {{\varvec{x}}}= {{\varvec{0}}}\text { and } {{\varvec{l}}}\le {{\varvec{x}}}\le {{\varvec{u}}}, \end{aligned}$$
(1)

where \(A \in \mathbb {R}^{m\times n}\) is the coefficient matrix and \({{\varvec{x}}}\), \({{\varvec{c}}}\), \({{\varvec{l}}}\) and \({{\varvec{u}}}\in \mathbb {R}^m\) are, respectively, the variable vector, cost vector and (lower and upper) bound vectors. Bounds on the constraints are incorporated into \({{\varvec{l}}}\) and \({{\varvec{u}}}\) via an identity submatrix of A. Thus it may be assumed that \(m < n\) and that A is of full rank.

As A is of full rank, it is always possible to identify a non-singular basis partition \(B\in \mathbb {R}^{m\times m}\) consisting of m linearly independent columns of A, with the remaining columns of A forming the matrix N. The variables are partitioned accordingly into basic variables \({{\varvec{x}}}_{\scriptscriptstyle B}\) and nonbasic variables \({{\varvec{x}}}_{\scriptscriptstyle N}\), so \(A {{\varvec{x}}}= B{{\varvec{x}}}_{\scriptscriptstyle B}+ N{{\varvec{x}}}_{\scriptscriptstyle N}= {{\varvec{0}}}\), and the cost vector is partitioned into basic costs \({{\varvec{c}}}_{\scriptscriptstyle B}\) and nonbasic costs \({{\varvec{c}}}_{\scriptscriptstyle N}\), so \(f = {{\varvec{c}}}_{\scriptscriptstyle B}^T{{\varvec{x}}}_{\scriptscriptstyle B}+ {{\varvec{c}}}_{\scriptscriptstyle N}^T{{\varvec{x}}}_{\scriptscriptstyle N}\). The indices of the basic and nonbasic variables form sets \({\mathcal {B}}\) and \({\mathcal {N}}\) respectively.

In the simplex algorithm, the values of the (primal) variables are defined by setting each nonbasic variable to one of its finite bounds and computing the values of the basic variables as \({{\varvec{x}}}_{\scriptscriptstyle B}= - B^{-1} N {{\varvec{x}}}_{\scriptscriptstyle N}\). The values of the dual variables (reduced costs) are defined as \(\widehat{{{\varvec{c}}}}_{\scriptscriptstyle N}^T = {{\varvec{c}}}_{\scriptscriptstyle N}^T - {{\varvec{c}}}_{\scriptscriptstyle B}^T B^{-1}N\). When \({{\varvec{l}}}_{\scriptscriptstyle B}\le {{\varvec{x}}}_{\scriptscriptstyle B}\le {{\varvec{u}}}_{\scriptscriptstyle B}\) holds, the basis is said to be primal feasible. Otherwise, the primal infeasibility for each basic variable \(i \in {\mathcal {B}}\) is defined as

$$\begin{aligned} \varDelta x_i = \left\{ \begin{array} {ll} l_i - x_i &{} \text { if } x_i < l_i \\ x_i - u_i &{} \text { if } x_i > u_i \\ 0 &{} \text { otherwise} \end{array} \right. \end{aligned}$$
(2)

If the following condition holds for all \(j\in {\mathcal {N}}\) such that \(l_j \ne u_j\)

$$\begin{aligned} \widehat{c}_j \ge 0~(x_j = l_j),\quad \widehat{c}_j \le 0~(x_j = u_j) \end{aligned}$$
(3)

then the basis is said to be dual feasible. It can be proved that if a basis is both primal and dual feasible then it yields an optimal solution to the LP problem.

2.2 Dual revised simplex method

The dual simplex algorithm solves an LP problem iteratively by seeking primal feasibility while maintaining dual feasibility. Starting from a dual feasible basis, each iteration of the dual simplex algorithm can be summarised as three major operations.

  1. 1.

    Optimality test. In a component known as chuzr, choose the index \(p \in {\mathcal {B}}\) of a good primal infeasible variable to leave the basis. If no such variable can be chosen, the LP problem is solved to optimality.

  2. 2.

    Ratio test. In a component known as chuzc, choose the index \(q \in {\mathcal {N}}\) of a good nonbasic variable to enter the basis so that, within the new partition, \(\widehat{c}_q\) is zeroed whilst \(\widehat{c}_p\) and other nonbasic variables remain dual feasible. This is achieved via a ratio test with \(\widehat{{{\varvec{c}}}}^T_{\scriptscriptstyle N}\) and \(\widehat{{{\varvec{a}}}}_p^T\), where \(\widehat{{{\varvec{a}}}}_p^T\) is row p of the reduced coefficient matrix \(\widehat{A}=B^{-1}A\).

  3. 3.

    Updating. The basis is updated by interchanging indices p and q between sets \({\mathcal {B}}\) and \({\mathcal {N}}\), with corresponding updates of the values of the primal variables \({{\varvec{x}}}_{\scriptscriptstyle B}\) using \(\widehat{{{\varvec{a}}}}_q\) (being column q of \(\widehat{A}\)) and dual variables \(\widehat{{{\varvec{c}}}}^T_{\scriptscriptstyle N}\) using \(\widehat{{{\varvec{a}}}}_p^T\), as well as other components as discussed below.

What defines the revised simplex method is a representation of the basis inverse \(B^{-1}\) to permit rows and columns of the reduced coefficient matrix \(\widehat{A}=B^{-1}A\) to be computed by solving linear systems. The operation to compute the representation of \(B^{-1}\) directly is referred to as invert and is generally achieved via sparsity-exploiting LU factorization. At the end of each simplex iteration the representation of \(B^{-1}\) is updated until it is computationally advantageous or numerically necessary to compute a fresh representation directly. The computational component which performs the update of \(B^{-1}\) is referred to as update-factor. Efficient approaches for updating \(B^{-1}\) are summarised in Sect. 2.4.

For many sparse LP problems the matrix \(B^{-1}\) is dense, so solutions of linear systems involving B or \(B^T\) can be expected to be dense even when, as is typically the case in the revised simplex method, the RHS is sparse. However, for some classes of LP problem the solutions of such systems are typically sparse. This phenomenon, and techniques for exploiting it in the simplex method, was identified by Hall and McKinnon [13] and is referred to as hyper-sparsity. This advanced technique has been incorporated throughout the design and development of the new parallel dual simplex solvers.

The remainder of this section introduces advanced algorithmic components of the dual simplex method.

2.2.1 Optimality test

In the optimality test, a modern dual simplex implementation adopts two important enhancements. The first is the dual steepest-edge (DSE) algorithm [6] which chooses the basic variable with greatest weighted infeasibility as the leaving variable. This variable has index

$$\begin{aligned} p = \arg \max _i \frac{\varDelta x_i}{||\widehat{{{\varvec{e}}}}_i^T||_2}. \end{aligned}$$

For each basic variable \(i \in {\mathcal {B}}\), the associated DSE weight \(w_i\) is defined as the 2-norm of row i of \(B^{-1}\) so \(w_i = ||\widehat{{{\varvec{e}}}}_i^T||_2 =||{{{\varvec{e}}}}_i^TB^{-1}||_2\). The weighted infeasibility \(\alpha _i = \varDelta x_i / w_i\) is referred to as the attractiveness of a basic variable. The DSE weight is updated at the end of the simplex iteration.

The second enhancement of the optimality test is the hyper-sparse candidate selection technique originally proposed for column selection in the primal simplex method [13]. This maintains a short list of the most attractive variables and is more efficient for large and sparse LP problems since it avoids repeatedly searching the less attractive choices. This technique has been adapted for the dual simplex row selection component of hsol.

2.2.2 Ratio test

In the ratio test, the updated pivotal row \(\widehat{{{\varvec{a}}}}_p^T\) is obtained by computing \(\widehat{{{\varvec{e}}}}_p^T = {{\varvec{e}}}_p^TB^{-1}\) and then forming the matrix vector product \(\widehat{{{\varvec{a}}}}_p^T = \widehat{{{\varvec{e}}}}_p^T A\). These two computational components are referred to as btran and spmv respectively.

The dual ratio test (chuzc) is enhanced by the Harris two-pass ratio test [14] and bound-flipping ratio test (BFRT) [8]. Details of how to apply these two techniques are set out by Koberstein [18].

For the purpose of this report, advanced chuzc can be viewed as having two stages, an initial stage chuzc1 which simply accumulates all candidate nonbasic variables and then a recursive selection stage chuzc2 to choose the entering variable q from within this set of candidates using BFRT and the Harris two-pass ratio test. chuzc also determines the primal step \(\theta _p\) and dual step \(\theta _q\), being the changes to the primal basic variable p and dual variable q respectively. Following a successful BFRT, chuzc also yields an index set \({\mathcal {F}}\) of any primal variables which have flipped from one bound to the other.

2.2.3 Updating

In the updating operation, besides update-factor, several vectors are updated. Update of the basic primal variables \({{\varvec{x}}}_{\scriptscriptstyle B}\) (update-primal) is achieved using \(\theta _p\) and \(\widehat{{{\varvec{a}}}}_q\), where \(\widehat{{{\varvec{a}}}}_q\) is computed by an operation \(\widehat{{{\varvec{a}}}}_q = B^{-1} {{\varvec{a}}}_q\) known as ftran. Update of the dual variables \(\widehat{{{\varvec{c}}}}_{\scriptscriptstyle N}^T\) (update-dual) is achieved using \(\theta _q\) and \(\widehat{{{\varvec{a}}}}_p^T\). The update of the DSE weights is given by

$$\begin{aligned} w_p \,&:=\, w_p / \widehat{a}_{pq}^2 \\ w_i \,&:= \,w_i - 2 (\widehat{a}_{iq} / \widehat{a}_{pq}) \tau _i + (\widehat{a}_{iq} / \widehat{a}_{pq})^2 w_p\, i\,\,\ne p \end{aligned}$$

This requires both the ftran result \(\widehat{{{\varvec{a}}}}_q\) and \({{\varvec{\tau }}}= B^{-1} \widehat{{{\varvec{e}}}}_p\). The latter is obtained by another ftran type operation, known as ftran-dse.

Following a BFRT ratio test, if \({\mathcal {F}}\) is not empty, then all the variables with indices in \({\mathcal {F}}\) are flipped, and the primal basic solution \({{\varvec{x}}}_{\scriptscriptstyle B}\) is further updated (another update-primal) by the result of the ftran-bfrt operation \(\widehat{{{\varvec{a}}}}_{\scriptscriptstyle F}= B^{-1} {{\varvec{a}}}_{\scriptscriptstyle F}\), where \({{\varvec{a}}}_{\scriptscriptstyle F}\) is a linear combination of the constraint columns for the variables in \({\mathcal {F}}\).

2.2.4 Scope for parallelisation

The computational components identified above are summarised in Table 1. This also gives the average contribution to solution time for the LP test set used in Sect. 5.

Table 1 Major components of the dual revised simplex method and their percentage of overall solution time

There is immediate scope for data parallelisation within chuzr, spmv, chuzc and most of the update operations since they require independent operations for each (nonzero) component of a vector. Exploiting such parallelisation in spmv and chuzc has been reported by Bixby and Martin [1] who achieve speedup on a small group of LP problems with relatively expensive spmv operations. The scope for task parallelism by overlapping ftran and ftran-dse was considered by Bixby and Martin but rejected as being disadvantageous computationally. Another notable development extending the ideas of Bixby and Martin [1] is Aboca, which was introduced by Forrest at a conference in 2012 [4]. Other than making the source code available [5], the authors believe that there is no published reference to this work.

2.3 Dual suboptimization

Suboptimization is one of the oldest variants of the revised simplex method and consists of a major-minor iteration scheme. Within the primal revised simplex method, suboptimization performs minor iterations of the standard primal simplex method using small subsets of columns from the reduced coefficient matrix \(\widehat{A}=B^{-1}A\). Suboptimization for the dual simplex method was first set out by Rosander [21] but no practical implementation has been reported. It performs minor operations of the standard dual simplex method, applied to small subsets of rows from \(\widehat{A}\).

  1. 1.

    Major optimality test. Choose index set \({\mathcal {P}}\subseteq {\mathcal {B}}\) of primal infeasible basic variables as potential leaving variables. If no such indices can be chosen, the LP problem has been solved to optimality.

  2. 2.

    Minor initialisation. For each \(p \in {\mathcal {P}}\), compute \(\widehat{{{\varvec{e}}}}_p^T = {{\varvec{e}}}_p^T B^{-1}\).

  3. 3.

    Minor iterations.

    1. (a)

      Minor optimality test. Choose and remove a primal infeasible variable p from \({\mathcal {P}}\). If no such variable can be chosen, the minor iterations are terminated.

    2. (b)

      Minor ratio test. As in the regular ratio test, compute \(\widehat{{{\varvec{a}}}}_p^T = \widehat{{{\varvec{e}}}}_p^T A\) (spmv) then identify an entering variable q.

    3. (c)

      Minor update. Update primal variables for the remaining candidates in set \({\mathcal {P}}\) only (\({{\varvec{x}}}_{\mathcal {P}}\)) and update all dual variables \(\widehat{{{\varvec{c}}}}_{\scriptscriptstyle N}\).

  4. 4.

    Major update. For the pivotal sequence identified during the minor iterations, update the primal basic variables, DSE weights and representation of \(B^{-1}\).

Originally, suboptimization was proposed as a pivoting scheme with the aim of achieving better pivot choices and advantageous data affinity. In modern revised simplex implementations, the DSE and BFRT are together regarded as the best pivotal rules and the idea of suboptimization has been largely forgotten.

However, in terms of parallelisation, suboptimization is attractive because it provides more scope for parallelisation. For the primal simplex algorithm, suboptimization underpinned the work of Hall and McKinnon [11, 12]. As discussed by Hall [10], Wunderling [22] also experimented with suboptimization for the primal simplex method. For dual suboptimization the major initialisation requires s btran operations, where \(s=|{\mathcal {P}}|\). Following \(t\le s\) minor iterations, the major update requires t ftran operations, t ftran-dse operations and up to t ftran-bfrt operations. The detailed design of the parallelisation scheme based on suboptimization is discussed in Sect. 3.

2.4 Simplex update techniques

Updating the basis inverse \(B_k^{-1} \) to \( B_{k+1}^{-1}\) after the basis change \(B_{k+1} = B_k + ({{\varvec{a}}}_q - B{{\varvec{e}}}_p){{\varvec{e}}}_p^T\) is a crucial component of revised simplex method implementations. The standard choices are the relatively simple product form (PF) update [20] or the efficient Forrest–Tomlin (FT) update [7]. A comprehensive report on simplex update techniques is given by Elble and Sahinidis [3] and novel techniques, some motivated by the design and development of pami, are described by Huangfu and Hall [16]. For the purpose of this report, the features of all relevant update methods are summarised as follows.

  • The product form (PF) update uses the ftran result \(\widehat{{{\varvec{a}}}}_q\), yielding \(B_{k+1}^{-1} = E^{-1}B_k^{-1}\), where the inverse of \(E = I + (\widehat{{{\varvec{a}}}}_q - {{\varvec{e}}}_p) {{\varvec{e}}}^T_p\), is readily available.

  • The Forrest–Tomlin (FT) update assumes \(B_k=L_kU_k\) and uses both the partial ftran result \(\tilde{{{\varvec{a}}}}_q = L_k^{-1}{{\varvec{a}}}_q\) and partial btran result \(\tilde{{{\varvec{e}}}}_p^T = {{{\varvec{e}}}}_p^T U_k^{-1}\) to modify \(U_k\) and augment \(L_k\).

  • The alternate product form (APF) update [16] uses the btran result \(\widehat{{{\varvec{e}}}}_p^T\) so that \(B_{k+1}^{-1} = B_k^{-1}T^{-1}\), where \(T = I+({{\varvec{a}}}_q - {{\varvec{a}}}_{p'})\widehat{{{\varvec{e}}}}_p^T\) and \({{\varvec{a}}}_{p'}\) is column p of B. Again, T is readily inverted.

  • Following suboptimization, the collective Forrest–Tomlin (CFT) update [16] updates \(B_k^{-1}\) to \(B_{k+t}^{-1}\) directly, using partial results obtained with \(B_k^{-1}\) which are required for simplex iterations.

Although the direct update of the basis inverse from \(B_k^{-1}\) to \(B_{k+t}^{-1}\) can be achieved easily via the PF or APF update, in terms of efficiency for future simplex iterations, the collective FT update is preferred to the PF and APF updates. The value of the APF update within pami is indicated in Sect. 3.

3 Parallelism across multiple iterations

This section introduces the design and implementation of the parallel dual simplex scheme, pami. It extends the suboptimization scheme of Rosander [21], incorporating (serial) algorithmic techniques and exploiting parallelism across multiple iterations.

The concept of pami was introduced by Hall and Huangfu [9], where it was referred to as ParISS. This prototype implementation was based on the PF update and was relatively unsophisticated, both algorithmically and computationally. Subsequent revisions and refinements, incorporating the advanced algorithmic techniques outlined in Sect. 2 as well as FT updates and some novel features introduced in this section, have yielded a very much more sophisticated and efficient implementation. Specifically, our implementation of pami out-performs ParISS by almost an order of magnitude in serial and to achieve the speed-up demonstrated in Sect. 5 has required new levels of task parallelism and parallel algorithmic control techniques described in Sects. 3.2 and 3.3, in addition to the linear algebra techniques introduced by Huangfu and Hall in [16].

Section 3.1 provides an overview of the parallelisation scheme of pami and Sect. 3.2 details the task parallel ftran operations in the major update stage and how to simplify it. A novel candidate quality control scheme for the minor optimality test is discussed in Sect. 3.3.

3.1 Overview of the pami framework

This section details the general pami parallelisation scheme with reference to the suboptimization framework introduced in Sect. 2.3.

3.1.1 Major optimality test

The major optimality test involves only major chuzr operations in which s candidates are chosen (if possible) using the DSE framework. In pami the value of s is the number of processors being used. It is a vector-based operation which can be easily parallelised, although its overall computational cost is not significant since it is only performed once per major operation. However, the algorithmic design of chuzr is important and Sect. 3.3 discusses it in detail.

3.1.2 Minor initialisation

The minor initialisation step computes the btran results for (up to s) potential candidates to leave the basis. This is the first of the task parallelisation opportunities provided by the suboptimization framework.

3.1.3 Minor iterations

There are three main operations in the minor iterations.

  1. (a)

    Minor chuzr simply chooses the best candidates from the set \({\mathcal {P}}\). Since this is computationally trivial, exploitation of parallelism is not considered. However, consideration must be given to the likelihood that the attractiveness of the best remaining candidate in \({\mathcal {P}}\) has dropped significantly. In such circumstances, it may not be desirable to allow this variable to leave the basis. This consideration leads to a candidate quality control scheme introduced in Sect. 3.3.

  2. (b)

    The minor ratio test is a major source of parallelisation and performance improvement. Since the btran result is known (see below), the minor ratio test consists of spmv, chuzc1 and chuzc2. The spmv operation is a sparse matrix-vector product and chuzc1 is a one-pass selection based on the result of spmv. In the actual implementation, they can share one parallel initialisation. On the other hand, chuzc2 often involves multiple iterations of recursive selection which, if exploiting parallelism, requires many synchronisation operations. According to the component profiling in Table 1, chuzc2 is a relative cheap operation thus, in pami, it is not parallelised. Data parallelism is exploited in spmv and chuzc1 by partitioning the variables across the processors before any simplex iterations are performed. This is done randomly with the aim of achieving load balance in spmv.

  3. (c)

    The minor update consists of the update of dual variables and the update of btran results. The former is performed in the minor update because the dual variables are required in the ratio test of the next minor iteration. It is simply a vector addition and represents immediate data parallelism. The updated btran result \({{\varvec{e}}}_i^TB_{k+1}^{-1}\) is obtained by observing that it is given by the APF update as \({{\varvec{e}}}_i^TB_k^{-1}T^{-1}=\widehat{{{\varvec{e}}}}_i^TT^{-1}\). Exploiting the structure of \(T^{-1}\) yields a vector operation which may be parallelised. After the btran results have been updated, the DSE weights of the remaining candidates are recomputed directly at little cost.

3.1.4 Major update

Following t minor iterations, the major update step concludes the major iteration. It consists of three types of operation: up to 3t ftran operations (including ftran-dse and ftran-bfrt), the vector-based update of primal variables and DSE weights, and update of the basis inverse representation.

The number of ftran operations cannot be fixed a priori since it depends on the number of minor iterations and the number involving a non-trivial BFRT. A simplification of the group of ftran s is introduced in 3.2.

The updates of all primal variables and DSE weights (given the particular vector \({{\varvec{\tau }}}= B^{-1} \widehat{{{\varvec{e}}}}_p\)) are vector-based data parallel operations.

The update of the invertible representation of B is performed using the collective FT update unless it is desirable or necessary to perform invert to reinvert B. Note that both of these operations are performed serially. Although the (collective) FT update is relatively cheap (see Table 1), so has little impact on performance, there is significant processor idleness during the serial invert.

3.2 Parallelising three groups of ftran operations

Within pami, the pivot sequence \(\{p_i, q_i\}_{i=0}^{t-1}\) identified in minor iterations yields up to 3t forward linear systems (where \(t\le s\)). Computationally, there are three groups of ftran operations, being t regular ftran s for obtaining updated tableau columns \(\widehat{{{\varvec{a}}}}_q = B^{-1} {{\varvec{a}}}_q\) associated with the entering variable identified during minor iterations; t additional ftran-dse operations to obtain the DSE update vector \({{\varvec{\tau }}}= B^{-1} \widehat{{{\varvec{e}}}}_p\) and ftran-bfrt calculations to update the primal solution resulting from bound flips identified in the BFRT. Each system in a group is associated with a different basis matrix, \(B_k, B_{k+1}, \ldots , B_{k+t-1}\). For example the t regular forward systems for obtaining updated tableau columns are \(\widehat{{{\varvec{a}}}}_{q_0} = B_k^{-1}{{{\varvec{a}}}}_{q_0}, \widehat{{{\varvec{a}}}}_{q_1} = B_{k+1}^{-1}{{{\varvec{a}}}}_{q_1}, \ldots , \widehat{{{\varvec{a}}}}_{q_{t-1}} = B_{k+t-1}^{-1}{{{\varvec{a}}}}_{q_{t-1}}\).

For the regular ftran and ftran-dse operations, the ith linear system (which requires \(B_{k+i}^{-1}\)) in each group, is solved by applying \(B_k^{-1}\) followed by \(i-1\) PF transformations given by \(\widehat{{{\varvec{a}}}}_{q_j}, j < i\) to bring the result up to date. The operations with \(B_k^{-1}\) and PF transformations are referred to as the inverse and update parts respectively. The multiple inverse parts are easily arranged as a task parallel computation. The update part of the regular ftran operations requires results of other forward systems in the same group and thus cannot be performed as task parallel calculations. However, it is possible and valuable to exploit data parallelism when applying individual PF updates when \(\widehat{{{\varvec{a}}}}_{q_i}\) is large and dense. For the ftran-dse group it is possible to exploit task parallelism fully if this group of computations is performed after the regular ftran. However, when implementing pami, both ftran-dse and regular ftran are performed together to increase the number of independent inverse parts in the interests of load balancing.

The group of up to t linear systems associated with BFRT is slightly different from the other two groups of systems. Firstly, there may be anything between none and t linear systems depending how many minor iterations are associated with actual bound flips. More importantly, the results are only used to update the values of the primal variables \({{\varvec{x}}}_{\scriptscriptstyle B}\) by simple vector addition. This can be expressed as a single operation

$$\begin{aligned} {{\varvec{x}}}_{\scriptscriptstyle B}:= {{\varvec{x}}}_{\scriptscriptstyle B}+ \sum _{i=0}^{t-1} B_{k+i}^{-1} {{{\varvec{a}}}}_{\scriptscriptstyle {Fi}} ={{\varvec{x}}}_{\scriptscriptstyle B}+ \sum _{i=0}^{t-1} \left( \prod _{j=i-1}^{0} E_j^{-1} B_{k}^{-1} {{{\varvec{a}}}}_{\scriptscriptstyle {Fi}} \right) \end{aligned}$$
(4)

where one or more of \({{{\varvec{a}}}}_{\scriptscriptstyle {Fi}}\) may be a zero vector. If implemented using the regular PF update, each ftran-bfrt operation starts from the same basis inverse \(B_k^{-1}\) but finishes with different numbers of PF update operations. Although these operations are closely related, they cannot be combined. However, if the APF update is used, so \(B_{k+i}^{-1}\) can be expressed as

$$\begin{aligned} B_{k+i}^{-1} = B_k^{-1} T_0^{-1} \cdots T_{i-1}^{-1}, \end{aligned}$$

the primal update Eq. (4) can be rewritten as

$$\begin{aligned} {{\varvec{x}}}_{\scriptscriptstyle B}:= {{\varvec{x}}}_{\scriptscriptstyle B}+ \sum _{i=0}^{t-1} \left( B_{k}^{-1} \prod _{j=0}^{i-1} T_j^{-1} {{{\varvec{a}}}}_{\scriptscriptstyle {Fi}} \right) = {{\varvec{x}}}_{\scriptscriptstyle B}+ B_{k}^{-1} \left( \sum _{i=0}^{t-1} \prod _{j=0}^{i-1} T_j^{-1} {{{\varvec{a}}}}_{\scriptscriptstyle {Fi}} \right) \end{aligned}$$
(5)

where the t linear systems start with a cheap APF update part and finish with a single \(B_k^{-1}\) operation applied to the combined result. This approach greatly reduces the total serial cost of solving the forward linear systems associated with BFRT. An additional benefit of this combination is that the update-primal operation is also reduced to a single operation after the combined ftran-bfrt.

By combining several potential ftran-bfrt operations into one, the number of forward linear systems to be solved is reduced to \(2t + 1\), or 2t when no bound flips are performed. An additional benefit of this reduction is that, when \(t\le s-1\), the total number of forward linear systems to be solved is less than 2s, so that each of the s processors will solve at most two linear systems. However, when \(t = s\) and ftran-bfrt is nontrivial, one of the s processors is required to solve three linear systems, while the other processors are assigned only two, resulting in an “orphan task”. To avoid this situation, the number of minor iterations is limited to \(t = s-1\) if bound flips have been performed in the previous \(s-2\) iterations.

The arrangement of the task parallel ftran operations discussed above is illustrated in Fig. 1. In the actual implementation, the \(2t + 1\) ftran operations are all started the same time as parallel tasks, and the processors are left to decide which ones to perform.

Fig. 1
figure 1

Task parallel scheme of all ftran operations in pami

3.3 Candidate persistence and quality control in chuzr

Major chuzr forms the set \({\mathcal {P}}\) and minor chuzr chooses candidates from it. The design of chuzr contributes significantly to the serial efficiency of suboptimization schemes so merits careful discussion.

When suboptimization is performed, the candidate chosen to leave the basis in the first minor iteration is the same as would have been chosen without suboptimization. Thereafter, the candidates remaining in \({\mathcal {P}}\) may be less attractive than the most attractive of the candidates not in \({\mathcal {P}}\) due to the former becoming less attractive and/or the latter becoming more attractive. Indeed, some candidates in \({\mathcal {P}}\) may become unattractive. If candidates in the original \({\mathcal {P}}\) do not enter the basis then the work of their btran operations (and any subsequent updates) is wasted. However, if minor iterations choose less attractive candidates to leave the basis the number of simplex iterations required to solve a given LP problem can be expected to increase. Addressing this issue of candidate persistence is the key algorithmic challenge when implementing suboptimization. The number of candidates in the initial set \({\mathcal {P}}\) must be decided, and a strategy determined for assessing whether a particular candidate should remain in \({\mathcal {P}}\).

For load balancing during the minor initialisation, the initial number of candidates \(s=|{\mathcal {P}}|\) should be an integer multiple of the number of processors used. Multiples larger than one yield better load balance due to the greater amount of work to be parallelised, particularly before and after the minor iterations, but practical experience with pami prototypes demonstrated clearly that this is more than offset by the amount of wasted computation and an increase in the number of iterations required to solve the problem. Thus, for pami, s was chosen to be eight, whatever the number of processors.

During minor iterations, after updating the primal activities of the variables given by the current set \({\mathcal {P}}\), the attractiveness of \(\alpha _p\) for each \(p\in {\mathcal {P}}\) is assessed relative to its initial value \(\alpha _p^i\) by means of a cutoff factor \(\psi >0\). Specifically, if

$$\begin{aligned} \alpha _p < \psi \alpha _p^i, \end{aligned}$$

then index p is removed from \({\mathcal {P}}\). Clearly if the variable becomes feasible or unattractive (\(\alpha _p\le 0\)) then it is dropped whatever the value of \(\psi \).

To determine the value of \(\psi \) to use in pami, a series of experiments was carried out using a reference set of 30 LP problems given in Table 3 of Sect. 5.1, with cutoff ratios ranging from 1.001 to 0.01. Computational results are presented in Table 2 which gives the (geometric) mean speedup factor and the number of problems for which the speedup factor is respectively 1.6, 1.8 and 2.0.

Table 2 Experiments with different cutoff factor for controlling candidate quality in pami

The cutoff ratio \(\psi = 1.001\) corresponds to a special situation, in which only candidates associated with improved attractiveness are chosen. As might be expected, the speedup with this value of \(\psi \) is poor. The cutoff ratio \(\psi = 0.999\) corresponds to a boundary situation where candidates whose attractiveness decreases are dropped. An mean speedup of 1.52 is achieved.

For various cutoff ratios in the range \(0.9 \le \psi \le 0.999\), there is no really difference in the performance of pami: the mean speedup and larger speedup counts are relatively stable. Starting from \(\psi = 0.9\), decreasing the cutoff factor results in a clear decrease in the mean speedup, although the larger speedup counts remain stable until \(\psi = 0.5\).

In summary, experiments suggest that any value in interval [0.9, 0.999] can be chosen as the cutoff ratio, with pami using the median value \(\psi = 0.95\).

3.4 Hyper-sparse LP problems

In the discussions above, when exploiting data parallelism in vector operations it is assumed that one independent scalar calculation must be performed for most of the components of the vector. For example, in update-dual and update-primal a multiple of the component is added to the corresponding component of another vector. In chuzr and chuzc1 the component (if nonzero) is used to compute and then compare a ratio. Since these scalar calculations need not be performed for zero components of the vector, when the LP problem exhibits hyper-sparsity this is exploited by efficient serial implementations [13]. When the cost of the serial vector operation is reduced in this way it is no longer efficient to exploit data parallelism so, when the density of the vector is below a certain threshold, pami reverts to serial computation. The performance of pami is not sensitive to the thresholds of 5–10% which are used.

4 Single iteration parallelism

This section introduces a relative simple approach to exploiting parallelism within a single iteration of the dual revised simplex method, yielding the parallel scheme sip. Our approach is a significant development of the work of Bixby and Martin [1] who parallelised only the spmv, chuzc and update-dual operations, having rejected the task parallelism of ftran and ftran-dse as being computationally disadvantageous. It also extends the work of Forrest’s Aboca code [4, 5]. Based on the little evidence available, Aboca needs an additional (partial) btran operation which is not necessary in sip, which incorporates the ftran-bfrt and a combined spmv and chuzc1 not exploited by Aboca.

Our serial simplex solver hsol has an additional ftran-bfrt component for the bound-flipping ratio test. However, naively exploiting task parallelism by simply overlapping this with ftran and ftran-dse is inefficient since the latter is seen in Table 1 to be relatively expensive. This is due to the RHS of ftran-dse being \(\widehat{{{\varvec{e}}}}_p\), which is dense relative to the RHS vectors \({{\varvec{a}}}_q\) of ftran and \({{\varvec{a}}}_{\scriptscriptstyle F}\) of ftran-bfrt. There is also no guarantee in a particular iteration that ftran-bfrt will be required.

The mixed parallelisation scheme of sip is illustrated in Fig. 2, which also indicates the data dependency for each computational component. Note that during chuzc1 there is a distinction between the operations for the original (structural) variables and those for the logical (slack) variables, since the latter correspond to an identity matrix in A. Thereafter, one processor performs ftran in parallel with (any) ftran-bfrt on another processor and update-dual on a third. The scheme assumes at least four processors, but with more than four only the parallelism in spmv and chuzc is enhanced.

Fig. 2
figure 2

sip data dependency and parallelisation scheme

5 Computational results

5.1 Test problems

Throughout this report, the performance of the simplex solvers is assessed using a reference set of 30 LP problems listed in Table 3. Most of these are taken from a comprehensive list of representative LP problems [19] maintained by Mittelmann.

Table 3 The reference set of 30 LP problems with hyper-sparsity measures

The problems in this reference set reflect the wide spread of LP properties and revised simplex characteristics, including the dimension of the linear systems (number of rows), the density of the coefficient matrix (average number of non-zeros per column), and the extent to which they exhibit hyper-sparsity (indicated by the last two columns). These columns, headed ftran and btran, give the proportion of the results of ftran and btran with a density below 10%, the criterion used to measure hyper-sparsity by Hall and McKinnon [13] who consider an LP problem to be hyper-sparse if the occurrence of such hyper-sparse results is greater than 60%. According to this measurement, half of the reference set are hyper-sparse. Since all problems are sparse, it is convenient to use the term “dense” to refer to those which are not hyper-sparse (Table 3).

The performance of pami and sip is assessed using experiments performed on a workstation with two Intel Xeon E5-2670s, 2.6 GHz (16 cores, 16 threads in total), using eight threads for the parallel calculations. Numerical results are given in Tables 4 and 5, where mean values of speedup or other relative performance measures are computed geometrically. The relative performance of solvers is also well illustrated using the performance profiles in Figs. 3, 4 and 5.

5.2 Performance of pami

The efficiency of pami is appropriately assessed in terms of parallel speedup and performance relative to the sequential dual simplex solver (hsol) from which it was developed. The former indicates the efficiency of the parallel implementation and the latter measures the impact of suboptimization on serial performance. A high degree of parallel efficiency would be of little value if it came at the cost of severe serial inefficiency. The solution times for hsol and pami running in serial, together with pami running in parallel with 8 cores, are listed in columns headed hsol, pami1 and pami8 respectively in Table 4. These results are also illustrated via a performance profile in Fig. 3 which, to put the results in a broader context, also includes Clp  1.15 [2], the world’s leading open-source solver. Note that since hsol and pami have no preprocessing or crash facility, these are not used in the runs with Clp.

Table 4 Solution time and iteration counts for hsol, pami, sip, Clp and Cplex
Table 5 Speedup of pami and sip with hyper-sparsity measures
Table 6 Iteration time (ms) and computational component profiling (the percentage of overall solution time) when solving LP problems with hsol

The number of iterations required to solve a given LP problem can vary significantly depending on the solver used and/or the algorithmic variant used. Thus, using solution times as the sole measure of computational efficiency is misleading if there is a significant difference in iteration counts for algorithmic reasons. However, this is not the case for hsol and pami. Observing that pami identifies the same sequence of basis changes whether it is run in serial or parallel, relative to hsol, the number of iterations required by pami is similar, with the mean relative iteration count of 0.96 being marginally in favour of pami. Individual relative iteration counts lie in [0.85, 1.15] with the exception of those for qap12, stp3d and dano3mip_lp which, being 0.67, 0.75 and 0.79 respectively, are significantly in favour of pami. Thus, with the candidate quality control scheme discussed in Sect. 3.3, suboptimization is seen not compromise the number of iterations required to solve LP problems. Relative to Clp, hsol typically takes fewer iterations, with the mean relative iteration count being 0.70 and extreme values of 0.07 for ns1688926 and 0.11 for dbic1.

It is immediately clear from the performance profile in Fig. 3 that, when using 8 cores, pami is superior to hsol which, in turn, is generally superior to Clp. Observe that the superior performance of pami on 8 cores relative to hsol comes despite pami in serial being inferior to hsol. Specifically, using the mean relative solution times in Table 5, pami on 8 cores is 1.51 times faster than hsol, which is 2.29 times faster than Clp. Even when taking into account that hsol requires 0.70 times the iterations of Clp, the iteration speed of hsol is seen to be 1.60 times faster than Clp: hsol is a high quality dual revised simplex solver.

Since hsol and pami require very similar numbers of iterations, the mean value of 0.64 for the inferiority of pami relative to hsol in terms of solution time reflects the the lower iteration speed of pami due to wasted computation. For more than 65% of the reference set pami is twice as fast in parallel, with a mean speedup of 2.34. However, relative to hsol, some of this efficiency is lost due to overcoming the wasted computation, lowering the mean relative solution time to 1.51.

Fig. 3
figure 3

Performance profile of Clp, hsol, pami and pami8 without preprocessing or crash

For individual problems, there is considerable variance in the speedup of pami over hsol, reflecting the variety of factors which affect performance and the wide range of test problems. For the two problems where pami performs best in parallel, it is flattered by requiring significantly fewer iterations than hsol. However, even if the speedups of 2.58 for qap12 and 2.33 for stp3d are scaled by the relative iteration counts, the resulting relative iteration speedups are still 1.74 and 1.75 respectively. However, for other problems where pami performs well, this is achieved with an iteration count which is similar to that of hsol. Thus the greater solution efficiency due to exploiting parallelism is genuine. Parallel pami is not advantageous for all problems. Indeed, for maros-r7 and Linf_520c, pami is slower in parallel than hsol. For these two problems, serial pami is slower than hsol by factors of 3.48 and 2.75 respectively. In addition, as can be seen in Table 6, a significant proportion of the computation time for hsol is accounted for by invert, which runs in serial on one processor with no work overlapped.

Interestingly, there is no real relation between the performance of pami and problem hyper-sparsity: it shows almost same range of good, fair and modest performance across both classes of problems, although the more extreme performances are for dense problems. Amongst hyper-sparse problems, the three where pami performs best are cre-b, sgpf5y6 and stp3d. This is due to the large percentage of the solution time for hsol accounted for by spmv (42.9% for cre-b and 19.2% for stp3d) and ftran-dse (80.7% for sgpf5y6 and 27% for stp3d). In pami, the spmv and ftran-dse components can be performed efficiently as task parallel and data parallel computations respectively, and therefore the larger percentage of solution time accounted for by these components yields a natural source of speedup.

5.3 Performance of sip

For sip, the iteration counts are generally very similar to those of hsol, with the relative values lying in [0.98, 1.06] except for the two, highly degenerate problems nug12 and qap12 where sip requires 1.09 and 1.60 times as many iterations respectively. [Note that these two problems are essentially identical, differing only by row and column permutations.] It is clear from Table 5 that the overall performance and mean speedup (1.15) of sip is inferior to that of pami. This is because sip exploits only limited parallelism.

Fig. 4
figure 4

Performance profile of Cplex, pami8 and sip8 without preprocessing or crash

The worst cases when using sip are associated with the hyper-sparse LP problems where sip typically results in a slowdown. Such an example is sgpf5y6, where the proportion of ftran-dse is more than 80% and the total proportion of spmv, chuzc, ftran and update-dual is less than 5%. Therefore, when performing ftran-dse and the rest as task parallel operations, the overall performance is not only limited by ftran-dse, but the competition for memory access by the other components and the cost of setting up the parallel environment will also slow down ftran-dse.

However, when applied to dense LP problems, the performance of sip is moderate and relatively stable. This is especially so for those instances where pami exhibits a slowdown: for Linf_520c, maros-r7, applying sip achieves speedups of 1.31 and 1.12 respectively.

In summary, sip, is a straightforward approach to parallelisation which exploits purely single iteration parallelism and achieves relatively poor speedup for general LP problems compared to pami. However, sip is frequently complementary to pami in achieving speedup when pami results in slowdown.

5.4 Performance relative to Cplex and influence on Xpress

Since commercial LP solvers are now highly developed it is, perhaps, unreasonable to compare their performance with a research code. However, this is done in Fig. 4, which illustrates the performance of Cplex  12.4 [17] relative to pami8 and sip8. Again, Cplex is run without preprocessing or crash. Figure 4 also traces the performance of the better of pami8 and sip8, clearly illustrating that sip and pami are frequently complementary in terms of achieving speedup. Indeed, the performance of the better of sip and pami is comparable with that of Cplex for the majority of the test problems. For a research code this is a significant achievement.

Fig. 5
figure 5

Performance profile of Cplex, Xpress and Xpress8 with preprocessing and crash

Since developing and implementing the techniques described in this paper, Huangfu has implemented them within the FICO Xpress simplex solver [15]. The performance profile in Fig. 5 demonstrates that when it is advantageous to run Xpress in parallel it enables FICO’s solver to match the serial performance of Cplex (which has no parallel simplex facility). Note that for the results in in Fig. 5, Xpress and Cplex were run with both preprocessing and crash. The newly-competitive performance of parallel Xpress relative to Cplex is also reflected in Mittelmann’s independent benchmarking [19].

6 Conclusions

This report has introduced the design and development of two novel parallel implementations of the dual revised simplex method.

One relatively complicated parallel scheme (pami) is based on a less-known pivoting rule called suboptimization. Although it provided the scope for parallelism across multiple iterations, as a pivoting rule suboptimization is generally inferior to the regular dual steepest-edge algorithm. Thus, to control the quality of the pivots, which often declines during pami, a cutoff factor is necessary. A suitable cutoff factor of 0.95, has been found via series of experiments. For the reference set, pami provides a mean speedup of 1.51 which enables it to out-perform Clp, the best open-source simplex solver.

The other scheme (sip) exploits purely single iteration parallelism. Although its mean speedup of 1.15 is worse than that of pami, it is frequently complementary to pami in achieving speedup when pami results in slowdown.

Although the results in this paper are far from the linear speedup which is the hallmark of many quality parallel implementations of algorithms, to expect such results for an efficient implementation of the revised simplex method applied to general large sparse LP problems is unreasonable. The commercial value of efficient simplex implementations is such that if such linear speedup were possible then it would have been achieved years ago. A measure of the quality of the pami and sip schemes discussed in this paper is that they have formed the basis of refinements made by Huangfu to the Xpress solver which have been considered noteworthy enough to be reported by FICO. With the techniques described in this paper, Huangfu has raised the performance of the Xpress parallel revised simplex solver to that of the worlds best commercial simplex solvers. In developing the first parallel revised simplex solver of general utility, this work represents a significant achievement in computational optimization.