Parallel distributed-memory simplex for large-scale stochastic LP problems
Authors
- First Online:
- Received:
DOI: 10.1007/s10589-013-9542-y
- Cite this article as:
- Lubin, M., Hall, J.A.J., Petra, C.G. et al. Comput Optim Appl (2013) 55: 571. doi:10.1007/s10589-013-9542-y
- 2 Citations
- 377 Views
Abstract
We present a parallelization of the revised simplex method for large extensive forms of two-stage stochastic linear programming (LP) problems. These problems have been considered too large to solve with the simplex method; instead, decomposition approaches based on Benders decomposition or, more recently, interior-point methods are generally used. However, these approaches do not provide optimal basic solutions, which allow for efficient hot-starts (e.g., in a branch-and-bound context) and can provide important sensitivity information. Our approach exploits the dual block-angular structure of these problems inside the linear algebra of the revised simplex method in a manner suitable for high-performance distributed-memory clusters or supercomputers. While this paper focuses on stochastic LPs, the work is applicable to all problems with a dual block-angular structure. Our implementation is competitive in serial with highly efficient sparsity-exploiting simplex codes and achieves significant relative speed-ups when run in parallel. Additionally, very large problems with hundreds of millions of variables have been successfully solved to optimality. This is the largest-scale parallel sparsity-exploiting revised simplex implementation that has been developed to date and the first truly distributed solver. It is built on novel analysis of the linear algebra for dual block-angular LP problems when solved by using the revised simplex method and a novel parallel scheme for applying product-form updates.
Keywords
Simplex methodParallel computingStochastic optimizationBlock-angular1 Introduction
The structure of such problems is known as dual block angular or block angular with linking columns. This structure commonly arises in stochastic optimization as the extensive form or deterministic equivalent of two-stage stochastic linear programs with recourse when the underlying distribution is discrete or when a finite number of samples have been chosen as an approximation [4]. The dual problem to (1) has a primal or row-linked block-angular structure. Linear programs with block-angular structure, both primal and dual, occur in a wide array of applications, and this structure can also be identified within general LPs [2].
Borrowing the terminology from stochastic optimization, we say that the vector x_{0} contains the first-stage variables and the vectors x_{1},…,x_{N} the second-stage variables. The matrices W_{1},W_{2},…,W_{N} contain the coefficients of the second-stage constraints, and the matrices T_{1},T_{2},…,T_{N} those of the linking constraints. Each diagonal block corresponds to a scenario, a realization of a random variable. Although we adopt this specialist terminology, our work applies to any LP of the form (1).
Block-angular LPs are natural candidates for decomposition procedures that take advantage of their special structure. Such procedures are of interest both because they permit the solution of much larger problems than could be solved with general algorithms for unstructured problems and because they typically offer a natural scope for parallel computation, presenting an opportunity to significantly decrease the required time to solution. Our primary focus is on the latter motivation.
Existing parallel decomposition procedures for dual block-angular LPs are reviewed by Vladimirou and Zenios [32]. Subsequent to their review, Linderoth and Wright [24] developed an asynchronous approach combining ℓ_{∞} trust regions with Benders decomposition on a large computational grid. Decomposition inside interior-point methods applied to the extensive form has been implemented in the state-of-the-art software package OOPS [14] as well as by some of the authors in PIPS [25].
These parallel decomposition approaches, based on either Benders decomposition or specialized linear algebra inside interior-point methods, have successfully demonstrated both parallel scalability on appropriately sized problems and capability to solve very large instances. However, each has algorithmic drawbacks. Neither of the approaches produces an optimal basis for the extensive form (1), which would allow for efficient hot-starts when solving a sequence of related LPs, whether in the context of branch-and-bound or real-time control, and may also provide important sensitivity information.
While techniques exist to warm-start Benders-based approaches, such as in [24], as well as interior-point methods to a limited extent, in practice the simplex method is considered to be the most effective for solving sequences of related LPs. This intuition drove us to consider yet another decomposition approach, which we present in this paper, one in which the simplex method itself is applied to the extensive form (1) and its operations are parallelized according to the special structure of the problem. Conceptually, this is similar to the successful approach of linear algebra decomposition inside interior-point methods.
Exploiting primal block-angular structure in the context of the primal simplex method was considered in the 1960s by, for example, Bennett [3] and summarized by Lasdon [23, p. 340]. Kall [20] presented a similar approach in the context of stochastic LPs, and Strazicky [29] reported results from an implementation, both solving the dual of the stochastic LPs as primal-block angular programs. These works focused solely on the decrease in computational and storage requirements obtained by exploiting the structure. As serial algorithms, these specialized approaches have not been shown to perform better than efficient modern simplex codes for general LPs and so are considered unattractive and unnecessarily complicated as solution methods; see, for example, the discussion in [4, p. 226]. Only recently (with a notable exception of [28]) have the opportunities for parallelism been considered, and so far only in the context of the primal simplex method; Hall and Smith [18] have developed a high-performance shared-memory primal revised simplex solver for primal block-angular LP problems. To our knowledge, a successful parallelization of the revised simplex method for dual block-angular LP problems has not yet been published. We present here our design and implementation of a distributed-memory parallelization of both the primal and dual simplex methods for dual block-angular LPs.
2 Revised simplex method for general LPs
We review the primal and dual revised simplex methods for general LPs, primarily in order to establish our notation. We assume that the reader is familiar with the mathematical algorithms. Following this section, the computational components of the primal and dual algorithms will be treated in a unified manner to the extent possible.
In the simplex method, the indices of variables are partitioned into sets \(\mathcal{B}\), corresponding to mbasic variables \(\boldsymbol{x}_{\scriptscriptstyle B}\), and \(\mathcal{N}\), corresponding to n−m nonbasic variables \(\boldsymbol{x}_{\scriptscriptstyle N}\), such that the basis matrixB formed from the columns of A corresponding to \(\mathcal{B}\) is nonsingular. The set \(\mathcal{B}\) itself is conventionally referred to as the basis. The columns of A corresponding to \(\mathcal{N}\) form the matrix N. The components of c corresponding to \(\mathcal{B}\) and \(\mathcal {N}\) are referred to as, respectively, the basic costs \(\boldsymbol{c}_{\scriptscriptstyle B}\) and non-basic costs \(\boldsymbol{c}_{\scriptscriptstyle N}\).
When the simplex method is used, for a given partition \(\{\mathcal{B}, \mathcal{N}\}\) the values of the primal variables are defined to be \(\boldsymbol{x}_{\scriptscriptstyle N}=\boldsymbol{0}\) and \(\boldsymbol{x}_{\scriptscriptstyle B}=B^{-1}\boldsymbol{b}=:\hat{\boldsymbol{b}}\), and the values of the nonbasic dual variables are defined to be \(\hat{\boldsymbol{c}}_{\scriptscriptstyle N}=\boldsymbol{c}_{\scriptscriptstyle N}-{N}^{T}{B} ^{-T}\boldsymbol{c}_{\scriptscriptstyle B}\). The aim of the simplex method is to identify a partition characterized by primal feasibility (\(\hat{\boldsymbol{b}}\geq \boldsymbol{0}\)) and dual feasibility (\(\hat{\boldsymbol{c}}_{\scriptscriptstyle N}\geq \boldsymbol{0}\)). Such a partition corresponds to an optimal solution of (2).
2.1 Primal revised simplex
The CHUZR (choose row) operation determines the variable to leave the basis, with p being used to denote the index of the row in which the leaving variable occurred, referred to as the pivotal row. The index of the leaving variable itself is denoted by p′. Once the indices q and p′ have been interchanged between the sets \(\mathcal{B}\) and \(\mathcal{N}\), a basis change is said to have occurred. The vector \(\hat{\boldsymbol{b}}\) is then updated to correspond to the increase \(\alpha=\hat{b}_{p}/\hat{a}_{pq}\) in x_{q}.
Before the next iteration can be performed, one must update the reduced costs and obtain a representation of the new matrix B^{−1}. The reduced costs are updated by computing the pivotal row\(\hat{\boldsymbol{a}}_{p}^{T}=\boldsymbol{e}_{p}^{T}{B^{-1}} {N}\) of the standard simplex tableau. This is obtained in two steps. First the vector \(\boldsymbol{\pi}_{p}^{T}=\boldsymbol{e}_{p}^{T}{B^{-1}}\) is formed by using the representation of B^{−1} in an operation known as BTRAN (backward transformation), and then the vector \(\hat{\boldsymbol{a}}_{p}^{T}=\boldsymbol{\pi}_{p}^{T}{N}\) of values in the pivotal row is formed. This sparse matrix-vector product with N is referred to as PRICE. Once the reduced costs have been updated, the UPDATE operation modifies the representation of B^{−1} according to the basis change. Note that, periodically, it will generally be either more efficient or necessary for numerical stability to find a new representation of B^{−1} using the INVERT operation.
2.2 Dual revised simplex
While primal simplex has historically been more important, it is now widely accepted that the dual variant, the dual simplex method, generally has superior performance. Dual simplex is often the default algorithm in commercial solvers, and it is also used inside branch-and-bound algorithms.
Given an initial partition \(\{\mathcal{B}, \mathcal{N}\}\) and corresponding values for the basic and nonbasic primal and dual variables, the dual simplex method aims to find an optimal solution of (2) by maintaining dual feasibility and seeking primal feasibility. Thus optimality is achieved when the basic variables \(\hat{\boldsymbol{b}}\) are non-negative.
2.3 Computational techniques
Today’s highly efficient implementations of the revised simplex method are a product of over 60 years of refinements, both in the computational linear algebra and in the mathematical algorithm itself, which have increased its performance by many orders of magnitude. A reasonable treatment of the techniques required to achieve an efficient implementation is beyond the scope of this paper; the reader is referred to the works of Koberstein [21] and Maros [27] for the necessary background. Our implementation includes these algorithmic refinements to the extent necessary in order to achieve serial execution times comparable with existing efficient solvers. This is necessary, of course, for any parallel speedup we obtain to have practical value. Our implementation largely follows that of Koberstein, and advanced computational techniques are discussed only when their implementation is nontrivial in the context of the parallel decomposition.
3 Parallel computing
This section provides the necessary background in the parallel-computing concepts relevant to the present work. A fuller and more general introduction to parallel computing is given by Grama et al. [15].
3.1 Parallel architectures
When classifying parallel architectures, an important distinction is between distributed memory, where each processor has its own local memory, and shared memory, where all processors have access to a common memory. We target distributed-memory architectures because of their availability and because they offer the potential to solve much larger problems. However, our parallel scheme could be implemented on either.
3.2 Speedup and scalability
In general, success in parallelization is measured in terms of speedup, the time required to solve a problem with more than one parallel process compared with the time required with a single process. The traditional goal is to achieve a speedup factor equal to the number of cores and/or nodes used. Such a factor is referred to as linear speedup and corresponds to a parallel efficiency of 100 %, where parallel efficiency is defined as the percentage of the ideal linear speed-up obtained empirically. The increase in available processor cache per unit computation as the number of parallel processes is increased occasionally leads to the phenomenon of superlinear speedup.
3.3 MPI (Message Passing Interface)
Broadcast (MPI_Bcast) is a simple operation in which data that are locally stored on only one process are broadcast to all.
Reduce (MPI_Allreduce) combines data from each MPI process using a specified operation, and the result is returned to all MPI processes. One may use this, for example, to compute a sum to which each process contributes or to scan through values on each process and determine the maximum/minimum value and its location.
Gather (MPI_Allgather) collects and concatenates contributions from all processes and distributes the result. For example, given P MPI processes, suppose a (row) vector x_{p} is stored in local memory in each process p. The gather operation can be used to form the combined vector \([ \begin{array}{cccc} x_{1} & x_{2} & \ldots& x_{P} \end{array} ]\) in local memory in each process.
The efficiency of these operations is determined by both their implementation and the physical communication network between nodes, known as an interconnect. The cost of communication depends on both the latency and the bandwidth of the interconnect. The former is the startup overhead for performing communication operations, and the latter is the rate of communication. A more detailed discussion of these issues is beyond the scope of this paper.
4 Linear algebra overview
Here we present from a mathematical point of view the specialized linear algebra required to solve, in parallel, systems of equations with basis matrices from dual block-angular LPs of the form (1). Precise statements of algorithms and implementation details are reserved for Sect. 5.
In the linear algebra community, the scope for parallelism in solving linear systems with matrices of the form (3) is well known and has been used to facilitate parallel solution methods for general unsymmetric linear systems [8]. However, we are not aware of this form having been analyzed or exploited in the context of solving dual block-angular LP problems. For completeness and to establish our notation, we present the essentials of this approach.
Let the dimensions of the \(W_{i}^{B}\) block be \(m_{i} \times n_{i}^{B}\), where m_{i} is the number of rows in W_{i} and is fixed while \(n_{i}^{B}\) is the number of basic variables from the ith block and varies. Similarly let A^{B} be \(m_{0} \times n_{0}^{B}\), where m_{0} is the number of rows in A and is fixed and \(n_{0}^{B}\) is the number of linking variables in the basis. Then \(T_{i}^{B}\) is \(m_{i} \times n_{0}^{B}\).
In the remainder of this section, we will prove the following result, which underlies the specialized solution procedure for singly bordered block-diagonal linear systems.
Result
Any square, nonsingular matrix with singly bordered block-diagonal structure can be reduced to trivial form, up to a row permutation, by a sequence of invertible transformations that may be computed independently for each diagonal block.
Proof
This procedure is illustrated in Fig. 3. If we then perform an LU factorization of the first-stage blockM, this entire procedure could be viewed as forming an LU factorization of B through a restricted sequence of pivot choices. Note that the sparsity and numerical stability of this LU factorization are expected to be inferior to those of a structureless factorization resulting from unrestricted pivot choices. Thus it is important to explore fully the scope for maintaining sparsity and numerical stability within the scope of the structured LU factorization. This topic is discussed in Sect. 5.2.
5 Design and implementation
Here we present the algorithmic design and implementation details of our code, PIPS-S, a new simplex code base for dual block-angular LPs written in C++. PIPS-S implements both the primal and dual revised simplex methods.
5.1 Distribution of data
In a distributed-memory algorithm, as described in Sect. 3, we must specify the distribution of data across parallel processes. We naturally expect to distribute the second-stage blocks, but the extent to which they are distributed and how the first stage is treated are important design decisions. We arrived at the following design after reviewing the data requirements of the parallel operations described in the subsequent sections.
Given a set of P MPI processes and N≥P scenarios or second-stage blocks, on initialization we assign each second-stage block to a single MPI process. All data, iterates, and computations relating to the first stage are duplicated in each process. The second-stage data (i.e., W_{i},T_{i},c_{i}, and b_{i}), iterates, and computations are only stored in and performed by their assigned process. If a scenario is not assigned to a process, this process stores no data pertaining to the scenario, not even the basic/nonbasic states of its variables. Thus, in terms of memory usage, the approach scales to an arbitrary number of scenarios.
5.2 Factorizing a dual block-angular basis matrix
In the INVERT step, one forms an invertible representation of the basis matrix B. This is performed in efficient sparsity-exploiting codes by forming a sparse LU factorization of the basis matrix. Our approach forms these factors implicitly and in parallel.
Sparse LU factorization procedures perform both row and column permutations in order to reduce the number of nonzero elements in the factors [7] while achieving acceptable numerical stability. Permutation matrices P and Q and triangular factors L and U are identified so that PBQ=LU.
We implemented this factorization (7) by modifying the CoinFactorization C++ class, written by John Forrest, from the open-source CoinUtils^{1} package. The methods used in the code are undocumented; however, we determined that it uses a Markowitz-type approach [26] such as that used by Suhl and Suhl [31]. After the factorization is complete, we extract the \(X_{i}T_{i}^{B}\) and \(Z_{i}T_{i}^{B}\) terms and store them for later use.
The Markowitz approach used by CoinFactorization is considered numerically stable because the pivot order is determined dynamically, taking into account numerical cancellation. Simpler approaches that fix a sequence of pivots are known to fail in some cases; see [31], for example. Because our INVERT procedure couples Markowitz factorizations, we expect it to share many of the same numerical properties.
5.3 Solving linear systems with B
Efficient implementations must take advantage of sparsity, not only in the matrix factors, but also in the right-hand side vectors, exploiting hyper-sparsity [17], when applicable. We reused the routines from CoinFactorization for solves with the factors. These include hyper-sparse solution routines based on the original work in [11], where a symbolic phase computes the sparsity pattern of the solution vector and then a numerical phase computes only the nonzero values in the solution.
The solution procedure exhibits parallelism in the calculations that are performed per block, that is, in Steps 1, 4, and 5. An MPI_Allgather communication operation is required at Step 2, and the first-stage calculation in Step 3 is duplicated, in serial in each process.
This computational pattern changes somewhat when the right-hand side vector is structured, as is the case in the most common FTRAN step, which computes the pivotal column \(\hat{a}_{q} = B^{-1}a_{q}\), where a_{q} is a column of the constraint matrix. If the entering column in FTRAN is from second-stage block j, then r_{i} has nonzero elements only for i=j. Since Step 1 is then trivial for all other second-stage blocks, it becomes a serial bottleneck. Fortunately we can expect Step 1 to be relatively inexpensive because the right-hand side will be a column from W_{j}, which should have few non-zero elements. Additionally, Step 2 becomes a broadcast instead of a gather operation.
5.4 Solving linear systems with B^{T}
Hyper-sparsity in solves with the factors is handled as previously described. Note that at Step 3 we must sum an arbitrary number of sparse vectors across MPI processes. This step is performed by using a dense buffer and MPI_Allreduce, then passing through the result to build an index of the nonzero elements in q_{0}. The overhead of this pass-through is small compared with the overhead of the communication operation itself.
As for linear systems with B, there is an important special case for the most common BTRAN step in which the right-hand side to BTRAN is entirely zero except for a unit element in the position corresponding to the row/variable that has been selected to leave the basis. If a second-stage variable has been chosen to leave the basis, Steps 1 and 2 become trivial for all but the corresponding second-stage block. Fortunately, as before, we can expect these steps to be relatively inexpensive because of the sparsity of the right-hand sides. Similarly, Step 3 may be implemented as a broadcast operation instead of a reduce operation.
5.5 Matrix-vector product with the block-angular constraint matrix
The procedure to compute (10) in parallel is immediately evident from the right-hand side. The terms involving \(W_{i}^{N}\) and \(T_{i}^{N}\) may be computed independently by column-wise or row-wise procedures, depending on the sparsity of each π_{i} vector. Then a reduce operation is required to form \(\sum_{i=1}^{N} (T_{i}^{N})^{T}\pi_{i}\).
5.6 Updating the inverse of the dual block-angular basis matrix
Multiple approaches exist for updating the invertible representation of the basis matrix following each iteration, in which a single column of the basis matrix is replaced. Updating the LU factors is generally considered the most efficient, in terms of both speed and numerical stability. Such an approach was considered but not implemented. We provide some brief thoughts on the issue, both for the interested reader and to indicate the difficulty of the task given the specialized structure of the representation of the basis inverse. The factors from the second-stage block could be updated by extending the ideas of Forrest and Tomlin [10] and Suhl and Suhl [30]. The first-stage matrix M, however, is significantly more difficult. With a single basis update, large blocks of elements in M could be changed, and the dimension of M could potentially increase or decrease by one. Updating this block with external Schur-complement-type updates as developed in [5] is one possibility.
Given the complexity of updating the basis factors, we first implemented product-form updates, which do not require any modification of the factors. While this approach generally has larger storage requirements and can exhibit numerical instability with long sequences of updates, these downsides are mitigated by invoking INVERT more frequently. Early reinversion is triggered if numerical instability is detected. Empirically, a reinversion frequency on the order of 100 iterations performed well on the problems tested, although the optimal choice for this number depends on both the size and numerical difficulty of a given problem. Product-form updates performed sufficiently well in our tests that we decided to not implement more complex procedures at this time. We now review briefly the product-form update and then describe our parallel implementation.
5.6.1 Product-form update
We introduce a small variation of this procedure for dual block-angular bases. An implicit permutation is applied with each eta matrix in order to preserve the structure of the basis (an entering variable is placed at the end of its respective block). With each η vector, we store both an entering index and a leaving index. We refer to the leaving index as the pivotal index.
5.6.2 Product-form update for parallel computation
Consider the requirements for applying an eta matrix during FTRAN to a structured right-hand side vector with components distributed by block. If the pivotal index for the eta matrix is in a second-stage block, then the element in the pivotal index is stored only in one MPI process, but it is needed by all and so must be broadcast. Following this approach, one would need to perform a communication operation for each eta matrix applied. The overhead of these broadcasts, which are synchronous bottlenecks, would likely be significantly larger than the cost of the floating-point operations themselves. Instead, we implemented a procedure that, at the cost of a small amount of extra computation, requires only one communication operation to apply an arbitrary number of eta matrices. The procedure may be considered a parallel product-form update. The essence of the procedure has been used by Hall in previous work, although the details below have not been published before.
Every MPI process stores all of the pivotal components of each η vector, regardless of which block they belong to, in a rectangular array. After a solve with B is performed (as in Fig. 5), MPI_Allgather is called to collect the elements of the solution vector in pivotal positions. Given all the pivotal elements in both the η vectors and the solution vector, each process can proceed to apply the eta matrices restricted to only these elements in order to compute the sequence of multipliers. Given these multipliers, the eta matrices can then be applied to the local blocks of the entire right-hand side. The communication cost of this approach is the single communication to collect the pivotal elements in the right-hand side, as well as the cost of maintaining the pivotal components of the η vectors.
The rectangular array of pivotal components of each η vector also serves to apply eta matrices in the BTRAN operation efficiently when the right-hand side is the unit vector with a single nontrivial element in the leaving index; see Hall and McKinnon [17]. This operation requires no communication once the leaving index is known to all processes and after the rectangular array has been updated correspondingly.
5.7 Algorithmic refinements
In order to improve the iteration count and numerical stability, it is valuable to use a weighted edge-selection strategy in CHUZC (primal) and CHUZR (dual). In PIPS-S, the former operation uses the DEVEX scheme [19], and the latter is based on the exact steepest-edge variant of Forrest and Goldfarb [9] described in [22]. Algorithmically, weighted edge selection is a simple operation with a straightforward parallelization. Each process scans through its local variables (both the local second-stage blocks and the first-stage block) and finds the largest (weighted) local infeasibility. An MPI_Allreduce communication operation then determines the largest value among all processes and returns to all processes its corresponding index.
Further contributions to improved iteration count and numerical stability come from the use of a two-pass EXPAND [12] ratio test, together with the shifting techniques described by [27] and [22]. We implement the two-pass ratio test in its canonical form, inserting the necessary MPI_Allreduce operations after each pass.
5.8 Updating iterates
After the pivotal column and row are chosen and computed, iterate values and edge weights are updated by using the standard formulas to reflect the basis change. We note here that each MPI process already has the sections of the pivotal row and column corresponding to its local variables (both the local second-stage blocks and the first-stage block). The only communication required is a broadcast of the primal and dual step lengths by the processes that own the leaving and entering variables, respectively. If edge weights are used, the pivotal element in the simplex tableau and a dot product (reduce) are usually required as well, if not an additional linear solve.
6 Numerical experiments
Numerical experiments with PIPS-S were conducted on two distributed-memory architectures available at Argonne National Laboratory. Fusion is a 320-node cluster with an InfiniBand QDR interconnect; each node has two 2.6 GHz Xeon processors (total 8 cores). Most nodes have 36 GB of RAM, while a small number offer 96 GB of RAM. A single node of Fusion is comparable to a high-performance workstation. Intrepid is a Blue Gene/P (BG/P) supercomputer with 40,960 nodes with a custom high-performance interconnect. Each BG/P node, much less powerful in comparison, has a quad-core 850 MHz PowerPC processor with 2 GB of RAM.
Using the stochastic LP test problems described in Sect. 6.1, we present results from problems of three different scales by varying the number of scenarios in the test problems. The smallest instances considered (Sect. 6.2) are those that could be solved on a modern desktop computer. The next set of instances (Sect. 6.3) are large-scale instances that demand the use of the Fusion nodes with 96 GB of RAM to solve in serial. The largest instances considered (Sect. 6.4) would require up to 1 TB of RAM to solve in serial; we use the Blue Gene/P system for these. At the first two scales, we compare our solver, PIPS-S, with the highly efficient, open-source, serial simplex code Clp.^{2} At the largest scale, no comparison is possible. These experiments aim to demonstrate both the scalability and capability of PIPS-S.
In all experiments, primal and dual feasibility tolerances of 10^{−6} were used. It was verified that the optimal objective values reported in each run were equal for each instance. A reinversion frequency of 150 iterations was used in PIPS-S, with earlier reinversion triggered by numerical stability tests. Presolve and internal rescaling, important features of LP solvers that have not yet been implemented in PIPS-S, were disabled in Clp to produce a fair comparison. Otherwise, default options were used. Commercial solvers were unavailable for testing on the Fusion cluster because of licensing constraints. The number of cores reported used corresponds to the total number of MPI processes.
6.1 Test problems
Dimensions of stochastic LP test problems
Test problem | 1st stage | 2nd-stage scenario | Nonzero elements | ||||
---|---|---|---|---|---|---|---|
Vars. | Cons. | Vars. | Cons. | A | W_{i} | T_{i} | |
Storm | 121 | 185 | 1,259 | 528 | 696 | 3,220 | 121 |
SSN | 89 | 1 | 706 | 175 | 89 | 2,284 | 89 |
UC12 | 3,132 | 0 | 56,532 | 59,436 | 0 | 163,839 | 3,132 |
UC24 | 6,264 | 0 | 113,064 | 118,872 | 0 | 327,939 | 6,264 |
The “UC12” and “UC24” problems are stochastic unit commitment problems developed at Argonne National Laboratory by Victor Zavala. See [25] for details of a stochastic economic dispatch model with similar structure. The problems aim to choose optimal on/off schedules for generators on the power grid of the state of Illinois over a 12-hour and 24-hour horizon, respectively. The stochasticity considered is that of the availability of wind-generated electricity, which can be highly variable. In practice each scenario would be the result of a weather simulation. For testing purposes only, we generate these scenarios by normal perturbations. Each second-stage scenario incorporates (direct-current) transmission constraints corresponding to the physical power grid, and so these scenarios become very large. We consider the LP relaxations, in the context of what would be required in order to solve these problems using a branch-and-bound approach.
In these test problems, only the right-hand side vectors b_{1},b_{2},…,b_{N} vary per scenario; the matrices T_{i} and W_{i} are identical for each scenario. This special structure, common in practice, is not currently exploited by PIPS-S.
6.2 Solution from scratch
Solves from scratch (all-slack basis) using dual simplex. Storm instance has 8,192 scenarios, 10,313,849 variables, and 4,325,561 constraints. SSN instance has 8,192 scenarios, 5,783,651 variables, and 1,433,601 constraints. UC12 instance has 32 scenarios, 1,812,156 variables, and 1,901,952 constraints. UC24 instance has 16 scenarios, 1,815,288 variables, and 1,901,952 constraints. Runs performed on nodes of the Fusion cluster
Test problem | Solver | Nodes | Cores | Iterations | Solution time (sec.) | Iter./sec. |
---|---|---|---|---|---|---|
Storm | Clp | 1 | 1 | 6,706,401 | 133,047 | 50.4 |
PIPS-S | 1 | 1 | 6,353,593 | 385,825 | 16.5 | |
1 | 4 | 6,357,445 | 108,517 | 58.6 | ||
1 | 8 | 6,343,352 | 52,948 | 119.8 | ||
2 | 16 | 6,351,493 | 28,288 | 224.5 | ||
4 | 32 | 6,347,643 | 15,667 | 405.2 | ||
SSN | Clp | 1 | 1 | 1,175,282 | 12,619 | 93.1 |
PIPS-S | 1 | 1 | 1,025,279 | 58,425 | 17.5 | |
1 | 4 | 1,062,776 | 16,511 | 64.4 | ||
1 | 8 | 1,055,422 | 7,788 | 135.5 | ||
2 | 16 | 1,051,860 | 3,865 | 272.1 | ||
4 | 32 | 1,046,840 | 1,931 | 542.1 | ||
UC12 | Clp | 1 | 1 | 2,474,175 | 39,722 | 62.3 |
PIPS-S | 1 | 1 | 1,968,400 | 236,219 | 8.3 | |
1 | 4 | 2,044,673 | 86,834 | 23.5 | ||
1 | 8 | 1,987,608 | 39,033 | 50.9 | ||
2 | 16 | 2,063,507 | 27,902 | 74.0 | ||
4 | 32 | 2,036,306 | 16,255 | 125.3 | ||
UC24 | Clp | 1 | 1 | 2,441,374 | 41,708 | 58.5 |
PIPS-S | 1 | 1 | 2,142,962 | 543,272 | 3.9 | |
1 | 4 | 2,204,729 | 182,370 | 12.1 | ||
1 | 8 | 2,253,199 | 101,893 | 22.1 | ||
2 | 16 | 2,270,728 | 60,887 | 37.3 |
We observe that Clp is faster than PIPS-S in serial on all instances; however, the total number of iterations performed by PIPS-S is consistent with the number of iterations performed by Clp, empirically confirming our implementation of pricing strategies. Significant parallel speedups are observed in all cases, and PIPS-S is 5 and 8 times faster than Clp for SSN and Storm respectively when using four nodes (32 cores). Parallel speedups obtained on the UC12 and UC24 instances are smaller, possibly because of the smaller number of scenarios and the larger dimensions of the first stage.
6.3 Larger instances with advanced starts
We next consider larger instances with 20–40 million total variables. The high-memory nodes of the Fusion cluster with 96 GB of RAM were required for these tests. Given the long times to solution for the smaller instances solved in the previous section, it is impractical to solve these larger instances from scratch. Instead, we consider using advanced or near-optimal starting bases in two different contexts. In Sect. 6.3.1 we generate starting bases by taking advantage of the structure of the problem. In Sect. 6.3.2 we attempt to simulate the problems solved by dual simplex inside a branch-and-bound node.
6.3.1 Advanced starts from exploiting structure
Solves from advanced starting bases using primal simplex. Both instances have 32,768 scenarios. Starting bases were generated by using a subset of 16,386 scenarios. Storm instance has 41,255,033 variables and 17,301,689 constraints. SSN instance has 23,134,297 variables and 5,734,401 constraints. Runs performed on nodes of Fusion cluster. Asterisk indicates that high-memory nodes were required
Test problem | Solver | Nodes | Cores | Iterations | Solution time (sec.) | Iter./sec. |
---|---|---|---|---|---|---|
Storm | Clp | 1* | 1 | 16,247 | 7537 | 2.2 |
PIPS-S | 1* | 1 | 9,026 | 7184 | 1.3 | |
1* | 4 | 6,598 | 662 | 10.0 | ||
1* | 8 | 10,899 | 486 | 22.4 | ||
2* | 16 | 6,519 | 137 | 47.6 | ||
4 | 32 | 5,776 | 61.5 | 93.9 | ||
8 | 64 | 7,509 | 47.3 | 158.8 | ||
16 | 128 | 7,691 | 35.5 | 216.6 | ||
32 | 256 | 6,572 | 25.2 | 260.4 | ||
SSN | Clp | 1* | 1 | 99,303 | 50,737 | 2.0 |
PIPS-S | 1* | 1 | 353,354 | 427,648 | 0.8 | |
1* | 4 | 239,882 | 58,621 | 4.1 | ||
1* | 8 | 235,039 | 22,485 | 10.5 | ||
2 | 16 | 219,050 | 9,550 | 22.9 | ||
4 | 32 | 193,565 | 4,134 | 46.8 | ||
8 | 64 | 219,560 | 2,365 | 92.8 | ||
16 | 128 | 212,269 | 1,481 | 143.3 | ||
32 | 256 | 200,979 | 1,117 | 180.0 |
Clp remains faster in serial than PIPS-S on these instances, although by a smaller factor than before. The parallel scalability of PIPS-S is almost ideal (>90 % parallel efficiency) up to 4 nodes (32 cores) and continues to scale well up to 16 nodes (128 cores). Scaling from 16 nodes to 32 nodes is poor. On 16 nodes, the iteration speed of PIPS-S is nearly 100× that of Clp for Storm and 70× that of Clp for SSN. A curious instance of superlinear scaling is observed within a single node. This could be caused by properties of the memory hierarchy (e.g. processor cache size).
We observe that the advanced starting bases are indeed near-optimal, particularly for Storm, where approximately ten thousand iterations are required to solve a problem with 41 million variables.
6.3.2 Dual simplex inside branch and bound
The dual simplex method is generally used inside branch-and-bound algorithms because the optimal basis obtained at the parent node remains dual feasible after variable bounds are changed.
Iteration counts and solution times for three reoptimization problems intended to simulate the work performed at a branch-and-bound node. UC12 instance has 512 scenarios, 28,947,516 variables, and 30,431,232 constraints. UC24 instance has 256 scenarios, 28,950,648 variables, and 30,431,232 constraints. Runs performed on Fusion cluster. Asterisk indicates that high-memory nodes were required
Test problem | Solver | Nodes | Cores | Iterations | Solution time (sec.) | Avg. iter./sec. |
---|---|---|---|---|---|---|
UC12 | Clp | 1* | 1 | 10,370/13,495/5,888 | 15,205/19,782/7,022 | 0.73 |
PIPS-S | 1* | 1 | 5,030/6,734/4,454 | 14,033/20,762/12,749 | 0.34 | |
1* | 8 | 5,031/6,793/4,454 | 1,963/2,955/1,788 | 2.5 | ||
2* | 16 | 5,031/6,794/4,451 | 1,015/1,537/929 | 4.7 | ||
4 | 32 | 5,031/6,738/4,454 | 548/810/503 | 8.8 | ||
8 | 64 | 5,031/6,738/4,454 | 321/476/300 | 14.9 | ||
16 | 128 | 5,031/6,793/4,454 | 226/346/214 | 20.9 | ||
32 | 256 | 5,031/6,794/4,454 | 180/296/169 | 25.8 | ||
UC24 | Clp | 1* | 1 | 5,813/9,386/2,909 | 6,818/10,815/3,280 | 0.87 |
PIPS-S | 1* | 1 | 3,035/2,240/2,272 | 8,031/6,049/6,841 | 0.36 | |
1* | 8 | 3,035/2,230/2,271 | 1,247/855/1,043 | 2.4 | ||
2* | 16 | 3,035/2,230/2,272 | 675/487/565 | 4.4 | ||
4 | 32 | 3,035/2,230/2,271 | 358/257/300 | 8.2 | ||
8 | 64 | 3,035/2,230/2,270 | 198/143/170 | 14.8 | ||
16 | 128 | 3,035/2,230/2,272 | 125/90/111 | 23.2 | ||
32 | 256 | 3,035/2,230/2,272 | 101/71/92 | 28.7 |
For UC12 and UC24, we obtain 71 % and 81 % scaling efficiency, respectively, up to 4 nodes (32 cores) and approximately 50 % scaling efficiency on both instances up to 16 nodes (128 cores). On 16 nodes, the iteration speed of PIPS-S is slightly over 25× that of Clp. PIPS-S requires fewer iterations than Clp on these problems, although we do not claim any general advantage. Comparing solution times, we observe relative speedups of over 100× in some cases.
These results, while inconclusive because only three subproblems were considered, suggest that with sufficient resources, branch-and-bound subproblems for these instances may be solved in minutes instead of hours. With these same resources, a parallel branch-and-bound approach is also possible; however, for instances of the sizes considered it will likely be necessary to distribute the subproblems to some extent because of memory constraints.
6.4 Very large instance
Iteration counts and solution times for UC12 with 8,192 scenarios. Starting basis was generated by using a subset of 4,096 scenarios. Runs performed on Intrepid Blue Gene/P system using PIPS-S
Nodes | Cores | Iterations | Solution time (hr.) | Iter./sec. |
---|---|---|---|---|
1,024 | 2,048 | 82,638 | 6.14 | 3.74 |
2,048 | 4,096 | 75,732 | 5.03 | 4.18 |
4,096 | 8,192 | 86,439 | 4.67 | 5.14 |
6.5 Performance analysis
The magnitude of the load imbalance, defined as \(\max_{p} \{ t_{p}\}-\frac{1}{P}\sum_{i=1}^{P} t_{p}\), and the cost of communication explain the deviation between the observed execution time and the ideal execution time, whereas the magnitude of the serial bottleneck t_{0} determines the algorithmic limit of scalability according to Amdahl’s law [1]. Evaluating the relative impacts of these three factors; namely, load imbalance, communication cost, and serial bottlenecks; on a given instance provides valuable insight into the empirical performance of PIPS-S.
Inefficiencies in PRICE operation on instances from Sects. 6.3.1 and 6.3.2. Times, given in microseconds (μs), are averages over all iterations. Load imbalance is defined as the difference between the maximum and average execution time in the second-stage calculations per MPI process. Total time in PRICE per iteration is given on the right
Test problem | Nodes | Cores | Load imbal. (μs) | Comm. cost (μs) | Serial bottleneck (μs) | Total time/iter. (μs) |
---|---|---|---|---|---|---|
Storm | 1 | 1 | 0 | 0 | 1.0 | 13,243 |
1 | 8 | 88 | 33 | 0.8 | 1,635 | |
2 | 16 | 40 | 68 | 0.9 | 856 | |
4 | 32 | 25 | 105 | 0.9 | 512 | |
8 | 64 | 26 | 112 | 1.0 | 326 | |
16 | 128 | 11 | 102 | 0.9 | 205 | |
32 | 256 | 34 | 253 | 0.8 | 333 | |
SSN | 1 | 1 | 0 | 0 | 0.8 | 2,229 |
1 | 8 | 18 | 23 | 0.8 | 305 | |
2 | 16 | 25 | 54 | 0.8 | 203 | |
4 | 32 | 14 | 68 | 0.7 | 133 | |
8 | 64 | 12 | 65 | 0.7 | 100 | |
16 | 128 | 10 | 87 | 0.6 | 106 | |
32 | 256 | 8 | 122 | 0.6 | 135 | |
UC12 | 1 | 1 | 0 | 0 | 6.8 | 24,291 |
1 | 8 | 510 | 183 | 6.0 | 4,785 | |
2 | 16 | 554 | 274 | 6.0 | 2,879 | |
4 | 32 | 563 | 327 | 6.0 | 1,921 | |
8 | 64 | 542 | 355 | 6.0 | 1,418 | |
16 | 128 | 523 | 547 | 6.0 | 1,335 | |
32 | 256 | 519 | 668 | 5.8 | 1,323 | |
UC24 | 1 | 1 | 0 | 0 | 11.0 | 28,890 |
1 | 8 | 553 | 259 | 9.8 | 5,983 | |
2 | 16 | 543 | 315 | 9.7 | 3,436 | |
4 | 32 | 551 | 386 | 9.6 | 2,248 | |
8 | 64 | 509 | 367 | 9.5 | 1,536 | |
16 | 128 | 538 | 718 | 9.5 | 1,593 | |
32 | 256 | 584 | 1413 | 9.5 | 2,170 |
We observe that for all instances, the cost of the first-stage calculations is small compared with the other factors. For a sufficiently large number of scenarios, this property will always hold for stochastic LPs, but it may not hold for block-angular LPs obtained by using the hypergraph partitioning of [2].
Communication cost is significant in all cases, particularly when more than 4 or 8 nodes are used. Communication cost increases with the number of first-stage variables (from SSN to Storm to UC12 to UC24), indicating the effects of bandwidth. Unsurprisingly, the communication cost increases with the number of nodes.
Load imbalance in PRICE is due primarily to exploitation of hyper-sparsity, and UC12 and UC24, potentially because they contain network-like structure, exhibit this property to a greater extent than do Storm and SSN. Interestingly, the load imbalance does not increase on an absolute scale with the number of nodes, although it becomes a larger proportion of the total execution time.
Despite communication overhead and load imbalance, which we had suspected to be insurmountable, significant speedups are possible, as evidenced by the results presented here. Advances in hardware have brought communication times across nodes to as little as tens of microseconds using high-performance interconnects such as InfiniBand. We note that a shared-memory implementation of our approach would have the potential to address load-balancing issues to a greater extent than the present distributed-memory implementation.
7 Conclusions
We have developed the linear algebra techniques necessary to exploit the dual block-angular structure of an LP problem inside the revised simplex method and a technique for applying product form updates efficiently in a parallel context. Using these advances, we have demonstrated the potential for significant parallel speedups. The approach is most effective on large instances that might otherwise be considered very difficult to solve using the simplex algorithm. The number of simplex iterations required to solve such problems is greatly reduced by using advanced starting bases generated by taking advantage of the structure of the problem. The optimal bases generated may be used to efficiently hot-start the solution of related problems, which often occur in real-time control or branch-and-bound approaches. This work paves the path for efficiently solving stochastic programming problems in these two contexts.
Acknowledgements
We acknowledge John Forrest and all other contributors for the open-source CoinUtils library which is used throughout the implementation. This work was supported by the U.S. Department of Energy under Contract DE-AC02-06CH11357. This research used resources of the Laboratory Computing Resource Center and the Argonne Leadership Computing Facility at Argonne National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under contract DE-AC02-06CH11357. Computing time on Intrepid was granted by a 2012 DOE INCITE Award “Optimization of Complex Energy Systems Under Uncertainty,” PI Mihai Anitescu.