A robust approach for finding all well-separated solutions of sparse systems of nonlinear equations

Tearing is a long-established decomposition technique, widely adapted across many engineering fields. It reduces the task of solving a large and sparse nonlinear system of equations to that of solving a sequence of low-dimensional ones. The most serious weakness of this approach is well-known: It may suffer from severe numerical instability. The present paper resolves this flaw for the first time. The new approach requires reasonable bound constraints on the variables. The worst-case time complexity of the algorithm is exponential in the size of the largest subproblem of the decomposed system. Although there is no theoretical guarantee that all solutions will be found in the general case, increasing the so-called sample size parameter of the method improves robustness. This is demonstrated on two particularly challenging problems. Our first example is the steady-state simulation a challenging distillation column, belonging to an infamous class of problems where tearing often fails due to numerical instability. This column has three solutions, one of which is missed using tearing, but even with problem-specific methods that are not based on tearing. The other example is the Stewart–Gough platform with 40 real solutions, an extensively studied benchmark in the field of numerical algebraic geometry. For both examples, all solutions are found with a fairly small amount of sampling.


Introduction
We consider square nonlinear systems where F : R n → R n is a continuously differentiable vector-valued function and whose Jacobian is structurally nonsingular; x and x denote the componentwise lower and upper bounds on the variables x, respectively. From an applied point of view, it is usually not meaningful to distinguish two solutions that are too close, due to the intrinsic uncertainty of every real-life model. Therefore, the task we pose is to find a reasonably small set of points such that every solution of (1) is close to one of the points in this set. An algorithm solving this task finds in particular all well-separated solutions. Even for problems with an infinite number of solutions, only a finite number of points need to be generated. Such problems are expected to come only from defective models, and our implementation will return in such cases a diagnostic message.
Our algorithm assumes that the variables are adequately scaled. This allows us to use one of the standard norms to measure distances; unless otherwise indicated, we use the maximum norm ( ∞ norm). We also assume that the bound constraints x ≤ x ≤ x are finite and reasonable; this is needed to allow an adequate sampling of the search space. (Therefore, our method may not work well when a variable is unbounded or its upper bound is not known, and the user circumvents this by specifying a huge number such as 10 20 as upper bound.) Finite bound constraints are also important from an engineering perspective: These bounds often exclude those solutions of F (x) = 0 that either have no physical meaning or lie outside the validity of the model. In a typical technical system, all variables are bounded from below and from above. Indeed, a model is typically valid only in a finite range of the variables. Physical limitations of the devices and the design typically impose minimal and maximal geometry, load, or throughput of the devices; this implies bounds on the corresponding variables. There are also natural physical bounds, for example, the mass fractions must be between 0 and 1. In practice, not all bounds are made explicit in the model because implied bounds would be tedious for the modeler to derive by hand and to keep up-to-date when the model changes. Therefore, the bounds are conventionally not specified explicitly if they can be deduced from the model formulation; e.g., upper bounds on nonnegative variables are typically not specified if there is a constraint fixing their sum. Fully automatic and computationally cheap preprocessing can compute sufficient (but not necessarily sharp) bounds for properly specified models, see for example [13,47,67], or [90]. When dealing with technical systems, a variable that remains unbounded after such preprocessing is almost always a sign of a modeling mistake. We therefore assume that such a preprocessing has already been done successfully when presenting (1) to our algorithm.
Besides the (possibly infinite) number of solutions that exist, the sparsity pattern of F decides how efficiently (1) can be solved. We therefore discuss favorable forms of sparse matrices in Section 1.1, and in Section 1.2 we present algorithms for automatically ordering sparse matrices to the form required by the proposed algorithm. Since the paper is concerned with resolving the most serious flaw of tearing, we first briefly review it in Section 1.3. The robustness of the method will be demonstrated on problems with multiple solutions; the alternative approaches are discussed in Section 1.4. Some important specific applications are summarized in Section 1.5. The proposed method is given in Section 2, and the numerical results in Section 3. The Appendix contains practical matters, such as parameter tuning and implementation-level remarks.

Staircase triangular matrices
We call any partition of rows and columns of a square matrix A into the same number m of contiguous row blocks R 1 , . . . , R m and contiguous column blocks C 1 , . . . , C m a block structure. A lower triangular block structure (or LTBS) of A is a block structure that partitions A into conforming submatrices A jk consisting of the entries in the contiguous rows of R j and columns of C k such that A jk = 0 for j < k. Thus, A may be viewed as a generalization of a block lower triangular matrix to the case of possibly rectangular diagonal blocks A jj . In general, this may be possible in many different ways.
We say that a square matrix A ∈ R n×n is a staircase triangular matrix (or has staircase triangular form) if it has no zero row or column and if the columns c r of the last nonzero entry in row r = 1, . . . , n form a monotone increasing sequence, and the first nonzero entry in column 1, . . . , n forms a monotone decreasing sequence. Staircase triangular matrices generalize staircase matrices (as surveyed, e.g., by [32]) by allowing the lower triangular part to contain additional entries, but restrict the possibilities slightly by imposing a minimal structure on the "walking profile" of the stairs. In practice, the lower triangular part is usually very sparse but often not of the form required by the traditional staircase matrices. By introducing a block boundary at every step of the stair, the staircase triangular form induces a canonical LTBS in a geometrically intuitive way; cf. Fig. 1. The corresponding algebraic description is as follows. By definition, 1 ≤ c 1 ≤ . . . ≤ c n = n.
Let r 1 = 1 and let r 2 < . . . < r m be the list of r = 2, . . . , n with c r−1 < c r in increasing order. Then, the rows are partitioned into m consecutive, nonempty row blocks and the columns into m consecutive, nonempty column blocks The shorthand p:q is used for the index set p, p + 1, . . . , q, where p ≤ q. The staircase triangular form implies that the resulting block structure is lower triangular. The simplest examples of staircase triangular matrices are nonsingular lower triangular matrices where c r = r, corresponding to n blocks of size 1. In addition to the canonical LTBS, we may obtain LTBSs with fewer blocks by arbitrarily merging one or more consecutive row blocks and the corresponding column blocks, cf. Fig. 1.

Ordering to staircase triangular form
The ordering algorithms typically used assume that the input matrix is structurally nonsingular. This is revealed by the Dulmage-Mendelsohn decomposition [24][25][26][27]45,66,Ch. 6], and [18,Ch. 7]. This decomposition is a standard procedure, and efficient computer implementations are available, for example HSL MC79 from the [44]. For a structurally nonsingular square matrix, the Dulmage-Mendelsohn decomposition always produces a block lower triangular matrix with structurally nonsingular square blocks on the diagonal. These diagonal blocks are irreducible. The ordering algorithms that we discuss orders each of these diagonal blocks to staircase triangular form; the correspondingly ordered full matrix will be automatically staircase triangular as well.
Practical algorithms for ordering sparse matrices to staircase triangular form include the Hellerman-Rarick family of ordering algorithms [24,29,42,43], and the algorithms of [78,79]. An efficient computer implementation of the Hellerman-Rarick algorithms is MC33 from the [44]. Although there are subtle differences among the various ordering algorithms, they all fit the same pattern when viewed from a high level of abstraction [31]: Algorithm 1 is the fundamental algorithm. The various ordering algorithms typically assume that the matrix is irreducible and only seem to differ in the lookahead step to break ties on line 3. Figure 2 shows an intermediate stage of the algorithm. Any version of Algorithm 1 will produce a staircase triangular matrix whose canonical diagonal blocks are fully dense.

Active submatrix
Removed min-row-count rows Some of the above cited papers on the Hellerman-Rarick algorithms and the Stadtherr-Wood algorithms include numerical results, indicating that this decomposition method is effective. Further numerical evidence shows that this approach usually gives favorable decompositions for problems from diverse fields: Performance results on 692 test problems are given in [9] for our own ordering algorithm (inspired by the fundamental Algorithm 1). These problems were taken from the COCONUT Benchmark [69], which covers a variety of applications, e.g., chemical engineering, computational chemistry, civil engineering, robotics, economics, multicommodity network flow, process design, stability analysis, VLSI chip design, and portfolio optimization.

Traditional tearing
Tearing dates back to the 1930s [50,81] and has been widely adapted across many engineering fields since: State-of-the-art steady-state and dynamic simulation environments all implement some variant of tearing, see for example [1], Dymola [17], JModelica [55], or OpenModelica [61]. The applicability of tearing is not limited to a particular engineering discipline: It is generic, and it is used in all state-of-the-art Modelica simulators to model "complex physical systems containing, e.g., mechanical, electrical, electronic, hydraulic, thermal, control, electric power or process-oriented subcomponents" [54]. Tearing is also referred to as diakoptics or sequential modular approach depending on the discipline. When dealing with distillation columns, tearing is called stage-to-stage or stage-by-stage calculations.
We say that a square matrix A has bordered block triangular form if it can be written as with block triangular A 11 and square A 22 , the latter typically of fairly small size. Numerous ordering algorithms are available to permute a spares matrix fully automatically into this form. One way of doing it is to first find a staircase ordering as in Section 1.2 and then to move those columns to the far right that spoil the upper left block triangular form. This produces a bordered block triangular form such that the diagonal blocks in A 11 are dense. Conversely, suppose a method is available to order a matrix to bordered block triangular form with the property that the diagonal blocks in A 11 are structurally nonsingular. Such methods are surveyed in [8] and [9]. Then, we can reorder the diagonal blocks to staircase form: We reorder each block of the whole matrix accordingly and then move the resulting border to the far left to get a matrix in staircase form.
In the traditional setup, the bound constraints in (1) are ignored, and the variables and equations are permuted with a suitable ordering algorithm such that the sparsity pattern of the Jacobian is in bordered block lower triangular form with structurally nonsingular square blocks on the diagonal, see Fig. 3.
The variables in (1) are partitioned as into subvectors x i ∈ R d i (i = 0 . . . N), so that n = d 0 + · · · + d N . Similar to the variables, F is partitioned as (1) is square, the trailing dimension must be d N+1 := d 0 . For any bordered block lower triangular matrix, only variables from subvectors x 0 , . . . , Equations (2)-(4) describe the block sparsity pattern shown in Fig. 3. In practice, the lower triangle is sparse. Given the bordered block lower triangular form, the diagonal blocks are eliminated one-by-one from i = 1 to N, and F N+1 (x) is considered as a function of x 0 only: The motivation behind tearing is to save time. The diagonal blocks are typically small and dense, and specialized methods can be used to efficiently eliminate them. This can lead to substantial saving in execution time compared to solving (1) without any decomposition.
The user has to provide an initial guess for each variable when a gradient-based local solver is used to solve (1) directly. With tearing, the user has to provide an initial guess for x 0 only, which saves significant amount of time for the user. Tearing is relatively easy to implement in a component-oriented simulator such as Dymola, JModelica, or OpenModelica, and it is usually robust provided that G(x 0 ) is wellconditioned.
The biggest flaw of tearing was recognized early [16,80]: It can show extreme sensitivity to the initial guess for x 0 , since numerical sensitivity can build up while the blocks are eliminated along the diagonal. In such cases, the Jacobian G (x 0 ) of the reduced system is extremely ill-conditioned, even if the Jacobian F (x) of the original system is well-conditioned. This in turn has a negative impact on the convergence properties of the methods used for solving G(x 0 ) = 0. Although several attempts were made to mitigate this issue, see for example [93,94] and [37], it has never been resolved satisfactorily.
The sensitivity issue can become so severe that, with all the intermediate variables x 1 , . . . x N eliminated, there may not be any machine representable vector for x 0 such that G(x 0 ) = 0 is satisfied with acceptable numerical accuracy. For example, the distillation column computed in Section 3.1 is intractable with traditional tearing although the Jacobian of each solution has a condition number estimate of < 10 9 , so that one expects from a stable method several accurate digits.

General-purpose methods for finding multiple solutions
Multistart methods try to find all solutions by starting a gradient-based local solver from multiple starting points. Guarantees regarding finding all solutions depend on the placement of the starting points. Perhaps the simplest method for bound constrained problems is the grid search: The constraint violation is evaluated at each point of a fine grid, and the best points are used as starting points for local optimization. Finding all solutions with probability one can be achieved by making the grid sufficiently dense. This naive approach is only effective in very low dimensions as the number of grid points grows exponentially with the dimension of the problem.
Multistart methods are applicable to large-scale systems of equations by giving up on the strong guarantees that, for example, grid search would provide. In practice, sophisticated approaches are used to place the starting points, for example, constraint consensus (to reduce infeasibility in a computationally cheap way) and clustering (to separate basins of attraction) by Smith et al. [73][74][75] or stochastic methods (especially population-based metaheuristics) [83]. These methods strike a balance between speed and robustness.
Another set of methods that are often successfully used to solve large-scale problems-especially if the objective function can be cheaply computed-are evolutionary algorithms, e.g., CMA-ES [2], if they are combined with local optimization starting from the most promising points found. Other methods that are based on similarities to natural processes (ant colony [22], particle swarm [28], etc.) can be used in a similar way.
For solving systems of equations with multiple solutions, homotopy methods are extensively used, especially for systems of polynomial equations. The reader is referred to [10,11,77,95] for the latest developments. Mature and robust software implementations for solving polynomial systems are, for example, Bertini [11,12], and PHCpack [86,87]. Homotopy methods with problem-specific homotopy maps were successfully applied to large-scale industrial problems with transcendental equations [21,51,85]. This approach with problem-specific homotopy maps is also suitable for component-based (also referred to as component-oriented) modeling of large, complex, and heterogeneous technical systems [70; 72, Ch. 8-10]: Once the models of the devices (components) are implemented and put into a library, practically no understanding of probability-one homotopy method is required from the end-user.
Spatial branch-and-bound methods recursively split the search space into smaller parts and eliminate those parts that cannot lead to a solution better than the currently best known one. Unfortunately, their worst-case performance tends to grow exponentially with the dimension of the problem since they perform exhaustive search. (For non-convex functions, global optimization is NP-hard.) The applicability of branchand-bound methods currently seems to be limited to fairly low-dimensional problems in the general case. Successful enclosure methods in chemical engineering include interval arithmetic [41], McCormick relaxations [53,68], affine arithmetic [6,76], and αBB [40].

Important specific applications
The problem of solving nonlinear systems of equations arises in the daily engineering practice, e.g., when consistent initial values for differential algebraic equation (DAE) systems are sought [63,84], or when solving steady-state models of technical systems. A steady-state solution can be used as a consistent initial set of the DAE system [48].
Even though mature equation-based component-oriented modeling environments are available, e.g., Modelica [34,52,82] for multi-domain modeling of heterogeneous complex technical systems and [36] ASCEND [65] and EMSO [62] for chemical process modeling, simulation, and optimization, the steady-state initialization is still not satisfactorily resolved in the general case. Often, steady-state initialization failures can only be resolved in very cumbersome ways [3,60,71,72,89], involving user-provided good initial values for the variables. The proposed algorithm aims to eliminate this tedious process by generating good initial values fully automatically.

Proposed algorithm
Here, we give a formal presentation with pseudo-code through Sections 2.1-2.4; the Java source code is available online in the supplementary material [7], together with an illustrative numerical example where the steps of the algorithm are illustrated with several figures.

Input of the proposed method
The input of the proposed method is (1), together with an LTBS of its Jacobian, obtained with appropriate preprocessing. For efficiency reasons one should enforce that the diagonal blocks are structurally nonsingular and contain no zero row or column; Algorithm 1 always delivers a staircase triangular matrix whose canonical LTBS has this property. As outlined in Section 1.2, numerical evidence indicates that Algorithm 1 tends to give favorable decompositions for problems from diverse fields, that is, the size of the largest block tends to be small. The worst-case time complexity of the proposed method grows exponentially with the size of the largest block.
Let N denote the number of blocks of the input LTBS. Unlike in tearing, the variables are now partitioned along the block boundaries as into N subvectors x i ∈ R n i (i = 1 . . . N), so that n = n 1 + · · · + n N . Similarly, F is partitioned along the block boundaries as 1 . . . N), and n = m 1 +· · ·+m N . The motivation behind requiring an LTBS as input is that then only variables from the subvectors x 1 , . . . , x i (i ≤ N) can appear in F i (x): We view this as a cycle-free sequence of subproblems that will be handled sequentially.

The idea in a nutshell
The algorithm builds up a point cloud, i.e., a set of vectors satisfying by processing the blocks one-by-one along the diagonal of the LTBS. The algorithm sketched so far would have exactly the same numerical issues that tearing has. Compared to tearing, the only significant difference up to this point is that a set of points is propagated through the blocks and not just a single point. But working with a point cloud allows us to counteract conditioning problems. Inspired by our earlier results for the univariate case [5], this is achieved by redistributing the sample points after each block. This redistribution step strives to ensure in each iteration that the sample of the solution set of (8) remains representative within the bound constraints, in the sense that (a) no point of the solution set is too far from the sample, and (b) the points in the sample are well-separated. These goals are achieved on a best effort basis. In each iteration, we first insert additional points into the sample with Algorithm 4 which involves robust sampling and sensitivity analysis; the aim of this is to achieve goal (a). After inserting additional points to the sample, its size is assumed to be greater than the user-defined sample size M i ; therefore, we have to drop some of the points from the sample until M i points remain. (If, due to some pathological situation, the sample size is still less than or equal to M i , we simply skip the the rest of the redistribution step and do not drop any of the points.) To achieve goal (b), we choose M i points from the point cloud with a greedy algorithm: We choose the least infeasible point (the point with smallest F 1:i (x 1:i ) ∞ ) as the first point. We iteratively continue choosing a point from the point cloud that are furthest away from all already chosen points, breaking ties arbitrarily. When the desired sample size M i is reached, we drop all points from the sample that have not been chosen. The sample size M i has the biggest effect on the robustness and execution time of the method: Increasing the sample size is expected to improve robustness but at the cost of increased computational costs.
The difference of our procedure to the naive way of directly sampling the solution set of F 1:i (x 1:i ) = 0 within the bound constraints is that in the latter approach, the volume to be sampled grows exponentially with the dimension p := dim x 1:i , hence good sampling is prohibitively expensive once p gets large (ultimately p = n). The proposed method avoids this growth by sampling only at the blocks along the diagonal: The volume to be sampled grows exponentially only with the largest block size, which is usually significantly smaller than p. The biggest computational savings of the proposed method are therefore achieved here.

Pseudo-code of the proposed algorithm
The algorithm is presented in high-level pseudo-code in Algorithm 2; the reader is referred to Appendix C for practical matters, such as the choice and effect of the used-provided parameters.
When forming the subvector v p:q of a vector v, p:q is cropped appropriately if necessary; that is, invalid indices are ignored. The index set p:q is considered empty if p > q, and the expression v p:q is a valid subvector of v that has no elements. We write for the concatenation of two vectors x ∈ R m and y ∈ R n .
The robust sampling and sensitivity analysis in Algorithm 3 was necessary to overcome implementation artifacts of the solvers: In the underdetermined case, the solver may place many of the solution vectors near to each other in a particular subspace due to implementation artifacts. Making the underdetermined subsystem square by fixing the least influential variables (x i ) J proved to be sufficient to resolve this issue. The goal of the local sensitivity analysis performed on line 9 is to order the variables x i according to their influence on F i (x 1:i ) 2 with x 1:i−1 fixed. The index set J on line 9 denotes the indices of the least influential variables. QR factorization with column pivoting is performed on the Jacobian of F i (x 1:i ); the resulting permutation vector gives the desired ordering [35, p. 591].

Adding additional sample points
The goal of Algorithm 4 is to produce additional sample points. The newly introduced The cardinality of J is chosen to be max(1, dim x i−h:i − dim F i−h:i ), that is, if the subsystem is underdetermined, we treat it as we did in Algorithm 3, see Section 2.3. However, if the subsystem is square (which is a common case) or even overdetermined, we still have to perturb at least one of the variables; otherwise, the solution vector x 1:i is most likely in the input S (i) already, and we will not insert any new point into our sample. (If the subsystem has multiple solutions in x i−h:i ; then, it can happen that we find a new point without perturbation.) The entire x 1:i−1 part of the partial solution should not be recomputed on line 13, not even with inter-or extrapolations: that would potentially make the complexity of the entire algorithm O n 2 where n is the number of variables in (1). This is the reason why x 1:i−h−1 is left unchanged, and only the last h subproblems are considered on line 13.

Numerical results and discussion
The benchmark problems have been coded in the AMPL modeling language [33] and are available in the online supplementary material [7]. A short summary of the problems is given in Table 1.

Multiple steady-states in homogeneous azeotropic distillation
The steady-state simulation of distillation columns belongs to an infamous family of problems where tearing is often inapplicable due to numerical instability [21]. Our first example is therefore a challenging distillation column where tearing fails. This column has three solutions, one of which is missed even with problem-specific methods (that are not based on tearing).

Background of the problem
The model equations are the MESH equations: The component material balance (M), vapor-liquid equilibrium (E), summation (S), and heat balance (H) equations are solved. The liquid phase activity coefficient is computed from the Wilson equations. The model and its parameters correspond to the Auto model [39], except for the number of stages N and the feed stage location N F = N/2. The specifications are the feed composition (methanol-methyl butyrate-toluene), the reflux ratio, and the vapor flow rate.
There are three steady-state branches: two stable steady-state branches and an unstable branch; this was experimentally verified in an industrial pilot column operated at finite reflux [23,39]. Multiple steady-states can be predicted by analyzing columns with infinite reflux and infinite length [14,38,64]. These predictions for infinite columns have relevant implications for columns of finite length operated at finite reflux.

Published numerical results with continuation methods
Both the conventional inside-out procedure [15] and the simultaneous correction procedure [57] were reported to miss the unstable steady-state solution, see [85] and [46] (all input variables specified; output multiplicity). However, all steady-state branches were computed either with the AUTO software package [20] or with an appropriate continuation method [39,46,85]. The initial estimates were carefully chosen with the ∞/∞ analysis [14,38], and special attention was paid to the turning points and branch switching.

Obtaining the lower triangular block structure
A distillation column consists of stages; partitioning the Jacobian along the stage boundaries gives the blocks. The natural order of the stages directly yields the LTBS by virtue of the internal physical layout of distillation columns. As a consequence, no preprocessing was necessary before applying the proposed method. The sparsity pattern of the Jacobian is shown in Fig. 4.

A note on plotting the solution vectors
To understand the displayed results of the next section, we recall how chemical engineers traditionally plot the solutions. Once a solution to the model equations is available, it is plotted as follows. A slice of the solution vector is created first: A subset of variables is selected that has the same physical meaning in each block. For example, if the variables corresponding to the temperature are selected at each block, a slice is obtained, the so-called temperature column profile, etc. Then, the block index is plotted against the selected value of the variable in this block, see on the left of Fig. 5.
In the chemical engineering literature, not a sequence of discrete points are plotted but a piecewise linear curve passing through these points, and not the block but the stage index is used, see on the right of Fig. 5. For the columns considered in this paper, block i corresponds to stage N − i + 1, that is, the stages are numbered in reverse order compared to the blocks. It must be emphasized that the solution is a vector of real numbers and not a continuous curve.  The stages are numbered in reverse order compared to the blocks. It must be emphasized that the slice is a vector of real numbers and not a continuous curve

Our numerical results
With the appropriate parameter settings, all three steady-state solutions are found when IPOPT [91] is run from the starting points generated with the proposed method. As for parameter tuning of the proposed algorithm, the reader is referred to Appendix C; the effects of varying h is shown in Table 2.
For the purposes of demonstration, several plots are given for the column with N = 20 stages; the column with 40 stages is too long to be appropriate for illustration. Figure 6 shows the three steady-state solutions and those starting points that are the closest to them. Figure 7 illustrates an intermediate step of the algorithm, the extension of the partial column profiles when moving from stage 11 to stage 10, that is, the result of executing the nested loop in Algorithm 2 for each point in the sample. Block i Table 2 The effect of varying h while the initial sample size M 0 is kept fixed at 25 Time in Alg. 2  Total time  Solutions  Time in Alg. 2  Total time   1  2  43  51  1  104  115   2  3  67  82  3  237  323   3  3  114  130  3  384  473 The problem being solved is the azeotropic distillation problem of Section 3.1; it has three solutions for both N = 20 and N = 40. The time is measured in seconds. Time in Alg. 2 is the time needed to generate the starting points; the total time also includes running IPOPT from each point  Figure 8 shows the effect of the distinguishing feature of the method, the redistribution. If the redistribution is disabled, the proposed method eventually boils down to the tearing (stage-by-stage method). Figure 8 is computed stage-by-stage, exactly from the same bulk composition used for Fig. 6. Two out of the three steady-state solutions are lost without the redistribution step. Figure 9 shows one execution of the redistribution step at stage 10. New points are forced into the sample, compare with Fig. 7; then, only the most distant points are kept.

Stewart-Gough platform with 40 real postures
The Stewart-Gough platform consists of two rigid bodies that are connected by six rods attached via spherical joints; it is a type of parallel-link robotic device. This problem is an extensively studied benchmark with homotopy methods, see, e.g, [77,Sec. 7.7] or [11,Sec. 6.3]. [19] published parameters with which a given Stewart-Gough platform has 40 real postures. The model equations with these parameters are available from the database of PHCpack, maintained by [88]; it is the stewgou40 benchmark. The latest version of PHCpack [86,87] is v2.4.25, released on 31 Aug 2016. With the hardware specified at beginning of Section 3, it requires 56.5 seconds for solving this particular benchmark in its so-called blackbox mode for non-experts. It is very likely that in the hands of an expert, this software can solve the problem faster. Here, we show how all the 40 real solutions can be found with the proposed method. Even though the problem characteristics that favor our method (large, sparse problem) are not present, our method performs reasonably on this benchmark.

Our results
Some preprocessing is necessary before applying the proposed method. As all variables are coordinates of unit vectors and therefore must be in the interval [−1, 1], the latter defines the bound constraints.
The Jacobian can be permuted fully automatically into staircase triangular form. The optimal pattern, shown in Fig. 10, is found on the root node of the search tree of the method of [9], in less than 5 ms. However, in this particular case, using such an ordering algorithm is an overkill: One can easily bring the Jacobian into staircase triangular form with pen and paper by making the first equation the fourth. As it can be seen in Fig. 10, the sparsity pattern has very little structure that the proposed method can exploit: It only has three blocks on the diagonal of the canonical LTBS.
As for setting the initial sample size M 0 , the following strategy gives a reasonable approach. The user should set M 0 to be ten times the expected number of solutions, and at least 25. If more than the expected number of solutions were found, M 0 should be doubled iteratively, and the proposed method run again with this larger initial sample size. We keep doubling M 0 and rerunning the algorithm until all previously found solutions were found and no new solutions, or we reach a pre-defined limit for M 0 .
Let us pretend that we expect only one solution, and we start with an initial sample size of 25 accordingly. Since 21 solutions are found, see Table 3, we iteratively double the initial sample size and rerun the proposed algorithm, until all previously found solutions are found with M 0 = 400 and no new solutions. The entire procedure takes 54.3 s. If we know a priori that there are at most 40 real solutions, see for example

B.2 Solving the original system from the generated starting points
Our solver of choice for solving the input problem from the generated starting points is IPOPT; however, due to implementation artifacts, this solver is not appropriate for the Stewart-Gough benchmark of Section 3.2. Even when supplied with a solution, IPOPT first wanders off then back while it is reducing the dual infeasibility. As a consequence, it already starts losing some of the solutions when provided with starting points that were obtained from the true solutions by applying small random perturbations of at most 0.004 in the maximum norm. The minimum ∞ distance between any two solutions is approximately 0.155. Other solvers, e.g., LMBOPT and VA27 from [44], do not have any difficulty with such small random perturbations. We therefore used LMBOPT for this particular benchmark. However, this solver squares the condition number, and the estimated condition number of the Jacobian at some of the solutions is of order 10 6 . This necessitated some postprocessing: The solutions returned by LMBOPT had to be polished with textbook Newton iteration.

B.3 Notes on the execution time
Implementing the proposed method is a major undertaking; the most difficult part is the implementation of the function and Jacobian evaluation of the subproblems. In order to reduce the implementation effort, we use at present our existing Java code that was originally created for researching inclusion algebras [58,Ch. 2.2] with abstract datatypes and our existing Java wrappers for the IPOPT and LMBOPT solvers for large-scale and sparse problems, that are comparably inefficient for small, dense subproblems to which they are applied here. Although this code reuse dramatically reduced the time needed to implement a prototype, the resulting software is unacceptably slow. Profiling shows that the execution times given in the present paper mostly reflect the performance flaws of our current research prototype. We therefore started a complete rewrite from scratch [4]: We are currently reimplementing the function and Jacobian evaluations in the C programming language, and for solving the subproblems, we are switching to VA27. Unlike IPOPT and LMBOPT, the latter solver is tailored for small and dense problems. This reimplementation is still an ongoing process.

C Parameter tuning
Algorithms 2 and 4 have parameters that were explicitly indicated in the pseudocode but were left unspecified; here, we discuss (i) the influence of these parameters on the robustness and execution time of the method, (ii) how they were set in our numerical experiments, (which we also consider reasonable default settings), (iii) and outline appealing future research directions to set them adaptively and automatically. In our numerical experience, the sample size M is the most important, and h in the redistribution step is the second most important parameter of the method with respect to their influence on robustness and execution time. The sample size M determines the resolution of the solution set. With the parameter h, one controls the number of subproblems among which the constraint violation of a newly forced point is spread. Increasing either one of these parameters is expected to increase robustness at the expense of increased computational effort; this is illustrated by Tables 2 and 3.

C.1 Setting the sample size
We used M i = c · i + M 0 in our numerical experiments, where c and M 0 are constants; it is left to the user to specify M 0 and c. The choice M i = i + 25 proved to be appropriate for the most difficult distillation column considered in this paper and, for all other, easier problems in our test set (that are not discussed here) with one exception. The stewgou40 benchmark, to be discussed in Section 3.2, has 40 solutions, and it was necessary to iteratively double M 0 to achieve a sufficient resolution of the solution set, see Table 3.
One can probably find a parametric formula that gives a reasonable default value for M i as a function of the key characteristics such as the size of the largest block, the size of block i, and the number of blocks N. The parameters of such a formula with the right qualitative form should be fitted and cross-validated by running the algorithm on a large benchmark set, consisting of diverse test problems. This is the subject of future research.

C.2 Setting h
The second most important parameter of the proposed method is h in the redistribution step: After forcing a new point into the sample, the last h blocks are simultaneously resolved to minimize the constraint violation resulting from the forceful insertion. If h is chosen too small (for example h = 1), we may fail to reduce the constraint violation sufficiently, and the new forced points will be discarded in Algorithm 2 on line 8; this can lead to poor resolution of the solution set, and some of the solutions will be lost eventually. Setting h to a large value (h 5) spoils the performance, since the last h blocks are resolved simultaneously, and the increased computational effort does not yield any visible improvement in robustness. In the current implementation, it is left to the user to specify h. The h = 4 choice works for all the test problems in our test set.
Similarly to M, it is subject of future research to work out a rule for choosing a default value for h. In particular, the following adaptive rule seems appealing. In our numerical experience, the remaining constraint violation in Algorithm 4, line 13 decreases rapidly as h is increased. Therefore, the algorithm could start with a small value for h at each block (for example h = 2), and increment it until either the tolerated constraint violation of Algorithm 2 or a user-defined cutoff for h (for example h = 5) is reached. The optimization problem in Algorithm 4 on line 13 can be quickly resolved after incrementing h if the gradient-based solver is started from the previous solution. Other, more sophisticated adaptive rules are also possible.

C.3 Setting the remaining parameters
With M and h fixed, the other parameters have negligible effect on the robustness and performance of the method in our experience; therefore, practically no effort was made to tune these other parameters. For the purposes of documentation only, the following settings were used: = 2/M 0 in Algorithm 2, and p = S (i) , and δ = 2/M 0 in Algorithm 4.

D Parallelization
Profiling shows that most of the time is spent solving the subproblems at the blocks as expected: In Algorithm 2 in the loop on lines 3-5, and in Algorithm 4 in the loop on lines 12-14. Since the optimization problems solved in these loops are completely independent, this means that they can be solved in parallel; the bottleneck of the computations is parallelizable. We have not implemented this improvement; our current implementation is still sequential (single-threaded) because the underlying implementation is unfortunately not thread-safe.