A threaded approach of the quadratic biblending algorithm
Authors
 First Online:
DOI: 10.1007/s1122701207839
 Cite this article as:
 Herrera, J.F.R., Casado, L.G., Hendrix, E.M.T. et al. J Supercomput (2013) 64: 38. doi:10.1007/s1122701207839
Abstract
Blending algorithms aim for solving the problem of determining the mixture of raw materials in order to obtain a cheap and feasible recipe with the smallest number of raw materials. An algorithm that solves this problem for two products, where available raw material is limited, has two phases. The first phase is a simplicial branchandbound algorithm which determines, for a given precision, a Pareto set of solutions of the biblending problem as well as a subspace of the initial space where better feasible solutions (with more precision) can be found. The second phase basically consists in an exhaustive reduction of the mentioned subspace by deleting simplicial subsets that do not contain solutions. This second phase is useful for future refinement of the solutions. Previous work only focused on the first phase neglecting the second phase due to computational burden. With this in mind, we study the parallelization of the different phases of the sequential biblending algorithm and focus on the most time consuming phase, analyzing the performance of several strategies.
Keywords
Shared memory Parallel processors Multithreaded Branchandbound Global optimization1 Introduction
The problem of finding the best robust recipe that satisfies quadratic design requirements is a global optimization problem for which a guaranteed optimal solution is hard to obtain, because it can have several local optima. The feasible area may be nonconvex and even may consist of several compartments. In practice, companies are dealing with socalled multiblending problems where the same raw materials are used to produce several products [1, 3]. This complicates the search process if we intend to guarantee the optimality and robustness of the final solutions. Exhaustive search for a blending algorithm and its components are described in [4, 6, 8], while a biblending approach appears in [9].
Section 1.1 describes the blending problem and Sect. 1.2 defines the blending problem to obtain two mixture designs (biblending). Section 2 describes the sequential version of the biblending algorithm, and Sect. 3 its parallel model. Section 4 shows the computational results and Sect. 5 summarizes the conclusions and future work.
1.1 Blending problem
The blending problem is the basis of our study of the biblending. The considered blending problem is described in [8] as a Semicontinuous Quadratic Mixture Design Problem (SQMDP). Here, we summarize the main characteristics of the blending problem.
The semicontinuity of the variables is due to a minimum acceptable dose (md) that the practical problems reveal, i.e., either x _{ i }=0 or x _{ i }≥md. The number of resulting subsimplices (faces) is 2^{ n }−1. All points x in an initial simplex P _{ u }, u=1,…,2^{ n }−1, are mixtures of the same group of raw materials. The index u representing the group of raw materials of simplex P _{ u } is given by \(u=\sum_{i=1}^{n} 2^{i1}\delta_{i}(x)\), ∀x∈P _{ u }.
Recipes have to satisfy certain requirements. For relatively simple blending instances, bounds and linear inequality constraints define the design space X⊂S, see [1, 3, 16]. In practice, however, quadratic requirements appear [4, 8]. The feasible space, according to quadratic constraints, is defined as Q.
Moreover, the design must have an εrobustness with respect to the quadratic requirements in order to maintain the feasibility of the result when small variations in the mixture appear. One can define robustness R(x) of a design x∈Q with respect to Q as R(x)=max{R∈ℝ^{+}:(x+r)∈Q,∀r∈ℝ^{ n },∥r∥≤R}.
In [4], tests are described based on socalled infeasibility spheres that identify areas where a feasible solution cannot be located. In [8], we described a B&B algorithm to solve SQMDP using rejection tests based on linear, quadratic, and robustness constraints. A threaded version of the B&B blending (SQMDP) algorithm was presented in [13], following a similar strategy to the one used in a parallel interval global optimization algorithm [5].
1.2 Biblending problem
As described in [9], when designing several products simultaneously, each product has its own demand and quality requirements which are posed as design constraints. Here, we summarize the main characteristics of the problem. Let index j represent a product with demand D _{ j }. The amount of available raw material i is given by B _{ i }. Now, the main decision variable is matrix x, where variables x _{ i,j } represent the fraction of raw material i in recipe of product j.
2 Algorithm to solve the QBB problem
We are interested in methods that find solutions x _{ p } of the QBB problem up to a guaranteed accuracy, e.g., \(F(\mathbf{x}_{p})F^{\star}_{p} \leq \delta\). Solving (2) in an exhaustive way (the method obtains all global solutions with a predefined precision) requires the design of a specific branchandbound algorithm. B&B methods can be characterized by four rules: Branching, Selection, Bounding, and Elimination [10, 14]. A Termination rule can be incorporated, for instance, based on the smallest sampling precision. In the branchandbound method, the search region is subsequently partitioned in more and more refined subsets (branching) over which bounds of an objective function value and bounds on the constraint functions are computed (bounding) which are used to determine whether the subset can contain an optimal solution.
 Branching:

Simplex C is divided by its longest edge or that edge with the cheapest and the most expensive vertices, when all edges have the same length.
 Bounding:

Two bound values have to be calculated for each simplex:
 Cost:

f ^{ L }(C) is a lower bound of the cost of a simplex C and it is equal to the minimum cost of the vertices of the simplex, because the simplices are convex and the cost function is linear.
 Amount of each raw material:

\(b_{i}^{L}(C)\) is a lower bound of the raw material i in the simplex C. It is obtained in an analogous way as the lower bound of the cost.
 Selection:

A hybrid BestDepth search is performed. The cheapest simplex, based on the sum of the cost of its vertices, is selected and a Depthfirst is done until no further subdivision is possible (see Algorithm 1, lines 7 and 16). Depthfirst search is used to reduce the memory requirement of the algorithm.
 Rejection:

Several individual tests based on linear, quadratic and robustness constraints are applied to simplices of one product, see [8]. In addition, tests are applied taking into account both products:
 Capacity test:

Let \(\beta^{L}_{i,j}=D_{j} \times \min \left\{x_{i} : x \in C \in \varLambda_{j} \cup \varOmega_{j}\right\}\) be a lower bound of the demand of material i in the current search space of product j. Then, a simplex C of product j does not satisfy the capacity test ifwhere j′ denotes the other product.$$ D_j \times b^L_i(C)+\beta^L_{i,j'} > B_i, $$(3)
 Pareto test:

Let \(\varphi^{L}_{u,j} = D_{j} \times \min \left\{f(v) : v \in C \subset P_{u,j}, C \in \varLambda_{j} \cup \varOmega_{j}\right\}\) be a vector containing the cost value of the cheapest nonrejected mixture for initial simplex P _{ u,j }, u=1,…,2^{ n }−1. Then a simplex C of product j does not satisfy the Pareto test ifGlobal upper bound values \(F^{U}_{p}\), p=1,…,n, are updated as follows. Every time a new vertex, satisfying individual tests, is generated by branching rule, it will be combined with all vertices of the other product that also meets individual tests to check the existence of a combination that satisfies (1) and improves \(F^{U}_{p}\).$$ D_jf^L(C)+\varphi^L_{u,j'} > F^U_{\omega(x,y)};\quad x\in C, y \in P_{u,j'}. $$(4)
 Termination:

Nonrejected simplices that reach the required size α are stored in Ω _{ j }.
The result of Algorithm 1 is a set of δguaranteed Pareto biblending recipepairs x _{ p } with their corresponding costs \(F^{U}_{p}\), p=1,…,n, and lists Ω _{ j }, j=1,2, that contain mixtures that have not been thrown out. During the execution of the Algorithm 1, lower bounds \(\beta^{L}_{i,j}\) and \(\varphi^{L}_{u,j}\) are updated based on nonrejected vertices to discard simplices that do not satisfy (3) or (4). These lower bounds are used to avoid expensive computation related to the combination of simplices of both products.
3 Parallel strategy
The biblending problem is solved in two independent phases: the B&B phase (Algorithm 1) provides lists Ω _{1} and Ω _{2} with simplices that reached the termination criterion; the combination phase (Algorithm 2) filters out simplices without solutions. The computational characteristic of Algorithms 1 and 2 are completely different. While Algorithm 1 works with irregular data structures, Algorithm 2 is confronted with more regular ones. Algorithm 2 is run after finishing Algorithm 1. Hence, parallel models of both algorithms are analyzed separately.
The number of final simplices of Algorithm 1 depends on several factors: the dimension, the accuracy α of the termination rule, the feasible region of the instances to solve, etc. Preliminary experimentation shows that this number of final simplices can be relatively large. Algorithm 2 is computationally much more expensive than Algorithm 1. Therefore, we first study the parallelization of Algorithm 2.
Algorithm 2 uses a nested loop, and two lists Ω _{1} and Ω _{2}. For each simplex C∈Ω _{ j }, a simplex C′∈Ω _{ j′} must be found that satisfies (5) and (6) to keep C on the list. In the worst case (when the simplex can be removed), list Ω _{ j′} is explored completely (all simplices C′∈Ω _{ j′} are examined).
 Strategy 1:

applying NTh/2 threads at each list Ω _{ j }, j=1,2. Thus, iterations 1 and 2 of the outer loop are performed concurrently. This strategy requires NTh≥2. Each thread Th checks simplices in {C∈Ω _{ j }:Pos (C,Ω _{ j })mod(NTh/2)=Thmod(NTh/2)}. After exploring both lists, the deletion of the simplices is performed by one thread per list Ω _{ j }.
 Strategy 2:

applying NTh threads at the inner loop to perform an iteration of the outer loop. Each thread Th checks simplices C∈Ω _{ j } that meets Pos (C,Ω _{ j })modNTh=Th. Now, the idea is to check just one list in parallel, removing nonfeasible simplices before exploring the other list. Deletion of the simplices (tagged for this purpose) is only performed by one of the threads at the end of each iteration j.
A difficulty of parallelizing Algorithm 1 is that the pending computational work for the B&B search of one product is not known beforehand, i.e., it is an irregular algorithm. The search in one product is affected by the shared information with the other product. Moreover, the computational cost of the search in each product can be quite different due to the different design requirements. A study on the prediction of the pending work in B&B Interval Global Optimization algorithms can be found in [2]. Although authors describe their experience in B&B parallel algorithms [5, 7, 12, 15], these papers tackle only one B&B algorithm. QBB actually uses two B&B algorithms, one for each product, sharing \(\beta^{L}_{i,j}\), \(\varphi^{L}_{u,j}\) and \(F^{U}_{p}\) (see Eqs. (3) and (4)). The problem is to determine how many threads to assign to each product, if we want that both parallel B&B executions spend approximately the same (or similar) computing time. This will be addressed in a future study. Preliminary results show that the B&B phase is computationally negligible when compared to Comb. phase. Therefore, we will use just one static thread per product. This allows us to illustrate the challenge of load balancing.
4 Experimental results
To evaluate the performance of the parallel algorithm, we have used a pair of fivedimensional products, called UniSpec15 and UniSpec5b5. Both of them are modifications of two sevendimensional instances (UniSpec1 and UniSpec5b, respectively) taken from [8] by removing raw materials 6 and 7 from the cases. This instance was solved with a robustness \(\varepsilon = \sqrt{2}/100\), an accuracy α=ε, and a minimal dose md=0.03. The demand of each product is D ^{ T }=(1,1). The availability of raw material RM1 and RM3 is restricted to 0.62 and 0.6, respectively; while the others are not limited. Two solutions were found for UniSpec15 & UniSpec5b5 with a different number of raw materials involved [9].
The algorithms were coded in C. For controlling the parallelization, POSIX Threads API was used to create and manipulate threads. Previous studies, as those presented in [15], show a less than linear speedup using OpenMP for B&B algorithms. A study on the parallelization of the combinatorial phase with OpenMP pragmas will be addressed in the future. The code was run on a Dell PowerEdge R810 with one octocore Intel Xeon L7555 1.87 GHz processor, 24 MB L3 cache, 16 GB of RAM, and Linux operating system with 2.6 kernel.
Computational effort
B&B phase 
Comb. phase  

BiBlendSeq 
BiBlendPar 
BiBlendSeq 
BiBlendPar  
NEvalS 
2,536,862 
2,537,430  
NEvalV 
168,186 
168,299  
QLR 
887,609 
888,004  
Pareto 
54,050 
54,050 
27,284 
27,284 
Capacity 
18,277 
18,211 
105,499 
105,521 
Ω _{ S } 
308,443 
308,465 
175,660 
175,660 
Ω _{ V } 
49,317 
49,324 
24,861 
24,861 
LLC misses (in thousands) and speedup obtained in Comb. phase
NTh 
Strategy 1 
Strategy 2 (Ω _{1}−Ω _{2}) 
Strategy 2 (Ω _{2}−Ω _{1})  

LLC 
Time 
Sp 
LLC 
Time 
Sp 
LLC 
Time 
Sp  
– 
2,691 
479.00 
− 
2,691 
479.00 
− 
1,595 
333.00 
− 
2 
2,723 
463.38 
1.03 
1,304 
231.52 
2.07 
774 
162.72 
2.05 
4 
1,467 
223.45 
2.14 
624 
112.97 
4.24 
372 
79.95 
4.16 
8 
815 
109.00 
4.39 
305 
57.17 
8.38 
178 
40.32 
8.26 
Strategy 1 shows a poor speedup compared to Strategy 2. Strategy 1 has threads working on elements of Ω _{1} and Ω _{2}, where for each element of one list, the comparison is done with elements of the other list until a valid combination is found or the complete list has been checked (the worst case). This requires that many elements of both lists have to be cached, involving cache misses.
On the other hand, Strategy 2 uses all threads for checking elements of one list. In this way, only the number of elements on the current list (equal to NTh) and the elements of the other list have to be in cache for comparison. This reduces cache misses and, therefore, the running time, even more when the other list has a small size. It is illustrated when Strategy 2 starts with list Ω _{2}, where Ω _{2}=286,475. The other list has a smaller size Ω _{1}=21,990, which may decrease the number of cache faults. The running time is reduced more than 25%. For this strategy, one can observe a slight superlinear speedup due to cache issues. Using more threads leads to less cache misses, which means that increasing the number of threads promotes the use of the same data in cache, instead of increasing the number of cache misses.
Regarding the B&B phase, which is the same in both strategies, BiBlendPar uses NTh=2. In this phase, a slight speedup equal to 1.03 is obtained: BiBlendSeq spends 7.23 seconds and BiBlendPar spends 7 seconds. A linear speedup is not reached due to the difference of complexity between both products: UniSpec15 has simpler quadratic requirements compared to UniSpec5b5; thread Th=1 only spends 0.86 seconds on exploring the entire search space of UniSpec15, while thread Th=2 spends 7 seconds to finalize the search space exploration of UniSpec5b5.
For a better analysis of computational results, we increase the computational experience with a larger number of processors. Now, the Strategy 2 is run on a Sunfire x4600 with eight quadcore AMD Opteron 8356 2.3 GHz processors, 2 MB L3 cache, 56 GB of RAM, and Linux operating system with 2.6 kernel. The algorithm shows an almost linear speedup for a number of threads less than or equal to the number of cores in a processor. For a larger number of threads, the nonuniform memory access and the small size of the L3 cache produce a strong loss of performance.
5 Conclusions and future work
A parallelization of an algorithm to solve the biblending problem has been studied for a smallmedium size instance of the problem. This single case illustrates the difficulties of this type of algorithms. Biblending increases the challenges of the parallelization of a B&B algorithm compared to single blending, because it actually runs two B&B algorithms that share information. Additionally, in biblending algorithms, a combination of final simplices has to be done after the B&B phase to discard regions without a solution. This combination phase can be computationally several orders of magnitude larger than the B&B phase. Here, we use just one thread for each product in the B&B phase and several threads for the combination phase. Linear speedup is obtained on a shared memory machine with an octocore processor and large L3 cache using one of the developed strategies. Executions in another sharedmemory machine with eight quadcore processors and small L3 cache leads to a poor performance, when the number of threads is greater than the number of cores per processor.
Our intention is to develop a new version in order to reduce the cache misses and to experiment with larger dimensional problems for the parallel biblending algorithm, trying to decrease the computational cost. Another future research question is to develop the nblending algorithm and its parallel version, which is the problem of interest to the industry.
Acknowledgements
This work has been funded by grants from the Spanish Ministry of Science and Innovation (TIN200801117 and TIN201012011E) and Junta de Andalucía (P08TIC3518 and P11TIC7176), in part financed by the European Regional Development Fund (ERDF). Juan F. R. Herrera is a fellow of the Spanish FPU programme. Eligius M. T. Hendrix is a fellow of the Spanish “Ramón y Cajal” contract programme, cofinanced by the European Social Fund.
Open Access
This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.