An Efficient CGM-Based Parallel Algorithm for Solving the Optimal Binary Search Tree Problem Through One-to-All Shortest Paths in a Dynamic Graph

The coarse-grained multicomputer parallel model (CGM for short) has been used for solving several classes of dynamic programming problems. In this paper, we propose a parallel algorithm on the CGM model, with p processors, for solving the optimal binary search tree problem (OBST problem), which is a polyadic non-serial dynamic programming problem. Firstly, we propose a dynamic graph model for solving the OBST problem and show that each instance of this problem corresponds to a one-to-all shortest path problem in this graph. Secondly, we propose a CGM parallel algorithm based on our dynamic graph to solve the OBST problem through one-to-all shortest paths in this graph. It uses our new technique of irregular partitioning of the dynamic graph to try to bring a solution to the well-known contradictory objectives of the minimization of the communication time and the load balancing of the processors in this type of graph. Our solution is based on Knuth’s sequential solution and required On2p\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathcal {O}}\left( \dfrac{n^2}{p} \right)$$\end{document} time steps per processor and ⌈2p⌉+k×⌈2p⌉2+1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\lceil \sqrt{2p} \rceil + k \times \left( {\left\lceil \dfrac{\lceil \sqrt{2p} \rceil }{2} \right\rceil } + 1 \right)$$\end{document} communication rounds. Integer k is a parameter used in the partitioning technique of our algorithm. This new CGM algorithm performs better than the previously most efficient solution, which uses regular partitioning of the tasks graph.


Introduction
Dynamic programming is a technique that can be applied to solve several combinatorial optimization problems.The idea in dynamic programming is to order the computations of solutions of subproblems in such a way that each of them is computed once.To do this, the problem is subdivided into steps (levels).Each step is composed of several subproblems and has a resolution strategy and a number of associated states.The solution of each of the subproblems of a given step depends only on those of the subproblems belonging to the preceding steps.Thus, the complexity of a dynamic programming problem depends on the intensity of the dependence between the subproblems of the different steps.Relying on the nature of the dependence of subproblems, we distinguish two types of multilevel dynamic programmings [23]: 1. serial dynamic programming, where the solution of a subproblem of a given level depends exclusively on the solutions of the subproblems of the immediately preceding level and 2. non-serial dynamic programming, where the solution of a subproblem of a given level depends on the solutions of the subproblems of several preceding levels.
Moreover, a dynamic programming problem is said to be monadic if the corresponding cost function has only one term of recurrence (see Eq. 1), and it is said to be polyadic otherwise.The problems that use dynamic programming technique induce a lot of calculations.In this paper, we study the optimal binary search tree problem (OBST for short) which is a polyadic non-serial dynamic programming problem in which there is a strong dependency and an irregularity of calculation load between the subproblems.Given n keys ( K 1 , K 2 , … , K n ), the access probabilities of each key, and those occurring in the gap between two successive keys, an optimal binary search tree for this set of keys is one which has the smallest search cost.The OBST problem is to construct an optimal binary search tree, given the keys and their access probabilities.It is widely used in several fields, for example in the word prediction, which is the problem of guessing the next word in a sentence as the sentence is being entered, and updates this prediction as the word is typed [9].The classical sequential algorithm for this problem requires O(n 3 ) calculation time and O(n 2 ) memory space [13].

Literature Review
Using the monotonicity property of optimal binary search trees (that we will define below), Knuth [22] derived an O(n 2 ) algorithm in the same space.Yao [34] proposed an algorithm with the same complexity thanks to the quadrangle inequalities.Using the restrictive assumption of convexity, Eppstein et al. [10] obtained an algorithm which required O(n log n) time.During the resolution of the OBST problem, the sequential nature of the evaluation (diagonal-by-diagonal processing) of the dynamic programming table prevents from its efficient parallelization in the case of the PRAM model [2].In the case of using real target parallel machines (systolic model for example), where the communication operations are taken into account, the parallel evaluation of this table becomes more complicated, because of the strong dependence relations between the subproblems (the evaluation of an element of the level d requires the already calculated values of two elements of levels 0, 1, … , d − 1 ), see [14,18].Although the parallelization of the classical algorithm has been widely studied by the community of researchers in parallel processing for different parallel computing models [4,28,29], little work is produced for the parallelization of the Knuth's approach.It is difficult to effectively parallelize this version on an abstract model.In PRAM model, it is unknown how to use the monotonicity property to reduce the number of processors in polylogarithmic calculations time [16,17].The solution becomes even more difficult to design, if we want to search the OBST solutions on a bridging model as the bulk synchronous parallel model (BSP for short) [32,33].Precisely, the goal is to benefit from the characteristics of a large majority of current parallel machines (where local memory size ≫ 1 ) in order to minimize the communication overhead by regrouping them in global communication rounds between pure calculation super-steps, which is the objective of the CGM model.
In this paper, our aim is to parallelize Knuth's sequential algorithm for OBST problem on the bridging coarse grain BSP/CGM (bulk synchronous parallel model/coarse-grained multicomputer) [5][6][7][8]32].CGM seems to be the best model for the design of algorithms that are not too dependent on a particular architecture.A BSP/CGM machine is a set of p processors, each having its own local memory of size s (with O(s) ≫ O(1) ) and connected to each other through a router able to deliver messages in a point-to-point manner.Each BSP/CGM parallel algorithm is an alternation of local computations and global communication rounds.Each communication round consists in routing a single h-relation with h = O(s) .Each CGM computation or communication round corresponds to a BSP super-step having a communication cost g × s [4].Here, g is the cost of a communication of a word in the BSP model.In order to produce an efficient BSP/ CGM parallel algorithm, the effort of the designers must be to maximize the speed-up and minimize the number of communication rounds.(Ideally, it must be independent from the problem size and constant in the optimum.) There are many CGM-based parallel algorithms for some dynamic programming problems, among others: The string editing problem [1], the longest common subsequence problem [2,3,12,28].The special class of problems including matrix chain ordering problem (MCOP), OBST and optimal string parenthesizing problem (OSPP) which are solved by the same dynamic programming algorithm, each with its own specification, was tackled in [15,19,26,30].
In [20], the authors proposed a CGM-based parallel algorithm for MCOP that requires O n 3 p time steps and O(p) communication rounds on p processors.The one for OSPP in [30] requires ⌈ √ 2p⌉ communication rounds and O n 3 p execution time on p processors.Similarly, their algorithm for OBST in [27] requires ⌈ √ 2p⌉ communication rounds and O n 2 p time steps.A major drawback of these algorithms is that the loads of processors are unbalanced and they promote the idleness of processors.

Our Contribution
Dilson and Marco [15] have proposed a CGM-based parallel algorithm for MCOP which requires O(1) communication rounds and O n 3 p 3 time steps.Their solution is based on the graph model proposed in [4], called dynamic graph.In fact, Bradford showed that at each instance of this problem it corresponds to a one-to-all shortest path problem in this graph.Thus, their algorithm is based on the computation of shortest paths in a dynamic graph derived from the input.It would be interesting to use the properties of such type of graph for solving the OBST problem.
Tchendji and Myoupo [27,30] have shown that for OBST problem, load balancing of the processors and minimization of the communication time are two contradictory objectives when the corresponding tasks graph is partitioned into subgraphs (or blocks) of the same size.Indeed, as each block of the graph is fully evaluated by a single processor, we have the following two scenarios: 1. for the minimization of communication rounds, the subgraphs must be of large size.Thus, the number of blocks of the task graph is minimized, and therefore, the number of communication rounds of the corresponding algorithm is reduced; 2. to promote load balancing of the processors (and therefore increase the efficiency of the parallel algorithm), the subgraphs must be of smaller size.Thus, if a processor has one more block to evaluate than another, because blocks are of small size, their load difference will also be low.
Their solution gives to the final user the choice to optimize one criterion or another according to the parameters g (granularity of their model) and p (number of processors) [31].Their idea is based on a task graph partitioning technique and data distribution introduced in [27,30].In these approaches, the task graph is partitioned in the same way (in rows and columns of blocks), and data dependencies are the same.The steps of the algorithm based on user parameters are then deduced.The performances (complexity) of the algorithm are expressed from these parameters.
The main drawback of this solution is the contradictory optimization criterion caused by the regular size of blocks of the task graph when partitioning.Indeed, the final user cannot optimize more than one criterion.The second drawback is idleness of processors.Indeed, the number of blocks of the diagonals become quickly lower than p, and so several processors cannot be active at the same time.From there, over time, there is more and more idle processors (one more after each step).Yet, the blocks of the upper levels are those which induce biggest loads.Therefore, this idleness is at the origin of the unbalance load between the processors and even of their latency.
Our main contribution can be summarized as follows: • firstly, we propose a dynamic graph model for solving the OBST problem and show that each instance of this problem corresponds to a one-to-all shortest path problem in this graph.So like in [15], any solution of a oneto-all shortest path problem can be used on our dynamic graph to accelerate the sequential or parallel resolution of OBST problem; • secondly, we proposed a BSP/CGM parallel algorithm based on our dynamic graph to solve the OBST problem from the corresponding one-to-all shortest path in this graph.It uses our new technique of irregular partitioning of this dynamic graph to try to bring a solution to the contradictory objectives of the minimization of the communication time and the load balancing of the processors in this type of graph.This new technique assures that the blocks of the first diagonals (or the first steps) are of large sizes to minimize the number of communications, but this size decreases along the diagonals to increase the number of blocks in these diagonals and allows processors to stay active as long as possible.This promotes load balancing and thus minimizes the overall computation time of processors.This also reduces their latency, which is the largest part of the overall communication time.
The remainder of this paper is organized as follows: In Sect.A binary search tree is a binary tree that serves to efficiently look for an element among a sorted set of elements (often called keys).Value and all the keys of the left sub-tree are less than the key at the root, and all the keys of the right sub-tree are greater than the key at the root.This property is applied recursively for the left and right sub-trees of this tree.A binary search tree completely orders its keys and is used to efficiently look for a particular key among a sorted set of keys.
The search cost of a particular key in a binary search tree depends on its depth in this tree.The example of Fig. 1 shows that the search cost of the key "n" is 1 and that of the key "l" Fig. 1 Sample of a binary search tree corresponding to the set of letter {c, e, f , g, h, k, l, n, o, r, s} sorted according to the alphabetic order is 5. Thus, to order a set of n keys, different binary trees are possible [21].For a sequence {a, b, c} , Fig. 2 shows the different possible trees.Therefore, given a set of keys and their access probabilities, the optimal binary search tree problem is to search the binary search tree for that set of keys which has the smallest cost.More formally, the problem can be defined as follows: consider that we have n keys K 1 < K 2 < ⋯ < K n , which are to be placed in a binary search tree, and 2n + 1 probabilities p 1 , p 2 , … , p n , q 0 , q 1 , q 2 , … , q n with p i + q j = 1 where p i is the access probability of the key k i and q j is that of: • a key which lies between k j and k j+1 if 0 < j < n ; • a key greater than k n if j = n ; • a key less than k 1 if j = 0.
Let's denote T a binary search tree for this set of keys.Let b i be the number of edges on the path from the root to the interior node k i , and a j be the number of edges on the path from the root to the leaf ( k j , k j+1 ).Then the expected cost induced by a search in T is given by: cost(T) = (p i × (b i + 1)) + (q j × a j ) , since the access cost of key k i , is b i + 1 , while for the gap ( k j , k j+1 ) it is just a j .An optimal binary search tree for this set of keys is a binary search tree whose expected search cost is smallest.From this, the optimal binary search tree problem boils down to construct an optimal binary search tree given a set of keys and their access probabilities.
Denote by OBST(i, j) an optimal binary tree corresponding to the set of keys in the interval int(i, j) = [K i+1 , … , K j ] and denote by Tree(i, j) the cost induced by this tree.Let w(i, j) = p i+1 + ⋯ + p j + q i + q i+1 + ⋯ + q j , thus w(0, n) = 1 and w(i, i) = q i .The search costs Tree(i, j) of OBST(i, j) obey to the following dynamic programming recurrences for 0 ≤ i ≤ j ≤ n: The corresponding sequential generic algorithm for this class of problems requires O(n 3 ) time and O(n 2 ) space [13]. (1) The cost values Tree(i, j) are calculated according to their tabulation in an array, often called dynamic programming table or shortest path matrix denoted here by Tree(0, n).
In this table, each entry (i, j) corresponds to a subproblem and contains the value Tree(i, j) at the end of the algorithm.For example, having seven keys (n = 7) , the corresponding dynamic programming table Tree(0, 7) (where (i, j) stands for Tree(i, j)) is represented in Fig. 3.The bottom-up approach of this algorithm evaluates the entries of this table diagonal-by-diagonal, starting from the central diagonal (Fig. 4).The general structure of this algorithm is presented by Algorithm 1.When all entries of table Tree(0, n) are evaluated (corresponding to the first step, or step of calculation or search of the optimal solution), a recursive algorithm with a time complexity O(n 2 ) is executed for the construction of the tree (corresponding to the second step or step of construction of the optimal solution).This recursive algorithm uses a table denoted Cut(0, n), which is obtained from the table Tree(0, n).Each entry (i, j) of this table gives the value of k (in Eq. 1) that minimizes Tree(i, j).It is the first step that gives an optimal decomposition of the tree OBST(i, j) in two sub-trees.In this paper, we are only interested by the parallelization of the first step of the resolution of this problem, i.e., the search of the optimal cost.
Proposed by Knuth, the fastest sequential algorithm requires O(n 2 ) time and O(n 2 ) space [22].The acceleration obtained with this algorithm comes from the property of monotonicity verified by some entries of the matrix Cut(0, n) [22].
In Sect.2.2, we discuss how this acceleration can be used to derive an efficient BSP/CGM parallel algorithm for OBST.We present the Knuth's sequential algorithm which is based on our CGM algorithm.

Knuth-Acceleration-Based Sequential
Algorithm in O(n 2 ) The Knuth's acceleration results from his observation of a property of the matrix Cut(0, n) [22].This property is defined as follows: This property transforms Algorithm 1 to the following (Algorithm 2).
We can see from the Algorithm 2 that the acceleration comes from the internal for loop which requires only O(n) comparison operations unlike O(n 2 ) for the classical version.For the calculation of an entry (i, j) of diagonal (j − i + 1) , the number of comparison operations is reduced from (j − i) to (Cut(i + 1, j) − Cut(i, j − 1)) .The Knuth acceleration is not regular on all entries in the same diagonal.Indeed, the number of comparison operations defers from an entry to another.Thus, contrary to the classical version, if the entries of a diagonal are fairly distributed on different processors, nothing guarantees that these processors will have the same load.From this, a new constraint must be taken into account during the design of parallel algorithms based on Knuth's algorithm: it is "the inequality of necessary calculation loads for entries of the same diagonal of the dynamic programming table ".
This property of Knuth's algorithm induces a load unbalance that is difficult to manage, and can vanish the acceleration obtained.One possible solution to inherit this acceleration and eliminate the global unbalanced loads is to evaluate each diagonal of the task graph by a single processor.But this is contrary to the main source of the parallelism of this algorithm, as for the classic O(n 3 ) version: the independence of computations between the entries of the same diagonal in the dynamic programming table.
In the next Sect.3, we propose a dynamic graph model and show that the OBST problem can be solved through a one-to-all shortest paths problem in this graph.

Fundamental Concepts
In this section, we present the basic concepts for the dynamic graph model proposed in this paper.Definition 1 A finite semigroupoid (S, R, •) is a non-empty finite set S, a binary relation R ⊆ S × S , and an associative binary operator • satisfying the following conditions [24]: Definition 3 A weighted semigroupoid (S, R, •, pc) is a semigroupoid (S, R, •) with a nonnegative product cost function pc such that if (a i , a j ) ∈ R then pc(a i , a j ) is the cost of evaluating a i • a j .The minimal cost of evaluating an asso- ciation product a i+1 • a i+2 • ⋯ • a j is denoted by sp(i, j) and is defined by: 2 is equivalent to Eq. 1 in Sect.2.1.So, we can derive the next corollary.

Description of the Model
We designate by We want to build an optimal binary search tree from these keys.Some searches may concern values that do not belong to K, so we also have n + 1 "dummy keys" d 0 , d 1 , … , d n representing values outside K.For each key K i (respectively dummy key d i ), we have the probability p i (respectively q i ) that a search concerns K i (respectively d i ).Each key K i is an internal node, and each dummy key d i is a leaf.The formalization of the OBST problem is carried out in Sect.2.1.
We reference each vertex of our dynamic graph D n by the pair of integers (i, j)∕0 ≤ i ≤ j ≤ n .We say that two ver- tices (i, j) and (k, m) are on the same row (respectively on the same column) if i = k (respectively j = m ).They are on the same diagonal (j − i + 1) if (j − i) = (m − k) .D n contains n + 1 diagonals of vertices, n + 1 rows of vertices and n + 1 columns of vertices.The cost of the shortest path from an added virtual vertex (−1, −1) to any vertex (i, j) is sp(i, j).This is shown in Fig. 5a.
We designate the unit arcs of the dynamic graph D n by → , ↑ or ↗ .The unit arc (i, j) → (i, j + 1) represents associative product (K i+1 • ⋯ • K j ) • K j+1 , which is a binary sub-tree of root K j+1 whose left part contains the keys K i+1 … K j and dummy keys d i … d j and its right part contains the dummy key d j+1 and weights q j+1 + w(i, j + 1) .Similarly, the unit arc (i, j) corresponding to a binary sub-tree of root K i whose left part contains the dummy key d i−1 and its right part contains the keys K i+1 … K j−1 and dummy keys d i … d j and weights ( 2) . Also, for all i∕0 ≤ i ≤ n , the arrows ↗ rep- resent units arcs from (−1, −1) to (i, i) and weights q i .
To represent the split tree, 1 we add edges to D n called jumpers.Denote a horizontal jumper by ⇒ and a verti- cal jumper by ⇑ .The vertical jumper (i, j) ⇑ (s, j) is (i − s) units long and the horizontal jumper (i, j) ⇒ (i, t) is (t − j) units long, where all non-jumper edges are of length 1.The shortest path to the node (i, j) through the jumps (i, k) ⇒ (i, j) and (k + 1, j) ⇑ (i, j) , represented by the prod- uct  The dynamic graph D 3 , corresponding to an OBST problem for a set of n = 3 keys, is represented in Fig. 5a.
Let us note the similarity between a dynamic graph D n and a conventional dynamic programming table, Tree, for the resolution of the OBST problem.The value of sp(i, j) in the dynamic graph D n is identical to Tree(i, j) in the dynamic programming table.
Calculating a shortest path from (−1, −1) to (i, j) gives the minimum cost of finding the OBST represented by the product So finding a shortest path from (i, j) to (0, n) gives the minimum cost of finding the OBST represented by the product Lemma 1 For all vertices (i, k) in a D n graph, sp(i, k) can be computed by a path having edges of length no larger than . On the other hand, a short- est path to (j + 1, k) cannot contain a jumper longer than k − (j + 1) .Since k − (j + 1) < k − j , this Lemma follows inductively. ◻ The Proof of Lemma 1 leads directly to Theorem 1.
Theorem 1 is fundamental because it makes it possible to avoid calculation redundancy when looking for the value of the shortest path for the vertices of the graph D n .Indeed, for any vertex (i, j), among all its shortest paths containing jumps, only those that contain only horizontal jumps are evaluated.The input graph of our BSP/CGM algorithm is therefore a subgraph of D n denoted by D ′ n , in which the set A ′ of the arcs consists of the set of unit arcs of D n ∪ {(i, j) ⇒ (i, t) ∶ 0 ≤ i < j < t ≤ n} .Thus, in the graph D ′ n , a vertex (i, j) has jumps only to the next vertices on the same line.Figure 5b shows the dynamic graph D ′
Theorem 2 For each instance of OBST problem of a set of n elements, corresponds to a dynamic graph D n , where sp(i, j) is equal to Tree(i, j).Thus, the shortest path from the virtual vertex (−1, −1) to the vertex (0, n) in D n solves Tree(0, n).Proof To prove this theorem, it is sufficient to show that the cost of every path from (−1, −1) to (0, n) in a D n graph corresponds to the cost of construction of a tree of an n element associative product.In addition, for every tree of an n element associative product, there is a corresponding path in D n where both the path and the associative product have the same cost.
We only show that for each path from (−1, −1) to (0, n) in a D n graph, there is a corresponding associative product of n elements (or n − 1 operators " •-s").This proof is by induction on the lengths of the associative products.Consider D 4 and the product In D 4 , the jumper (0, 2) ⇒ (0, 4) has weight sp(3, 4) + w(0, 4) .Therefore, the cost of a path using this jumper corresponds to the cost of the associative product (K 1 • K 2 ) • (K 3 • K 4 ) .A symmetric argument holds for the jumper (3, 4) ⇑ (0, 4) .Hence, for each associative product there is a path and for each path there is an associative product.Now suppose that the theorem holds for all n ≤ k , where k ≥ 4 and n is the number of associative operators in a given associative product.Thus, the inductive hypothesis is for any path in D n from (−1, −1) to (0, n), where n ≤ k there is a corresponding associative product of n elements with the same cost.
Without loss of generality, we only consider horizontal jumpers.For m = 2k take a path in D m from (−1, −1) to (0, m).We will show that for all m ≤ 2k there is a corre- sponding m element associative product of the same cost.(This proof holds for both even and odd length products because D k−1 is a proper subgraph of D k .) The structure of D m along with the inductive hypothesis gives: 1. the cost of a path from (−1, −1) to any node (i, j), such that j − i ≤ k , corresponds to the cost of a binary search tree of K i+1 • ⋯ • K j by the inductive hypothesis.The product , by the inductive hypothesis.Since P consists of at least k elements, the product Suppose that path from (−1, −1) to (1, 2k) includes the jumper (i, j) ⇒ (i, t) .Then assume that j − i < k ≤ t − i (otherwise by the inductive hypothesis and the two facts above, the proof is complete).Therefore, consider the jumper (i, j) ⇒ (i, t) , where j − i < k ≤ t − i .By the induc- tive hypothesis and fact 2, the cost of a path from (i, t) to (0, m) corresponds to the cost of a binary search tree of Again by the inductive hypothesis and fact 1 above, the cost of a path to (i, j) corresponds to the cost of a binary search tree of K i+1 • ⋯ • K j , where K i • ⋯ • K j is a sub-product of P. Furthermore, the jumper (i, j) ⇒ (i, t) is a part of a path from (−1, −1) to (0, m) whose cost corresponds to the cost of the product . This is because its cost is sp(j + 1, t) + w(i, t) and because t − (j + i) < j − i , we can apply the inductive hypothesis.Vertical jumpers follow similarly.
◻ It is easy to derive the Corollary 2.

Corollary 2 Any solution of a one-to-all shortest path problem can be used on our dynamic graph D n to solve OBST problem.
In the next section, we use our dynamic graph D n as task graph for construction of our CGM solution.

CGM Algorithm for OBST Problem
This section presents CGM algorithm for OBST problem which construct the one-to-all shortest path in the dynamic graph D n , on a BSP/CGM model of p processors, each with a local memory of size O(n 2 ∕p) .To reach this goal, firstly we partition the shortest path matrix (or the D n graph) into sub-matrices (or subgraphs) of varying sizes (irregular size) in Sect.4.1.Next, we study the dependencies between subproblems in Sect.4.2 and distribute the blocks onto processors in Sect.4.3.Finally, we present the CGM algorithm in Sect.4.4.This algorithm tries to reconcile the two contradictory objectives mentioned in Sect.1.2.

Task Graph Partitioning
The idea is to start the subdivision on the largest diagonal of the shortest path matrix with blocks of large sizes, in order to minimize the number of communications.We reiterate this same partitioning on the following diagonals.since the number of blocks per diagonal quickly becomes smaller than the number of processors, when a diagonal reaches half of the first diagonal of blocks, fragmentation is carried out (that is to say, the size of the blocks is reduced) to catch up (or exceed by one notch) the number of blocks of the first diagonal of blocks in order to increase the number of blocks of these diagonals and allow a maximum of processors to remain active.This minimizes the idleness of the processors and thus promotes their load balancing.After k fragmentations, we no longer modify the size of the blocks, and the rest of the partitioning becomes traditional because an excessive fragmentation of the blocks would lead to a drastic increase in the number of communication rounds.
⌉ , we partition the shortest path matrix into sub-matrices or blocks (Bk(i, j) for short).Bk(i, j) is a g(n, p, k) × g(n, p, k) matrix at the kth fragmentation, i.e., at each fragmentation we divide the current size of the blocks into 4. Figure 6 shows two scenarios of this partitioning for n = 31 , k = 2 and p ∈ {2, 3, 4} .The number in each block represents the diagonal in which it belongs.In Fig. 6a, we have two diagonals of blocks when no fragmentation is performed.Since the number of blocks of the second diagonal become equal to the half of the first, we perform the first fragmentation on the block of second diagonal (this block of 16 × 16 size is divided into 4 blocks of 8 × 8 size) and the number of diagonals of blocks increases to 2. We perform the second fragmentation on the block of fourth diagonal when the number of blocks of this diagonal become equal to the half of the first (this block of 8 × 8 size is divided into 4 blocks of 4 × 4 size) and the number of diagonals of blocks increases to 2. Finally, we obtain a task graph with six diagonals of blocks.
It is important to note the following fundamental Remarks:

Remark 1
1.The blocks of the first diagonal are upper triangular matrices of g(n, p) lines and g(n, p) columns; 2. a block is full if it is a non-triangular matrix of size g(n, p, k) × g(n, p, k); 3. in general, blocks in the last column of blocks are not full (this is shown in Fig. 6b); with S = f (p).

Block Dependency
Figure 7a, b shows the dependency of Bk(i, j), i < j , on the other blocks after k fragmentations.
The extremities of a block Bk(i, j) are defined as follows (see Fig. 7): • LUE ij stands for leftmost upper entry of Bk(i, j); • RUE ij stands for rightmost upper entry of Bk(i, j); • LLE ij stands for leftmost lower entry of Bk(i, j); • RLE ij stands for rightmost lower entry of Bk(i, j).
The expressions of these entries are the following: The following Lemma 4 is inspired from [19]:

Lemma 4 (block dependency) The calculation of entries of the block Bk(i, j) requires, in the worst case, values of two blocks delimited, respectively, by extremities (ABCD) and (EFGH), as:
Due to the unbalanced load caused by the Knuth acceleration, the sizes of the rectangles ABCD and EFGH and input contents from the task graph differ from one block to another.They are different even for blocks belonging to the same diagonal.Thus, the calculation load required for the evaluation of the blocks cannot be deduced prior to the treatment.The dependent relation between blocks shows that those that are not on the same line or on the same column of blocks is independent.Consequently, they can be carried out in parallel.

Remark 2
In order to minimize the amount of data exchanged between the processors in the communication phases, each processor communicates only the subset of its block, corresponding to the size of the target block.

Mapping Blocks Onto Processors
In this mapping, all blocks of the main diagonal are assigned from the leftmost upper corner to the rightmost lower corner.This process is renewed until all processors have been used, starting with processor 1 and traveling through the blocks with a "snake-like" path, as shown in Fig. 8.
This mapping has several advantages: 1. some processors evaluate at most one block more than the others; 2. this distribution helps with load balancing between processors.In fact, the blocks are evenly distributed among the processors.Thus, they have the blocks belonging to the first and the last diagonals.
However, it has a drawback: due to the snake-like data distribution onto processors, communications are not minimized.
Lemma 5 With the snake-like distribution of the blocks onto processors, each processor has to evaluate at most (3k + 2) blocks.k is the number of fragmentations of blocks performed.
Proof Depending on the parity of f(p), we have the following two scenarios: 1. when f(p) is odd, each processor has to evaluate at most one super-block, 3(k − 1) blocks in the diagonals from the first to the (k − 1)th fragmentation and 4 blocks after the kth fragmentation; thus, 1 + 3(k − 1) + 4 = 3k + 2 blocks in total; 2. when f(p) is even, each processor has to evaluate at most 2 super-blocks, 2(k − 1) blocks in the diagonals from the first to the (k − 1)th fragmentation and 3 blocks after the kth fragmentation; thus, 2 + 2(k − 1) + 3 = 2k + 3 blocks in total.

CGM Algorithm
Our CGM algorithm evaluates the values of shortest paths to each node of a sub-matrix (or a subgraph), starting with the first diagonal of blocks to the diagonal f (p) + k × (⌈f (p)∕2⌉ + 1) .But, because of the nature of the dependencies between the blocks, one cannot predict, before the beginning of the computation of the values of the entries of the block Bk(i, j), which will be the values necessary for its evaluation.In other words, the locations of the ends of the rectangles ABCD and EFGH are not known before the start of the treatments.Thus, a gradual evaluation of the blocks of a diagonal d is not possible.The overall CGM Algorithm that is run by every processor to solve the OBST problem is as follows (Algorithm 3).
It is clear that this CGM Algor ithm needs f (p) + k × (⌈f (p)∕2⌉ + 1) super-steps of computation.In each of them, the blocks of a diagonal are calculated: one block per processor.The Knuth's sequential Algorithm is used for these local sequential calculations.Before the computation of every diagonal, entries (of Cut and Tree tables) which depend on each of its parallelograms are gathered by processors that are dedicated to this computation.
We are then ready to state Theorem 3 that gives the performances of our CGM Algorithm for OBST: It is crucial to know the better number of fragmentations before the start of computations in order to increase as most as possible the efficiency of the Algorithm.To do this, we propose to use the cost model defined in [11] to observe the evolution of load balancing, efficiency and the number of steps of the Algorithm that will result and determine, according to the triplet (n, p, k), the better number of fragmentations to perform.

Experimental Results
Here we present the results of the implementation of our BSP/CGM Algorithm for solving the OBST problem.We implemented this Algorithm on the cluster dolphin of the MATRICS platform of the University of Picardie Jules Verne [25] using 60 computation nodes (48 nodes called thin nodes are with 128 GB of RAM, and 12 named thick nodes with 512 GB of RAM), and whose characteristics are as follows: • each node is an Intel Xeon Processor E5-2680 V4 (35M Cache, 2.40 GHz); • 2 login nodes which are not intended to make the calculations but to give the job submission environment; • 1 NFS server (85 TB) in 10 Gbits; • a distributed file system BeeGFS (300 TB) in 100 Gbits.
The C programming language is used, on the operating system CentOS Linux release 7.4.1708.The inter-processor communication is implemented with the MPI library (OpenMPI version).To explore the performance of our algorithm, the results presented here are derived from its execution for different values of the triplet (n, p, k), where Recall that the communication time considered here is the sum of times of effective transfer of data and waiting times of the processors.We consider that the parallel execution time T par of the algorithm is the time interval between when the first processor starts its computations and when the last processor finishes its own.Its speed-up S par will be evaluated by compar- ing T par and the sequential execution time T seq for the same problem; therefore, S par = T seq T par .Its efficiency E par measures the average cost of processor activity and will be evaluated by E par = S par p .
Figure 9a shows that communication time decreases as the number of fragmentations increases (on 32 processors, it decreases on average by 18.83% for k = 1 and on aver- age by 25.08% for k = 2 ) because when a fragmentation is performed on the diagonal blocks, it increases the number of blocks in these diagonals and allows processors to stay active as long as possible.This minimizes the latency of processors and thus minimizes the communication time.Figure 9b shows that the computation rate is much lower than the communication rate.This is due to the computation time gain from the Knuth acceleration.The proportion of calculation is much lower with k = 2 than that of k = 0 (on 32 processors, it decreases on average by 50.19% for k = 1 and on average by 66.27% for k = 2 ) because each frag- mentation divides the current size of the blocks into 4.This reduction in the size of blocks reduces greatly the number of local computations performed by a processor due to the load balancing and thus reduces the overall computation time.
Table 1 and Fig. 10a, b show that the reduction in computation and communication time results in a reduction in the total execution time as the number of fragmentations increases.In Fig. 10a, on 32 processors, it decreases on average by 25.67% for k = 1 and on average by 33.99% for k = 2 .Also, when n = 16,383 in Fig. 10b, it decreases on average by 30.87% for k = 1 and on average by 40.43% for k = 2 .Since the total execution time decreases when the number of fragmentations increases, the speed-up and efficiency of our algorithm increase.On 32 processors when n = 16,383 in Fig. 10b and Table 2 (respectively Table 3), speed-up (efficiency) is equal on average by 2.5 (respectively 7.83%) for k = 0 , and increases up to 3.68 (respectively 11.52%) for k = 1 and 4.31 (respectively 13.49%) for k = 2 .From all this, we can deduce that our algorithm is scalable to the increase in data size and the number of processors.
Figure 11a presents for k ∈ {0, 1, 2} the curves of differ- ent loads compared to their average load.(Each value of k has its own load average.)This figure shows that irregular partitioning of the dynamic graph balances the load between the processors better than regular partitioning of this graph.Note that it is not possible to predict which processor will have the biggest or the smaller load, because of the irregularity in the number of calculations needed to evaluate the elements of the same diagonal induced by Knuth acceleration.For example, in Fig. 11a, p = 5 (respectively p = 6 ) has the greatest burden with the irregular partitioning of the dynamic graph when k = 1 (respectively when k = 2 ) with respect to p = 2 (respectively p = 3 ) that should have been predicted to support this greatest burden.
In Fig. 11b, we can see that the load of more than half of the processors is lower than the average load (indeed, the loads of the two processors p = 2 and p = 3 are more or less equal; it is the same for processors p = 5 and p = 6 ).A processor loads below the average are very close to it ( p = 7 ).Those that are above this average are those of the processors that evaluate the last blocks of the dynamic graph (task graph) and are more distant from the average load.This result was expected, because the load induced by a block of the task graph increases with the number of the diagonal on which it belongs.From this, we can

Conclusion and Future Works
In this paper, we presented an efficient parallel algorithm on the BSP/CGM model for solving the OBST problem on p processors.Firstly, we proposed a dynamic graph model for solving the OBST problem and showed that each instance of this problem corresponds to a one-to-all shortest path problem in this graph.So like in [15], any solution of a one-to-all shortest path problem can be used on our dynamic graph to accelerate the sequential or parallel resolution of OBST problem.Secondly, we proposed a BSP/ CGM parallel algorithm based on our dynamic graph to solve the OBST problem from the corresponding one-toall shortest path in this graph.It uses our new technique of irregular partitioning of this dynamic graph to try to bring a solution to the contradictory objectives of the minimization of the communication time and the load balancing of the processors in this type of graph.In our BSP/ CGM algorithm, each processor has O n 2 p local mem- � communication rounds.The experimental results show a good agreement with theoretical predictions.The progressive reduction in size of the blocks allows processors to stay active as long as possible.This promotes load balancing and thus minimizes the overall computation times of processors.This also reduces their latency and then minimizes the communication time.All this reduces the overall execution time of the algorithm.In perspective of this work, it will be interesting to use our dynamic graph model to develop a CGM-based parallel algorithm for OBST problem which requires O(1) com- munication round and O n 2 p 2 time steps, using, for exam- ple, the works of [15].The irregular partitioning technique of the tasks graph may be applicable to other dynamic programming problems in the same class of the OBST as matrix chain ordering problem.This work is also left for future works.

Fig. 2 Fig. 3 7 Fig. 4
Fig.2For a sequence {a, b, c} , there are five possible binary search trees and the search cost of "a" is 1 on the first tree and 3 on the fourth

Definition 4
defines the same binary sub-tree of root K k+1 whose left part contains the keys K i+1 … K k and dummy keys d i … d k , and its right part con- tains the keys K k+2 … K j and dummy keys d k+1 … d j .Given a set of n keys, a dynamic graph D n = (V, E ∪ E � ) is defined as a set of vertices, a set of unit arcs, a set of jumpers, a weight function W such that

Fig. 5
Fig. 5 Dynamic graphs D 3 and D ′ 3 for a set of n = 3 keys

4. one fragmentation increases ⌈ f (p) 2 ⌉+ 1 3 : 2
diagonals (see Proof of Lemma 3); 5. when f(p) is odd, the number of blocks in a diagonal after each fragmentation exceeds by one notch the number of super-blocks (the larger blocks) of the first diagonal.(This is illustrated in Fig. 6b where there are 3 blocks in diagonal 1 and 4 blocks in the diagonal 3.) This allows us to state the following Lemmas 2 and Lemma The number of blocks of the dynamic graph (the shortest path matrix) after partitioning is a function of k and is: An Efficient CGM-Based Parallel Algorithm for Solving the Optimal Binary Search Tree Problem… 1 3

Fig. 9 Fig. 10
Fig. 9 Global communication time and computation rate versus communication rate

Acknowledgements
The authors wish to express their gratitude to the University of Picardie Jules Verne which made it possible to carry out the experimentations of this work.Author's Contribution VK suggested this work.The authors carried out the analysis.JL performed the experiments and wrote the first draft of this work.The authors worked for the revised version and approved the work.

Fig. 11
Fig. 11 Load difference relative to the average load in function of p