# Efficient Parallel Algorithm for Optimal DAG Structure Search on Parallel Computer with Torus Network

## Abstract

The optimal directed acyclic graph search problem constitutes searching for a DAG with a minimum score, where the score of a DAG is defined on its structure. This problem is known to be NP-hard, and the state-of-the-art algorithm requires exponential time and space. It is thus not feasible to solve large instances using a single processor. Some parallel algorithms have therefore been developed to solve larger instances. A recently proposed parallel algorithm can solve an instance of 33 vertices, and this is the largest solved size reported thus far. In the study presented in this paper, we developed a novel parallel algorithm designed specifically to operate on a parallel computer with a torus network. Our algorithm crucially exploits the torus network structure, thereby obtaining good scalability. Through computational experiments, we confirmed that a run of our proposed method using up to 20,736 cores showed a parallelization efficiency of 0.94 as compared to a 1296-core run. Finally, we successfully computed an optimal DAG structure for an instance of 36 vertices, which is the largest solved size reported in the literature.

### Keywords

Optimal DAG structure Optimal bayesian network structure Parallel algorithm Distributed algorithm Torus network## 1 Introduction

In the field of computer science, a directed acyclic graph (DAG) structure frequently appears in, for example, bioinformatics, graph theory, and parallel computing [3, 4, 6, 10, 11]. In this study, we adopted a score-based structure search to construct a DAG structure from observed data [7, 8, 9]. The optimal DAG is defined as one having a minimum score, where a score function is defined on a graph structure. The problem of finding a DAG structure with the optimal score is NP-hard [5]. Learning a Bayesian network, which is used for bioinformatics and machine learning [4, 6], is an application of this problem. In bioinformatics, a Bayesian network is used as a model for a gene network and applied for new drug development [12]. Inferring larger networks enhances possibility of such gene network researches. Ott et al. [14] developed an algorithm to find the optimal DAG by dynamic programming (DP). Their algorithm requires \(O(n\cdot 2^{n})\) space and time, where *n* denotes the number of vertices in the DAG. Although the memory size on a single machine has been increasing in recent years, by using this algorithm, the instance of only approximately 25 vertices can be solved on a single machine. Even if a machine with sufficient memory existed, it would take a day to solve the instance of more than 30 vertices. To meet these time and memory requirements, some parallel algorithms were developed.

Tamada et al. [15] proposed a parallel algorithm. The feature of their algorithm is a reduction in communications between parallel processors achieved by splitting the search spaces. The time and space complexities of their algorithm are \(O(n^{\sigma + 1}2^{n})\), where \(\sigma =1,2,...>0\) controls the trade-off between the number of communications and the memory requirement. They succeeded in solving the optimal DAG search problem for 32 vertices. Nikolova et al. developed a more efficient parallel algorithm [13]. They considered a single processor of \(2^{k}\) processors as a node of a *k*-dimensional hypercube. They distributed computations to the processors such that the following property is satisfied. When a processor performs a computation, the necessary data for the computation are stored in the processor or its adjacent processors in a *k*-dimensional hypercube. The time complexity of their algorithm is \(O(n^{2}\cdot (2^{n-k}+k))\) on \(2^{k}\) processors. They proved that the maximum storage needed is approximately \(\sqrt{2}\cdot 2^{n-k}/\sqrt{n-k}\) elements on \(2^{k}\) processors, and by this formula, 33 vertices was the limitation in their experimental environment. Their algorithm could linearly scale up to 2048 cores and they stated that it computed the instance of 33 vertices in 1 h and 14 m with 2048 cores.

In this paper, we propose a novel parallel algorithm called TRPOS (Torus Relay for Parallel Optimal Search). As compared to previously developed algorithms, TRPOS is very simple. Although our algorithm is based on Tamada et al.’s parallel algorithm [15], our algorithm does not reduce communications. Our objective was to develop an algorithm that runs efficiently when many cores are used. When executing the program in distributed systems, the important issues are the distribution of the computations and the data, and the manner in which the processors of the systems communicate with each other. In the previous parallel algorithms, a processor communicates directly with others that store necessary data. Nikolova et al.’s algorithm, for example, would communicate efficiently in a distributed system such that the structure of the *k*-D hypercube can effectively be allocated to processors in the system. A torus interconnect is a popular network topology for connecting processors in a distributed system. In fact, many supercomputers adopt torus networks, e.g., Fujitsu FX10, IBM Blue Gene/Q, and Cray XK7. We focus on the structure of the network topology, i.e., the torus network in our system, and propose a communication method designed specifically for torus networks. In a torus network system, although processors can communicate with distant processors, it takes more time and data congestion may occur. To avoid congestion, in our algorithm, a processor communicates with only its adjacent processors. The necessary data for a computation are acquired by relaying (Torus Relay). We developed a communication method for a one-dimensional (1D) torus network at first, and then, extended the method to a two-dimensional (2D) torus network.

We applied our algorithm to the optimal Bayesian network search problem. Through computational experiments, we confirmed that a run of our algorithm using up to 20,736 cores shows a parallelization efficiency of 0.94 as compared to a 1296-core run. Note that previous research showed that an algorithm is scalable up to 2048 cores. We also succeeded in computing the instance of 36 vertices in approximately 12 h with 76,800 cores, which is the largest solved size reported in the literature.

## 2 Preliminaries

### 2.1 Finding Optimal DAG Structure

*G*is a DAG and \(s(X_{j},Pa^{G}(X_{j}),{\varvec{X}})\) is a local score function \(V\times 2^{V}\times {\mathbb {R}}^{N, n} \rightarrow {\mathbb {R}}\) for vertex \(X_{j}\) given the observed input data of an \((m\times |V|)\)-matrix \({\varvec{X}}\), where

*V*is a set of vertices, \(Pa^{G}(X_{j})\) represents the set of vertices that are directed parents of the

*j*-th vertex \(X_{j}\), and

*m*is the number of observed samples.

### 2.2 Dynamic Programming Algorithm

We present Ott et al.’s algorithm. A DAG structure can be represented by an order of the vertices, that is, a permutation, and the parents of each vertex. The DP algorithm computes these values in DP steps. In detail, the algorithm consists of DP having two layers: one for obtaining the optimal permutation and one for obtaining the optimal parents of each vertex.

### Definition 1

**(Optimal local score).**We define the function \(F:V\times 2^{V}\rightarrow {\mathbb {R}}\) as

*A*denotes candidate vertices for

*v*’s parent. That is,

*F*(

*v*,

*A*) calculates the optimal choice of

*v*’s parent from

*A*and returns its local score.

### Definition 2

**(Optimal network score on a permutation).**Let \(\pi : \{1,2,...,|A|\} \rightarrow A\) be a permutation on \(A \subset V\) and \(\varPi ^{A}\) be a set of all the permutations on A. Given a permutation \(\pi \in \varPi ^{A}\), the optimal network score on \(\pi \) is described asA permutation represents the possibility of parent-child relationships between vertices. The vertex

*u*can be

*v*’s parent if \(\pi ^{-1}(u) < \pi ^{-1}(v)\). By the definition of

*F*(

*v*,

*A*), \(Q^{A}(\pi )\) calculates the optimal parent-child relationships of \(\pi \) and returns its network score.

### Definition 3

**(Optimal network score).**Let

*A*be a set of vertices. We define the function \(M: V \rightarrow {\mathbb {R}}\) as

*M*(

*A*) returns the optimal permutation on

*A*that derives the minimal score of the network consisting of the vertices in

*A*. Thus, solving the optimal DAG structure search problem is equal to computing

*M*(

*V*), where

*V*represents all the vertices in a DAG.

Finally, the following theorem provides an algorithm to calculate *F*(*v*, *A*), *M*(*A*), and \(Q^{A}(M(A))\) by DP. See [14] for the proof of this theorem.

### Theorem 1

**(Optimal network search by DP).**The functions

*F*(

*v*,

*A*),

*M*(

*A*), and \(Q^{A}(M(A))\) can be respectively calculated by the following formulae

*V*and its score \(Q^{V}(M(V))\) in \(O(n\cdot 2^{n})\) steps.

Note that in order to reconstruct the network structure, we need to store the optimal choice of the parents derived in Eq. (1) and the optimal permutation \(\pi = M(A)\) in Eq. (3) for all the combinations of \(A \subset V\) for the next size of *A*.

## 3 Methods

We call a processor in distributed systems a node. As mentioned in Sect. 2, we need to store the intermediate results in each DP step. In order to calculate a new result in a DP step, for example, *F*(*v*, *A*), it is necessary to calculate the initial score \(s(v,A,{\varvec{X}})\), compare it with \(F(v,A \backslash \{a\})\)’s, and take their minimum. The algorithm needs to collect \(F(v,A \backslash \{a\})\)’s that were calculated in the previous DP step and stored at (possibly) different nodes. The previous parallel algorithms, including Nikolova et al.’s algorithm, collect all the necessary data for a computation at the same time and then update each *F*(*v*, *A*). However, because it is not necessary to updating a new result at the same time, a node can receive the data and update at any time. In our proposed algorithm, a node relays all the data and updates if the received data include data necessary for a computation. We explain the manner in which each node acquires all the necessary data for a computation by relaying in the following subsections. The feature of our algorithm is that a node communicates with only its adjacent nodes on a torus network. First, we show that the traditional collective communication methods supported by a Message Passing Interface (MPI) are not suitable for this problem. Next, we introduce our parallel algorithm on the simple interconnect network structure, that is, a 1D torus network. Then, we extend our idea to a 2D torus network. Finally, we improve the algorithm with multithread programming. In this section, we refer to the values of the *F*(*v*, *A*), *M*(*A*), and \(Q^{A}(M(A))\) functions as *new results* when we focus on a DP step \(|A| = a\). In addition, we refer to the results of the previous step, that is, \(F(v, A')\)’s, \(M(A')\)’s, and \(Q^{A'}(M(A'))\)’s for \(|A'| = |A| - 1\), as *sub-results*. Thus, the sub-results are the intermediate results required to calculate the new results.

### 3.1 Proposed Algorithm

An MPI supports some collective communication methods and some distributed systems may ensure that data congestion does not occur when using these communication methods. A node needs to receive all the sub-results required for score calculations and this is realized by *Allgather*. Each node, by using *Allgather*, gathers all the data stored at all the nodes. *Allgather* consumes memory to store all the data simultaneously. Since the memory complexity is exponential in the optimal DAG structure problem, if this operation is used, the memory is exhausted even for a small problem. Therefore, we cannot use MPI collective communication methods to solve the problem efficiently.

*bi-directional*communication and a node simultaneously sends to and receives from its adjacent nodes. Let

*n*be the number of vertices in a DAG and

*N*be the number of nodes. We propose a simple algorithm that does not entail any congestion on the 1D torus network. The idea is that a node communicates with only its adjacent nodes. So that a node receives the required data located in a distant node, each node transfers the entire data in it to the adjacent nodes, and this is repeated until all the nodes receive the entire data. At each communication step, a node receives the sub-results sent from its adjacent nodes and simultaneously sends the sub-results that were received at the previous communication step. During the communications, a node also calculates the new results from the received sub-results. If the received sub-results include sub-results required for the new results on the node, the node calculates them while receiving and sending the data. A node obtains sub-results from two nodes at distance 1 at the first communication step, from two nodes at distance 2 at the second step, and so on. When the

*i*-th communication step is complete, a node acquires sub-results from two nodes at distance

*i*from both directions. As an exception, if

*N*is even, a node receives the sub-results originally stored at the farthest node by single-directional communication at the final \(\lceil \frac{N-1}{2}\rceil \)-th step. Thus, the number of communication steps in a DP step such that all nodes obtain all sub-results is \(\lceil \frac{N-1}{2}\rceil \). Figure 1 shows an example of communications where \(N = 6\). In the figure, the sub-results stored initially at node

*x*are denoted as

*x*, where node

*x*is a node the rank of which is

*x*, and the ranks are numbered from left to right. Figure 1(i) shows the communication of nodes with their adjacent nodes at the first step. For example, node 2 sends its sub-results to its adjacent nodes, i.e., node 1 and 3, and receives 1 and 3 from them. At the second communication step (ii), having 1 and 3, it sends 3 to node 1 and 1 to node 3 and receives 2 and 4. Finally (iii), it sends only 4 to node 1 and receives 5 from node 3.

*a*, the number of

*F*(

*v*,

*A*)’s is \(n\cdot \left( {\begin{array}{c}n-1\\ a\end{array}}\right) \) and the number of \(Q^{A}(M(A))\)’s is \(\left( {\begin{array}{c}n\\ a\end{array}}\right) \), denoted by \(F_{a}\) and \(Q_{a}\), respectively. In our algorithm,

*F*(

*v*,

*A*) and \(Q^{A}(M(A))\) are numbered as in Tamada et al.’s algorithm (see [15] for details) and are distributed uniformly to a node. Thus, the total number of

*F*(

*v*,

*A*)’s and \(Q^{A}(M(A))\)’s that an

*i*-th rank node stores is calculated by the function \(\textit{Data}_{N}(a,i)\)The sub-results stored at the

*i*-th rank node, in a DP step \(a+1\), are [\({Ceil}_{N}(F_{a},i), {Ceil}_{N}(F_{a}, i + 1)\))-th

*F*(

*v*,

*A*)’s and [\({Ceil}_{N}(Q_{a},i), {Ceil}_{N}(Q_{a}, i + 1)\))-th \(Q^{A}(M(A))\)’s. Thus, the rank of the node that originally stores the

*k*-th value out of \(F_{a}\), denoted \({rank\_has}(k)\), can be calculated asThe data that the

*i*-th rank node receives from the right (left) node at the

*k*-th communication step are originally located at the node, the rank of which is \(i+k \,(i-k) \,(\bmod \, N\)). Therefore, if a node computes a list of nodes that stores the required sub-results in advance, it can determine immediately whether the data received from its adjacent nodes include the sub-results required to compute the new results or not. Hence, by this algorithm, a node have to communicate only with its adjacent nodes and as a result no data congestion occurs.

### 3.2 Memory Complexity for Communication

*divided communication*and a divided part of communication in a DP step a

*communication piece*. For example, if we split sub-results into three pieces, the size of a communication buffer is reduced to one third while the number of communication pieces in a DP step increases to three. The more the number of communication pieces increases, the smaller is the communication buffer size. Unlike in the first strategy, however, the fixed buffer for its sub-results is necessary, because a node sends them in the second or later communication pieces. Thus, there are six buffers: two buffers for sub-results and results, and four communication buffers, the size of which depends on the number of communication pieces. We call this the second strategy. Figure 3 shows how these six buffers are used. At the beginning of each communication piece (i), a node copies a part of its sub-results to a send buffer. The transition of buffer usages is the same as in the first strategy (ii, iii). The memory usages of the first and second strategy, denoted by \( Mem _{{fir}}\) and \( Mem _{{sec}}\), respectively, are

*C*is the number of communication pieces and \(\displaystyle D = \max _{a,i} {Data}_{N}(a,i)\). The size of the communication buffers should be as large as possible and a node should perform the minimum number of communication pieces. The number of communication pieces varies dynamically depending on

*a*. For instance, it can be smaller when

*a*is small or large. This is because the number of communication pieces increases linearly with the amount of sub-results, and the amount of sub-results is a sum of \(F_{a}\) and \(Q_{a}\), each of which is a convex function of

*a*. The maximum size of \( Mem _{{sec}}\) is the size of the available memory. Therefore, we can calculate the number of communication pieces as

*M*is the size of the available memory. Based on this equation, the algorithm uses a different number of communication pieces for each DP step, where \(D = \max _{a,i} {Data}_{N}(a,i)\) and

*a*is fixed to the DP step

*a*.

### 3.3 Disadvantage of Algorithm on 1D Torus Network

The algorithm has a disadvantage on a 1D torus network, the cost of synchronization is high. In the 1D torus network algorithm, a node refers to sub-results in send buffers to calculate new results during a communication step. At the next communication step, a send buffer becomes a receive buffer and then the old sub-results in the send buffer are overwritten by the received new sub-results. Thus, a node must block the next communication until the calculations using the sub-results in the send buffers are completed. Figure 4 shows the situation where such a blocking occurs. When node 2 is calculating the new results from the sub-results in its send buffers (i, ii), its adjacent nodes, here nodes 1 and 3, cannot send to node 2 at the next communication step. Furthermore, since node 1 cannot complete the next communication step because of node 2’s blocking (iii, iv), node 0 cannot perform the communication with node 1 at the next communication step, and the same blocking also occurs between nodes 3 and 4. That is, blockings are propagated. The data of the effect of this blocking on the scalability to the number of processors are shown in Sect. 4. To overcome this disadvantage, we extend the algorithm to a 2D torus network.

### 3.4 Algorithm on 2D Torus Network

*H*is the number of rows and

*W*is the number of columns. Thus, \(N = H \cdot W\). In this paper, the shape of 2D torus network is also denoted by “height”and “width,” corresponding to the number of rows and columns, respectively. The algorithm extended to a 2D torus network consists of column-wise and row-wise communication phases. Basically, the method is almost the same as the 1D torus network algorithm. It is different point in that, in the column-wise communication phase, a node communicates as Allgather between nodes in the same column without any calculation. Note that, in our implementation, because a node communicates as in the 1D torus network method, it is guaranteed that no data congestion occurs. After a column-wise communication phase, nodes in the same column share the same set of data. Next, a node performs a row-wise communication. During the row-wise communication phase, a node communicates only within the same row and calculates intermediate results as in the 1D torus network algorithm. Figure 5 shows the communication schemes in the \(4 \times 4\) 2D torus network. A node, in the initial state (i), stores only its sub-results. First, a node starts the column-wise communication. When the column-wise communication phase is completed, a node stores all the data in the same column (ii). For example, node 4 stores data 4, 5, 6, and 7. Next, a node goes to the row-wise communication phase. At the first row-wise communication step (iii), node 4 sends [0, 3] to nodes 0 and 8, and receives [8, 11] from node 0 and [0, 3] from node 8, where [

*x*,

*y*] denotes the data that are originally stored at nodes from

*x*to

*y*. In a row-wise communication, communications are not blocked by nodes in different rows. This independence among rows makes the effect of the synchronous behavior weaker. We showed the performance of the algorithm on a 2D torus network through computational experiments shown in Sect. 4.

*D*and

*M*as in Equation (5). In the algorithm on a 2D torus network, the amount of memory for a communication buffer needs \(H \cdot D\), where

*H*is the number of rows of the 2D torus network. The total memory usage becomes

*H*times than the first strategy. We analyze the relationships between the number of communication pieces and an execution time in Sect. 4.

### 3.5 Improvement with Multithread Programming

Although we already distribute score calculations to nodes, we further distribute them to CPU cores on a single node. Recent processors have multicores and support multithread programming. If the *i*-th rank node is able to execute *k* threads, its *j*-th thread deals with \({Ceil}_{k}({Data}_{N}(a, i), j)\) score calculations. A further improvement in multithread programming can be realized in communications. The communication between nodes can be performed by a single, fixed thread. We call this thread a *communication thread* and all the other threads *non-communication threads*. In a column-wise communication phase, since a node only have to store the received data and does not calculate or update, non-communication threads are idle and wait until the column-wise communication phase is completed. Because initial scores, \(s(v,A,{\varvec{X}})\), can be computed without sub-results, non-communication threads can calculate them during a column-wise communication phase. Since the time required to complete all the initial score calculations may be longer than a column-wise communication phase, non-communication threads need to determine whether the communication is completed during their calculations. We employ the following simple solution for this problem. We introduce a flag that allows only the communication thread to change its value. The flag is set to 0 before the column-wise communication phase and, if the column-wise communication phase is completed, it is set to 1. A non-communication thread reads the flag periodically, and if the flag is 0, it continues to calculate initial scores; otherwise it completes calculations. In a row-wise communication phase, a calculation of the score function takes considerably more time than comparing and updating. In other words, the times spent by nodes for a row-wise communication step depend mainly on the times for the calculations of the initial scores. Hence, by calculating initial scores during a column-wise communication, the execution time in a row-wise communication step becomes shorter. As a result, the effect of the synchronous behavior can be weaker.

## 4 Computational Experiments

In the computational experiments, we applied our algorithm to the optimal Bayesian network search problem. We adopted the Bayesian Dirichlet equivalence (BDe) score [8] as a score function. The BDe score calculation time increases linearly with the number of samples in the observed data, denoted by *m*. The input data for the experiments are generated randomly from the random DAGs. The conditional probabilities of the variables are also assigned at random. The amount of required memory depends not on the number of edges, but vertices. Moreover, because our algorithm is dynamic programming, the number of edges does not affect the execution time.

TRPOS has various parameters, such as the number of nodes (*N*), the shape of the 2D torus network (*H*, *W*), the number of communication pieces, and the problem size (*n*, *m*). Moreover, some parameters depend on each other, e.g., the number of columns and the number of communication pieces). Therefore, we need to analyze the relationships between these parameters. First, we confirm that the synchronous behavior has in fact occurred, which is the motivation of the novel 2D communication method. Next, we analyze the relationships between the parameters when no divided communication occurs. Finally, we analyze them when divided communication occurs. In this section, we call our proposed method for a 1D torus network and for a 2D torus network *TRPOS-1D* and *TRPOS-2D*, respectively.

We implemented our algorithm in C programming language (ISO C99). The parallelization was implemented using OpenMP and MPI. We used a Fujitsu FX10 supercomuputer system installed at the Information Technology Center, the University of Tokyo [2]. Fujitsu C/C++ Compiler 1.2.1 is installed and supports OpenMP 3.0 and MPI-2.2. The node consists of a single SPARC \(64^{\text {TM}}\) IXfx CPU (16 cores per CPU, 1.848 GHz) and 32 GiB memory [1]. They are connected to each other with a DDR InfiniBand and construct the Tofu 6D torus network; the link throughput is 5.0 GBytes/s. The interconnect provides a 3D torus network as the user view. The number of nodes is 4800 and thus the number of cores is 76,800. Although users are in general allowed to use fewer than 1440 nodes, there are several opportunities in a year when we can use all the nodes as a result of competitive applications. The time limitation for a single job is 24 h. Because the nodes must be used effectively, the system bans the alocation of fewer than 72 nodes when its shape is 2D torus. Thus, we used from 81 nodes to 1440 nodes to conduct experiments on the algorithms and won an opportunity to use all the nodes to attempt to solve an instance of 36 vertices.

### 4.1 Synchronous Behavior

*end time*). In our implementation for this experiment, the time was measured by the

*gettimeofday*function in the standard C library. For simplicity, we tested the case where a single-directional communication does not occur, that is,

*N*is odd (here, 19 and 37). Table 1 shows the relative end times in the specified DP step \(a=11\), where all the end times are subtracted from the minimum end time and \(N=19\). Because no first communication is blocked and thus receiving is performed efficiently, the first end time does not depend on the adjacent node. It depends rather on the execution time of the score calculations from the sub-results the node originally has. The first end time (\(i = 0\)) of node 0 was the latest time among all the first end times (4.04 s). The second communications of the adjacent nodes of node 0, here, nodes 1 and 18, were blocked, and therefore, these second end times became greater (both 4.26 s). The third communications of the nodes at distance 2 from node 0, nodes 2 and 17, were blocked, and therefore these third end times became greater (both 4.29 s). In the later communication steps, similar communication blockings occurred. We clearly observed that synchronous behavior occurred and the delay caused by this was propagated. Note that the first end time of node 13 was also delayed, and we can observe that the propagation of the delay by node 13 occurred.

Relative end times (s) of each communication step (i = 0–8) for DP step a = 11 in the case where N = 19, n = 25, and m = 200. To change the measured absolute times to relative times, the minimum time among them (i = 0, node 5) was subtracted from these times. The gray boxes represent the propagation of blocking by nodes 0 and 13.

Maximum and minimum time difference (s) for DP step *a*

Next, we analyze the synchronous behavior when the number of processors increases. We call the difference between the minimum and maximum of the end times in a communication step the *time difference*. For instance, in Table 1, the time difference of the last communication step was 0.21 s (\(=4.49-4.28\)). By the synchronous behavior, the time difference of the last communication step decreased as compared to that of the first communication step. The maximum (minimum) time difference for a DP step *a* is the time difference that is the greatest (smallest) among those for communication steps in a DP step *a*. For instance, in Table 1, the maximum time difference is the first one (4.04 s) and the minimum time difference is the last one (0.21 s). Table 2 shows the maximum and minimum time difference for each DP step *a* in the case where \(N = 19\) and 37. Similar synchronous behavior also occurred in the case of 37 nodes. The maximum time differences for DP step *a* decreased, because the number of score calculations that a single node peformed decreased. However, these ratios (e.g., \(0.74 = 2.65/3.55\) in \(a=10\)) were greater than 0.51 (\(=19/37\)). That is, the effect of the synchronous behavior did not scale. Hence, the execution of a DP step does not scale because of communication blockings. This is the main factor that caused the scalability of TRPOS-1D to be not good scalability. We show its scalability in the next subsection.

### 4.2 No Divided Communication

*N*is from 81 to 1296 (i.e., from 1296 to 20,736 cores). Using 81 nodes, Tamada et al.’s method, TRPOS-1D, and TRPOS-2D took 2992.8, 1450.6, and 1169.0 s, respectively. Table 4 and Fig. 6 show a comparison of the scalabilities as compared to the 1296-core run. The relative speedups of 1296 nodes as compared to the 81 nodes run were 2.11, 7.21, and 12.89 respectively, while the ideal relative speedup is 16.0. We can conclude, from these results, that the proposed methods (TRPOS-1D and TRPOS-2D) are much faster and more scalable than Tamada et al.’s algorithm. Furthermore, as compared to TRPOS-1D, TRPOS-2D is faster and more scalable. TRPOS-1D is equal to executing TRPOS-2D in the case where \(H=1\) and \(W=N\). Thus, the shape of the nodes affected the execution times and scalabilities. However, if the difference between the height and width was not very large, the execution times were not affected dramatically by its shape (see the execution times of TRPOS-2D, \(18\times 18, 9\times 36\), and \(36\times 9\) in Table 3(a)). Table 3(b) shows the execution times of TRPOS-2D in the case where \(n=28\) and \(m=1000\). In our proposed algorithm, we mainly parallelized the score calculations. If the ratio of the execution times of score calculations to all the execution times is large, the parallelization efficiency increases from the perspective of Amdahl’s law. Because of this, the relative speedup of the case where \(m=1000\) was better than that where \(m=200\) (15.06 vs 12.89, Table 4). Hence, although the relative speedup decreased when the number of cores increased in the case where \(n=28\) and \(m=200\), we can expect to obtain a good relative speedup if the problem size is much larger. TRPOS-2D achieved a parallelization efficiency of 0.94 \((=15.06/16)\) as compared to 1296 cores run in the case where \(n=28\) and \(m=1000\) and its scalability was almost linear up to 20,736 cores. These results show that our algorithm maintains good efficiency with many cores. This is a significant improvement as compared to the previous research results, where the scalability was ensured only up to 2048 cores. We can expect that our algorithm is scalable with a still larger number of cores. However, for the reason described above, we could not conduct experiments of the algorithms with more cores.

Execution times (s) in the case where \(n=28\) and \(m=200, 1000\)

Relative speedups as compared to 81 nodes (1296 cores) run for problem size \(n=28\)

\(N=81\) | 162 | 324 | 648 | 1296 | |
---|---|---|---|---|---|

Tamada et al.’s algo | 1.00 | 1.17 | 1.52 | 2.05 | 2.11 |

TRPOS-1D | 1.00 | 1.91 | 3.38 | 4.86 | 7.21 |

TRPOS-2D (\(m=200\)) | 1.00 | 1.98 | 3.85 | 7.24 | 12.89 |

TRPOS-2D (\(m=1000\)) | 1.00 | 1.99 | 3.96 | 7.77 | 15.06 |

ideal speedup | 1.00 | 2.00 | 4.00 | 8.00 | 16.00 |

Execution times (s) in the case where \(n=25\)–\(30, m = 200, 1000\), and \(N=324(18\times 18)\).

| \(m=200\) | \(m=1000\) | Ideal |
---|---|---|---|

25 | 31.9(1.00) | 116.8(1.00) | 1.00 |

26 | 66.3(2.08) | 241.8(2.07) | 2.08 |

27 | 146.2(4.58) | 581.2(4.98) | 4.32 |

28 | 303.1(9.50) | 1217.4(10.42) | 8.96 |

29 | 632.2(19.82) | 2592.4(22.20) | 18.56 |

30 | 1365.2(42.80) | 5944.7(50.90) | 38.40 |

We show the relationships between the execution times and the problem sizes. As our parallel algorithm is based on a \(O(n\cdot 2^{n})\) algorithm, the execution times will increase exponentially if the number of nodes is fixed. Table 5 shows the execution times and the relative execution time ratios for \(n=25\)–30 of TRPOS-2D when *N* is 324 (\(=18\times 18\)). As the problem size increased, the gap between the observed and the ideal times based on the computational complexity became larger. For example, in the case where \(m=200\), the relative execution time ratios as compared to \(n=25\) was 9.50 for \(n=28\), although the ideal is \(8.96 (=28\cdot 2^{28} /(25\cdot 2^{25})\)). Furthermore, in most problem sizes, the relative execution time ratios of the case where \(m=200\) were smaller than that of the case where \(m=1000\). For example, in the case where \(n=30\), the relative execution time ratios were 42.8 for \(m=200\) and 50.9 for \(m=1000\), whereas its ideal ratio is 38.4 (\(=30\cdot 2^{30} / 25\cdot 2^{25}\)). This is because the time to calculate a single score increased, and then, the differences in execution times between nodes increased.

### 4.3 Divided Communication

*M*) was limited to between 100 MBytes and 1600 MBytes. Because the divided communication and the synchronous behavior in TRPOS-1D is similar to that in TRPOS-2D, we executed TRPOS-2D in Ex. I. In Ex. II,

*M*was fixed to 100 MBytes, and we executed TRPOS-1D and TRPOS-2D, increasing the number of nodes. By increasing the number of nodes, we can also observe the effects on the scalability. In the case where no divided communication occurs, the execution time and scalability of TRPOS-2D is better than that of TRPOS-1D. We can determine the trade-off between the inherent scalability and divided communications through Ex. II. In both experiments, we set \(n = 26\) and \(m=200\).

Results of Ex. I. *M* is limited to several sizes and *N* is 81 (\(=9\times 9\)). (a) Execution times (s) and the number of communication pieces for DP step *a*. (b) Time (s) taken for each communication piece, denoted by *i*, in a DP step \(a=10\). (c) Number of score calculations that nodes in the first row performed in each communication piece of DP step \(a=10\) when *M* is 100 MBytes. Gray boxes represent the maximum of score calculations in each communication piece.

Table 6 shows the results of Ex. I. Table 6(a) shows the execution times and the number of communication pieces for DP step *a*. The number of communication pieces is a convex function of *a*. The execution times increased when divided communications occurred. However, in our distributed system, the pure time for communications did not increase as a result of divided communication (data are not shown in this paper). In other words, if the size of the data that a node sends is not changed, no communications overhead is incurred as a result of divided communications. To confirm the reason for the increase in execution times, we focused on a specified DP step. Table 6(b) shows the time taken for each communication piece in the case of DP step \(a=10\) in this experiment. When no divided communication occurred (800 and 1600 MBytes), the execution time for \(a=10\) was approximately 21 s. When divided communication occurred, the execution times for \(a=10\) were much more than 21 s, e.g., 47 s. Table 6(c) shows the number of score calculations that nodes in the first row (i.e., nodes 0, 9,\(\cdots \), 72) performed in each communication piece of DP step \(a=10\), where the size of the communication buffers was 100 MBytes. According to the table, the number of score calculations processed in a communication piece differed among nodes. This is because the required sub-results are received separately for each divide communication piece. The number of calculations that a single node needs to perform for DP step \(a=10\) is 1,049,230 or 1,049,231 (\(=F_{10}=26\cdot \left( {\begin{array}{c}25\\ 10\end{array}}\right) /81\) (+1)). Note that, by executing the score calculations in the column-wise phase, the number of score calculations shown in Table 6(c) decreased. Because of the synchronous behavior, the end time of each communication piece was nearly the latest node in the same row. Moreover, the next communication piece is also blocked by the latest node in the same column that may process the previous communication piece. Therefore, when the divided communication occurs, the execution time depends on the sum of the maximum calculations in each communication piece. In the case where *M* is 100 MBytes (show in Table 6(c)), it is 1,855,080 (\(={768,087}+\cdots +{25,008}\)). On the other hand, if no divided communication occurs (800 and 1600 MBytes), it is at most 1,049,231 (\(=F_{a}\)). Hence, the factor that caused the execution time to increase was the deviation in the number of score calculations that a node performs in a communication piece.

*n*when no divided communication occurs. TRPOS-1D did not perform divided communication with more than 162 nodes for \(n=26\). In comparison, in TRPOS-2D, the divided communication always occurred. Figure 7 shows the execution times in Ex. II. TRPOS-2D maintained good scalability even if divided communication occurred. The execution time of TRPOS-2D decreased more than the linear scale when the width increased. For example, when

*N*increased from 81 to 162 by doubling the width (8\(\rightarrow \)16), 803.5 s became 301.3 s. This more than linear speedup is because the amount of data stored in a column depends on the width. If the width increased, the data stored in a column decreased, and then, the number of communication pieces decreased. Because of the overhead of the divided communications, TRPOS-2D was slower than TRPOS-1D when the number of cores was small. However, because the inherent scalability of TRPOS-2D was better and the number of communication pieces decreased, the gap between them became smaller as the number of nodes increased, and finally, TRPOS-2D took a shorter time than TRPOS-1D with 1296 nodes.

Execution times for problem size \(n=26\) when *M* is 100 MBytes and *N* is from 81 to 1296 (i.e., from 1296 to 20,736 cores). The sum of the number of communication pieces is shown in the parentheses.

\(9\times 9(81)\) | \(9\times 18(162)\) | \(18\times 18(324)\) | \(18\times 36(648)\) | \(36\times 36(1296)\) | |
---|---|---|---|---|---|

TRPOS-1D | 407.9[31] | 166.0[26] | 99.6[26] | 72.1[26] | 45.9[26] |

TRPOS-2D | 803.5[105] | 301.3[63] | 149.1[63] | 65.8[42] | 35.9[42] |

Finally, we attempted to solve a large size problem. For searching for a much larger size of DAGs, since a huge amount of memory is required, it is necessary to execute the algorithm with more nodes. Therefore, TRPOS-2D would be faster that TRPOS-1D in solving larger problem sizes. We assumed that TRPOS-2D is able to solve the instance for \(n=36\) and \(m=200\) within the limited of 24 h with 76,800 cores. As a result, we succeeded in solving it by using TRPOS-2D in 11 h 38 m with 76,800 (\(=60\times 80\times 16\)) cores. This is a significant improvement in the size of the problem solved, because the largest size solved previously was 33 [13].

## 5 Conclusion

In this paper, we presented a novel parallel algorithm to search for the optimal DAG structure. A torus network is adopted in many distributed systems, and therefore, an algorithm that runs efficiently on them is very important. Our algorithm was developed such that it would be able to run efficiently on torus network distributed systems and we confirmed that its scalability is in fact good and it scales up to 20,736 cores. Our results showed that our algorithm is scalable up to more than ten times as many cores as the existing algorithms. This is important, because the number of cores in distributed systems is increasing and the situation where more than twenty thousand cores are used is becoming common. We also succeeded in solving an instance of 36 vertices without any constraints, which is the largest solved size reported in the literature. Our core idea is that a node communicates with only its adjacent nodes on a distributed system. It enables computational nodes to communicate without any congestion, and as a result, our algorithm runs very efficiently. Since the communication on distributed systems is a major bottleneck of the distributed algorithm, such algorithms developed for distributed systems having a specific topology are becoming increasingly important.

Although we used a 3D torus network system, our algorithm can run on a 2D torus network system. Clearly, we can extend our idea to a 3D torus network. The 3D torus network algorithm further divides the column-wise communication into two dimensions. It enables column-wise communication to be more independent, and therefore the algorithm can be expected to be more efficient and more scalable. Implementation of the 3D torus network algorithm remains further research for the future when a considerably larger distributed system becomes available.

### References

- 1.Fujitsu. http://www.fujitsu.com/global/. Accessed 01 11 2015
- 2.Information Technology Center, the University of Tokyo. http://www.cc.u-tokyo.ac.jp/. Accessed 01 11 2015
- 3.Chaiken, R., Jenkins, B., Larson, P.A., Ramsey, B., Shakib, D., Weaver, S., Zhou, J.: SCOPE: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow.
**1**(2), 1265–1276 (2008)CrossRefGoogle Scholar - 4.Cheng, J., Bell, D.A., Liu, W.: Learning belief networks from data: an information theory based approach. In: Proceedings of the Sixth International Conference on Information and Knowledge Management CIKM 1997, NY, USA, pp. 325–331. ACM, New York (1997)Google Scholar
- 5.Chickering, D.M., Geiger, D., Heckerman, D.: Learning Bayesian networks is NP-Hard. Technical report, Citeseer (1994)Google Scholar
- 6.Friedman, N., Goldszmidt, M.: Learning Bayesian networks with local structure. In: Jordan, M.I. (ed.) Learning in Graphical Models, vol. 89, pp. 421–459. Springer, Netherlands (1998)CrossRefGoogle Scholar
- 7.Friedman, N., Linial, M., Nachman, I., Pe’er, D.: Using Bayesian networks to analyze expression data. J. Comput. Biol.
**7**(3–4), 601–620 (2000)CrossRefGoogle Scholar - 8.Heckerman, D., Geiger, D., Chickering, D.M.: Learning Bayesian networks: the combination of knowledge and statistical data. Mach. Learn.
**20**(3), 197–243 (1995)MATHGoogle Scholar - 9.Imoto, S., Goto, T., Miyano, S.: Estimation of genetic networks and functional structures between genes by using Bayesian networks and nonparametric regression. In: Pacific symposium on Biocomputing, vol. 7, pp. 175–186. World Scientific (2002)Google Scholar
- 10.Italiano, G.F.: Finding paths and deleting edges in directed acyclic graphs. Inf. Process. Lett.
**28**(1), 5–11 (1988)MathSciNetCrossRefMATHGoogle Scholar - 11.Kramer, R., Gupta, R., Soffa, M.L.: The combining DAG: a technique for parallel data flow analysis. IEEE Trans. Parallel Distrib. Syst.
**5**(8), 805–813 (1994)CrossRefGoogle Scholar - 12.Lecca, P.: Methods of biological network inference for reverse engineering cancer chemoresistance mechanisms. Drug Discov. Today
**19**(2), 151–163 (2014). http://www.sciencedirect.com/science/article/pii/S1359644613003930, system BiologyMathSciNetCrossRefGoogle Scholar - 13.Nikolova, O., Zola, J., Aluru, S.: Parallel globally optimal structure learning of Bayesian networks. J. Parallel Distrib. Comput.
**73**(8), 1039–1048 (2013)CrossRefMATHGoogle Scholar - 14.Ott, S., Imoto, S., Miyano, S.: Finding optimal models for small gene networks. In: Pacific Symposium on Biocomputing. vol. 9, pp. 557–567. World Scientific (2004)Google Scholar
- 15.Tamada, Y., Imoto, S., Miyano, S.: Parallel algorithm for learning optimal Bayesian network structure. J. Mach. Learn. Res.
**12**, 2437–2459 (2011)MathSciNetMATHGoogle Scholar