A Permutational Boltzmann Machine with Parallel Tempering for Solving Combinatorial Optimization Problems

Bagherbeik, Mohammad; Ashtari, Parastoo; Mousavi, Seyed Farzad; Kanda, Kouichi; Tamura, Hirotaka; Sheikholeslami, Ali

doi:10.1007/978-3-030-58112-1_22

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 12269))

Included in the following conference series:

International Conference on Parallel Problem Solving from Nature

4511 Accesses
10 Citations
3 Altmetric

Abstract

Boltzmann Machines are recurrent neural networks that have been used extensively in combinatorial optimization due to their simplicity and ease of parallelization. This paper introduces the Permutational Boltzmann Machine, a neural network capable of solving permutation optimization problems. We implement this network in combination with a Parallel Tempering algorithm with varying degrees of parallelism ranging from a single-thread variant to a multi-threaded system using a 64-core CPU with SIMD instructions. We benchmark the performance of this new system on Quadratic Assignment Problems, using some of the most difficult known instances, and show that our parallel system performs in excess of 100$\times $ faster than any known dedicated solver, including those implemented on CPU clusters, GPUs, and FPGAs.

The authors would like to thank Fujitsu Laboratories Ltd. and Fujitsu Consulting (Canada) Inc. for providing financial support and technical expertise on this research.

You have full access to this open access chapter, Download conference paper PDF

Logically synthesized and hardware-accelerated restricted Boltzmann machines for combinatorial optimization and integer factorization

Article 28 February 2022

Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs

Infinite Swapping Algorithm for Training Restricted Boltzmann Machines

Keywords

1 Introduction

Boltzmann Machines (BM), first proposed by Hinton in 1984 [13], are recurrent, fully connected, neural networks that store information within their symmetric edge weights. When combined with Stochastic Local Search methods such as Simulated Annealing (SA) [1] or Parallel Tempering (PT) [10], BMs can be used to perform combinatorial optimization on complex problems such as TSP [3], MaxSAT [8], and MaxCut [16]. In this paper, we present an algorithm for a Permutational Boltzmann Machine (PBM), structured to solve complex, integer based, permutation optimization problems. We combine this PBM with Parallel Tempering and propose both single-threaded and multi-threaded, software implementations of this PBM + PT system using a 64-core CPU along with SIMD instructions. As a proof-of-concept, we show how to solve Quadratic Assignment Problems (QAP) [17] using a PBM and present experimental results on some of the hardest QAP instances from QAPLIB [6], Palubeckis [21], and Drezner [9]. We then show that, over the tested instances, our single-threaded and multi-threaded PBM systems can find the best-known-solutions of QAP problems in excess of 10$\times $ and 100$\times $ faster than the next best solver respectively.

The rest of this paper is organized as follows: Sect. 2 provides background on BMs and the formulation of QAP problems. Section 3 presents the structure of our Permutational Boltzmann Machine and Sect. 4 presents our single and multi-threaded implementations of a PBM + PT system on a multi-core CPU. Section 5 outlines the experiments conducted to benchmark the performance of our PBM + PT system and presents our results. Section 6 concludes this paper.

2 Background

2.1 Boltzmann Machines

BMs, as shown in Fig. 1, are made up of N neurons, $\{x_1, x_2, \ldots , x_N\}$ with binary states represented by vector $\mathbf {S}=[s_1\, s_2\, \ldots \, s_N]^{\intercal } \in \{0,1\}^{N}$. Each neuron, $x_{i}$, is connected to other neurons, $x_{j}$, via symmetric, real-valued weights, $w_{i,j} \in \mathbb {R}$ where $w_{i,j} = w_{j,i}$ and $w_{i,i} = 0$, forming a 2D matrix, $\mathbf {W} \in \mathbb {R}^{N \times N} $. Each neuron also has a bias value, $b_{i}$, which forms $\mathbf {B} \in \mathbb {R}^{N \times 1}$. The cumulative inputs to the neurons, also referred to as their local fields, $h_{i}$, form $\mathbf {H} \in \mathbb {R}^{N \times 1}$ and are calculated using (1).

$$\begin{aligned} h_{i}(\mathbf {S}) = \sum _{j=1}^{N} w_{i,j}s_{j} + b_{i} \; \;, \; \; \mathbf {H}(\mathbf {S}) = \mathbf {W}\mathbf {S} + \mathbf {B} \end{aligned}$$

(1)

$$\begin{aligned} E(\mathbf {S}) = -\dfrac{1}{2}\sum _{i=1}^N \sum _{j=1}^N w_{i,j}s_{i}s_{j} - \sum _{i=1}^N b_is_i = -\dfrac{\mathbf {S}^{\intercal }\mathbf {W}\mathbf {S}}{2} - \mathbf {S}^{\intercal }\mathbf {B} \end{aligned}$$

(2)

$$\begin{aligned} P(\mathbf {S}) = \dfrac{\mathrm {exp}(-E(\mathbf {S})/T)}{\sum _{\forall \mathbf {S}_{j} \in \{0,1\}^{N}} \mathrm {exp}(-E(\mathbf {S}_{j})/T)} \end{aligned}$$

(3)

Each possible state of a BM has an associated energy term calculated via (2). The probability of the system being in any state depends on the energy of that state as shown in (3). The lower the energy, the higher the probability that the network will be in that state. BMs create an energy landscape for a problem through the weights that connect their neurons where the state(s) with the lowest energy corresponds to a valid solution. The procedures to convert various optimization problems to the BM format are discussed in [12]. The term T in (3), known as the system temperature, flattens or sharpens the energy landscape when increased or decreased respectively, providing a method to maneuver the landscape when searching for the global minimum.

Generally, BMs are combined with Simulated Annealing (SA) to solve optimization problems. Using SA, at a time-step t, where the system is in state $\mathbf {S^{t}}$ with temperature T, the local fields $\mathbf {H}(\mathbf {S^{t}})$ are calculated using (1). In order to make an update to the system, we must conduct a trial. First, a neuron $x_{i}$ is randomly chosen and the change in energy as a result of flipping its state is calculated via (4). Next, the probability of flipping the neuron’s state, $P_{move}$, is calculated via (5) and is compared against a uniformly distributed random number in [0,1] to determine the change in the neuron’s state using (6).

$$\begin{aligned} \varDelta E(\mathbf {S^{t}}, i) = \varDelta E_{s_{i} \xrightarrow {} !s_{i}}(\mathbf {S^{t}}) = -[1 -2s_{i}^{\mathrm {t}}]h_{i}(\mathbf {S^t}) \end{aligned}$$

(4)

$$\begin{aligned} P_{move} = min\{1,\mathrm {exp}(-\varDelta E/T)\} \end{aligned}$$

(5)

$$\begin{aligned} \varDelta s_{i} = {\left\{ \begin{array}{ll} [1 -2s_{i}^{\mathrm {t}}] &{} \mathrm {if} \,\,\, P_{move} \ge rand() \\ 0 &{} otherwise \end{array}\right. } \end{aligned}$$

(6)

After the trial, the system state variables $s_{i},$ $\mathbf {H},$ and E need to be updated as shown in (7), (8), and (9) respectively, where $\mathbf {W}_{i,*}$ and $\mathbf {W}_{*,i}$ represent row and column i of $\mathbf {W}$ respectively.

$$\begin{aligned} s^{\mathrm {t}+1}_{i} = s^{\mathrm {t}}_{i} + \varDelta s_{i} \end{aligned}$$

(7)

$$\begin{aligned} \mathbf {H}(\mathbf {S^{t+1}}) = \mathbf {H}(\mathbf {S^{t}}) + \varDelta s_{i} \mathbf {W}_{*,i} \end{aligned}$$

(8)

$$\begin{aligned} E(\mathbf {S^{t+1}}) = E(\mathbf {S^{t}}) + \varDelta E(\mathbf {S^{t}}, i) \end{aligned}$$

(9)

This procedure is repeated a preset number of times, occasionally cooling the system by decreasing T until it goes below a certain threshold, $T_{thresh}$, at which point the process is terminated and the lowest energy state observed throughout the search is returned. This state may or may not correspond to a valid or optimal solution due to the stochastic nature of the algorithm but, theoretically, if given a long enough cooling schedule, the BM + SA system will eventually converge to an optimal answer [2].

2.2 Quadratic Assignment Problems (QAP)

QAP problems, first formulated in [17], are a class of NP-Hard permutation optimization problems to which many other problems such as the Travelling Salesman Problem can be reduced. While the formulation is relatively simple, QAP remains, to this day, one of the more challenging combinatorial optimization problems. QAP problems entail the task of assigning a set of n facilities to a set of n locations while minimizing the cost of the assignment. QAP problems are comprised of $n \times n$ matrices $\mathbf {F} = (f_{i,j})$ and $\mathbf {D}=(d_{k,l})$ which describe the flows between facilities and distances between locations respectively with the diagonal elements of both matrices being 0. A third $n \times n$ matrix $\mathbf {B_P}=(b_{i,k})$, describes the costs of assigning a facility to a location. All three matrices are comprised of real-valued elements. Given these matrices, each facility must be assigned to a unique location, generating a permutation, $\pmb {\phi } \in \mathbf {S_{n}}$, where $\mathbf {S_{n}}$ is the set of all permutations, such that the cost function (10) is minimized.

$$\begin{aligned} \min _{\pmb {\phi } \in \mathbf {S_{n}} } cost(\pmb {\phi }) = \min _{\pmb {\phi } \in \mathbf {S_{n}} } \sum ^{n}_{i=1} \sum ^{n}_{j=1} f_{i,j}d_{\phi _{i}, \phi _{j}} + \sum ^{n}_{i=1} b_{i,\phi _{i}} \end{aligned}$$

(10)

Generally, there are two variants of the QAP problem: symmetric (sym) and asymmetric (asm). In the symmetric case, either one or both of $\mathbf {F}$ and $\mathbf {D}$ are symmetric. If one of the matrices is asymmetric, it can be made symmetric by taking the average of an element and its complement. However, if both matrices are asymmetric, we can no longer symmetrize them in this manner. It is important to distinguish between these two cases as they are handled differently by a PBM, as will be shown in Sect. 3.

3 Permutational Boltzmann Machines (PBM)

3.1 Structure and Update Scheme

The PBM’s structure is an extension of Clustered Boltzmann Machines (CBM), first proposed by De Gloria [11]. A CBM places neurons that do not have any connections between them into groups called clusters. Within a cluster, the states of the neurons have no effect on each other’s local fields; simultaneously flipping the states of multiple neurons in the same cluster has the same effect as flipping them in sequence. In a PBM, the neurons are arranged into an $n \times n$ matrix $\mathbf {S_{P}} = (s_{r,c})$, where each row, $r_i$, and each column, $c_j$, forms a cluster, as shown in Fig. 2a. On each cluster, we impose an exactly-1 constraint to ensure that within each row and each column, there is exactly one neuron in the ON state.

In the context of a permutation problem, the row-clusters represent a 1-hot encoded integer in [1, n], allowing the neuron states to be represented via the integer permutation vector, $\varvec{\phi }$. The column-clusters, in turn, enforce that every integer is unique. The $n^2 \times n^2$ weight matrix is also reshaped into a 4D ($n\times n)\times (n \times n$) matrix as shown in (11), allowing the generation of the $\mathbf {w}_{r,c}$ sub-matrices via Kronecker Products (denoted by $\otimes $) of rows and columns of F and D via (12). The $n \times n$ local field matrix $\mathbf {H_{P}}$ is calculated via (13).

$$\begin{aligned} \mathbf {W_{P}} = \begin{bmatrix} \mathbf{w} _{1,1} &{} \mathbf{w} _{1,2} &{} \ldots &{} \mathbf{w} _{1,n} \\ \mathbf{w} _{2,1} &{} \mathbf{w} _{2,2} &{} \ldots &{} \mathbf{w} _{2,n} \\ \mathbf{w} _{3,1} &{} \mathbf{w} _{3,2} &{} \ldots &{} \mathbf{w} _{3,n} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ \mathbf{w} _{n,1} &{} \mathbf{w} _{n,2} &{} \ldots &{} \mathbf{w} _{n,n} \\ \end{bmatrix} \,\, , \,\, \mathbf {W_{P}} \in \mathbb {R}^{(n \times n) \times (n \times n)} \end{aligned}$$

(11)

$$\begin{aligned} \mathbf{w} _{r,c} = {\left\{ \begin{array}{ll} -(\mathbf {F}_{r,*})^{\intercal } \otimes \mathbf {D}_{c,*} &{} sym\\ -(\mathbf {F}_{r,*})^{\intercal } \otimes \mathbf {D}_{c,*} - \mathbf {F}_{*,r} \otimes (\mathbf {D}_{*,c})^{\intercal } &{} asm \end{array}\right. } \,\, , \,\, \mathbf{w} _{r,c} \in \mathbb {R}^{n \times n} \end{aligned}$$

(12)

$$\begin{aligned} h_{r,c} = \sum _{r'=1}^{n} \sum _{c'=1}^{n} w_{r,c;r',c'}s_{r',c'} + b_{r,c} \; \;, \; \;\mathbf {H_{P}} = (h_{r,c}) \in \mathbb {R}^{n \times n} \end{aligned}$$

(13)

We enforce the 2n exactly-1 constraints by not allowing moves that violate the constraints. Assuming that the system is initialized to a valid state that meets all the constraints, we propose trials via moves called swaps as shown in Fig. 2b. A swap proposal involves picking two unique rows, r and $r'$, from the neuron matrix and swapping the states of their ON neurons along columns c and $c'$. If accepted, this move results in 4 simultaneous bit-flips within the binary neuron matrix. The change in energy as a result of such a move is shown in (14), where the first set of local field terms correspond to the neurons being turned OFF while the second set is due to the neurons being turned ON. The two weights being subtracted are required as we have two pairs of neurons that may be connected across the clusters. The first weight is to compensate for the weight being double added by the local fields of the two neurons turning OFF. The second weight is to account for a coupling that was previously inactive between the two neurons turning ON. As shown in (15), we can directly generate the sum of these weights using F and D. A trial can then be performed by substituting the $\varDelta E$ value from (14) into (5) and comparing the generated move probability against a value generated by rand().

$$\begin{aligned} \varDelta E(\varvec{\phi }^{\mathbf {t}},r,r') = ( h^t_{r,c} + h^t_{r',c'}) - (h^t_{r,c'} + h^t_{r',c}) - (w_{r,c;r',c'} + w_{r,c';r',c}) \end{aligned}$$

(14)

$$\begin{aligned} w_{r,c;r',c'} + w_{r,c';r',c} = {\left\{ \begin{array}{ll} -2f_{r,r'} d_{c,c'} &{} sym\\ -(f_{r,r'} + f_{r',r})(d_{c,c'} + d_{c',c}) &{} asm\\ \end{array}\right. } \end{aligned}$$

(15)

3.2 Updating the Local Field Matrix

When a swap proposal is accepted, the system state must be updated. Swapping the two values in $\varvec{\phi }$ and adjusting the system energy is simple. However, updating the local field matrix involves a large number of calculations. Attempting to update $\mathbf {H_{p}}$ via (16) involves fetching four weight sub-matrices from global memory with long access delays. Interestingly, the structure of the weight matrix and the PBM itself allow these calculations to be performed efficiently while storing the majority of required data within L2 or L3 caches. For a symmetric problem, we can generate the required weights with a Kronecker Product operation on the differences between 2 rows of the $\mathbf {F}$ matrix ($\mathbf {\varDelta f}$) and 2 rows of the $\mathbf {D}$ matrix ($\mathbf {\varDelta d}$) using (17). For an asymmetric problem, an additional update using $\mathbf {F^{\intercal }}$ and $\mathbf {D^{\intercal }}$ is required. In this manner, the amount of memory required to store the weight data is reduced from $n^{4}$ elements for a monolithic weight matrix to $2n^{2}$ elements to store $\mathbf {F}$ and $\mathbf {D}$ when the problem is symmetric. For an asymmetric problem, an additional $2n^{2}$ elements are needed to store $\mathbf {F^{\intercal }}$ and $\mathbf {D^{\intercal }}$. Storing a transposed copy of the matrices, while doubling the required memory, provides significant speedups due to a larger number of cache hits when fetching a small number of rows.

$$\begin{aligned} \mathbf {H}^{t+1}_{\mathbf {P}} = \mathbf {H}^{t}_{\mathbf {P}} - (\mathbf {w}_{r,c} + \mathbf {w}_{r',c'}) + (\mathbf {w}_{r,c'} + \mathbf {w}_{r',c}) = \mathbf {H}^{t}_{\mathbf {P}} + \varDelta \mathbf {H_{P}} \end{aligned}$$

(16)

$$\begin{aligned} \varDelta \mathbf {H_{P}} = {\left\{ \begin{array}{ll} \mathbf {\varDelta f} \otimes \mathbf {\varDelta d} = (\mathbf {F}_{r,*} - \mathbf {F}_{r',*})^{\intercal } \otimes (\mathbf {D}_{c,*} - \mathbf {D}_{c',*}) &{} sym\\ \mathbf {\varDelta f} \otimes \mathbf {\varDelta d} + (\mathbf {F}_{*,r} - \mathbf {F}_{*,r'}) \otimes (\mathbf {D}_{*,c} - \mathbf {D}_{*,c'})^{\intercal } &{} asm\\ \end{array}\right. } \end{aligned}$$

(17)

4 System Overview

4.1 Parallel Tempering

A major weakness of Simulated Annealing in traditional BM optimizers is that it can easily get stuck in a local minimum due to the unidirectional nature of the cooling schedule. Parallel Tempering (PT), first proposed in [23] and developed in [14], provides a means of running M cooperative copies (replicas) of the system, each at a different temperature, in order to search a larger portion of the landscape while allowing a mechanism for escaping from local minima. Replicas are generally arranged in order of increasing T from $T_{min}$ to $T_{max}$ in a temperature ladder. A replica, $R_{k}$, operating at temperature $T_{k}$, can stochastically exchange temperature with the replica immediately above it on the ladder, $R_{k+1}$, with an Exchange Acceptance Probability (EAP) calculated via (18). Figure 3 outlines the structure of an optimization engine using BM replicas with PT.

$$\begin{aligned} EAP = min\{1, \mathrm {exp}((1/T_{k} - 1/T_{k+1})(E_{k}-E_{k+1}))\} \end{aligned}$$

(18)

As implied by (3) and (7), higher T replicas can move around a larger portion of the landscape whereas the moves in lower T replicas are contained to a smaller subspace of the landscape. The ability of replicas to move up or down the ladder, as shown in Fig. 3b, allows for a systematic method of escaping from local minima, making PT a better choice for utilizing parallelism than simply running M disjoint replicas in parallel using SA as proven in [10, 14]. In this paper, we implement a PT algorithm based on a modified version of Dabiri’s work [7]. One drawback to PT algorithms such as the BM + PT system used in [7] is that their $T_{max}$ and $T_{min}$ must be manually tuned for each problem instance. This requires considerable time and effort while having dramatic effects on the efficacy of the optimization process. We partially address this issue by selecting, from each family of QAP problems, small instances whose solutions can be verified via exact algorithms, to tune a function that automatically selects these parameters for that family within our system.

4.2 Single-Threaded Program

Algorithm 1 presents our proposed PBM + PT system which can be configured for varying levels of multi-threaded operation. A single PT engine ($U = 1$) is used with $M=32$ replicas. The algorithm starts by initializing the temperature ladder and assigning random permutations to each replica and populating their $\mathbf {H_{P}}$ matrices and energy values. The system then enters an optimization loop where it runs Y trials for each replica in sequence using the function, updating their states every time a trial is accepted by calling . After all replicas have finished their Y trials, temperature exchanges are performed. Similar to QAP solvers such as [15, 18,19,20, 24], this process is repeated until the state corresponding to the best-known-solution (BKS) of a problem, $E_{BKS}$, is reached by one of the replicas, terminating the loop. The system then returns, for each replica, the minimum energy found and the corresponding state.

4.3 Multi-thread Load Balancing

For our implementation, we targeted a 64-core AMD 3990X CPU. Given the structure of a PBM combined with PT, one of the most intuitive ways to extract parallel speedups is to create a thread for each replica such that they all run on a unique core with their own dedicated L1 and L2 caches.

One issue that arises from this form of parallel execution is that replicas at higher T have higher swap acceptance rates than replicas at lower T resulting in more local field updates per trial on average, increasing their run-time. In our experiments, we observed that the number of trials accepted typically increases linearly as T is increased as demonstrated in Fig. 4a. To load-balance the threads, upon initialization of the system, replicas are folded and assigned in pairs to threads as shown in Fig. 4b.

The replica-to-thread assignments are static throughout a run to ensure that there is minimal movement of $\mathbf {H_P}$ data between cores. Although the temperature exchanges between replicas can cause load imbalance due to the static folding, their stochastic nature ensures that they are temporary with minimal effects.

4.4 Multi-threaded Configuration Selection

To find the optimal number of engines (U) and threads-per-engine (C), we ran all instances within the sko and taiXXb sets from QAPLIB [6] (excluding tai150b) 100 times each and recorded the average time-to-optimum (TtO) across all 100 runs for each instance. The TtO, reported in seconds, is measured as the average time for a solver to reach the BKS of a problem. We measured TtO values over the selected instances for a system with $U=1$ across different C values and compared the TtO of each instance against those of a single-threaded system ($U\times C=1\times 1$). Figure 5a depicts the average speed-up of a single-engine system as C is varied relative to a $1\times 1$ system, showing that the execution time decreases as the number of threads is increased with diminishing returns. We repeated this experiment, testing different combinations of U and C to find the optimal system configuration. Figure 5b compares the speed-up of different configurations relative to a $1\times 1$ system with the $2 \times 32$ configuration having the highest average speed-up despite having no load-balancing. This implies that extra engines, even with load-balancing, cannot make-up for their addtional data movement costs.

5 Experiments and Results

We benchmark our PBM optimizers using a 64-core AMD Threadripper 3990X system with 128 GB of DDR4 running on CentOS 8.1. Our system was coded in C++, using the OpenMP API for multi-threading, and compiled with GCC-9. Two separate variants of our solver were benchmarked: PBM ($U=1$, $C=1$) and PBM64 ($U=2$, $C=32$). We compare the performance of our systems against eight state-of-the-art solvers, described in Table 1. Solvers [4, 5, 22] use a preset iteration/time limit as their termination criterion while [15, 18,19,20, 24] terminate as soon as the BKS is reached. All metrics are taken directly from the respective papers. We benchmarked instances from the QAP Library [6] along with ones created by Palubeckis [21] and Drezner [9]. The sets of instances from Palubeckis and Drezner are generated to be difficult with known optima, with the Drezner set being specifically ill-conditioned for meta-heuristics.

Table 1. State-of-the-art solver descriptions

Full size table

5.1 Benchmarks: Previously Solved QAP Instances

Table 2 contains TtO values for our two PBM variants and the solvers in Table 1, across some of the most difficult instances from literature that at least one other solver was able to solve with a 100% success rate within a five minute time-out window. The bur set, while not difficult, was included as it is the only asm set used in literature. The TtO reported for PBM is the average value across 10 consecutive runs with a 5 min time-out window for each run. For the solvers in Table 1, we report only the TtO from the best solver for each instance. The TtOs where PBM or PBM64 outperform the best solver are highlighted in Table 2. In 44 out of 60 instances, the fastest TtO is reported by one or both of our PBM variants with speed-ups in excess of 10$\times $ for PBM and 100$\times $ for PBM64 on certain instances. Of the remaining 16 instances, PBM64 has either identical or marginally slower performance compared to the best reported solver.

Table 2. Time-to-optimum (s) comparisons across difficult QAP instances

Full size table

Table 3. Performance across unsolvable instances

Full size table

5.2 Benchmarks: Unsolvable QAP Instances

Table 3 contains performance comparisons between PBM64 and the solver from [19], ParEOTS, across QAP instances that no solver to date could consistently solve within a 5 min time-out window. As neither ParEOTS or PBM64 have a 100% success rate on these instances, we also compare their Average Percentage Deviation, calculated as $APD = 100 \times (Avg - BKS)/BKS$. Avg is calculated as the average of the best cost found in each run. We benchmarked PBM64 using the same procedure reported in [19], running each instance 10 times with a time-out window of 5 min and reporting the average time of the 10 runs along with the number of runs that reached the BKS, #bks

PBM64 displays better performance on the dre instances and has a near 100% success rate on tai150b. Across other instances, ParEOTS reports equal or better performance despite PBM64 performing better on smaller instances from the same family of problems. Further testing is required to compare the TtO of ParEOTS and PBM64 if ran without a time-out limit.

6 Conclusion

We demonstrated a Permutational Boltzmann Machine with Parallel Tempering, that is capable of solving NP-Hard problems such as QAP in excess of 100$\times $ faster than other state-of-the-art solvers. The speed of the PBM is attributed to its simple structure where we can utilize parallelism through the parallel execution of its replicas on dedicated computational units along with using SIMD instructions when performing local field updates. Though our PBM + PT system, which uses a 64-core CPU, was the fastest in solving the majority of the QAP test cases by a wide margin, its flexibility allows it to be scaled to match the user’s available hardware while maintaining competitive performance with other state-of-the-art solvers.

References

Aarts, E., Korst, J.: Simulated Annealing and Boltzmann Machines (1988)
Google Scholar
Aarts, E.H.L., Korst, J.H.M.: Boltzmann machines as a model for parallel annealing. Algorithmica 6(1–6), 437–465 (1991). https://doi.org/10.1007/bf01759053
Article MathSciNet MATH Google Scholar
Aarts, E.H., Korst, J.H.: Boltzmann machines for travelling salesman problems. Eur. J. Oper. Res. 39(1), 79–95 (1989). https://doi.org/10.1016/0377-2217(89)90355-x
Article MathSciNet MATH Google Scholar
Agharghor, A., Riffi, M., Chebihi, F.: Improved hunting search algorithm for the quadratic assignment problem. Indonesian J. Electr. Eng. Comput. Sci. 14, 143 (2019). https://doi.org/10.11591/ijeecs.v14.i1.pp143-154
Article Google Scholar
Aksan, Y., Dokeroglu, T., Cosar, A.: A stagnation-aware cooperative parallel breakout local search algorithm for the quadratic assignment problem. Comput. Ind. Eng. 103, 105–115 (2017). https://doi.org/10.1016/j.cie.2016.11.023
Article Google Scholar
Burkard, R.E., Karisch, S.E., Rendl, F.: QAPLIP-a quadratic assignment problem library. J. Global Optim. 10(4), 391–403 (1997). https://doi.org/10.1023/A:1008293323270
Article MathSciNet MATH Google Scholar
Dabiri, K., Malekmohammadi, M., Sheikholeslami, A., Tamura, H.: Replica exchange MCMC hardware with automatic temperature selection and parallel trial. IEEE Trans. Parallel Distrib. Syst. 31(7), 1681–1692 (2020). https://doi.org/10.1109/TPDS.2020.2972359
Article Google Scholar
d’Anjou, A., Grana, M., Torrealdea, F., Hernandez, M.: Solving satisfiability via Boltzmann machines. IEEE Trans. Pattern Anal. Mach. Intell. 15(5), 514–521 (1993). https://doi.org/10.1109/34.211473
Article Google Scholar
Drezner, Z., Hahn, P.M., Taillard, É.D.: Recent advances for the quadratic assignment problem with special emphasis on instances that are difficult for meta-heuristic methods. Ann. Oper. Res. 139(1), 65–94 (2005). https://doi.org/10.1007/s10479-005-3444-z
Article MathSciNet MATH Google Scholar
Earl, D.J., Deem, M.W.: Parallel tempering: theory, applications, and new perspectives. Phys. Chem. Chem. Phys. 7(23), 3910 (2005). https://doi.org/10.1039/b509983h
Article Google Scholar
Gloria, A.D., Faraboschi, P., Olivieri, M.: Clustered Boltzmann machines: massively parallel architectures for constrained optimization problems. Parallel Comput. 19(2), 163–175 (1993). https://doi.org/10.1016/0167-8191(93)90046-n
Article MATH Google Scholar
Glover, F., Kochenberger, G., Du, Yu.: Quantum bridge analytics I: a tutorial on formulating and using QUBO models. 4OR 17(4), 335–371 (2019). https://doi.org/10.1007/s10288-019-00424-y
Hinton, G.E., Sejnowski, T.J., Ackley, D.H.: Boltzmann machines: constraint satisfaction networks that learn. Carnegie-Mellon University, Department of Computer Science Pittsburgh (1984)
Google Scholar
Hukushima, K., Nemoto, K.: Exchange Monte Carlo method and application to spin glass simulations. J. Phys. Soc. Jpn. 65(6), 1604–1608 (1996)
Article Google Scholar
Kanazawa, K.: Acceleration of solving quadratic assignment problems on programmable SoC using high level synthesis. In: FSP 2017; Fourth International Workshop on FPGAs for Software Programmers, pp. 1–8 (2017)
Google Scholar
Korst, J.H., Aarts, E.H.: Combinatorial optimization on a Boltzmann machine. J. Parallel Distrib. Comput. 6(2), 331–357 (1989). https://doi.org/10.1016/0743-7315(89)90064-6
Article MATH Google Scholar
Lawler, E.L.: The quadratic assignment problem. Manage. Sci. 9(4), 586–599 (1963). https://doi.org/10.1287/mnsc.9.4.586
Article MathSciNet MATH Google Scholar
López, J., Múnera, D., Diaz, D., Abreu, S.: Weaving of metaheuristics with cooperative parallelism. In: Auger, A., Fonseca, C.M., Lourenço, N., Machado, P., Paquete, L., Whitley, D. (eds.) PPSN 2018. LNCS, vol. 11101, pp. 436–448. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99253-2_35
Chapter Google Scholar
Munera, D., Diaz, D., Abreu, S.: Hybridization as cooperative parallelism for the quadratic assignment problem. In: Blesa, M.J., et al. (eds.) HM 2016. LNCS, vol. 9668, pp. 47–61. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-39636-1_4
Chapter Google Scholar
Munera, D., Diaz, D., Abreu, S.: Solving the quadratic assignment problem with cooperative parallel extremal optimization. In: Chicano, F., Hu, B., García-Sánchez, P. (eds.) EvoCOP 2016. LNCS, vol. 9595, pp. 251–266. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-30698-8_17
Chapter Google Scholar
Palubeckis, G.: An algorithm for construction of test cases for the quadratic assignment problem. Informatica Lith. Acad. Sci. 11, 281–296 (2000)
MathSciNet MATH Google Scholar
Sonuc, E., Sen, B., Bayir, S.: A cooperative GPU-based parallel multistart simulated annealing algorithm for quadratic assignment problem. Eng. Sci. Technol. Int. J. 21(5), 843–849 (2018). https://doi.org/10.1016/j.jestch.2018.08.002
Article Google Scholar
Swendsen, R.H., Wang, J.S.: Replica Monte Carlo simulation of spin-glasses. Phys. Rev. Lett. 57(21), 2607–2609 (1986). https://doi.org/10.1103/physrevlett.57.2607
Article MathSciNet Google Scholar
Tsutsui, S.: ACO on multiple GPUs with CUDA for faster solution of QAPs. In: Coello, C.A.C., Cutello, V., Deb, K., Forrest, S., Nicosia, G., Pavone, M. (eds.) PPSN 2012. LNCS, vol. 7492, pp. 174–184. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-32964-7_18
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

University of Toronto, Toronto, ON, M5S2E8, Canada
Mohammad Bagherbeik, Parastoo Ashtari, Seyed Farzad Mousavi & Ali Sheikholeslami
Fujitsu Laboratories Limited, Kawasaki, Kanagawa, 211-8588, Japan
Kouichi Kanda & Hirotaka Tamura

Authors

Mohammad Bagherbeik
View author publications
You can also search for this author in PubMed Google Scholar
Parastoo Ashtari
View author publications
You can also search for this author in PubMed Google Scholar
Seyed Farzad Mousavi
View author publications
You can also search for this author in PubMed Google Scholar
Kouichi Kanda
View author publications
You can also search for this author in PubMed Google Scholar
Hirotaka Tamura
View author publications
You can also search for this author in PubMed Google Scholar
Ali Sheikholeslami
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mohammad Bagherbeik .

Editor information

Editors and Affiliations

Leiden University, Leiden, The Netherlands
Thomas Bäck
Leiden University, Leiden, The Netherlands
Mike Preuss
Leiden University, Leiden, The Netherlands
André Deutz
Sorbonne University, Paris, France
Hao Wang
Sorbonne University, Paris, France
Carola Doerr
Leiden University, Leiden, The Netherlands
Michael Emmerich
University of Münster, Münster, Germany
Heike Trautmann

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bagherbeik, M., Ashtari, P., Mousavi, S.F., Kanda, K., Tamura, H., Sheikholeslami, A. (2020). A Permutational Boltzmann Machine with Parallel Tempering for Solving Combinatorial Optimization Problems. In: Bäck, T., et al. Parallel Problem Solving from Nature – PPSN XVI. PPSN 2020. Lecture Notes in Computer Science(), vol 12269. Springer, Cham. https://doi.org/10.1007/978-3-030-58112-1_22

Download citation

DOI: https://doi.org/10.1007/978-3-030-58112-1_22
Published: 31 August 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58111-4
Online ISBN: 978-3-030-58112-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Permutational Boltzmann Machine with Parallel Tempering for Solving Combinatorial Optimization Problems

Abstract

Similar content being viewed by others

Logically synthesized and hardware-accelerated restricted Boltzmann machines for combinatorial optimization and integer factorization

Neural Nets with a Newton Conjugate Gradient Method on Multiple GPUs

Infinite Swapping Algorithm for Training Restricted Boltzmann Machines

Keywords

1 Introduction