A Permutational Boltzmann Machine with Parallel Tempering for Solving Combinatorial Optimization Problems

. Boltzmann Machines are recurrent neural networks that have been used extensively in combinatorial optimization due to their simplicity and ease of parallelization. This paper introduces the Permutational Boltzmann Machine, a neural network capable of solving permutation optimization problems. We implement this network in combination with a Parallel Tempering algorithm with varying degrees of parallelism ranging from a single-thread variant to a multi-threaded system using a 64-core CPU with SIMD instructions. We benchmark the performance of this new system on Quadratic Assignment Problems, using some of the most diﬃcult known instances, and show that our parallel system performs in excess of 100 × faster than any known dedicated solver, including those implemented on CPU clusters, GPUs, and FPGAs.


Introduction
Boltzmann Machines (BM), first proposed by Hinton in 1984 [13], are recurrent, fully connected, neural networks that store information within their symmetric edge weights. When combined with Stochastic Local Search methods such as Simulated Annealing (SA) [1] or Parallel Tempering (PT) [10], BMs can be used to perform combinatorial optimization on complex problems such as TSP [3], MaxSAT [8], and MaxCut [16]. In this paper, we present an algorithm for a Permutational Boltzmann Machine (PBM), structured to solve complex, integer based, permutation optimization problems. We combine this PBM with Parallel Tempering and propose both single-threaded and multi-threaded, software implementations of this PBM + PT system using a 64-core CPU along with SIMD instructions. As a proof-of-concept, we show how to solve Quadratic Assignment Problems (QAP) [17] using a PBM and present experimental results on some of the hardest QAP instances from QAPLIB [6], Palubeckis [21], and Drezner [9]. We then show that, over the tested instances, our single-threaded and multi-threaded PBM systems can find the best-known-solutions of QAP problems in excess of 10× and 100× faster than the next best solver respectively.
The rest of this paper is organized as follows: Sect. 2 provides background on BMs and the formulation of QAP problems. Section 3 presents the structure of our Permutational Boltzmann Machine and Sect. 4 presents our single and multi-threaded implementations of a PBM + PT system on a multi-core CPU. Section 5 outlines the experiments conducted to benchmark the performance of our PBM + PT system and presents our results. Section 6 concludes this paper.

Boltzmann Machines
BMs, as shown in Fig. 1, are made up of N neurons, {x 1 , x 2 , . . . , x N } with binary states represented by vector S = [s 1 s 2 . . . s N ] ∈ {0, 1} N . Each neuron, x i , is connected to other neurons, x j , via symmetric, real-valued weights, w i,j ∈ R where w i,j = w j,i and w i,i = 0, forming a 2D matrix, W ∈ R N ×N . Each neuron also has a bias value, b i , which forms B ∈ R N ×1 . The cumulative inputs to the neurons, also referred to as their local fields, h i , form H ∈ R N ×1 and are calculated using (1).

(a) (b)
Each possible state of a BM has an associated energy term calculated via (2). The probability of the system being in any state depends on the energy of that state as shown in (3). The lower the energy, the higher the probability that the network will be in that state. BMs create an energy landscape for a problem through the weights that connect their neurons where the state(s) with the lowest energy corresponds to a valid solution. The procedures to convert various optimization problems to the BM format are discussed in [12]. The term T in (3), known as the system temperature, flattens or sharpens the energy landscape when increased or decreased respectively, providing a method to maneuver the landscape when searching for the global minimum.
Generally, BMs are combined with Simulated Annealing (SA) to solve optimization problems. Using SA, at a time-step t, where the system is in state S t with temperature T , the local fields H(S t ) are calculated using (1). In order to make an update to the system, we must conduct a trial. First, a neuron x i is randomly chosen and the change in energy as a result of flipping its state is calculated via (4). Next, the probability of flipping the neuron's state, P move , is calculated via (5) and is compared against a uniformly distributed random number in [0,1] to determine the change in the neuron's state using (6).
After the trial, the system state variables s i , H, and E need to be updated as shown in (7), (8), and (9) respectively, where W i, * and W * ,i represent row and column i of W respectively.
This procedure is repeated a preset number of times, occasionally cooling the system by decreasing T until it goes below a certain threshold, T thresh , at which point the process is terminated and the lowest energy state observed throughout the search is returned. This state may or may not correspond to a valid or optimal solution due to the stochastic nature of the algorithm but, theoretically, if given a long enough cooling schedule, the BM + SA system will eventually converge to an optimal answer [2].

Quadratic Assignment Problems (QAP)
QAP problems, first formulated in [17], are a class of NP-Hard permutation optimization problems to which many other problems such as the Travelling Salesman Problem can be reduced. While the formulation is relatively simple, QAP remains, to this day, one of the more challenging combinatorial optimization problems. QAP problems entail the task of assigning a set of n facilities to a set of n locations while minimizing the cost of the assignment. QAP problems are comprised of n × n matrices F = (f i,j ) and D = (d k,l ) which describe the flows between facilities and distances between locations respectively with the diagonal elements of both matrices being 0. A third n × n matrix B P = (b i,k ), describes the costs of assigning a facility to a location. All three matrices are comprised of real-valued elements. Given these matrices, each facility must be assigned to a unique location, generating a permutation, φ φ φ ∈ S n , where S n is the set of all permutations, such that the cost function (10) is minimized.
Generally, there are two variants of the QAP problem: symmetric (sym) and asymmetric (asm). In the symmetric case, either one or both of F and D are symmetric. If one of the matrices is asymmetric, it can be made symmetric by taking the average of an element and its complement. However, if both matrices are asymmetric, we can no longer symmetrize them in this manner. It is important to distinguish between these two cases as they are handled differently by a PBM, as will be shown in Sect. 3.

Structure and Update Scheme
The PBM's structure is an extension of Clustered Boltzmann Machines (CBM), first proposed by De Gloria [11]. A CBM places neurons that do not have any connections between them into groups called clusters. Within a cluster, the states of the neurons have no effect on each other's local fields; simultaneously flipping the states of multiple neurons in the same cluster has the same effect as flipping them in sequence. In a PBM, the neurons are arranged into an n × n matrix S P = (s r,c ), where each row, r i , and each column, c j , forms a cluster, as shown in Fig. 2a. On each cluster, we impose an exactly-1 constraint to ensure that within each row and each column, there is exactly one neuron in the ON state.
In the context of a permutation problem, the row-clusters represent a 1hot encoded integer in [1, n], allowing the neuron states to be represented via the integer permutation vector, φ. The column-clusters, in turn, enforce that every integer is unique. The n 2 × n 2 weight matrix is also reshaped into a 4D (n × n) × (n × n) matrix as shown in (11), allowing the generation of the w r,c sub-matrices via Kronecker Products (denoted by ⊗) of rows and columns of F and D via (12). The n × n local field matrix H P is calculated via (13).
We enforce the 2n exactly-1 constraints by not allowing moves that violate the constraints. Assuming that the system is initialized to a valid state that meets all the constraints, we propose trials via moves called swaps as shown in Fig. 2b.
A swap proposal involves picking two unique rows, r and r , from the neuron matrix and swapping the states of their ON neurons along columns c and c . If accepted, this move results in 4 simultaneous bit-flips within the binary neuron matrix. The change in energy as a result of such a move is shown in (14), where the first set of local field terms correspond to the neurons being turned OF F while the second set is due to the neurons being turned ON . The two weights being subtracted are required as we have two pairs of neurons that may be connected across the clusters. The first weight is to compensate for the weight being double added by the local fields of the two neurons turning OF F . The second weight is to account for a coupling that was previously inactive between the two neurons turning ON . As shown in (15), we can directly generate the sum of these weights using F and D. A trial can then be performed by substituting the ΔE value from (14) into (5) and comparing the generated move probability against a value generated by rand().

Updating the Local Field Matrix
When a swap proposal is accepted, the system state must be updated. Swapping the two values in φ and adjusting the system energy is simple. However, updating the local field matrix involves a large number of calculations. Attempting to update H p via (16) involves fetching four weight sub-matrices from global memory with long access delays. Interestingly, the structure of the weight matrix and the PBM itself allow these calculations to be performed efficiently while storing the majority of required data within L2 or L3 caches. For a symmetric problem, we can generate the required weights with a Kronecker Product operation on the differences between 2 rows of the F matrix (Δf ) and 2 rows of the D matrix (Δd) using (17). For an asymmetric problem, an additional update using F and D is required. In this manner, the amount of memory required to store the weight data is reduced from n 4 elements for a monolithic weight matrix to 2n 2 elements to store F and D when the problem is symmetric. For an asymmetric problem, an additional 2n 2 elements are needed to store F and D . Storing a transposed copy of the matrices, while doubling the required memory, provides significant speedups due to a larger number of cache hits when fetching a small number of rows.
4 System Overview

Parallel Tempering
A major weakness of Simulated Annealing in traditional BM optimizers is that it can easily get stuck in a local minimum due to the unidirectional nature of the cooling schedule. Parallel Tempering (PT), first proposed in [23] and developed in [14], provides a means of running M cooperative copies (replicas) of the system, each at a different temperature, in order to search a larger portion of the landscape while allowing a mechanism for escaping from local minima. Replicas are generally arranged in order of increasing T from T min to T max in a temperature ladder. A replica, R k , operating at temperature T k , can stochastically exchange temperature with the replica immediately above it on the ladder, R k+1 , with an Exchange Acceptance Probability (EAP ) calculated via (18). Figure 3 outlines the structure of an optimization engine using BM replicas with PT. As implied by (3) and (7), higher T replicas can move around a larger portion of the landscape whereas the moves in lower T replicas are contained to a smaller subspace of the landscape. The ability of replicas to move up or down the ladder, as shown in Fig. 3b, allows for a systematic method of escaping from local minima, making PT a better choice for utilizing parallelism than simply running M disjoint replicas in parallel using SA as proven in [10,14]. In this paper, we implement a PT algorithm based on a modified version of Dabiri's work [7]. One drawback to PT algorithms such as the BM + PT system used in [7] is that their T max and T min must be manually tuned for each problem instance. This requires considerable time and effort while having dramatic effects on the efficacy of the optimization process. We partially address this issue by selecting, from each family of QAP problems, small instances whose solutions can be verified via exact algorithms, to tune a function that automatically selects these parameters for that family within our system.

Single-Threaded Program
Algorithm 1 presents our proposed PBM + PT system which can be configured for varying levels of multi-threaded operation. A single PT engine (U = 1) is used with M = 32 replicas. The algorithm starts by initializing the temperature ladder and assigning random permutations to each replica and populating their H P matrices and energy values. The system then enters an optimization loop where it runs Y trials for each replica in sequence using the RUN R() function, updating their states every time a trial is accepted by calling Swap(). After all replicas have finished their Y trials, temperature exchanges are performed. Similar to QAP solvers such as [15,[18][19][20]24], this process is repeated until the state corresponding to the best-known-solution (BKS ) of a problem, E BKS , is reached by one of the replicas, terminating the loop. The system then returns, for each replica, the minimum energy found and the corresponding state.

Multi-thread Load Balancing
For our implementation, we targeted a 64-core AMD 3990X CPU. Given the structure of a PBM combined with PT, one of the most intuitive ways to extract parallel speedups is to create a thread for each replica such that they all run on a unique core with their own dedicated L1 and L2 caches.
One issue that arises from this form of parallel execution is that replicas at higher T have higher swap acceptance rates than replicas at lower T resulting in  Energies are stored using fp64 while F, D, B P , and H P elements are stored using fp32. Floats allow for the use of fused-multiply-add operations when implementing (17) within Swap(). To fully utilize our available hardware, SIMD instructions were used wherever possible for significant speed-ups. more local field updates per trial on average, increasing their run-time. In our experiments, we observed that the number of trials accepted typically increases linearly as T is increased as demonstrated in Fig. 4a. To load-balance the threads, upon initialization of the system, replicas are folded and assigned in pairs to threads as shown in Fig. 4b.
The replica-to-thread assignments are static throughout a run to ensure that there is minimal movement of H P data between cores. Although the temperature exchanges between replicas can cause load imbalance due to the static folding, their stochastic nature ensures that they are temporary with minimal effects.

Multi-threaded Configuration Selection
To find the optimal number of engines (U ) and threads-per-engine (C), we ran all instances within the sko and taiXXb sets from QAPLIB [6] (excluding tai150b) 100 times each and recorded the average time-to-optimum (TtO) across all 100 runs for each instance. The TtO, reported in seconds, is measured as the average time for a solver to reach the BKS of a problem. We measured TtO values over the selected instances for a system with U = 1 across different C values and compared the TtO of each instance against those of a single-threaded system (U × C = 1 × 1). Figure 5a depicts the average speed-up of a single-engine system as C is varied relative to a 1 × 1 system, showing that the execution time decreases as the number of threads is increased with diminishing returns. We repeated this experiment, testing different combinations of U and C to find the optimal system configuration. Figure 5b compares the speed-up of different configurations relative to a 1 × 1 system with the 2 × 32 configuration having the highest average speed-up despite having no load-balancing. This implies that extra engines, even with load-balancing, cannot make-up for their addtional data movement costs.

Speed-up v Cores-per-Engine
Speed-up v Multi-Engine Configs.

Experiments and Results
We benchmark our PBM optimizers using a 64-core AMD Threadripper 3990X system with 128 GB of DDR4 running on CentOS 8.1. Our system was coded in C++, using the OpenMP API for multi-threading, and compiled with GCC-9. Two separate variants of our solver were benchmarked: PBM (U = 1, C = 1) and PBM64 (U = 2, C = 32). We compare the performance of our systems against eight state-of-the-art solvers, described in Table 1. Solvers [4,5,22] use a preset iteration/time limit as their termination criterion while [15,[18][19][20]24] terminate as soon as the BKS is reached. All metrics are taken directly from the respective papers. We benchmarked instances from the QAP Library [6] along with ones created by Palubeckis [21] and Drezner [9]. The sets of instances from Palubeckis and Drezner are generated to be difficult with known optima, with the Drezner set being specifically ill-conditioned for meta-heuristics.  Table 2 contains TtO values for our two PBM variants and the solvers in Table  1, across some of the most difficult instances from literature that at least one other solver was able to solve with a 100% success rate within a five minute time-out window. The bur set, while not difficult, was included as it is the only asm set used in literature. The TtO reported for PBM is the average value across 10 consecutive runs with a 5 min time-out window for each run. For the solvers in Table 1, we report only the TtO from the best solver for each instance. The TtOs where PBM or PBM64 outperform the best solver are highlighted in Table  2. In 44 out of 60 instances, the fastest TtO is reported by one or both of our PBM variants with speed-ups in excess of 10× for PBM and 100× for PBM64 on certain instances. Of the remaining 16 instances, PBM64 has either identical or marginally slower performance compared to the best reported solver. Table 3 contains performance comparisons between PBM64 and the solver from [19], ParEOTS, across QAP instances that no solver to date could consistently solve within a 5 min time-out window. As neither ParEOTS or PBM64 have a 100% success rate on these instances, we also compare their Average Percentage   Deviation, calculated as AP D = 100 × (Avg − BKS)/BKS. Avg is calculated as the average of the best cost found in each run. We benchmarked PBM64 using the same procedure reported in [19], running each instance 10 times with a time-out window of 5 min and reporting the average time of the 10 runs along with the number of runs that reached the BKS, #bks PBM64 displays better performance on the dre instances and has a near 100% success rate on tai150b. Across other instances, ParEOTS reports equal or better performance despite PBM64 performing better on smaller instances from the same family of problems. Further testing is required to compare the TtO of ParEOTS and PBM64 if ran without a time-out limit.

Conclusion
We demonstrated a Permutational Boltzmann Machine with Parallel Tempering, that is capable of solving NP-Hard problems such as QAP in excess of 100× faster than other state-of-the-art solvers. The speed of the PBM is attributed to its simple structure where we can utilize parallelism through the parallel execution of its replicas on dedicated computational units along with using SIMD instructions when performing local field updates. Though our PBM + PT system, which uses a 64-core CPU, was the fastest in solving the majority of the QAP test cases by a wide margin, its flexibility allows it to be scaled to match the user's available hardware while maintaining competitive performance with other state-of-the-art solvers.