Parallelization and implementation of multi-spin Monte Carlo simulation of 2D square Ising model using MPI and C++

In this paper, we present a parallel algorithm for Monte Carlo simulation of the 2D Ising Model to perform efficiently on a cluster computer using MPI. We use C++ programming language to implement the algorithm. In our algorithm, every process creates a sub-lattice and the energy is calculated after each Monte Carlo iteration. Each process communicates with its two neighbor processes during the job and they exchange the boundary spin variables. Finally, the total energy of lattice is calculated by map-reduce method versus the temperature. We use multi-spin coding technique to reduce the interprocess communications. This algorithm has been designed in a way that an appropriate load-balancing exists and it benefits a good scalability. It has been executed on the cluster computer of Plasma Physics Research Center which includes 9 nodes and each node consists of two quad-core CPUs. Our results show that this algorithm is more efficient for large lattices and more iterations.


Introduction
The Ising model [1] gives a microscopic description of the ferromagnetism which is caused by the interaction between spins of the electrons in a crystal.The particles are assumed to be fixed on the sites of the lattice.Spin is considered as a scalar quantity which can achieve two values +1 and −1.The model is a simple statistical one which shows the phase transition between high temperature paramagnetism phase and low temperature ferromagnetic one at a specific temperature.In fact, the symmetry between up and down is spontaneously broken when the temperature goes below the critical temperature.However, the one-dimensional Ising model, which has been exactly solved-shows no phase transition.The two-dimensional Ising model has been solved analytically with zero [2] and non-zero [3] external field.In spite of a lot of attempts to solve 3D Ising model, one might say that this model has never been solved exactly.All the results for the three-dimensional Ising model have been used approximation approaches and Monte Carlo methods.
Monte Carlo methods or statistical simulation methods are widely used in different fields of science such as physics, chemistry, biology, computational finance and even new fields like econophysics [4,5,6,7,8,9].The simulation can proceed by sampling from the Probability Density Function (PDF) and generating random numbers uniformly.The simulation of the Ising model on big lattices increases the cost of simulation.One way to reduce the simulation cost is to design the algorithms which work faster.Swendsen-Wang and Wolff algorithms [10,11] and multi-spin coding methods [12,13,14] are the examples of such methods.Another way is to parallelize and execute the model on GPUs, GPU clusters and cluster computers [15,16,17,18,19,20,21,22,23,24,25,26,27].
In this paper, we present a parallel algorithm to simulate the 2D Ising model using Monte Carlo Method.Then, we run the algorithm on a cluster computer using C++ programming language and MPI.Message Passing Interface (MPI) is a useful programming model in HPC systems [28,29,30,31,32,33,34] in which the processes communicate through message-passing and was designed for distributed memory architectures.MPI provides functionalities which allow two specified processes to exchange data by sending and receiving messages.To get high efficiency, it is necessary to have good load balancing and also to have minimum communications between processes.
In our algorithm, each individual process creates its own sub-lattice, initializes it, gets all Monte Carlo iterations done and calculates the energy of the sub-lattice for a specific temperature.Each process communicates with its two neighbor processes during the job and they exchange the boundary spin variables.Finally, the total energy of lattice is calculated by map-reduce method.Since in multi-spin coding technique each spin is stored by 3 bits, interprocesses communications are reduced considerably.Because computational load of each sub-lattice is assigned to each process and size of all sub-lattices is equal, an appropriate load-balancing exists.Since each process -independent of number of processes -only communicates with its two neighbor processes and the lattice is decomposed into sub-lattices, the algorithm benefits a good scalability.
This paper has been organized as follows.In section 2, Metropolis algorithm and the Ising model are studied briefly.In section 3, we explain how to use Multi-spin coding method to calculate the interaction energy between a specific spin and its nearest neighbors.We also study the boundary conditions in the memory-word lattice.Details of parallelization of the algorithm is discussed in section 4 and the method of implementation is given in section 5. Finally, the results are given in section 6.

Metropolise algorithm and Ising model
The Ising model consists of spins variables which take values +1 or −1 and are arranged in a one, two or three dimensional lattice.Each spin interacts with its neighbors and the interaction is given by the Hamiltonian: where J is the coupling coefficient.The summation in Eq. ( 1) is taken over the nearest neighbor pairs < m, n >.Periodic boundary conditions are used which state that spins on one edge of the lattice are neighbors with the spins on the opposite side.In this paper, we focus on simulation of the 2D square Ising model using Metropolis Monte Carlo algorithm [35].The lattice is initialized randomly and is updated as the following: 1. Select a spin (s i,j ) randomly and calculate the interaction energy between this spin and its nearest neighbors (E).
2. Flip the spin s i,j to s ′ i,j and again calculate the interaction energy (E ′ ).
3. △E = E ′ − E, if △E ≤ 0, s ′ i,j is accepted.Otherwise, s ′ i,j is accepted with the probability e −△E/KT where K is Boltzmann constant and T is the temperature.4. Repeat steps 1 − 3 till we are sure that every spin has been flipped.5. Calculate theh total energy of the lattice for i th iteration (E i total ).The steps above form a Monte Carlo iteration.We perform enough iterations (N times) and finally average on (E i total ) to obtain E total : 3 Multi-spin coding method Multi-spin coding refers to all techniques that store and process multiple spins in one memory word.In this paper, we apply the multi-spin coding technique to the 2D Ising model.In general, multi-spin coding technique results in a faster algorithm as a consequence of updating multiple spins simultaneously.However, we mainly employ this technique to reduce the interprocess communications.We apply the multi-spin coding introduced in ref. [12].However, in our implementation, the size of a memory word is 64 bits, in contrast to Jacobs's 60-bit memory word.In addition, each spin is retained in three consecutive bits and the value of the 64 th bit is always set to zero.000 represents the spin down and the spin up is shown by 001.Since a memory word contains 21 spins, the size of the lattice is taken to be 21N × 21N , where N is an integer greater than one.Now, we need to convert the spin lattice (Fig. 1.a) to the lattice of memory words (Fig. 1.b).Therefore, the size of the memory word lattice is considered as N × 21N .Each column of the spin lattice is coded into the same column of the memory word in the memory word lattice.So, 21N spins  in one column of the spin lattice are arranged in N memory words of a column in the memory word lattice as follows: where S(I, J) represents the memory word at the row I and the column J, 0 s i,j shows the spin located at the row i and the column j where j = J.The advantage of this arrangement is that each spin is placed in the appropriate position related to its neighbors.Consider k th spin in a given memory word S(I, J).The right/left/top/down neighbor of the k th in the spin lattice is exactly k th spin in the right/left/top/down neighbor of the memory word S(I, J) in the memory word lattice.
In order to apply periodic boundary conditions to the memory words in the first and last row (Fig. 2.a), we need to make some changes to up and down neighbors in advance.In fact, the down (up) neighbor of S(N −1, J) (S(0, J)) is not exactly S(0, J) (S(N − 1, J)).For a memory word in the first (last) row, its up (down) neighbor -which is the memory word in the last (first) row and in same column-have to be shifted 3 bits to the right (left).These two cases have been shown in the diagrams (b) and (c) of Fig. 2. We should recall that the 64 th bit is always set to zero.1: Different configurations which might happen between a selected spin and its four nearest neighbors.
The initial energy E and the energy E ′ after flipping the selected spin, have been shown in forth and fifth columns, respectively.The energy difference ∆E = E ′ − E has been given in the sixth column.the last column represents the value of a 3-bit group.

Calculation of energy
Now, using multi-spin coding method, we show how to calculate the energy difference (∆E) between two configurations in the Ising model.At first, to better understand the process, we consider two 3bit-spins s 1 and s 2 .s 1 XOR s 2 produces 000 when the two spins are placed in the same direction and 001 is given when spins s 1 and s 2 are in the opposite directions.Hence, for a given memory word S(I, J), the expression (S(I, J) XOR S(I − 1, J)) + (S(I, J) XOR S(I + 1, J))+ (S(I, J) XOR S(I, J − 1)) + (S(I, J) XOR S(I, J + 1)), generates a value in the range of [0, 4] for every 3bit-group given in the last column of Table 1.In the second and third columns, we have considered different cases that might occur between a selected spin and its four neighbors.The initial interaction energy E and the energy E ′ calculated after flipping the selected spin have been presented in the forth and fifth rows, respectively.

Parallelization
In a Monte Carlo Metropolis iteration, each memory word is updated at least once.The iterations must be performed enough times to yield accurate outcome energy.The given lattice could be vertically divided into N p sub-lattices with equal sizes, where N p is the number of processes.Computational load of each sub-lattice is assigned to the processes 0 to N p − 1 from left to right.Each process creates a sub-lattice of the specific size, initializes the sub-lattice, performs all Monte Carlo iterations and calculates the energy of the sub-lattice using Eq. ( 2).When all individual processes calculate the energy of their own sub-lattice, the energies of the sub-lattices are added up, through a Map-Reduce operation, to calculate the total energy of the lattice .However, this approach results in two problems.As illustrated in Fig. 3, half of the neighbors of the memory words on the border, are placed in the sub-lattice of neighbor process.Therefore, to calculate the energy of the memory words on the border, some memory words of the side sub-lattice are needed.Therefore, these memory words have to be observed when needed.Moreover, we should note that neighbor memory words should not be updated simultaneously by different processes.
To deal with the second problem, we propose a method in which the memory words are updated in two phases (see Fig. 3).In each phase, half of the sub-lattice is updated while the other half stays unchanged i.e. in phase 1 (2) the left (right) half of the sub-lattice is updated.After the phase 1 (2) is done, each process will pass to phase 2 (1) only if its right (left) process accomplishes the phase 1 (2).To better understand this process, we consider three consecutive processes in Fig. 3 which are updating the left half of their sub-lattices in phase 1.The process p2 has updated the left half of its sub-lattice and is going to start the phase 2 to update the right half.However, it is not able to reach the phase 2 until the process p3 accomplishes the phase 1 and finishes the update of the left half.So, the memory words on the borders of the processes p2 and p3, marked with × in Fig. 3, do not update simultaneously.In the same manner, the memory words on the borders of processes p1 and p2, marked with + in Fig. 3 do not update at the same time.Each half of the memory word lattice is updated in the phase 1 or 2. Therefore, the border memory words are not updated simultaneously.Now, we turn to the first problem.As mentioned before, in each phase half of a sub-lattice is updated.Before a process starts updating the half of the sub-lattice, it should receive the corresponding border memory words of the neighbor process.Suppose that the process p2 is going to update the left half of its sub-lattice in phase 1.It waits to receive the right-side border memory words of the process p1.The process p1 sends its right border memory words to the process p2 asynchronously just after it accomplishes the phase 2 of the last iteration.After p2 receives the border memory words from p1 synchronously, it starts updating the memory words in the phase 1.Just after finishing the phase 1, p2 sends its updated left-side border memory words to p1 asynchronously and goes to the phase 2. The similar procedure occurs for other processes as well.It should be mentioned that we use periodic boundary conditions thereby the left neighbor of the first process is the last process, and likewise the right neighbor of the last process is the first process.

Implementation
In this section, we describe the implementation details of the algorithm presented in the previous section.Consider a memory word lattice of size N × 21N where N is an arbitrary integer bigger than one.We execute the algorithm on N p processes and each process is identified by an integer number, 0 to N p −1, called rank.Each process is responsible for N c columns of the memory word lattice where N c = 21N Np .Each individual process creates its own sub-lattice, initializes it, gets all Monte Carlo iterations done and calculates the energy of the sub-lattice for a specific temperature.Within each Monte Carlo iteration a sub-lattice is updated many times and the energy of iteration is calculated.Finally, the total energy of the memory word lattice for a specific temperature is obtained via a reduce operation.This operation is illustrated in Fig. 4. Now, every step of the algorithm is studied in details.

Initialization
Now, we explain how a process creates its own sub-lattice and initializes it randomly.We use a 64-bit long integer as a memory word to store 21 spins.A 2D array of N × N c long integers forms the sub-lattice of the process where N c = 21N Np .However, we use an array with two additional columns (N × (N c + 2)) reserved for border memory words of the neighbor processes (see Fig. 5.a).In this paper, we denote this array by S-lattice.Each spin in the memory words of the s-lattice is initialized with 0 or 1 randomly -except for the first and last columns-which represent spin down and up, respectively.

Updating
As mentioned before, updating process is done in two phases.In each phase, one half of the sub-lattice is updated.In cases where N c is odd, f loor(N c /2) columns are updated in phase 1 and the rest of the columns is  updated in phase 2. Before starting the update process in each phase, some interprocess communication should be carried out.At first, each process sends its border memory words to its neighbor process asynchronously.Then, it waits to receive the border memory words from its neighbor processes.When the process receives the required border memory words, it can accomplish the phase by frequently updating the memory words that belong to the corresponding phase.In phase 1, in which the left half of the sub-lattice is updated, each process sends the rightmost column of its sub-lattice to its right neighbor (Fig. 5.b).So, the destination of the sending memory words is determined by the following code: destination = (Rank == NP-1) ?0: Rank+1; It means that, sue to the periodic boundary conditions, the right neighbor of the process with the rank N p − 1 is the process 0. Likewise, the source process from which the process receives the border memory words is determined by the following code: source= (Rank==0) ?NP-1:Rank-1; which means that the left neighbor of the process with the rank 0 is the process N p − 1.
The received column of memory words is stored in the first column of the s-lattice which has been reserved for the border memory words of the neighbor process.When the border memory words are received, the left half of sub-lattice is updated (Fig. 5.c).Likewise, the destination and the source, in phase 2, are determined by the following code (Fig. 5.d): destination = (Rank == 0) ?NP-1: Rank-1; source= (Rank==NP-1) ?0:Rank+1; The received column of the memory words is stored in the last column of the S-lattice which has been reserved for the border memory words of the neighbor process (Fig. 5.e).

Calculating the Energy Of a Monte Carlo Iteration
In order to obtain the total energy of the lattice, the energy of all nearest neighbor pairs must be considered.However, if we consider the interaction energy of the right and down neighbors of each memory word, the total Listing 1: Computing the energy of a memory word energy is calculated.Each process calculates the energy of each memory word in the S-lattice except for the first and last columns.Notice that the last column of the S-lattice contains the copy of the border memory words of the right process (Fig. 5.e).Since these border memory words are not used until they are sent, the copy of them is still valid.This copied column is used as the right neighbor of the last column of the sub-lattice.The code in Listing. 1 shows how the interaction energy between a specific memory word and its right and down memory words is calculated: In the line 3, the outcome of the expression on the right side of the assignment operator, is a memory word which includes 21 3bit-groups.Every group contains a number between 0 and 2 i.e. 0 represents −2J, 1 represents 0 and 2 represents +2J.Each 3bit-group retains the sum of the interaction energy between a specific spin with its right and down neighbors.The for loop iterates on the 3bit-groups of E. In each iteration, the energy of one 3bit-group is extracted and is added to rv where rv retains the sum of energies of the 3bit-groups.Therefore, rv contains the total energy of the 21 3bit-groups.

Results
We have executed our program on a part of the computer cluster of Plasma Physics Research Center which includes 16 nodes networked by a switch.Each node is equipped with two Intel Xeon X5365 CPUs.We have used up to 9 nodes to test our program.Three different cases with different number of iterations have been considered in Tab. 2.
The measured speed up and efficiency versus the number of cores are illustrated in Figs. 6 and 7, respectively.As shown, for all three test cases, as number of cores and nodes is increased, efficiency goes down, especially when one more node is exploited, the efficiency drops considerably.This fall is due to the fact that overhead of the communication between processes on different nodes is higher than overhead of the communication between processes on the same node.Now, we are able to inspect the impact of the lattice dimension and the number of Monte Carlo iterations on the performance of our algorithm.Comparing the test cases 2 and 3, it is inferred that bigger lattice sizes get better speedup and efficiency.In addition, the comparison between the test cases 1 and 2, we can claim when the number of Monte Carlo iterations increases, better speed up and efficiency is deduced.Therefore, our  algorithm has better performance for bigger lattice sizes and higher number of Monte Carlo iterations.

Figure 2 :
Figure 2: (a) Memory words in the first and last rows of a memory word lattice (b) up neighbor of S(0, J) formed by shifting three bits of S(N − 1, J) to the right (c) down neighbor of S(N − 1, J) formed by shifting three bits of S(0, J) to the left.

Figure 3 :
Figure 3: Sub-lattices of the memory words that belong to the three consecutive processes.The memory words on the border of the two different sub-lattices which have interaction with each other, are marked the same.Each half of the memory word lattice is updated in the phase 1 or 2. Therefore, the border memory words are not updated simultaneously.

Figure 4 :
Figure 4: Function of our implemented program.Left and right arrows denote interprocess communications.

Figure 5 :
Figure 5: (a) S-lattice of each process.(b) Communication between processes before updating the memory words in phase 1. (c) Updating memory words in phase 1.(d) Communication between processes before updating memory words in phase 2. (e) Updating the memory words in phase 2.

Figure 6 :Figure 7 :
Figure 6: Parallel speed-up versus the number of cores presented in Tab. 2.

/
/ i n i t i a l i z e t h e M P I L i b r a r y 89MPI_Init ( NULL , NULL ) ; 90 / / R a n k o f P r o c e s s 91 int Rank ; 92 M P I _ C o m m _ r a n k ( MPI_COMM_WORLD , & Rank ) ; 93 / / N u m b e r O f P r o c e s s e s int NP ; 95 M P I _ C o m m _ s i z e ( MPI_COMM_WORLD , & NP ) ; 96 / / s i z e o f l a t t i c e i s N x 2 1 N 97 const int N = 4; 98 / / T e m p e r a t u r e r a n g e a n d r i s e -u n i t 99 double = 0.001 , TempMax =10.001 , dT =0.1; 100 / / N o o f C o l u m n s d e d i c a t e d t o e a c h i n d i v i d u a l p r o c e s s 101 const int NC = 21 * N / NP ; 102 / / N o o f M o n t e C a r l o I t e r a t i o n s 103 int n o O f I t e r a t i o n s = 400; 104 / / N o o f M e m o r y W o r d s i n p h a s e 1 105 int n o O f M e m o r y W o r d s P 1 = N * floor ( NC /2) ; 106 / / N o o f M e m o r y W o r d s i n p h a s e 2 107 int n o O f M e m o r y W o r d s P 2 = ( N * NC ) -N * floor ( NC /2) ; 108 / / N o o f M e m o r y W o r d s t o b e u p d a t e d i n p h a s e 1 109 int n o O f U p d a t e s P 1 = n o O f M e m o r y W o r d s P 1 *10; 110 / / N o o f M e m o r y W o r d s t o b e u p d a t e d i n p h a s e 2 111 int n o O f U p d a t e s P 2 = n o O f M e m o r y W o r d s P 2 *10; 112 / / 2 D a r r a y t o r e t a i n s -L a t t i c e 113 long int s_lattice [ N ][ NC +2]; 114 / / a r a n d o m n u m b e r g e n e r a t o r t o s e l e c t a r o w f r o m s -l a t t i c e 115 u n i f o r m _ i n t _ d i s t r i b u t i o n < int > r o w G e n e r a t o r (0 , N -1) ; 116 / / a r a n d o m n u m b e r g e n e r a t o r t o s e l e c t a c o l u m n f r o m s -l a t t i c e 117 u n i f o r m _ i n t _ d i s t r i b u t i o n < int > c o l G e n e r a t o r (1 , NC ) ; 118

Table 2 :
Three different cases which have been examined in this paper.The number of iterations, N and the average number of updates per spin in one iteration have been presented in the second, third and forth columns, respectively.