Multithreaded computing in evolutionary design and in artificial life simulations
- 633 Downloads
This article investigates low-level and high-level multithreaded performance of evolutionary processes that are typically employed in evolutionary design and artificial life. Computations performed in these areas are specific because evaluation of each genotype usually involves time-consuming simulation of virtual environments and physics. Computational experiments have been conducted using the Framsticks simulator running a multithreaded version of a standard evolutionary experiment. Tests carried out on five diverse machines and two operating systems demonstrated how low-level performance depends on the number of physical and logical CPU cores and on the number of threads. Two string implementations have been compared, and their raw performance turned out to fundamentally differ in a multithreading setup. To improve high-level performance of parallel evolutionary algorithms, i.e. the quality of optimized solutions, a new distribution scheme that is especially useful and efficient for complex representations of solutions—the convection distribution—has been introduced. This new distribution scheme has been compared against a random distribution of genotypes among threads that carry out evolutionary processes.
KeywordsConcurrency Multithreading Simulation Evolution Optimization Performance
In the areas of evolutionary design and artificial life, evolutionary processes [5, 7] are used to optimize designs (structures, constructs) or to mimic the biological world. In both cases, computer simulation plays a key role. However, simulating physics requires intensive computation, and the more the detail is expected, the more the computation is necessary. Fortunately, it is often possible to divide the simulated system into independent parts so that computations can be performed in parallel. This opens up a way to speed up the process of simulation, but still requires a lot of computing power and some method of distribution among multiple processors.
For tests and computational experiments, the Framsticks simulator  is employed here. Since its initial releases in 1996, this simulator has been used as a computing engine in a number of diverse applications, including comparison of genetic encodings in artificial life and evolutionary design , estimating symmetry of evolved and designed agents , employing similarity measure to organize evolved constructs [18, 20], bio-inspired visual-motor coordination  and real-time coordination, modeling robots  and optimizing fuzzy controllers, user-driven (interactive, aesthetic) evolution, synthetic neuroethology [27, 28], analyses of brain activity evoked by perception of biological motion [34, 35], modeling perception of time in humans , modeling foraminiferal genetics, morphology, simulation, and evolution , and modeling communication, predator–prey coevolution, speciation, and other biological phenomena . These applications and the fact that Framsticks core is implemented in a low-level language (C++) for high efficiency but also features a higher-level scripting language, make it a representative example of software that is used for modeling and simulation of life, and in particular, evolutionary processes .
Many of the applications enumerated above require considerable amounts of computing power, and in most cases the more the computing resources are available, the more meaningful the experiments and their results are. With modern computers equipped with many processors and cores and a clear direction of hardware development in the near future, using multithreading allows to exploit more computing power on a single machine in a single experiment.
The Framsticks environment allows for a flexible configuration of the way computing is parallelized, distributed, and organized. Multi-level and hybrid architectures are possible both in centralized and distributed scenarios [1, 17, 37], because every Framsticks server can perform multithreaded computation, genotype transfer, or both . This paper focuses on a basic one-machine architecture shown in Fig. 1: the master thread can create, delete, and control slave threads. The master thread does not perform any continuous work and only redistributes genotypes among slaves during migrations. Unlike the traditional master–slave evolutionary algorithm where the master process runs the optimization algorithm and slaves evaluate individual solutions, here slaves perform independent evolutionary processes. For distributed configurations, this architecture is typically employed in machines that are end nodes. This is where the biggest gain can be achieved from parallel optimization, and this is currently the most popular architecture among researchers that do not use specialized high-end configurations.
The goal of this work is to investigate and improve both low-level and high-level performance of parallel evolutionary algorithms in optimization of solutions that need much computational power to evaluate. Optimizing three-dimensional structures in evolutionary design and in artificial life is an example of such applications. Section 2 focuses on technical aspects of raw, low-level performance of multithreaded evolutionary optimization and investigates how the number of threads, gene pool capacity, CPU architecture, and string implementation influence the number of evaluated genotypes. Section 3 focuses on high-level performance of parallel evolutionary algorithms—performance is no longer measured as the number of genotype evaluations; it is rather the actual fitness that is achieved by evolutionary optimization. To improve the quality of optimized solutions, a new distribution scheme of genotypes among slave processes is introduced and tested. Section 4 summarizes this work.
2 Multithreading performance
All tests reported in this work were performed using the standard-mt experiment definition—a multithreaded version of the most common and versatile Framsticks evolutionary optimization experiment . This experiment script performs physical simulation of creatures built from genotypes that are mutated and crossed over in the course of a steady-state (i.e., non-generational) evolution [16, 26, 38]. The most computationally expensive part of the optimization process is the evaluation of fitness of each genotype. This evaluation is based on the creature’s performance in the simulated physical world. The number of evaluations performed in the fixed amount of time (500 seconds) is our measure of performance.
Each test has been run for varying thread count and varying values of capacity. Capacity is the size of the slave gene pool (the number of genotypes) and as such, it influences the dynamics of the evolutionary optimization process.
In the multithreaded implementation, capacity and mix_period parameters determine the migration frequency. A migration occurs after reaching the desired number of genotype evaluations, expressed as the percentage of the gene pool capacity: there are capacity \(\times \) mix_period/100 evaluations between migrations. For example, for the default value of mix_period = 1000 that is used in all the experiments in this section, the number of evaluations performed by each slave between migrations is \(10\times \) capacity of the gene pool.
In order to precisely compare raw performance across multiple runs, the actual genetic optimization has been disabled by removing the sources of genotype variability—only the evaluation (simulation) and selection is performed, continuously operating on identical genotypes of medium complexity. The amount of memory required by all threads combined was at least one order of magnitude smaller than the amount of available RAM, and slaves did not perform any file or network operations.
4/4-L. Desktop 4-core Intel Core2 Quad processor Q6600  running Linux Debian x86_64 3.14-2-amd64 and having its CPU clock forced to a constant frequency of 1.6 GHz for better reproducibility.
4/4-W8. Desktop 4-core Intel Core i5-2500  running 64-bit Windows 8.1 Pro in safe boot mode which forced the CPU clock to a constant frequency of 3.3 GHz.
4/8-W7. Desktop 4-core 8-thread Intel Core i7-4790 processor  running 64-bit Windows 7 Pro in safe boot mode which forced the CPU clock to a constant frequency of 3.6 GHz.
4/8-W8. Laptop 4-core 8-thread Intel Core i7-4700MQ processor  running 64-bit Windows 8.1 Pro in safe boot mode which forced the CPU clock to a constant frequency of 2.4 GHz.
16/32-L. Server 8-core 16-thread (\(\times \)2 CPU) Intel Xeon E5-2660 processor (max turbo frequency 3 GHz)  running Linux Ubuntu x86_64 3.2.0-52-generic.
In all 3D charts presented in this section, red, green, and blue surfaces demonstrate 40%, 70%, and 100% progress of the experiment, respectively. The blue surface presents the final results, while red and green surfaces are approximate and only shown to illustrate progress at intermediate stages of the experiment.
2.1 String implementation: reference counting vs. copying
Two basic approaches to managing string contents that were compared in computational experiments
COW (copy-on-write)—uses reference counting
PU (private unprotected)—allocates private character buffers for individual string objects
Synchronization in a multi-threaded environment
Reference counter is protected by a mutex
No synchronization necessary
String copy operation (constructor, assignment)
Only a reference is copied—faster for long strings
String contents must be copied—faster for short strings
Efficient: usually stores only one copy of each unique string contents
Inefficient: stores each string content separately
The experiments showed that the COW string implementation was seriously limiting multithreaded performance  because synchronization was based on a single pthreads mutex  shared between all strings. When increasing the number of CPU cores, the PU string implementation enabled a nearly linear parallelization speedup of the evolutionary experiment and did not introduce any significant memory footprint. The PU approach was up to 1.35x faster than COW for the 4/4-L machine and up to 10.8x faster for the 16/32-L machine. The difference in performance of both string implementations on the 16/32-L machine is illustrated in the two bottom rows in Fig. 2.
This speedup was possible because the synchronization of string access was no longer necessary, and physical simulations and evolutionary algorithms can perform independent computations in each thread, with only a small minority of operations involving inter-thread communication and synchronization. Therefore, the PU string implementation was used in all the experiments in the following sections. The need for locking (which causes delays) is the price paid for the ability to access shared memory space, which is not the only possible implementation. Another possibility would be to use operating system processes instead of threads—this would provide a complete separation between processes, but at the same time would make it more difficult to access shared data during master-slave interactions.
2.2 Simulation and evolution
Since evolutionary processes and simulation performed by individual threads are highly independent and there is nearly no additional locking and synchronization apart from migrations, an almost linear speedup can be achieved when the number of threads is increased. This can indeed be observed in Fig. 2 except for the COW string implementation discussed in the previous section.
When tested across different values of the gene pool capacity, smaller gene pools experience more frequent migrations (Fig. 4), as the number of genotypes created and evaluated between migrations is proportional to the gene pool capacity. For a given capacity, the number of migrations decreases with a decreasing processing power of a single slave thread (i.e., with an increasing number of threads), but it is also influenced by the master thread being delayed because of slave threads consuming more processing power. Any such delay increases the time interval between migrations decreasing the number of migrations, which is especially visible for small capacities where the migration period is short. While the number of performed migrations varies highly as Fig. 4 shows, the influence on the number of evaluated genotypes is so minimal that it cannot be noticed in the right column in Fig. 2.
To avoid unnecessary use of the CPU by the master thread, this thread was asked to sleep for 10 milliseconds in each simulation step. Given the test duration of 500 seconds and the 10-ms delay, the expected number of steps in ideal conditions is \(500/0.01=\) 50,000. The actual number of steps varies; it is close to 50,000 on Linux machines 4/4-L and 16/32-L, and it does not exceed 35,000 on Windows machines 4/8-W7 and 4/8-W8, which on average sleep for 5.6 milliseconds more than requested  as illustrated in Fig. 5.
In configurations 4/4-L, 4/8-W7, and 16/32-L (but not in 4/8-W8), the master thread performance barely decreases under increased slave thread load. This suggests that the operating system measures the actual CPU time used by each individual thread and schedules accordingly. The master thread, waiting most of the time, uses less CPU than its fair share and, therefore, it is not limited by the CPU shortage when the share decreases with an increasing thread count. The positive side effect of such behavior is that the master thread latency during the slave event handling is minimized. This does not seriously influence the experiment (except for, perhaps, slightly disrupting the migration count by delaying migrations), because the amount of the useful work depends almost entirely on the performance of slave threads. Configuration 4/8-W8 was the only laptop machine; as such, its mobile processor did not have an integrated heat spreader like desktop processors had, and this might influence the way the operating system scheduler allocated a busy CPU to a mostly idle thread. These differences between platforms in the way the master thread is managed do not affect the number of evaluated genotypes and the performance of the evolutionary process; if a particular scheduling behavior were required by an application, it can be enforced by setting priorities of threads and processes accordingly.
3 Convection distribution scheme
Since 1980s, a number of parallel evolutionary architectures have been proposed and implemented, differing in the way the population is decentralized, the topology of connections between nodes, their roles, and the way migration of genotypes is performed [1, 4, 17, 32, 37]. The most trivial approach to distributing genotypes to subpopulations (slaves) in centralized (master-slave) and coarse-grained architectures is to send to each slave the entire gene pool, or a random sample of the entire gene pool. This approach leverages the raw power of parallel evolutionary processes, but does not take advantage of any specific logic like migrating best genotypes [1, 17, 37].
In the convection distribution scheme, genotypes in the master’s gene pool are sorted according to fitness. Then each slave receives a subset of genotypes that fall within a range of fitness values. In the computational experiment, two methods of determining fitness ranges have been compared. In the first method, the entire fitness range has been divided into equal intervals (as many as there are slaves); if there are no genotypes in some fitness range, the corresponding slave receives genotypes from the nearest lower non-empty fitness interval. In the second method, the genotypes in master have been sorted according to fitness and then divided into as many sets as there are slaves so that each slave receives the same number of genotypes. This idea is illustrated in Fig. 6.
The experiment concerned evolution of simulated 3D structures that maximized vertical position of the center of mass using the f1 genetic encoding. This encoding is a direct mapping between letters and parts of a 3D structure: ‘X’ represents a rod (a stick), parentheses encode branches in the structure, and additional symbols influence properties like length or rotation. The encoding is able to represent arbitrary tree-like 3D structures. Mutations modify individual aspects of the structure by adding or removing parentheses in random places in the genotype, or by adding and removing random symbols. Two-point crossover is used, and additional repair mechanisms validate the genotype by fixing parentheses if needed. Details of this genetic encoding are provided in .
Figure 8 summarizes the results of this experiment for tournament size of 2 (left column, low selection pressure) and 5 (right column, high selection pressure), and different migration frequencies (top and bottom rows). Despite the difficulty of this optimization task and numerous local optima causing high variance of the best achieved fitness, both convection distribution schemes performed similarly well. In all experiments both schemes proved to be significantly better than the random distribution; p values are shown for a two-tailed t test. The improvement provided by the convection distribution schemes is more pronounced when the selective pressure is higher (the tournament selection of size 5).
4 Summary and further work
This article discussed multithreaded performance of evolutionary processes that are typically employed in evolutionary design and artificial life. On a technical note, the experiments revealed that the string implementation that used reference counting and a mutex was much less efficient in a multithreading setup than a private unprotected string implementation. The negative impact of the mutex on overall performance increased as the number of threads increased. For more than 4 threads and the copy-on-write string, the overall performance started decreasing even if there were more than 4 CPU cores available. This illustrates the rationale behind using string implementations that do not require synchronization in multithreaded applications.
Further performance analyses confirmed that the evolutionary algorithm that requires the simulation of physics and control systems to evaluate genotypes can be efficiently parallelized. This is because in evolutionary design and artificial life experiments, evaluation of genotypes can usually be implemented as highly independent (an exception would be the environment where most individuals interact frequently). The CPU architecture (the number of physical and logical CPU cores) determines the speedup that can be achieved given a specific number of independent subpopulations (threads). For a maximal performance in the evolutionary architecture considered here, the number of subpopulations should be equal to the number of “logical” CPU cores, as the master thread only performs migrations.
The convection distribution scheme that was introduced in this paper proved to be significantly better than the random (uniform) distribution of genotypes among slave subpopulations. One of the reasons for this efficiency may be the fact that genotypes with similar fitness values are usually similar, and crossing over of similar parent genotypes is less likely to degrade the quality of their children. The performance of the convection distribution scheme can likely be further improved by employing more sophisticated ways of determining fitness intervals. The influence of the frequency of migrations on the performance of the evolutionary algorithm should be investigated as well. The promising performance of this distribution scheme should be tested on optimization benchmark functions, and these tests should include one-threaded, one-population architecture, where convection distribution turns into convection selection.
The research presented in the paper received support from Polish National Science Center (DEC-2013/09/B/ST10/01734).
- 3.Butenhof DR (1997) Programming with POSIX threads. Addison-Wesley Professional, BostonGoogle Scholar
- 4.Cantú-Paz E (1997) A survey of parallel genetic algorithms. Tech. Rep. 97003, University of Illinois at Urbana-ChampaignGoogle Scholar
- 6.de Back W, Wiering M, de Jong E (2006) Red Queen dynamics in a predator–prey ecosystem. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation, pp 381–382. http://igitur-archive.library.uu.nl/vet/2007-0302-210407/wiering_06_red.pdf
- 8.González A, Latorre F, Magklis G (2011) Processor microarchitecture: an implementation perspective. Synthesis Lectures on Computer Architecture. Morgan & Claypool, San RafaelGoogle Scholar
- 9.Intel Corporation: Intel Core2 Quad Processor Q6600 (8M Cache, 2.40 GHz, 1066 MHz FSB. http://ark.intel.com/products/29765 (2007)
- 10.Intel Corporation: Intel Core i5-2500 Processor (6M Cache, up to 3.70 GHz. http://ark.intel.com/products/52209 (2011)
- 11.Intel Corporation: Intel Xeon Processor E5-2660 (20M Cache, 2.20 GHz, 8.00 GT/s Intel QPI). http://ark.intel.com/products/64584 (2012)
- 12.Intel Corporation: Intel Core i7-4700MQ Processor (6M Cache, up to 3.40 GHz). http://ark.intel.com/products/75117 (2013)
- 13.Intel Corporation: Intel Core i7-4790 Processor (8M Cache, up to 4.00 GHz). http://ark.intel.com/products/80806 (2014)
- 15.Jelonek J, Komosinski M (2006) Biologically-inspired visual-motor coordination model in a navigation problem. In: Gabrys B, Howlett R, Jain L (eds) Knowledge-based intelligent information and engineering systems, Lecture notes in computer science, vol 4253. Springer, Berlin, pp 341–348. doi: 10.1007/11893011_44. http://www.framsticks.com/files/common/BiologicallyInspiredVisualMotorCoordinationModel.pdf
- 16.Jones J, Soule T (2006) Comparing genetic robustness in generational vs. steady state evolutionary algorithms. In: Proceedings of the 8th Annual Conference on Genetic and Evolutionary Computation. ACM, La Jolla, pp 143–150Google Scholar
- 18.Komosinski M (2016) Applications of a similarity measure in the analysis of populations of 3D agents. J Comput Sci. doi: 10.1016/j.jocs.2016.10.004
- 22.Komosinski M, Mensfelt A, Tyszka J, Goleń J (2016) Multi-agent simulation of benthic foraminifera response to annual variability of feeding fluxes. J Comput Sci. doi: 10.1016/j.jocs.2016.09.009
- 24.Komosinski M, Ulatowski S (2013) Parallel computing in Framsticks. Research report RA–18/2013, Poznan University of Technology, Institute of Computing Science. http://www.framsticks.com/files/common/ParallelComputingFramsticks.pdf
- 25.Komosinski M, Ulatowski S (2016) Framsticks web site. http://www.framsticks.com
- 27.Mandik P (2002) Synthetic neuroethology. Metaphilosophy 33(1&2):11–29. http://www.petemandik.com/philosophy/papers/synthneur.pdf
- 28.Mandik P (2003) Varieties of representation in evolved and embodied neural networks. Biol Philos 18(1):95–130. http://www.framsticks.com/files/common/Mandik_RepresentationsInNeuralNetworks.pdf
- 29.Marr DT, Binns F, Hill DL, Hinton G, Koufaty DA, Miller JA, Upton M (2002) Hyper-threading technology architecture and microarchitecture. Intel Technol J 6(1):4–15Google Scholar
- 30.Microsoft: process and thread functions: Sleep (2016). https://msdn.microsoft.com/en-us/library/windows/desktop/ms686298(v=vs.85).aspx
- 31.Mohamed R, Raviraj P (2011) Biologically inspired design framework for robot in dynamic environments using Framsticks. Int J Bioinform Biosci 1(1):27–35Google Scholar
- 32.Nesmachnow S, Cancela H, Alba E (2012) A parallel micro evolutionary algorithm for heterogeneous computing and grid scheduling. Appl Soft Comput 12(2): 626–639. doi: 10.1016/j.asoc.2011.09.022. http://www.sciencedirect.com/science/article/pii/S1568494611004248
- 33.Nikseresht MR, Somayaji A, Maheshwari A (2010) Customer appeasement scheduling. Tech. Rep. TR-10-18, School of Computer Science, Carleton University arxiv:1012.3452
- 36.Sutter H (2005) Exceptional C++ style: 40 new engineering puzzles, programming problems, and solutions. The C++ in-depth series. Addison-Wesley, BostonGoogle Scholar
- 37.Tomassini M (1999) Parallel and distributed evolutionary algorithms: a review. In: Neittaanmki P, Miettinen K, Mkel M, Periaux J (eds) Evolutionary algorithms in engineering and computer science. Wiley, New YorkGoogle Scholar
- 38.Vavak F, Fogarty TC (1996) A comparative study of steady state and generational genetic algorithms for use in nonstationary environments. In: Evolutionary computing. Springer, Berlin, pp 297–304Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.