Multi-objective redundancy hardening with optimal task mapping for independent tasks on multi-cores

The rate of transient faults has increased significantly as the technology scales up. The tolerance of transient faults has become an important issue in the system design. Dual modular redundancy (DMR) and triple modular redundancy (TMR) are two commonly used techniques that can achieve fault detection and masking through executing redundant tasks. As DMR and TMR have different time and cost overheads, we must carefully determine which one should be used for each task (i.e., task hardening) to achieve the optimal system design. Furthermore, for multi-core systems, the system-level design includes the allocation of cores for the tasks (i.e., task mapping) as well. This paper aims at task hardening and mapping simultaneously for independent tasks on multi-cores with heterogeneous performances, in order to minimize the maximum completion time of all tasks (i.e., makespan). We demonstrate that once task hardening is given, task mapping of independent tasks can be achieved by employing min–max-weight perfect matching with a polynomial time complexity. Besides, as there is a trade-off between cost and time performance, we propose a multi-objective memetic algorithm (MOMA)-based task hardening method to obtain a set of solutions with different numbers of cores (i.e., costs), so the designer can choose different solutions according to different requirements. The key idea of the MOMA is to incorporate problem-specific knowledge into the global search of evolutionary algorithms. Our experimental studies have demonstrated the effectiveness of the proposed method and have shown that by combining the results of MOMA and MOEA we can provide a designer with a highly accurate set of solutions within a reasonable amount of time.


Introduction
The rapid technology advances, such as transistor size scaling, high operation frequency, and low voltage supply, have led to multiple reliability threats to circuits and systems (Constantinescu 2003). For example, the rate of transient faults in circuits has dramatically increased in the past years. A transient fault is usually caused by highenergy particle that strikes to flip the state of bits in an unpredictable way, which can corrupt the correct application execution state (Baumann 2005). Since the rate of transient faults is much higher than that of permanent faults, any reliable system should employ an effective scheme to tolerate transient faults in the future (Henkel et al. 2013a).
Multi-cores have emerged to be a popular and powerful computing platform for many recent systems (Ebi et al. 2009;Henkel et al. 2013b;Jahn et al. 2011). As the technology keeps scaling down, more and more cores can be integrated into a single chip to meet the growing computing demand of modern applications. Such applications depend on high-performance computing (HPC), and a high reliability in the presence of transient faults is required as well. On the other hand, the individual cores may exhibit significant frequency variations (e.g., up to 30%) because of process variations (result from fabrication imprecision) (Bowman et al. 2002;Dighe et al. 2011). This situation is worsening with technology scaling because of the difficulty of precise fabrication with reduced dimensions at a nanoscale.
To achieve reliability against transient faults, three types of redundancies have conventionally been applied, i.e., time redundancy, space redundancy, and hybrid redundancy. Time redundancy (i.e., re-execution) is based on checkpointing and to execute the task again in case of a fault (Kandasamy et al. 2003;Nikolov and Larsson 2016;Salehi et al. 2016a); it is applicable only after a fault can be detected. The overhead of local fault detection is not free since it is hard to achieve perfect transient fault detection. Space redundancy (i.e., replication), based on executing redundant replicas of task independently on different cores, does not require any specific fault detection mechanism and uses result comparison (majority voting) for fault detection and masking (Koren and Krishna 2007;Pradhan 1996;Salehi et al. 2016b). To make a majority voting, the number of replicas should be an odd number so that triple modular redundancy (TMR) is commonly used Lyons and Vanderkulk 1962). Although TMR is simple and predictable for tasks with deadlines, it requires more resources and energy than dual modular redundancy (DMR) (Dave and Jha 1999;Vadlamani et al. 2010). On the other hand, DMR is only capable of detecting fault since it is unable to decide which one is correct. Therefore, another way to implement fault masking is to combine DMR-based detection and checkpoint-based re-execution, which can be regarded as the hybrid redundancy (Kang et al. 2015;Pradhan and Vaidya 1994;Ziv and Bruck 1997). Both TMR and DMR are well suited for multi-core platforms as multi-cores can provide multiple processing units and low-overhead communication for comparing and voting.
Since different fault-tolerant techniques are usually characterized by different time and space overheads, there is an optimization trade-off in task hardening, i.e., determining one of the fault-tolerant techniques (e.g., DMR or TMR) for each task. This trade-off, together with the traditional optimization in task mapping, i.e., mapping each task to one of the cores, makes the design of fault-tolerant multi-cores very challenging (Das et al. 2014;Gan et al. 2012;Khosravi et al. 2014;Pop et al. 2009;Stralen and Pimentel 2012). It is notable that replication-based task hardening will introduce new tasks, i.e., replicas, into the system, and these replicas should be mapped as well. In this case, the design of fault-tolerant multi-cores can be regarded as a bi-level optimization problem due to its nested structure. Specifically, the upper-level optimization subproblem is task hardening and the lower-level optimization subproblem is task (including all original tasks and new tasks) mapping, thereby making the overall optimization computationally very intensive.
As both task hardening and task mapping are NP-hard in general cases (Bolchini et al. 2011), most of the previous methods are based on meta-heuristics, especially evolutionary algorithms (EAs). Through mapping a task to a set of cores, replication-based task hardening can be implicitly implemented, e.g., the number of mappings and the list of cores for each task are encoded as an integer vector in (Glaß et al. 2007;Huang et al. 2011;Huang et al. 2012;Reimann et al. 2008), or the mappings to all cores for each task are encoded as a 0-1 vector in Kang et al. (2014a, b). As task hardening is implicitly implemented in task mapping, the chromosome length of task mapping must be overestimated according to the maximum number of replicas that are probably introduced by replication-based hardening (Glaß et al. 2007;Huang et al. 2011;Huang et al. 2012;Reimann et al. 2008) or introduced by the total number of cores on the multi-core platforms (Kang et al. 2014a, b). The problem of overestimation is relaxed in the nested two-layer EA (Bolchini and Miele 2013;Bolchini et al. 2011), where the outer layer EA is used to explore task hardening, while the inner layer EA is used to explore task mapping on each extended task graph (Jhumka et al. 2005) independently. Nested two-layer structure is a commonly used framework for bi-level optimization problems, but the EA-based inner layer makes the whole optimization process very time-consuming. Tabu search (TS) seems to be an alternative way of exploring both task hardening and task mapping at the same time Lifa et al. 2010). As an individual/trajectorybased metaheuristic, the encoding strategy of TS is simple and flexible. But the performance of TS highly depends on the initial solution which is generated greedily in the TSbased methods, and generally TS is easy to get trapped in local optima, especially for problems with complex landscape.
As stated above, it is very difficult to develop efficient method for joint task hardening and task mapping for general cases, because of the nested structure of the two NP-hard problems. This paper builds upon the analysis that, for independent tasks (Bleuse et al. 2017;Hong and Prasanna 2007) (i.e., there is no data dependence between tasks), if the solution of task hardening is given, the solution of task mapping can be optimally obtained by employing min-max-weight perfect matching (MMW-PM). Specifically, both DMR-and TMR-based hardening techniques are considered, and we show how to link the problem of task mapping with the goal of minimizing the worst-case makespan (i.e., the maximum completion time of all tasks) to the MMW-PM model, and then, based on binary search and Hungarian algorithm (Kuhn 1955), we use an efficient heuristic algorithm [proposed in our previous work (Zhong et al. 2016) to obtain an MMW-PM from a bipartite graph with polynomial time complexity. Besides, as there is a trade-off between cost and time performance (Erbas et al. 2006), we propose a multi-objective memetic algorithm (MOMA) -based task hardening method to obtain a set of solutions with different numbers of cores (i.e., costs), so the designer can choose a solution from the Pareto front according to user preferences, such as cost budget or performance requirement. The key idea of MA (Chen et al. 2011;Wang et al. 2010) is to incorporate problem-specific knowledge into the global search of EAs. Our experimental studies have demonstrated the effectiveness of the proposed method and have shown that by combining the results of MOMA and MOEA we can provide a designer with a highly accurate set of solutions in a reasonable amount of time. Figure 1 shows an overview of our novel contributions: According to the nested structure of the design of faulttolerant multi-cores, the proposed MOMA-based task hardening can be regarded as the upper-level optimizer for the alternative between DMR and TMR for each task, while the proposed MMW-PM-based task mapping can be regarded as the lower-level solver for task-to-core assignments (for all original tasks and new tasks) in an optimal way.
The rest of this paper is organized as follows. In Sect. 2, we introduce the adopted problem model. The proposed design method of fault-tolerant multi-cores, including the optimal MMW-PM-based task mapping and MOMA-based task hardening, is detailed in Sects. 3 and 4. The simulation results and discussions are given in Sect. 5. Finally, Sect. 6 concludes this paper.

System model
We consider a resource library C = {c 1 , c 2 ,…, c M } consisting of M ISA-compatible RISC cores, which only has single thread per core. Each core c i has its own instruction and data cache to execute tasks. Due to the performance heterogeneity, e.g., process variations (Herbert et al. 2012;Raghunathan et al. 2013), each core c i has its own frequency, denoted as f i , which represents the number of instructions invoked by the core per second. For notational brevity, we index the M cores by a non-decreasing order of the current frequencies, i.e., f max = f 1 C f 2 … C f M = f min .
Independent tasks have been used in modeling some practical applications (Bleuse et al. 2017;Hong and Prasanna 2007), e.g., Monte Carlo simulations and computational phylogeny. In these scenarios, each task is to process a fixed amount of source data. The source data of all the tasks initially reside on a single node in the system, which we call the root node.
An instance of the problem is described as a set T = {t 1 , t 2 ,…, t N } of N independent tasks (Bleuse et al. 2017;Hong and Prasanna 2007). Task is the atomic unit executed by the multi-core platform. Due to the functional difference, each task t i has its own program to be executed by the core, and the number of instructions of the compiled program is denoted as d i . For notational brevity, we index the N tasks by a non-increasing order of the number of instructions, i.e., d max = d 1 C d 2 … C d N = d min . And, the execution time of task t i on core c j can be evaluated as et(t i , c j )-= CPIÁd i /f j , where CPI represents the clock cycle per instruction. Multi-objective redundancy hardening with optimal task mapping for independent tasks on… 983

Hardening technique
In this paper, both DMR with re-execution (Kang et al. 2015;Pradhan and Vaidya 1994;Ziv and Bruck 1997) and TMR Lyons and Vanderkulk 1962) are considered to be against transient faults. For a task hardened by DMR with re-execution, fault detection is achieved by DMR. If a fault is detected, the tasks are executed again by rolling back. For a task hardened by TMR, fault masking is directly achieved by majority voting. Both DMR with re-execution and TMR with major voting can achieve very high reliability even given a high fault rate. For example, if the fault rate of a task on a core is k = 10 -6 or 10 -7 (in the unit of #fault/cycles) to realize high fault scenarios as adopted by the related works (Hu et al. 2006;Li et al. 2004), the fault rate of the task hardened by TMR can be evaluated as k TMR = k 3 ? 3(1 -k)k 2 & 3 9 10 -12 or 3 9 10 -14 , and the fault rate of the task hardened by DMR with re-execution can be evaluated as 10 -12 or 4 9 10 -14 ; here, it is assumed that the maximum number of faults during the execution of each task is no more than 1. Both DMR with re-execution and TMR can achieve very high reliability, and they are characterized by different overheads in cost and time. Obviously, three cores are required in TMR, while only two cores are required in DMR with re-execution. On the other hand, the worst-case execution time (WCET) of a task t i in TMR mode can be evaluated as wcet(t i ) = max{et(t i1 ), et(t i2 ), et(t i3 )} ? et v , where tasks t i1 , t i2 , and t i3 are replicas of task t i , and e v is the execution time of major voting, while the WCET of a task t i in DMR mode can be evaluated as wcet(t i ) = 2 max{et(t i1 ), et(t i2 )} ? et r ? etc., where tasks t i1 and t i2 are replicas of task t i , etc. is the execution time of comparison, and et r is the execution time of rollback. Since multi-core platform can provide low-overhead communication for comparing and voting, compared with et(t i ), both etc. and et v are negligible. Therefore, TMR mode is more efficient than DMR mode in terms of WCET.

Problem statement
Assume we are given a resource library C = {c 1 , c 2 ,…, c M } with different frequencies f max = f 1 C f 2 … C f M = f min , and an instance consisting of independent tasks T = {t 1 , t 2 ,…, t N } with different numbers of instructions d max = d 1-C d 2 … C d N = d min , the goal of this paper is to (1) determine TMR mode or DMR mode for each task, and (2) allocate a core for each task (including all original tasks and new tasks), such that (1) the number of cores used, i.e., K = 2K DMR ? 3K TMR , is minimized, where K DMR and K TMR are the number of tasks hardened by DMR and TMR, respectively, and K DMR ? K TMR = N, and (2) the worstcase makespan, i.e., wcet(T) = max{wcet(t 1 ), wcet(t 2 ),…, wcet(t N )}}, is minimized. It is assumed that we have enough cores in the library so that we do not map two tasks or replicas to the same core in order to fully explore the parallelism of the multi-core platform.
To our knowledge, this is the first work specialized for independent tasks in the design of fault-tolerant multicores, as all the previous works (Das et al. 2014;Gan et al. 2012;Khosravi et al. 2014;Pop et al. 2009;Stralen and Pimentel 2012) consider general task sets (e.g., there are data dependences between tasks). As mentioned above, task mapping for general task sets is a well-known NP-hard problem. Independent task set, as a special kind of task sets, has been widely applied in modeling many real-world applications (Alazzoni and Down 2008;Bleuse et al. 2017;Cortadella et al. 2006;Hong and Prasanna 2007), but without considering faults. As shown in Sect. 3, under the assumption that the execution modes for all tasks are already known beforehand (Sect. 4), the task mapping of independent tasks can be optimized by employing minmax-weight perfect matching (MMW-PM) to minimize the worst-case makespan with polynomial time complexity.

Greedy mapping algorithm and a motivational example
The greedy mapping algorithm (Algorithm 1) is to map tasks to the cores that can be completed as fast as possible.
In the following, we provide a motivational example to explain why greedy mapping is not good enough for task mapping. Suppose that we are given three tasks, i.e., t 1 , t 2 , and t 3 . Task t 1 is executed in TMR mode, while task t 2 and t 3 are executed in DMR mode. Now, we consider the task mapping problem to allocate the cores to the tasks for minimizing the worst-case makespan, and the execution times of tasks on the cores are shown in Table 1. For simplicity, et v , etc., and et r are supposed to be zeroes here.
In this example, we can check all the possible mappings to obtain the optimal result that will be 74.16 ms, where TMR mode task t 1 uses core group {c 1 , c 2 , c 6 }, while DMR mode tasks t 2 and t 3 use core groups {c 3 , c 7 } and {c 4 , c 5 }. By using greedy mapping to assign the tasks and cores, the result is 79.08 ms, where TMR mode task t 1 uses core group {c 3 , c 4 , c 7 }, while DMR mode tasks t 2 and t 3 use core groups {c 2 , c 5 } and {c 1 , c 6 }. In the above example, we can observe that the greedy mapping strategy is not good enough. As a consequence, it is clear that such a task mapping problem requires a better strategy, whereas the straightforward exhaustive search is obviously not feasible in practice with the expected high time complexity.
A perfect matching is a matching which matches all vertices of the graph, that is, every vertex of the graph is incident to exactly one edge of the matching. The problem of finding a perfect matching can be solved by Hungarian algorithm (Kuhn 1955). Min-Max-weight perfect matching (MMW-PM) is defined on a weighted bipartite graph as a perfect matching, where the maximal weight of the edges in the matching has a minimal value among all perfect matchings of the given graph. One should note that the MMW-PM problem is different from the maximum (or minimum)weight perfect matching, where the sum of the weights of the edges in the matching has a maximal (or minimal) value among all perfect matchings of the given graph.
It is easy to link the task mapping problem to the proposed MMW-PM model. Given a task hardening solution, we can construct a new task set as T 0 ¼ t 0 1 ; t 0 2 ; . . .; t 0 K È É by replicating each task in T = {t 1 , t 2 ,…, t N } by 2 (DMR mode) or 3 (TMR mode) one by one, where K = 2K DMR-? 3K TMR , and K DMR and K TMR are the number of tasks hardened by DMR and TMR, respectively. On the other hand, we can obtain the resource allocation, , by selecting the first K cores with the highest frequencies from C = {c 1 , c 2 ,…, c M }. Then, a bipartite graph can be built as G 0 ¼ T 0 ; C 0 ; E 0 ð Þ , and E 0 go between each pair of t 0 i and c 0 j . The associated weight of each edge t 0 i ; c 0 j can be calculated according to: 1) Then, for a MMW-PM M 0 from G 0 , the matchings in M 0 represent the mappings from task replicas to cores, and the associated maximum weight of the edges is the worst-case makespan.
It is notable that the associated weight w t 0 i ; c 0 j of each edge is not the WCET of t 0 i on c 0 j , because comparing and voting are invoked until all replicas of the same task are completed. Actually, there is no way to know the execution time of other replicas of t 0 i beforehand, because we do not know the mappings at all. But, the worst-case makespan can be evaluated as: ; . . . n o : Multi-objective redundancy hardening with optimal task mapping for independent tasks on… 985 Therefore, we can obtain the real worst-case makespan of T by employing the above weighting strategy.

MMW-PM heuristic
The above section shows how to link the task mapping problem to the MMW-PM model. A naive way to find an MMW-PM (i.e., the optimal mapping) is to find all perfect matchings in the given bipartite graph first and then select the one whose maximal edge weight is minimal. Instead of using such an enumeration method, we use an efferent heuristic algorithm [proposed in our previous work (Zhong et al. 2016) for MMW-PM problem with low time complexity.
Given an undirected weighted bipartite graph G = (U, V, E, W), where U and V are disjoint and all edges in E go between U and V. At first, we sort the edges in G in ascending order of their weights. Our heuristic is to select an edge e in G and remove all the edges whose weights are larger than that of e in G, while a perfect matching method (i.e., Hungarian algorithm (Kuhn 1955)) is used to check whether a perfect matching exists. The framework of the heuristic method is iterative based on binary search, as shown in Algorithm 2 (Zhong et al. 2016). The algorithm starts with an initial perfect matching M obtained by the Hungarian algorithm (line 1), and then we have E 0 by sorting E in ascending order according to their weights (line 2). At each iteration, an edge e in E 0 is selected by binary search (lines 5 and 6) and a new graph G 0 is obtained through removing all the edges with larger weights in G (line 7), and then we can obtain a new matching M 0 by running the Hungarian algorithm on G 0 (line 8). If the cardinality of M 0 equates to that of M, i.e., M 0 is a perfect matching as well, we update M to M 0 (line 10) and set mid as high (line 11), otherwise, we increase low by 1 (line 13). The final solution M is returned until the loop terminates (line 16).
Obviously, Algorithm 2 (Zhong et al. 2016) would return a perfect matching of the input graph, as the resulting matching M has the same cardinality as the initialized M obtained from the input graph (line 1). Besides, the edges in G are removed according to the ascending order of their weights, so the resulting matching M satisfies that the maximal weight of the edges in M has a minimal value among all perfect matchings of the input graph. Therefore, Algorithm 2 can indeed find an MMW-PM from the given graph with the advantage of a high efficiency over the enumeration method. Given an undirected weighted bipartite graph G = (U, V, E, W), with |U| = |V| = n (n = K in our case), the time complexities of Hungarian algorithm and binary search are O(n 3 ) and O(logn), respectively. Therefore, the time complexity of Algorithm 2 is O(n 3 logn).

Trade-offs in task hardening
As analyzed in Sect. 2, DMR mode and TMR mode are characterized by different overheads in cost and time, and there is a trade-off in determining DMR or TMR for each task. Figure 2 shows a simple example of the trade-offs in task hardening for a problem instance, where the green circle represents the solution that all tasks are hardened by DMR, and the magenta star represents the solution that all tasks are hardened by TMR. ''DMR for all tasks'' uses the least number of cores, while ''TMR for all tasks'' achieves the best performance in terms of worst-case makespan. In order to provide the designer with a set of solutions with different trade-offs in cost and performance, we model the problem as a multi-objective optimization problem. Specifically, in terms of cost, the objective is to minimize the number of cores used, while in terms of performance, the objective is to minimize the worst-case makespan, so we have: Object 3.2 MOEA-based task hardening In recent years, MOEAs have been successfully applied to many kinds of multi-objective optimization problems (Shen and Yao 2015;Wang et al. 2015). While there is a wide variety of MOEAs, in this paper, we just consider the most popular one, i.e., nondominated sorting genetic algorithm II (NSGA-II) (Deb et al. 2002). The key ideas of NSGA-II include a fast nondominated sorting approach with low computational complexity and a novel selection process that creates a mating pool by combining the parent and offspring populations and selecting the best (with respect to fitness and spread) solutions. Genetic algorithm (GA) is adopted in the NSGA-II framework to work as the evolutionary engine for the population. GA generates solutions to the optimization problem using operations inspired by Darwinian principles of natural evolution, such as recombination (crossover), mutation, and selection. The detailed design of the elementary steps for evolving the population of solutions of task hardening is given as follows.
(1) Encoding In GA, the solution to the optimization problem is encoded in a chromosome as a set of parameters. In our implementation, each chromosome in the task hardening population is represented by a N-dimensional 0-1 vector th = {th 1 , th 2 , … th N } [ {0, 1} N , where N is the number of tasks. Each position of the vector describes the fault-tolerant technique that assigned to the task, i.e., th i = 0 indicates that task t i is hardened by DMR with re-execution and th i = 1 indicates that task t i is hardened by TMR.
(2) Crossover In GA, crossover is a genetic operator that recombines more than one (generally two) parent chromosomes to produce one or two child chromosomes from them. In our implementation, we use one-point crossover, the simplest one, i.e., a single crossover point on both parent vectors is selected first, and then all data beyond that point in either vector are swapped between the two parent vectors, the resulting vectors are the child chromosomes. In this way, the applied fault-tolerant techniques are mixed for the task hardening solutions.

(3) Mutation
In GA, mutation is a genetic operator that alters one or more gene values in a chromosome from its initial state to maintain the genetic diversity of the population. In general, mutation operators involve a probability that an arbitrary gene in a chromosome would be changed from its original state, and this probability should be set low. In our implementation, we use flip mutation for the 0-1 vector, i.e., the value of the chosen gene is inverted. In this way, the applied fault-tolerant technique is changed for the task in the task hardening solution. Selection occurs two times during each generation in GA. In our implementation, selection for reproduction is performed before the crossover operator is applied, which is based on a purely random basis without bias to filter any individual, and selection for survival is performed according to NSGA-II, i.e., a fast nondominated sorting approach and a novel selection process with the consideration of both fitness and spread.
The outline of the MOEA-based hardening is given in Algorithm 3, where PS, P c , and P m indicate the population size, probability of crossover, and probability of mutation, respectively. The algorithm starts with a population TH consisting of PS random individuals for task hardening (line 1). The fitness in terms of cost, i.e., Object 1 , is evaluated according to Eq. (1), and the fitness in terms of performance, i.e., Object 2 , is evaluated based on the proposed MMW-PM-based mapping (Algorithm 2) and Eq. (2) (line 2). During each generation, TH is evolved and the population of PS individuals generates PS children through the crossover operation (lines 4-7) and the mutation operation (lines 8-10). Then, the offspring NTH are evaluated (line 11) and used to update the current population (line 12) based on the fast nondominated sort. When the given maximum number of fitness evaluations is reached, the algorithm stops (line 13).

MOMA-based task hardening
As shown in the following experiments (Sect. 5), it is very difficult to obtain the whole Pareto front using the proposed MOEA-based hardening method so that we cannot provide well-distributed solutions for the designer to choose. In particular, it is hard to minimize Object 2 by NSGA-II, so the solutions are all partiality located at the range where Object 1 is minimized. This is because Object 2 involves a complex nonlinear mapping from decision variables to fitness value (i.e., using MMW-PM-based mapping), while Object 1 is just a linear combination of the variables (i.e., Eq. 1), and the select pressure on Object 2 is too weak in the adopted NSGA-II framework.
In order to incorporate bias in the search for Object 2 , we propose a problem-specific local search operator, and it can be regarded as a kind of memes in the case of MAs (Rubio-Largo et al. 2016;Tersi et al. 2015;Yuan and Xu 2015). In combinatorial optimization, local search operators generally work in the form of heuristics that are customized to a specific problem. In our implementation, the key idea of the heuristic is inherited from the previous greedy reassignment local search (GRLS) operators for logic mapping Yuan et al. 2014), which is to reassign the gene values of part of the parent chromosome by taking advantage of the greedy information extracted from the problem instance. In this paper, this idea is extended for task hardening.
As analyzed in Sect. 2, TMR mode is more efficient than DMR mode; it is expected that a task can be completed earlier if a task is executed in TMR mode. The GRLS operator designed for task hardening tries to flip the hardening technique from DMR to TMR for the given task t i with larger d i . It is expected to reduce the worst-case makespan, i.e., Object 2 . In order to release the time overhead added to the iterative process of GA, the time complexity of the operator should be as low as possible. In fact, for each task, the priority list of tasks in terms of d i can be sorted in advance; thus, the greedy information of the problem instance only needs to be evaluated once before the optimization process.
Since the operator is designed to be complementary to the stochastic search of GA, the incorporation of the operator should maintain the randomness as well. Besides, an operator with strong greediness will weaken the stochastic nature of GA and improve the risk of convergence to the local optima. Therefore, care should be taken when setting the strength of the GRLS operator in achieving the best search performance. In our implementation, for a given solution, a control parameter is introduced to limit the number of tasks that will be applied to by the GRLS operator. For example, NÁl tasks are randomly selected, where 0 B l B 1 is defined as the greedy strength factor that provides a flexible control on randomness or greediness of the GRLS operator.
The outline of the proposed GRLS operator for task hardening is given in Algorithm 4. NÁl tasks are randomly selected and marked as unhardened (line 1). In the loop (lines 2-6), an unhardened task t i with maximum d i is selected (line 3); if t i is hardened by DMR, then change DMR to TMR, and the loop terminates (line 4). By applying Algorithm 4 to each generated solution during the MOEA-based hardening algorithm, we obtain the MOMAbased hardening algorithm.

Simulations and discussion
In this section, the performance of the proposed method is experimentally investigated. First, we show that the proposed MMW-PM-based mapping consistently outperforms greedy mapping, and we test the impact of control parameter l on the performance of the proposed MOMAbased hardening. Then, compared with MOEA-based hardening, the effectiveness of MOMA-based hardening is verified by extensive experiments, especially on large-scale benchmarks. At the final, we show that by combining the results of MOMA-based hardening and MOEA-based hardening we can provide the designer with a highly accurate set of solutions in a reasonable amount of time.
In the experiments, a large set of benchmarks with different scales are synthesized, e.g., N = 10, 30, 50, and 100 that indicates the number of tasks in the problem instances, and M [ 3 N accordingly that indicates the number of cores in the resource library. We generate four test instances for each case. The number of instructions (d) of the compiled tasks ranges from 10 7 to 5 9 10 7 and randomly determined, and the value of CPI is assumed to be 1.5. The frequencies (f) of cores are normally distributed with mean 1 GHz and standard deviation 0.1 GHz. The voting time (et v ) involved in TMR, the comprising time (etc.) involved in DMR, and the recovering time (et r ) involved in re-execution are assumed to be 2 ms, 2 ms, and 6 ms, respectively.
GA is used as the evolutionary search engine, and the parameters of population size (PS), probability of crossover (P c ), and probability of mutation (P m ) are set as PS = 2 N, P c = 0.8, and P m = 0.8/N experimentally. The maximum number of generations is set to be 50 to limit the runtime (thus the number of function evaluations is 100 N), and in most cases, both MOMA and MOEA have converged to good solutions. Both generational distance (GD) and inverted generational distance (IGD) are used as performance metrics for the proposed MOMA-based hardening and MOEA-based hardening. GD is defined as the mean distance in the objective space from each obtained solution to its nearest Pareto optimal solution, while IGD is defined as the mean distance in the objective space from each Pareto optimal solution to its nearest obtained solution. GD provides a measure of proximity of the nondominated solutions in the objective space with respect to the Pareto optimal front, while IGD provides a measure of both proximity and diversity of the nondominated solutions in the objective space with respect to the Pareto optimal front; the smaller the value is, the better the Pareto optimal front is approximated. As customary, the nondominated solutions obtained by multiple runs of both MOMA and MOEA with larger populations and more generations are assumed as the Pareto optimal set, which is the reference set of GD and IGD. We implement the algorithms in MATLAB. All the experiments are performed on a 3.2 GHz Intel Core i5-6500 quad-core platform with 8 GB memory. However, all the tested algorithms are implemented as monolithic processes and no CPU core parallelism is exploited.

MMW-PM-based mapping versus greedy mapping
As analyzed in Sect. 3, if the solution of task hardening is given, the proposed MMW-PM-based mapping can obtain the optimal task mapping solution. Figure 3 shows the average worst-case makespan (WCET) of greedy mapping and MMW-PM-based mapping on test instances of different scales. In these simulations, since it is impossible to evaluate the algorithms with all possible task hardening solutions, each bar in the presented figures is obtained by averaging the results through multiple runs (e.g., 2 9 , 2 10 , 2 11 , and 2 12 for N = 10, 30, 50, and 100, respectively). As shown in Fig. 3, we can observe that the proposed MMW-PM-based mapping consistently outperforms greedy mapping on all cases, and the average improvements are 3.40%, 3.74%, 2.74%, and 2.54% for different problem scales (N = 10, 30, 50, and 100, respectively).

Sensitivity of parameter l
Parameter l controls the randomness or greediness of the GRLS operator; therefore, it provides a balance between the two conflicting objectives. Figure 4 shows the evolutionary curves in terms of IGD for MOMAs with different values of l on middle-size test instances of N = 50. To make the simulations more convincing, each group result is averaged over 30 runs. As shown in the figure, the larger the values of l, the faster the MOMA approximates the Pareto optimal front in terms of IGD. But, a very large l, i.e., l = 0.9, results in bad performances at the final stage; this is because such strong greediness disturbs the stochastic search of GA too much. On the contrary, a very small l, i.e., l = 0.1, cannot accelerate the approximation to the Pareto front significantly, but it indeed leads to good results if granted a long runtime. Therefore, a moderate value of l is preferred, e.g., l = 0.3 or 0.5 or 0.7, and the performance of MOMA-based hardening is not very sensitive to the values of l in this range, i.e., from 0.3 to 0.7. We use l = 0.5 in the following simulations. the Pareto optimal front. The maximum reductions on IGD values are about 82%, 75%, and 86%, and the average reductions are about 64%, 70%, and 80% for different problem scales (N = 30, 50, and 100, respectively). The better approximation can be attributed to the introduction of the problem-specific knowledge, so the effectiveness of the proposed GRLS for task hardening is demonstrated. We have performed statistical tests for the GD and IGD values of MOEA and MOMA on each benchmark instance. A two-tailed t test is conducted with a null hypothesis, stating that there is no difference between two algorithms in comparison. The null hypothesis is rejected if the p value is smaller than the significance level a = 0.05. We find that (1) there is no significant difference between the GD values of MOEA and MOMA on all test instances, (2) the IGD values of MOEA are statistically better than those of MOMA on small-scale test instances (N = 10), and (3) the IGD values of MOMA are statistically better than those of MOEA on large-scale test instances (N = 30, 50, and 100).

A comprehensive comparison
Figures 9, 10, 11, and 12 show the Pareto fronts obtained by multiple runs (e.g., 5 runs each) of both MOMA-based hardening and MOEA-based hardening on test instances of different scales (N = 10, 30, 50, and 100, respectively). In each figure, the green circle represents the all ''0'' solution, i.e., th = {0} N , that means all the tasks are hardened by DMR, while the magenta star represents the all ''1'' solutions, i.e., th = {1} N , that means all the tasks are hardened by TMR. Obviously, all ''0'' solutions minimize the number of cores according to Eq. (1), while all ''1'' solutions can achieve the best performance in terms of worstcase makespan. As shown in the figures, the near ''0'' solutions can be obtained by using MOEA-based hardening method, and we can approach the best performance with fewer cores by using MOMA-based hardening method. Another observation is that the results obtained by MOEAbased hardening are all partiality located at the area where Object 1 (number of cores) is minimized, especially for large-scale problems (e.g., N = 50 and 100). As analyzed Multi-objective redundancy hardening with optimal task mapping for independent tasks on… 991 above, this is because Object 2 (WCET) involves a complex nonlinear mapping from decision variables to fitness value, while Object 1 is just a linear combination of the values; the select pressure on Object 2 is too weak in the adopted NSGA-II framework. By incorporating the proposed GRLS operator, we can see MOMA-based hardening can cover this absent area; this is because the proposed operator has bias for minimizing WECT by using more space redundancy (from DMR to TMR). We can obtain a welldistributed Pareto front by combining the solutions of both methods; thus, we can provide the designer with a highly accurate set of solutions.

Hybrid fitness evaluation
As the time complexity (O(n 3 log n )) of the Hungarian-based optimal MMW-PM method (Algorithm 2) is high, the whole EA process is very time-consuming for large-scale problems, e.g., more than 2 days for n = 100 test instances. As a remedy, we propose to use hybrid fitness evaluation method to balance the accuracy and time efficiency of fitness evaluations. The key idea is to combine the optimal and greedy MMW-PM methods in the fitness evolution process (Jin et al. 2002;Jin 2005). Specifically, we use single optimal MMW-PM evaluation for the whole population in every D iterations, where D is called exact evaluation gap here. A few different values of exact evaluation gap D = 1, 5, 10, and 25, are tested as shown in Figs. 13 and 14, where D is marked as ''GAP.'' The results show that the runtime can be reduced significantly if the preference degradation is acceptable by the designers, e.g., D = 5. After obtaining the Pareto front, it is an open problem of how to make the final decision. From the point of view of multi-core system design, the designer can choose a solution from the Pareto front based on the available hardware resources. For example, if the required resources can be satisfied, the designer can implement a high-performance multi-core system with a low hardware cost. Alternatively, the designer could select the knee point on the Pareto front. It is important to note that the approximate Pareto front found by our algorithm can inform the designer about various trade-offs among conflicting objectives, which is essential in making an informed final decision.

Conclusion
This paper aims at jointly task hardening and task mapping for independent tasks on heterogeneous multi-core systems in order to achieve low cost and high performance. In order to minimize the worst-case makespan with optimal task mapping, we show that the mapping problem can be modeled as a min-max-weight perfect matching problem. Then, we propose a polynomial time complexity heuristic algorithm that works in a binary search framework and employs Hungarian algorithm as a subroutine. As there is a trade-off between the cost and performance in task hardening, we model the task hardening problem as a multi- Multi-objective redundancy hardening with optimal task mapping for independent tasks on… 993 objective optimization problem. Since a MOEA can find a partial Pareto front only, we propose a problem-specific local search operator and incorporate it in the MOEA framework. As shown by our simulation results, the proposed MOMA method can cover the missing part of the Pareto front, so we can obtain the entire Pareto front by combining the solutions of MOEA and MOMA. This work considers independent tasks (or parallel tasks); we will extend this work to the tasks with data dependents in our future work, e.g., sequential tasks. Besides, we will consider dynamic task mapping, where the frequencies of cores change with heat, etc.