MTHSA-DHEI: multitasking harmony search algorithm for detecting high-order SNP epistatic interactions

Genome-wide association studies have succeeded in identifying genetic variants associated with complex diseases, but the findings have not been well interpreted biologically. Although it is widely accepted that epistatic interactions of high-order single nucleotide polymorphisms (SNPs) [(1) Single nucleotide polymorphisms (SNP) are mainly deoxyribonucleic acid (DNA) sequence polymorphisms caused by variants at a single nucleotide at the genome level. They are the most common type of heritable variation in humans.] are important causes of complex diseases, the combinatorial explosion of millions of SNPs and multiple tests impose a large computational burden. Moreover, it is extremely challenging to correctly distinguish high-order SNP epistatic interactions from other high-order SNP combinations due to small sample sizes. In this study, a multitasking harmony search algorithm (MTHSA-DHEI) is proposed for detecting high-order epistatic interactions [(2) In classical genetics, if genes X1 and X2 are mutated and each mutation by itself produces a unique disease status (phenotype) but the mutations together cause the same disease status as the gene X1 mutation, gene X1 is epistatic and gene X2 is hypostatic, and gene X1 has an epistatic effect (main effect) on disease status. In this work, a high-order epistatic interaction occurs when two or more SNP loci have a joint influence on disease status.], with the goal of simultaneously detecting multiple types of high-order (k1-order, k2-order, …, kn-order) SNP epistatic interactions. Unified coding is adopted for multiple tasks, and four complementary association evaluation functions are employed to improve the capability of discriminating the high-order SNP epistatic interactions. We compare the proposed MTHSA-DHEI method with four excellent methods for detecting high-order SNP interactions for 8 high-orderepistatic interaction models with no marginal effect (EINMEs) and 12 epistatic interaction models with marginal effects (EIMEs) (*) and implement the MTHSA-DHEI algorithm with a real dataset: age-related macular degeneration (AMD). The experimental results indicate that MTHSA-DHEI has power and an F1-score exceeding 90% for all EIMEs and five EINMEs and reduces the computational time by more than 90%. It can efficiently perform multiple high-order detection tasks for high-order epistatic interactions and improve the discrimination ability for diverse epistasis models.

, such as the 1 causes of complex diseases, due to the rapid development of high-throughput sequencing technology and dramatic declines in sequencing costs. GWASs are dedicated to detecting genetic variants associated with complex traits/diseases from single nucleotide polymorphisms (SNPs), which are the most common genetic variations in human deoxyribonucleic acid (DNA) sequences [2][3][4][5].
Many important and interesting findings have been made by GWASs using single-SNP-based and SNP-pair-based methods. Single-SNP analysis approaches for GWASs, such as the single-SNP test [5], compare the relative frequencies of genotypes between case and control samples independently of other SNP loci, and some results have been successfully translated to candidate drugs [6,7]. Nevertheless, most studies fail to effectively explain the causal SNPs of complex diseases. One important reason is that most studies focus on discovering the contribution of single SNPs to complex disease status/traits in isolation, and SNPs with a small effect on the phenotype were neglected in further analysis [8]. In recent years, various multi-SNP methods have been employed for GWASs, such as penalized regression [9][10][11], which can eliminate SNPs associated only with the phenotype due to their linkage disequilibrium (LD) with causal SNPs [11]. An increasing number of studies indicate that epistatic interactions across the whole genome ubiquitously exist in relation to complex diseases [12]. Epistatic interaction generally refers to joint interaction effects among multiple genetic variants in the genome, where the effect of a set of genes or SNPs on a phenotype is unequal to the sum of their independent contributions [11]. Epistatic interactions are now widely regarded to determine individual susceptibility to complex diseases [8,13].
Detecting high-order epistatic interactions in the human genome has become a very important goal in GWASs, but it is also extremely challenging because there are hundreds of thousands of SNPs in the human genome, creating a very complex "combination explosion problem". Current computers are not capable of determining whether each kth-order (k > 2) SNP combination has an epistatic interaction effect in a limited time. To address this problem, high-performance computing and heuristic searches have been presented to accelerate the detection of high-order epistatic interactions. High-performance computing usually adopts graphics processing units (GPUs) and parallel processing techniques to improve the speed of computers. Guo et al. employed cloud computing to detect high-order epistatic interactions [14], and forty virtual machines were adopted to accelerate the detection of such interactions. Yang et al. [15] developed a GPU-based permutation tool to accelerate the detection of SNP-SNP interactions based on the likelihood ratio (LR) test with the assumption that the statistic follows a χ 2 distribution. Cecilia et al. [16] presented a tool called MPI3SNP that implements a multicentral processing unit (CPU) and multi-GPU clusters to detect 3rd-order epistatic interactions. Alex Upton et al. [7] reviewed high-performance computing and cloud computing used to detect epistasis in detail and dissected different computational approaches to analyse epistatic interactions in disease-related genetic datasets. GPU and parallel processing techniques can speed up detection but are insufficient for high-order (> 3) epistatic interactions because the time complexity of detecting high-order SNP epistatic interactions is not reduced if the search algorithm still has high time complexity (i.e., exhaustive search algorithm).
To reduce the computational burden, heuristic search techniques, such as the Monte Carlo method [13,17,18], the spanning tree method [19], and swarm intelligence search algorithms (SISAs) [20,21], use current information about the target problem as heuristic information that can improve search efficiency and reduce the number of searches. The Monte Carlo method employs random sampling procedures to explore potential SNP epistatic interactions and can speed up epistasis detection, but its power is often unsatisfactory. Zhang, Y et al. presented a Bayesian partition model (called Beam) for detecting SNP epistatic interactions, and they employed Markov chain Monte Carlo (MCMC) sampling to compute the posterior probability of SNP markers [13]. Beam models have a very rapid search speed but easily miss epistatic interactions with weak marginal effects on disease status. Wang W introduced a minimum spanning tree structure for exhaustively detecting two-locus epistasis [19]. The minimum spanning tree-based method is powerful for detecting 2nd-order epistatic interactions with marginal effects but largely inefficient for the detection of high-order SNP epistatic interactions with weak marginal effects. Shanwen Sun et al. analysed statistical modelling and machine learning approaches for identifying SNP epistatic interactions in detail [11].
Due to the powerful exploration capability of a highdimensional search space, the SISA has received much attention in recent years for high-order epistatic interaction detection. Moore JH et al. employed a genetic algorithm (GA) to discover complex genetic models [20,21] and presented a grid-based stochastic search algorithm (named Crush-MDR) [22], which adopts genetic model-free multifactor dimensionality reduction (MDR) to calculate the associations between SNP combinations and disease status in order to accelerate the detection of high-order epistatic interactions. Crush-MDR reduces the time complexity of the search process, but the objective function MDR is computationally expensive. Wang et al. proposed a twostage ant colony optimization (ACO) algorithm (named AntEpiSeeker) to detect epistatic interactions [23], which employs the chi-square (χ 2 ) test to evaluate the scores of SNP combinations. In the 1st stage of AntEpiSeeker, ACO is used to select suspected SNP sets with high χ 2 scores, and the 2nd stage of AntEpiSeeker conducts an exhaustive search with the suspected SNPs. Shang and Sun et al. conducted an in-depth study on gene-gene interactions via ACO [24][25][26], and their research concentrated on the identification of epistatic models and the improvement of ACO. Differential evolution-based methods were adopted by Yang et al. [27,28] to detect epistatic interactions, and improved MDR was used to measure associations. Aflakparast, M. et al. introduced a cuckoo search epistasis (CSE) detection algorithm to identify high-order SNP epistatic interactions in which each single SNP has a small effect on disease status. The CSE detection algorithm first divides all SNP loci into M groups based on their relevance, and kth-order SNP combinations are then chosen from the M groups [29].
Tuo et al. proposed three harmony search (HS)-based epistatic detection algorithms (FHSA-SED [30], NHSA-DHSC [31], and MP-HS-DHSI [32]) because of the performance advantages of HS, such as its powerful exploration ability and fast speed. HS is a very simple optimization algorithm that has shown outstanding performance in solving both combinational optimization problems and real number optimization problems. FHSA-SED aims to discover SNP-pair (2nd-order) interactions using HS and two scoring functions (the Gini index and Bayesian networkbased K2 score) to evaluate the associations between SNP pairs and disease status. NHSA-DHSC presents a niche HS for detecting high-order epistatic interactions, in which a niche strategy is used to record local optimal solutions and avoid repeated searches in local regions. The MP-HS-DHSI algorithm employs multipopulational and multiple scoring functions to improve the exploration power of HS and overcome the preference for disease models. In terms of search speed and detection power, it outperforms FHSA-SED and NHSA-DHSC in detecting high-order SNP epistatic interactions, but for an unknown detection task, it also requires the detection of 2nd-order, 3rd-order, …, Kth-order epistatic interactions one by one.
Although the SISA has made some progress in accelerating the detection of high-order epistatic interactions, it still faces two challenges: Search. Finding kth-order (a combination of k SNP loci) epistatic interactions among over hundreds of thousands of SNPs in the whole genome in a limited time is very difficult due to the large number of SNP combinations, which is a complex combination explosion problem. For example, the number of 3rd-order SNP combinations for 1,000,000 SNPs is larger than 1.6667×10 17 . In particular, if the SNPs in highorder epistatic interactions have very weak or no marginal effect on disease status/complex traits, the SISA is inefficient or nearly powerless because there are no valid clues to guide the population to locate the causal SNP epistatic interaction among the extreme number of SNP combinations.
Discrimination. The discriminating function (objective function) adopted to calculate the associations between SNP combinations and phenotypes is crucial for the SISA. Faced with such a large number of SNP combinations, discrimination functions with light computational requirements should be considered first for SISAs. Bayesian network-based methods [33][34][35], Shannon entropy-based methods (i.e., mutual information and conditional entropy) [36] and statistical test methods (i.e., chi-square tests [37]) are lightweight methods that have been widely used to evaluate associations, but none of them are considered effective for all types of epistatic interaction models. Machine learning approaches, such as MDR [38][39][40], random forest and neural networks, are statisticalfree methods with strong applicability for evaluating various disease models, but the high computational burden limits the usefulness of these methods as objective functions of SISAs.
To address the above challenges, this study aims to improve performance in detecting high-order SNP epistatic interactions in the following two aspects: (1) A multitasking HS algorithm with three stages is developed to improve detection speed and power. (2) Four complementary association evaluation functions are employed to improve the discrimination ability of various disease models.
To the best of our knowledge, the existing SISAs for detecting SNP epistatic interactions, such as CSE [29], MACOED [37], epiACO [25], and NHSA-DHSC [31], can perform only one task (detecting a single kth-order epistatic interaction) in each run and, therefore, must be run n times to perform n tasks (detecting k 1 -order, k 2 -order, …, k n -order epistatic interactions). To collaboratively perform multiple detection tasks simultaneously, a multitasking HS algorithm (named MTHSA-DHEI) is developed for detecting highorder epistatic interactions in this study. The contributions of our work can be summarized as follows: (1) A new multitask-based HS algorithm is proposed for detecting k 1 -order, k 2 -order, …, k n -order SNP epistatic interactions simultaneously. The proposed algorithm is divided into three stages: searching, screening and verifying. The search stage aims to reduce the computational burden. The purpose of screening and verifying is to improve detection result accuracy.
(2) Unified coding is adopted to represent k 1 -order, k 2order, …, k n -order combinations. For all tasks, the solutions (SNP combinations) are encoded with the same length, which is equal to the number of SNPs in the highest-order SNP epistatic interaction, and this encoding scheme facilitates knowledge transfer between tasks. Knowledge transfer between tasks can significantly accelerate the detection of high-order SNP epistatic interactions from high-dimensional genome datasets. (3) To improve the capability of identifying various models and discriminating kth-order SNP epistatic interactions from non-functional kth-order SNP combinations, four complementary association evaluation functions (Bayesian network, mutual entropy (ME), LR, and normalized distance with joint entropy (ND-JE)) are integrated as objective functions of the multitasking HS. (4) T harmony memories (HM 1 , HM 2 , …, HM T ) are employed to memorize the potential SNP epistatic combinations for T tasks, and four elite harmony sets (EHS 1 , EHS 2 , EHS 3 , and EHS 4 ) are used for each task to record the elite solutions of four evaluation functions, with the aims of reducing the preference of a single evaluation function for a particular disease model and enhancing the global search ability.
The rest of this paper is arranged as follows. "Preliminary and related work" presents the related work and preliminaries. The proposed method is introduced in detail in "Proposed algorithm". The experiments performed on simulation datasets and real datasets are given in "Simulation experiments". The subsequent sections are the conclusion and discussion.

Preliminary and related work
Let X {x 1 , x 2 , · · · , x N } indicate N SNP markers for n individuals and Y {y 1 , y 2 , · · · , y J } denote disease status (J is the number of disease statuses). The homozygous major allele, heterozygous allele and homozygous minor allele in the sample dataset are defined as 0, 1 and 2, respectively. For a kth-order SNP combination, there are I 3 k genotype combinations. n i is the number of samples in the dataset with SNP loci having the value of the ith genotype combination, and n i j represents the number of samples with the ith genotype combination that are associated with disease state y j .
Definition (high-order SNP epistatic interaction). Let X k {x s 1 , x s 2 , · · · , x s k } (1 < k < N , x s i ∈ X ) be a kth-order SNP combination. f (X k , Y ) is a function for scoring the association between X k and disease state Y . A kth-order SNP combination X k is jointly associated with Y if and only if where is defined for comparing the strength of association with the disease. X k is said to be strongly associated with Y if f (X k , Y ) > θ (θ is the threshold value for determining the association with disease status). A kth-order SNP epistatic interaction occurs if and only if a kth-order SNP combination X k is truly a disease-causing SNP combination associated with Y .

Multitasking optimization model
Multitasking optimization aims to solve K optimization problems simultaneously [41], and its goal is to concurrently optimize all K tasks. Let the K tasks be maximization problems. The optimization model can be expressed as follows: where the objective function is defined as f i : S i → R and X * i ∈ S i is the optimal solution of objective function f i in the feasible space of S i .
Evolutionary multitasking optimization (EMO) has received much attention in recent years in relation to implicit parallel population-based optimization algorithms to search multiple decision spaces of multiple optimization problems [42]. The evolutionary multitasking algorithm can significantly accelerate convergence for multiple complex optimizations by transferring learning between tasks. It has been applied in the fields of engineering and science computing. Li et al. employed a multifidelity evolutionary multitasking method to extract hyperspectral endmembers [43]. Feng et al. proposed evolutionary multitasking to solve the capacitated vehicle routing problem [44] consisting of a weighted learning process for capturing transfer mapping. Eneko Osaba et al. presented a novel adaptive metaheuristic algorithm to address evolutionary multitasking environments called the adaptive transfer-guided multifactorial cellular genetic algorithm (AT-MFCGA) [45]. Nguyen Thi Tam introduced evolutionary multitasking optimization to address the issues of relay node assignment for wireless single-hop sensor and multihop sensor networks in three-dimensional terrains [46]. To solve scheduling problems with batch distribution, Xu et al. presented multitasking optimization [47]. Gao et al. designed a transfer strategy based on the multidirectional prediction method to improve the performance of the multiobjective multitasking optimization approach [48]. Zhao et al. proposed a polynomial regression surface modelling approach based on multitasking optimization for rational basis function selection [49]. EMO can efficiently address multiple different optimization problems simultaneously, enhance the global search ability and improve the performance of each task via knowledge transfer between tasks [48].
In the detection of high-order SNP epistatic interactions, there may be an implicit relationship between kth-order SNP epistatic interactions and (kth + i)-order SNP epistatic interactions for the same disease.
For example, in a 5th-order SNP epistatic interaction (x 1 , x 2 , x 3 , x 4 , x 5 ), the five single-SNP loci x i (i 1, 2, ..., 5) and all 2nd-order SNP combinations x i , x j (i j) may have no explicit associations with disease status, while some 3rd-order SNP combinations, such as (x 1 , x 2 , x 4 ) and (x 2 , x 3 , x 5 ), show associations with disease status, which can guide the search algorithm to identify the 5th-order SNP epistatic interaction by transferring learning between the task of detecting 3rd-order epistatic interactions and that of detecting 5th-order epistatic interactions. In the task of detecting 3rd-order SNP epistatic interactions, some 3rd-order SNP combinations with very weak associations with the disease but no disease-causing SNP interactions may be part of the 5th-order epistatic interaction. Conversely, some 5th-order SNP combinations may contain functional loci for 3rd-order SNP interactions. Therefore, a multitasking optimization model is well suited for accelerating the detection of high-order SNP epistatic interactions through knowledge transfer between multiple tasks.

Multitasking optimization model for detecting high-order SNP epistatic interactions
The multitasking optimization model for detecting k 1 -order, k 2 -order, · · · , k m -order SNP epistatic interactions can be expressed as Eq. (1).
where X * k i (i 1, 2,…,m) indicates a k i -order SNP epistatic interaction and f (X , Y , k) denotes the objective function for evaluating the association between kth-order SNP combination X k and disease status Y .

Discrimination functions for evaluating the associations between SNP combinations and disease status
Due to the small sample size and diversity of disease models, it is very difficult to discriminate kth-order SNP epistatic interactions from all kth-order SNP combinations on a genome-wide scale. Conventional evaluation methods (such as mutual information and Bayesian networks) cannot identify all disease models well. Almost all evaluation methods can correctly discriminate only a portion of disease models.
In this study, four discrimination functions with low computational costs are employed to enhance the discrimination ability.
Bayesian-network-based K2 score. The Bayesiannetwork-based K2 score is a statistical method for describing relationships using a directed acyclic graph (DAG) G (V , E) [50]. It is a lightweight computing method and has high discrimination precision for evaluating the association between a kth-order SNP combination and disease status; it can be expressed as Eq. (2): The larger the K2 -Score log value is, the greater the association between a SNP combination and disease status. ME score. The ME score aims to calculate the contribution of a kth-order SNP combination X to disease status Y , defined as in Eq. (3) [51], where H (x)(see Eq. (4)) denotes the Shannon entropy of x and H (x 1 , x 2 , ..., x k ) represents the joint entropy of multiple variables (x 1 , LR score. The LR score is employed as a related measure to identify the likelihood difference between a kth-order SNP epistatic interaction and a kth-order SNP combination that is not involved in the disease process [52,53] as shown in Eq. (6): where o i j and e i j represent the observed number and expected number of phenotypes, respectively, when a phenotype takes the ith disease state and a SNP combination takes the jth genotype. The expected number e i j can be obtained based on the Hardy-Weinberg principle [45,58].
ND-JE score. The ND-JE score is defined as the normalized distance with joint entropy [32], which aims to uncover clues for detecting high-order epistatic models with very weak or no marginal effects, defined as in Eqs. (7)- (9): where X (x 1 , x 2 , · · · , x k ) is a kth-order SNP combination for all samples (including case and control samples); X control indicates a kth-order SNP combination for only control samples; n j, control i and n j, case i denote the numbers of control samples and case samples in the dataset, respectively, with the jth SNP locus taking the value of i (homozygous major allele 0, heterozygous allele 1 and homozygous minor allele 2); n control i and n case i represent the numbers of control samples and case samples in the dataset, respectively, with SNP combination X taking the value of the ith genotype combination; and n control is the number of control samples.
The smaller the value of N D(X ) is, the larger the distribution difference (distance) between the case and control samples. The JE of the control samples is employed to normalize the distance. The main goal of ND-JE is to uncover a clue to guide the HS algorithm to detect potential diseasecausing SNP combinations.

Harmony search algorithm
The HS algorithm mimics the process of new music improvisation by jazz musicians, who address unknown complex problems by exchanging information and learning between individuals in a group [54][55][56]. Musicians improvise their instruments' pitches to search for a perfect state of harmony. The HS algorithm is characterized by its simplicity, easy implementation, and powerful global search capabilities and has been widely applied in combination optimization problems on a large scale. (The standard HS algorithm is introduced in detail in Supplementary file 1.) In HS, a candidate solution X (x 1 , x 2 , . . . , x K ) is referred to as a harmony. A set of candidate solutions is referred to as a harmony memory (HM), which is similar to the memory of a tabu search (TS) algorithm and the population of a GA. The number of harmonies in an HM is called the harmony memory size (HMS). An HM is a matrix of order HMS × N or an augmented matrix of order HMS × (N + 1) [50,57] as in Eq. (10): where X i (i 1, 2, …, HMS) is the i-th harmony in HM and f (x i ) denotes the value of the objective function. The worst harmony X id_ worst in HM is iteratively updated by new harmony X new , which is improvised through the following three operators: (1) HM consideration performs a combination operation of HM with the probability harmony memory considering rate (HMCR). (2) Pitching adjusts with probability PAR, which performs a local adjustment operation on X new . (3) Random consideration is performed with probability 1-HMCR, which introduces stochastic disturbance in a feasible search space to explore unknown space.
HS has been widely used to solve complex engineering and science optimization problems.

Proposed algorithm
Framework Figure 1 shows the framework of our proposed MTHSA-DHEI algorithm.
MTHSA-DHEI is divided into three stages: the search stage, screening stage and verification stage. In the search stage, a multitasking HS is adopted to find potential SNP combinations that have a strong association with disease status. The G-test [30,32,59] statistical method is employed to test the significance level of the difference between control samples and case samples in the screening stage, and the SNP combinations with significance level p values larger than the threshold value θ 1 are discarded. In the verification stage, MDR [38] is further used to verify the classification ability.
The goal of this study is to develop a fast and effective search algorithm; therefore, the focus is on the design of the search stage.
In the search stage of MTHSA-DHEI, four scoring functions (Bayesian network-based K2 score, LR-based score, ME-based score and ND-JE-based score) are employed as Fig. 1 The general flowchart of the proposed algorithm MTHSA-DHEI. In the search stage, the multitasking harmony search aims to discover some potential SNP combinations that have a strong association with disease status. In the screening stage, the G-test is employed to eliminate some SNP combinations without a significance level. MDR is used to verify the classification ability of candidate solutions in the 3rd stage objective functions to improve the ability to discriminate SNP interactions with nonfunctional SNP combinations. T tasks are employed to detect 2nd-order, 3rd-order, …, Tth-order, (T th + 1)-order SNP epistatic interactions simultaneously. As shown in Fig. 2, four tasks are concurrently employed to detect 2nd-order, 3rd-order, 4th-order and 5th-order SNP interactions, in which the t-th task has an HM and four elite harmony sets (EHS 1 t , EHS 2 t , EHS 3 t and EHS 4 t ). Each harmony in the HM has four association scores (K2 score, LR score, ME score and ND-JE score). Each harmony in the elite harmony set has only one association score. The K2 score, LR score, ME score and ND-JE score are separately adopted by EHS 1 t , EHS 2 t , EHS 3 t and EHS 4 t . Unified coding is applied to the harmonies and elite harmony sets of all tasks, which is intended to allow the harmonies among the K tasks to transfer knowledge from each other and further improve detection speed.
In MTHSA-DHEI, all tasks employ the same code length, but only the previous t + 1 values are considered to be the solution for the t-th task. For example, task 1 aims to detect 2nd-order SNP interactions. Only the association of the 2ndorder SNP combination x i 1 , x i 2 in X i (i 1, 2, ..., H M S) is calculated between x i 1 , x i 2 and disease status Y , and other SNPs are ignored, as follows: . .
In Fig. 1, the general flow of the search stage is presented, in which Algorithm 1 and Algorithm 2 introduce the improvisation of new harmonies based on knowledge transfer from other tasks and the current task, respectively. Algorithm 3 presents the process of updating the harmony memory and elite harmony sets (EHS 1 t , EHS 2 t , EHS 3 t and EHS 4 t ) of the t-task.

Improvising new harmonies with knowledge transfer
In the proposed MTHSA-DHEI approach, improvising a new harmony for the current t-th task has two steps: (1) knowledge transfer from other tasks and (2) generation of a new solution in the current task by using HM t , EHS 1 t , EHS 2 t , EHS 3 t and EHS 4 t . Figure S1 shows an example in which task 2 transfers knowledge to task 1 (see Supplementary file 1).
In the 1st method, the new harmony is improvised using three classical operators of HS, but the knowledge is from another task t r ∈ {1, 2, · · · , T }, t r t, which aims to obtain the information from another task t r . If task t r has a higher order than the current task, it may carry one or more functional SNPs that were missing in the current task. Conversely, if task t r has a lower order than the current task, some clues (SNPs with marginal effects) that can help the current task accelerate the detection of epistatic SNP interactions may be found.
Algorithm 1 describes the steps of improvising a new har- for the t-th task by transferring learning from the t r -th task (t t r ), in which the HM consideration is from the four EHSs of the t r -th task; there are three strategies with equal probability of pitch adjustment or the value of x new j will be randomly regenerated from the search space. X new will be randomly regenerated if it exists in the HM or in EHS t r (r 1, 2, 3, 4).
In the 2nd method, a new harmony is improvised through the components (HM t , EHS 1 t , EHS 2 t , EHS 3 t , and EHS 4 t ) of the current t-th task. HM t is for harmony memory consideration with probability HMCR. The best harmonies of four elite harmony sets (EHS 1 t , EHS 2 t , EHS 3 t , and EHS 4 t ) are the focus of consideration when employing the pitch adjustment operator.
In Algorithm 1 and Algorithm 2, the scale factor F is important for the performance of the proposed MTHSA-DHEI method. It is analysed in the simulation experiment described in "Simulation experiments".

Update the harmony memory and elite harmony sets
For each new harmony generated for the t-th task, HM t , EHS 1 t , EHS 2 t , EHS 3 t and EHS 4 t are considered to be updated. The update rate of HM t is associated with FEs. Algorithm 3 describes the update operator in detail.
In Algorithm 3, the i-th fitness value sc new, i of X new is divided by max_ score i to normalize the fitness value to the interval [1]. The value of max_ score i is the maximum value of the i-th scoring function in the initial population, and its value is not changed during iterations. The condition C > 2 || sc new, 4 > sc i, 4 HM t &&rand(0, 1) < (1 − FEs/maxFEs) is critical. C > 2 means that the i-th harmony HM t (i, :) of HM t is replaced by X new only when at least two scores of X new are higher than the corresponding scores of HM t (i, :).sc new, 4 > sc i, 4 HM t && rand(0, 1) < (1 − FEs/maxFEs) indicates that HM t (i, :) can be replaced by X new with probability (1 − FEs/maxFEs) if sc new, 4 > sc i, 4 HM t , which means that the ND-JE score is different from the other three scores. In this work, the goal of the ND-JE score is to discover some clues to guide the algorithm to locate SNP interactions with no marginal effect on disease status.
In the proposed MTHSA-DHEI algorithm, four complementary discriminating functions (evaluation functions) are adopted to calculate the associations between high-order SNP combinations and disease status, with the aim of improving the ability to identify various diseases, and have the following benefits: other. The K2 score has high power for detecting SNP interactions and is superior in discriminating certain disease models with weak marginal effects. However, it has low accuracy for interaction models with low minor allele frequencies (MAFs) and low genetic heritability (H 2 ). The ME score aims to calculate the contribution of a kth-order SNP combination to disease status. The LR score aims to discover the likelihood difference between a functional SNP combination and a nonfunctional SNP combination via statistical theory, and it has good adaptability to unknown disease models. The ND_JE score aims to guide the HS to uncover clues for detecting highorder epistatic interactions.

Simulation experiments
To investigate the performance of the proposed MTHSA-DHEI method, four 4th-order and eight 5th-order epistatic interaction models with marginal effects (EIMEs), eight high-order epistatic interaction models with no marginal effects (EINMEs) (including two 3rd-order models, three 4th-order models and three 5th-order models) and one real disease dataset (age-related macular degeneration, AMD) were tested. The experimental results were compared with the results of four high-order epistatic interaction detection algorithms: CSE, epiACO, NHSA-DHSC and MP-HS-DHSI. All experiments were performed on a Windows 10 64-bit system with an Intel(R) Core (TM) i7-8700 CPU @3.2 GHz, 16 GB memory, and all the program codes were written and run in MATLAB R2018a.

Evaluation criteria for performance
(1) Power is a measure of the capability of identifying a kth-order SNP epistatic interaction from genomic data and is expressed as where #S is the number of true kth-order epistatic interactions found from the simulation datasets, in which a total of #T true kth-order epistatic interactions are available. Note that in this work, power is used mainly to evaluate the search ability of the proposed method. If the diseasecausing SNP combination (epistatic interaction) has been found within the specified iterations (maximum number of objective functions evaluated, maxFEs), the search is considered successful.
(2) FEs represent the number of evaluations of the associations between kth-order SNP combinations and disease status until the kth-order SNP epistatic interaction is found or the terminal condition of the algorithm is met. In simulation experiments, the search is stopped immediately when the kth-order SNP epistatic interaction is found, and FEs is the number of SNP combinations that have been evaluated for their association with disease status. In this work, the aim of the FEs is to measure the capability of the algorithm to reduce the computational burden.
(3) Runtime denotes the mean runtime that an algorithm takes to detect the kth-order SNP epistatic interaction before the algorithm is terminated, and it is intended to measure the time cost of detecting high-order SNP epistatic interactions.
To further investigate the reliability of the results obtained in the search stage, the G-test method is adopted to screen out the SNP combinations that differ significantly between case and control samples Pvalue < max 10 −8 , 0.05/C k N , and MDR is then used to verify the classification accuracy of the SNP combinations selected by the G-test [59]. The false discovery rate (FDR) and F1-score are employed to further evaluate the reliability of the results. (4) The FDR is defined as: where F P and T P represent the false-positive rate and truepositive rate, respectively.
(5) The F1-score can be expressed as follows: recall T P T P + F N , precision T P T P + F P , F1 − score 2 × recall × precision recall + precision .

Datasets
(1) EINME datasets. Eight EINMEs were employed to test the capability of detecting high-order epistatic interactions with no marginal effect. For each EINME, 1500 control samples and 1500 case samples for the functional SNPs were generated using a multiobjective optimization algorithm that aims to maximize the joint effects of k-SNPs, minimize the marginal effects of individual SNPs and limit Hardy-Weinberg equilibrium (HWE) constraints [60]. The samples of nonfunctional SNPs were randomly generated according to HWE. To investigate the performance of the proposed algorithm, simulation datasets with 100 SNPs, 1 k SNPs and 10 k SNPs were generated for each EINME. The eight EINMEs are described in Table S1 of Supplementary file 1.
(2) EIME datasets. Four 5th-order additive EIMEs, four 5th-order threshold EIMEs and four 4th-order multiplicative EIMEs [61] were employed to test the performance of detecting epistatic interactions with marginal effects. For each model, 100 datasets with 2000 control samples and 2000 case samples that had 100 SNPs, 1000 SNPs and 10 k SNPs were separately generated using GAMETES software [62]. The parameters of the 12 EIMEs are listed in Table S2 of Supplementary file 1.
(3) AMD dataset. The AMD dataset contains 103,611 SNPs genotyped for 50 controls and 96 cases [63]. This experiment aims to detect 2nd-order, 3rd-order, 4th-order and 5th-order SNP epistatic interactions from the 103,611 SNPs for AMD. We conducted two experiments for AMD: A. All 103,611 SNPs were detected to identify epistatic interactions. B. Three widely reported SNPs (rs380390, rs1329428, and rs1363688) were first removed, and the rest of the SNPs were detected to identify epistatic interactions in which each SNP had a small effect on disease status.

Parameter analysis and settings
(1) Effect of parameters on the performance of MTHSA-DHEI In this section, the effect of two important parameters (TP and F) on the performance of MTHSA-DHEI are investigated.
As shown in Fig. 3, for the EINMEs, when TP > 0.5, the power begins to drop gradually, and the FEs and runtime increase with an increasing TP value, except for EINME-4. However, as shown in Fig. 4, for the EIMEs, the FEs and Fig. 3 Effect of TP on the performance of MTHSA-DHEI for detecting EINME interactions runtime decrease with an increasing TP value, but the power remains constant. This result demonstrates that a larger TP value will decrease the performance of MTHSA-DHEI for detecting EINME interactions but enhance the performance for the detection of EIME interactions. With a compromise, we believe that TP 0.5 is a better choice when we have unknown disease models.
Next, we investigate the effect of parameter F on the performance of the proposed method. Figures 5, 6 and 7 show the power, FEs and runtime of MTHSA-DHEI for values of parameter F from 2 to 20 (step 2). MTHSA-DHEI has the same power values for the three EIMEs for all F values, and it has high power for EINMEs when F equals 10. As shown in Fig. 6 and Fig. 7, MTHSA-DHEI with a small F value (F < 12) requires more FEs and runtime to find epistatic interactions than MTHSA-DHEI with a large F value for EINMEs, but for the three EIMEs, the opposite result occurs. Therefore, we recommend that F be set to 10. In addition, a θ 2 value set to 0.6 has the highest accuracy for all EINMEs and EIMEs. When θ 2 < 0.55, the falsepositive rate starts to increase, and when θ 2 > 0.65, the false negative rate starts to increase. For PAR, the algorithm has a greater search speed and improved detection power when its value is in the interval [0.4, 0.7].
(2) Parameter settings The parameters of the algorithms are described in Table 1.

Experimental results and analysis
(1) EINME Figure 8 shows the power, FEs and runtime used to detect eight high-order EINMEs using five intelligent search algorithms, and the results show that the power of MTHSA-DHEI exceeds that of the other four algorithms in all EINMEs except for MP-HS-DHSI, which has the same power value as MTHSA-DHEI on EINME-3, EINME-4 and EINME-6. Both MTHSA-DHEI and MP-HS-DHSI have much higher power than the other algorithms (Fig. 8a). As shown in Fig. 8b, MTHSA-DHEI took the fewest FEs among all five algorithms to find the kth-order SNP epistatic interactions except for EINME-3, EINME-4 and EINME-6. Except for EINME-5, EINME-7 and EINME-8, MTHSA-DHEI has a more than 99% success rate in detecting kth-order (k 3,4,5) epistatic interactions from 100 SNPs with no more than 10,000 FEs, which is much lower than the number of FEs ( 3 100 =161,700, 4 100 =3,921,225, 5 100 =75,287,520) obtained by an exhaustive search.
As shown in Fig. 8c, MP-HS-DHSI took the least time among the five algorithms on EINME-1, EINME-3, EINME-4 and EINME-6 to find the high-order SNP epistatic interactions. The runtime taken by the proposed MTHSA-DHEI method is slightly more than that of MP-HS-DHSI, but it is less than the runtime required by the other three algorithms. Importantly, in the simulation experiments, MP-HS-DHSI, NHSA-DHSC, epiACO and CSE were performed only on a kth-order epistatic interaction task (where k is the number of 50  functional SNPs in an epistatic interaction); however, the proposed MTHSA-DHEI method aims to simultaneously detect 2nd-order, …, kth-order epistatic interactions, which consumes much of the computational cost of MTHSA-DHEI to detect potential 2nd-order, …, (k − 1)-order epistatic interactions. Overall, MTHSA-DHEI has evident advantages over the other four approaches in eight EINMEs with 100 SNPs. To further investigate performance as the number of SNPs increases, we conducted the proposed method on EINME datasets with 1 k and 10 k SNPs. The results are summarized in Table S3 (see Supplementary file 1). Figure 9 displays the change curves of the power, FEs, runtime and F1-score of the MTHSA-DHEI with an increasing number of SNPs, from which we can see that the power and F1-scores decrease rapidly and the FEs and runtime increase significantly when conducting EINME-5, EINME-7 and EINME-8; however, for the other five models, the changes in these metrics are not very significant. We found that in EINME-5, EINME-7 and EINME-8, the marginal effect of each functional SNP was very small, especially for EINME-8, and the joint effects could be seen only when three or more of the five functional SNPs were combined, making it very difficult to search for epistatic interactions among over thousands of SNPs. Compared with the exponential growth in the number of SNP combinations, the increases in FEs and runtime and the decrease in power are very small and acceptable.
(2) EIME MTHSA-DHEI HMCR 0.98, PAR 0.5, TP 0.5, T K − 1, F 10 (1) For EINME and EIME simulation datasets, the HMS of each task is set to max(50, K × min(N/10,200)) (K is the highest-order SNP interactions that will be detected). MaxFEs min(50 × K × N, 2 × K × 50,000),  Table 2 summarizes the results (power, FEs, and runtime) of the five approaches in twelve high-order EIMEs, from which it can be clearly seen that the proposed MTHSA-DHEI method outperforms or is equivalent to the other four methods in terms of power. MTHSA-DHEI took fewer FEs than CSE, NHSA-DHSC and epiACO for almost all 12 models, and it had 100% power to detect epistatic interactions with very few FEs. Compared with MP-HS-DHSI, MTHSA-DHEI took fewer FEs on four multiplicative datasets (EIME-9-EIME-12) with 1000 SNP loci. MTHSA-DHEI took less runtime than CSE, NHSA-DHSC and epiACO, but it took more time than MP-HS-DHSI. However, the runtimes and FEs of MTHSA-DHEI are composed of the runtimes required to perform multiple tasks (detection of 2nd-order, 3rd-order, …, and kth-order SNP epistatic interactions), while the runtimes of the other four methods correspond to the time of performing only a single task (detection of kth-order SNP epistatic interactions). In summary, compared with the four excellent SISAs, MTHSA-DHEI has significant advantages in power, especially for more complex disease models (i.e., multiplicative models). Experiments on the 12 EIME datasets with 100, 1 k and 10 k SNPs are conducted, and the results are summarized in Table S4 (see Supplementary file 1), which demonstrates that the proposed algorithm can maintain high power (1st and 2nd power) for all datasets with different SNPs. Moreover, its runtime and FEs increases are not very significant, where the 1st power denotes the ability to find functional epistatic interactions by the multitasking HS algorithm and the 2nd power is the number of epistatic interactions that pass the threshold value θ 1 . However, for EIME-5, EIME-8, EIME-9 and EIME-10, the 3rd power is equal to zero because the ability to classify functional SNP epistatic interactions cannot pass the threshold value θ 2 ( 60%). When MDR was used to evaluate the four disease models, their average classification accuracy was equal to 56.8%, 58.6%, 56.3% and 58.4%. Comparing the EIMEs with EINMEs, each functional SNP in the EIMEs has an obvious marginal effect on disease status, which allows the HS algorithm to quickly locate the functional SNPs.
To detect epistatic interactions in which each SNP has a small effect on disease status, we removed three important SNPs (rs380390, rs1329428 and rs1363688) that have been widely reported to be associated with AMD, and MTHSA-DHEI was applied to the remaining SNPs. Twentyfour 2nd-order SNP combinations (CA > 75%, p value < 1 ×10 −7 ), 33 3rd-order SNP combinations (CA > 80%, p value < 1 ×10 −10 ), 56 4th-order SNP combinations (CA > 85%, p value < 1 ×10 −10 ) and 89 5th-order SNP combinations (CA > 88%, p value < 1 ×10 −12 ) were found to be strongly associated with AMD. Figure 10b Figure 10a shows the 2nd-order SNP combinations (p value < 1 × 10 −7 ) with a classification accuracy greater than 75%, with SNPs rs380390, rs1329428 and rs10272438 shown to interact with many other SNPs. Both rs380390 and rs1329428 are in the CFH gene, which has been widely reported to be associated with AMD [3,24,26,34,63,64]. rs10272438 is an intron variant of the BBS9 gene that is associated with Bardet Biedl syndrome [65] and was reported in our previous study [31]. In Fig. 10b, rs10272438 is the only central node that interacts with 21 SNPs. Figure S2(a) displays the interaction network of 3rd-order SNP combinations with a classification accuracy greater than 80%, of which the degrees of five central nodes, namely, rs380390, rs1363688, rs1329428, rs618499 and rs555174, are equal to 193, 124, 3, 47 and 13, respectively. In Figure  S2(a), the SNPs rs380390 and rs1329428 are indicated to have important roles in 3rd-order SNP combinations, and rs1363688 (at position 174,609,731 of chromosome 15, not in a gene-coding region) [14,31,32] and rs618499 (in gene ATM) were reported to be associated with osteosarcoma [66] and AMD [14]. rs555174 (not in a gene-coding region) has Fig. 9 Change curves of the power, FEs, runtime and F1-score of MTHSA-DHEI with an increasing number of SNPs for EINMEs also been reported to be associated with AMD [31,34]. In Figure S2(b), rs10272438 and rs2022251 are two important nodes, where rs2022251 (on chromosome 17, the difference p value 0.1 between the case and control samples) has never been reported to be associated with disease. Figure S3(a) shows the potential 4th-order SNP combinations with a classification accuracy greater than 85% and significance level less than 1 × 10 −10 , from which it can be seen that the SNPs rs1329428, rs3922799, rs2207553, rs4585932, rs10494614, and rs967358 are central nodes, among which rs1329428 is the only SNP that has been reported. In Figure S3(b), rs10272438, rs10482918 and rs6104678 are three central nodes, where rs10482918 is in the NCAM2 gene on chromosome 21, and rs6104678, which is on chromosome 20, has been reported previously [30,31,68]. Figure S4(a) shows the 5th-order SNP interaction network, in which SNPs rs1363688, rs1329428 and rs207389 are central nodes. Figure S4(b) shows that rs10482918, rs10272438, rs1982756 and rs6104678 are central nodes. There are two 5th-order SNP combinations, namely, (rs380390, rs7322610, rs2556560, rs4689888, and rs10496217) and (rs2050733, rs207389, rs1178123, rs1329428, and rs1363688), that have a very high classification accuracy (97.9% and 95.8%, respectively) measured with MDR in the first combination, except for SNP rs380390, which shows a significant difference (p value 6.19921E-07) between the case and control samples. The other four To complete detection with 5 × 10 7 FEs, the proposed MTHSA-DHEI method took no more than 20 h to simultaneously detect 2nd-order, 3rd-order, 4th-order and 5thorder SNP epistatic interactions from the AMD dataset. The time required for MP-HS-DHSI to perform the detection of 2nd-order, …, 5th-order SNP epistatic interactions one by one was more than 48 h; NHSA-DHSC, CSE and epiACO required more than one days. According to the AMD detection results, MTHSA-DHEI found almost all the SNPs that have been reported to be associated with AMD, such as rs380390, rs1329428, rs1363688, rs10272438, and rs555174, and found some SNPs that have been reported to be associated with other complex diseases. Some previously unreported SNPs were also found by MTHSA-DHEI and are worthy of further study by biologists.

Conclusion
According to the experimental results, the proposed MTHSA-DHEI method is significantly superior to the other four algorithms in terms of power, FEs and runtime for the EINMEs. The EIMEs also outperform others with respect to power and FEs, but they take more time than MP-HS-DHEI for most EIMEs. In the AMD experiment, MTHSA-DHEI also showed a powerful ability to detect high-order epistatic interactions from hundreds of thousands of data points, and it found almost all the SNPs that have been reported to be associated with AMD. Although the results of simulation experiments indicate that our method outperforms the four compared SISAs and shows very effective performance for detecting high-order SNP epistatic models, such as EINME-1, EINME-4, and EINME-6, additive models and threshold models, it still cannot ensure the identification of casual epistatic interactions from a dataset of over 10,000 SNPs in a limited amount of time (30 min), and detection power starts to degrade rapidly, such as for EINME-8. For the four multiplicative models, the heritability and population prevalence values have a very important influence on the detection power of the algorithm. The larger the heritability and population prevalence values of the disease model are, the higher the detection power of the algorithm. However, SNP loci typically have a small effect and only modest heritability [2,69]. To enhance the detection power of the proposed MTHSA-DHEI method on datasets with more than 10,000 SNPs, we can set a large size for HM and EHS and set a large value for MaxFEs to make the algorithm run for a longer time. Large sizes of harmony memory and large values of MaxFEs for our algorithm can improve detection power, but the computational burden will also increase rapidly.

Discussion
Traditional methods for detecting high-order SNP epistatic interactions can perform only a single task and ignore the sharing of information between tasks, which makes the computational burden of detecting SNP epistatic interactions from unknown disease data very high. To address this problem, this study aimed to improve detection power, reduce the computational burden and enhance the ability to discriminate high-order SNP epistatic interactions from a significant number of high-order SNP combinations. We proposed a novel multitasking HS algorithm for detecting high-order SNP epistatic interactions, where multitasking is applied to accelerate detection using concurrent collaborative computation, transfer learning is adopted to enhance the information exchange between tasks, and four complementary evaluation functions are employed to promote the ability to identify various disease models and overcome the preference of a single evaluation function for a specific disease model. In addition, for the epistatic interaction model with no marginal effects, it is very difficult to uncover clues that can guide the search algorithm to locate the functional SNP loci. The proposed MTHSA-DHEI algorithm integrates ND-JE into the evaluation functions to seek clues of the functional SNP locus that has no or a very weak marginal effect on disease status.
MTHSA-DHEI is a metaheuristic search algorithm, and its time complexity is determined by four objective functions (K2-score, ME score, LR score and ND-JE-score) of the 1st stage and the MaxFEs (maximum number of evaluations of associations between SNP combinations and disease status). In the 2nd stage and 3rd stage, only the G-test and MDR are employed to test and verify the number of candidate solutions that were found in the 1st stage, and the associated time complexity is negligible. The time complexity of evaluating objective functions is O(k × S) (where k is the order of SNP combinations and S is the number of samples). Therefore, the time complexity of MTHSA-DHEI is roughly equivalent to O(k × S × MaxFEs), which is much less than the time complexity O(k × S × N k ) of the exhaustive method, where N is the number of SNPs in the dataset. Since N and MaxFEs are much larger than k and S, the time complexity of the traditional exhaustive method is O(N k ), which is much higher than the time complexity O(k × S × MaxFEs) of MTHSA-DHEI.
To the best of our knowledge, this study is the first to detect high-order SNP epistatic interactions by using a multitasking search algorithm. There is still much room for improving the performance of this type of algorithm. In the future, we should try to develop an explicit-encoding-based multitasking search algorithm to improve the search speed and design more effective evaluation functions for identifying various disease models.
In addition, the fuzzy set-based optimization algorithm [70] has received much attention in recent years and has been successfully applied to solid assignment [71,72] and transportation [73] problems, and it can also be considered a focal method for future studies on high-order SNP epistatic interaction detection. In addition, it is also very important to develop an effective scoring function for seeking clues to guide the SISA to locate the positions of potential SNP interactions. The proposed MTHSA-DHEI method can only be applied to detect associations between common SNPs (MAF > 0.05 && MAF < 0.5) and disease status. This needs to be further studied for its application to rare variants.

Availability and implementation
The supplementary files, MATLAB codes and Python code are available at https://github.com/shouhengtuo/ MTHSADHEI. permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.