Abstract
Genomewide association studies have succeeded in identifying genetic variants associated with complex diseases, but the findings have not been well interpreted biologically. Although it is widely accepted that epistatic interactions of highorder single nucleotide polymorphisms (SNPs) [(1) Single nucleotide polymorphisms (SNP) are mainly deoxyribonucleic acid (DNA) sequence polymorphisms caused by variants at a single nucleotide at the genome level. They are the most common type of heritable variation in humans.] are important causes of complex diseases, the combinatorial explosion of millions of SNPs and multiple tests impose a large computational burden. Moreover, it is extremely challenging to correctly distinguish highorder SNP epistatic interactions from other highorder SNP combinations due to small sample sizes. In this study, a multitasking harmony search algorithm (MTHSADHEI) is proposed for detecting highorder epistatic interactions [(2) In classical genetics, if genes X1 and X2 are mutated and each mutation by itself produces a unique disease status (phenotype) but the mutations together cause the same disease status as the gene X1 mutation, gene X1 is epistatic and gene X2 is hypostatic, and gene X1 has an epistatic effect (main effect) on disease status. In this work, a highorder epistatic interaction occurs when two or more SNP loci have a joint influence on disease status.], with the goal of simultaneously detecting multiple types of highorder (k_{1}order, k_{2}order, …, k_{n}order) SNP epistatic interactions. Unified coding is adopted for multiple tasks, and four complementary association evaluation functions are employed to improve the capability of discriminating the highorder SNP epistatic interactions. We compare the proposed MTHSADHEI method with four excellent methods for detecting highorder SNP interactions for 8 highorder epistatic interaction models with no marginal effect (EINMEs) and 12 epistatic interaction models with marginal effects (EIMEs) ^{(*)} and implement the MTHSADHEI algorithm with a real dataset: agerelated macular degeneration (AMD). The experimental results indicate that MTHSADHEI has power and an F1score exceeding 90% for all EIMEs and five EINMEs and reduces the computational time by more than 90%. It can efficiently perform multiple highorder detection tasks for highorder epistatic interactions and improve the discrimination ability for diverse epistasis models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Introduction
Genomewide association studies (GWASs) are widely considered one of the most promising technologies for elucidating complex relationships between genotype and phenotype [1], such as the ^{Footnote 1}causes of complex diseases, due to the rapid development of highthroughput sequencing technology and dramatic declines in sequencing costs. GWASs are dedicated to detecting genetic variants associated with complex traits/diseases from single nucleotide polymorphisms (SNPs), which are the most common genetic variations in human deoxyribonucleic acid (DNA) sequences [2,3,4,5].
Many important and interesting findings have been made by GWASs using singleSNPbased and SNPpairbased methods. SingleSNP analysis approaches for GWASs, such as the singleSNP test [5], compare the relative frequencies of genotypes between case and control samples independently of other SNP loci, and some results have been successfully translated to candidate drugs [6, 7]. Nevertheless, most studies fail to effectively explain the causal SNPs of complex diseases. One important reason is that most studies focus on discovering the contribution of single SNPs to complex disease status/traits in isolation, and SNPs with a small effect on the phenotype were neglected in further analysis [8]. In recent years, various multiSNP methods have been employed for GWASs, such as penalized regression [9,10,11], which can eliminate SNPs associated only with the phenotype due to their linkage disequilibrium (LD) with causal SNPs [11]. An increasing number of studies indicate that epistatic interactions across the whole genome ubiquitously exist in relation to complex diseases [12]. Epistatic interaction generally refers to joint interaction effects among multiple genetic variants in the genome, where the effect of a set of genes or SNPs on a phenotype is unequal to the sum of their independent contributions [11]. Epistatic interactions are now widely regarded to determine individual susceptibility to complex diseases [8, 13].
Detecting highorder epistatic interactions in the human genome has become a very important goal in GWASs, but it is also extremely challenging because there are hundreds of thousands of SNPs in the human genome, creating a very complex “combination explosion problem”. Current computers are not capable of determining whether each kthorder (k > 2) SNP combination has an epistatic interaction effect in a limited time. To address this problem, highperformance computing and heuristic searches have been presented to accelerate the detection of highorder epistatic interactions. Highperformance computing usually adopts graphics processing units (GPUs) and parallel processing techniques to improve the speed of computers. Guo et al. employed cloud computing to detect highorder epistatic interactions [14], and forty virtual machines were adopted to accelerate the detection of such interactions. Yang et al. [15] developed a GPUbased permutation tool to accelerate the detection of SNPSNP interactions based on the likelihood ratio (LR) test with the assumption that the statistic follows a χ^{2} distribution. Cecilia et al. [16] presented a tool called MPI3SNP that implements a multicentral processing unit (CPU) and multiGPU clusters to detect 3rdorder epistatic interactions. Alex Upton et al. [7] reviewed highperformance computing and cloud computing used to detect epistasis in detail and dissected different computational approaches to analyse epistatic interactions in diseaserelated genetic datasets. GPU and parallel processing techniques can speed up detection but are insufficient for highorder (> 3) epistatic interactions because the time complexity of detecting highorder SNP epistatic interactions is not reduced if the search algorithm still has high time complexity (i.e., exhaustive search algorithm).
To reduce the computational burden, heuristic search techniques, such as the Monte Carlo method [13, 17, 18], the spanning tree method [19], and swarm intelligence search algorithms (SISAs) [20, 21], use current information about the target problem as heuristic information that can improve search efficiency and reduce the number of searches. The Monte Carlo method employs random sampling procedures to explore potential SNP epistatic interactions and can speed up epistasis detection, but its power is often unsatisfactory. Zhang, Y et al. presented a Bayesian partition model (called Beam) for detecting SNP epistatic interactions, and they employed Markov chain Monte Carlo (MCMC) sampling to compute the posterior probability of SNP markers [13]. Beam models have a very rapid search speed but easily miss epistatic interactions with weak marginal effects on disease status. Wang W introduced a minimum spanning tree structure for exhaustively detecting twolocus epistasis [19]. The minimum spanning treebased method is powerful for detecting 2ndorder epistatic interactions with marginal effects but largely inefficient for the detection of highorder SNP epistatic interactions with weak marginal effects. Shanwen Sun et al. analysed statistical modelling and machine learning approaches for identifying SNP epistatic interactions in detail [11].
Due to the powerful exploration capability of a highdimensional search space, the SISA has received much attention in recent years for highorder epistatic interaction detection. Moore JH et al. employed a genetic algorithm (GA) to discover complex genetic models [20, 21] and presented a gridbased stochastic search algorithm (named CrushMDR) [22], which adopts genetic modelfree multifactor dimensionality reduction (MDR) to calculate the associations between SNP combinations and disease status in order to accelerate the detection of highorder epistatic interactions. CrushMDR reduces the time complexity of the search process, but the objective function MDR is computationally expensive. Wang et al. proposed a twostage ant colony optimization (ACO) algorithm (named AntEpiSeeker) to detect epistatic interactions [23], which employs the chisquare (χ^{2}) test to evaluate the scores of SNP combinations. In the 1st stage of AntEpiSeeker, ACO is used to select suspected SNP sets with high χ^{2} scores, and the 2nd stage of AntEpiSeeker conducts an exhaustive search with the suspected SNPs. Shang and Sun et al. conducted an indepth study on gene–gene interactions via ACO [24,25,26], and their research concentrated on the identification of epistatic models and the improvement of ACO. Differential evolutionbased methods were adopted by Yang et al. [27, 28] to detect epistatic interactions, and improved MDR was used to measure associations. Aflakparast, M. et al. introduced a cuckoo search epistasis (CSE) detection algorithm to identify highorder SNP epistatic interactions in which each single SNP has a small effect on disease status. The CSE detection algorithm first divides all SNP loci into M groups based on their relevance, and kthorder SNP combinations are then chosen from the M groups [29].
Tuo et al. proposed three harmony search (HS)based epistatic detection algorithms (FHSASED [30], NHSADHSC [31], and MPHSDHSI [32]) because of the performance advantages of HS, such as its powerful exploration ability and fast speed. HS is a very simple optimization algorithm that has shown outstanding performance in solving both combinational optimization problems and real number optimization problems. FHSASED aims to discover SNPpair (2ndorder) interactions using HS and two scoring functions (the Gini index and Bayesian networkbased K2 score) to evaluate the associations between SNP pairs and disease status. NHSADHSC presents a niche HS for detecting highorder epistatic interactions, in which a niche strategy is used to record local optimal solutions and avoid repeated searches in local regions. The MPHSDHSI algorithm employs multipopulational and multiple scoring functions to improve the exploration power of HS and overcome the preference for disease models. In terms of search speed and detection power, it outperforms FHSASED and NHSADHSC in detecting highorder SNP epistatic interactions, but for an unknown detection task, it also requires the detection of 2ndorder, 3rdorder, …, Kthorder epistatic interactions one by one.
Although the SISA has made some progress in accelerating the detection of highorder epistatic interactions, it still faces two challenges:
Search. Finding kthorder (a combination of k SNP loci) epistatic interactions among over hundreds of thousands of SNPs in the whole genome in a limited time is very difficult due to the large number of SNP combinations, which is a complex combination explosion problem. For example, the number of 3rdorder SNP combinations for 1,000,000 SNPs is larger than \(1.6667\times {10}^{17}\). In particular, if the SNPs in highorder epistatic interactions have very weak or no marginal effect on disease status/complex traits, the SISA is inefficient or nearly powerless because there are no valid clues to guide the population to locate the causal SNP epistatic interaction among the extreme number of SNP combinations.
Discrimination. The discriminating function (objective function) adopted to calculate the associations between SNP combinations and phenotypes is crucial for the SISA. Faced with such a large number of SNP combinations, discrimination functions with light computational requirements should be considered first for SISAs. Bayesian networkbased methods [33,34,35], Shannon entropybased methods (i.e., mutual information and conditional entropy) [36] and statistical test methods (i.e., chisquare tests [37]) are lightweight methods that have been widely used to evaluate associations, but none of them are considered effective for all types of epistatic interaction models. Machine learning approaches, such as MDR [38,39,40], random forest and neural networks, are statisticalfree methods with strong applicability for evaluating various disease models, but the high computational burden limits the usefulness of these methods as objective functions of SISAs.
To address the above challenges, this study aims to improve performance in detecting highorder SNP epistatic interactions in the following two aspects:

(1)
A multitasking HS algorithm with three stages is developed to improve detection speed and power.

(2)
Four complementary association evaluation functions are employed to improve the discrimination ability of various disease models.
To the best of our knowledge, the existing SISAs for detecting SNP epistatic interactions, such as CSE [29], MACOED [37], epiACO [25], and NHSADHSC [31], can perform only one task (detecting a single kthorder epistatic interaction) in each run and, therefore, must be run n times to perform n tasks (detecting k_{1}order, k_{2}order, …, k_{n}order epistatic interactions). To collaboratively perform multiple detection tasks simultaneously, a multitasking HS algorithm (named MTHSADHEI) is developed for detecting highorder epistatic interactions in this study. The contributions of our work can be summarized as follows:

(1)
A new multitaskbased HS algorithm is proposed for detecting k_{1}order, k_{2}order, …, k_{n}order SNP epistatic interactions simultaneously. The proposed algorithm is divided into three stages: searching, screening and verifying. The search stage aims to reduce the computational burden. The purpose of screening and verifying is to improve detection result accuracy.

(2)
Unified coding is adopted to represent k_{1}order, k_{2}order, …, k_{n}order combinations. For all tasks, the solutions (SNP combinations) are encoded with the same length, which is equal to the number of SNPs in the highestorder SNP epistatic interaction, and this encoding scheme facilitates knowledge transfer between tasks. Knowledge transfer between tasks can significantly accelerate the detection of highorder SNP epistatic interactions from highdimensional genome datasets.

(3)
To improve the capability of identifying various models and discriminating kthorder SNP epistatic interactions from nonfunctional kthorder SNP combinations, four complementary association evaluation functions (Bayesian network, mutual entropy (ME), LR, and normalized distance with joint entropy (NDJE)) are integrated as objective functions of the multitasking HS.

(4)
T harmony memories (HM_{1}, HM_{2}, …, HM_{T}) are employed to memorize the potential SNP epistatic combinations for T tasks, and four elite harmony sets (EHS_{1}, EHS_{2}, EHS_{3}, and EHS_{4}) are used for each task to record the elite solutions of four evaluation functions, with the aims of reducing the preference of a single evaluation function for a particular disease model and enhancing the global search ability.
The rest of this paper is arranged as follows. “Preliminary and related work” presents the related work and preliminaries. The proposed method is introduced in detail in “Proposed algorithm”. The experiments performed on simulation datasets and real datasets are given in “Simulation experiments”. The subsequent sections are the conclusion and discussion.
Preliminary and related work
Let \(X = \left\{ {x_{1} ,x_{2} , \cdots ,x_{N} } \right\}\) indicate N SNP markers for \(n\) individuals and \(Y = \left\{ {y_{1} ,y_{2} , \cdots ,y_{J} } \right\}\) denote disease status (J is the number of disease statuses). The homozygous major allele, heterozygous allele and homozygous minor allele in the sample dataset are defined as 0, 1 and 2, respectively. For a kthorder SNP combination, there are \(I = {3}^{k} \, \) genotype combinations. \(n_{i}\) is the number of samples in the dataset with SNP loci having the value of the ith genotype combination, and \(n_{ij}\) represents the number of samples with the ith genotype combination that are associated with disease state \(y_{j}\).
Definition
(highorder SNP epistatic interaction). Let \(X_{k} = \{ x_{{s_{1} }} ,x_{{s_{2} }} , \cdots ,x_{{s_{k} }} \}\) \((1 < k < N,x_{{s_{i} }} \in X)\) be a kthorder SNP combination. \(f(X_{k} ,Y)\) is a function for scoring the association between \(X_{k}\) and disease state \(Y\). A kthorder SNP combination \(X_{k}\) is jointly associated with \(Y\) if and only if \(\forall X^{^{\prime}} \subset X_{k} \wedge f(X_{k} ,Y) \succ f(X^{^{\prime}} ,Y)\), where \(\succ\) is defined for comparing the strength of association with the disease. \(X_{k}\) is said to be strongly associated with \(Y\) if \(f(X_{k} ,Y) > \theta\) (\(\theta\) is the threshold value for determining the association with disease status). A kthorder SNP epistatic interaction occurs if and only if a kthorder SNP combination \(X_{k}\) is truly a diseasecausing SNP combination associated with Y.
Multitasking optimization model
Multitasking optimization aims to solve K optimization problems simultaneously [41], and its goal is to concurrently optimize all K tasks. Let the K tasks be maximization problems. The optimization model can be expressed as follows:
where the objective function is defined as \(f_{i} :S_{i} \to {\mathbb{R}}\) and \(X_{{_{i} }}^{*} \in S{}_{i}\) is the optimal solution of objective function \(f_{i}\) in the feasible space of \(S_{i}\).
Evolutionary multitasking optimization (EMO) has received much attention in recent years in relation to implicit parallel populationbased optimization algorithms to search multiple decision spaces of multiple optimization problems [42]. The evolutionary multitasking algorithm can significantly accelerate convergence for multiple complex optimizations by transferring learning between tasks. It has been applied in the fields of engineering and science computing. Li et al. employed a multifidelity evolutionary multitasking method to extract hyperspectral endmembers [43]. Feng et al. proposed evolutionary multitasking to solve the capacitated vehicle routing problem [44] consisting of a weighted learning process for capturing transfer mapping. Eneko Osaba et al. presented a novel adaptive metaheuristic algorithm to address evolutionary multitasking environments called the adaptive transferguided multifactorial cellular genetic algorithm (ATMFCGA) [45]. Nguyen Thi Tam introduced evolutionary multitasking optimization to address the issues of relay node assignment for wireless singlehop sensor and multihop sensor networks in threedimensional terrains [46]. To solve scheduling problems with batch distribution, Xu et al. presented multitasking optimization [47]. Gao et al. designed a transfer strategy based on the multidirectional prediction method to improve the performance of the multiobjective multitasking optimization approach [48]. Zhao et al. proposed a polynomial regression surface modelling approach based on multitasking optimization for rational basis function selection [49]. EMO can efficiently address multiple different optimization problems simultaneously, enhance the global search ability and improve the performance of each task via knowledge transfer between tasks [48].
In the detection of highorder SNP epistatic interactions, there may be an implicit relationship between kthorder SNP epistatic interactions and (kth + i)order SNP epistatic interactions for the same disease.
For example, in a 5thorder SNP epistatic interaction \(\left( {x_{1} ,x_{2} ,x_{3} ,x_{4} ,x_{5} } \right)\), the five singleSNP loci \(x_{i} \left( {i = 1,2,...,5} \right)\) and all 2ndorder SNP combinations \(\left( {x_{i} ,x_{j} } \right)(i \ne j)\) may have no explicit associations with disease status, while some 3rdorder SNP combinations, such as \(\left( {x_{1} ,x_{2} ,x_{4} } \right)\) and \(\left( {x_{2} ,x_{3} ,x_{5} } \right)\), show associations with disease status, which can guide the search algorithm to identify the 5thorder SNP epistatic interaction by transferring learning between the task of detecting 3rdorder epistatic interactions and that of detecting 5thorder epistatic interactions. In the task of detecting 3rdorder SNP epistatic interactions, some 3rdorder SNP combinations with very weak associations with the disease but no diseasecausing SNP interactions may be part of the 5thorder epistatic interaction. Conversely, some 5thorder SNP combinations may contain functional loci for 3rdorder SNP interactions. Therefore, a multitasking optimization model is well suited for accelerating the detection of highorder SNP epistatic interactions through knowledge transfer between multiple tasks.
Multitasking optimization model for detecting highorder SNP epistatic interactions
The multitasking optimization model for detecting \(k_{1} {\text{  order}},k_{2} {\text{  order}}, \cdots ,k_{m} {\text{  order}}\) SNP epistatic interactions can be expressed as Eq. (1).
where \(X_{{_{{k_{i} }} }}^{*}\)(i = 1, 2,…,m) indicates a \(k_{i} {\text{  order}}\) SNP epistatic interaction and \(f{(}X,Y,k)\) denotes the objective function for evaluating the association between kthorder SNP combination \(X_{k}\) and disease status Y.
Discrimination functions for evaluating the associations between SNP combinations and disease status
Due to the small sample size and diversity of disease models, it is very difficult to discriminate kthorder SNP epistatic interactions from all kthorder SNP combinations on a genomewide scale. Conventional evaluation methods (such as mutual information and Bayesian networks) cannot identify all disease models well. Almost all evaluation methods can correctly discriminate only a portion of disease models. In this study, four discrimination functions with low computational costs are employed to enhance the discrimination ability.
Bayesiannetworkbased K2 score. The Bayesiannetworkbased K2 score is a statistical method for describing relationships using a directed acyclic graph (DAG) G = (V, E) [50]. It is a lightweight computing method and has high discrimination precision for evaluating the association between a kthorder SNP combination and disease status; it can be expressed as Eq. (2):
The larger the \({\text{K2  Score}}_{{{\text{log}}}}\) value is, the greater the association between a SNP combination and disease status.
ME score. The ME score aims to calculate the contribution of a kthorder SNP combination X to disease status Y, defined as in Eq. (3) [51],
where \(H(x)\)(see Eq. (4)) denotes the Shannon entropy of \(x\) and \(H(x_{1} ,x_{2} ,...,x_{k} )\) represents the joint entropy of multiple variables \((x_{1} ,x_{2} ,...,x_{k} )\).
LR score. The LR score is employed as a related measure to identify the likelihood difference between a kthorder SNP epistatic interaction and a kthorder SNP combination that is not involved in the disease process [52, 53] as shown in Eq. (6):
where \(o_{ij}\) and \(e_{ij}\) represent the observed number and expected number of phenotypes, respectively, when a phenotype takes the ith disease state and a SNP combination takes the jth genotype. The expected number \(e_{ij}\) can be obtained based on the Hardy–Weinberg principle [45, 58].
NDJE score. The NDJE score is defined as the normalized distance with joint entropy [32], which aims to uncover clues for detecting highorder epistatic models with very weak or no marginal effects, defined as in Eqs. (7)–(9):
where \(X = (x_{1} ,x_{2} , \cdots ,x_{k} )\) is a kthorder SNP combination for all samples (including case and control samples); \(X_{{{\text{control}}}}\) indicates a kthorder SNP combination for only control samples; \(n_{i}^{{j,{\text{control}}}}\) and \(n_{i}^{{j,{\text{case}}}}\) denote the numbers of control samples and case samples in the dataset, respectively, with the jth SNP locus taking the value of i (homozygous major allele 0, heterozygous allele 1 and homozygous minor allele 2); \(n_{i}^{{{\text{control}}}}\) and \(n_{i}^{{{\text{case}}}}\) represent the numbers of control samples and case samples in the dataset, respectively, with SNP combination \(X\) taking the value of the ith genotype combination; and \(n_{{}}^{{{\text{control}}}}\) is the number of control samples.
The smaller the value of \(ND(X)\) is, the larger the distribution difference (distance) between the case and control samples. The JE of the control samples is employed to normalize the distance. The main goal of NDJE is to uncover a clue to guide the HS algorithm to detect potential diseasecausing SNP combinations.
Harmony search algorithm
The HS algorithm mimics the process of new music improvisation by jazz musicians, who address unknown complex problems by exchanging information and learning between individuals in a group [54,55,56]. Musicians improvise their instruments’ pitches to search for a perfect state of harmony. The HS algorithm is characterized by its simplicity, easy implementation, and powerful global search capabilities and has been widely applied in combination optimization problems on a large scale. (The standard HS algorithm is introduced in detail in Supplementary file 1.)
In HS, a candidate solution \(X = (x_{1} ,x_{2} , \ldots ,x_{{\mathbf{K}}} )\) is referred to as a harmony. A set of candidate solutions is referred to as a harmony memory (HM), which is similar to the memory of a tabu search (TS) algorithm and the population of a GA. The number of harmonies in an HM is called the harmony memory size (HMS). An HM is a matrix of order HMS × N or an augmented matrix of order HMS × (N + 1) [50, 57] as in Eq. (10):
where \(X^{i}\) (i = 1, 2, …, HMS) is the ith harmony in HM and f(x^{i}) denotes the value of the objective function.
The worst harmony \(X^{{{\text{id\_worst}}}}\) in HM is iteratively updated by new harmony \(X^{{{\text{new}}}}\), which is improvised through the following three operators:

(1)
HM consideration performs a combination operation of HM with the probability harmony memory considering rate (HMCR).

(2)
Pitching adjusts with probability PAR, which performs a local adjustment operation on \(X^{{{\text{new}}}}\).

(3)
Random consideration is performed with probability 1HMCR, which introduces stochastic disturbance in a feasible search space to explore unknown space.
HS has been widely used to solve complex engineering and science optimization problems.
Proposed algorithm
Framework
Figure 1 shows the framework of our proposed MTHSADHEI algorithm.
MTHSADHEI is divided into three stages: the search stage, screening stage and verification stage. In the search stage, a multitasking HS is adopted to find potential SNP combinations that have a strong association with disease status. The Gtest [30, 32, 59] statistical method is employed to test the significance level of the difference between control samples and case samples in the screening stage, and the SNP combinations with significance level p values larger than the threshold value \({\theta }_{1}\) are discarded. In the verification stage, MDR [38] is further used to verify the classification ability.
The goal of this study is to develop a fast and effective search algorithm; therefore, the focus is on the design of the search stage.
In the search stage of MTHSADHEI, four scoring functions (Bayesian networkbased K2 score, LRbased score, MEbased score and NDJEbased score) are employed as objective functions to improve the ability to discriminate SNP interactions with nonfunctional SNP combinations. T tasks are employed to detect 2ndorder, 3rdorder, …, Tthorder, (Tth + 1)order SNP epistatic interactions simultaneously. As shown in Fig. 2, four tasks are concurrently employed to detect 2ndorder, 3rdorder, 4thorder and 5thorder SNP interactions, in which the tth task has an HM and four elite harmony sets (EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t} and EHS_{4}^{t}). Each harmony in the HM has four association scores (K2 score, LR score, ME score and NDJE score). Each harmony in the elite harmony set has only one association score. The K2 score, LR score, ME score and NDJE score are separately adopted by EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t} and EHS_{4}^{t}. Unified coding is applied to the harmonies and elite harmony sets of all tasks, which is intended to allow the harmonies among the K tasks to transfer knowledge from each other and further improve detection speed.
In MTHSADHEI, all tasks employ the same code length, but only the previous t + 1 values are considered to be the solution for the tth task. For example, task 1 aims to detect 2ndorder SNP interactions. Only the association of the 2ndorder SNP combination \(\left( {x_{1}^{i} ,x_{2}^{i} } \right)\) in \(X^{i} \left( {i = 1,2,...,HMS} \right)\) is calculated between \(\left( {x_{1}^{i} ,x_{2}^{i} } \right)\) and disease status \(Y\), and other SNPs are ignored, as follows:
In Fig. 1, the general flow of the search stage is presented, in which Algorithm 1 and Algorithm 2 introduce the improvisation of new harmonies based on knowledge transfer from other tasks and the current task, respectively. Algorithm 3 presents the process of updating the harmony memory and elite harmony sets (EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t} and EHS_{4}^{t}) of the ttask.
Improvising new harmonies with knowledge transfer
In the proposed MTHSADHEI approach, improvising a new harmony for the current tth task has two steps: (1) knowledge transfer from other tasks and (2) generation of a new solution in the current task by using HM_{t}, EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t} and EHS_{4}^{t}. Figure S1 shows an example in which task 2 transfers knowledge to task 1 (see Supplementary file 1).
In the 1st method, the new harmony is improvised using three classical operators of HS, but the knowledge is from another task \(t_{r} \in \{ 1,2, \cdots,{T}\} ,t_{r} \ne t\), which aims to obtain the information from another task \(t_{r}\). If task \(t_{r}\) has a higher order than the current task, it may carry one or more functional SNPs that were missing in the current task. Conversely, if task \(t_{r}\) has a lower order than the current task, some clues (SNPs with marginal effects) that can help the current task accelerate the detection of epistatic SNP interactions may be found.
Algorithm 1 describes the steps of improvising a new harmony \({X}^{\mathrm{new}}=\left({x}_{1}^{\mathrm{new}},{x}_{2}^{\mathrm{new}},\dots ,{x}_{K}^{\mathrm{new}}\right)\) for the tth task by transferring learning from the t_{r}th task (\(t\ne {t}_{r}\)), in which the HM consideration is from the four EHSs of the t_{r}th task; there are three strategies with equal probability of pitch adjustment \({X}^{\mathrm{new}}\). For \({X}^{\mathrm{new}}\), if \({x}_{i}^{\mathrm{new}}={x}_{j}^{\mathrm{new}}\left(i\ne j\right)\), the value of \({x}_{i}^{\mathrm{new}}\) or the value of \({x}_{j}^{\mathrm{new}}\) will be randomly regenerated from the search space. \({X}^{\mathrm{new}}\) will be randomly regenerated if it exists in the HM or in \({\text{EHS}}_{r}^{t} \left(r=\mathrm{1,2},\mathrm{3,4}\right)\).
In the 2nd method, a new harmony is improvised through the components (HM_{t}, EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t}, and EHS_{4}^{t}) of the current tth task. HM_{t} is for harmony memory consideration with probability HMCR. The best harmonies of four elite harmony sets (EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t}, and EHS_{4}^{t}) are the focus of consideration when employing the pitch adjustment operator.
Algorithm 2 describes the steps of improvising a new harmony \({X}^{\mathrm{new}}=\left({x}_{1}^{\mathrm{new}},{x}_{2}^{\mathrm{new}},\dots ,{x}_{K}^{\mathrm{new}}\right)\) for the tth task by selflearning from \({\mathrm{HM}}_{t}\mathrm{ and }{\mathrm{EHS}}_{r}^{t} (r=\mathrm{1,2},\mathrm{3,4}\)).
In Algorithm 1 and Algorithm 2,\(a \in \left\{ {1, \cdots ,K} \right\}\),\(a_{1} \in \left\{ {1, \cdots ,K} \right\}\),\(b \in \left\{ {1, \cdots ,{\text{HMS}}} \right\}\), \(b_{1} \in \left\{ {1,2, \cdots ,{\text{HMS}}} \right\}\), \(b_{2} \in \left\{ {1, \ldots ,{\text{HMS}}} \right\}\),\(R \in \left\{ {1,2,3,4} \right\}\),\(R_{0} \in \left\{ {1,2,3,4} \right\}\) and \(r \in \left\{ {1,2,3,4} \right\}\) are all randomly generated integers, and \({\text{HM}}_{{t_{r} }} (i,j)\) denotes the jth note (variable) of the ith harmony in \({\text{HM}}_{{t_{r} }}\). \({\text{EHS}}_{r}^{t} \, \) is the rth elite harmony set of the tth task. \(x_{a}^{{{\text{EHS}}_{r}^{t} ,b}}\) denotes the ath SNP value of the bth harmony in the rth elite harmony set of the tth task. F is the scale factor for adjusting the step of the local search.
In Algorithm 1 and Algorithm 2, the scale factor F is important for the performance of the proposed MTHSADHEI method. It is analysed in the simulation experiment described in “Simulation experiments”.
Update the harmony memory and elite harmony sets
For each new harmony generated for the tth task, HM_{t}, EHS_{1}^{t}, EHS_{2}^{t}, EHS_{3}^{t} and EHS_{4}^{t} are considered to be updated. The update rate of HM_{t} is associated with \({\text{FEs}}\). Algorithm 3 describes the update operator in detail.
In Algorithm 3, the ith fitness value \({\text{sc}}^{{{\text{new}},i}}\) of \(X^{{{\text{new}}}}\) is divided by \({\text{max\_score}}^{i}\) to normalize the fitness value to the interval [1]. The value of \({\text{max\_score}}^{i}\) is the maximum value of the ith scoring function in the initial population, and its value is not changed during iterations. The condition C > 2  \({\text{sc}}^{{{\text{new}},4}} > {\text{sc}}_{{{\text{HM}}_{t} }}^{i,4}\) \(\& \&\)\({\text{rand}}(0,1) < \left( {1  {\text{FEs/maxFEs}}} \right)\) is critical. C > 2 means that the ith harmony \({\text{HM}}_{t} (i,:)\) of \({\text{HM}}_{t}\) is replaced by \(X^{{{\text{new}}}}\) only when at least two scores of \(X^{{{\text{new}}}}\) are higher than the corresponding scores of \({\text{HM}}_{t} (i,:)\).\({\text{sc}}^{{{\text{new}},4}} > {\text{sc}}_{{{\text{HM}}_{t} }}^{i,4} \& \& {\text{ rand}}(0,1) < \left( {1  {\text{FEs/maxFEs}}} \right)\) indicates that \({\text{HM}}_{t} (i,:)\) can be replaced by \(X^{{{\text{new}}}}\) with probability \(\left( {1  {\text{FEs/maxFEs}}} \right)\) if \({\text{sc}}^{{{\text{new}},4}} > {\text{sc}}_{{{\text{HM}}_{t} }}^{i,4}\), which means that the NDJE score is different from the other three scores. In this work, the goal of the NDJE score is to discover some clues to guide the algorithm to locate SNP interactions with no marginal effect on disease status.
In the proposed MTHSADHEI algorithm, four complementary discriminating functions (evaluation functions) are adopted to calculate the associations between highorder SNP combinations and disease status, with the aim of improving the ability to identify various diseases, and have the following benefits:

(1)
All four evaluation methods are lightweight. For a kthorder SNP combination, the four scoring values can be calculated simultaneously by counting only the values of \(n_{i}\) and \(n_{ij}\) (i = 1, 2,…,I; j = 1,2), and the calculations are not additive.

(2)
The four evaluation methods are complementary to each other. The K2 score has high power for detecting SNP interactions and is superior in discriminating certain disease models with weak marginal effects. However, it has low accuracy for interaction models with low minor allele frequencies (MAFs) and low genetic heritability (H^{2}). The ME score aims to calculate the contribution of a kthorder SNP combination to disease status. The LR score aims to discover the likelihood difference between a functional SNP combination and a nonfunctional SNP combination via statistical theory, and it has good adaptability to unknown disease models. The ND_JE score aims to guide the HS to uncover clues for detecting highorder epistatic interactions.
Simulation experiments
To investigate the performance of the proposed MTHSADHEI method, four 4thorder and eight 5thorder epistatic interaction models with marginal effects (EIMEs), eight highorder epistatic interaction models with no marginal effects (EINMEs) (including two 3rdorder models, three 4thorder models and three 5thorder models) and one real disease dataset (agerelated macular degeneration, AMD) were tested. The experimental results were compared with the results of four highorder epistatic interaction detection algorithms: CSE, epiACO, NHSADHSC and MPHSDHSI. All experiments were performed on a Windows 10 64bit system with an Intel(R) Core (TM) i78700 CPU @3.2 GHz, 16 GB memory, and all the program codes were written and run in MATLAB R2018a.
Evaluation criteria for performance
(1) Power is a measure of the capability of identifying a kthorder SNP epistatic interaction from genomic data and is expressed as
where \(\# S\) is the number of true kthorder epistatic interactions found from the simulation datasets, in which a total of \(\# T\) true kthorder epistatic interactions are available.
Note that in this work, power is used mainly to evaluate the search ability of the proposed method. If the diseasecausing SNP combination (epistatic interaction) has been found within the specified iterations (maximum number of objective functions evaluated, maxFEs), the search is considered successful.
(2) FEs represent the number of evaluations of the associations between kthorder SNP combinations and disease status until the kthorder SNP epistatic interaction is found or the terminal condition of the algorithm is met. In simulation experiments, the search is stopped immediately when the kthorder SNP epistatic interaction is found, and FEs is the number of SNP combinations that have been evaluated for their association with disease status. In this work, the aim of the FEs is to measure the capability of the algorithm to reduce the computational burden.
(3) Runtime denotes the mean runtime that an algorithm takes to detect the kthorder SNP epistatic interaction before the algorithm is terminated, and it is intended to measure the time cost of detecting highorder SNP epistatic interactions.
To further investigate the reliability of the results obtained in the search stage, the Gtest method is adopted to screen out the SNP combinations that differ significantly between case and control samples \(\left(P{\mathrm{value}}<{\mathrm{max}}\left({10}^{8},0.05/{C}_{N}^{k}\right)\right)\), and MDR is then used to verify the classification accuracy of the SNP combinations selected by the Gtest [59]. The false discovery rate (FDR) and F1score are employed to further evaluate the reliability of the results.
(4) The FDR is defined as:
where \(FP\) and \(TP\) represent the falsepositive rate and truepositive rate, respectively.
(5) The F1score can be expressed as follows:
Datasets
(1) EINME datasets. Eight EINMEs were employed to test the capability of detecting highorder epistatic interactions with no marginal effect. For each EINME, 1500 control samples and 1500 case samples for the functional SNPs were generated using a multiobjective optimization algorithm that aims to maximize the joint effects of kSNPs, minimize the marginal effects of individual SNPs and limit Hardy–Weinberg equilibrium (HWE) constraints [60]. The samples of nonfunctional SNPs were randomly generated according to HWE. To investigate the performance of the proposed algorithm, simulation datasets with 100 SNPs, 1 k SNPs and 10 k SNPs were generated for each EINME. The eight EINMEs are described in Table S1 of Supplementary file 1.
(2) EIME datasets. Four 5thorder additive EIMEs, four 5thorder threshold EIMEs and four 4thorder multiplicative EIMEs [61] were employed to test the performance of detecting epistatic interactions with marginal effects. For each model, 100 datasets with 2000 control samples and 2000 case samples that had 100 SNPs, 1000 SNPs and 10 k SNPs were separately generated using GAMETES software [62]. The parameters of the 12 EIMEs are listed in Table S2 of Supplementary file 1.
(3) AMD dataset. The AMD dataset contains 103,611 SNPs genotyped for 50 controls and 96 cases [63]. This experiment aims to detect 2ndorder, 3rdorder, 4thorder and 5thorder SNP epistatic interactions from the 103,611 SNPs for AMD. We conducted two experiments for AMD:

A.
All 103,611 SNPs were detected to identify epistatic interactions.

B.
Three widely reported SNPs (rs380390, rs1329428, and rs1363688) were first removed, and the rest of the SNPs were detected to identify epistatic interactions in which each SNP had a small effect on disease status.
Parameter analysis and settings
(1) Effect of parameters on the performance of MTHSADHEI
In this section, the effect of two important parameters (TP and F) on the performance of MTHSADHEI are investigated.
As shown in Fig. 3, for the EINMEs, when TP > 0.5, the power begins to drop gradually, and the FEs and runtime increase with an increasing TP value, except for EINME4. However, as shown in Fig. 4, for the EIMEs, the FEs and runtime decrease with an increasing TP value, but the power remains constant. This result demonstrates that a larger TP value will decrease the performance of MTHSADHEI for detecting EINME interactions but enhance the performance for the detection of EIME interactions. With a compromise, we believe that TP = 0.5 is a better choice when we have unknown disease models.
Next, we investigate the effect of parameter F on the performance of the proposed method. Figures 5, 6 and 7 show the power, FEs and runtime of MTHSADHEI for values of parameter F from 2 to 20 (step = 2). MTHSADHEI has the same power values for the three EIMEs for all F values, and it has high power for EINMEs when F equals 10. As shown in Fig. 6 and Fig. 7, MTHSADHEI with a small F value (F < 12) requires more FEs and runtime to find epistatic interactions than MTHSADHEI with a large F value for EINMEs, but for the three EIMEs, the opposite result occurs. Therefore, we recommend that F be set to 10.
In addition, a \({\theta }_{2}\) value set to 0.6 has the highest accuracy for all EINMEs and EIMEs. When \({\theta }_{2}<0.55\), the falsepositive rate starts to increase, and when \({\theta }_{2}>0.65\), the false negative rate starts to increase. For PAR, the algorithm has a greater search speed and improved detection power when its value is in the interval [0.4, 0.7].
(2) Parameter settings
The parameters of the algorithms are described in Table 1.
Experimental results and analysis
(1) EINME
Figure 8 shows the power, FEs and runtime used to detect eight highorder EINMEs using five intelligent search algorithms, and the results show that the power of MTHSADHEI exceeds that of the other four algorithms in all EINMEs except for MPHSDHSI, which has the same power value as MTHSADHEI on EINME3, EINME4 and EINME6. Both MTHSADHEI and MPHSDHSI have much higher power than the other algorithms (Fig. 8a). As shown in Fig. 8b, MTHSADHEI took the fewest FEs among all five algorithms to find the kthorder SNP epistatic interactions except for EINME3, EINME4 and EINME6. Except for EINME5, EINME7 and EINME8, MTHSADHEI has a more than 99% success rate in detecting kthorder (k = 3,4,5) epistatic interactions from 100 SNPs with no more than 10,000 FEs, which is much lower than the number of FEs (\({\complement }_{100}^{3}\)=161,700, \({\complement }_{100}^{4}\)=3,921,225, \({\complement }_{100}^{5}\)=75,287,520) obtained by an exhaustive search.
As shown in Fig. 8c, MPHSDHSI took the least time among the five algorithms on EINME1, EINME3, EINME4 and EINME6 to find the highorder SNP epistatic interactions. The runtime taken by the proposed MTHSADHEI method is slightly more than that of MPHSDHSI, but it is less than the runtime required by the other three algorithms. Importantly, in the simulation experiments, MPHSDHSI, NHSADHSC, epiACO and CSE were performed only on a kthorder epistatic interaction task (where k is the number of functional SNPs in an epistatic interaction); however, the proposed MTHSADHEI method aims to simultaneously detect 2ndorder, …, kthorder epistatic interactions, which consumes much of the computational cost of MTHSADHEI to detect potential 2ndorder, …, (k − 1)order epistatic interactions. Overall, MTHSADHEI has evident advantages over the other four approaches in eight EINMEs with 100 SNPs.
To further investigate performance as the number of SNPs increases, we conducted the proposed method on EINME datasets with 1 k and 10 k SNPs. The results are summarized in Table S3 (see Supplementary file 1). Figure 9 displays the change curves of the power, FEs, runtime and F1score of the MTHSADHEI with an increasing number of SNPs, from which we can see that the power and F1scores decrease rapidly and the FEs and runtime increase significantly when conducting EINME5, EINME7 and EINME8; however, for the other five models, the changes in these metrics are not very significant. We found that in EINME5, EINME7 and EINME8, the marginal effect of each functional SNP was very small, especially for EINME8, and the joint effects could be seen only when three or more of the five functional SNPs were combined, making it very difficult to search for epistatic interactions among over thousands of SNPs. Compared with the exponential growth in the number of SNP combinations, the increases in FEs and runtime and the decrease in power are very small and acceptable.
(2) EIME
Table 2 summarizes the results (power, FEs, and runtime) of the five approaches in twelve highorder EIMEs, from which it can be clearly seen that the proposed MTHSADHEI method outperforms or is equivalent to the other four methods in terms of power. MTHSADHEI took fewer FEs than CSE, NHSADHSC and epiACO for almost all 12 models, and it had 100% power to detect epistatic interactions with very few FEs. Compared with MPHSDHSI, MTHSADHEI took fewer FEs on four multiplicative datasets (EIME9EIME12) with 1000 SNP loci. MTHSADHEI took less runtime than CSE, NHSADHSC and epiACO, but it took more time than MPHSDHSI. However, the runtimes and FEs of MTHSADHEI are composed of the runtimes required to perform multiple tasks (detection of 2ndorder, 3rdorder, …, and kthorder SNP epistatic interactions), while the runtimes of the other four methods correspond to the time of performing only a single task (detection of kthorder SNP epistatic interactions). In summary, compared with the four excellent SISAs, MTHSADHEI has significant advantages in power, especially for more complex disease models (i.e., multiplicative models).
Experiments on the 12 EIME datasets with 100, 1 k and 10 k SNPs are conducted, and the results are summarized in Table S4 (see Supplementary file 1), which demonstrates that the proposed algorithm can maintain high power (1st and 2nd power) for all datasets with different SNPs. Moreover, its runtime and FEs increases are not very significant, where the 1st power denotes the ability to find functional epistatic interactions by the multitasking HS algorithm and the 2nd power is the number of epistatic interactions that pass the threshold value \({\theta }_{1}\). However, for EIME5, EIME8, EIME9 and EIME10, the 3rd power is equal to zero because the ability to classify functional SNP epistatic interactions cannot pass the threshold value \({\theta }_{2}\) (= 60%). When MDR was used to evaluate the four disease models, their average classification accuracy was equal to 56.8%, 58.6%, 56.3% and 58.4%. Comparing the EIMEs with EINMEs, each functional SNP in the EIMEs has an obvious marginal effect on disease status, which allows the HS algorithm to quickly locate the functional SNPs.
(3) AMD. The proposed MTHSADHEI method is adopted to simultaneously detect 2ndorder, 3rdorder, 4thorder and 5thorder epistatic interactions from AMD data, with 146 samples and 103,611 SNPs. A total of 526 2ndorder SNP combinations, 1059 3rdorder SNP combinations, 638 4thorder SNP combinations and 322 5thorder SNP combinations were found to be associated with AMD, of which 168 2ndorder SNP combinations (CA > 75%, p value < 1 \(\times {10}^{7}\)), 631 3rdorder SNP combinations (CA > 80%, p value < \(1\times {10}^{10}\)), 546 4thorder SNP combinations (CA > 85%, p value < \(1\times {10}^{10}\)), and 285 5thorder SNP combinations met the significance level for the Gtest [30, 32, 59], and the classification accuracy with MDR for each SNP combination was greater than 75%, 80%, 85% and 90% for the 2ndorder, 3rdorder, 4thorder and 5thorder combinations, respectively (see Supplementary file 2 for details). To better analyse the interactions among the identified SNPs, we employed Cytoscape software (https://cytoscape.org/) [67] to generate the interaction networks (see Fig. 10a, Figs. S2(a), S3(a) and S4(a)).
To detect epistatic interactions in which each SNP has a small effect on disease status, we removed three important SNPs (rs380390, rs1329428 and rs1363688) that have been widely reported to be associated with AMD, and MTHSADHEI was applied to the remaining SNPs. Twentyfour 2ndorder SNP combinations (CA > 75%, p value < 1 \(\times {10}^{7}\)), 33 3rdorder SNP combinations (CA > 80%, p value < 1 \(\times {10}^{10}\)), 56 4thorder SNP combinations (CA > 85%, p value < 1 \(\times {10}^{10}\)) and 89 5thorder SNP combinations (CA > 88%, p value < 1 \(\times {10}^{12}\)) were found to be strongly associated with AMD. Figure 10b, Figs. S2(b) and S3(b) show the interaction networks.
Figure 10a shows the 2ndorder SNP combinations (p value < \(1\times {10}^{7}\)) with a classification accuracy greater than 75%, with SNPs rs380390, rs1329428 and rs10272438 shown to interact with many other SNPs. Both rs380390 and rs1329428 are in the CFH gene, which has been widely reported to be associated with AMD [3, 24, 26, 34, 63, 64]. rs10272438 is an intron variant of the BBS9 gene that is associated with Bardet Biedl syndrome [65] and was reported in our previous study [31]. In Fig. 10b, rs10272438 is the only central node that interacts with 21 SNPs.
Figure S2(a) displays the interaction network of 3rdorder SNP combinations with a classification accuracy greater than 80%, of which the degrees of five central nodes, namely, rs380390, rs1363688, rs1329428, rs618499 and rs555174, are equal to 193, 124, 3, 47 and 13, respectively. In Figure S2(a), the SNPs rs380390 and rs1329428 are indicated to have important roles in 3rdorder SNP combinations, and rs1363688 (at position 174,609,731 of chromosome 15, not in a genecoding region) [14, 31, 32] and rs618499 (in gene ATM) were reported to be associated with osteosarcoma [66] and AMD [14]. rs555174 (not in a genecoding region) has also been reported to be associated with AMD [31, 34]. In Figure S2(b), rs10272438 and rs2022251 are two important nodes, where rs2022251 (on chromosome 17, the difference p value = 0.1 between the case and control samples) has never been reported to be associated with disease.
Figure S3(a) shows the potential 4thorder SNP combinations with a classification accuracy greater than 85% and significance level less than \(1\times {10}^{10}\), from which it can be seen that the SNPs rs1329428, rs3922799, rs2207553, rs4585932, rs10494614, and rs967358 are central nodes, among which rs1329428 is the only SNP that has been reported. In Figure S3(b), rs10272438, rs10482918 and rs6104678 are three central nodes, where rs10482918 is in the NCAM2 gene on chromosome 21, and rs6104678, which is on chromosome 20, has been reported previously [30, 31, 68].
Figure S4(a) shows the 5thorder SNP interaction network, in which SNPs rs1363688, rs1329428 and rs207389 are central nodes. Figure S4(b) shows that rs10482918, rs10272438, rs1982756 and rs6104678 are central nodes. There are two 5thorder SNP combinations, namely, (rs380390, rs7322610, rs2556560, rs4689888, and rs10496217) and (rs2050733, rs207389, rs1178123, rs1329428, and rs1363688), that have a very high classification accuracy (97.9% and 95.8%, respectively) measured with MDR in the first combination, except for SNP rs380390, which shows a significant difference (p value = 6.19921E07) between the case and control samples. The other four SNPs (rs7322610, rs2556560, rs4689888, and rs10496217) have very small effect sizes, and their p values are equal to 0.282, 0.283, 0.864 and 0.220, respectively. For the 2nd SNP combination, the p values of the five SNPs were equal to 0.324, 0.088, 0.011, 5.99 E − 06 and 3.84E − 05.
To complete detection with \(5\times {10}^{7}\) FEs, the proposed MTHSADHEI method took no more than 20 h to simultaneously detect 2ndorder, 3rdorder, 4thorder and 5thorder SNP epistatic interactions from the AMD dataset. The time required for MPHSDHSI to perform the detection of 2ndorder, …, 5thorder SNP epistatic interactions one by one was more than 48 h; NHSADHSC, CSE and epiACO required more than one days. According to the AMD detection results, MTHSADHEI found almost all the SNPs that have been reported to be associated with AMD, such as rs380390, rs1329428, rs1363688, rs10272438, and rs555174, and found some SNPs that have been reported to be associated with other complex diseases. Some previously unreported SNPs were also found by MTHSADHEI and are worthy of further study by biologists.
Conclusion
According to the experimental results, the proposed MTHSADHEI method is significantly superior to the other four algorithms in terms of power, FEs and runtime for the EINMEs. The EIMEs also outperform others with respect to power and FEs, but they take more time than MPHSDHEI for most EIMEs. In the AMD experiment, MTHSADHEI also showed a powerful ability to detect highorder epistatic interactions from hundreds of thousands of data points, and it found almost all the SNPs that have been reported to be associated with AMD. Although the results of simulation experiments indicate that our method outperforms the four compared SISAs and shows very effective performance for detecting highorder SNP epistatic models, such as EINME1, EINME4, and EINME6, additive models and threshold models, it still cannot ensure the identification of casual epistatic interactions from a dataset of over 10,000 SNPs in a limited amount of time (30 min), and detection power starts to degrade rapidly, such as for EINME8. For the four multiplicative models, the heritability and population prevalence values have a very important influence on the detection power of the algorithm. The larger the heritability and population prevalence values of the disease model are, the higher the detection power of the algorithm. However, SNP loci typically have a small effect and only modest heritability [2, 69]. To enhance the detection power of the proposed MTHSADHEI method on datasets with more than 10,000 SNPs, we can set a large size for HM and EHS and set a large value for MaxFEs to make the algorithm run for a longer time. Large sizes of harmony memory and large values of MaxFEs for our algorithm can improve detection power, but the computational burden will also increase rapidly.
Discussion
Traditional methods for detecting highorder SNP epistatic interactions can perform only a single task and ignore the sharing of information between tasks, which makes the computational burden of detecting SNP epistatic interactions from unknown disease data very high. To address this problem, this study aimed to improve detection power, reduce the computational burden and enhance the ability to discriminate highorder SNP epistatic interactions from a significant number of highorder SNP combinations. We proposed a novel multitasking HS algorithm for detecting highorder SNP epistatic interactions, where multitasking is applied to accelerate detection using concurrent collaborative computation, transfer learning is adopted to enhance the information exchange between tasks, and four complementary evaluation functions are employed to promote the ability to identify various disease models and overcome the preference of a single evaluation function for a specific disease model. In addition, for the epistatic interaction model with no marginal effects, it is very difficult to uncover clues that can guide the search algorithm to locate the functional SNP loci. The proposed MTHSADHEI algorithm integrates NDJE into the evaluation functions to seek clues of the functional SNP locus that has no or a very weak marginal effect on disease status.
MTHSADHEI is a metaheuristic search algorithm, and its time complexity is determined by four objective functions (K2score, ME score, LR score and NDJEscore) of the 1st stage and the MaxFEs (maximum number of evaluations of associations between SNP combinations and disease status). In the 2nd stage and 3rd stage, only the Gtest and MDR are employed to test and verify the number of candidate solutions that were found in the 1st stage, and the associated time complexity is negligible. The time complexity of evaluating objective functions is O(k × S) (where k is the order of SNP combinations and S is the number of samples). Therefore, the time complexity of MTHSADHEI is roughly equivalent to O(k × S × MaxFEs), which is much less than the time complexity O(k × S × N^{k}) of the exhaustive method, where N is the number of SNPs in the dataset. Since N and MaxFEs are much larger than k and S, the time complexity of the traditional exhaustive method is O(N^{k}), which is much higher than the time complexity O(k × S × MaxFEs) of MTHSADHEI.
To the best of our knowledge, this study is the first to detect highorder SNP epistatic interactions by using a multitasking search algorithm. There is still much room for improving the performance of this type of algorithm. In the future, we should try to develop an explicitencodingbased multitasking search algorithm to improve the search speed and design more effective evaluation functions for identifying various disease models.
In addition, the fuzzy setbased optimization algorithm [70] has received much attention in recent years and has been successfully applied to solid assignment [71, 72] and transportation [73] problems, and it can also be considered a focal method for future studies on highorder SNP epistatic interaction detection. In addition, it is also very important to develop an effective scoring function for seeking clues to guide the SISA to locate the positions of potential SNP interactions. The proposed MTHSADHEI method can only be applied to detect associations between common SNPs (MAF > 0.05 && MAF < 0.5) and disease status. This needs to be further studied for its application to rare variants.
Availability and implementation
The supplementary files, MATLAB codes and Python code are available at https://github.com/shouhengtuo/MTHSADHEI.
Notes
(*) In EIMEs, one or more SNP loci also have a marginal effect on disease status, resulting in an epistatic interaction model with additive effects. In EINMEs, each SNP locus has no or a very weak marginal effect on disease status.
References
Guo X (2015) Searching genomewide disease association through SNP Data. Dissertation, Georgia State University. https://scholarworks.gsu.edu/cs_diss/101.
Manolio TA et al (2009) Finding the missing heritability of complex diseases. Nature 461:747–753
Easton DF et al (2007) Genomewide association study identifies novel breast cancer susceptibility loci. Nature 447:1087–1093
Fellay J et al (2007) A wholegenome association study of major determinants for host control of HIV1. Science 317:944–947
Wang MH, Cordell HJ, Van Steen K (2019) Statistical methods for genomewide association studies. Semin Cancer Biol 55:53–60
Visscher PM, Wray NR, Zhang Q et al (2017) 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 101:5–22
Upton A, Trelles O, CornejoGarcia JA, Perkins JR (2016) Review: highperformance computing to detect epistasis in genome scale datasets. Brief Bioinform 17(3):368–379. https://doi.org/10.1093/bib/bbv058
Loucoubar C, Grant AV, Bureau JF et al (2017) Detecting multiway epistasis in familybased association studies. Brief Bioinform 18(3):394–402. https://doi.org/10.1093/bib/bbw039
Li P, Guo MZ, Wang CY et al (2015) An overview of SNP interactions in genomewide association studies. Brief Funct Genomics 14:143–155
Banerjee S, Zeng LY, Schunkert H et al (2018) Bayesian multiple logistic regression for case–control GWAS. PLoS Genet 14:27
Sun S, Dong B, Zou Q (2021) Revisiting genomewide association studies from statistical modelling to machine learning. Brief Bioinform 22(4):263. https://doi.org/10.1093/bib/bbaa263
Gros PA, Le Nagard H, Tenaillon O (2009) The evolution of epistasis and its links with genetic robustness, complexity and drift in a phenotypic model of adaptation. Genetics 182(1):277–293. https://doi.org/10.1534/genetics.108.099127
Zhang Y, Liu J (2007) Bayesian inference of epistatic interactions in case–control studies. Nat Genet 39:1167–1173. https://doi.org/10.1038/ng2110
Guo X, Meng Y, Yu N, Pan Y (2014) Cloud computing for detecting high order genomewide epistatic interaction via dynamic clustering. BMC Bioinformatic 5(1):102
Yang GYJW, Yang Q et al (2014) PBOOST: a GPUbased tool for parallel permutation tests in genomewide association studies. Bioinformatics 2014(9):1460–1462
Cecilia JM, PonteFernández C, GonzálezDomínguez J, Martín MJ (2020) Fast search of thirdorder epistatic interactions on CPU and GPU clusters. Int J High Perform Comput Appl 34(1):20–29. https://doi.org/10.1177/1094342019852128
Wang J, Joshi T, Valliyodan B, Shi H, Liang Y et al (2015) A Bayesian model for detection of highorder interactions among genetic variants in genomewide association studies. BMC Genomics 16:1011. https://doi.org/10.1186/s1286401522176
Han B, Chen XW, Talebizadeh Z, Xu H (2012) Genetic studies of complex human diseases: characterizing SNPdisease associations using Bayesian networks. BMC Syst Biol 6(Suppl 3):S14. https://doi.org/10.1186/175205096S3S14
Wang W (2010) TEAM: efficient twolocus epistasis tests in human genomewide association study. Bioinformatics 26(12):i217
Moore JH, Hahn LW, Ritchie MD, Thornton TA, White BC (2002) Application of genetic algorithms to the discovery of complex genetic models for simulation studies in human genetics. In: Langdon WB, et al., editors. Proceedings of the Genetic and Evolutionary Computation Conference. Morgan Kaufmann Publishers; San Francisco
Moore JH, Hahn LW, Ritchie MD et al (2004) Routine discovery of complex genetic models using genetic algorithms. Appl Soft Comput 4(1):79–86
Moore JH, Andrews PC, Olson RS, Carlson SE, Larock CR, Bulhoes MJ, Armentrout SL (2017) Gridbased stochastic search for hierarchical gene–gene interactions in populationbased genetic studies of common human diseases. BioData Mining 10:19. https://doi.org/10.1186/s1304001701393
Wang Y, Liu X, Robbins K et al (2010) AntEpiSeeker: detecting epistatic interactions for case–control studies using a twostage ant colony optimization algorithm. BMC Res Notes 3(1):117
Shang J, Zhang J, Lei X, Zhang Y, Chen B (2012) Incorporating heuristic information into ant colony optimization for epistasis detection. Genes Genom 34(3):321–327
Sun Y, Shang J, Liu JX, Li S, Zheng CH (2017) epiACO—a method for identifying epistasis based on ant Colony optimization algorithm. BioData Mining 10:23. https://doi.org/10.1186/s1304001701437
Sun Y, Wang X, Shang J, Liu J, Zheng C, Lei X (2019) Introducing heuristic information into ant colony optimization algorithm for identifying epistasis. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2018.2879673
Yang CH, Chuang LY, Lin YD (2017) Multiobjective differential evolutionbased multifactor dimensionality reduction for detecting gene–gene interactions. Sci Rep 7(1):12869. https://doi.org/10.1038/s4159801712773x
Yang CH, Kao YK, Chuang LY, Lin YD (2018) Catfish taguchibased binary differential evolution algorithm for analysing single nucleotide polymorphism interactions in chronic dialysis. IEEE Trans Nanobiosci 17(3):291–299
Aflakparast M et al (2014) Cuckoo search epitasis: a new method for exploring significant genetic interactions. Heredity 112:666–674
Tuo S, Zhang J, Yuan X et al (2016) FHSASED: twolocus model detection for genomewide association study with harmony search algorithm. PLoS One 11(3):e0150669
Tuo S, Zhang J, Yuan X, He Z, Liu Y, Liu Z (2017) Niche harmony search algorithm for detecting complex disease associated highorder SNP combinations. Sci Rep 7:11529
Shouheng T, Haiyan L, Hao C (2020) Multipopulation harmony search algorithm for the detection of highorder SNP interactions. Bioinformatics 36:4389–4398. https://doi.org/10.1093/bioinformatics/btaa215
Wang J, Joshi T, Valliyodan B, Shi H, Liang Y, Nguyen HT et al (2015) A Bayesian model for detection of highorder interactions among genetic variants in genomewide association studies. BMC Genomics 16:1011. https://doi.org/10.1186/s1286401522176
Guo Y, Zhong Z, Yang C, Hu J, Jiang Y, Liang Z et al (2019) EpiGTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network. BMC Bioinform 20(1):444. https://doi.org/10.1186/s128590193022z
Visweswaran S, Wong AKI, Barmada MM (2009) A Bayesian method for identifying genetic interactions[C]. AMIA Ann Sympos Proc Am Med Inform Assoc: 673
Cao X, Yu G, Liu J, Jia L, Wang J (2018) ClusterMI: detecting highOrder SNP interactions based on clustering and mutual information. Int J Mol Sci 19(8):2267
Jing PJ, Shen HB (2015) MACOED: a multiobjective ant colony optimization algorithm for SNP epistasis detection in genomewide association studies. Bioinformatics 31:634–641. https://doi.org/10.1093/bioinformatics/btu702
Crawford L, Zeng P, Mukherjee S, Zhou X (2017) Detecting epistasis with the marginal epistasis test in genetic mapping studies of quantitative traits. PLoS Genet 13(7):e1006869. https://doi.org/10.1371/journal.pgen.1006869
Gola D, Mahachie John JM, van Steen K, König IR (2016) A roadmap to multifactor dimensionality reduction methods. Brief Bioinform 17(2):293–308. https://doi.org/10.1093/bib/bbv038
Kim H, Jeong HB, Jung HY, Park T, Park M (2019) Multivariate clusterbased multifactor dimensionality reduction to identify genetic interactions for multiple quantitative phenotypes. Biomed Res Int 2019:4578983. https://doi.org/10.1155/2019/4578983
Gupta A, Ong YS, Feng L (2016) Multifactorial evolution: towardstoward evolutionary multitasking. IEEE Trans Evol Comput 20(3):343–357
Tang ZD, Gong MG et al (2021) A multifactorial optimization framework based on adaptive intertask coordinate system. IEEE Trans Cybernet. https://doi.org/10.1109/TCYB.2020.3043509
Li JZ, Li H et al (2021) Multifidelity evolutionary multitasking optimization for hyperspectral endmember extraction. Appl Soft Comput 111:107713
Feng L et al (2019) Explicit evolutionary multitasking for combinatorial optimization: a case study on capacitated vehicle routing problem. IEEE Trans Cybernet 51(6):3143–3156. https://doi.org/10.1109/TCYB.2019.2962865
Osaba E, Del Ser J, Martinez AD, Lobo JL, Herrera F (2021) ATMFCGA: an adaptive transferguided multifactorial cellular genetic algorithm for evolutionary multitasking. Inf Sci 570:577–598
Tam NT, Dat VT, Lan PN, Binh HTT, Vinh LT, Swami A (2021) Multifactorial evolutionary optimization to maximize lifetime of wireless sensor network. Inf Sci 576:355–373
Xu X, Yin G, Wang C (2021) Multitasking scheduling with batch distribution and due date assignment. Complex Intell Syst 7:191–202. https://doi.org/10.1007/s4074702000184x
Dang Q, Gao W, Gong M (2022) Multiobjective multitasking optimization assisted by multidirectional prediction method. Complex Intell Syst. https://doi.org/10.1007/s40747021006242
Zhao Y, Ye S, Chen X et al (2021) Polynomial Response Surface based on basis function selection by multitask optimization and ensemble modeling. Complex Intell Syst. https://doi.org/10.1007/s40747021005687
Neapolitan RE (2004) Learning bayesian networks. Prentice Hall, Upper Saddle River
Li X (2017) A fast and exhaustive method for heterogeneity and epistasis analysis based on multiobjective optimization. Bioinformatics 18:2829–2836. https://doi.org/10.1093/bioinformatics/btx339
Bush WS, Edwards TL, Dudek SM, McKinney BA, Ritchie MD (2008) Alternative contingency table measures improve the power and detection of multifactor dimensionality reduction. BMC Bioinform 9:238. https://doi.org/10.1186/147121059238
Neyman J, Pearson ES (1928) On the use and interpretation of certain test criteria for purposes of statistical inference: part 1. Biometrika 20A:175–240
Geem ZW, Kim JH, Loganathan GV (2001) A new heuristic optimization algorithm: harmony search. SIMULATION 76(2):60–68
Das S, Mukhopadhyay A, Roy A, Abraham A, Panigrahi BK (2011) Exploratory power of the harmony search algorithm: analysis and improvements for global numerical optimization. Syst Man Cybernet Part B 41(1):89–106
Tuo S, Geem ZW, Yoon JH (2020) A new method for analyzing the performance of the harmony search algorithm. Mathematics 8(9):1421. https://doi.org/10.3390/math8091421
Zhang TH, Geem ZW (2019) Review of harmony search with respect to algorithm structure. Swarm Evol Comput 48:31–43
Crow Jf (1999) Hardy. Weinberg and language impediments. Genetics 152:821–825
Hoey J (2012) The twoway likelihood ratio (G) test and comparison to twoway chi squared test. arXiv preprint arXiv:1206.4881
Himmelstein et al (2011) Evolving hard problems: generating human genetics datasets with a complex etiology. BioData Min. https://doi.org/10.1186/17560381421
PonteFernández C, GonzálezDomínguez J, CarvajalRodríguez A et al (2020) Toxo: a library for calculating penetrance tables of highorder epistasis models. BMC Bioinform. https://doi.org/10.1186/s1285902034563
Urbanowicz RJ, Kiralis J, SinnottArmstrong NA, Heberling T, Fisher JM, Moore JH (2012) GAMETES: a fast, direct algorithm for generating pure, strict, epistatic models with random architectures. BioData mining 5:1–14
Klein RJ et al (2005) Complement factor H polymorphism in agerelated macular degeneration. Science 308:385–389
Xie M, Li J, Jiang T (2012) Detecting genomewide epistasis based on the clustering of relatively frequent items. Bioinformatics 28(1):5–12. https://doi.org/10.1093/bioinformatics/btr603
Barba M, Pietro LD, Massimi L et al (2018) BBS9 gene in nonsyndromic craniosynostosis: Role of the primary cilium in the aberrant ossification of the suture osteogenic niche. Bone 112:58–70
Mirabello L, Richards EG, Duong LM et al (2011) Telomere length and variation in telomere biology genes in individuals with osteosarcoma. Int J Mol Epidemiol Genet 2(1):19–29
(2003) Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 13(11):2498–504. https://cytoscape.org/
Jiang R, Tang W, Wu X, Fu W (2009) A random forest approach to the detection of epistatic interactions in case–control studies. BMC Bioinform 10(Suppl 1):S65. https://doi.org/10.1186/1471210510S1S65
Tam V, Patel N, Turcotte M et al (2019) Benefits and limitations of genomewide association studies. Nat Rev Genet 20:467–484. https://doi.org/10.1038/s4157601901271
Kumar PS (2020) Algorithms for solving the optimization problems using fuzzy and intuitionistic fuzzy set. Int J Syst Assur Eng Manag 11(1):189–222. https://doi.org/10.1007/s13198019009413
Kumar PS (2019) Intuitionistic fuzzy solid assignment problems: a softwarebased approach. Int J Syst Assur Eng Manag 10(4):661–675. https://doi.org/10.1007/s1319801900794w
Kumar PS (2020) The PSK method for solving fully intuitionistic fuzzy assignment problems with some software tools. Adv Bus Strategy Compet Adv. https://doi.org/10.4018/9781522584582.ch009
Kumar PS (2021) Finding the solution of balanced and unbalanced intuitionistic fuzzy transportation problems by using different methods with some software packages. Handbook Res Appl AI Int Bus Market Appl. https://doi.org/10.4018/9781799850779.ch015
Acknowledgements
The author would like to thank all the editors, reviewers and referees for their constructive comments.
Funding
This research was supported by Natural Science Foundation of China (Grant 62002289).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary file
1: Introduction of the standard HS algorithm, the parameters of the 12 EIME models and supplementary experimental results. Supplementary file 2: The experimental results for the AMD dataset 1 (PDF 838 KB)
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Tuo, S., Li, C., Liu, F. et al. MTHSADHEI: multitasking harmony search algorithm for detecting highorder SNP epistatic interactions. Complex Intell. Syst. 9, 637–658 (2023). https://doi.org/10.1007/s40747022008137
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s40747022008137