Introduction

Genome-wide association studies (GWASs) are widely considered one of the most promising technologies for elucidating complex relationships between genotype and phenotype [1], such as the Footnote 1causes of complex diseases, due to the rapid development of high-throughput sequencing technology and dramatic declines in sequencing costs. GWASs are dedicated to detecting genetic variants associated with complex traits/diseases from single nucleotide polymorphisms (SNPs), which are the most common genetic variations in human deoxyribonucleic acid (DNA) sequences [2,3,4,5].

Many important and interesting findings have been made by GWASs using single-SNP-based and SNP-pair-based methods. Single-SNP analysis approaches for GWASs, such as the single-SNP test [5], compare the relative frequencies of genotypes between case and control samples independently of other SNP loci, and some results have been successfully translated to candidate drugs [6, 7]. Nevertheless, most studies fail to effectively explain the causal SNPs of complex diseases. One important reason is that most studies focus on discovering the contribution of single SNPs to complex disease status/traits in isolation, and SNPs with a small effect on the phenotype were neglected in further analysis [8]. In recent years, various multi-SNP methods have been employed for GWASs, such as penalized regression [9,10,11], which can eliminate SNPs associated only with the phenotype due to their linkage disequilibrium (LD) with causal SNPs [11]. An increasing number of studies indicate that epistatic interactions across the whole genome ubiquitously exist in relation to complex diseases [12]. Epistatic interaction generally refers to joint interaction effects among multiple genetic variants in the genome, where the effect of a set of genes or SNPs on a phenotype is unequal to the sum of their independent contributions [11]. Epistatic interactions are now widely regarded to determine individual susceptibility to complex diseases [8, 13].

Detecting high-order epistatic interactions in the human genome has become a very important goal in GWASs, but it is also extremely challenging because there are hundreds of thousands of SNPs in the human genome, creating a very complex “combination explosion problem”. Current computers are not capable of determining whether each kth-order (k > 2) SNP combination has an epistatic interaction effect in a limited time. To address this problem, high-performance computing and heuristic searches have been presented to accelerate the detection of high-order epistatic interactions. High-performance computing usually adopts graphics processing units (GPUs) and parallel processing techniques to improve the speed of computers. Guo et al. employed cloud computing to detect high-order epistatic interactions [14], and forty virtual machines were adopted to accelerate the detection of such interactions. Yang et al. [15] developed a GPU-based permutation tool to accelerate the detection of SNP-SNP interactions based on the likelihood ratio (LR) test with the assumption that the statistic follows a χ2 distribution. Cecilia et al. [16] presented a tool called MPI3SNP that implements a multicentral processing unit (CPU) and multi-GPU clusters to detect 3rd-order epistatic interactions. Alex Upton et al. [7] reviewed high-performance computing and cloud computing used to detect epistasis in detail and dissected different computational approaches to analyse epistatic interactions in disease-related genetic datasets. GPU and parallel processing techniques can speed up detection but are insufficient for high-order (> 3) epistatic interactions because the time complexity of detecting high-order SNP epistatic interactions is not reduced if the search algorithm still has high time complexity (i.e., exhaustive search algorithm).

To reduce the computational burden, heuristic search techniques, such as the Monte Carlo method [13, 17, 18], the spanning tree method [19], and swarm intelligence search algorithms (SISAs) [20, 21], use current information about the target problem as heuristic information that can improve search efficiency and reduce the number of searches. The Monte Carlo method employs random sampling procedures to explore potential SNP epistatic interactions and can speed up epistasis detection, but its power is often unsatisfactory. Zhang, Y et al. presented a Bayesian partition model (called Beam) for detecting SNP epistatic interactions, and they employed Markov chain Monte Carlo (MCMC) sampling to compute the posterior probability of SNP markers [13]. Beam models have a very rapid search speed but easily miss epistatic interactions with weak marginal effects on disease status. Wang W introduced a minimum spanning tree structure for exhaustively detecting two-locus epistasis [19]. The minimum spanning tree-based method is powerful for detecting 2nd-order epistatic interactions with marginal effects but largely inefficient for the detection of high-order SNP epistatic interactions with weak marginal effects. Shanwen Sun et al. analysed statistical modelling and machine learning approaches for identifying SNP epistatic interactions in detail [11].

Due to the powerful exploration capability of a high-dimensional search space, the SISA has received much attention in recent years for high-order epistatic interaction detection. Moore JH et al. employed a genetic algorithm (GA) to discover complex genetic models [20, 21] and presented a grid-based stochastic search algorithm (named Crush-MDR) [22], which adopts genetic model-free multifactor dimensionality reduction (MDR) to calculate the associations between SNP combinations and disease status in order to accelerate the detection of high-order epistatic interactions. Crush-MDR reduces the time complexity of the search process, but the objective function MDR is computationally expensive. Wang et al. proposed a two-stage ant colony optimization (ACO) algorithm (named AntEpiSeeker) to detect epistatic interactions [23], which employs the chi-square (χ2) test to evaluate the scores of SNP combinations. In the 1st stage of AntEpiSeeker, ACO is used to select suspected SNP sets with high χ2 scores, and the 2nd stage of AntEpiSeeker conducts an exhaustive search with the suspected SNPs. Shang and Sun et al. conducted an in-depth study on gene–gene interactions via ACO [24,25,26], and their research concentrated on the identification of epistatic models and the improvement of ACO. Differential evolution-based methods were adopted by Yang et al. [27, 28] to detect epistatic interactions, and improved MDR was used to measure associations. Aflakparast, M. et al. introduced a cuckoo search epistasis (CSE) detection algorithm to identify high-order SNP epistatic interactions in which each single SNP has a small effect on disease status. The CSE detection algorithm first divides all SNP loci into M groups based on their relevance, and kth-order SNP combinations are then chosen from the M groups [29].

Tuo et al. proposed three harmony search (HS)-based epistatic detection algorithms (FHSA-SED [30], NHSA-DHSC [31], and MP-HS-DHSI [32]) because of the performance advantages of HS, such as its powerful exploration ability and fast speed. HS is a very simple optimization algorithm that has shown outstanding performance in solving both combinational optimization problems and real number optimization problems. FHSA-SED aims to discover SNP-pair (2nd-order) interactions using HS and two scoring functions (the Gini index and Bayesian network-based K2 score) to evaluate the associations between SNP pairs and disease status. NHSA-DHSC presents a niche HS for detecting high-order epistatic interactions, in which a niche strategy is used to record local optimal solutions and avoid repeated searches in local regions. The MP-HS-DHSI algorithm employs multipopulational and multiple scoring functions to improve the exploration power of HS and overcome the preference for disease models. In terms of search speed and detection power, it outperforms FHSA-SED and NHSA-DHSC in detecting high-order SNP epistatic interactions, but for an unknown detection task, it also requires the detection of 2nd-order, 3rd-order, …, Kth-order epistatic interactions one by one.

Although the SISA has made some progress in accelerating the detection of high-order epistatic interactions, it still faces two challenges:

Search. Finding kth-order (a combination of k SNP loci) epistatic interactions among over hundreds of thousands of SNPs in the whole genome in a limited time is very difficult due to the large number of SNP combinations, which is a complex combination explosion problem. For example, the number of 3rd-order SNP combinations for 1,000,000 SNPs is larger than \(1.6667\times {10}^{17}\). In particular, if the SNPs in high-order epistatic interactions have very weak or no marginal effect on disease status/complex traits, the SISA is inefficient or nearly powerless because there are no valid clues to guide the population to locate the causal SNP epistatic interaction among the extreme number of SNP combinations.

Discrimination. The discriminating function (objective function) adopted to calculate the associations between SNP combinations and phenotypes is crucial for the SISA. Faced with such a large number of SNP combinations, discrimination functions with light computational requirements should be considered first for SISAs. Bayesian network-based methods [33,34,35], Shannon entropy-based methods (i.e., mutual information and conditional entropy) [36] and statistical test methods (i.e., chi-square tests [37]) are lightweight methods that have been widely used to evaluate associations, but none of them are considered effective for all types of epistatic interaction models. Machine learning approaches, such as MDR [38,39,40], random forest and neural networks, are statistical-free methods with strong applicability for evaluating various disease models, but the high computational burden limits the usefulness of these methods as objective functions of SISAs.

To address the above challenges, this study aims to improve performance in detecting high-order SNP epistatic interactions in the following two aspects:

  1. (1)

    A multitasking HS algorithm with three stages is developed to improve detection speed and power.

  2. (2)

    Four complementary association evaluation functions are employed to improve the discrimination ability of various disease models.

To the best of our knowledge, the existing SISAs for detecting SNP epistatic interactions, such as CSE [29], MACOED [37], epiACO [25], and NHSA-DHSC [31], can perform only one task (detecting a single kth-order epistatic interaction) in each run and, therefore, must be run n times to perform n tasks (detecting k1-order, k2-order, …, kn-order epistatic interactions). To collaboratively perform multiple detection tasks simultaneously, a multitasking HS algorithm (named MTHSA-DHEI) is developed for detecting high-order epistatic interactions in this study. The contributions of our work can be summarized as follows:

  1. (1)

    A new multitask-based HS algorithm is proposed for detecting k1-order, k2-order, …, kn-order SNP epistatic interactions simultaneously. The proposed algorithm is divided into three stages: searching, screening and verifying. The search stage aims to reduce the computational burden. The purpose of screening and verifying is to improve detection result accuracy.

  2. (2)

    Unified coding is adopted to represent k1-order, k2-order, …, kn-order combinations. For all tasks, the solutions (SNP combinations) are encoded with the same length, which is equal to the number of SNPs in the highest-order SNP epistatic interaction, and this encoding scheme facilitates knowledge transfer between tasks. Knowledge transfer between tasks can significantly accelerate the detection of high-order SNP epistatic interactions from high-dimensional genome datasets.

  3. (3)

    To improve the capability of identifying various models and discriminating kth-order SNP epistatic interactions from non-functional kth-order SNP combinations, four complementary association evaluation functions (Bayesian network, mutual entropy (ME), LR, and normalized distance with joint entropy (ND-JE)) are integrated as objective functions of the multitasking HS.

  4. (4)

    T harmony memories (HM1, HM2, …, HMT) are employed to memorize the potential SNP epistatic combinations for T tasks, and four elite harmony sets (EHS1, EHS2, EHS3, and EHS4) are used for each task to record the elite solutions of four evaluation functions, with the aims of reducing the preference of a single evaluation function for a particular disease model and enhancing the global search ability.

The rest of this paper is arranged as follows. “Preliminary and related work” presents the related work and preliminaries. The proposed method is introduced in detail in “Proposed algorithm”. The experiments performed on simulation datasets and real datasets are given in “Simulation experiments”. The subsequent sections are the conclusion and discussion.

Preliminary and related work

Let \(X = \left\{ {x_{1} ,x_{2} , \cdots ,x_{N} } \right\}\) indicate N SNP markers for \(n\) individuals and \(Y = \left\{ {y_{1} ,y_{2} , \cdots ,y_{J} } \right\}\) denote disease status (J is the number of disease statuses). The homozygous major allele, heterozygous allele and homozygous minor allele in the sample dataset are defined as 0, 1 and 2, respectively. For a kth-order SNP combination, there are \(I = {3}^{k} \, \) genotype combinations. \(n_{i}\) is the number of samples in the dataset with SNP loci having the value of the ith genotype combination, and \(n_{ij}\) represents the number of samples with the ith genotype combination that are associated with disease state \(y_{j}\).

Definition

(high-order SNP epistatic interaction). Let \(X_{k} = \{ x_{{s_{1} }} ,x_{{s_{2} }} , \cdots ,x_{{s_{k} }} \}\) \((1 < k < N,x_{{s_{i} }} \in X)\) be a kth-order SNP combination. \(f(X_{k} ,Y)\) is a function for scoring the association between \(X_{k}\) and disease state \(Y\). A kth-order SNP combination \(X_{k}\) is jointly associated with \(Y\) if and only if \(\forall X^{^{\prime}} \subset X_{k} \wedge f(X_{k} ,Y) \succ f(X^{^{\prime}} ,Y)\), where \(\succ\) is defined for comparing the strength of association with the disease. \(X_{k}\) is said to be strongly associated with \(Y\) if \(f(X_{k} ,Y) > \theta\) (\(\theta\) is the threshold value for determining the association with disease status). A kth-order SNP epistatic interaction occurs if and only if a kth-order SNP combination \(X_{k}\) is truly a disease-causing SNP combination associated with Y.

Multitasking optimization model

Multitasking optimization aims to solve K optimization problems simultaneously [41], and its goal is to concurrently optimize all K tasks. Let the K tasks be maximization problems. The optimization model can be expressed as follows:

$$ \left\{ {X_{1}^{*} ,X_{2}^{*} ,...,X_{K}^{*} } \right\} = \left\{ {\mathop {\arg \max }\limits_{{}} f_{1} \left( X \right),..,\mathop {\arg \max }\limits_{{}} f_{m} \left( X \right)} \right\} $$

where the objective function is defined as \(f_{i} :S_{i} \to {\mathbb{R}}\) and \(X_{{_{i} }}^{*} \in S{}_{i}\) is the optimal solution of objective function \(f_{i}\) in the feasible space of \(S_{i}\).

Evolutionary multitasking optimization (EMO) has received much attention in recent years in relation to implicit parallel population-based optimization algorithms to search multiple decision spaces of multiple optimization problems [42]. The evolutionary multitasking algorithm can significantly accelerate convergence for multiple complex optimizations by transferring learning between tasks. It has been applied in the fields of engineering and science computing. Li et al. employed a multifidelity evolutionary multitasking method to extract hyperspectral endmembers [43]. Feng et al. proposed evolutionary multitasking to solve the capacitated vehicle routing problem [44] consisting of a weighted learning process for capturing transfer mapping. Eneko Osaba et al. presented a novel adaptive metaheuristic algorithm to address evolutionary multitasking environments called the adaptive transfer-guided multifactorial cellular genetic algorithm (AT-MFCGA) [45]. Nguyen Thi Tam introduced evolutionary multitasking optimization to address the issues of relay node assignment for wireless single-hop sensor and multihop sensor networks in three-dimensional terrains [46]. To solve scheduling problems with batch distribution, Xu et al. presented multitasking optimization [47]. Gao et al. designed a transfer strategy based on the multidirectional prediction method to improve the performance of the multiobjective multitasking optimization approach [48]. Zhao et al. proposed a polynomial regression surface modelling approach based on multitasking optimization for rational basis function selection [49]. EMO can efficiently address multiple different optimization problems simultaneously, enhance the global search ability and improve the performance of each task via knowledge transfer between tasks [48].

In the detection of high-order SNP epistatic interactions, there may be an implicit relationship between kth-order SNP epistatic interactions and (kth + i)-order SNP epistatic interactions for the same disease.

For example, in a 5th-order SNP epistatic interaction \(\left( {x_{1} ,x_{2} ,x_{3} ,x_{4} ,x_{5} } \right)\), the five single-SNP loci \(x_{i} \left( {i = 1,2,...,5} \right)\) and all 2nd-order SNP combinations \(\left( {x_{i} ,x_{j} } \right)(i \ne j)\) may have no explicit associations with disease status, while some 3rd-order SNP combinations, such as \(\left( {x_{1} ,x_{2} ,x_{4} } \right)\) and \(\left( {x_{2} ,x_{3} ,x_{5} } \right)\), show associations with disease status, which can guide the search algorithm to identify the 5th-order SNP epistatic interaction by transferring learning between the task of detecting 3rd-order epistatic interactions and that of detecting 5th-order epistatic interactions. In the task of detecting 3rd-order SNP epistatic interactions, some 3rd-order SNP combinations with very weak associations with the disease but no disease-causing SNP interactions may be part of the 5th-order epistatic interaction. Conversely, some 5th-order SNP combinations may contain functional loci for 3rd-order SNP interactions. Therefore, a multitasking optimization model is well suited for accelerating the detection of high-order SNP epistatic interactions through knowledge transfer between multiple tasks.

Multitasking optimization model for detecting high-order SNP epistatic interactions

The multitasking optimization model for detecting \(k_{1} {\text{ - order}},k_{2} {\text{ - order}}, \cdots ,k_{m} {\text{ - order}}\) SNP epistatic interactions can be expressed as Eq. (1).

$$ \left\{ {X_{{k_{1} }}^{*} , \cdots ,X_{{k_{m} }}^{*} } \right\} = \left\{ {\mathop {\arg \max }\limits_{{}} f_{1} \left( {X,Y,k_{1} } \right), \cdots ,\mathop {\arg \max }\limits_{{}} f_{m} \left( {X,Y,k_{m} } \right)} \right\}, $$
(1)

where \(X_{{_{{k_{i} }} }}^{*}\)(i = 1, 2,…,m) indicates a \(k_{i} {\text{ - order}}\) SNP epistatic interaction and \(f{(}X,Y,k)\) denotes the objective function for evaluating the association between kth-order SNP combination \(X_{k}\) and disease status Y.

Discrimination functions for evaluating the associations between SNP combinations and disease status

Due to the small sample size and diversity of disease models, it is very difficult to discriminate kth-order SNP epistatic interactions from all kth-order SNP combinations on a genome-wide scale. Conventional evaluation methods (such as mutual information and Bayesian networks) cannot identify all disease models well. Almost all evaluation methods can correctly discriminate only a portion of disease models. In this study, four discrimination functions with low computational costs are employed to enhance the discrimination ability.

Bayesian-network-based K2 score. The Bayesian-network-based K2 score is a statistical method for describing relationships using a directed acyclic graph (DAG) G = (V, E) [50]. It is a lightweight computing method and has high discrimination precision for evaluating the association between a kth-order SNP combination and disease status; it can be expressed as Eq. (2):

$$ {\text{K2 - Score}}_{{{\text{log}}}} = \sum\limits_{i = 1}^{I} {\left( {\sum\limits_{k}^{{n_{i} + 1}} {\log (k) - \sum\limits_{j = 1}^{J} {\sum\limits_{s = 1}^{{n_{ij} }} {\log (s)} } } } \right)} . $$
(2)

The larger the \({\text{K2 - Score}}_{{{\text{log}}}}\) value is, the greater the association between a SNP combination and disease status.

ME score. The ME score aims to calculate the contribution of a kth-order SNP combination X to disease status Y, defined as in Eq. (3) [51],

$$ \begin{gathered} {\text{ME - score}} = I\left( {Y|x_{1} ,...,x_{k} } \right) \hfill \\ { = }H(Y) + H(x_{1} ,...,x_{k} ) - H(Y,x_{1} ,...,x_{k} ), \hfill \\ \end{gathered} $$
(3)

where \(H(x)\)(see Eq. (4)) denotes the Shannon entropy of \(x\) and \(H(x_{1} ,x_{2} ,...,x_{k} )\) represents the joint entropy of multiple variables \((x_{1} ,x_{2} ,...,x_{k} )\).

$$ H(Y) = - \sum\limits_{y = 0}^{J} {p(y) \times \log_{2} \left( {p(y)} \right)} , $$
(4)
$$ H(x_{1} ,...,x_{k} ) = - \sum\limits_{{x_{1} = 0}}^{I} { \cdots \sum\limits_{{x_{k} = 0}}^{I} {p\left( {x_{1} ,...,x_{k} } \right)} } \times \log_{2} p\left( {x_{1} ,...,x_{k} } \right). $$
(5)

LR score. The LR score is employed as a related measure to identify the likelihood difference between a kth-order SNP epistatic interaction and a kth-order SNP combination that is not involved in the disease process [52, 53] as shown in Eq. (6):

$$ {\text{LR - score}} = 2\sum\limits_{i = 1}^{I} {\sum\limits_{j = 1}^{J} {o_{ij} \ln \left( {\frac{{o_{ij} }}{{e_{ij} }}} \right)} } = 2\sum\limits_{i = 1}^{I} {\sum\limits_{j = 1}^{J} {n_{ij} \ln \left( {\frac{{n_{ij} }}{{e_{ij} }}} \right)} } , $$
(6)

where \(o_{ij}\) and \(e_{ij}\) represent the observed number and expected number of phenotypes, respectively, when a phenotype takes the ith disease state and a SNP combination takes the jth genotype. The expected number \(e_{ij}\) can be obtained based on the Hardy–Weinberg principle [45, 58].

ND-JE score. The ND-JE score is defined as the normalized distance with joint entropy [32], which aims to uncover clues for detecting high-order epistatic models with very weak or no marginal effects, defined as in Eqs. (7)–(9):

$$ {\text{ND\_JE - score}} = ND(X)/JE(X_{{{\text{control}}}} ), $$
(7)
$$ ND(X) = \frac{{\sum\limits_{j = 1}^{J} {\sum\limits_{i = 0}^{I} {\sqrt {\left( {n_{i}^{{j,{\text{control}}}} - n_{i}^{{j,{\text{case}}}} } \right)^{2} } } } }}{{\sum\limits_{i = 0}^{I} {\left| {n_{i}^{{{\text{control}}}} - n_{i}^{{{\text{case}}}} } \right|} }}, $$
(8)
$$ JE\left( {X_{{{\text{control}}}} } \right) = - \sum\limits_{i = 1}^{I} {\frac{{n_{i}^{{{\text{control}}}} }}{{n^{{{\text{control}}}} }}} \log \frac{{n_{i}^{{{\text{control}}}} }}{{n^{{{\text{control}}}} }}, $$
(9)

where \(X = (x_{1} ,x_{2} , \cdots ,x_{k} )\) is a kth-order SNP combination for all samples (including case and control samples); \(X_{{{\text{control}}}}\) indicates a kth-order SNP combination for only control samples; \(n_{i}^{{j,{\text{control}}}}\) and \(n_{i}^{{j,{\text{case}}}}\) denote the numbers of control samples and case samples in the dataset, respectively, with the jth SNP locus taking the value of i (homozygous major allele 0, heterozygous allele 1 and homozygous minor allele 2); \(n_{i}^{{{\text{control}}}}\) and \(n_{i}^{{{\text{case}}}}\) represent the numbers of control samples and case samples in the dataset, respectively, with SNP combination \(X\) taking the value of the ith genotype combination; and \(n_{{}}^{{{\text{control}}}}\) is the number of control samples.

The smaller the value of \(ND(X)\) is, the larger the distribution difference (distance) between the case and control samples. The JE of the control samples is employed to normalize the distance. The main goal of ND-JE is to uncover a clue to guide the HS algorithm to detect potential disease-causing SNP combinations.

Harmony search algorithm

The HS algorithm mimics the process of new music improvisation by jazz musicians, who address unknown complex problems by exchanging information and learning between individuals in a group [54,55,56]. Musicians improvise their instruments’ pitches to search for a perfect state of harmony. The HS algorithm is characterized by its simplicity, easy implementation, and powerful global search capabilities and has been widely applied in combination optimization problems on a large scale. (The standard HS algorithm is introduced in detail in Supplementary file 1.)

In HS, a candidate solution \(X = (x_{1} ,x_{2} , \ldots ,x_{{\mathbf{K}}} )\) is referred to as a harmony. A set of candidate solutions is referred to as a harmony memory (HM), which is similar to the memory of a tabu search (TS) algorithm and the population of a GA. The number of harmonies in an HM is called the harmony memory size (HMS). An HM is a matrix of order HMS × N or an augmented matrix of order HMS × (N + 1) [50, 57] as in Eq. (10):

$$ \begin{gathered} {\text{HM}} = \left[ {\begin{array}{*{20}c} {X^{1} } \\ {X^{2} } \\ \vdots \\ {X^{{{\text{HMS}}}} } \\ \end{array} \left| {\begin{array}{*{20}c} {f(X^{1} ,Y,k)} \\ {f(X^{2} ,Y,k)} \\ \vdots \\ {f(X^{{{\text{HMS}}}} ,Y,k)} \\ \end{array} } \right.} \right] \hfill \\ \, = \left[ {\begin{array}{*{20}c} {x_{1}^{1} } & {x_{2}^{1} } & \cdots & {x_{{\mathbf{K}}}^{1} } \\ {x_{1}^{2} } & {x_{2}^{2} } & \cdots & {x_{{\mathbf{K}}}^{2} } \\ \vdots & \vdots & \cdots & \vdots \\ {x_{1}^{{{\text{HMS}}}} } & {x_{2}^{{{\text{HMS}}}} } & \cdots & {x_{{\mathbf{K}}}^{{{\text{HMS}}}} } \\ \end{array} \left| {\begin{array}{*{20}c} {f(X^{1} ,Y,k)} \\ {f(X^{2} ,Y,k)} \\ \vdots \\ {f(X^{{{\text{HMS}}}} ,Y,k)} \\ \end{array} } \right.} \right], \hfill \\ \end{gathered} $$
(10)

where \(X^{i}\) (i = 1, 2, …, HMS) is the i-th harmony in HM and f(xi) denotes the value of the objective function.

The worst harmony \(X^{{{\text{id\_worst}}}}\) in HM is iteratively updated by new harmony \(X^{{{\text{new}}}}\), which is improvised through the following three operators:

  1. (1)

    HM consideration performs a combination operation of HM with the probability harmony memory considering rate (HMCR).

  2. (2)

    Pitching adjusts with probability PAR, which performs a local adjustment operation on \(X^{{{\text{new}}}}\).

  3. (3)

    Random consideration is performed with probability 1-HMCR, which introduces stochastic disturbance in a feasible search space to explore unknown space.

HS has been widely used to solve complex engineering and science optimization problems.

Proposed algorithm

Framework

Figure 1 shows the framework of our proposed MTHSA-DHEI algorithm.

Fig. 1
figure 1

The general flowchart of the proposed algorithm MTHSA-DHEI. In the search stage, the multitasking harmony search aims to discover some potential SNP combinations that have a strong association with disease status. In the screening stage, the G-test is employed to eliminate some SNP combinations without a significance level. MDR is used to verify the classification ability of candidate solutions in the 3rd stage

MTHSA-DHEI is divided into three stages: the search stage, screening stage and verification stage. In the search stage, a multitasking HS is adopted to find potential SNP combinations that have a strong association with disease status. The G-test [30, 32, 59] statistical method is employed to test the significance level of the difference between control samples and case samples in the screening stage, and the SNP combinations with significance level p values larger than the threshold value \({\theta }_{1}\) are discarded. In the verification stage, MDR [38] is further used to verify the classification ability.

The goal of this study is to develop a fast and effective search algorithm; therefore, the focus is on the design of the search stage.

In the search stage of MTHSA-DHEI, four scoring functions (Bayesian network-based K2 score, LR-based score, ME-based score and ND-JE-based score) are employed as objective functions to improve the ability to discriminate SNP interactions with nonfunctional SNP combinations. T tasks are employed to detect 2nd-order, 3rd-order, …, Tth-order, (Tth + 1)-order SNP epistatic interactions simultaneously. As shown in Fig. 2, four tasks are concurrently employed to detect 2nd-order, 3rd-order, 4th-order and 5th-order SNP interactions, in which the t-th task has an HM and four elite harmony sets (EHS1t, EHS2t, EHS3t and EHS4t). Each harmony in the HM has four association scores (K2 score, LR score, ME score and ND-JE score). Each harmony in the elite harmony set has only one association score. The K2 score, LR score, ME score and ND-JE score are separately adopted by EHS1t, EHS2t, EHS3t and EHS4t. Unified coding is applied to the harmonies and elite harmony sets of all tasks, which is intended to allow the harmonies among the K tasks to transfer knowledge from each other and further improve detection speed.

Fig. 2
figure 2

Four tasks are separately employed to detect 2nd-order, 3rd-order, 4th-order and 5th-order SNP interactions simultaneously. The t-th task possesses an HM (HMt) for recoding the potential SNP combinations and has four elite harmony sets (EHS1t, EHS2t, EHS3t and EHS4t) for storing the elite solutions. In EHS1t, the association of SNP combinations is calculated with the K2 score based on a Bayesian network. The association of SNP combinations is calculated with the LR in EHS2t. In EHS3t and EHS4t, the association of SNP combinations is separately calculated based on the ME and ND-JE scores. The tasks exchange information with each other to improve the search capability and speed

In MTHSA-DHEI, all tasks employ the same code length, but only the previous t + 1 values are considered to be the solution for the t-th task. For example, task 1 aims to detect 2nd-order SNP interactions. Only the association of the 2nd-order SNP combination \(\left( {x_{1}^{i} ,x_{2}^{i} } \right)\) in \(X^{i} \left( {i = 1,2,...,HMS} \right)\) is calculated between \(\left( {x_{1}^{i} ,x_{2}^{i} } \right)\) and disease status \(Y\), and other SNPs are ignored, as follows:

$$ {\text{HM}}_{1} = \left[ {\begin{array}{*{20}c} {x_{1}^{1} } & {x_{2}^{1} } & {\begin{array}{*{20}c} {x_{3}^{1} } & \cdots \\ \end{array} } & {x_{{\mathbf{K}}}^{1} } \\ {x_{1}^{2} } & {x_{2}^{2} } & {\begin{array}{*{20}c} {x_{3}^{2} } & \cdots \\ \end{array} } & {x_{{\mathbf{K}}}^{2} } \\ \vdots & \vdots & {\begin{array}{*{20}c} \vdots & \cdots \\ \end{array} } & \vdots \\ {x_{1}^{{{\text{HMS}}}} } & {x_{2}^{{{\text{HMS}}}} } & {\begin{array}{*{20}c} {x_{3}^{{{\text{HMS}}}} } & \cdots \\ \end{array} } & {x_{{\mathbf{K}}}^{{{\text{HMS}}}} } \\ \end{array} \left| {\begin{array}{*{20}c} {f\left( {\left( {x_{1}^{1} ,x_{2}^{1} } \right),Y,2} \right)} \\ {f\left( {\left( {x_{1}^{2} ,x_{2}^{2} } \right),Y,2} \right)} \\ \vdots \\ {f\left( {\left( {x_{1}^{{{\text{HMS}}}} ,x_{2}^{{{\text{HMS}}}} } \right),Y,2} \right)} \\ \end{array} } \right.} \right]. $$

In Fig. 1, the general flow of the search stage is presented, in which Algorithm 1 and Algorithm 2 introduce the improvisation of new harmonies based on knowledge transfer from other tasks and the current task, respectively. Algorithm 3 presents the process of updating the harmony memory and elite harmony sets (EHS1t, EHS2t, EHS3t and EHS4t) of the t-task.

Improvising new harmonies with knowledge transfer

In the proposed MTHSA-DHEI approach, improvising a new harmony for the current t-th task has two steps: (1) knowledge transfer from other tasks and (2) generation of a new solution in the current task by using HMt, EHS1t, EHS2t, EHS3t and EHS4t. Figure S1 shows an example in which task 2 transfers knowledge to task 1 (see Supplementary file 1).

In the 1st method, the new harmony is improvised using three classical operators of HS, but the knowledge is from another task \(t_{r} \in \{ 1,2, \cdots,{T}\} ,t_{r} \ne t\), which aims to obtain the information from another task \(t_{r}\). If task \(t_{r}\) has a higher order than the current task, it may carry one or more functional SNPs that were missing in the current task. Conversely, if task \(t_{r}\) has a lower order than the current task, some clues (SNPs with marginal effects) that can help the current task accelerate the detection of epistatic SNP interactions may be found.

Algorithm 1 describes the steps of improvising a new harmony \({X}^{\mathrm{new}}=\left({x}_{1}^{\mathrm{new}},{x}_{2}^{\mathrm{new}},\dots ,{x}_{K}^{\mathrm{new}}\right)\) for the t-th task by transferring learning from the tr-th task (\(t\ne {t}_{r}\)), in which the HM consideration is from the four EHSs of the tr-th task; there are three strategies with equal probability of pitch adjustment \({X}^{\mathrm{new}}\). For \({X}^{\mathrm{new}}\), if \({x}_{i}^{\mathrm{new}}={x}_{j}^{\mathrm{new}}\left(i\ne j\right)\), the value of \({x}_{i}^{\mathrm{new}}\) or the value of \({x}_{j}^{\mathrm{new}}\) will be randomly regenerated from the search space. \({X}^{\mathrm{new}}\) will be randomly regenerated if it exists in the HM or in \({\text{EHS}}_{r}^{t} \left(r=\mathrm{1,2},\mathrm{3,4}\right)\).

figure a

In the 2nd method, a new harmony is improvised through the components (HMt, EHS1t, EHS2t, EHS3t, and EHS4t) of the current t-th task. HMt is for harmony memory consideration with probability HMCR. The best harmonies of four elite harmony sets (EHS1t, EHS2t, EHS3t, and EHS4t) are the focus of consideration when employing the pitch adjustment operator.

figure b

Algorithm 2 describes the steps of improvising a new harmony \({X}^{\mathrm{new}}=\left({x}_{1}^{\mathrm{new}},{x}_{2}^{\mathrm{new}},\dots ,{x}_{K}^{\mathrm{new}}\right)\) for the t-th task by self-learning from \({\mathrm{HM}}_{t}\mathrm{ and }{\mathrm{EHS}}_{r}^{t} (r=\mathrm{1,2},\mathrm{3,4}\)).

In Algorithm 1 and Algorithm 2,\(a \in \left\{ {1, \cdots ,K} \right\}\),\(a_{1} \in \left\{ {1, \cdots ,K} \right\}\),\(b \in \left\{ {1, \cdots ,{\text{HMS}}} \right\}\), \(b_{1} \in \left\{ {1,2, \cdots ,{\text{HMS}}} \right\}\), \(b_{2} \in \left\{ {1, \ldots ,{\text{HMS}}} \right\}\),\(R \in \left\{ {1,2,3,4} \right\}\),\(R_{0} \in \left\{ {1,2,3,4} \right\}\) and \(r \in \left\{ {1,2,3,4} \right\}\) are all randomly generated integers, and \({\text{HM}}_{{t_{r} }} (i,j)\) denotes the j-th note (variable) of the i-th harmony in \({\text{HM}}_{{t_{r} }}\). \({\text{EHS}}_{r}^{t} \, \) is the r-th elite harmony set of the t-th task. \(x_{a}^{{{\text{EHS}}_{r}^{t} ,b}}\) denotes the a-th SNP value of the b-th harmony in the r-th elite harmony set of the t-th task. F is the scale factor for adjusting the step of the local search.

In Algorithm 1 and Algorithm 2, the scale factor F is important for the performance of the proposed MTHSA-DHEI method. It is analysed in the simulation experiment described in “Simulation experiments”.

Update the harmony memory and elite harmony sets

For each new harmony generated for the t-th task, HMt, EHS1t, EHS2t, EHS3t and EHS4t are considered to be updated. The update rate of HMt is associated with \({\text{FEs}}\). Algorithm 3 describes the update operator in detail.

figure c

In Algorithm 3, the i-th fitness value \({\text{sc}}^{{{\text{new}},i}}\) of \(X^{{{\text{new}}}}\) is divided by \({\text{max\_score}}^{i}\) to normalize the fitness value to the interval [1]. The value of \({\text{max\_score}}^{i}\) is the maximum value of the i-th scoring function in the initial population, and its value is not changed during iterations. The condition C > 2 || \({\text{sc}}^{{{\text{new}},4}} > {\text{sc}}_{{{\text{HM}}_{t} }}^{i,4}\) \(\& \&\)\({\text{rand}}(0,1) < \left( {1 - {\text{FEs/maxFEs}}} \right)\) is critical. C > 2 means that the i-th harmony \({\text{HM}}_{t} (i,:)\) of \({\text{HM}}_{t}\) is replaced by \(X^{{{\text{new}}}}\) only when at least two scores of \(X^{{{\text{new}}}}\) are higher than the corresponding scores of \({\text{HM}}_{t} (i,:)\).\({\text{sc}}^{{{\text{new}},4}} > {\text{sc}}_{{{\text{HM}}_{t} }}^{i,4} \& \& {\text{ rand}}(0,1) < \left( {1 - {\text{FEs/maxFEs}}} \right)\) indicates that \({\text{HM}}_{t} (i,:)\) can be replaced by \(X^{{{\text{new}}}}\) with probability \(\left( {1 - {\text{FEs/maxFEs}}} \right)\) if \({\text{sc}}^{{{\text{new}},4}} > {\text{sc}}_{{{\text{HM}}_{t} }}^{i,4}\), which means that the ND-JE score is different from the other three scores. In this work, the goal of the ND-JE score is to discover some clues to guide the algorithm to locate SNP interactions with no marginal effect on disease status.

In the proposed MTHSA-DHEI algorithm, four complementary discriminating functions (evaluation functions) are adopted to calculate the associations between high-order SNP combinations and disease status, with the aim of improving the ability to identify various diseases, and have the following benefits:

  1. (1)

    All four evaluation methods are lightweight. For a kth-order SNP combination, the four scoring values can be calculated simultaneously by counting only the values of \(n_{i}\) and \(n_{ij}\) (i = 1, 2,…,I; j = 1,2), and the calculations are not additive.

  2. (2)

    The four evaluation methods are complementary to each other. The K2 score has high power for detecting SNP interactions and is superior in discriminating certain disease models with weak marginal effects. However, it has low accuracy for interaction models with low minor allele frequencies (MAFs) and low genetic heritability (H2). The ME score aims to calculate the contribution of a kth-order SNP combination to disease status. The LR score aims to discover the likelihood difference between a functional SNP combination and a nonfunctional SNP combination via statistical theory, and it has good adaptability to unknown disease models. The ND_JE score aims to guide the HS to uncover clues for detecting high-order epistatic interactions.

Simulation experiments

To investigate the performance of the proposed MTHSA-DHEI method, four 4th-order and eight 5th-order epistatic interaction models with marginal effects (EIMEs), eight high-order epistatic interaction models with no marginal effects (EINMEs) (including two 3rd-order models, three 4th-order models and three 5th-order models) and one real disease dataset (age-related macular degeneration, AMD) were tested. The experimental results were compared with the results of four high-order epistatic interaction detection algorithms: CSE, epiACO, NHSA-DHSC and MP-HS-DHSI. All experiments were performed on a Windows 10 64-bit system with an Intel(R) Core (TM) i7-8700 CPU @3.2 GHz, 16 GB memory, and all the program codes were written and run in MATLAB R2018a.

Evaluation criteria for performance

(1) Power is a measure of the capability of identifying a kth-order SNP epistatic interaction from genomic data and is expressed as

$$ {\text{Power }} = \, \frac{\# S}{{\# T}}, $$

where \(\# S\) is the number of true kth-order epistatic interactions found from the simulation datasets, in which a total of \(\# T\) true kth-order epistatic interactions are available.

Note that in this work, power is used mainly to evaluate the search ability of the proposed method. If the disease-causing SNP combination (epistatic interaction) has been found within the specified iterations (maximum number of objective functions evaluated, maxFEs), the search is considered successful.

(2) FEs represent the number of evaluations of the associations between kth-order SNP combinations and disease status until the kth-order SNP epistatic interaction is found or the terminal condition of the algorithm is met. In simulation experiments, the search is stopped immediately when the kth-order SNP epistatic interaction is found, and FEs is the number of SNP combinations that have been evaluated for their association with disease status. In this work, the aim of the FEs is to measure the capability of the algorithm to reduce the computational burden.

(3) Runtime denotes the mean runtime that an algorithm takes to detect the kth-order SNP epistatic interaction before the algorithm is terminated, and it is intended to measure the time cost of detecting high-order SNP epistatic interactions.

To further investigate the reliability of the results obtained in the search stage, the G-test method is adopted to screen out the SNP combinations that differ significantly between case and control samples \(\left(P{\mathrm{value}}<{\mathrm{max}}\left({10}^{-8},0.05/{C}_{N}^{k}\right)\right)\), and MDR is then used to verify the classification accuracy of the SNP combinations selected by the G-test [59]. The false discovery rate (FDR) and F1-score are employed to further evaluate the reliability of the results.

(4) The FDR is defined as:

$$\frac{FP}{FP+TP},$$

where \(FP\) and \(TP\) represent the false-positive rate and true-positive rate, respectively.

(5) The F1-score can be expressed as follows:

$$\mathrm{recall}=\frac{TP}{TP+FN},$$
$$\mathrm{precision}=\frac{TP}{TP+FP},$$
$$F1-\mathrm{score}=\frac{2\times \mathrm{recall}\times \mathrm{precision}}{\mathrm{recall}+\mathrm{precision}}.$$

Datasets

(1) EINME datasets. Eight EINMEs were employed to test the capability of detecting high-order epistatic interactions with no marginal effect. For each EINME, 1500 control samples and 1500 case samples for the functional SNPs were generated using a multiobjective optimization algorithm that aims to maximize the joint effects of k-SNPs, minimize the marginal effects of individual SNPs and limit Hardy–Weinberg equilibrium (HWE) constraints [60]. The samples of nonfunctional SNPs were randomly generated according to HWE. To investigate the performance of the proposed algorithm, simulation datasets with 100 SNPs, 1 k SNPs and 10 k SNPs were generated for each EINME. The eight EINMEs are described in Table S1 of Supplementary file 1.

(2) EIME datasets. Four 5th-order additive EIMEs, four 5th-order threshold EIMEs and four 4th-order multiplicative EIMEs [61] were employed to test the performance of detecting epistatic interactions with marginal effects. For each model, 100 datasets with 2000 control samples and 2000 case samples that had 100 SNPs, 1000 SNPs and 10 k SNPs were separately generated using GAMETES software [62]. The parameters of the 12 EIMEs are listed in Table S2 of Supplementary file 1.

(3) AMD dataset. The AMD dataset contains 103,611 SNPs genotyped for 50 controls and 96 cases [63]. This experiment aims to detect 2nd-order, 3rd-order, 4th-order and 5th-order SNP epistatic interactions from the 103,611 SNPs for AMD. We conducted two experiments for AMD:

  1. A.

    All 103,611 SNPs were detected to identify epistatic interactions.

  2. B.

    Three widely reported SNPs (rs380390, rs1329428, and rs1363688) were first removed, and the rest of the SNPs were detected to identify epistatic interactions in which each SNP had a small effect on disease status.

Parameter analysis and settings

(1) Effect of parameters on the performance of MTHSA-DHEI

In this section, the effect of two important parameters (TP and F) on the performance of MTHSA-DHEI are investigated.

As shown in Fig. 3, for the EINMEs, when TP > 0.5, the power begins to drop gradually, and the FEs and runtime increase with an increasing TP value, except for EINME-4. However, as shown in Fig. 4, for the EIMEs, the FEs and runtime decrease with an increasing TP value, but the power remains constant. This result demonstrates that a larger TP value will decrease the performance of MTHSA-DHEI for detecting EINME interactions but enhance the performance for the detection of EIME interactions. With a compromise, we believe that TP = 0.5 is a better choice when we have unknown disease models.

Fig. 3
figure 3

Effect of TP on the performance of MTHSA-DHEI for detecting EINME interactions

Fig. 4
figure 4

Effect of TP on the performance of MTHSA-DHEI for detecting EIME interactions

Next, we investigate the effect of parameter F on the performance of the proposed method. Figures 5, 6 and 7 show the power, FEs and runtime of MTHSA-DHEI for values of parameter F from 2 to 20 (step = 2). MTHSA-DHEI has the same power values for the three EIMEs for all F values, and it has high power for EINMEs when F equals 10. As shown in Fig. 6 and Fig. 7, MTHSA-DHEI with a small F value (F < 12) requires more FEs and runtime to find epistatic interactions than MTHSA-DHEI with a large F value for EINMEs, but for the three EIMEs, the opposite result occurs. Therefore, we recommend that F be set to 10.

Fig. 5
figure 5

Effect of parameter F on the power of MTHSA-DHEI

Fig. 6
figure 6

Effect of parameter F on the FEs of MTHSA-DHEI

Fig. 7
figure 7

Effect of parameter F on the runtime of MTHSA-DHEI

In addition, a \({\theta }_{2}\) value set to 0.6 has the highest accuracy for all EINMEs and EIMEs. When \({\theta }_{2}<0.55\), the false-positive rate starts to increase, and when \({\theta }_{2}>0.65\), the false negative rate starts to increase. For PAR, the algorithm has a greater search speed and improved detection power when its value is in the interval [0.4, 0.7].

(2) Parameter settings

The parameters of the algorithms are described in Table 1.

Table 1 Parameters of the five algorithms for detecting high-order epistatic interactions

Experimental results and analysis

(1) EINME

Figure 8 shows the power, FEs and runtime used to detect eight high-order EINMEs using five intelligent search algorithms, and the results show that the power of MTHSA-DHEI exceeds that of the other four algorithms in all EINMEs except for MP-HS-DHSI, which has the same power value as MTHSA-DHEI on EINME-3, EINME-4 and EINME-6. Both MTHSA-DHEI and MP-HS-DHSI have much higher power than the other algorithms (Fig. 8a). As shown in Fig. 8b, MTHSA-DHEI took the fewest FEs among all five algorithms to find the kth-order SNP epistatic interactions except for EINME-3, EINME-4 and EINME-6. Except for EINME-5, EINME-7 and EINME-8, MTHSA-DHEI has a more than 99% success rate in detecting kth-order (k = 3,4,5) epistatic interactions from 100 SNPs with no more than 10,000 FEs, which is much lower than the number of FEs (\({\complement }_{100}^{3}\)=161,700, \({\complement }_{100}^{4}\)=3,921,225, \({\complement }_{100}^{5}\)=75,287,520) obtained by an exhaustive search.

Fig. 8
figure 8

Power, FEs and runtime of the five algorithms for detecting 8 EINMEs with 100 SNPs

As shown in Fig. 8c, MP-HS-DHSI took the least time among the five algorithms on EINME-1, EINME-3, EINME-4 and EINME-6 to find the high-order SNP epistatic interactions. The runtime taken by the proposed MTHSA-DHEI method is slightly more than that of MP-HS-DHSI, but it is less than the runtime required by the other three algorithms. Importantly, in the simulation experiments, MP-HS-DHSI, NHSA-DHSC, epiACO and CSE were performed only on a kth-order epistatic interaction task (where k is the number of functional SNPs in an epistatic interaction); however, the proposed MTHSA-DHEI method aims to simultaneously detect 2nd-order, …, kth-order epistatic interactions, which consumes much of the computational cost of MTHSA-DHEI to detect potential 2nd-order, …, (k − 1)-order epistatic interactions. Overall, MTHSA-DHEI has evident advantages over the other four approaches in eight EINMEs with 100 SNPs.

To further investigate performance as the number of SNPs increases, we conducted the proposed method on EINME datasets with 1 k and 10 k SNPs. The results are summarized in Table S3 (see Supplementary file 1). Figure 9 displays the change curves of the power, FEs, runtime and F1-score of the MTHSA-DHEI with an increasing number of SNPs, from which we can see that the power and F1-scores decrease rapidly and the FEs and runtime increase significantly when conducting EINME-5, EINME-7 and EINME-8; however, for the other five models, the changes in these metrics are not very significant. We found that in EINME-5, EINME-7 and EINME-8, the marginal effect of each functional SNP was very small, especially for EINME-8, and the joint effects could be seen only when three or more of the five functional SNPs were combined, making it very difficult to search for epistatic interactions among over thousands of SNPs. Compared with the exponential growth in the number of SNP combinations, the increases in FEs and runtime and the decrease in power are very small and acceptable.

Fig. 9
figure 9

Change curves of the power, FEs, runtime and F1-score of MTHSA-DHEI with an increasing number of SNPs for EINMEs

(2) EIME

Table 2 summarizes the results (power, FEs, and runtime) of the five approaches in twelve high-order EIMEs, from which it can be clearly seen that the proposed MTHSA-DHEI method outperforms or is equivalent to the other four methods in terms of power. MTHSA-DHEI took fewer FEs than CSE, NHSA-DHSC and epiACO for almost all 12 models, and it had 100% power to detect epistatic interactions with very few FEs. Compared with MP-HS-DHSI, MTHSA-DHEI took fewer FEs on four multiplicative datasets (EIME-9-EIME-12) with 1000 SNP loci. MTHSA-DHEI took less runtime than CSE, NHSA-DHSC and epiACO, but it took more time than MP-HS-DHSI. However, the runtimes and FEs of MTHSA-DHEI are composed of the runtimes required to perform multiple tasks (detection of 2nd-order, 3rd-order, …, and kth-order SNP epistatic interactions), while the runtimes of the other four methods correspond to the time of performing only a single task (detection of kth-order SNP epistatic interactions). In summary, compared with the four excellent SISAs, MTHSA-DHEI has significant advantages in power, especially for more complex disease models (i.e., multiplicative models).

Table 2 Power, FEs and runtime of the five algorithms for detecting 12 high-order EIMEs with 500 SNPs and 1000 SNPs. (The bold numbers indicate the optimal results)

Experiments on the 12 EIME datasets with 100, 1 k and 10 k SNPs are conducted, and the results are summarized in Table S4 (see Supplementary file 1), which demonstrates that the proposed algorithm can maintain high power (1st and 2nd power) for all datasets with different SNPs. Moreover, its runtime and FEs increases are not very significant, where the 1st power denotes the ability to find functional epistatic interactions by the multitasking HS algorithm and the 2nd power is the number of epistatic interactions that pass the threshold value \({\theta }_{1}\). However, for EIME-5, EIME-8, EIME-9 and EIME-10, the 3rd power is equal to zero because the ability to classify functional SNP epistatic interactions cannot pass the threshold value \({\theta }_{2}\) (= 60%). When MDR was used to evaluate the four disease models, their average classification accuracy was equal to 56.8%, 58.6%, 56.3% and 58.4%. Comparing the EIMEs with EINMEs, each functional SNP in the EIMEs has an obvious marginal effect on disease status, which allows the HS algorithm to quickly locate the functional SNPs.

(3) AMD. The proposed MTHSA-DHEI method is adopted to simultaneously detect 2nd-order, 3rd-order, 4th-order and 5th-order epistatic interactions from AMD data, with 146 samples and 103,611 SNPs. A total of 526 2nd-order SNP combinations, 1059 3rd-order SNP combinations, 638 4th-order SNP combinations and 322 5th-order SNP combinations were found to be associated with AMD, of which 168 2nd-order SNP combinations (CA > 75%, p value < 1 \(\times {10}^{-7}\)), 631 3rd-order SNP combinations (CA > 80%, p value < \(1\times {10}^{-10}\)), 546 4th-order SNP combinations (CA > 85%, p value < \(1\times {10}^{-10}\)), and 285 5th-order SNP combinations met the significance level for the G-test [30, 32, 59], and the classification accuracy with MDR for each SNP combination was greater than 75%, 80%, 85% and 90% for the 2nd-order, 3rd-order, 4th-order and 5th-order combinations, respectively (see Supplementary file 2 for details). To better analyse the interactions among the identified SNPs, we employed Cytoscape software (https://cytoscape.org/) [67] to generate the interaction networks (see Fig. 10a, Figs. S2(a), S3(a) and S4(a)).

Fig. 10
figure 10

A 2nd-order SNP interaction network. A node denotes a SNP locus, and an edge indicates that the 2nd-order SNP combination connected by that edge is strongly associated with AMD (the significance level for the G-test is < \(1\times {10}^{-8}\), and the classification accuracy of the 2nd-order SNP combination is > 75%). The larger the node is, the greater the number of nodes connected to it

To detect epistatic interactions in which each SNP has a small effect on disease status, we removed three important SNPs (rs380390, rs1329428 and rs1363688) that have been widely reported to be associated with AMD, and MTHSA-DHEI was applied to the remaining SNPs. Twenty-four 2nd-order SNP combinations (CA > 75%, p value < 1 \(\times {10}^{-7}\)), 33 3rd-order SNP combinations (CA > 80%, p value < 1 \(\times {10}^{-10}\)), 56 4th-order SNP combinations (CA > 85%, p value < 1 \(\times {10}^{-10}\)) and 89 5th-order SNP combinations (CA > 88%, p value < 1 \(\times {10}^{-12}\)) were found to be strongly associated with AMD. Figure 10b, Figs. S2(b) and S3(b) show the interaction networks.

Figure 10a shows the 2nd-order SNP combinations (p value < \(1\times {10}^{-7}\)) with a classification accuracy greater than 75%, with SNPs rs380390, rs1329428 and rs10272438 shown to interact with many other SNPs. Both rs380390 and rs1329428 are in the CFH gene, which has been widely reported to be associated with AMD [3, 24, 26, 34, 63, 64]. rs10272438 is an intron variant of the BBS9 gene that is associated with Bardet Biedl syndrome [65] and was reported in our previous study [31]. In Fig. 10b, rs10272438 is the only central node that interacts with 21 SNPs.

Figure S2(a) displays the interaction network of 3rd-order SNP combinations with a classification accuracy greater than 80%, of which the degrees of five central nodes, namely, rs380390, rs1363688, rs1329428, rs618499 and rs555174, are equal to 193, 124, 3, 47 and 13, respectively. In Figure S2(a), the SNPs rs380390 and rs1329428 are indicated to have important roles in 3rd-order SNP combinations, and rs1363688 (at position 174,609,731 of chromosome 15, not in a gene-coding region) [14, 31, 32] and rs618499 (in gene ATM) were reported to be associated with osteosarcoma [66] and AMD [14]. rs555174 (not in a gene-coding region) has also been reported to be associated with AMD [31, 34]. In Figure S2(b), rs10272438 and rs2022251 are two important nodes, where rs2022251 (on chromosome 17, the difference p value = 0.1 between the case and control samples) has never been reported to be associated with disease.

Figure S3(a) shows the potential 4th-order SNP combinations with a classification accuracy greater than 85% and significance level less than \(1\times {10}^{-10}\), from which it can be seen that the SNPs rs1329428, rs3922799, rs2207553, rs4585932, rs10494614, and rs967358 are central nodes, among which rs1329428 is the only SNP that has been reported. In Figure S3(b), rs10272438, rs10482918 and rs6104678 are three central nodes, where rs10482918 is in the NCAM2 gene on chromosome 21, and rs6104678, which is on chromosome 20, has been reported previously [30, 31, 68].

Figure S4(a) shows the 5th-order SNP interaction network, in which SNPs rs1363688, rs1329428 and rs207389 are central nodes. Figure S4(b) shows that rs10482918, rs10272438, rs1982756 and rs6104678 are central nodes. There are two 5th-order SNP combinations, namely, (rs380390, rs7322610, rs2556560, rs4689888, and rs10496217) and (rs2050733, rs207389, rs1178123, rs1329428, and rs1363688), that have a very high classification accuracy (97.9% and 95.8%, respectively) measured with MDR in the first combination, except for SNP rs380390, which shows a significant difference (p value = 6.19921E-07) between the case and control samples. The other four SNPs (rs7322610, rs2556560, rs4689888, and rs10496217) have very small effect sizes, and their p values are equal to 0.282, 0.283, 0.864 and 0.220, respectively. For the 2nd SNP combination, the p values of the five SNPs were equal to 0.324, 0.088, 0.011, 5.99 E − 06 and 3.84E − 05.

To complete detection with \(5\times {10}^{7}\) FEs, the proposed MTHSA-DHEI method took no more than 20 h to simultaneously detect 2nd-order, 3rd-order, 4th-order and 5th-order SNP epistatic interactions from the AMD dataset. The time required for MP-HS-DHSI to perform the detection of 2nd-order, …, 5th-order SNP epistatic interactions one by one was more than 48 h; NHSA-DHSC, CSE and epiACO required more than one days. According to the AMD detection results, MTHSA-DHEI found almost all the SNPs that have been reported to be associated with AMD, such as rs380390, rs1329428, rs1363688, rs10272438, and rs555174, and found some SNPs that have been reported to be associated with other complex diseases. Some previously unreported SNPs were also found by MTHSA-DHEI and are worthy of further study by biologists.

Conclusion

According to the experimental results, the proposed MTHSA-DHEI method is significantly superior to the other four algorithms in terms of power, FEs and runtime for the EINMEs. The EIMEs also outperform others with respect to power and FEs, but they take more time than MP-HS-DHEI for most EIMEs. In the AMD experiment, MTHSA-DHEI also showed a powerful ability to detect high-order epistatic interactions from hundreds of thousands of data points, and it found almost all the SNPs that have been reported to be associated with AMD. Although the results of simulation experiments indicate that our method outperforms the four compared SISAs and shows very effective performance for detecting high-order SNP epistatic models, such as EINME-1, EINME-4, and EINME-6, additive models and threshold models, it still cannot ensure the identification of casual epistatic interactions from a dataset of over 10,000 SNPs in a limited amount of time (30 min), and detection power starts to degrade rapidly, such as for EINME-8. For the four multiplicative models, the heritability and population prevalence values have a very important influence on the detection power of the algorithm. The larger the heritability and population prevalence values of the disease model are, the higher the detection power of the algorithm. However, SNP loci typically have a small effect and only modest heritability [2, 69]. To enhance the detection power of the proposed MTHSA-DHEI method on datasets with more than 10,000 SNPs, we can set a large size for HM and EHS and set a large value for MaxFEs to make the algorithm run for a longer time. Large sizes of harmony memory and large values of MaxFEs for our algorithm can improve detection power, but the computational burden will also increase rapidly.

Discussion

Traditional methods for detecting high-order SNP epistatic interactions can perform only a single task and ignore the sharing of information between tasks, which makes the computational burden of detecting SNP epistatic interactions from unknown disease data very high. To address this problem, this study aimed to improve detection power, reduce the computational burden and enhance the ability to discriminate high-order SNP epistatic interactions from a significant number of high-order SNP combinations. We proposed a novel multitasking HS algorithm for detecting high-order SNP epistatic interactions, where multitasking is applied to accelerate detection using concurrent collaborative computation, transfer learning is adopted to enhance the information exchange between tasks, and four complementary evaluation functions are employed to promote the ability to identify various disease models and overcome the preference of a single evaluation function for a specific disease model. In addition, for the epistatic interaction model with no marginal effects, it is very difficult to uncover clues that can guide the search algorithm to locate the functional SNP loci. The proposed MTHSA-DHEI algorithm integrates ND-JE into the evaluation functions to seek clues of the functional SNP locus that has no or a very weak marginal effect on disease status.

MTHSA-DHEI is a metaheuristic search algorithm, and its time complexity is determined by four objective functions (K2-score, ME score, LR score and ND-JE-score) of the 1st stage and the MaxFEs (maximum number of evaluations of associations between SNP combinations and disease status). In the 2nd stage and 3rd stage, only the G-test and MDR are employed to test and verify the number of candidate solutions that were found in the 1st stage, and the associated time complexity is negligible. The time complexity of evaluating objective functions is O(k × S) (where k is the order of SNP combinations and S is the number of samples). Therefore, the time complexity of MTHSA-DHEI is roughly equivalent to O(k × S × MaxFEs), which is much less than the time complexity O(k × S × Nk) of the exhaustive method, where N is the number of SNPs in the dataset. Since N and MaxFEs are much larger than k and S, the time complexity of the traditional exhaustive method is O(Nk), which is much higher than the time complexity O(k × S × MaxFEs) of MTHSA-DHEI.

To the best of our knowledge, this study is the first to detect high-order SNP epistatic interactions by using a multitasking search algorithm. There is still much room for improving the performance of this type of algorithm. In the future, we should try to develop an explicit-encoding-based multitasking search algorithm to improve the search speed and design more effective evaluation functions for identifying various disease models.

In addition, the fuzzy set-based optimization algorithm [70] has received much attention in recent years and has been successfully applied to solid assignment [71, 72] and transportation [73] problems, and it can also be considered a focal method for future studies on high-order SNP epistatic interaction detection. In addition, it is also very important to develop an effective scoring function for seeking clues to guide the SISA to locate the positions of potential SNP interactions. The proposed MTHSA-DHEI method can only be applied to detect associations between common SNPs (MAF > 0.05 && MAF < 0.5) and disease status. This needs to be further studied for its application to rare variants.

Availability and implementation

The supplementary files, MATLAB codes and Python code are available at https://github.com/shouhengtuo/MTHSADHEI.