Advertisement

Community-Based Semantic Subgroup Discovery

  • Blaž Škrlj
  • Jan Kralj
  • Anže Vavpetič
  • Nada Lavrač
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10785)

Abstract

Modern data mining algorithms frequently need to address learning from heterogeneous data and knowledge sources, including ontologies. A data mining task in which ontologies are used as background knowledge is referred to as semantic data mining. A special form of semantic data mining is semantic subgroup discovery, where ontology terms are used in subgroup describing rules. We propose to enhance ontology-based subgroup identification by Community-Based Semantic Subgroup Discovery (CBSSD), taking into account also the structural properties of complex networks related to the studied phenomenon. The application of the developed CBSSD approach is demonstrated on two use cases from the field of molecular biology.

Keywords

Semantic data mining Bioinformatics Community detection Network analysis Term enrichment analysis 

1 Introduction

Modern machine learning approaches are capable of using continuously increasing amounts of information to explain complex phenomena in numerous fields, including biology, sociology, mechanics and electrical engineering. As there can be many distinct types of data associated with a single phenomenon, novel approaches strive towards the integration of different, heterogeneous data and knowledge sources into unified predictive or descriptive models.

In such settings, prior knowledge can play an important role in the development and deployment of learning algorithms in real world scenarios. Background knowledge can come in many forms, which introduces additional complexity to the modeling process, yet can have a great impact on the model’s performance. For example, Bayesian methods can be leveraged to incorporate knowledge about prior states of a system, i.e. prior distributions of random variables being modeled. In the Bayesian setting, the prior knowledge is incorporated via conditional probabilities used in the Bayes rule for posterior probability calculation. For example, Bayesian methodology is in widespread use in the field of phylogenetics, where Bayesian inference is used for reconstruction of evolutionary trees [1]. A different modeling technique was used in a biological application by Madahian et al. [2], where a general linear model was developed to aid gene expression profiling, achieving better predictive accuracy by using prior knowledge based on the index rank of the term “cancer” in the underlying background knowledge. Background knowledge can be encoded also more explicitly, as an additional knowledge source to be used in learning the models. A machine learning discipline that relies heavily on the use of explicitly encoded background knowledge is inductive logic programming (ILP) [3]. In ILP, background knowledge is used along with the examples to derive hypotheses in the form of logic programs, which explain the positive examples.

Semantic Data Mining. Semantic data mining (SD) [4] is a field of machine learning that employs curated domain knowledge in the form of ontologies as background knowledge used in the learning process. An ontology can be represented as a data structure consisting of semantic triplets T(SPO), which represent the subject, its predicate and the object. Resource Description Format (RDF) hypergraph is a data model commonly used to operate at the intersection of data and the ontologies. There are many existing approaches, which use background knowledge in the form of an ontology to obtain either more accurate or more general results. First, knowledge in the form of ontologies can represent constraints, specific to a domain. It has been empirically and theoretically demonstrated, that using background knowledge as a constraint can improve classification performance [5]. The RDF framework provides also the necessary formalism to leverage the graph-theoretic methods for ontology exploration. Random walks and large scale motif sampling are some of the techniques used to discover indirectly associated biological terms [6]. Semantic clustering is an emerging field, where semantic similarity measures are used to determine the clusters using the background knowledge, in a manner similar to, for example, k-means family of clustering algorithms. Semantic clustering is frequently used in the area of document clustering [7]. Large databases in the form of RDF triplets exist for many domains. For example, the Bio2RDF project [8] aims at integrating all major biological databases and joining them under a unified framework, which can be queried using SPARQLQ—a specialized query language. The BioMine methodology is another example of large-scale knowledge graph creation, where biological terms from many different databases are connected into a single knowledge graph with millions of nodes [9]. Despite such large amounts of data being freely accessible, there remain many new opportunities to fully exploit its potential for knowledge discovery.

Semantic Subgroup Discovery. Semantic subgroup discovery (SSD) [10, 11] is a field of data mining named subgroup discovery, which is compliant with the paradigm of rule learning that expect a labeled training set, where class labels are used to denote the groups for which descriptive rules describing groups of instances of interest are to be learned. Apart from experimental data, a semantic subgroup discovery algorithm leverages background knowledge in the form of ontologies in order to guide the rule learning process. For example, the Hedwig algorithm [10, 12] accepts the input in the form of ontologies and the instances, grouped into different classes, and the individual instances are mapped to the ontology terms, while rule learning is guided by the hierarchical relations between the considered ontology terms. Hedwig is capable of using an arbitrary ontology to identify latent relations explaining the discovered subgroups of instances.

Complex Networks. Complex networks are graphs with distinct, real world topological properties [13]. Many natural phenomena can be described using graphs. They can be used to model physical, biological, chemical and mechanical systems [14, 15]. Real world networks can be characterized with distinct statistical properties regarding their node degree distribution, component distribution or connectivity [16]. Complex biological and social networks are also known to include many communities, i.e. smaller, distinct units of a network [17]. Complex networks are commonly used in modeling systems, where extensive background knowledge is not necessarily accessible. Motif finding, community detection and similar methods can provide valuable insights into the latent organization of the observed network.

In this work we propose a methodology, where iteratively constructed complex networks are used to identify relevant subgroups, which are used as input for the process of semantic subgroup discovery. We demonstrate that new knowledge can be obtained using existing, freely accessible heterogeneous data in the form of complex networks and ontologies. In the next sections we present the proposed methodology and demonstrate the use of the new approach on two datasets from the life science domain, where the complementarity with existing enrichment analysis tools is demonstrated.

2 Methodology

This section presents the proposed approach to semantic subgroup discovery from complex networks, named CBSSD (Community-Based Semantic Subgroup Discovery). The proposed approach operates on a list of terms connected with the studied phenomenon. The main steps include: network construction, community detection and semantic subgroup discovery. The proposed methodology is depicted in Fig. 1.
Fig. 1.

Schematic representation of the proposed CBSSD procedure. Complex graph’s communities are used to identify possible subgroups in the input term list. The subgroups are further explained using semantic subgroup discovery with background knowledge.

Constructing the Network of Associations. A list of relevant biological terms is used to construct a term network. The network is constructed using the BioMine methodology [9]; individual terms are used as seeds for crawling the BioMine knowledge graph, which already includes millions of term associations across main biological databases, such as UniProt [18], Kegg [19], and GenBank [20]. The final knowledge graph \(G_{f}\) is constructed incrementally, by querying one term at a time. This knowledge graph consists of a set of graphs \(\{G_1,\dots , G_n\}\), where n is the total number of query terms and, for each i, \(G_i=(V_i, E_i)\). In order to obtain the final graph \(G_f\), node and edge information from \(\{G_{1},..,G_{n}\}\) is joined into a single graph. Throughout the network construction process, nodes and edges can not be duplicated—once the node is present in the final graph, only new edges can be added. Final set of nodes \(V_{f}\) thus equals \(\bigcup _{i=1}^{n}V_{i}\) and final set of edges \(E_{f}\) similarly equals \(\bigcup _{i=1}^{n} E_{i}\).

Community Detection in Homogeneous Networks. Once the network is constructed, a network community detection algorithm is used to identify interesting subsets of the network, which are directly mapped to groups within the input query list. We use the Louvain algorithm [21], which is based on the network modularity measure [22] defined for splitting the network into two modules (\(m_{i}\) and \(m_{j}\)) as follows:
$$\begin{aligned} \xi = \frac{1}{2m}\sum _{i=1}^n\sum _{j=1}^n \bigg [A_{i,j} - \frac{d_{i}d_{j}}{2m} \bigg ]\frac{m_{i}m_{j}+1}{2} \end{aligned}$$
(1)
where the \(\xi \) represents the modularity, m the number of all edges, A the adjacency matrix (i.e. \(A_{ij}\) is equal to 1 if the i-th and j-th node are connected and 0 if they are not), and \(d_{i}\) is the degree of node \(u_i\). Term \(m_i\) represents a membership function, which returns 1 if a specific node is present in the observed module and \(-1\) otherwise. The final community partition includes all the nodes. The Louvain algorithm is one of the most scalable community detection methods due to its \(\mathcal {O}(n \log (n))\) time complexity. For this step, the constructed knowledge graph was interpreted as an undirected graph, which is a feasible assumption as long as we are interested only in biological associations. The community detection procedure returns sets of nodes \(\{C_{1\dots n}\}\) that represent individual communities. Each node in the network belongs to exactly one community (i.e. the communities are non-overlapping). We are interested in finding subgroup descriptions of these communities. In order to do this, each community \(C_i\) becomes a class label \(T_i\). The terms from the input list are partitioned to individual classes according to the community they belong to. This way, input terms are grouped into distinct classes, yet no additional terms are added as they could introduce unnecessary noise in the semantic subgroup discovery step.
Community Detection in Heterogeneous Networks. As the networks under consideration consist of many distinct layers (node types), our methodology can also account for such organization without additional simplification of the network. For such tasks, we leverage the state-of-the-art InfoMap algorithm for multilayer community detection [23]. This algorithm’s objective is to minimize the information gain, formulated as the map equation:
$$\begin{aligned} L(M) = q \curvearrowright H(Q) + \sum _{i=1}^{m}p_{\circ }^{i}H(p^{i}) \end{aligned}$$
(2)
where L(M) represents the per-step description length for module partition M. For module partition M of n nodes into m modules, L(M) is the lower bound of the average length of a code word describing the trace of a random walker. The partition resulting in the shortest description length is believed to best represent the network dynamics. The \(q \curvearrowright H(Q)\) represents the total probability that the random walker enters any of the m modules under consideration. The entropy of the relative rates H(Q) is used to measure the smallest average code word length that is theoretically possible. The \(p_{\circ }^{i}\) represents the total probability that any node in the module is visited, plus the probability that the random walker exits the module and the exit code word is used. Entropy \(H(p^{i})\) of the relative rates at which the random walker exits module i and visits each node in module i, measures the smallest average code word length that is theoretically possible. For completeness, our approach includes also a variant of the InfoMap algorithm, which detects communities in homogeneous networks, i.e. networks consisting of single node types.

Preparation of the Background Knowledge. Semantic rule learning requires the data to be encoded in the form of RDF triplets T(SPO), where S is the subject, P the predicate and O the object. The experimental data from the previous step was converted into RDF triplets in accordance with Hedwig, the algorithm used in the rule discovery process [10]. Hedwig is capable of leveraging the background knowledge in the form of ontologies to guide the rule construction process. It does so by using the hierarchical relations between the ontology terms. Rules are initially constructed using more general terms and further refined using more specific terms. As the CBSSD methodology is primarily developed for the field of bioinformatics, our main source of background knowledge in this study is the Gene Ontology (GO) [24] database, one of the largest semantic resources for biology. It includes tens of thousands of terms, which together form a directed acyclic graph, directly usable by SSD tools.

For Hedwig to perform rule construction, two conditions must be met. First, individual term names from the community detection step need to have the corresponding GO term mappings, and second, the whole gene ontology must be provided as a source of background knowledge. This requires that the discovered communities are encoded in the form of semantic triplets. Such encoding is achieved by treating each observed community as an individual target class, where all of its terms are considered as instances of this class. The key aspect of the rule generation procedure is the definition of the predicate, which will be used for finding suitable rule conjunctions. The objective function can thus be formulated as learning a rule set \(\varDelta \) for individual classes \(\zeta _{1,..,n}\) using background knowledge (\(\varXi \)) in the form of ontologies, and class instance embeddings (in the semantic space) \(\gamma \), such that the likelihood of individual class representations \(\zeta _{x}\) for \(x \in \{1,..,n\}\) is maximized, which is formulated as follows:
$$\begin{aligned} \varDelta _{\zeta _{1},..,\zeta _{n}} = \mathop {\text {arg max}}\limits _{i \in \{1,..,n\}}\Big [Pr(\varDelta _{\zeta _{i}}|\varXi ,\gamma ) \Big ]. \end{aligned}$$
(3)
By convention, we use the subClassOf predicate when constructing the knowledge base for the Hedwig algorithm. Individual rules’ p-values are determined by the Fisher’s exact test (FET), a non-parametric, contingency table-based procedure, where a difference in coverage between two rules is leveraged to select the better one. The FET test is based on the hypergeometric distribution, in which a random variable X is distributed as
$$\begin{aligned} Pr(X = k) = \frac{\left( {\begin{array}{c}K\\ k\end{array}}\right) \left( {\begin{array}{c}N-K\\ n-k\end{array}}\right) }{\left( {\begin{array}{c}N\\ n\end{array}}\right) } \end{aligned}$$
(4)
where N is the number of all examples, K is total number of positive examples, n is the rule coverage (number of covered terms) and k is the number of covered positives in the context of a single rule or a beam. Further, multiple hypothesis correction (e.g., Bonferroni or Benjamini-Hochberg) is applied in order to reduce the false discovery rate.
Final Formulation of the CBSSD Approach. First, individual input terms are used to construct the heterogeneous network related to the studied phenomenon. Communities are identified (CommunityDetection step) and the input term list is partitioned according to the presence of individual terms within specific communities (PartitionByCommunity). Finally, background knowledge in the form of ontologies is used to construct meaningful representations of individual partitions. The CBSSD approach can thus be further formalized as described in Algorithm 1.

In Algorithm 1, T represents the input term list, O the ontology used in the semantic learning process, M the mapping from T to O and G a graph generator, and \(S_{f}\) represents the knowledge graph, which is constructed from the input term list. The stopping criterion for evaluating individual sets of rules can be any statistical measure of rule significance, such as for example the chi-squared metric, entropy-related measures or similar. The second while corresponds to a rule beam update, the key part of the semantic subgroup discovery.

There are two computationally expensive steps in the CBSSD approach. The community detection and the semantic subgroup discovery. The community detection algorithms used [21, 23] were previously proven to scale well up to millions of nodes and edges. The subgroup discovery part uses efficient beam search, where only a set of rules is propagated through search space and continuously upgraded. Furthermore, Hedwig [10, 12] uses efficient parallelism with bitsets.

As CBSSD consists of many distinct steps, having no free parameters would not cover all possible uses. First, the community detection step is parameterized in terms of number of iterations, as well as the detection type, which can include information on multilayer edges or not. Initial network construction is parameterized in the number of concurrent terms, being sent as a query to the BioMine graph crawler. Larger number of terms results in more coarse-grained networks and thus smaller numbers are preferred (e.g., in range from 1 to 10). Mind that smaller number of concurrent terms results in longer network construction step. Some of the key parameters for the rule learning part include beam size, search heuristic and significance thresholds. Larger beam sizes naturally result in larger rule-sets, which results in longer execution times.

3 Using CBSSD for Knowledge Discovery

This section demonstrates the use of the proposed methodology on two real world datasets from the life science domain. First, we consider properties of amino-acid variants within protein binding sites, followed by cancer related transcription factors identified in the context of epigenetics.

3.1 Discovery of Properties of Proteins with Single Amino-Acid Variants Present in the Binding Sites

Sequence variants are nucleotide or amino acid substitutions that can lead to unstable protein interaction complexes and thus influence the organism’s phenotype (e.g., induce a disease state). There are two main types of variants: polymorphisms or germ-line variants that are heritable, and somatic mutations that appear in somatic tissues without previous genetic encoding. Although it was demonstrated that variants within biological interactions can be associated with disease occurrence [25, 26, 27, 28], currently there are no studies of this phenomenon aimed at discovering new subgroups of proteins associated with variants within interaction sites at a more general level.

We use the results from a previous enrichment analysis study [25] for comparison with the proposed CBSSD methodology. Enrichment analysis in the context of this study is concerned with the identification of single significant terms, associated with the studied phenomenon. The results are compared based on terms appearing in both approaches, i.e. terms found as a result of enrichment analysis as well as a result of semantic subgroup discovery. As the two compared approaches are fundamentally different, the intersection of both results is expected to be relatively small (highly significant terms).

Preparing the Input for Semantic Subgroup Discovery. More than 300 UniProt terms, for which variants were found within protein binding sites, were used as the input query list (found in supplementary material of [25]). A BioMine knowledge graph with more than 1,650 nodes and 2,300 edges was constructed. The resulting network is depicted in Fig. 2.1

Triplet construction consists of first mapping the nodes from the knowledge graph to the associated ontology terms, followed by the construction of the background knowledge. In this application, the Gene ontology [24] was used in both steps. Semantic subgroup discovery was conducted for more than 20 communities, and as the main result, more than 100 rules of various lengths were obtained. The most significant and the longest rules were manually inspected to identify possible overlap with previous pathway enrichment studies done on the same input dataset. Different beam sizes were experimented with in the procedure (from 10 to 50).
Fig. 2.

Final size of the BioMine network, associated with polymorphisms located within protein interaction sites.

Results. The obtained rule sets for the identified communities were further inspected. We directly compared the ontology terms present in the rules with the terms, identified as significant in our previous study [25]. For this naïve comparison, conjuncts were considered as individual entries, as we were only interested in term presence (not coverage). There were 13 gene ontology terms present in both approaches (Table 1). Although only 13 terms were found with both procedures, the identified terms were among the most significant ones detected in the enrichment analysis setting. This indicates, that both procedures identified a strong signal related to DNA and cell cycle related processes. As semantic subgroup discovery was conducted for separate communities, the results were expected to be more detailed and comprehensive. This was indeed the case: given that many CBSSD rules consist of two conjuncts, these rules are potentially more informative than the ones identified by ontology enrichment analysis. As iron binding proteins were present in the protein list (this was known from the previous study [25]), rule \(R=\) GO:0034618 \(\wedge \) GO:0006874 appeared as one of the most significant rules (\(p<0.1\)). Ontology terms in this rule represent arginine binding and cellular calcium homeostasis—both processes involving terms from the part of the input term list; a part not directly detected with enrichment analysis. The key UniProt term found for this rule was P41180 (CASR), which represents the extracellular calcium-sensing receptor [29]. As CASR is indeed critical for calcium homeostasis discovery (GO:0006874), it serves as an indicator of the validity of our CBSSD approach. The second term (GO:0034618), representing arginine binding is not so directly associated with the CASR protein. To further investigate the context, within which GO:0034618 occurs, we queried the gene ontology database directly for similar proteins, already associated with this term. The majority of proteins, annotated with this term, correspond to acetylglutamate kinase, an enzyme that participates in the metabolism of amino acids (e.g., urea cycle). A possible interpretation of this association is that the CASR protein induces hormonal response, which could effectively lead to increased amino-acid metabolism, providing the molecular components necessary for establishment of homeostasis. This association serves as a possible candidate for further experimental testing and demonstrates the hypothesis generation capabilities of proposed approach.
Table 1.

Gene ontology terms, found both in enrichment and semantic rule learning process. Terms marked with * emerged as the most statistically significant (\(p < 0.1\)) and therefore relevant for semantic subgroup discovery.

Gene ontology term

Meaning

GO:0000077

DNA damage checkpoint*

GO:0000086

Mitotic cell cycle*

GO:0003677

DNA binding*

GO:0004871

Signal transducer activity*

GO:0005730

Nucleolus*

GO:0005814

Centriole

GO:0016020

Membrane

GO:0016605

PML body

GO:0030018

Z-disc

GO:0035264

Multicellular organism growth

GO:0045892

Negative regulation of transcription (DNA)

GO:0000122

Negative regulation of transcription (RNA)

GO:0000785

Chromatin

Another interesting rule emerged from the first community identified. Rule GO:0030903 \(\wedge \) GO:0000006 was found for UniProt entries Q96SN8 (CDK5 regulatory subunit-associated protein 2), O94986 (Centrosomal protein), Q9HC77 (Centromere protein J) and O43303 (Centriolar coiled-coil protein). It can be observed that all the identified proteins are connected with nucleus-related processes. Term GO:0030903 corresponds to notochord development, which is a stage in cell division—a term directly associated with the identified proteins. The second term, GO:0000006, corresponds to high-affinity zinc uptake transmembrane transporter activity, a process related to enzyme system responsible for cell division and proliferation. Although this rule does not imply any new hypothesis, it demonstrates the generalization capability of the proposed approach.

Many terms remain specific for either semantic rule discovery based on community detection or enrichment analysis. This discrepancy appears due to the fact that community detection splits the input term list into smaller lists, which can be described by completely different terms than the list as a whole. As the proposed methodology splits the input list, it is not sensible to compare it with conventional approaches, which operate on whole lists. Both approaches cover approximately the same percentage of input terms. The CBSSD’s coverage is \(12.02\%\) with 218 GO terms, whereas the term coverage for conventional enrichment is \(12.3\%\) with 881 GO terms. The Term discrepancy serves only as a proof of fundamental difference between the two approaches. Nevertheless, we demonstrate that our approach is a useful complementary methodology to the well established enrichment analysis.

3.2 Grouping of Epigenetic Factors

Epigenetics is a field, where processes such as methylation are studied in the context of the influence of environment on the phenotype. Epigenetic factors are actively researched, and are constantly updated in databases such as emDB [30], where information such as gene expression, tissue information and variant information is publicly accessible. We tested the developed approach on the list of many currently known epigenetic factors related to cancer. The epigenetics dataset was chosen for two main reasons: first, to demonstrate the CBSSD’s performance on a dataset, to our knowledge not yet used in semantic subgroup discovery, and second, this dataset serves to further test the developed methodology in the context of different biological process. The 153 distinct UniProt terms were used as input for the BioMine knowledge graph construction. Final graph consisted of approximately 4,500 nodes and 5,500 edges, respectively. The obtained knowledge graph is significantly larger than the one used in the previous case study (properties of SNVs in binding sites) and thus demonstrates the capabilities of the developed approach on larger graphs.

More than 50 communities were identified and further inspected. For the community including UniProt term Q8WTS6 (Histone-lysine N-methyltransferase), many interesting rules emerged. For example, rule (\(p=0.09\)): GO:1990785 \(\wedge \) GO:0000975 \(\wedge \) GO:0000082 indicates that the protein is indeed highly associated with epigenetic processes. Term GO:1990785 describes water-immersion restraint stress, term GO:0000975 regulatory region DNA binding and term GO:0000082 transition of mitotic cell cycle. All three terms describe the Q8WTS6 entry, as it effects the DNA’s topological properties (coil formation) and is responsible for transcriptional activation of genes, which code for collagenases, enzymes crucial to mitotic cell cycle (wall formation). To further analyze CBSSD’s generalization capabilities, we plotted all rule sets (communities) against all GO terms, identified as enriched by the DAVID Bioinformatics Suite [31]. As this experiment is conducted using only terms, previously identified as significant, CBSSD’s significance threshold was relaxed to \(p=0.5\). Additional relaxation was introduced to cover more possibly interesting patterns, which would otherwise be considered noise or false positive results.
Fig. 3.

Visualizing higher order abstraction emerging from previously enriched terms associated with epigenetic regulators. It can be observed (inset image), that only a couple of terms correspond to multi-term rules (red rectangles). Terms, such as GO:0000118 represent very high level terms, associated with majority of epigenetics-related processes. Such terms are most commonly included in more complex rules. (Color figure online)

The semantic landscape obtained in this experiment is depicted in Fig. 3. It can be observed that only a handful of GO terms serve as a basis for more complex rules. For this example, some of these terms are GO:0000118, which represents the Hystone deacetylase complex, one of the key mechanisms for hystone structure regulation. The GO:0000112, representing negative regulation of transcription from RNA polymerase II promoter, a mechanism by which many epigenetic regulators influence the transcription patterns, GO:0000183, representing chromatin silencing at rDNA, GO:0000785 and GO:0000790, representing chromatin in general, GO:0000976, representing transcription regulatory region sequence-specific DNA binding and GO:0001046, which represents core promoter sequence-specific DNA binding. The described terms are all fundamentally associated with epigenetic regulation, which proves CBSSD was able to use the more general terms to construct meaningful rules. Overall, \(27\%\) of all significant terms identified via conventional enrichment analysis were also found via CBSSD algorithm. Such low percentage is expected, as CBSSD builds upon individual subsets of the larger termset, used in conventional enrichment. This result implies the higher level terms are similar in both approaches, yet CBSSD identified latent patterns, which can not be detected via conventional enrichment. The higher level terms appear to form the base for more complex rules. Similar behavior was reported as a result of the SegMine methodology [32], which similarly to CBSSD yields explanatory power of rules in order to find enriched parts of input term lists. Coverage-wise, both approaches perform the same, as the CBSSD’s coverage is \(96.7\%\) with 230 GO terms, whereas the term coverage for conventional enrichment is \(96.7\%\) with 360 GO terms. Similarly to the case study one, CBSSD needed less GO terms to cover approximately the same percentage of input term list.

4 Conclusions and Further Work

Semantic data mining is an emerging field, where background knowledge in the form of ontologies can be used to generalize the rules emerging from the learning process. In this study, we demonstrate how such an approach can be used to induce rules describing the communities, detected on an automatically constructed knowledge graph. Our implementation was tested on two data sets from the life science domain, where validity of the most significant rules was manually inspected in terms of biological context. This approach works for up to 6,000 terms in reasonable time (e.g., a day), but for more than e.g., 10,000 terms, whole graphs should be used from the beginning, if possible. As the number of rules produced can be large, adequate visualization techniques for elegant result inspection are still to be developed. Our approach differs significantly from conventional enrichment analysis, as interesting groups of terms are identified based on the underlying network structure, rather than manual, expert-guided selection. We currently see CBSSD it as a complementary methodology to enrichment analysis, as it is capable of describing latent patterns beyond the ones expected by a domain expert. Further work includes extensive testing of CBSSD on larger data sets, possibly from many different domains.

Availability. The Community-based subgroup discovery reference implementation is freely available at https://github.com/SkBlaz/CBSSD.

Footnotes

  1. 1.

    Plotted with the Py3Plex library (https://github.com/SkBlaz/Py3Plex).

Notes

Acknowledgments

This research was funded by the Slovenian Research Agency funded project HinLife: Analysis of Heterogeneous Information Networks for Knowledge Discovery in Life Sciences (J7-7303), as well as the The Human Brain Project (FET Flagship grant FP7-ICT-604102). The authors also gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan-XP GPU used for this research.

References

  1. 1.
    Drummond, A.J., Rambaut, A.: Beast: Bayesian evolutionary analysis by sampling trees. BMC Evol. Biol. 7(1), 214 (2007)CrossRefGoogle Scholar
  2. 2.
    Madahian, B., Deng, L., Homayouni, R.: Development of a literature informed Bayesian machine learning method for feature extraction and classification. BMC Bioinform. 16(Suppl. 15), P9 (2015)CrossRefGoogle Scholar
  3. 3.
    Lavrač, N., Džeroski, S.: Inductive Logic Programming (1994)Google Scholar
  4. 4.
    Vavpetič, A., Lavrač, N.: Semantic subgroup discovery systems and workflows in the SDM-toolkit. Comput. J. 56(3), 304–320 (2012)CrossRefGoogle Scholar
  5. 5.
    Balcan, N., Blum, A., Mansour, Y.: Exploiting structures and unlabeled data for learning. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML 2013, vol. 28, pp. 1112–1120 (2013)Google Scholar
  6. 6.
    Liu, H., Dou, D., Jin, R., LePendu, P., Shah, N.: Mining biomedical ontologies and data using RDF hypergraphs. In: 2013 Proceedings of the 12th International Conference on Machine Learning and Applications (ICMLA), vol. 1, pp. 141–146. IEEE (2013)Google Scholar
  7. 7.
    Hotho, A., Staab, S., Stumme, G.: Ontologies improve text document clustering. In: Third IEEE International Conference on Data Mining, pp. 2–5 (2003)Google Scholar
  8. 8.
    Belleau, F., Nolin, M.A., Tourigny, N., Rigault, P., Morissette, J.: Bio2RDF: towards a mashup to build bioinformatics knowledge systems. J. Biomed. Inf. 41(5), 706–716 (2008)CrossRefGoogle Scholar
  9. 9.
    Eronen, L., Toivonen, H.: Biomine: predicting links between biological entities using network models of heterogeneous databases. BMC Bioinform. 13(1), 119 (2012)CrossRefGoogle Scholar
  10. 10.
    Vavpetič, A., Novak, P.K., Grčar, M., Mozetič, I., Lavrač, N.: Semantic data mining of financial news articles. In: Fürnkranz, J., Hüllermeier, E., Higuchi, T. (eds.) DS 2013. LNCS (LNAI), vol. 8140, pp. 294–307. Springer, Heidelberg (2013).  https://doi.org/10.1007/978-3-642-40897-7_20 CrossRefGoogle Scholar
  11. 11.
    Langohr, L., Podpečan, V., Petek, M., Mozetič, I., Gruden, K., Lavrač, N., Toivonen, H.: Contrasting subgroup discovery. Comput. J. 56(3), 289–303 (2012)CrossRefGoogle Scholar
  12. 12.
    Adhikari, P.R., Vavpetič, A., Kralj, J., Lavrač, N., Hollmén, J.: Explaining mixture models through semantic pattern mining and banded matrix visualization. Mach. Learn. 105(1), 3–39 (2016)MathSciNetCrossRefzbMATHGoogle Scholar
  13. 13.
    Cohen, R., Havlin, S.: Complex Networks: Structure, Robustness and Function. Cambridge University Press, Cambridge (2010)CrossRefzbMATHGoogle Scholar
  14. 14.
    Palla, G., Derényi, I., Farkas, I., Vicsek, T.: Uncovering the overlapping community structure of complex networks in nature and society. arXiv preprint physics/0506133 (2005)Google Scholar
  15. 15.
    Vrabič Rok, H.D., Butala, P.: Discovering autonomous structures within complex networks of work systems. CIRP Ann. Manuf. Technol. 61(1), 423–426 (2012)CrossRefGoogle Scholar
  16. 16.
    Strogatz, S.H.: Exploring complex networks. Nature 410(6825), 268 (2001)CrossRefzbMATHGoogle Scholar
  17. 17.
    Duch, J., Arenas, A.: Community detection in complex networks using extremal optimization. Phys. Rev. E 72(2), 027104 (2005)CrossRefGoogle Scholar
  18. 18.
    The UniProt Consortium, et al.: UniProt: the universal protein knowledgebase. Nucleic Acids Res. 45(D1), D158–D169 (2017)Google Scholar
  19. 19.
    Kanehisa, M., Goto, S.: Kegg: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)CrossRefGoogle Scholar
  20. 20.
    Benson, D.A., Cavanaugh, M., Clark, K., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Sayers, E.W.: Genbank. Nucleic Acids Res. 41(D1), D36–D42 (2012)CrossRefGoogle Scholar
  21. 21.
    Blondel, V.D., Guillaume, J.L., Lambiotte, R., Lefebvre, E.: Fast unfolding of communities in large networks. J. Stat. Mech. Theory Exp. 2008(10), P10008 (2008)Google Scholar
  22. 22.
    Newman, M.E.: Modularity and community structure in networks. Proc. Nat. Acad. Sci. 103(23), 8577–8582 (2006)CrossRefGoogle Scholar
  23. 23.
    Rosvall, M., Axelsson, D., Bergstrom, C.T.: The map equation. Eur. Phys. J. Spec. Topics 178(1), 13–23 (2009)CrossRefGoogle Scholar
  24. 24.
    Ashburner, M., Ball, C.A., Blake, J.A., Botstein, D., Butler, H., Cherry, J.M., Davis, A.P., Dolinski, K., Dwight, S.S., Eppig, J.T., et al.: Gene ontology: tool for the unification of biology. Nat. Genet. 25(1), 25–29 (2000)CrossRefGoogle Scholar
  25. 25.
    Škrlj, B., Konc, J., Kunej, T.: Identification of sequence variants within experimentally validated protein interaction sites provides new insights into molecular mechanisms of disease development. Mol. Inform. 36, 1–8 (2017)Google Scholar
  26. 26.
    Škrlj, B., Kunej, T.: Computational identification of non-synonymous polymorphisms within regions corresponding to protein interaction sites. Comput. Biol. Med. 79, 30–35 (2016)CrossRefGoogle Scholar
  27. 27.
    Schröder, N.W., Schumann, R.R.: Single nucleotide polymorphisms of toll-like receptors and susceptibility to infectious disease. Lancet Infect. Dis. 5(3), 156–164 (2005)CrossRefGoogle Scholar
  28. 28.
    Kamburov, A., Lawrence, M.S., Polak, P., Leshchiner, I., Lage, K., Golub, T.R., Lander, E.S., Getz, G.: Comprehensive assessment of cancer missense mutation clustering in protein structures. Proc. Nat. Acad. Sci. 112(40), E5486–E5495 (2015)CrossRefGoogle Scholar
  29. 29.
    Garrett, J.E., Capuano, I.V., Hammerland, L.G., Hung, B.C., Brown, E.M., Hebert, S.C., Nemeth, E.F., Fuller, F.: Molecular cloning and functional expression of human parathyroid calcium receptor cDNAs. J. Biol. Chem. 270(21), 12919–12925 (1995)CrossRefGoogle Scholar
  30. 30.
    Nanda, J.S., Kumar, R., Raghava, G.P.: dbEM: a database of epigenetic modifiers curated from cancerous and normal genomes. Sci. Rep. 6, 19340 (2016)CrossRefGoogle Scholar
  31. 31.
    Huang, D.W., Sherman, B.T., Tan, Q., Kir, J., Liu, D., Bryant, D., Guo, Y., Stephens, R., Baseler, M.W., Lane, H.C., et al.: David bioinformatics resources: expanded annotation database and novel algorithms to better extract biology from large gene lists. Nucleic Acids Res. 35(2), W169–W175 (2007)CrossRefGoogle Scholar
  32. 32.
    Podpečan, V., Lavrač, N., Mozetič, I., Novak, P.K., Trajkovski, I., Langohr, L., Kulovesi, K., Toivonen, H., Petek, M., Motaln, H., et al.: Segmine workflows for semantic microarray data analysis in Orange4WS. BMC Bioinform. 12(1), 416 (2011)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Blaž Škrlj
    • 1
  • Jan Kralj
    • 2
  • Anže Vavpetič
    • 2
  • Nada Lavrač
    • 2
    • 3
  1. 1.Jožef Stefan International Postgraduate SchoolLjubljanaSlovenia
  2. 2.Jožef Stefan InstituteLjubljanaSlovenia
  3. 3.University of Nova GoricaNova GoricaSlovenia

Personalised recommendations