ISCIS 2016: Computer and Information Sciences pp 90-96 | Cite as
Cosine Similarity-Based Pruning for Concept Discovery
Abstract
In this work we focus on improving the time efficiency of Inductive Logic Programming (ILP)-based concept discovery systems. Such systems have scalability issues mainly due to the evaluation of large search spaces. Evaluation of the search space cosists translating candidate concept descriptor into SQL queries, which involve a number of equijoins on several tables, and running them against the dataset. We aim to improve time efficiency of such systems by reducing the number of queries executed on a DBMS. To this aim, we utilize cosine similarity to measure the similarity of arguments that go through equijoins and prune those with 0 similarity. The proposed method is implemented as an extension to an existing ILP-based concept discovery system called Tabular Cris w-EF and experimental results show that the poposed method reduces the number of queries executed around 15 %.
Keywords
Cosine Similarity Inductive Logic Programming Query Optimization Concept Descriptor Large Search Space1 Introduction
Concept discovery [3] is a multi-relational data mining task and is concerned with inducing logical definitions of a relation, called target relation, in terms of other provided relations, called background knowledge. It has extensively been studied under Inductive Logic Programming (ILP) [12] research and successful applications are reported [2, 4, 7, 10].
ILP-based concept discovery systems consist of two main steps, namely search space formation and search space evaluation. In the first step candidate concept descriptors are generated and in the second step candiate condept descriptors are converted into queries, i.e. SQL queries, and are run against the dataset. As the search space is generally large and the queries involve multiple joins over several tables, the second step is computationally expensive and dominates the total running time of a concept discovery system. Several methods such as parallelization, memoization have been investigated to improve running time of the search space evaluation step.
In this paper we propose a method that improves the running time of concept discovery systems by reducing the number of SQL queries run on a database. The proposed method calculates the cosine similarity of the tables that appear in a query, and prunes those with 0 similarity. To realize this, (i) term-document count matrix where domain values of arguments of tables correspond to terms and relation arguments correspond to documents is built, and (ii) cosine similarity of table arguments that participate in a query are calculated from the term-document count matrix and those with 0 similarity are pruned.
The proposed method is implemented as an extension to an existing concept discovery system called Tabular CRIS w-EF [14, 15]. To evaluate the performance of the proposed method several experiments are conducted on data sets that belong to different learning problems. The experimental results show that the proposed method reduces the number of queries executed by 15 % on the average without any loss in the accuracy of the systems.
The rest of the paper is organized as follows. In Sect. 2 we provide the background related to the study, in Sect. 3 we introduce the proposed method, and in Sect. 4 we present and discuss the experimental results. Last section concludes the paper.
2 Background
Concept discovery is a predictive multi relational data mining problem. Given a set facts, called target instances, and related observations, called background knowledge, concept discovery is concerned with inducing logical definitions of the target instances in terms of background knowledge. The problem has primarily been studied by ILP community and successful application have been reported.
In ILP-based concept discovery systems data is represented within first order logic framework and concept descriptors are generated by specialization or generalization of some an initial hypothesis. ILP-based concept discovery systems follow generate and test approach to find a solution and usually build large search spaces. Evaluation of the search space consists of translating concept descriptors into queries and running them against the data set. Evaluation of the queries is computationally expensive as queries involve multiple joins over tables. To improve running time of such systems several methods including parallelization [9], caching [13], query optimization [20] have been proposed. In parallelization based approaches either the search space is built or evaluated in parallel by multiple processors, in caching based methods queries and their results are stored in hash tables in case the same query is regenerated, and in query optimization based approaches several query optimization techniques are implemented to improve the running time of the search space evaluation step.
Cosine similarity is a popular metric to measure the similarity of data that can be represented as vectors. Cosine similarity of two vectors is the inner product of these vectors divided by the product of their lengths. Cosine similarity of −1 indicates exactly opposition, 1 indicates exact correlation, and 0 indicates decorrelation between the vectors. It has been applied in several domains including text document clustering [5], face verification [16].
In this work we propose to measure the cosine similarity of table arguments that partake in equijoins and prune those with cosine similarity of 0 without running them against the data set. To achieve this, firstly we group attributes that belong to the same domain, build a term-document matrix for each domain where domain values of the attributes constitute the terms, and individual arguments constitute the documents. When two arguments go through an equijoin we calculate their cosine similarity from the term-document matrix and prune those queries that have cosine similarity of 0. The proposed method is implemented as an extension to an existing ILP-based concept discovery system called Tabular CRIS w-EF. Tabular CRIS w-EF is an ILP-based concept discovery system that employs association rule mining techniques to find frequent and strong concept descriptors and utilizes memoization techniques to improve search space evaluation step of its predecessor CRIS [6].
3 Proposed Method
Sample concept descriptor evaluation query
In such a transformation argument values with the same value go through equijoins. The proposed method targets such equijoins and prevents execution of queries that involve equjoins whose participating arguments have cosine similarity 0.
- (1)
arguments are grouped based on their domains,
- (2)
for each such group term-document matrix is formed where values of the domain are the terms, arguments are the documents and values of an argument is the bag of the words of the argument
- (3)
for each term-document matrix a cosine similarity matrix is calculated.
Query for creating a count vector for rel.arg1
In literature, there exists several ILP-based concept discovery systems that work on Prolog engines [11, 17]. Such systems benefit from depth bounded interpreters for theorem proving to test possible concept descriptors. The proposed method is also applicable for such systems, as in Prolog notation each predicate can be considered a table and arguments of the literal can be considered as the fields of the table. With such a transformation, the proposed method can be utilized to prune hypotheses for ILP-based concept discovery systems that work on Prolog like environments.
In terms of algorithmic complexity, the proposed method consists of two main steps (i) matrix construction and (ii) cosine similarity calculation. To construct the matrix, one SQL query needs to be run for each literal argument. Complexity of cosine similarity is quadratic, hence applicable to real world data sets.
4 Experimental Results
Experimental parameters for each used data sets
| Data set | Num of relations | Num of instances | Argument types |
|---|---|---|---|
| Dunur | 9 | 234 | Categorical |
| Elti | 9 | 234 | Categorical |
| Eastbound | 12 | 196 | Categorical, real |
| Mesh | 26 | 1749 | Categorical, real |
| Mutagenesis | 8 | 16,544 | Categorical, real |
| PTE | 32 | 29,267 | Categorical, real |
In Table 2 we report the experimental results. Filtering Queries column shows the decrease in the number of queries when the proposed method is employed. The experimental results show that the proposed method performs well on the data sets that are highly relational, i.e. Dunur and Elti data sets. The proposed method performs sligly worse for the data sets that contains numerical attributes as well as categorical attributes to theose that only contains categorical attributes. This is indeed due to the fact that, arguments from the categorical domain go through equijoins, while arguments that belong to numerical domain go through less than (<), greater than (>) comparisons in SQL statements.
Improvements of proposed method
| Data set | Tabular CRIS-wEF | Pruning by the proposed method | Improvement % | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Num. rules | Num. queries | Time (mm:ss.sss) | Num. rules | Num. queries | Time (mm:ss.sss) | Rules | Queries | Time | |
| Dunur | 1887 | 5807 | 00:02.086 | 1279 | 4607 | 00:01.783 | 32.22 | 20.66 | 14.54 |
| Elti | 1741 | 5333 | 00:02.655 | 1422 | 4922 | 00:02.470 | 18.32 | 7.71 | 6.99 |
| Eastbound | 7294 | 34654 | 00:04.091 | 6805 | 32665 | 00:03.895 | 6.70 | 5.74 | 4.77 |
| Mesh | 56512 | 249084 | 00:27.302 | 54314 | 238982 | 00:27.314 | 3.89 | 4.06 | −0.05 |
| Mutagenesis | 62486 | 223644 | 34:04.099 | 55477 | 216635 | 33:42.752 | 11.22 | 3.13 | 1.04 |
| PTE | 64322 | 237082 | 35:50.340 | 58503 | 231191 | 35:15.975 | 9.05 | 2.48 | 1.60 |
| PTE No Aggr. | 11166 | 43862 | 03:46.457 | 10328 | 43024 | 03:40.578 | 7.50 | 1.91 | 2.60 |
5 Conclusion
Concept discovery systems face scalability issues due to the evaluation of the large search spaces they build. In this paper we propose a pruning mechanism based on cosine similarity to improve running time of concept discovery systems. The proposed method calculates the cosine similarity of arguments that participate in equijoins and prunes those concept descriptors that have arguments with cosine similarity 0. The proposed method is applicable to concept descovery systems that work on relational databases or Prolog like engines. The experimental results show that the proposed method decreased the number of concept descriptor evaluations around 15 % on the average, and improved the running time of the system around 7.5 % on the average.
References
- 1.Dolšak, B.: Finite element mesh design expert system. Knowl. Based Syst. 15(5), 315–322 (2002)Google Scholar
- 2.Dolsak, B., Muggleton, S.: The application of inductive logic programming to finite element mesh design. In: Inductive Logic Programming, pp. 453–472. Academic Press (1992)Google Scholar
- 3.Dzeroski, S.: Multi-relational data mining: an introduction. SIGKDD Explor. 5(1), 1–16 (2003). doi: 10.1145/959242.959245 MathSciNetCrossRefGoogle Scholar
- 4.Feng, C.: Inducing temporal fault diagnostic rules from a qualitative model. In: Proceedings of the Eighth International Workshop (ML91), Northwestern University, Evanston, Illinois, USA, pp. 403–406 (1991)Google Scholar
- 5.Huang, A.: Similarity measures for text document clustering. In: Proceedings of the Sixth New Zealand Computer Science Research Student Conference (NZCSRSC2008), Christchurch, New Zealand, pp. 49–56 (2008)Google Scholar
- 6.Kavurucu, Y., Senkul, P., Toroslu, I.H.: ILP-based concept discovery in multi-relational data mining. Expert Syst. Appl. 36(9), 11418–11428 (2009). doi: 10.1016/j.eswa.2009.02.100 CrossRefGoogle Scholar
- 7.King, R.D., Muggleton, S., Lewis, R.A., Sternberg, M.: Drug design by machine learning: the use of inductive logic programming to model the structure-activity relationships of trimethoprim analogues binding to dihydrofolate reductase. Proc. Nat. Acad. Sci. 89(23), 11322–11326 (1992)CrossRefGoogle Scholar
- 8.Larson, J., Michalski, R.S.: Inductive inference of VL decision rules. ACM SIGART Bull. 63, 38–44 (1977)CrossRefGoogle Scholar
- 9.Matsui, T., Inuzuka, N., Seki, H., Itoh, H.: Comparison of three parallel implementations of an induction algorithm. In: 8th International Parallel Computing Workshop, pp. 181–188. Citeseer (1998)Google Scholar
- 10.Muggleton, S., King, R., Sternberg, M.: Predicting protein secondary structure using inductive logic programming. Protein Eng. 5(7), 647–657 (1992)CrossRefGoogle Scholar
- 11.Muggleton, S.: Inverse entailment and progol. New Gener. Comput. 13(3–4), 245–286 (1995)CrossRefGoogle Scholar
- 12.Muggleton, S., Raedt, L.D.: Inductive logic programming: theory and methods. J. Log. Program. 19(20), 629–679 (1994). doi: 10.1016/0743-1066(94)90035-3 MathSciNetCrossRefMATHGoogle Scholar
- 13.Mutlu, A., Karagoz, P.: Policy-based memoization for ILP-based concept discovery systems. J. Intell. Inf. Syst. 46(1), 99–120 (2016). doi: 10.1007/s10844-015-0354-7 CrossRefGoogle Scholar
- 14.Mutlu, A., Senkul, P.: Improving hash table hit ratio of an ILP-based concept discovery system with memoization capabilities. In: Gelenbe, E., Lent, R. (eds.) Computer and Information Sciences III, pp. 261–269. Springer, London (2012). doi: 10.1007/978-1-4471-4594-3_27 Google Scholar
- 15.Mutlu, A., Senkul, P.: Improving hit ratio of ILP-based concept discovery system with memoization. Comput. J. 57(1), 138–153 (2014). doi: 10.1093/comjnl/bxs163 CrossRefGoogle Scholar
- 16.Nguyen, H.V., Bai, L.: Cosine similarity metric learning for face verification. In: Kimmel, R., Klette, R., Sugimoto, A. (eds.) ACCV 2010. LNCS, vol. 6493, pp. 709–720. Springer, Heidelberg (2011). doi: 10.1007/978-3-642-19309-5_55 CrossRefGoogle Scholar
- 17.Quinlan, J.R.: Learning logical definitions from relations. Mach. Learn. 5(3), 239–266 (1990)Google Scholar
- 18.Srinivasan, A., King, R.D., Muggleton, S.H., Sternberg, M.J.: The predictive toxicology evaluation challenge. In: IJCAI, vol. 1, pp. 4–9. Citeseer (1997)Google Scholar
- 19.Srinivasan, A., Muggleton, S.H., Sternberg, M.J., King, R.D.: Theories for mutagenicity: a study in first-order and feature-based induction. Artif. Intell. 85(1), 277–299 (1996)CrossRefGoogle Scholar
- 20.Struyf, J., Blockeel, H.: Query optimization in inductive logic programming by reordering literals. In: Horváth, T., Yamamoto, A. (eds.) ILP 2003. LNCS (LNAI), vol. 2835, pp. 329–346. Springer, Heidelberg (2003). doi: 10.1007/978-3-540-39917-9_22 CrossRefGoogle Scholar
Copyright information
Open Access This chapter is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, duplication, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, a link is provided to the Creative Commons license and any changes made are indicated.
The images or other third party material in this chapter are included in the work’s Creative Commons license, unless indicated otherwise in the credit line; if such material is not included in the work’s Creative Commons license and the respective action is not permitted by statutory regulation, users will need to obtain permission from the license holder to duplicate, adapt or reproduce the material.


