Abstract
Although the TF-IDF weighted frequency matrix (vector space model) has been widely studied and used in document clustering or document categorisation, there has been no attempt to extend this application to relational data that contain one-to-many associations between records. This paper explains the rationale for using TF-IDF (term frequency inverse document frequency), a technique for weighting data attributes, borrowed from Information Retrieval theory, to summarise datasets stored in a multi-relational setting with one-to-many relationships. A novel data summarisation algorithm based on TF-IDF is introduced, which is referred to as Dynamic Aggregation of Relational Attributes (DARA). The DARA algorithm applies clustering techniques in order to summarise these datasets. The experimental results show that using the DARA algorithm finds solutions with much greater accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Alfred, R.: A genetic-based feature construction method for data summarisation. In: Tang, C., Ling, C.X., Zhou, X., Cercone, N.J., Li, X. (eds.) ADMA 2008. LNCS, vol. 5139, pp. 39–50. Springer, Heidelberg (2008)
Alfred, R., Kazakov, D.: Clustering Approach to Generalised Pattern Identification Based on Multi-Instanced Objects with DARA. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds.) ADBIS 2007. LNCS, vol. 4690. Springer, Heidelberg (2007)
Alfred, R., Kazakov, D.: Discretisation Numbers for Multiple-Instances Problem in Relational Database. In: Ioannidis, Y., Novikov, B., Rachev, B. (eds.) ADBIS 2007. LNCS, vol. 4690, pp. 55–65. Springer, Heidelberg (2007)
Blockeel, H., Raedt, L.D.: Top-Down Induction of First-Order Logical Decision Trees. Artif. Intell. 101(1-2), 285–297 (1998)
Blockeel, H., Sebag, M.: Scalability and Efficiency in Multi-Relational Data Mining. In: SIGKDD Explorations, vol. 5(1), pp. 17–30 (2003)
Finn, P.W., Muggleton, S., Page, D., Srinivasan, A.: Pharmacophore Discovery Using the Inductive Logic Programming System PROGOL. Machine Learning 30(2-3), 241–270 (1998)
Guyon, I., Elisseeff, A.: An Introduction to Variable and Feature Selection. Journal of Machine Learning Research 3, 1157–1182 (2003)
Jain, A.K., Duin, R.P.W., Mao, J.: Statistical Pattern Recognition: A Review. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(1), 4–37 (2000)
Kirsten, M., Wrobel, S.: Relational Distance-Based Clustering. In: 8th International Conference on Inductive Logic Programming, pp. 261–270 (1998)
Kramer, S., Lavrac, N., Flach, P.: Propositionalisation Approaches to Relational Data Mining. In: Deroski, S., Lavrac, N. (eds.) Relational Data mining. Springer, Heidelberg (2001)
Krogel, M.A., Wrobel, S.: Transformation-Based Learning Using Multirelational Aggregation. In: Rouveirol, C., Sebag, M. (eds.) ILP 2001. LNCS, vol. 2157, pp. 142–155. Springer, Heidelberg (2001)
Quinlan, R.J.: C4.5: Programs for Machine Learning. Morgan Kaufmann Series in Machine Learning (1993)
Salton, G., McGill, M.: Introduction to Modern Information Retrieval. McGraw-Hill Book Company, New York (1984)
Salton, G., Wong, A., Yang, C.S.: A Vector Space Model for Automatic Indexing. Commun. ACM 18(11), 613–620 (1975)
Srinivasan, A., Muggleton, S., Sternberg, M.J.E., King, R.D.: Theories for Mutagenicity: A Study in First-Order and Feature-Based Induction. Artif. Intell. 85(1-2), 277–299 (1996)
van Rijsbergen, C.J.: Information Retrieval. Butterworth (1979)
Witten, I.H., Frank, E.: Data Mining: PracticalMachine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Alfred, R. (2009). Discovering Knowledge from Multi-relational Data Based on Information Retrieval Theory. In: Huang, R., Yang, Q., Pei, J., Gama, J., Meng, X., Li, X. (eds) Advanced Data Mining and Applications. ADMA 2009. Lecture Notes in Computer Science(), vol 5678. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-03348-3_39
Download citation
DOI: https://doi.org/10.1007/978-3-642-03348-3_39
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-03347-6
Online ISBN: 978-3-642-03348-3
eBook Packages: Computer ScienceComputer Science (R0)