Ensemble-Based Relationship Discovery in Relational Databases
- 106 Downloads
Abstract
We performed an investigation of how several data relationship discovery algorithms can be combined to improve performance. We investigated eight relationship discovery algorithms like Cosine similarity, Soundex similarity, Name similarity, Value range similarity, etc., to identify potential links between database tables in different ways using different categories of database information. We proposed voting system and hierarchical clustering ensemble methods to reduce the generalization error of each algorithm. Voting scheme uses a given weighting metric to combine the predictions of each algorithm. Hierarchical clustering groups predictions into clusters based on similarities and then combine a member from each cluster together. We run experiments to validate the performance of each algorithm and compare performance with our ensemble methods and the state-of-the-art algorithms (FaskFK, Randomness and HoPF) using Precision, Recall and F-Measure evaluation metrics over TPCH and AdvWork datasets. Results show that performance of each algorithm is limited, indicating the importance of combining them to consolidate their strengths.
Keywords
Semantic relationship Primary/Foreign key relationship Data discovery Database management Ensemble-based discoveryReferences
- 1.Alwan, A.A., Nordin, A., Alzeber, M., Abualkishik, A.Z.: A survey of schema matching research using database schemas and instances. Int. J. Adv. Comput. Sci. Appl. 8(10) (2017)Google Scholar
- 2.Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping, Section 6. In: Data-Centric Systems. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4
- 3.Chavent, M., Kuentz, V., Liquet, B., Saracco, L.: ClustOfVar: an R package for the clustering of variables. arXiv preprint arXiv:1112.0295 (2011)
- 4.Chen, Z., Narasayya, V., Chaudhuri, S.: Fast foreign-key detection in Microsoft SQL server PowerPivot for excel. Proc. VLDB Endow. 7(13), 1417–1428 (2014)CrossRefGoogle Scholar
- 5.De Carvalho, M.G., Laender, A.H., GonçAlves, M.A., Da Silva, A.S.: An evolutionary approach to complex schema matching. Inf. Syst. 38(3), 302–316 (2013)CrossRefGoogle Scholar
- 6.Ding, G., Sun, T., Xu, Y.: Multi-schema matching based on clustering techniques. In: 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 778–782. IEEE (2013)Google Scholar
- 7.Do, H.H.: Schema matching and mapping-based data integration (2006)Google Scholar
- 8.Elmeleegy, H., Ouzzani, M., Elmagarmid, A.: Usage-based schema matching. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 20–29. IEEE (2008)Google Scholar
- 9.Hai, D.H.: Schema matching and mapping-based data integration. University of Leipzig (2005)Google Scholar
- 10.Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
- 11.Jiang, L., Naumann, F.: Holistic primary key and foreign key detection. J. Intell. Inf. Syst. 1–23 (2019)Google Scholar
- 12.Kotu, V., Deshpande, B.: Data Science: Concepts and Practice. Morgan Kaufmann, Burlington (2018)Google Scholar
- 13.Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304. Citeseer (1998)Google Scholar
- 14.Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 60–73. Springer, Heidelberg (2007). https://doi.org/10.1007/978-3-540-75410-7_5CrossRefGoogle Scholar
- 15.Mehdi, O.A., Ibrahim, H., Affendey, L.S.: An approach for instance based schema matching with Google similarity and regular expression. Int. Arab J. Inf. Technol. 14(5), 755–763 (2017)Google Scholar
- 16.Mihalcea, R., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)Google Scholar
- 17.Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
- 18.Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Datenbanksysteme für Business, Technologie und Web (BTW 2017) (2017)Google Scholar
- 19.Pinto, D., Vilarino, D., Alemán, Y., Gómez, H., Loya, N.: The soundex phonetic algorithm revisited for SMS-based information retrieval. In: II Spanish Conference on Information Retrieval CERI (2012)Google Scholar
- 20.Rahm, E.: Towards large-scale schema and ontology matching. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds) Schema Matching and Mapping, pp. 3–27. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-16518-4_1
- 21.Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefGoogle Scholar
- 22.Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U.: A machine learning approach to foreign key discovery. In: WebDB (2009)Google Scholar
- 23.Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRefGoogle Scholar
- 24.Wang, Y., Qin, J., Wang, W.: Efficient approximate entity matching using Jaro-Winkler distance. In: Bouguettaya, A., et al. (eds.) WISE 2017. LNCS, vol. 10569, pp. 231–239. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-68783-4_16CrossRefGoogle Scholar
- 25.Winkler, W.E.: Frequency-based matching in Fellegi-Sunter model of record linkage. Bur. Census Stat. Res. Div. 14 (2000)Google Scholar
- 26.Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: On multi-column foreign key discovery. Proc. VLDB Endow. 3(1–2), 805–814 (2010)CrossRefGoogle Scholar
- 27.Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic discovery of attributes in relational databases. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 109–120. ACM (2011)Google Scholar