Ensemble-Based Relationship Discovery in Relational Databases

  • Akinola OgunsemiEmail author
  • John McCall
  • Mathias Kern
  • Benjamin Lacroix
  • David Corsar
  • Gilbert Owusu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 12498)


We performed an investigation of how several data relationship discovery algorithms can be combined to improve performance. We investigated eight relationship discovery algorithms like Cosine similarity, Soundex similarity, Name similarity, Value range similarity, etc., to identify potential links between database tables in different ways using different categories of database information. We proposed voting system and hierarchical clustering ensemble methods to reduce the generalization error of each algorithm. Voting scheme uses a given weighting metric to combine the predictions of each algorithm. Hierarchical clustering groups predictions into clusters based on similarities and then combine a member from each cluster together. We run experiments to validate the performance of each algorithm and compare performance with our ensemble methods and the state-of-the-art algorithms (FaskFK, Randomness and HoPF) using Precision, Recall and F-Measure evaluation metrics over TPCH and AdvWork datasets. Results show that performance of each algorithm is limited, indicating the importance of combining them to consolidate their strengths.


Semantic relationship Primary/Foreign key relationship Data discovery Database management Ensemble-based discovery 


  1. 1.
    Alwan, A.A., Nordin, A., Alzeber, M., Abualkishik, A.Z.: A survey of schema matching research using database schemas and instances. Int. J. Adv. Comput. Sci. Appl. 8(10) (2017)Google Scholar
  2. 2.
    Bellahsene, Z., Bonifati, A., Rahm, E.: Schema Matching and Mapping, Section 6. In: Data-Centric Systems. Springer, Heidelberg (2011).
  3. 3.
    Chavent, M., Kuentz, V., Liquet, B., Saracco, L.: ClustOfVar: an R package for the clustering of variables. arXiv preprint arXiv:1112.0295 (2011)
  4. 4.
    Chen, Z., Narasayya, V., Chaudhuri, S.: Fast foreign-key detection in Microsoft SQL server PowerPivot for excel. Proc. VLDB Endow. 7(13), 1417–1428 (2014)CrossRefGoogle Scholar
  5. 5.
    De Carvalho, M.G., Laender, A.H., GonçAlves, M.A., Da Silva, A.S.: An evolutionary approach to complex schema matching. Inf. Syst. 38(3), 302–316 (2013)CrossRefGoogle Scholar
  6. 6.
    Ding, G., Sun, T., Xu, Y.: Multi-schema matching based on clustering techniques. In: 2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD), pp. 778–782. IEEE (2013)Google Scholar
  7. 7.
    Do, H.H.: Schema matching and mapping-based data integration (2006)Google Scholar
  8. 8.
    Elmeleegy, H., Ouzzani, M., Elmagarmid, A.: Usage-based schema matching. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 20–29. IEEE (2008)Google Scholar
  9. 9.
    Hai, D.H.: Schema matching and mapping-based data integration. University of Leipzig (2005)Google Scholar
  10. 10.
    Jaro, M.A.: Advances in record-linkage methodology as applied to matching the 1985 census of Tampa, Florida. J. Am. Stat. Assoc. 84(406), 414–420 (1989)CrossRefGoogle Scholar
  11. 11.
    Jiang, L., Naumann, F.: Holistic primary key and foreign key detection. J. Intell. Inf. Syst. 1–23 (2019)Google Scholar
  12. 12.
    Kotu, V., Deshpande, B.: Data Science: Concepts and Practice. Morgan Kaufmann, Burlington (2018)Google Scholar
  13. 13.
    Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304. Citeseer (1998)Google Scholar
  14. 14.
    Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V.S. (eds.) SUM 2007. LNCS (LNAI), vol. 4772, pp. 60–73. Springer, Heidelberg (2007). Scholar
  15. 15.
    Mehdi, O.A., Ibrahim, H., Affendey, L.S.: An approach for instance based schema matching with Google similarity and regular expression. Int. Arab J. Inf. Technol. 14(5), 755–763 (2017)Google Scholar
  16. 16.
    Mihalcea, R., et al.: Corpus-based and knowledge-based measures of text semantic similarity. In: AAAI, vol. 6, pp. 775–780 (2006)Google Scholar
  17. 17.
    Miller, G.A.: WordNet: a lexical database for English. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  18. 18.
    Papenbrock, T., Naumann, F.: A hybrid approach for efficient unique column combination discovery. Datenbanksysteme für Business, Technologie und Web (BTW 2017) (2017)Google Scholar
  19. 19.
    Pinto, D., Vilarino, D., Alemán, Y., Gómez, H., Loya, N.: The soundex phonetic algorithm revisited for SMS-based information retrieval. In: II Spanish Conference on Information Retrieval CERI (2012)Google Scholar
  20. 20.
    Rahm, E.: Towards large-scale schema and ontology matching. In: Bellahsene, Z., Bonifati, A., Rahm, E. (eds) Schema Matching and Mapping, pp. 3–27. Springer, Heidelberg (2011).
  21. 21.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)CrossRefGoogle Scholar
  22. 22.
    Rostin, A., Albrecht, O., Bauckmann, J., Naumann, F., Leser, U.: A machine learning approach to foreign key discovery. In: WebDB (2009)Google Scholar
  23. 23.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  24. 24.
    Wang, Y., Qin, J., Wang, W.: Efficient approximate entity matching using Jaro-Winkler distance. In: Bouguettaya, A., et al. (eds.) WISE 2017. LNCS, vol. 10569, pp. 231–239. Springer, Cham (2017). Scholar
  25. 25.
    Winkler, W.E.: Frequency-based matching in Fellegi-Sunter model of record linkage. Bur. Census Stat. Res. Div. 14 (2000)Google Scholar
  26. 26.
    Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: On multi-column foreign key discovery. Proc. VLDB Endow. 3(1–2), 805–814 (2010)CrossRefGoogle Scholar
  27. 27.
    Zhang, M., Hadjieleftheriou, M., Ooi, B.C., Procopiuc, C.M., Srivastava, D.: Automatic discovery of attributes in relational databases. In: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data, pp. 109–120. ACM (2011)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Akinola Ogunsemi
    • 1
    Email author
  • John McCall
    • 1
  • Mathias Kern
    • 2
  • Benjamin Lacroix
    • 1
  • David Corsar
    • 1
  • Gilbert Owusu
    • 2
  1. 1.Robert Gordon UniversityAberdeenUK
  2. 2.BT Applied ResearchIpswichUK

Personalised recommendations