Advertisement

The VLDB Journal

, Volume 27, Issue 1, pp 105–126 | Cite as

Non-binary evaluation measures for big data integration

Regular Paper

Abstract

The evolution of data accumulation, management, analytics, and visualization has led to the coining of the term big data, which challenges the task of data integration. This task, common to any matching problem in computer science involves generating alignments between structured data in an automated fashion. Historically, set-based measures, based upon binary similarity matrices (match/non-match), have dominated evaluation practices of matching tasks. However, in the presence of big data, such measures no longer suffice. In this work, we propose evaluation methods for non-binary matrices as well. Non-binary evaluation is formally defined together with several new, non-binary measures using a vector space representation of matching outcome. We provide empirical analyses of the usefulness of non-binary evaluation and show its superiority over its binary counterparts in several problem domains.

Keywords

Matching Data integration Evaluation 

Notes

Acknowledgements

The research leading to these results has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2013) under the NisB (http://nisb-project.eu/) project, Grant Agreement No. 256955.

References

  1. 1.
    Algergawy, A., Nayak, R., Saake, G.: XML schema element similarity measures: a schema matching context. In: On the Move to Meaningful Internet Systems: OTM 2009, pp. 1246–1253 (2009)Google Scholar
  2. 2.
    Ayat, N., Afsarmanesh, H., Akbarinia, R., Valduriez, P.: Pay-as-you-go data integration using functional dependencies. In: Multidisciplinary Research and Practice for Information Systems, LNCS, vol. 7465, pp. 375–389. Springer, Berlin (2012)Google Scholar
  3. 3.
    Bellahsene, Z., Bonifati, A., Rahm, E. (eds.): Schema Matching and Mapping. Data-Centric Systems and Applications. Springer, Berlin (2011).  https://doi.org/10.1007/978-3-642-16518-4 Google Scholar
  4. 4.
    Ben-Tal, A., Nemirovski, A.: Robust optimization-methodology and applications. Math. Program. 92(3), 453–480 (2002)MathSciNetCrossRefMATHGoogle Scholar
  5. 5.
    Berenzweig, A., Logan, B., Ellis, D.P., Whitman, B.: A large-scale evaluation of acoustic and subjective music-similarity measures. Comput. Music J. 28(2), 63–76 (2004)CrossRefGoogle Scholar
  6. 6.
    Berlin, J., Motro, A.: Autoplex: automated discovery of content for virtual databases. In: CoopIS 2001, LNCS, vol. 2172, pp. 108–122. Springer, Berlin (2001)Google Scholar
  7. 7.
    Bryant, V.: Metric Spaces: Iteration and Application. Cambridge University Press, Cambridge (1985)MATHGoogle Scholar
  8. 8.
    Cardoso, J., Sheth, A.P.: Semantic Web Services, Processes and Applications. Springer, Berlin (2006)CrossRefMATHGoogle Scholar
  9. 9.
    Christen, P.: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. In: KDD ’08: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1065–1068. ACM, New York (2008).  https://doi.org/10.1145/1401890.1402020
  10. 10.
    Christen, P.: A survey of indexing techniques for scalable record linkage and deduplication. IEEE Trans. Knowl. Data Eng. (2011).  https://doi.org/10.1109/TKDE.2011.127 Google Scholar
  11. 11.
    Das Sarma, A., Dong, X., Halevy, A.: Bootstrapping pay-as-you-go data integration systems. In: Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, pp. 861–874. ACM, New York, SIGMOD ’08 (2008).  https://doi.org/10.1145/1376616.1376702
  12. 12.
    Do, H.H., Rahm, E.: COMA: a system for flexible combination of schema matching approaches. In: Proceedings of VLDB, VLDB Endowment, pp. 610–621 (2002)Google Scholar
  13. 13.
    Doan, A.H., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. ACM SIGMOD Rec. 30, 509–520 (2001)CrossRefGoogle Scholar
  14. 14.
    Dong, X., Halevy, A., Yu, C.: Data integration with uncertainty. VLDB J. 18, 469–500 (2009).  https://doi.org/10.1007/s00778-008-0119-9 CrossRefGoogle Scholar
  15. 15.
    Duchateau, F., Bellahsene, Z., Coletta, R.: Matching and alignment: What is the cost of user post-match effort? In: On the Move to Meaningful Internet Systems: OTM 2011, LNCS, vol. 7044, pp. 421–428. Springer, Berlin (2011).  https://doi.org/10.1007/978-3-642-25109-2_28
  16. 16.
    Engmann, D., Maßmann, S.: Instance matching with coma++. In: BTW Workshops, pp. 28–37 (2007)Google Scholar
  17. 17.
    Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proceedings of the IJCAI, pp. 348–353 (2007)Google Scholar
  18. 18.
    Euzenat, J., Meilicke, C., Stuckenschmidt, H., Shvaiko, P., dos Santos, C.T.: Ontology alignment evaluation initiative: six years of experience. J. Data Semant. 15, 158–192 (2011).  https://doi.org/10.1007/978-3-642-22630-4_6 CrossRefGoogle Scholar
  19. 19.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Am. Stat. Assoc. 64(328), 1183–1210 (1969).  https://doi.org/10.2307/2286061 CrossRefMATHGoogle Scholar
  20. 20.
    Friedman, E.J.: Active learning for smooth problems. In: Proceedings of the 22nd Annual Conference on Learning Theory (2009)Google Scholar
  21. 21.
    Gal, A.: Uncertain Schema Matching. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, Los Altos (2011).  https://doi.org/10.2200/S00337ED1V01Y201102DTM013
  22. 22.
    Gal, A., Anaby-Tavor, A., Trombetta, A., Montesi, D.: A framework for modeling and evaluating automatic semantic reconciliation. VLDB J. 14(1), 50–67 (2005)CrossRefGoogle Scholar
  23. 23.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: language, model and algorithms. In: Proceedings of the International Conference on Very Large Databases (VLDB) (2001)Google Scholar
  24. 24.
    Gawinecki, M.: Abbreviation Expansion in Lexical Annotation of Schema. Camogli (Genova), Italy June 25th, 2009 Co-located with SEBD, p. 61 (2009)Google Scholar
  25. 25.
    Lee, Y., Sayyadian, M., Doan, A.H., Rosenthal, A.S.: eTuner: tuning schema matching software using synthetic scenarios. VLDB J. 16(1), 97–122 (2007)CrossRefGoogle Scholar
  26. 26.
    Li, W., Clifton, C.: SEMINT: a tool for identifying attribute correspondences in heterogeneous databases using neural networks. Data Knowl. Eng. 33(1), 49–84 (2000)CrossRefMATHGoogle Scholar
  27. 27.
    Luenberger, D.: Optimization by Vector Space Methods. Wiley-Interscience, New York (1997)MATHGoogle Scholar
  28. 28.
    Madhavan, J., Bernstein, P., Doan, A., Halevy, A.: Corpus-based schema matching. In: Proceedings of the ICDE, pp. 57–68 (2005)Google Scholar
  29. 29.
    Madhavan, J., Jeffery, S., Cohen, S., Dong, X., Ko, D., Yu, C., Halevy, A.: Web-scale data integration: you can only afford to pay as you go. In: Proceedings of the CIDR, pp. 342–350 (2007)Google Scholar
  30. 30.
    Magnani, M., Rizopoulos, N., McBrien, P., Montesi, D.: Schema integration based on uncertain semantic mappings. In: Conceptual Modeling ER 2005, pp. 31–46 (2005)Google Scholar
  31. 31.
    Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: Prade, H., Subrahmanian, V. (eds.) Scalable Uncertainty Management, LNCS, vol. 4772, pp. 60–73. Springer, Berlin (2007).  https://doi.org/10.1007/978-3-540-75410-7_5
  32. 32.
    Marie, A., Gal, A.: On the stable marriage of maximum weight royal couples. In: Proceedings of AAAI Workshop on Information Integration on the Web (2007)Google Scholar
  33. 33.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE, pp. 117–128. IEEE (2002)Google Scholar
  34. 34.
    Mena, E., Kashyap, V., Illarramendi, A., Sheth, A.P.: Imprecise answers in distributed environments: estimation of information loss for multi-ontology based query processing. Int. J. Coop. Inf. Syst. 9(4), 403–425 (2000)CrossRefGoogle Scholar
  35. 35.
    Modica, G., Gal, A., Jamil, H.: The use of machine-generated ontologies in dynamic information seeking. In: CoopIS, pp. 433–447 (2001)Google Scholar
  36. 36.
    Noy, N.F., Mortensen, J., Musen, M.A., Alexander, P.R.: Mechanical turk as an ontology engineer? Using microtasks as a component of an ontology-engineering workflow. In: Web Science 2013 (co-located with ECRC), WebSci ’13, Paris, pp. 262–271 (2013).  https://doi.org/10.1145/2464464.2464482
  37. 37.
    Peukert, E., Eberius, J., Rahm, E.: AMC—a framework for modelling and comparing matching systems as matching processes. In: ICDE, pp. 1304–1307. IEEE (2011)Google Scholar
  38. 38.
    Powers, D.: Evaluation: from precision, recall and f-measure to ROC, informedness, markedness & correlation. J. Mach. Learn. Technol. 2(1), 37–63 (2011)MathSciNetGoogle Scholar
  39. 39.
    Ratinov, L., Gudes, E.: Abbreviation expansion in schema matching and web integration. In: Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, pp. 485–489 (2004)Google Scholar
  40. 40.
    Rodriguez-Gianolli, P., Mylopoulos, J.: A semantic approach to XML-based data integration. In: Kunii, H.S., Jajodia, S., Sølvberg, A. (eds.) Conceptual Modeling–ER 2001. Lecture Notes in Computer Science, vol. 2224, pp. 117–132. Springer, Berlin (2001)CrossRefGoogle Scholar
  41. 41.
    Sagi, T., Gal, A.: Non-binary evaluation for schema matching. In: Atzeni, P., Cheung, D., Ram, S. (eds.) Conceptual Modeling, Lecture Notes in Computer Science, vol. 7532, pp. 477–486. Springer, Berlin (2012).  https://doi.org/10.1007/978-3-642-34002-4_37
  42. 42.
    Sagi, T., Gal, A.: Schema matching prediction with applications to data source discovery and dynamic ensembling. VLDB J. 22(5), 689–710 (2013).  https://doi.org/10.1007/s00778-013-0325-y CrossRefGoogle Scholar
  43. 43.
    Sagi, T., Gal, A.: In schema matching, even experts are human. towards expert sourcing in schema matching. In: 10th International Workshop on Information Integration on the Web (IIWeb ’14), co-located with ICDE 2014. IEEE, Chicago (2014)Google Scholar
  44. 44.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’02, pp. 269–278. ACM, New York (2002).  https://doi.org/10.1145/775047.775087
  45. 45.
    Shepard, R.: Attention and the metric structure of the stimulus space. J. Math. Psychol. 1(1), 54–87 (1964)MathSciNetCrossRefGoogle Scholar
  46. 46.
    Steel, R.G.D., Torrie, J.H.: Principles and Procedures of Statistics. McGraw-Hill, New York (1960)MATHGoogle Scholar
  47. 47.
    Weidlich, M., Dijkman, R., Mendling, J.: The ICOP framework: identification of correspondences between process models. In: Advanced Information Systems Engineering, pp. 483–498. Springer, Berlin (2010)Google Scholar
  48. 48.
    Winkler, W., Yancey, W., Porter, E.: Fast record linkage of very large files in support of decennial and administrative records projects. In: Proceedings of the Section on Survey Research Methods. American Statistical Association (2010)Google Scholar
  49. 49.
    Zobel, J., Moffat, A.: Exploring the similarity space. SIGIR Forum 32, 18–34 (1998).  https://doi.org/10.1145/281250.281256 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Germany 2017

Authors and Affiliations

  1. 1.University of HaifaHaifaIsrael
  2. 2.Technion - Israel Institute of TechnologyHaifaIsrael

Personalised recommendations