Pay-as-you-go Data Integration: Experiences and Recurring Themes

  • Norman W. PatonEmail author
  • Khalid Belhajjame
  • Suzanne M. Embury
  • Alvaro A. A. Fernandes
  • Ruhaila Maskat
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9587)


Data integration typically seeks to provide the illusion that data from multiple distributed sources comes from a single, well managed source. Providing this illusion in practice tends to involve the design of a global schema that captures the users data requirements, followed by manual (with tool support) construction of mappings between sources and the global schema. This overall approach can provide high quality integrations but at high cost, and tends to be unsuitable for areas with large numbers of rapidly changing sources, where users may be willing to cope with a less than perfect integration. Pay-as-you-go data integration has been proposed to overcome the need for costly manual data integration. Pay-as-you-go data integration tends to involve two steps. Initialisation: automatic creation of mappings (generally of poor quality) between sources. Improvement: the obtaining of feedback on some aspect of the integration, and the application of this feedback to revise the integration. There has been considerable research in this area over a ten year period. This paper reviews some experiences with pay-as-you-go data integration, providing a framework that can be used to compare or develop pay-as-you-go data integration techniques.


Data Integration Entity Resolution Improvement Phase Identify Problem Matched Record 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



Research on data integration at Manchester is supported by the VADA Programme Grant of the UK Engineering and Physical Sciences Research Council, whose support we are pleased to acknowledge.


  1. 1.
    Acosta, M., Zaveri, A., Simperl, E., Kontokostas, D., Auer, S., Lehmann, J.: Crowdsourcing linked data quality assessment. In: Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J.X., Aroyo, L., Noy, N., Welty, C., Janowicz, K. (eds.) ISWC 2013, Part II. LNCS, vol. 8219, pp. 260–276. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  2. 2.
    Amsterdamer, Y., Davidson, S.B., Milo, T., Novgorodov, S., Somech, A.: OASSIS: query driven crowd mining. In: International Conference on Management of Data, SIGMOD 2014, Snowbird, 22–27 June 2014, pp. 589–600 (2014)Google Scholar
  3. 3.
    Belhajjame, K., Paton, N.W., Embur, S.M., Fernande, A.A.A., Hedeler, C.: Incrementally improving dataspaces based on user feedback. Inf.Syst. 38(5), 656–687 (2013)CrossRefGoogle Scholar
  4. 4.
    Belhajjame, K., Paton, N.W., Hedeler, C., Fernandes, A.A.A.: Enabling community-driven information integration through clustering. Distrib. Parallel Databases 33(1), 33–67 (2015)CrossRefGoogle Scholar
  5. 5.
    Bernstein, P.A., Haas, L.M.: Information integration in the enterprise. CACM 51(9), 72–79 (2008)CrossRefGoogle Scholar
  6. 6.
    Bleiholder, J., Naumann, F.: Data fusion. ACM Comput. Surv., 41(1) (2008)Google Scholar
  7. 7.
    Bonifati, A., Mecca, G., Pappalardo, A., Raunich, S., Summa, G.: Schema mapping verification: the spicy way. In: Proceedings EDBT 2008, 11th International Conference on Extending Database Technology, Nantes, 25–29 March 2008, pp. 85–96 (2008)Google Scholar
  8. 8.
    Bozzon, A., Brambilla, M., Ceri, S.: Answering search queries with crowdsearcher. In: Proceeding of 21st WWW, pp. 1009–1018 (2012)Google Scholar
  9. 9.
    Cao, C.C., She, J., Tong, Y., Chen, L.: Whom to ask? jury selection for decision making tasks on micro-blog services. PVLDB 5(11), 1495–1506 (2012)Google Scholar
  10. 10.
    Cao, H., Qi, Y., Candan, K.S., Sapino, M.L.: Feedback-driven result ranking and query refinement for exploring semi-structured data collections. In: EDBT, pp. 3–14 (2010)Google Scholar
  11. 11.
    Chai, X., Vuong, B.-Q., Doan, A., Naughton, J.F.: Efficiently incorporating user feedback into information extraction and integration programs. In: SIGMOD Conference, pp. 87–100 (2009)Google Scholar
  12. 12.
    Christodoulou, K., Paton, N.W., Fernandes, A.A.A.: Structure inference for linked data sources using clustering. Trans. Large-Scale Data- Knowl.-Centered Syst. 19, 1–25 (2015)Google Scholar
  13. 13.
    Costa, G., Manco, G., Ortale, R.: An incremental clustering scheme for data de-duplication. Data Min. Knowl. Disc. 20(1), 152–187 (2010)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Crescenzi, V., Merialdo, P., Qiu, D.: Crowdsourcing large scale wrapper inference. Distributed and Parallel Databases (October 2014)Google Scholar
  15. 15.
    Demartini, G., Difallah, D.E., Cudré-Mauroux, P.: Large-scale linked data integration using probabilistic reasoning and crowdsourcing. VLDB J. 22(5), 665–687 (2013)CrossRefGoogle Scholar
  16. 16.
    Dong, X.L., Gabrilovich, E., Heitz, G., Horn, W., Lao, N., Murphy, K., Strohmann, T., Sun, S., Zhang, W.: Knowledge vault: a web-scale approach to probabilistic knowledge fusion. In: KDD, pp. 601–610 (2014)Google Scholar
  17. 17.
    Dong, X.L., Saha, B., Srivastava, D.: Less is more: selecting sources wisely for integration. PVLDB 6(2), 37–48 (2012)Google Scholar
  18. 18.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: a survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  19. 19.
    Fagin, R., Haas, L.M., Hernández, M., Miller, R.J., Popa, L., Velegrakis, Y.: Clio: schema mapping creation and data exchange. In: Borgida, A.T., Chaudhri, V.K., Giorgini, P., Yu, E.S. (eds.) Conceptual Modeling: Foundations and Applications. LNCS, vol. 5600, pp. 198–236. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  20. 20.
    Furche, T., Gottlob, G., Grasso, G., Guo, X., Orsi, G., Schallhart, C., Wang, C.: DIADEM: thousands of websites to a single database. PVLDB 7(14), 1845–1856 (2014)Google Scholar
  21. 21.
    Gokhale, C., Das, S., Doan, A., Naughton, J.F., Rampalli, N., Shavlik, J.W., Zhu, X.: Corleone: hands-off crowdsourcing for entity matching. In: SIGMOD Conference, pp. 601–612 (2014)Google Scholar
  22. 22.
    Halevy, A.Y., Franklin, M.J., Maie, D.: Principles of dataspace systems. In: PODS, pp. 1–9 (2006)Google Scholar
  23. 23.
    Hedeler, C., Belhajjame, K., Fernandes, A.A.A., Embury, S.M., Paton, N.W.: Dimensions of dataspaces. In: Sexton, A.P. (ed.) BNCOD 26. LNCS, vol. 5588, pp. 55–66. Springer, Heidelberg (2009) CrossRefGoogle Scholar
  24. 24.
    Quoc, N., Hung, V., Wijaya, T.K., Miklós, Z., Aberer, K., Levy, E., Shafran, V., Gal, A., Weidlich, M.: Minimizing human effort in reconciling match networks. In: ER, pp. 212–226 (2013)Google Scholar
  25. 25.
    Isele, R., Bize, C.: Learning linkage rules using genetic programming. In: Proceeding 6th International Workshop on Ontology Matching, vol. 814 of CEUR Workshop Proceedings (2011)Google Scholar
  26. 26.
    Isele, R., Bizer, C.: Active learning of expressive linkage rules using genetic programming. J. Web Sem. 23, 2–15 (2013)CrossRefGoogle Scholar
  27. 27.
    Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD, pp. 847–860 (2008)Google Scholar
  28. 28.
    Kandel, S., Heer, J., Plaisant, C., Kennedy, J., van Ham, F., Riche, N.H., Weaver, C., Lee, B., Brodbeck, D., Buono, P.: Research directions in data wrangling: visuatizations and transformations for usable and credible data. Inf. Vis. 10(4), 271–288 (2011)CrossRefGoogle Scholar
  29. 29.
    Köpcke, H., Thor, A., Rahm, E.: Evaluation of entity resolution approaches on real-world match problems. PVLDB 3(1), 484–493 (2010)Google Scholar
  30. 30.
    Niu, F., Zhang, C., Ré, C., Shavlik, J.W.: Elementary: large-scale knowledge-base construction via machine learning and statistical inference. Int. J. Semantic Web Inf. Syst. 8(3), 42–73 (2012)CrossRefGoogle Scholar
  31. 31.
    Osorno-Gutierrez, F., Paton, N.W., Fernandes, A.A.A.: Crowdsourcing feedback for pay-as-you-go data integration. In: DBCrowd, pp. 32–37 (2013)Google Scholar
  32. 32.
    Parameswaran, A.G., Park, H., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: declarative crowdsourcing. In: Proceeding 21st CIKM, pp. 1203–1212 (2012)Google Scholar
  33. 33.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDBJ 10(4), 334–350 (2001)zbMATHCrossRefGoogle Scholar
  34. 34.
    Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. PVLDB 1(1), 785–796 (2008)Google Scholar
  35. 35.
    Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: crowdsourcing entity resolution. Proc. VLDB Endow. 5(11), 1483–1494 (2012)CrossRefGoogle Scholar
  36. 36.
    Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349–360 (2013)Google Scholar
  37. 37.
    Yan, Z., Zheng, N., Ives, Z.G., Talukdar, P.P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. PVLDB 6(3), 205–216 (2013)Google Scholar
  38. 38.
    Zhang, C.J., Chen, L., Jagadish, H.V., Cao, C.C.: Reducing uncertainty of schema matching via crowdsourcing. PVLDB 6(9), 757–768 (2013)Google Scholar
  39. 39.
    Zhao, B., Rubinstein, B.I.P., Gemmell, J., Han, J.: A bayesian approach to discovering truth from conflicting sources for data integration. PVLDB 5(6), 550–561 (2012)Google Scholar
  40. 40.
    Zheng, Y., Cheng, R., Maniu, S., Mo, L.: On optimality of jury selection in crowdsourcing. In: Proceedings of the 18th International Conference on Extending Database Technology, EDBT 2015, Brussels, 23–27 March 2015, pp. 193–204 (2015)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2016

Authors and Affiliations

  • Norman W. Paton
    • 1
    Email author
  • Khalid Belhajjame
    • 2
  • Suzanne M. Embury
    • 1
  • Alvaro A. A. Fernandes
    • 1
  • Ruhaila Maskat
    • 1
  1. 1.School of Computer ScienceUniversity of ManchesterManchesterUK
  2. 2.Université Paris DauphineParis Cedex 16France

Personalised recommendations