The VLDB Journal

, Volume 24, Issue 5, pp 611–631 | Cite as

Active learning in keyword search-based data integration

  • Zhepeng Yan
  • Nan Zheng
  • Zachary G. Ives
  • Partha Pratim Talukdar
  • Cong Yu
Special Issue Paper


The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration—where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers’ quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few “top-\(k\)” results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result’s score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.


Data integration Keyword search Active learning 



We thank Burr Settles for his advice on active learning, and the anonymous reviewers for their feedback. This work was funded in part by the National Science Foundation Grants IIS-1050448, IIS-1217798, IIS-0477972, IIS-0513778, CNS-0721541, and by a gift from Google. Portions of this work were done when P. Talukdar was at Carnegie Mellon University.


  1. 1.
    Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE (2002)Google Scholar
  2. 2.
    Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)Google Scholar
  3. 3.
    Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: ISWC/ASWC (2007)Google Scholar
  4. 4.
    Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based keyword search in databases. In: VLDB (2004)Google Scholar
  5. 5.
    Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD (2011)Google Scholar
  6. 6.
    Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: AAAI Spring Symposium: Learning by Reading and Learning to Read (2009)Google Scholar
  7. 7.
    Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)Google Scholar
  8. 8.
    Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive–aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)zbMATHMathSciNetGoogle Scholar
  9. 9.
    Craswell, N., Zoeter, O., Taylor, M.J., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM, pp. 87–94 (2008)Google Scholar
  10. 10.
    Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI, pp. 746–751 (2005)Google Scholar
  11. 11.
    Deng, T., Fan, W.: On the complexity of query result diversification. Proc. VLDB Endow. 6(8), 557–588 (2013)Google Scholar
  12. 12.
    Do, H.H., Rahm, E.: Matching large schemas: Aroaches and evaluatio. Inf. Syst. 32(6), 857–885 (2007)Google Scholar
  13. 13.
    Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD (2001)Google Scholar
  14. 14.
    Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)Google Scholar
  15. 15.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
  16. 16.
    Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)Google Scholar
  17. 17.
    Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)Google Scholar
  18. 18.
    Gal, A.: Uncertain Schema Matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)Google Scholar
  19. 19.
    Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inf. Syst. 35(8), 845–859 (2010)CrossRefGoogle Scholar
  20. 20.
    Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 (2009)Google Scholar
  21. 21.
    Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW (2003)Google Scholar
  22. 22.
    Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M.J., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: WWW, pp. 11–20 (2009)Google Scholar
  23. 23.
    Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: SIGMOD (2003)Google Scholar
  24. 24.
    He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: SIGMOD (2007)Google Scholar
  25. 25.
    Hristidis, V., Papakonstantinou, Y.: Discover: Keyword search in relational databases. In: VLDB, pp. 670–681 (2002)Google Scholar
  26. 26.
    Hwa, R.: Sample selection for statistical parsing. Comput. Linguist. 30(3), 253–276 (2004)zbMATHMathSciNetCrossRefGoogle Scholar
  27. 27.
    Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: VLDB (2003)Google Scholar
  28. 28.
    Jacob, M., Ives, Z.G.: Sharing work in keyword search over databases. In: SIGMOD (2011)Google Scholar
  29. 29.
    Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD (2008)Google Scholar
  30. 30.
    Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005)Google Scholar
  31. 31.
    Kimelfeld, B., Sagiv, Y.: Finding and approximating top-k answers in keyword proximity search. In: PODS, pp. 173–182 (2006)Google Scholar
  32. 32.
    Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004)Google Scholar
  33. 33.
    Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: SUM, pp. 60–73 (2007)Google Scholar
  34. 34.
    Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE (2002)Google Scholar
  35. 35.
    Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)Google Scholar
  36. 36.
    Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Efficient keyword search across heterogeneous relational databases. In: ICDE (2007)Google Scholar
  37. 37.
    Settles, B.: Active Learning. Morgan and Claypool, Cambridge (2012)zbMATHGoogle Scholar
  38. 38.
    Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: EMNLP (2008)Google Scholar
  39. 39.
    Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS (2007)Google Scholar
  40. 40.
    Shen, S., Hu, B., Chen, W., Yang, Q.: Personalized click model through collaborative filtering. In: WSDM, pp. 323–332 (2012)Google Scholar
  41. 41.
    Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A large ontology from Wikipedia and WordNet. J. Web Sem. 6(3), 203–217 (2008)Google Scholar
  42. 42.
    Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD (2010)Google Scholar
  43. 43.
    Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. In: VLDB (2008)Google Scholar
  44. 44.
    Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)Google Scholar
  45. 45.
    Yan, Z., Zheng, N., Ives, Z., Talukdar, P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. In: PVLDB (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Zhepeng Yan
    • 1
  • Nan Zheng
    • 1
  • Zachary G. Ives
    • 1
  • Partha Pratim Talukdar
    • 2
  • Cong Yu
    • 3
  1. 1.Computer and Information Science DepartmentUniversity of PennsylvaniaPhiladelphiaUSA
  2. 2.Room 401, SERC Indian Institute of ScienceBengaluruIndia
  3. 3.Google ResearchNew YorkUSA

Personalised recommendations