Active learning in keyword search-based data integration
- 914 Downloads
- 4 Citations
Abstract
The problem of scaling up data integration, such that new sources can be quickly utilized as they are discovered, remains elusive: Global schemas for integrated data are difficult to develop and expand, and schema and record matching techniques are limited by the fact that data and metadata are often under-specified and must be disambiguated by data experts. One promising approach is to avoid using a global schema, and instead to develop keyword search-based data integration—where the system lazily discovers associations enabling it to join together matches to keywords, and return ranked results. The user is expected to understand the data domain and provide feedback about answers’ quality. The system generalizes such feedback to learn how to correctly integrate data. A major open challenge is that under this model, the user only sees and offers feedback on a few “top-\(k\)” results: This result set must be carefully selected to include answers of high relevance and answers that are highly informative when feedback is given on them. Existing systems merely focus on predicting relevance, by composing the scores of various schema and record matching algorithms. In this paper, we show how to predict the uncertainty associated with a query result’s score, as well as how informative feedback is on a given result. We build upon these foundations to develop an active learning approach to keyword search-based data integration, and we validate the effectiveness of our solution over real data from several very different domains.
Keywords
Data integration Keyword search Active learningNotes
Acknowledgments
We thank Burr Settles for his advice on active learning, and the anonymous reviewers for their feedback. This work was funded in part by the National Science Foundation Grants IIS-1050448, IIS-1217798, IIS-0477972, IIS-0513778, CNS-0721541, and by a gift from Google. Portions of this work were done when P. Talukdar was at Carnegie Mellon University.
References
- 1.Agrawal, S., Chaudhuri, S., Das, G.: DBXplorer: A system for keyword-based search over relational databases. In: ICDE (2002)Google Scholar
- 2.Arasu, A., Götz, M., Kaushik, R.: On active learning of record matching packages. In: SIGMOD Conference, pp. 783–794 (2010)Google Scholar
- 3.Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.G.: DBpedia: A nucleus for a web of open data. In: ISWC/ASWC (2007)Google Scholar
- 4.Balmin, A., Hristidis, V., Papakonstantinou, Y.: ObjectRank: Authority-based keyword search in databases. In: VLDB (2004)Google Scholar
- 5.Bergamaschi, S., Domnori, E., Guerra, F., Trillo Lado, R., Velegrakis, Y.: Keyword search over relational databases: a metadata approach. In: SIGMOD (2011)Google Scholar
- 6.Betteridge, J., Carlson, A., Hong, S.A., Jr., E.R.H., Law, E.L.M., Mitchell, T.M., Wang, S.H.: Toward never ending language learning. In: AAAI Spring Symposium: Learning by Reading and Learning to Read (2009)Google Scholar
- 7.Bhalotia, G., Hulgeri, A., Nakhe, C., Chakrabarti, S., Sudarshan, S.: Keyword searching and browsing in databases using BANKS. In: ICDE, pp. 431–440 (2002)Google Scholar
- 8.Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz, S., Singer, Y.: Online passive–aggressive algorithms. J. Mach. Learn. Res. 7, 551–585 (2006)MATHMathSciNetGoogle Scholar
- 9.Craswell, N., Zoeter, O., Taylor, M.J., Ramsey, B.: An experimental comparison of click position-bias models. In: WSDM, pp. 87–94 (2008)Google Scholar
- 10.Culotta, A., McCallum, A.: Reducing labeling effort for structured prediction tasks. In: AAAI, pp. 746–751 (2005)Google Scholar
- 11.Deng, T., Fan, W.: On the complexity of query result diversification. Proc. VLDB Endow. 6(8), 557–588 (2013)Google Scholar
- 12.Do, H.H., Rahm, E.: Matching large schemas: Aroaches and evaluatio. Inf. Syst. 32(6), 857–885 (2007)Google Scholar
- 13.Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: a machine-learning approach. In: SIGMOD (2001)Google Scholar
- 14.Drosou, M., Pitoura, E.: Search result diversification. SIGMOD Rec. 39(1), 41–47 (2010)Google Scholar
- 15.Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE TKDE 19(1), 1–16 (2007)Google Scholar
- 16.Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. J. Comput. Syst. Sci. 66(4), 614–656 (2003)Google Scholar
- 17.Franklin, M., Halevy, A., Maier, D.: From databases to dataspaces: a new abstraction for information management. SIGMOD Rec. 34(4), 27–33 (2005)Google Scholar
- 18.Gal, A.: Uncertain Schema Matching. Synth. Lect. Data Manag. 3(1), 1–97 (2011)Google Scholar
- 19.Gal, A., Sagi, T.: Tuning the ensemble selection process of schema matchers. Inf. Syst. 35(8), 845–859 (2010)CrossRefGoogle Scholar
- 20.Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web, WWW ’09 (2009)Google Scholar
- 21.Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW (2003)Google Scholar
- 22.Guo, F., Liu, C., Kannan, A., Minka, T., Taylor, M.J., Wang, Y.M., Faloutsos, C.: Click chain model in web search. In: WWW, pp. 11–20 (2009)Google Scholar
- 23.Guo, L., Shao, F., Botev, C., Shanmugasundaram, J.: XRANK: Ranked keyword search over XML documents. In: SIGMOD (2003)Google Scholar
- 24.He, H., Wang, H., Yang, J., Yu, P.S.: BLINKS: ranked keyword searches on graphs. In: SIGMOD (2007)Google Scholar
- 25.Hristidis, V., Papakonstantinou, Y.: Discover: Keyword search in relational databases. In: VLDB, pp. 670–681 (2002)Google Scholar
- 26.Hwa, R.: Sample selection for statistical parsing. Comput. Linguist. 30(3), 253–276 (2004)MATHMathSciNetCrossRefGoogle Scholar
- 27.Ilyas, I.F., Aref, W.G., Elmagarmid, A.K.: Supporting top-k join queries in relational databases. In: VLDB (2003)Google Scholar
- 28.Jacob, M., Ives, Z.G.: Sharing work in keyword search over databases. In: SIGMOD (2011)Google Scholar
- 29.Jeffery, S.R., Franklin, M.J., Halevy, A.Y.: Pay-as-you-go user feedback for dataspace systems. In: SIGMOD (2008)Google Scholar
- 30.Kacholia, V., Pandit, S., Chakrabarti, S., Sudarshan, S., Desai, R., Karambelkar, H.: Bidirectional expansion for keyword search on graph databases. In: VLDB, pp. 505–516 (2005)Google Scholar
- 31.Kimelfeld, B., Sagiv, Y.: Finding and approximating top-k answers in keyword proximity search. In: PODS, pp. 173–182 (2006)Google Scholar
- 32.Marian, A., Bruno, N., Gravano, L.: Evaluating top-k queries over web-accessible databases. ACM Trans. Database Syst. 29(2), 319–362 (2004)Google Scholar
- 33.Marie, A., Gal, A.: Managing uncertainty in schema matcher ensembles. In: SUM, pp. 60–73 (2007)Google Scholar
- 34.Melnik, S., Garcia-Molina, H., Rahm, E.: Similarity flooding: a versatile graph matching algorithm and its application to schema matching. In: ICDE (2002)Google Scholar
- 35.Rahm, E., Bernstein, P.A.: A survey of approaches to automatic schema matching. VLDB J. 10(4), 334–350 (2001)Google Scholar
- 36.Sayyadian, M., LeKhac, H., Doan, A., Gravano, L.: Efficient keyword search across heterogeneous relational databases. In: ICDE (2007)Google Scholar
- 37.Settles, B.: Active Learning. Morgan and Claypool, Cambridge (2012)MATHGoogle Scholar
- 38.Settles, B., Craven, M.: An analysis of active learning strategies for sequence labeling tasks. In: EMNLP (2008)Google Scholar
- 39.Settles, B., Craven, M., Ray, S.: Multiple-instance active learning. In: NIPS (2007)Google Scholar
- 40.Shen, S., Hu, B., Chen, W., Yang, Q.: Personalized click model through collaborative filtering. In: WSDM, pp. 323–332 (2012)Google Scholar
- 41.Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: A large ontology from Wikipedia and WordNet. J. Web Sem. 6(3), 203–217 (2008)Google Scholar
- 42.Talukdar, P.P., Ives, Z.G., Pereira, F.: Automatically incorporating new sources in keyword search-based data integration. In: SIGMOD (2010)Google Scholar
- 43.Talukdar, P.P., Jacob, M., Mehmood, M.S., Crammer, K., Ives, Z.G., Pereira, F., Guha, S.: Learning to create data-integrating queries. In: VLDB (2008)Google Scholar
- 44.Yakout, M., Elmagarmid, A.K., Neville, J., Ouzzani, M., Ilyas, I.F.: Guided data repair. PVLDB 4(5), 279–289 (2011)Google Scholar
- 45.Yan, Z., Zheng, N., Ives, Z., Talukdar, P., Yu, C.: Actively soliciting feedback for query answers in keyword search-based data integration. In: PVLDB (2013)Google Scholar