TupleRank: Ranking Discovered Content in Virtual Databases
Recently, the problem of data integration has been newly addressed by methods based on machine learning and discovery. Such methods are intended to automate, at least in part, the laborious process of information integration, by which existing data sources are incorporated in a virtual database. Essentially, these methods scan new data sources, attempting to discover possible mappings to the virtual database. Like all discovery processes, this process is intrinsically probabilistic; that is, each discovery is associated with a specific value that denotes assurance of its appropriateness. Consequently, the rows in a discovered virtual table have mixed assurance levels, with some rows being more credible than others. We argue that rows in discovered virtual databases should be ranked, and we describe a ranking method, called TupleRank, for calculating such a ranking order. Roughly speaking, TupleRank calibrates the probabilities calculated during a discovery process with historical information about the performance of the system. The work is done in the framework of the Autoplex system for discovering content for virtual databases, and initial experimentation is reported and discussed.
KeywordsDiscovery Process Assurance Score Assurance Measurement Internet Search Engine Constraint Checker
Unable to display preview. Download preview PDF.
- 1.Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley/ACM Press (1999)Google Scholar
- 4.Castano, S., De Antonellis, V.: A schema analysis and reconciliation tool environment for heterogeneous databases. In: Proc. IDEAS 1999, Int. Database Engineering and Applications Symposium, pp. 53–62 (1999)Google Scholar
- 5.Dhamankar, R., Lee, Y., Doan, A., Halevy, A.Y., Domingos, P.: iMAP: Discovering complex semantic matches between database schemas. In: Proc. SIGMOD 2004, Int. Conf. on Management of Data, pp. 383–394 (2004)Google Scholar
- 6.Doan, A., Domingos, P., Halevy, A.Y.: Learning source description for data integration. In: Proc. WebDB, pp. 81–86 (2000)Google Scholar
- 7.Doan, A., Domingos, P., Halevy, A.Y.: Reconciling schemas of disparate data sources: A machine-learning approach. In: Proc. SIGMOD 2001, Int. Conf. on Management of Data, pp. 509–520 (2001)Google Scholar
- 10.Madhavan, J., Bernstein, P.A., Rahm, E.: Generic schema matching with Cupid. In: Proc. VLDB 2001, 27th Int. Conf. on Very Large Databases, pp. 49–58 (2001)Google Scholar
- 12.Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco (2000)Google Scholar