Querying and Merging Heterogeneous Data by Approximate Joins on Higher-Order Terms

  • Simon Price
  • Peter Flach
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5194)


Integrating heterogeneous data from sources as diverse as web pages, digital libraries, knowledge bases, the Semantic Web and databases is an open problem. The ultimate aim of our work is to be able to query such heterogeneous data sources as if their data were conveniently held in a single relational database. Pursuant to this aim, we propose a generalisation of joins from the relational database model to enable joins on arbitrarily complex structured data in a higher-order representation. By incorporating kernels and distances for structured data, we further extend this model to support approximate joins of heterogeneous data. We demonstrate the flexibility of our approach in the publications domain by evaluating example approximate queries on the CORA data sets, joining on types ranging from sets of co-authors through to entire publications.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Codd, E.F.: The Relational Model for Database Management, Version 2. Addison-Wesley, Reading (1990)MATHGoogle Scholar
  2. 2.
    Date, C.J.: An Introduction to Database Systems. Addison-Wesley Longman Publishing Co., Inc., Boston (1991)Google Scholar
  3. 3.
    Lloyd, J.W.: Logic and Learning. Springer, New York (2003)Google Scholar
  4. 4.
    Gaertner, T., Lloyd, J.W., Flach, P.A.: Kernels and distances for structured data. Mach. Learn. 57(3), 205–232 (2004)MATHCrossRefGoogle Scholar
  5. 5.
    Church, A.: A formulation of the simple theory of types. Journal of Symbolic Logic 5(2), 56–68 (1940)MATHCrossRefMathSciNetGoogle Scholar
  6. 6.
    Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)Google Scholar
  7. 7.
    Gyftodimos, E., Flach, P.A.: Combining bayesian networks with higher-order data representations. In: Famili, A.F., Kok, J.N., Peña, J.M., Siebes, A., Feelders, A. (eds.) IDA 2005. LNCS, vol. 3646, pp. 145–156. Springer, Heidelberg (2005)Google Scholar
  8. 8.
    Culotta, A., McCallum, A.: Joint deduplication of multiple record types in relational data. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 257–258. ACM, New York (2005)CrossRefGoogle Scholar
  9. 9.
    Lawrence, S., Bollacker, K., Giles, C.L.: Autonomous citation matching. In: Proceedings of the 3rd International Conference on Autonomous Agents, pp. 392–393. ACM Press, New York (May 1999)CrossRefGoogle Scholar
  10. 10.
    Newman, M.E.J.: The structure of scientific collaboration networks. Proc. Natl. Acad. Sci. USA 98, 404–409 (2001)MATHCrossRefMathSciNetGoogle Scholar
  11. 11.
    Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (May 2001)Google Scholar
  12. 12.
    Prud’hommeaux, E., Seabourne, A.: SPARQL Query Language for RDF. W3C, W3C Working Draft April 19, 2005 edn. (April 2005)Google Scholar
  13. 13.
    McGuinness, D.L., van Harmelen, F.: OWL Web Ontology Language overview (2004)Google Scholar
  14. 14.
    Maedche, A., Staab, S.: Measuring similarity between ontologies. In: Gómez-Pérez, A., Benjamins, V.R. (eds.) EKAW 2002. LNCS (LNAI), vol. 2473, pp. 251–263. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  15. 15.
    Nienhuys-Cheng, S.H.: Distance between herbrand interpretations: A measure for approximations to a target concept. In: [24], pp. 213–226Google Scholar
  16. 16.
    Sebag, M.: Distance induction in first order logic. In: [24], pp. 264–272Google Scholar
  17. 17.
    Bohnebeck, U., Horváth, T., Wrobel, S.: Term comparisons in first-order similarity measures. In: Page, D.L. (ed.) ILP 1998. LNCS, vol. 1446, pp. 65–79. Springer, Heidelberg (1998)CrossRefGoogle Scholar
  18. 18.
    Kirsten, M., Wrobel, S.: Extending k-means clustering to first-order representations. In: Cussens, J., Frisch, A.M. (eds.) ILP 2000. LNCS (LNAI), vol. 1866, pp. 112–129. Springer, Heidelberg (2000)CrossRefGoogle Scholar
  19. 19.
    Bhattacharya, I., Getoor, L.: Relational clustering for multi-type entity resolution. In: MRDM 2005: Proceedings of the 4th international workshop on Multi-relational mining, pp. 3–12. ACM Press, New York (2005)CrossRefGoogle Scholar
  20. 20.
    Woznica, A., Kalousis, A., Kalousis, M.H.A., Hilario, M.: Kernels over relational algebra structures. In: Ho, T.-B., Cheung, D., Liu, H. (eds.) PAKDD 2005. LNCS (LNAI), vol. 3518, pp. 588–598. Springer, Heidelberg (2005)Google Scholar
  21. 21.
    Domingos, P., Domingos, P.: Multi-relational record linkage. In: Dzeroski, S., Blockeel, H. (eds.) Proceedings of the 2004 ACM SIGKDD Workshop on Multi-Relational Data Mining, pp. 31–48 (August 2004)Google Scholar
  22. 22.
    Bhattacharya, I., Getoor, L.: A latent Dirichlet model for unsupervised entity resolution. In: 6th SIAM Conference on Data Mining (SDM 2006), Bethesda, MD (2006)Google Scholar
  23. 23.
    d’Amato, C., Fanizzi, N., Esposito, F.: Induction of optimal semantic semi-distances for clausal knowledge bases. In: Blockeel, H., Ramon, J., Shavlik, J., Tadepalli, P. (eds.) ILP 2007. LNCS (LNAI), vol. 4894, pp. 29–38. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  24. 24.
    Lavrac, N., Dzeroski, S.(eds.): ILP 1997. LNCS, vol. 1297. Springer, Heidelberg (1997)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Simon Price
    • 1
  • Peter Flach
    • 1
  1. 1.Department of Computer ScienceUniversity of BristolBristolUnited Kingdom

Personalised recommendations