Learning Relational Bayesian Classifiers from RDF Data

  • Harris T. Lin
  • Neeraj Koul
  • Vasant Honavar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7031)

Abstract

The increasing availability of large RDF datasets offers an exciting opportunity to use such data to build predictive models using machine learning algorithms. However, the massive size and distributed nature of RDF data calls for approaches to learning from RDF data in a setting where the data can be accessed only through a query interface, e.g., the SPARQL endpoint of the RDF store. In applications where the data are subject to frequent updates, there is a need for algorithms that allow the predictive model to be incrementally updated in response to changes in the data. Furthermore, in some applications, the attributes that are relevant for specific prediction tasks are not known a priori and hence need to be discovered by the algorithm. We present an approach to learning Relational Bayesian Classifiers (RBCs) from RDF data that addresses such scenarios. Specifically, we show how to build RBCs from RDF data using statistical queries through the SPARQL endpoint of the RDF store. We compare the communication complexity of our algorithm with one that requires direct centralized access to the data and hence has to retrieve the entire RDF dataset from the remote location for processing. We establish the conditions under which the RBC models can be incrementally updated in response to addition or deletion of RDF data. We show how our approach can be extended to the setting where the attributes that are relevant for prediction are not known a priori, by selectively crawling the RDF data for attributes of interest. We provide open source implementation and evaluate the proposed approach on several large RDF datasets.

Keywords

Resource Description Framework Communication Complexity Target Class Attribute Graph SPARQL Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

References

  1. 1.
    Ackerson, L.K., Viswanath, K.: Communication inequalities, social determinants, and intermittent smoking in the 2003 health information national trends survey. Prev. Chronic. Dis. 6(2) (2009)Google Scholar
  2. 2.
    Antoniou, G., van Harmelen, F.: A Semantic Web Primer, 2nd edn. MIT Press, Cambridge (2008)Google Scholar
  3. 3.
    Berners-Lee, T., Hendler, J., Lassila, O.: The semantic web. Scientific American (2001)Google Scholar
  4. 4.
    Bicer, V., Tran, T., Gossen, A.: Relational Kernel Machines for Learning from Graph-Structured RDF Data. In: Antoniou, G., Grobelnik, M., Simperl, E., Parsia, B., Plexousakis, D., De Leenheer, P., Pan, J. (eds.) ESWC 2011, Part I. LNCS, vol. 6643, pp. 47–62. Springer, Heidelberg (2011)CrossRefGoogle Scholar
  5. 5.
    Breitmann, K., Casanova, M., Truszkowski, W.: Semantic Web: Concepts, Technologies and Applications. Springer (2007)Google Scholar
  6. 6.
    Caragea, D., Zhang, J., Bao, J., Pathak, J., Honavar, V.: Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources. In: Hoffmann, A., Motoda, H., Scheffer, T. (eds.) DS 2005. LNCS (LNAI), vol. 3735, p. 14. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Cover, T.M., Thomas, J.A.: Elements of information theory. Wiley Interscience, New York (1991)CrossRefMATHGoogle Scholar
  8. 8.
    Cyganiak, R., Jentzsch, A.: Linking open data cloud diagram, http://lod-cloud.net/ (accessed 2011)
  9. 9.
    Ding, L., DiFranzo, D., Graves, A., Michaelis, J.R., Li, X., McGuinness, D.L., Hendler, J.: Data-gov wiki: Towards linking government data. In: AAAI Spring Symposium on Linked Data Meets Artificial Intelligence (2010)Google Scholar
  10. 10.
    Ding, L., DiFranzo, D., Graves, A., Michaelis, J.R., Li, X., McGuinness, D.L., Hendler, J.A.: TWC data-gov corpus: incrementally generating linked government data from data. gov. In: Proceedings of the 19th International Conference on World Wide Web, pp. 1383–1386 (2010)Google Scholar
  11. 11.
    Getoor, L., Taskar, B.: Introduction to Statistical Relational Learning. The MIT Press (2007)Google Scholar
  12. 12.
    Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)MATHGoogle Scholar
  13. 13.
    Hassanzadeh, O., Consens, M.: Linked movie data base. In: WWW 2009 LDOW Workshop (2009)Google Scholar
  14. 14.
    Hendler, J.: Science and the semantic web. Science 299, 520–521 (2003)CrossRefGoogle Scholar
  15. 15.
    Hitzler, P., Krötzsch, M., Rudolph, S.: Foundations of Semantic Web Technologies. Chapman & Hall/CRC (2009)Google Scholar
  16. 16.
    Hung, E., Deng, Y., Subrahmanian, V.S.: RDF aggregate queries and views. In: 21st International Conference on Data Engineering, pp. 717–728 (2005)Google Scholar
  17. 17.
    Kiefer, C., Bernstein, A., Locher, A.: Adding Data Mining Support to SPARQL Via Statistical Relational Learning Methods. In: Bechhofer, S., Hauswirth, M., Hoffmann, J., Koubarakis, M. (eds.) ESWC 2008. LNCS, vol. 5021, pp. 478–492. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  18. 18.
    Korf, R.E.: Depth-first iterative-deepening: an optimal admissible tree search. Artif. Intell. 27, 97–109 (1985)MathSciNetCrossRefMATHGoogle Scholar
  19. 19.
    Koul, N., Bui, N., Honavar, V.: Scalable, updatable predictive models for sequence data. In: BIBM, pp. 681–685 (2010)Google Scholar
  20. 20.
    Koul, N., Caragea, C., Honavar, V., Bahirwani, V., Caragea, D.: Learning classifiers from large databases using statistical queries. In: Web Intelligence, pp. 923–926 (2008)Google Scholar
  21. 21.
    Koul, N., Lin, H.T.: Indus learning framework. Google Code (2011), http://code.google.com/p/induslearningframework/
  22. 22.
    Liu, H., Motoda, H.: Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic Publishers, Norwell (1998)CrossRefMATHGoogle Scholar
  23. 23.
    Manola, F., Miller, E. (eds.): RDF Primer. W3C Recommendation. World Wide Web Consortium (February 2004)Google Scholar
  24. 24.
    Nelson, D., Kreps, G., Hesse, B., Croyle, R., Willis, G., Arora, N., Rimer, B., Viswanath, K.V., Weinstein, N., Alden, S.: The health information national trends survey (HINTS): Development, design, and dissemination. Journal of Health Communication: International Perspectives 9(5), 443–460 (2004)CrossRefGoogle Scholar
  25. 25.
    Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational bayesian classifiers. In: Proceedings of the Third IEEE International Conference on Data Mining, pp. 609–612 (2003)Google Scholar
  26. 26.
    Prud’ommeaux, E., Seaborne, A.: SPARQL query language for RDF, http://www.w3.org/TR/2008/REC-rdf-sparql-query-20080115/ (accessed 2011)
  27. 27.
    Tauberer, J.: The 2000, U.S. census: 1 billion RDF triples, http://www.rdfabout.com/demo/census/ (accessed 2011)
  28. 28.
    Tresp, V., Huang, Y., Bundschus, M., Rettinger, A.: Materializing and querying learned knowledge. In: Proceedings of the ESWC 2009 IRMLeS Workshop (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2011

Authors and Affiliations

  • Harris T. Lin
    • 1
  • Neeraj Koul
    • 1
  • Vasant Honavar
    • 1
  1. 1.Department of Computer ScienceIowa State UniversityAmesUSA

Personalised recommendations