A Machine Learning Model for Information Retrieval with Structured Documents

  • Benjamin Piwowarski
  • Patrick Gallinari
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2734)


Most recent document standards rely on structured representations. On the other hand, current information retrieval systems have been developed for flat document representations and cannot be easily extended to cope with more complex document types. Only a few models have been proposed for handling structured documents, and the design of such systems is still an open problem. We present here a new model for structured document retrieval which allows to compute and to combine the scores of document parts. It is based on bayesian networks and allows for learning the model parameters in the presence of incomplete data. We present an application of this model for ad-hoc retrieval and evaluate its performances on a small structured collection. The model can also be extended to cope with other tasks such as interactive navigation in structured documents or corpus.


Information Retrieval Bayesian Network Estimation Maximization Structure Document Machine Learn Model 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    ACM SIGIR 2000 Workshop on XML and Information Retrieval. Athens, Greece. July 28, 2000-also published in JASIST, Vol 53, no 6, 2002, special topic issue: XML.Google Scholar
  2. [2]
    Jamie P. Callan, W. Bruce Croft, and Stephen M. Harding. The INQUERY Retrieval System. In A. Min Tjoa and Isidro Ramos, editors, Database and Expert Systems Applications, Proceedings of the International Conference, pages 78–83, Valencia, Spain, 1992. Springer-Verlag.Google Scholar
  3. [3]
    Soumen Chakrabarti, Byron Dom, Rakesh Agrawal, and Prabhakar Raghavan. Using taxonomy, discriminants, and signatures for navigating in text databases. In 23rd International Conference on Very Large Data Bases, Athens, Greece, 1997.Google Scholar
  4. [4]
    A.P. Dempster, N.M. Laird, and D.B. Rubin. Maximum Likelihood from incomplete data via de EM algorithm. The Journal of Royal Statistical Society, 39:1–37, 1977.zbMATHMathSciNetGoogle Scholar
  5. [5]
    Fuhr, N. and Rölleke, T. HySpirit — a Probabilistic Inference Engine for Hypermedia Retrieval in Large Databases. In: Schek, H.-J.; Saltor, F.; Ramos, I.; Alonso, G. (eds.). Proceedings of the 6th International Conference on Extending Database Technology (EDBT), Valencia, Spain, pages 24–38. Springer, Berlin, 1998.Google Scholar
  6. [6]
    Maria Indrawan, Desra Ghazfan, and Bala Srinivasan. Using Bayesian Networks as Retrieval Engines. In ACIS 5th Australasian Conference on Information Systems, pages 259–271, Melbourne, Australia, 1994.Google Scholar
  7. [7]
    Finn Verner Jensen. An introduction to Bayesian Networks. UCL Press, London, England, 1996.Google Scholar
  8. [8]
    Daphne Koller and Mehran Sahami. Hierarchically Classifying Documents Using Very Few Words. In ICML-97: Proceedings of the Fourteenth International Conference on Machine Learning, pages 435–443, San Francisco, CA, USA, 1997. Morgan Kaufmann.Google Scholar
  9. [9]
    Paul Krause. Learning Probabilistic Networks. 1998.Google Scholar
  10. [10]
    Mounia Lalmas. Dempster-Shafer’s Theory of Evidence Applied to Structured Documents: Modelling Uncertainty. In Proceedings of the 20th Annual International ACM SIGIR, pages 110–118, Philadelphia, PA, USA, July 1997. ACM.Google Scholar
  11. [11]
    Mounia Lalmas. Uniform representation of content and structure for structured document retrieval. Technical report, Queen Mary & Westfield College, University of London, London, England, 2000.Google Scholar
  12. [12]
    Mounia Lalmas and Ekaterini Moutogianni. A Dempster-Shafer indexing for the focussed retrieval of a hierarchically structured document space: Implementation and experiments on a web museum collection. In 6th RIAO Conference, Content-Based Multimedia Information Access, Paris, France, April 2000.Google Scholar
  13. [13]
    Mounia Lalmas, I. Ruthven, and M. Theophylactou. Structured document retrieval using Dempster-Shafer’s Theory of Evidence: Implementation and evaluation. Technical report, University of Glasgow, UK, August 1997.Google Scholar
  14. [14]
    Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, and Andrew Y. Ng. Improving Text Classification by Shrinkage in a Hierarchy of Classes. In Ivan Brasko and Saso Dzeroski, editors, International Conference on Machine Learning (ICML 98), pages 359–367. Morgan Kaufmann, 1998.Google Scholar
  15. [15]
    Kevin Patrick Murphy. A Brief Introduction to Graphical Models and Bayesian Networks. web:, October 2000.Google Scholar
  16. [16]
    Sung Hyon Myaeng, Dong-Hyun Jang, Mun-Seok Kim, and Zong-Cheol Zhoo. A Flexible Model for Retrieval of SGML documents. In W. Bruce Croft, Alistair Moffat, C.J. van Rijsbergen, Ross Wilkinson, and Justin Zobel, editors, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 138–140, Melbourne, Australia, August 1998. ACM Press, New York.CrossRefGoogle Scholar
  17. [17]
    Gonzalo Navarro and Ricardo Baeza-Yates. Proximal Nodes: A Model to Query Document Databases by Content and Structure. ACM TOIS, 15(4):401–435, October 1997.CrossRefGoogle Scholar
  18. [18]
    OASIS. Docbook standard.|, 2 2001.Google Scholar
  19. [19]
    Judea Pearl. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.Google Scholar
  20. [20]
    Berthier A.N. Ribeiro and Richard Muntz. A Belief Network Model for IR. In Proceedings. of the 19th ACM-SIGIR conference, pages 253–260, 1996.Google Scholar
  21. [21]
    Stephen E. Robertson. The probability ranking principle in IR. Journal of Documentation, 33:294–304, 1977.CrossRefGoogle Scholar
  22. [22]
    Howard R. Turtle and W. Bruce Croft. Evaluation of an Inference Network-Based Retrieval Model. ACM Transactions On Information Systems, 9(3):187–222, 1991.CrossRefGoogle Scholar
  23. [23]
    S. Walker and Stephen E. Robertson. Okapi/Keenbow at TREC-8. In E. M. Voorhees and D.K. Harman, editors, NIST Special Publication 500-246: The Eighth Text REtrieval Conference (TREC-8), Gaithersburg, Maryland, USA, November 1999.Google Scholar
  24. [24]
    Ross Wilkinson. Effective retrieval of structured documents. In W.B. Croft and C.J. van Rijsbergen, editors, Proceedings of the 17th Annual International Conference on Research and Development in Information Retrieval, pages 311–317, Dublin, Ireland, July 1994. Springer-Verlag.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2003

Authors and Affiliations

  • Benjamin Piwowarski
    • 1
  • Patrick Gallinari
    • 1
  1. 1.LIP6 — Université Paris 6ParisFrance

Personalised recommendations