Skip to main content

DTD Based Costs for Tree-Edit Distance in Structured Information Retrieval

  • Conference paper
Advances in Information Retrieval (ECIR 2013)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7814))

Included in the following conference series:

Abstract

In this paper we present a Structured Information Retrieval (SIR) model based on graph matching. Our approach combines content propagation, which handles sibling relationships, with a document-query structure matching process. The latter is based on Tree-Edit Distance (TED) which is the minimum set of insert, delete, and replace operations to turn one tree to another. To our knowledge this algorithm has never been used in ad-hoc SIR. As the effectiveness of TED relies both on the input tree and the edit costs, we first present a focused subtree extraction technique which selects the most representative elements of the document w.r.t the query. We then describe our TED costs setting based on the Document Type Definition (DTD). Finally we discuss our results according to the type of the collection (data-oriented or text-oriented). Experiments are conducted on two INEX test sets: the 2010 Datacentric collection and the 2005 Ad-hoc one.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Alilaouar, A., Sedes, F.: Fuzzy querying of XML documents. In: Web Intelligence and Intelligent Agent Technology Conference, France, pp. 11–14 (2005)

    Google Scholar 

  2. Amati, G., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. Inf. Syst. 20, 357–389 (2002)

    Article  Google Scholar 

  3. Barros, E.G., Moro, M.M., Alberto, H., Laender, F.: An Evaluation Study of Search Algorithms for XML Streams. JIDM 1(3), 487–502 (2010)

    Google Scholar 

  4. Ben Aouicha, M., Tmar, M., Boughanem, M.: Flexible document-query matching based on a probabilistic content and structure score combination. In: Symposium on Applied Computing (SAC), Sierre, Switzerland (March 2010)

    Google Scholar 

  5. Bender, M.A., Farach-Colton, M.: The LCA Problem Revisited. In: Gonnet, G., Panario, D., Viola, A. (eds.) LATIN 2000. LNCS, vol. 1776, pp. 88–94. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  6. Bille, P.: A survey on tree edit distance and related problems. Theoritical Computer Science 337(1-3), 217–239 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  7. Damiani, E., Oliboni, B., Tanca, L.: Fuzzy techniques for XML data smushing. In: Proceedings of the International Conference, 7th Fuzzy Days on Computational Intelligence, Theory and Applications, pp. 637–652 (2001)

    Google Scholar 

  8. Dulucq, S., Touzet, H.: Analysis of Tree Edit Distance Algorithms. In: Baeza-Yates, R., Chávez, E., Crochemore, M. (eds.) CPM 2003. LNCS, vol. 2676, pp. 83–95. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  9. Floyd, R.W.: Algorithm 97: Shortest path. Commun. ACM 5, 345 (1962)

    Article  Google Scholar 

  10. Jia, X.-F., Alexander, D., Wood, V., Trotman, A.: University of Otago at INEX 2010. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 250–268. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  11. Sparck Jones, K.: Index term weighting. Information Storage and Retrieval 9(11), 619–633 (1973)

    Article  Google Scholar 

  12. Kamps, J., Pehcevski, J., Kazai, G., Lalmas, M., Robertson, S.: INEX 2007 Evaluation Measures. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds.) INEX 2007. LNCS, vol. 4862, pp. 24–33. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

  13. Kazai, G., Lalmas, M.: INEX 2005 Evaluation Measures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 16–29. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  14. Klein, P.N.: Computing the Edit-Distance between Unrooted Ordered Trees. In: Bilardi, G., Pietracaprina, A., Italiano, G.F., Pucci, G. (eds.) ESA 1998. LNCS, vol. 1461, pp. 91–102. Springer, Heidelberg (1998)

    Google Scholar 

  15. Laitang, C., Pinel-Sauvagnat, K., Boughanem, M.: Edit Distance for XML Information Retrieval: Some Experiments on the Datacentric Track of INEX 2011. In: Geva, S., Kamps, J., Schenkel, R. (eds.) INEX 2011. LNCS, vol. 7424, pp. 138–145. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  16. Levenshtein, V.I.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)

    MathSciNet  Google Scholar 

  17. Mehdad, Y.: Automatic cost estimation for tree edit distance using particle swarm optimization. In: Proceedings of the ACL-IJCNLP 2009 Conference Short Papers, ACLShort 2009, pp. 289–292 (2009)

    Google Scholar 

  18. Neuhaus, M., Bunke, H.: Automatic learning of cost functions for graph edit distance. Information Science 177(1), 239–247 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  19. Oncina, J., Sebban, M.: Learning stochastic edit distance: Application in handwritten character recognition. Pattern Recogn. 39, 1575–1587 (2006)

    Article  MATH  Google Scholar 

  20. Popovici, E., Ménier, G., Marteau, P.-F.: SIRIUS: A Lightweight XML Indexing and Approximate Search System at INEX 2005. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 321–335. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  21. Ramírez, G.: UPF at INEX 2010: Towards Query-Type Based Focused Retrieval. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 206–218. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  22. Sauvagnat, K., Boughanem, M., Chrisment, C.: Why Using Structural Hints in XML Retrieval? In: Larsen, H.L., Pasi, G., Ortiz-Arroyo, D., Andreasen, T., Christiansen, H. (eds.) FQAS 2006. LNCS (LNAI), vol. 4027, pp. 197–209. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  23. Tai, K.-C.: The tree-to-tree correction problem. J. ACM 26, 422–433 (1979)

    Article  MathSciNet  MATH  Google Scholar 

  24. Theobald, M., Schenkel, R., Weikum, G.: Topx XXL. In: Proceedings of the Initiative for the Evaluation of XML Retrieval, pp. 201–214 (2005)

    Google Scholar 

  25. Trotman, A.: Processing structural constraints. In: Encyclopedia of Database Systems, pp. 2191–2195 (2009)

    Google Scholar 

  26. Trotman, A., Lalmas, M.: Why structural hints in queries do not help XML-retrieval. In: SIGIR 2006, pp. 711–712 (2006)

    Google Scholar 

  27. Trotman, A., Wang, Q.: Overview of the INEX 2010 Data Centric Track. In: Geva, S., Kamps, J., Schenkel, R., Trotman, A. (eds.) INEX 2010. LNCS, vol. 6932, pp. 171–181. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  28. Wang, Q., Ramírez, G., Marx, M., Theobald, M., Kamps, J.: Overview of the INEX 2011 Data-Centric Track. In: Geva, S., Kamps, J., Schenkel, R. (eds.) INEX 2011. LNCS, vol. 7424, pp. 118–137. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Laitang, C., Pinel-Sauvagnat, K., Boughanem, M. (2013). DTD Based Costs for Tree-Edit Distance in Structured Information Retrieval. In: Serdyukov, P., et al. Advances in Information Retrieval. ECIR 2013. Lecture Notes in Computer Science, vol 7814. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-36973-5_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-36973-5_14

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-36972-8

  • Online ISBN: 978-3-642-36973-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics