EXTIRP: Baseline Retrieval from Wikipedia

  • Miro Lehtonen
  • Antoine Doucet
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4518)


The Wikipedia XML documents are considered an interesting challenge to any XML retrieval system that is capable of indexing and retrieving XML without prior knowledge of the structure. Although the structure of the Wikipedia XML documents is highly irregular and thus unpredictable, EXTIRP manages to handle all the well-formed XML documents without problems. Whether the high flexibility of EXTIRP also implies high performance concerning the quality of IR has so far been a question without definite answers. The initial results do not confirm any positive answers, but instead, they tempt us to define some requirements for the XML documents that EXTIRP is expected to index. The most interesting question stemming from our results is about the line between high-quality XML markup which aids accurate IR and noisy “XML spam” that misleads flexible XML search engines.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Doucet, A., Aunimo, L., Lehtonen, M., Petit, R.: Accurate Retrieval of XML Document Fragments using EXTIRP. In: INEX, Workshop Proceedings, Schloss Dagstuhl, Germany, pp. 73–80 (2003)Google Scholar
  2. 2.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar
  3. 3.
    Ahonen-Myka, H.: Finding all frequent maximal sequences in text. In: Mladenic, D., Grobelnik, M. (eds.) Proceedings of the 16th International Conference on Machine Learning ICML-99 Workshop on Machine Learning in Text Data Analysis, Ljubljana, Slovenia, J. Stefan Institute, pp. 11–17 (1999)Google Scholar
  4. 4.
    Doucet, A.: Advanced Document Description, a Sequential Approach. PhD thesis, University of Helsinki (2005)Google Scholar
  5. 5.
    Lehtonen, M.: Preparing heterogeneous XML for full-text search. ACM Trans. Inf. Syst. 24, 455–474 (2006)CrossRefGoogle Scholar
  6. 6.
    Lehtonen, M.: Indexing Heterogeneous XML for Full-Text Search. PhD thesis, University of Helsinki (2006)Google Scholar
  7. 7.
    Kazai, G., Lalmas, M.: INEX 2005 Evaluation Measures. In: Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.) INEX 2005. LNCS, vol. 3977, pp. 16–29. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  8. 8.
    Lehtonen, M.: When a few highly relevant answers are enough. In: [9] pp. 296–305Google Scholar
  9. 9.
    Fuhr, N., Lalmas, M., Malik, S., Kazai, G. (eds.): INEX 2005 (Revised Selected Papers). LNCS, vol. 3977. Springer, Heidelberg (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Miro Lehtonen
    • 1
  • Antoine Doucet
    • 1
    • 2
  1. 1.Department of Computer Science, P.O. Box 68 (Gustaf Hällströmin katu 2b), FI–00014 University of HelsinkiFinland
  2. 2.IRISA-INRIA, Campus de Beaulieu, F-35042 Rennes CedexFrance

Personalised recommendations