Phrase Detection in the Wikipedia

  • Miro Lehtonen
  • Antoine Doucet
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4862)


The Wikipedia XML collection turned out to be rich of marked-up phrases as we carried out our INEX 2007 experiments. Assuming that a phrase occurs at the inline level of the markup, we were able to identify over 18 million phrase occurrences, most of which were either the anchor text of a hyperlink or a passage of text with added emphasis. As our IR system — EXTIRP — indexed the documents, the detected inline-level elements were duplicated in the markup with two direct consequences: 1) The frequency of the phrase terms increased, and 2) the word sequences changed. Because the markup was manipulated before computing word sequences for a phrase index, the actual multi-word phrases became easier to detect. The effect of duplicating the inline-level elements was tested by producing two run submissions in ways that were similar except for the duplication. According to the official INEX 2007 metric, the positive effect of duplicated phrases was clear.


Query Term Frequency Threshold Word Sequence Recall Level Anchor Text 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Lehtonen, M.: Preparing heterogeneous XML for full-text search. ACM Trans. Inf. Syst. 24, 455–474 (2006)CrossRefGoogle Scholar
  2. 2.
    Doucet, A., Ahonen-Myka, H.: Probability and expected document frequency of discontinued word sequences, an efficient method for their exact computation. Traitement Automatique des Langues (TAL) 46, 13–37 (2006)Google Scholar
  3. 3.
    Lehtonen, M., Doucet, A.: Extirp: Baseline retrieval from wikipedia. In: Malik, S., Trotman, A., Lalmas, M., Fuhr, N. (eds.) INEX 2006. LNCS, vol. 4518, pp. 119–124. Springer, Heidelberg (2007)Google Scholar
  4. 4.
    Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)Google Scholar
  5. 5.
    Ahonen-Myka, H.: Finding all frequent maximal sequences in text. In: Mladenic, D., Grobelnik, M. (eds.) Proceedings of the 16th International Conference on Machine Learning ICML 1999 Workshop on Machine Learning in Text Data Analysis, Ljubljana, Slovenia, J. Stefan Institute, pp. 11–17 (1999)Google Scholar
  6. 6.
    Doucet, A., Ahonen-Myka, H.: Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences. In: Proceedings of IS-LTC 2006, pp. 186–191 (2006)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Miro Lehtonen
    • 1
  • Antoine Doucet
    • 1
    • 2
  1. 1.Department of Computer ScienceFI–00014 University of HelsinkiFinland
  2. 2.GREYC CNRS UMR 6072University of Caen Lower NormandyCaen CedexFrance

Personalised recommendations