Phrase Detection in the Wikipedia

Lehtonen, Miro; Doucet, Antoine

doi:10.1007/978-3-540-85902-4_10

Miro Lehtonen¹ &
Antoine Doucet^1,2

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4862))

Included in the following conference series:

International Workshop of the Initiative for the Evaluation of XML Retrieval

541 Accesses
3 Citations

Abstract

The Wikipedia XML collection turned out to be rich of marked-up phrases as we carried out our INEX 2007 experiments. Assuming that a phrase occurs at the inline level of the markup, we were able to identify over 18 million phrase occurrences, most of which were either the anchor text of a hyperlink or a passage of text with added emphasis. As our IR system — EXTIRP — indexed the documents, the detected inline-level elements were duplicated in the markup with two direct consequences: 1) The frequency of the phrase terms increased, and 2) the word sequences changed. Because the markup was manipulated before computing word sequences for a phrase index, the actual multi-word phrases became easier to detect. The effect of duplicating the inline-level elements was tested by producing two run submissions in ways that were similar except for the duplication. According to the official INEX 2007 metric, the positive effect of duplicated phrases was clear.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Lehtonen, M.: Preparing heterogeneous XML for full-text search. ACM Trans. Inf. Syst. 24, 455–474 (2006)
Article Google Scholar
Doucet, A., Ahonen-Myka, H.: Probability and expected document frequency of discontinued word sequences, an efficient method for their exact computation. Traitement Automatique des Langues (TAL) 46, 13–37 (2006)
Google Scholar
Lehtonen, M., Doucet, A.: Extirp: Baseline retrieval from wikipedia. In: Malik, S., Trotman, A., Lalmas, M., Fuhr, N. (eds.) INEX 2006. LNCS, vol. 4518, pp. 119–124. Springer, Heidelberg (2007)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14, 130–137 (1980)
Google Scholar
Ahonen-Myka, H.: Finding all frequent maximal sequences in text. In: Mladenic, D., Grobelnik, M. (eds.) Proceedings of the 16th International Conference on Machine Learning ICML 1999 Workshop on Machine Learning in Text Data Analysis, Ljubljana, Slovenia, J. Stefan Institute, pp. 11–17 (1999)
Google Scholar
Doucet, A., Ahonen-Myka, H.: Fast extraction of discontiguous sequences in text: a new approach based on maximal frequent sequences. In: Proceedings of IS-LTC 2006, pp. 186–191 (2006)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, FI–00014 University of Helsinki, P. O. Box 68, (Gustaf Hällströmin katu 2b), Finland
Miro Lehtonen & Antoine Doucet
GREYC CNRS UMR 6072, University of Caen Lower Normandy, F-14032, Caen Cedex, France
Antoine Doucet

Authors

Miro Lehtonen
View author publications
You can also search for this author in PubMed Google Scholar
Antoine Doucet
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Norbert Fuhr Jaap Kamps Mounia Lalmas Andrew Trotman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lehtonen, M., Doucet, A. (2008). Phrase Detection in the Wikipedia. In: Fuhr, N., Kamps, J., Lalmas, M., Trotman, A. (eds) Focused Access to XML Documents. INEX 2007. Lecture Notes in Computer Science, vol 4862. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85902-4_10

Download citation

DOI: https://doi.org/10.1007/978-3-540-85902-4_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85901-7
Online ISBN: 978-3-540-85902-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics