Advertisement

Language Resources and Evaluation

, Volume 47, Issue 2, pp 425–448 | Cite as

Morphological query expansion and language-filtering words for improving Basque web retrieval

  • Igor LeturiaEmail author
  • Antton Gurrutxaga
  • Nerea Areta
  • Iñaki Alegria
  • Aitzol Ezeiza
Original Paper

Abstract

The experience of a user of major search engines or other web information retrieval services looking for information in the Basque language is far from satisfactory: they only return pages with exact matches but no inflections (necessary for an agglutinative language like Basque), many results in other languages (no search engine gives the option to restrict its results to Basque), etc. This paper proposes using morphological query expansion and language-filtering words in combination with the APIs of search engines as a very cost-effective solution to build appropriate web search services for Basque. The implementation details of the methodology (choosing the most appropriate language-filtering words, the number of them, the most frequent inflections for the morphological query expansion, etc.) have been specified by corpora-based studies. The improvements produced have been measured in terms of precision and recall both over corpora and real web searches. Morphological query expansion can improve recall up to 47 % and language-filtering words can raise precision from 15 % to around 90 %, although with a loss in recall of about 30–35 %. The proposed methodology has already been successfully used in the Basque search service Elebila (http://www.elebila.eu) and the web-as-corpus tool CorpEus (http://www.corpeus.org), and the approach could be applied to other morphologically rich or under-resourced languages as well.

Keywords

Search engines Web-as-corpus Basque NLP Morphological query expansion Language-filtering words 

References

  1. Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., & Urizar, R. (1996). EUSLEM: A lemmatiser/tagger for Basque. In Proceedings of Euralex conference, Göteborg, pp. 17–26.Google Scholar
  2. Aduriz, I., Aldezabal, I., Ansa, O., Artola, X., & Diaz de Ilarraza, A. (1998). EDBL: A multi-purpose lexical support for the treatment of basque. In Proceedings of the first international conference on language resources and evaluation, Granada, vol. II, pp. 821–826.Google Scholar
  3. Alegria, I., Artola, X., & Sarasola, K. (1996). Automatic morphological analysis of Basque. Literary & Linguistic Computing, 4(II), 193–203.Google Scholar
  4. Ambroziak, J., & Woods, W. A. (1998). Natural language technology in precision content retrieval. In Proceedings of the international conference on natural language processing and industrial applications, Moncton.Google Scholar
  5. Areta, N., Gurrutxaga, A., Leturia, I., Alegria, I., Artola, X., Diaz de Ilarraza, A., et al. (2007). ZT corpus—annotation and tools for basque corpora. In Proceedings of corpus linguistics conference, Birmingham.Google Scholar
  6. Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.CrossRefGoogle Scholar
  7. Belkin, N. J. (2000). Helping people find what they don’t know. Communications of the ACM, 43(8), 58–61.CrossRefGoogle Scholar
  8. Benczúr, A. A., Csalogány, K., Fogaras, D., Friedman, E., Sarlós, T., Uher, M. et al. (2003). Searching a small national domain—a preliminary report. In Proceedings of the 12th international World Wide Web conference, Budapest, pp. 184.Google Scholar
  9. Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2).Google Scholar
  10. Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of third annual symposium on document analysis and information retrieval, Las Vegas, pp. 161–175.Google Scholar
  11. Efthimiadis, E. N., Malevris, N., Kousaridas, A., Lepeniotou, A., & Loutas, N. (2009). Non-english web search: An evaluation of indexing and searching the Greek web. Information Retrieval, 12(3), 352–379.CrossRefGoogle Scholar
  12. Fletcher, W. H. (2006). Concordancing the web: Promise and problems, tools and techniques. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 25–46). Amsterdam: Rodopi.Google Scholar
  13. Ghani, R., Jones, R., & Mladenić, D. (2003). Building minority language corpora by learning to generate Web search queries. Knowledge and Information Systems, 7(1), 56–83.CrossRefGoogle Scholar
  14. Jones, K. S., & Tait, J. I. (1984). Automatic search term variant generation. Journal of Documentation, 40(1), 50–66.CrossRefGoogle Scholar
  15. Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the WWW2002 Conference, Honolulu.Google Scholar
  16. Kettunen, K., Airio, E., & Järvelin, K. (2007). Restricted inflectional form generation in management of morphological keyword variation. Information Retrieval, 10(4–5), 415–444.CrossRefGoogle Scholar
  17. Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the Web as corpus. Computational Linguistics, 29, 333–348.CrossRefGoogle Scholar
  18. Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, pp. 191–202.Google Scholar
  19. Langer, S. (2001). Natural languages and the World Wide Web. Bulletin de linguistique appliquée et générale, 26, 89–100.Google Scholar
  20. Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.CrossRefGoogle Scholar
  21. Lazarinis, F., Vilares, J., & Tait, J. (2007). Improving non-English web searching (iNEWS07). ACM SIGIR Forum, 41(2), 72–76.CrossRefGoogle Scholar
  22. Leturia, I., Gurrutxaga, A., Alegria, I., & Ezeiza, A. (2007). CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. In Proceedings of the 3rd Web as Corpus workshop, Louvain-la-Neuve, pp. 69–81.Google Scholar
  23. Leturia, I., Gurrutxaga, A., Areta, A., Alegria, I., & Ezeiza, A. (2007). EusBila, a search service designed for the agglutinative nature of Basque. In Proceedings of iNEWS’07 workshop in SIGIR, Amsterdam, pp. 47–54.Google Scholar
  24. Moreau, F., Claveau, V., & Sébillot, P. (2007). Automatic morphological query expansion using analogy-based machine learning. In Proceedings of ECIR 2007, Rome, pp. 222–233.Google Scholar
  25. Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the international conference on intelligent information systems, Zakopane, pp. 359–368.Google Scholar
  26. Padró, M., & Padró, L. (2004). Comparing methods for language identification. Procesamiento del Lenguaje Natural, 33, 155–162.Google Scholar
  27. Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web WWW’04, New York, pp. 13–19.Google Scholar
  28. Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), WaCky! Working papers on the Web as corpus (pp. 63–98). Bologna: Gedit Edizioni.Google Scholar
  29. Stanković, R. M. (2008). Improvement of queries using a rule based procedure for inflection of compounds and phrases. Research Journal on Computer Science and Computer Engineering with Applications, 37, 14–20.Google Scholar
  30. Uyar, A. (2009). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469–480.CrossRefGoogle Scholar
  31. Woods, W. A. (2000). Aggressive morphology for robust lexical coverage. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 218–223.Google Scholar
  32. Woods, W. A., Bookman, L. A., Houston, A., Kuhns, R. J., Martin, P., & Green, S. (2000). Linguistic knowledge can improve information retrieval. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 262–267.Google Scholar
  33. Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(1), 61–81.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2012

Authors and Affiliations

  • Igor Leturia
    • 1
    Email author
  • Antton Gurrutxaga
    • 1
  • Nerea Areta
    • 1
  • Iñaki Alegria
    • 2
  • Aitzol Ezeiza
    • 2
  1. 1.Elhuyar FoundationUsurbilSpain
  2. 2.University of the Basque CountryDonostia/San SebastianSpain

Personalised recommendations