Information Retrieval

, Volume 12, Issue 3, pp 324–351 | Cite as

Mixed monolingual homepage finding in 34 languages: the role of language script and search domain

Article

Abstract

The information that is available or sought on the World Wide Web (Web) is increasingly multilingual. Information Retrieval systems, such as the freely available search engines on the Web, need to provide fair and equal access to this information, regardless of the language in which a query is written or where the query is posted from. In this work, we ask two questions: How do existing state of the art search engines deal with languages written in different alphabets (scripts)? Do local language-based search domains actually facilitate access to information? We conduct a thorough study on the effect of multilingual queries for homepage finding, where the aim of the retrieval system is to return only one document, namely the homepage described in the query. We evaluate the effect of multilingual queries in retrieval performance with regard to (i) the alphabet in which the queries are written (e.g., Latin, Russian, Arabic), and (ii) the language domain where the queries are posted (e.g., google.com, google.fr). We query four major freely available search engines with 764 queries in 34 different languages, and look for the correct homepage in the top retrieved results. In order to have fair multilingual experimental settings, we use an ontology that is comparable across languages and also representative of realistic Web searches: football premier leagues in different countries; the official team name represents our query, and the official team homepage represents the document to be retrieved. A series of thorough experiments involving over 10,000 runs, with queries both in their correct and in Latin characters, and also using both global-domain and local-domain searches, reveal that queries issued in the correct script of a language are more likely to be found and ranked in the top 3, while queries in non-Latin script languages which are however issued in Latin script are less likely to be found; also, queries issued to the correct local domain of a search engine, e.g., French queries to yahoo.fr, are likely to have better retrieval performance than queries issued to the global domain of a search engine. To our knowledge, this is the first Web retrieval study that uses such a wide range of languages.

Keywords

Multilingual information retrieval Web information retrieval Search engines and evaluation 

Notes

Acknowledgements

Roi Blanco is co-funded by FEDER, Ministerio de Ciencia e Innovación and Xunta de Galicia under projects TIN2008-06566-C04-04 and 07SIN005206PR. Christina Lioma is funded by K.U.L. Postdoctoral Fellowship F+/08/002.

References

  1. Adriani, M., Asian, J., Nazief, B., Tahaghoghi, S. M. M., & Williams, H. E. (2007). Stemming Indonesian: A confix-stripping approach. ACM Transaction on Asian Language Information Processing, 6(4), 1–33.Google Scholar
  2. Adriani, M., & Pandugita, R. (2006). Using the Web information structure for retrieving Web pages. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 892–897). Heidelberg: Springer.Google Scholar
  3. Ahmed, F., & Nurnberger, A. (2007). N-grams conflation approach for Arabic text. In F. Lazarinis, J. V. Ferro, & J. Tait (Eds.), Proceedings of the Workshop on Improving Web Retrieval for Non-English Queries (iNEWS), SIGIR 2007, Amsterdam, The Netherlands, 23–27 July, 2007.Google Scholar
  4. Balog, K., Azzopardi, L., Kamps, J., & de Rijke, M. (2007). Overview of WebCLEF 2006. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 803–819). Heidelberg: Springer.Google Scholar
  5. Balog, K., & de Rijke, M. (2007). Index combinations and query reformulations for mixed monolingual Web retrieval. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 830–833). Heidelberg: Springer.Google Scholar
  6. Berners-Lee, T. (2005). WWW at 15 years: Looking forward. In A. Ellis & T. Hagino (Eds.), WWW, p. 1. New York: ACM.Google Scholar
  7. Bernstein, Y., & Zobel, J. (2004). A scalable system for identifying co-derivative documents. In A. Apostolico & M. Melucci (Eds.), SPIRE. Lecture Notes in Computer Science (Vol. 3246, pp. 55–67). Heidelberg: Springer.Google Scholar
  8. Broder, A. Z. (2002). A taxonomy of Web search. SIGIR Forum, 36(2), 3–10.Google Scholar
  9. Buttcher, S., Clarke, C., & Soboroff, I. (2006). TREC Terabyte track overview. In TREC. Gaithersburg: NIST.Google Scholar
  10. Buttcher, S., Scholer, F., & Soboroff, I. (2005). TREC Terabyte track overview. In TREC. Gaithersburg: NIST.Google Scholar
  11. Chung, W. (2008). Web searching in a multilingual world. In Communications of the ACM (Vol. 51, pp. 32–40). New York: ACM.Google Scholar
  12. Craswell, N., Hawking, D., & Robertson, S. E. (2001). Effective site finding using link anchor information. In W. B. Croft, D. J. Harper, D. H. Kraft, & J. Zobel (Eds.), SIGIR (pp. 250–257). New York: ACM.Google Scholar
  13. Craswell, N., Hawking, D., Wilkinson, R., & Wu, M. (2001). TREC10 Web and interactive tracks at CSIRO. In TREC. Gaithersburg: NIST.Google Scholar
  14. Crystal, D. (2006). Language and the Internet. Cambridge University Press: New York.Google Scholar
  15. Daya, E., Roth, D., & Wintner, S. (2008). Identifying semitic roots: Machine learning with linguistics constraints. Computational Linguistics, 34(3), 429–448.Google Scholar
  16. Efthimiadis, E., Malevris, N., Kousaridas, A., Lepeniotou, A., & Loutas, N. (2007). How do search engines handle Greek queries? In F. Lazarinis, J. V. Ferro, & J. Tait (Eds.), Proceedings of the Workshop on Improving Web Retrieval for Non-English Queries (iNEWS), SIGIR 2007, Amsterdam, The Netherlands, 23–27 July, 2007.Google Scholar
  17. Fallows, D. (2007). China’s online population explosion. Pew Internet & American life project. http://www.pewinternet.org/ppf/r/218/report_display.asp.
  18. Figuerola, C. G., Berrocal, J. L. A., Rodríguez, Á. F. Z., & de Aldana, E. R. V. (2006). Web page retrieval by combining evidence. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 880–887). Springer.Google Scholar
  19. Garfield, R. (1975). Transliteration, transcription, translation. Essays of an Information Scientist, 2(16), 254–256.Google Scholar
  20. Gey, F. C., Kando, N., Lin, C. Y., & Peters, C. (2006). New directions in multilingual information access. SIGIR Forum, 40(2), 31–39.Google Scholar
  21. Hasan, M., & Matsumoto, Y. (2000). Chinese-Japanese cross language information retrieval: A Han character based approach. In ACL Workshop on Word Senses and Multi-linguality (pp. 19–26). Hong Kong: ACL.Google Scholar
  22. Heuwing, B., Mandl, T., & Strötgen, R. (2007). Multilingual Web retrieval experiments with field specific indexing strategies for WebCLEF 2006 at the University of Hildesheim. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 834–837). Heidelberg: Springer.Google Scholar
  23. Jensen, N., Hackl, R., Mandl, T., & Strötgen, R. (2006). Web retrieval experiments with the EuroGOV corpus at the University of Hildesheim. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 837–845). Heidelberg: Springer.Google Scholar
  24. Kamps, J., de Rijke, M., & Sigurbjörnsson, B. (2006). Combination methods for crosslingual Web retrieval. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 856–864). Heidelberg: Springer.Google Scholar
  25. Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding?. Journal of Information Science, 33(5), 622–636.Google Scholar
  26. Lazarinis, F., Ferro, J. V., & Tait, J. (2007). Improving non-English Web searching (iNEWS07). SIGIR Forum, 41(2), 72–76.Google Scholar
  27. López, F. R., Jiménez-Salazar, H., & Pinto, D. (2007). Vocabulary reduction and text enrichment at WebCLEF. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 838–843). Heidelberg: Springer.Google Scholar
  28. Macdonald, C., Lioma, C., & Ounis, I. (2007). Terrier takes on the non-English Web. In F. Lazarinis, J. V. Ferro, & J. Tait (Eds.), Proceedings of the Workshop on Improving Web Retrieval for Non-English Queries (iNEWS), SIGIR 2007, Amsterdam, The Netherlands, 23–27 July, 2007.Google Scholar
  29. Macdonald, C., Plachouras, V., He, B., Lioma, C., & Ounis, I. (2006). University of Glasgow at WebCLEF 2005: Experiments in per-field normalisation and language specific stemming. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 898–907). Heidelberg: Springer.Google Scholar
  30. Martínez-González, Á., Martínez-Fernández, J. L., de Pablo-Sánchez, C., & Villena-Román, J. (2006). Miracle at WebCLEF 2005: Combining Web specific and linguistic information. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 869–872). Heidelberg: Springer.Google Scholar
  31. McNamee, P. (2007). JHU/APL ad hoc experiments at CLEF 2006. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 157–162). Heidelberg: Springer.Google Scholar
  32. Mishne, G. (2007). Applied text analytics for blogs. PhD in Computing Science, University of Amsterdam.Google Scholar
  33. Orengo, V. M., Buriol, L. S., & Coelho, A. R. (2007). A study on the use of stemming for monolingual ad-hoc Portuguese information retrieval. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 91–98). Heidelberg: Springer.Google Scholar
  34. Owen, F., & Jones, R. (1986). Statistics. Pitman, London.Google Scholar
  35. Pinto, D., Rosso, P., & Jiménez, E. (2007). A penalisation-based ranking approach for the mixed monolingual task of WebCLEF 2006. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 826–829). Heidelberg: Springer.Google Scholar
  36. Rodríguez, Á. F. Z., Berrocal, J. L. A., & Figuerola, C. G. (2007). Local query expansion using terms windows for robust retrieval. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 145–152). Heidelberg: Springer.Google Scholar
  37. Santiago, F. M., Ráez, A. M., Cumbreras, M. A. G., & López, L. A. U. (2007). SINAI at CLEF 2006 ad hoc robust multilingual track: Query expansion using the Google search engine. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 119–126). Heidelberg: Springer.Google Scholar
  38. Savoy, J., & Abdou, S. (2007). Experiments with monolingual, bilingual, and robust retrieval. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 137–144). Heidelberg: Springer.Google Scholar
  39. Sigurbjörnsson, B., Kamps, J., & de Rijke, M. (2006). EuroGOV. Engineering a multilingual Web corpus. In C. Peters, F. Gey, J. Gonzalo, G. Jones, M. Kluck, B. Magnini, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evaluation Forum, CLEF 2005, Lecture Notes in Computer Science (Vol. 4022, pp. 825–836). Heidelberg: Springer.Google Scholar
  40. Tomlinson, S. (2006). Danish & Greek Web search experiments with Hummingbird Searchservertm at CLEF 2005. In C. Peters, F. C. Gey, J. Gonzalo, H. Müller, G. J. F. Jones, M. Kluck, et al. (Eds.), Accessing multilingual information repositories. 6th Workshop of the Cross-Language Evalution Forum, CLEF 2005, Vienna, Austria, 21–23 September, 2005. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4022, pp. 846–855). Heidelberg: Springer.Google Scholar
  41. Tomlinson, S. (2007). Comparing the robustness of expansion techniques and retrieval measures. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 129–136). Heidelberg: Springer.Google Scholar
  42. Tzekou, P., Stamou, S., Zotos, N., & Kozanitis, L. (2007). Querying the Greek Web in Greeklish. In F. Lazarinis, J. V. Ferro, & J. Tait (Eds.), Proceedings of the Workshop on Improving Web Retrieval for Non-English Queries (iNEWS), SIGIR 2007, Amsterdam, The Netherlands, 23–27 July, 2007.Google Scholar
  43. Van Rijsbergen, C. J. K. (1979). Information retrieval. London: Butterworth.Google Scholar
  44. Vilares, J., Oakes, M. P., & Tait, J. (2007). A first approach to CLIR using character-grams alignment. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 111–118). Heidelberg: Springer.Google Scholar
  45. Voegelin, C. F., & Voegelin, F. M. (1977). Classification and index of the world’s languages. New York: Elsevier.Google Scholar
  46. Voorhees, E. M., & Harman, D. K. (2005). TREC: Experiment and evaluation in information retrieval. Cambridge, MA: MIT Press.Google Scholar
  47. Wijaya, S., Widhi, B., Khoerniawan, T., & Adriani, M. (2007). Applying relevance feedback for retrieving Web-page retrieval. In C. Peters, P. Clough, F. C. Gey, J. Karlgren, B. Magnini, D. W. Oard, et al. (Eds.), Evaluation of multilingual and multi-modal information retrieval. 7th Workshop of the Cross-Language Evaluation Forum, CLEF 2006, Alicante, Spain, 20–22 September, 2006. Revised Selected Papers, Lecture Notes in Computer Science (Vol. 4730, pp. 848–851). Heidelberg: Springer.Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  1. 1.Computer Science DepartmentUniversity of A CorunaLa CoruñaSpain
  2. 2.Computer Science DepartmentKatholieke Universiteit LeuvenHeverleeBelgium

Personalised recommendations