Advertisement

The Challenges of Language Variation in Information Access

  • Jussi KarlgrenEmail author
  • Turid Hedlund
  • Kalervo Järvelin
  • Heikki Keskustalo
  • Kimmo Kettunen
Chapter
Part of the The Information Retrieval Series book series (INRE, volume 41)

Abstract

This chapter will give an overview of how human languages differ from each other and how those differences are relevant to the development of human language understanding technology for the purposes of information access. It formulates what requirements information access technology poses (and might pose) to language technology. We also discuss a number of relevant approaches and current challenges to meet those requirements.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Afli H, Qiu Z, Way A, Sheridan P (2016) Using SMT for OCR error correction of historical texts. In: 10th international conference on language resources and evaluation, LREC. European Language Resources Association, France, pp 962–966Google Scholar
  2. Airio E (2008) Who benefits from CLIR in web retrieval? J Doc 64(5):760–778CrossRefGoogle Scholar
  3. Akmajian A, Demers R, Farmer A, Harnish R (1995) Linguistics: an introduction to language and communication, 4th edn. MIT Press, CambridgeGoogle Scholar
  4. Argaw AA (2007) Amharic-English information retrieval with pseudo relevance feedback. In: Nardi A, Peters C, Ferro N (eds) CLEF 2007 working notes, CEUR workshop proceedings (CEUR-WS.org), ISSN 1613-0073. http://ceur-ws.org/Vol-1173/
  5. Argaw AA, Asker L (2006) Amharic-English information retrieval. In: Workshop of the cross-language evaluation forum for European languages. Springer, Berlin, pp 43–50Google Scholar
  6. Argaw AA, Asker L, Cöster R, Karlgren J (2004) Dictionary-based Amharic–English information retrieval. In: Workshop of the cross-language evaluation forum for European languages. Springer, Berlin, pp 143–149Google Scholar
  7. Argaw AA, Asker L, Cöster R, Karlgren J, Sahlgren M (2005) Dictionary-based Amharic-French information retrieval. In: Workshop of the cross-language evaluation forum for European languages. Springer, Berlin, pp 83–92Google Scholar
  8. Chen A (2001) Multilingual information retrieval using English and Chinese queries. In: Workshop of the cross-language evaluation forum for European languages. Springer, Berlin, pp 44–58Google Scholar
  9. Chen A (2002) Cross-language retrieval experiments at clef 2002. In: Workshop of the cross-language evaluation forum for European languages, Springer, Berlin, pp 28–48Google Scholar
  10. Cosijn E, Keskustalo H, Pirkola A, De Wet K (2004) Afrikaans-English cross-language information retrieval. In: Bothma T, Kaniki A (eds) Proceedings of the 3rd biennial DISSAnet conference, Pretoria, pp 97–100Google Scholar
  11. Cöster R, Sahlgren M, Karlgren J (2003) Selective compound splitting of Swedish queries for boolean combinations of truncated terms. In: Workshop of the cross-language evaluation forum for European languages. Springer, Berlin, pp 337–344Google Scholar
  12. Dahl Ö (2004) The growth and maintenance of linguistic complexity, vol 71. John Benjamins, AmsterdamCrossRefGoogle Scholar
  13. Dryer MS, Haspelmath M (2011) The world atlas of language structures online. Max Planck Digital Library, München. http://wals.info Google Scholar
  14. Gollins T, Sanderson M (2001) Improving cross language retrieval with triangulated translation. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 90–95Google Scholar
  15. Harman D (1991) How effective is suffixing? J Am Soc Inf Sci 42:7–15MathSciNetCrossRefGoogle Scholar
  16. Hedlund T (2002) Compounds in dictionary-based cross-language information retrieval. Inf Res 7(2):7-2Google Scholar
  17. Hedlund T, Keskustalo H, Pirkola A, Sepponen M, Järvelin K (2000) Bilingual tests with Swedish, Finnish, and German queries: dealing with morphology, compound words, and query structure. In: Workshop of the cross-language evaluation forum for European languages, Springer, Berlin, pp 210–223Google Scholar
  18. Hedlund T, Pirkola A, Järvelin K (2001) Aspects of Swedish morphology and semantics from the perspective of mono-and cross-language information retrieval. Inf Process Manage 37(1):147–161zbMATHCrossRefGoogle Scholar
  19. Hedlund T, Airio E, Keskustalo H, Lehtokangas R, Pirkola A, Järvelin K (2004) Dictionary-based cross-language information retrieval: learning experiences from clef 2000–2002. Inf Retrieval 7(1–2):99–119CrossRefGoogle Scholar
  20. Herbert B, Szarvas G, Gurevych I (2011) Combining query translation techniques to improve cross-language information retrieval. In: Proceedings of the 33D European conference on information retrieval. Springer, BerlinGoogle Scholar
  21. Hull DA, Grefenstette G (1996) Querying across languages: a dictionary-based approach to multilingual information retrieval. In: Proceedings of the 19th annual international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 49–57Google Scholar
  22. Järvelin A, Keskustalo H, Sormunen E, Saastamoinen M, Kettunen K (2016) Information retrieval from historical newspaper collections in highly inflectional languages: A query expansion approach. J Assoc Inf Sci Technol 67(12):2928–2946CrossRefGoogle Scholar
  23. Kamps J, Monz C, De Rijke M, Sigurbjörnsson B (2004) Language-dependent and language-independent approaches to cross-lingual text retrieval. In: Peters C, Braschler M, Gonzalo J, Kluck M (eds) Comparative evaluation of multilingual information access systems: fourth workshop of the cross–language evaluation forum (CLEF 2003) revised selected papers. Lecture notes in computer science (LNCS), vol 3237. Springer, HeidelbergCrossRefGoogle Scholar
  24. Kantor PB, Voorhees EM (2000) The TREC-5 confusion track: comparing retrieval methods for scanned text. Inf Retrieval 2(2):165–176CrossRefGoogle Scholar
  25. Karlgren H (1981) Computer aids in translation. Stud Linguist 35(1–2):86–101CrossRefGoogle Scholar
  26. Karlgren J (2005) Compound terms and their constituent elements in information retrieval. In: Proceedings of the 15th Nordic conference of computational linguistics (NoDaLiDa). University of Joensuu, Finland, pp 111–115Google Scholar
  27. Karlgren J (ed) (2006) New text—wikis and blogs and other dynamic text sources. In: Proceedings of the EACL06 workshop. European Chapter of the Association for Computational LinguisticsGoogle Scholar
  28. Karlgren J, Dalianis H, Jongejan B (2008) Experiments to investigate the connection between case distribution and topical relevance of search terms. In: 6th international conference on language resources and evaluation, LRECGoogle Scholar
  29. Kettunen K (2009) Reductive and generative approaches to management of morphological variation of keywords in monolingual information retrieval: an overview. J Doc 65(2):267–290CrossRefGoogle Scholar
  30. Kettunen K (2014) Can type-token ratio be used to show morphological complexity of languages? J Quant Linguist 21(3):223–245CrossRefGoogle Scholar
  31. Kettunen K, Airio E (2006) Is a morphologically complex language really that complex in full-text retrieval? In: Advances in natural language processing. Springer, Berlin, pp 411–422CrossRefGoogle Scholar
  32. Kettunen K, Airio E, Järvelin K (2007) Restricted inflectional form generation in morphological keyword variation. Inf Retrieval 10(4–5):415–444CrossRefGoogle Scholar
  33. Lehtokangas R, Airio E, Järvelin K (2004) Transitive dictionary translation challenges direct dictionary translation in clir. Inf Process Manage 40(6):973–988zbMATHCrossRefGoogle Scholar
  34. Lennon M, Peirce DS, Tarry BD, Willett P (1981) An evaluation of some conflation algorithms for information retrieval. Information Scientist 3(4):177–183Google Scholar
  35. Leveling J, Zhou D, Jones GJF, Wade V (2009) TCD-DCU at TEL@CLEF 2009: document expansion, query translation and language modeling. In: Borri F, Nardi A, Peters C, Ferro N (eds) CLEF 2009 working notes, CEUR workshop proceedings (CEUR-WS.org), ISSN 1613–0073. http://ceur-ws.org/Vol-1175/
  36. Lewis MP, Simons GF, Fennig CD, et al (2009) Ethnologue: languages of the world, vol 16. SIL international, Dallas. http://www.ethnologue.com
  37. Lieber R, Štekauer P (2009) The Oxford handbook of compounding. Oxford University Press, OxfordGoogle Scholar
  38. Lopresti D (2009) Optical character recognition errors and their effects on natural language processing. Int J Doc Anal Recogn 12(3):141–151CrossRefGoogle Scholar
  39. Lovins JB (1968) Development of a stemming algorithm. MIT Information Processing Group, Electronic Systems Laboratory, CambridgeGoogle Scholar
  40. Lowe TC, Roberts DC, Kurtz P (1973) Additional text processing for on-line retrieval (the radcol system), vol 1. Tech. rep., DTIC DocumentGoogle Scholar
  41. McNamee P, Nicholas C, Mayfield J (2009) Addressing morphological variation in alphabetic languages. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. ACM, New York, pp 75–82Google Scholar
  42. Mittendorf E, Schäuble P (2000) Information retrieval can cope with many errors. Inf Retrieval 3(3):189–216zbMATHCrossRefGoogle Scholar
  43. Pääkkönen T, Kettunen K, Kervinen J (2018) Digitisation and digital library presentation system—a resource-conscientious approach. In: Proceedings of 3D conference on digital humanities in the Nordic countries, CEUR-WS.org, pp 297–305Google Scholar
  44. Piotrowski M (2012) Natural language processing for historical texts. Synth Lect Hum Lang Technol 5(2):1–157CrossRefGoogle Scholar
  45. Pirkola A (2001) Morphological typology of languages for IR. J Doc 57(3):330–348CrossRefGoogle Scholar
  46. Pirkola A, Järvelin K (1996) The effect of anaphor and ellipsis resolution on proximity searching in a text database. Inf Process Manage 32(2):199–216CrossRefGoogle Scholar
  47. Pirkola A, Hedlund T, Keskustalo H, Järvelin K (2001) Dictionary-based cross-language information retrieval: problems, methods, and research findings. Inf Retrieval 4(3–4):209–230zbMATHCrossRefGoogle Scholar
  48. Pirkola A, Toivonen J, Keskustalo H, Visala K, Järvelin K (2003) Fuzzy translation of cross-lingual spelling variants. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval. ACM, New York, pp 345–352Google Scholar
  49. Pirkola A, Toivonen J, Keskustalo H, Järvelin K (2007) Frequency-based identification of correct translation equivalents (FITE) obtained through transformation rules. ACM Trans Inf Sys 26(1):2CrossRefGoogle Scholar
  50. Pletschacher S, Clausner C, Antonacopoulos A (2015) Europeana newspapers OCR workflow evaluation. In: Proceedings of the 3rd international workshop on historical document imaging and processing. ACM, New York, pp 39–46CrossRefGoogle Scholar
  51. Popović M, Willett P (1992) The effectiveness of stemming for natural-language access to slovene textual data. J Am Soc Inf Sci 43(5):384–390CrossRefGoogle Scholar
  52. Porter MF (1980) An algorithm for suffix stripping. Program 14(3):130–137CrossRefGoogle Scholar
  53. Porter MF (2001) Snowball: a language for stemming algorithmsGoogle Scholar
  54. Rehm G, Uszkoreit H (2012) Meta-net white paper series: Europe’s languages in the digital ageGoogle Scholar
  55. Saleh S, Pecina P (2016) Reranking Hypotheses of Machine-Translated Queries for Cross-Lingual Information Retrieval. In: Fuhr N, Quaresma P, Gonçalves T, Larsen B, Balog K, Macdonald C, Cappellato L, Ferro N (eds) Experimental IR meets multilinguality, multimodality, and interaction. Proceedings of the seventh international conference of the CLEF association (CLEF 2016). Lecture notes in computer science (LNCS), vol 9822, Springer, Heidelberg, pp 54–68Google Scholar
  56. Savoy J, Naji N (2011) Comparative information retrieval evaluation for scanned documents. In: Proceedings of the 15th WSEAS international conference on Computers, pp 527–534Google Scholar
  57. Springmann U, Lüdeling A (2017) OCR of historical printings with an application to building diachronic corpora: a case study using the RIDGES corpus. Digit Humanit Q11(2)Google Scholar
  58. Steinberger J, Lenkova P, Kabadjov MA, Steinberger R, Van der Goot E (2011) Multilingual entity-centered sentiment analysis evaluated by parallel corpora. In: Recent advances in natural language processing, pp 770–775Google Scholar
  59. Taghva K, Borsack J, Condit A (1996) Evaluation of model-based retrieval effectiveness with OCR text. ACM Trans Inf Syst 14(1):64–93CrossRefGoogle Scholar
  60. Tanner S, Muñoz T, Ros PH (2009) Measuring mass text digitization quality and usefulness. lessons learned from assessing the OCR accuracy of the British library’s 19th century online newspaper archive. D-lib Mag 15(7/8):1082–9873Google Scholar
  61. Toivonen J, Pirkola A, Keskustalo H, Visala K, Järvelin K (2005) Translating cross-lingual spelling variants using transformation rules. Inf Process Manag 41(4):859–872CrossRefGoogle Scholar
  62. Traub MC, van Ossenbruggen J, Hardman L (2015) Impact analysis of OCR quality on research tasks in digital archives. In: Kapidakis S, Mazurek C, Werla M (eds) International conference on theory and practice of digital libraries. Lecture notes in computer science (LNCS), vol 9316. Springer, Heidelberg, pp 252–263Google Scholar
  63. Uryupina O, Plank B, Severyn A, Rotondi A, Moschitti A (2014) Sentube: a corpus for sentiment analysis on youtube social media. In: 9th international conference on language resources and evaluation, LRECGoogle Scholar
  64. Velupillai V (2012) An introduction to linguistic typology. John Benjamins Publishing, AmsterdamCrossRefGoogle Scholar
  65. Volk M, Furrer L, Sennrich R (2011) Strategies for reducing and correcting OCR errors. Language technology for cultural heritage, pp 3–22CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Jussi Karlgren
    • 1
    Email author
  • Turid Hedlund
    • 2
  • Kalervo Järvelin
    • 3
  • Heikki Keskustalo
    • 3
  • Kimmo Kettunen
    • 4
  1. 1.Gavagai and KTH Royal Institute of TechnologyStockholmSweden
  2. 2.HelsinkiFinland
  3. 3.University of TampereTampereFinland
  4. 4.The National Library of FinlandHelsinkiFinland

Personalised recommendations