Advertisement

International Journal on Digital Libraries

, Volume 13, Issue 3–4, pp 135–153 | Cite as

On the applicability of word sense discrimination on 201 years of modern english

  • Nina Tahmasebi
  • Kai Niklas
  • Gideon Zenz
  • Thomas Risse
Article

Abstract

As language evolves over time, documents stored in long- term archives become inaccessible to users. Automatically, detecting and handling language evolution will become a necessity to meet user’s information needs. In this paper, we investigate the performance of modern tools and algorithms applied on modern English to find word senses that will later serve as a basis for finding evolution. We apply the curvature clustering algorithm on all nouns and noun phrases extracted from The Times Archive (1785–1985). We use natural language processors for part-of-speech tagging and lemmatization and report on the performance of these processors over the entire period. We evaluate our clusters using WordNet to verify whether they correspond to valid word senses. Because The Times Archive contains OCR errors, we investigate the effects of such errors on word sense discrimination results. Finally, we present a novel approach to correct OCR errors present in the archive and show that the coverage of the curvature clustering algorithm improves. We increase the number of clusters by 24 %. To verify our results, we use the New York Times corpus (1987–2007), a recent collection that is considered error free, as a ground truth for our experiments. We find that after correcting OCR errors in The Times Archive, the performance of word sense discrimination applied on The Times Archive is comparable to the ground truth.

Keywords

Word sense discrimination Historical document collections OCR error correction 

Notes

Acknowledgments

We would like to thank Times Newspapers Limited for providing the archive of The Times, London for our research. A special thanks to Gertrud Erbach for her valuable contributions. This work is partly funded by the European Commission under LiWA (IST 216267) and ARCOMEM (IST 270239).

References

  1. 1.
    IMPACT Project. Improving Access to Text. http://www.impact-project.eu
  2. 2.
    Oxford English Dictionary. The Oxford English Dictionary, 2nd edn. 1989. OED Online, Oxford University Press, Oxford (2000). http://dictionary.oed.com
  3. 3.
    Oxford English Dictionary, Writing the OED (2010). http://www.oed.com/about/writing/
  4. 4.
    Google books (2011). http://books.google.com/
  5. 5.
    Project gutenberg (2011). http://www.gutenberg.org/
  6. 6.
  7. 7.
    Abdulkader, A., Casey, M.R.: Low cost correction of ocr errors using learning in a multi-engine environment. In: ICDAR, pp. 576–580 (2009)Google Scholar
  8. 8.
    Abecker, A., Stojanovic, L.: Ontology evolution: Medline case study. In: Proceedings of Wirtschaftsinformatik 2005: eEconomy, eGovernment, eSociety, pp. 1291–1308 (2005)Google Scholar
  9. 9.
    Atkinson, K.: Gnu aspell version 0.60.6 (2008). http://aspell.net/
  10. 10.
    Cheng, P.-J., Kan, M.-Y., Lam, W., Nakov, P. (eds.): Sixth Asia Information Retrieval Societies Conference (AIRS 2010). Springer, Berlin (2010)Google Scholar
  11. 11.
    Coburn, A.: Lingua::EN::Tagger—part-of-speech tagger for english natural language processing (2008). http://search.cpan.org/acoburn/Lingua-EN-Tagger-0.15/Tagger.pm
  12. 12.
    Damerau, F.J.: A technique for computer detection and correction of spelling errors. Commun. ACM 7(3), 171–176 (1964)CrossRefGoogle Scholar
  13. 13.
    Sebastian, D., Ciura, M.G.: Correcting spelling errors by modeling their causes. Int. J. Appl. Math. Comput. Sci. 15, 275–285 (2005)Google Scholar
  14. 14.
    Dorow, B.: A Graph Model for Words and their Meanings. PhD thesis, University of Stuttgart (2007)Google Scholar
  15. 15.
    Dorow, B., Eckmann, J.-P., Sergi, D.: Using curvature and markov clustering in graphs for lexical acquisition and word sense discrimination. In: Workshop MEANING-2005 (2004)Google Scholar
  16. 16.
    Ernst-Gerlach, A., Fuhr, N.: Retrieval in text collections with historic spelling using linguistic and spelling variants. In: JCDL ’07: Proceedings of the 7th ACM/IEEE-CS Joint Conference on Digital Libraries, pp. 333–341, ACM, Vancouver, BC, Canada (2007)Google Scholar
  17. 17.
    Ferret, O.: Discovering word senses from a network of lexical cooccurrences. In: COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, 1326, Geneva, Switzerland (2004)Google Scholar
  18. 18.
    Finlayson, M.A.: MIT Java Wordnet Interface version 2.1.5, Released under Creative Commons Attribution-NonCommerical Version 3.0 Unported License. http://projects.csail.mit.edu/jwi/
  19. 19.
    Annette, G., Ulrich, R., Christoph, R., Schulz, K.U., Andreas, N.: Towards information retrieval on historical document collections: The role of matching procedures and special lexica. Int. J. Doc. Anal. Recognit. 14(2), 159–171 (2011)CrossRefGoogle Scholar
  20. 20.
    Hauser, A., Heller, M., Leiss, E., Schulz, K.U., Wanzeck, C.: Information access to historical documents from the early New High german period. In: IJCAI-2007 Workshop on Analytics for Noisy Unstructured Text Data (2006)Google Scholar
  21. 21.
    Hauser, A.W., Schulz, K.U.: Unsupervised learning of edit distance weights for retrieving historical spelling variations. In: Proceedings of the First Workshop on Finite-State Techniques and Approximate Search, pp. 1–6, Borovets, Bulgaria (2007)Google Scholar
  22. 22.
    Hong, T., Hull, J.J., Srihari, S.N., Deborah, Walters, K., Henry, S.B.: Degraded Text Recognition Using Visual And Linguistic, Context (1995)Google Scholar
  23. 23.
    Lee Daniel, D., Sebastian, S.H.: Algorithms for non-negative matrix factorization. In: Leen Todd, K., Dietterich, T.G., Volker, T. (eds.) NIPS, pp. 556–562. MIT Press, Cambridge (2000)Google Scholar
  24. 24.
    Levenshtein, V.I.: Binary codes capable of correcting deletions, insertions, and reversals. Sov. Phys. Dokl. 10(8), 707–710 (1966)MathSciNetGoogle Scholar
  25. 25.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 17th International Conference on Computational Linguistics, pp. 768–774, Montreal, QC, Canada (1998)Google Scholar
  26. 26.
    Lopresti, D.P.: Optical character recognition errors and their effects on natural language processing. IJDAR 12(3), 141–151 (2009) Google Scholar
  27. 27.
    Miller, G.A.: WordNet: A lexical database for English. Commun. ACM 38, 39–41 (1995)Google Scholar
  28. 28.
    Pantel, P., Lin, D.: Discovering word senses from text. In: KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 613–619. ACM, Edmonton, Alberta, Canada (2002)Google Scholar
  29. 29.
    Pedersen, T., Bruce, R.: Distinguishing word senses in untagged text. In: Proceedings of the Second Conference on Empirical Methods in Natural Language Processing, pp. 197–207, Providence, RI (1997)Google Scholar
  30. 30.
    Resnik, P.: Using information content to evaluate semantic similarity in a taxonomy. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, pp. 448–453, Montreal, QC, Canada (1995)Google Scholar
  31. 31.
    Reynaert, M.: Text Induced Spelling Correction. In: COLING ’04: Proceedings of the 20th International Conference on Computational Linguistics, p. 834. Association for Computational Linguistics, Morristown (2004)Google Scholar
  32. 32.
    Reynaert, M.: Non-interactive OCR post-correction for giga-scale digitization projects. In: Computational Linguistics and Intelligent Text Processing, pp. 617–630 (2008)Google Scholar
  33. 33.
    Evan, S.: The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia (2008)Google Scholar
  34. 34.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the International Conference on New Methods in Language Processing, pp. 44–49, Manchester. http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/DecisionTreeTagger.html (1994)
  35. 35.
    Heinrich, S.: Automatic word sense discrimination. Comput. Linguistics 24(1), 97–123 (1998)MathSciNetGoogle Scholar
  36. 36.
    Spitz, A.L.: An ocr based on character shape codes and lexical information. In: ICDAR, pp. 723–728 (1995)Google Scholar
  37. 37.
    Strohmaier, C.M.: Methoden der lexikalischen Nachkorrektur OCR-erfasster Dokumente (2004)Google Scholar
  38. 38.
    Kazem, T., Eric, S.: OCRSpell: An interactive spelling correction system for OCR errors in text. Int. J. Doc. Anal. Recognit. 3, 2001 (2001)Google Scholar
  39. 39.
    Tahmasebi, N., Niklas, K., Theuerkauf, T., Risse, T.: Using word sense discrimination on historic document collections. In: 10th ACM/IEEE Joint Conference on Digital Libraries (JCDL), Surfers Paradise, Gold Coast (2010)Google Scholar
  40. 40.
    Tahmasebi, N.: Automatic detection of terminology evolution. In: Meersman, R., Herrero, P., Dillon, T.S. (eds.) OTM Workshops, vol. 5872 of Lecture Notes in Computer Science, pp. 769–778. Springer, Berlin (2009)Google Scholar
  41. 41.
    Tahmasebi, N., Gossen, G., Risse, T.: Which words do you remember? Temporal properties of language use in digital archives. In: Zaphiris, P., Buchanan, G., Rasmussen, E., Loizides, F. (eds.) TPDL, volume 7489 of Lecture Notes in Computer Science, pp. 32–37. Springer, Berlin (2012)Google Scholar
  42. 42.
    Tahmasebi, N., Ramesh, S., Risse, T.: First results on detecting term evolutions. In: 9th International Web Archiving Workshop, Corfu, Greece (2009)Google Scholar
  43. 43.
  44. 44.
    Van de Cruys, T.: Using three way data for word sense discrimination. In: Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008), pp. 929–936. Coling 2008 Organizing Committee, Manchester (2008)Google Scholar
  45. 45.
    Watts, D.J., Strogatz, S.: Collective dynamics of “small-world” networks. Nature 393, 440–442 (1998)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Nina Tahmasebi
    • 1
  • Kai Niklas
    • 1
  • Gideon Zenz
    • 1
  • Thomas Risse
    • 1
  1. 1.L3S Research CenterHanoverGermany

Personalised recommendations