Statistical Corpus and Language Comparison on Comparable Corpora

  • Thomas EckartEmail author
  • Uwe Quasthoff


With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety of applications. These include, among many others, the field of Natural Language Processing. The Leipzig Corpora Collection creates and uses such corpora for more than fifteen years. However, the work on preprocessing distributed resources to ensure homogeneity and thus comparability is a steady process. As a result created corpora in identical formats allow the use of different statistical methods to generate various data for manual or automatic analysis. These are basis for applications in intra- and inter-language comparison or quality assurance of text stocks.


Corpus comparison Language comparison Corpus evaluation 


  1. 1.
    Biemann, C., Heyer, G., Quasthoff, U., Richter, M.: The Leipzig corpora collection—monolingual corpora of standard size. In: Proceedings of Corpus Linguistic, Birmingham, UK (2007)Google Scholar
  2. 2.
    Bleier, A., Bock, B., Schulze, U., Maicher, L.: JRuby topic maps. In: Proceedings of the 5th International Conference on Topic Maps Research and Applications (TMRA 2009), Leipzig, Germany (2009)Google Scholar
  3. 3.
    Buechler, M.: Medusa: Performante Textstatistiken auf grossen Textmengen - Kookkurrenzanalyse in Theorie und Anwendung. Vdm Verlag Dr. Müller, Saarbrücken (2008)Google Scholar
  4. 4.
    Buechler, M., Heyer, G.: Leipzig linguistic services—a 4 years summary of providing linguistic web services. In: Proceeding of TMS 2009 conference, Augustusplatz 10/11, 04109, Leipzig, Germany (2009)Google Scholar
  5. 5.
    Dickinson, M., Meurers, D.: Detecting annotation errors in spoken language corpora. In: Proceedings of the Special Session on Treebanks for Spoken Language and Discourse at the 15th Nordic Conference of Computational Linguistic (NODALIDA-05), Joensuu, Finland (2005)Google Scholar
  6. 6.
    Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)Google Scholar
  7. 7.
    Dunning, T.: Statistical identification of language. In: Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University (1994)Google Scholar
  8. 8.
    Eckart, T., Quasthoff, U.: Statistical corpus and language comparison using comparable corpora. In: Workshop on Building and Using Comparable Corpora, LREC 2010, Malta (2010)Google Scholar
  9. 9.
    Eskin, E.: Automatic corpus correction with anomaly detection. In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-00). Seattle, Washington, USA (2000)Google Scholar
  10. 10.
    Grzybek, P.: History and methodology of word length studies. The State of the Art. In: Grzybek, P. (Hrsg.) Contributions to the Theory of Text and Language. Word Length Studies and Related Issues, S. 15–90. Springer, Dordrecht (NL). ISBN 1-4020-4067-9 (HB) (2006)Google Scholar
  11. 11.
    Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora In: Proceedings of 5th ACL SIGDAT Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)Google Scholar
  12. 12.
    Li, Y., McLean, D., Bandar Z., O’Shea, J., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 8 (2006)Google Scholar
  13. 13.
    Meier, H.: Deutsche Sprachstatistik. Olms, Hildesheim (1964)Google Scholar
  14. 14.
    Quasthoff, U., Eckart, T.: Corpus building process of the project “Deutscher Wortschatz”. In: GSCL Workshop: Linguistic Processing Pipelines, Potsdam, Germany (2009)Google Scholar
  15. 15.
    Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy (2006)Google Scholar
  16. 16.
    Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)CrossRefGoogle Scholar
  17. 17.
    Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner reprint. New York, 1972, 1st edn. (Addison-Wesley, Cambridge, MA, 1949)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  1. 1.University of LeipzigLeipzigGermany

Personalised recommendations