Skip to main content

Statistical Corpus and Language Comparison on Comparable Corpora

  • Chapter
  • First Online:
Building and Using Comparable Corpora

Abstract

With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety of applications. These include, among many others, the field of Natural Language Processing. The Leipzig Corpora Collection creates and uses such corpora for more than fifteen years. However, the work on preprocessing distributed resources to ensure homogeneity and thus comparability is a steady process. As a result created corpora in identical formats allow the use of different statistical methods to generate various data for manual or automatic analysis. These are basis for applications in intra- and inter-language comparison or quality assurance of text stocks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Biemann, C., Heyer, G., Quasthoff, U., Richter, M.: The Leipzig corpora collection—monolingual corpora of standard size. In: Proceedings of Corpus Linguistic, Birmingham, UK (2007)

    Google Scholar 

  2. Bleier, A., Bock, B., Schulze, U., Maicher, L.: JRuby topic maps. In: Proceedings of the 5th International Conference on Topic Maps Research and Applications (TMRA 2009), Leipzig, Germany (2009)

    Google Scholar 

  3. Buechler, M.: Medusa: Performante Textstatistiken auf grossen Textmengen - Kookkurrenzanalyse in Theorie und Anwendung. Vdm Verlag Dr. Müller, Saarbrücken (2008)

    Google Scholar 

  4. Buechler, M., Heyer, G.: Leipzig linguistic services—a 4 years summary of providing linguistic web services. In: Proceeding of TMS 2009 conference, Augustusplatz 10/11, 04109, Leipzig, Germany (2009)

    Google Scholar 

  5. Dickinson, M., Meurers, D.: Detecting annotation errors in spoken language corpora. In: Proceedings of the Special Session on Treebanks for Spoken Language and Discourse at the 15th Nordic Conference of Computational Linguistic (NODALIDA-05), Joensuu, Finland (2005)

    Google Scholar 

  6. Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)

    Google Scholar 

  7. Dunning, T.: Statistical identification of language. In: Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University (1994)

    Google Scholar 

  8. Eckart, T., Quasthoff, U.: Statistical corpus and language comparison using comparable corpora. In: Workshop on Building and Using Comparable Corpora, LREC 2010, Malta (2010)

    Google Scholar 

  9. Eskin, E.: Automatic corpus correction with anomaly detection. In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-00). Seattle, Washington, USA (2000)

    Google Scholar 

  10. Grzybek, P.: History and methodology of word length studies. The State of the Art. In: Grzybek, P. (Hrsg.) Contributions to the Theory of Text and Language. Word Length Studies and Related Issues, S. 15–90. Springer, Dordrecht (NL). ISBN 1-4020-4067-9 (HB) (2006)

    Google Scholar 

  11. Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora In: Proceedings of 5th ACL SIGDAT Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)

    Google Scholar 

  12. Li, Y., McLean, D., Bandar Z., O’Shea, J., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 8 (2006)

    Google Scholar 

  13. Meier, H.: Deutsche Sprachstatistik. Olms, Hildesheim (1964)

    Google Scholar 

  14. Quasthoff, U., Eckart, T.: Corpus building process of the project “Deutscher Wortschatz”. In: GSCL Workshop: Linguistic Processing Pipelines, Potsdam, Germany (2009)

    Google Scholar 

  15. Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy (2006)

    Google Scholar 

  16. Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)

    Article  Google Scholar 

  17. Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner reprint. New York, 1972, 1st edn. (Addison-Wesley, Cambridge, MA, 1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Eckart .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Eckart, T., Quasthoff, U. (2013). Statistical Corpus and Language Comparison on Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-20128-8_8

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-20127-1

  • Online ISBN: 978-3-642-20128-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics