Statistical Corpus and Language Comparison on Comparable Corpora

Eckart, Thomas; Quasthoff, Uwe

doi:10.1007/978-3-642-20128-8_8

Thomas Eckart⁵ &
Uwe Quasthoff⁵

1385 Accesses
7 Citations

Abstract

With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety of applications. These include, among many others, the field of Natural Language Processing. The Leipzig Corpora Collection creates and uses such corpora for more than fifteen years. However, the work on preprocessing distributed resources to ensure homogeneity and thus comparability is a steady process. As a result created corpora in identical formats allow the use of different statistical methods to generate various data for manual or automatic analysis. These are basis for applications in intra- and inter-language comparison or quality assurance of text stocks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Biemann, C., Heyer, G., Quasthoff, U., Richter, M.: The Leipzig corpora collection—monolingual corpora of standard size. In: Proceedings of Corpus Linguistic, Birmingham, UK (2007)
Google Scholar
Bleier, A., Bock, B., Schulze, U., Maicher, L.: JRuby topic maps. In: Proceedings of the 5th International Conference on Topic Maps Research and Applications (TMRA 2009), Leipzig, Germany (2009)
Google Scholar
Buechler, M.: Medusa: Performante Textstatistiken auf grossen Textmengen - Kookkurrenzanalyse in Theorie und Anwendung. Vdm Verlag Dr. Müller, Saarbrücken (2008)
Google Scholar
Buechler, M., Heyer, G.: Leipzig linguistic services—a 4 years summary of providing linguistic web services. In: Proceeding of TMS 2009 conference, Augustusplatz 10/11, 04109, Leipzig, Germany (2009)
Google Scholar
Dickinson, M., Meurers, D.: Detecting annotation errors in spoken language corpora. In: Proceedings of the Special Session on Treebanks for Spoken Language and Discourse at the 15th Nordic Conference of Computational Linguistic (NODALIDA-05), Joensuu, Finland (2005)
Google Scholar
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Google Scholar
Dunning, T.: Statistical identification of language. In: Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University (1994)
Google Scholar
Eckart, T., Quasthoff, U.: Statistical corpus and language comparison using comparable corpora. In: Workshop on Building and Using Comparable Corpora, LREC 2010, Malta (2010)
Google Scholar
Eskin, E.: Automatic corpus correction with anomaly detection. In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-00). Seattle, Washington, USA (2000)
Google Scholar
Grzybek, P.: History and methodology of word length studies. The State of the Art. In: Grzybek, P. (Hrsg.) Contributions to the Theory of Text and Language. Word Length Studies and Related Issues, S. 15–90. Springer, Dordrecht (NL). ISBN 1-4020-4067-9 (HB) (2006)
Google Scholar
Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora In: Proceedings of 5th ACL SIGDAT Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)
Google Scholar
Li, Y., McLean, D., Bandar Z., O’Shea, J., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 8 (2006)
Google Scholar
Meier, H.: Deutsche Sprachstatistik. Olms, Hildesheim (1964)
Google Scholar
Quasthoff, U., Eckart, T.: Corpus building process of the project “Deutscher Wortschatz”. In: GSCL Workshop: Linguistic Processing Pipelines, Potsdam, Germany (2009)
Google Scholar
Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy (2006)
Google Scholar
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)
Article Google Scholar
Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner reprint. New York, 1972, 1st edn. (Addison-Wesley, Cambridge, MA, 1949)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Leipzig, Leipzig, Germany
Thomas Eckart & Uwe Quasthoff

Authors

Thomas Eckart
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Quasthoff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thomas Eckart .

Editor information

Editors and Affiliations

Centre for Translation Studies, University of Leeds, Leeds, United Kingdom
Serge Sharoff
University of Mainz, Mainz, Germany
Reinhard Rapp
Université de Paris-Sud LIMSI-CNRS, Orsay, France
Pierre Zweigenbaum
Electronic & Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, People's Republic of China
Pascale Fung

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Eckart, T., Quasthoff, U. (2013). Statistical Corpus and Language Comparison on Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-20128-8_8
Published: 14 December 2013
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics