Abstract
With the wide availability of textual data in various languages, domains and registers it is easy to create text corpora for a variety of applications. These include, among many others, the field of Natural Language Processing. The Leipzig Corpora Collection creates and uses such corpora for more than fifteen years. However, the work on preprocessing distributed resources to ensure homogeneity and thus comparability is a steady process. As a result created corpora in identical formats allow the use of different statistical methods to generate various data for manual or automatic analysis. These are basis for applications in intra- and inter-language comparison or quality assurance of text stocks.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Biemann, C., Heyer, G., Quasthoff, U., Richter, M.: The Leipzig corpora collection—monolingual corpora of standard size. In: Proceedings of Corpus Linguistic, Birmingham, UK (2007)
Bleier, A., Bock, B., Schulze, U., Maicher, L.: JRuby topic maps. In: Proceedings of the 5th International Conference on Topic Maps Research and Applications (TMRA 2009), Leipzig, Germany (2009)
Buechler, M.: Medusa: Performante Textstatistiken auf grossen Textmengen - Kookkurrenzanalyse in Theorie und Anwendung. Vdm Verlag Dr. Müller, Saarbrücken (2008)
Buechler, M., Heyer, G.: Leipzig linguistic services—a 4 years summary of providing linguistic web services. In: Proceeding of TMS 2009 conference, Augustusplatz 10/11, 04109, Leipzig, Germany (2009)
Dickinson, M., Meurers, D.: Detecting annotation errors in spoken language corpora. In: Proceedings of the Special Session on Treebanks for Spoken Language and Discourse at the 15th Nordic Conference of Computational Linguistic (NODALIDA-05), Joensuu, Finland (2005)
Dunning, T.: Accurate methods for the statistics of surprise and coincidence. Comput. Linguist. 19(1), 61–74 (1993)
Dunning, T.: Statistical identification of language. In: Technical report CRL MCCS-94-273. Computing Research Lab, New Mexico State University (1994)
Eckart, T., Quasthoff, U.: Statistical corpus and language comparison using comparable corpora. In: Workshop on Building and Using Comparable Corpora, LREC 2010, Malta (2010)
Eskin, E.: Automatic corpus correction with anomaly detection. In: Proceedings of the 1st Conference of the North American Chapter of the Association for Computational Linguistics (NAACL-00). Seattle, Washington, USA (2000)
Grzybek, P.: History and methodology of word length studies. The State of the Art. In: Grzybek, P. (Hrsg.) Contributions to the Theory of Text and Language. Word Length Studies and Related Issues, S. 15–90. Springer, Dordrecht (NL). ISBN 1-4020-4067-9 (HB) (2006)
Kilgarriff, A.: Using word frequency lists to measure corpus homogeneity and similarity between corpora In: Proceedings of 5th ACL SIGDAT Workshop on Very Large Corpora, pp. 231–245. Beijing and Hong Kong (1997)
Li, Y., McLean, D., Bandar Z., O’Shea, J., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18, 8 (2006)
Meier, H.: Deutsche Sprachstatistik. Olms, Hildesheim (1964)
Quasthoff, U., Eckart, T.: Corpus building process of the project “Deutscher Wortschatz”. In: GSCL Workshop: Linguistic Processing Pipelines, Potsdam, Germany (2009)
Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy (2006)
Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’ networks. Nature 393(6684), 440–442 (1998)
Zipf, G.K.: Human Behavior and the Principle of Least Effort: An Introduction to Human Ecology. Hafner reprint. New York, 1972, 1st edn. (Addison-Wesley, Cambridge, MA, 1949)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Eckart, T., Quasthoff, U. (2013). Statistical Corpus and Language Comparison on Comparable Corpora. In: Sharoff, S., Rapp, R., Zweigenbaum, P., Fung, P. (eds) Building and Using Comparable Corpora. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-20128-8_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-20128-8_8
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-20127-1
Online ISBN: 978-3-642-20128-8
eBook Packages: Computer ScienceComputer Science (R0)