Skip to main content

Building Large Resources for Text Mining: The Leipzig Corpora Collection

  • Chapter
  • First Online:

Abstract

Many text mining algorithms and applications require the availability of large text corpora and certain statistics-based annotations. To ensure comparability of results a standardized corpus building process is required. Particularly noteworthy are all pre-processing procedures as they are crucial for the quality of the resulting data stock. This quality can be estimated by both evaluating the corpus building process and by statistical quality measurements on the corpus. Some of these approaches are described using the example of the Leipzig Corpora Collection.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD   129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    http://www.iol.co.za/isolezwe.

  2. 2.

    http://www.abyznewslinks.com.

  3. 3.

    https://webarchive.jira.com/wiki/display/Heritrix/Heritrix.

  4. 4.

    http://www.abyznewslinks.com.

  5. 5.

    http://www.ohchr.org.

  6. 6.

    http://sourceforge.net/projects/wikiprep/.

  7. 7.

    http://www.sonderzeichen.de.

  8. 8.

    http://corpora.informatik.uni-leipzig.de/download.html.

  9. 9.

    http://www.clarin.eu.

  10. 10.

    http://www.dariah.eu.

  11. 11.

    http://weblicht.sfs.uni-tuebingen.de/Aggregator.

  12. 12.

    http://wortschatz.uni-leipzig.de.

  13. 13.

    http://www.gutenberg.org.

  14. 14.

    http://corpora.uni-leipzig.de.

  15. 15.

    http://www.wiktionary.org.

  16. 16.

    http://compling.hss.ntu.edu.sg/omw.

References

  1. Altmann G (1980) Prolegomena to menzerath’s law. Glottometrica 2:1–10

    MathSciNet  Google Scholar 

  2. Baroni M, Bernardini S (2004) BootCaT: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004

    Google Scholar 

  3. Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: Student Research Workshop, pp 7–12. Association for Computational Linguistics

    Google Scholar 

  4. Bocek T, Hunt E, Hausheer D, Stiller B (2008) Fast similarity search in peer-to-peer networks. In: Network operations and management symposium, 2008. NOMS 2008, Salvador, 7–11 April 2008. IEEE, pp 240–247

    Google Scholar 

  5. Broeder D, Windhouwer M, van Uytvanck D, Goosen T, Trippel T (2012) CMDI: a component metadata infrastructure. In: Describing LRs with metadata: towards flexibility and interoperability in the documentation of LR workshop programme

    Google Scholar 

  6. Büchler M (2006) Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten. Diploma Thesis, University of Leipzig

    Google Scholar 

  7. Cysouw M (2005) Quantitative methods in typology. In: Altmann G, Köhler R, Piotrowski R (eds) Quantitative linguistics: an international handbook. Mouton de Gruyter, Berlin, pp 554–578

    Google Scholar 

  8. Cysouw M (2008) Using the World Atlas of language structures. Introduction to the special issue of Sprachtypologie und Universalienforschung (STUF) 60(2):181–185

    Google Scholar 

  9. Duden (2009) Die deutsche rechtschreibung, Band 1, 25th edn. Dudenverlag, Mannheim/Wien/Zürich

    Google Scholar 

  10. Eckart T, Quasthoff U (2013) Statistical corpus and language comparison on comparable corpora. In: BUCC - Building and using comparable corpora. Springer, Berlin

    Google Scholar 

  11. Eckart T, Quasthoff U, Goldhahn D (2012) Language statistics-based quality assurance for large corpora. In: Proceedings of Asia pacific corpus linguistics conference 2012, Auckland

    Google Scholar 

  12. Fenk-Oczlon G, Fenk A, (1999) Cognition, quantitative linguistics, and systemic typology. Linguist Typol 3:151–177

    Google Scholar 

  13. Goldhahn D (2013) Quantitative Methoden in der Sprachtypologie: Nutzung korpusbasierter Statistiken. Dissertation, University of Leipzig, Leipzig

    Google Scholar 

  14. Goldhahn D, Eckart T, Quasthoff U (2012) Building large monolingual dictionaries at the leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC 2012)

    Google Scholar 

  15. Goldhahn D, Quasthoff U, Heyer G (2014) Corpus-based linguistic typology: a comprehensive approach. In: Proceedings of konvens 2014, Hildesheim

    Google Scholar 

  16. Guy JB (1991) Vowel identification: an old (but good) algorithm. Cryptologia 15(3):258–262

    Article  MathSciNet  Google Scholar 

  17. Halácsy P, Kornai A, Oravecz C (2007) HunPos: an open source trigram tagger. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp 209–212. Association for Computational Linguistics

    Google Scholar 

  18. Heid U, Schmid H, Eckart K, Hinrichs E (2010) A corpus representation format for linguistic web services: the D-SPIN text corpus format and its relationship with ISO standards. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10), 2010

    Google Scholar 

  19. Heyer G, Quasthoff U (2006) Calculating communities by link analysis of URLs. In: Innovative internet community systems. Springer, Berlin, pp 151–156

    Google Scholar 

  20. Kilgarriff A (2007) Googleology is bad science. Comput Linguist 33(1):147–151

    Article  Google Scholar 

  21. Köhler R, Altmann G, Piotrowski R (2005) Quantitative linguistik (Quantitative linguistics). In: Ein internationales handbuch (An international handbook). De Gruyter, Berlin

    Google Scholar 

  22. Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60

    Article  MATH  MathSciNet  Google Scholar 

  23. Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. Knowl Data Eng IEEE Trans 23(7):961–976

    Article  Google Scholar 

  24. Quasthoff U, Biemann C (2006) Measuring monolinguality. In: The workshop programme of LREC 2006, p 38

    Google Scholar 

  25. Richter M, Quasthoff U, Hallsteinsdóttir E, Biemann C (2006) Exploiting the leipzig corpora collection. In: Proceedings of the IS-LTC 2006

    Google Scholar 

  26. Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, Manchester

    Google Scholar 

  27. Sharoff S (2006) Creating general-purpose corpora using automated search engine queries. In: Baroni M, Bernardini S (eds) WaCky! Working papers on the web as corpus. Gedit, Bologna

    Google Scholar 

  28. Sukhotin BV (1988) Optimization algorithms of deciphering as the elements of a linguistic theory. In: Proceedings of the 12th conference on computational linguistics-association for computational linguistics, vol 2, pp 645–648

    Google Scholar 

  29. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83

    Article  Google Scholar 

  30. Zipf GK (1935) The psycho-biology of language: an introduction to dynamic philology. The MIT Press, Cambridge

    Google Scholar 

  31. Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge

    Google Scholar 

Download references

Acknowledgements

We thank both anonymous reviewers and editors for their valuable hints and comments, which helped finalizing the chapter.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Uwe Quasthoff .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Quasthoff, U., Goldhahn, D., Eckart, T. (2014). Building Large Resources for Text Mining: The Leipzig Corpora Collection. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12655-5_1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12654-8

  • Online ISBN: 978-3-319-12655-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics