Skip to main content

SdeWaC – A Corpus of Parsable Sentences from the Web

  • Conference paper
Language Processing and Knowledge in the Web

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8105))

Abstract

For a number of languages, web crawling allows researchers to collect huge text samples to build corpora. However, only part of the material found on the internet is useful for Natural Language Processing, as e.g. parsers typically cannot handle lists and tables, or very short or very long sentences. There are methods (cf. e.g. [3]) for cleaning the downloaded data before adding it to a corpus collection – but even when these are applied, not all remaining textual material might be suitable for certain research requirements. This paper describes methods utilized to prepare deWaC, a freely available German web corpus of the WaCky project, for automatic processing up to the parsing level. It then discusses ways in which this corpus, called SdeWaC, has been used since its release.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 49.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baroni, M., Kilgarriff, A.: Large linguistically-processed web corpora for multiple languages. In: Conference Companion of EACL 2006, 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 87–90 (2006)

    Google Scholar 

  2. Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209–226 (2009)

    Article  Google Scholar 

  3. Bauer, D., Degen, J., Deng, X., Herger, P., Gasthaus, J., Giesbrecht, E., Jansen, L., Kalina, C., Krüger, T., Märtin, R., Schmidt, M., Scholler, S., Steger, J., Stemle, E., Evert, S.: Fiasco: Filtering the internet by automatic subtree classification. In: Fairon, C., Naets, H., Kilgarriff, A., de Schrvyer, G.-M. (eds.) Building and Exploring Web Corpora: Proceedings of the 3rd Web as Corpus Workshop (WAC3), Incorporating CLEANEVAL, Louvain-la-Neuve, Belgium, pp. 111–121 (2007)

    Google Scholar 

  4. Bohnet, B.: Top accuracy and fast dependency parsing is not a contradiction. In: Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Coling 2010 Organizing Committee, Beijing, China, pp. 89–97 (2010)

    Google Scholar 

  5. Briscoe, T., Carrol, J.: Automatic extraction of subcategorization from corpora. In: Proceedings of the Fifth Conference on Applied Natural Language Processing, Washington DC, USA, pp. 356–363 (1997)

    Google Scholar 

  6. Buchholz, S., Marsi, E.: CoNLL-X Shared Task on Multilingual Dependency Parsing. In: Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X), pp. 149–164. Association for Computational Linguistics, New York City (2006)

    Chapter  Google Scholar 

  7. Eberle, K., Faaß, G., Heid, U.: Proposition oder Temporalangabe? Disambiguierung von -ung-Nominalisierungen von verba dicendi in nach-PPs. In: Chiarcos, C., Eckart de Castilho, R., Stede, M. (eds.) Proceedings of the Biennial GSCL Conference 2009, Von der Form zur Bedeutung: Texte automatisch verarbeiten / From Form to Meaning: Processing Texts Automatically, Potsdam, pp. 81–91. Narr, Tübingen (2009)

    Google Scholar 

  8. Faaß, G., Heid, U., Schmid, H.: Design and application of a Gold Standard for morphological analysis: SMOR in validation. In: Proceedings of the Seventh LREC Conference, European Language Resources Association (ELRA), Valetta, Malta, pp. 803–810 (2010)

    Google Scholar 

  9. Haselbach, B., Eckart, K., Seeker, W., Eberle, K., Heid, U.: Approximating Theoretical Linguistics Classification in Real Data: the Case of German “nach” Particle Verbs. In: Proceedings of COLING 2012, Mumbai, India, pp. 1113–1128 (2012)

    Google Scholar 

  10. Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics 33(1), 159–174 (1977)

    Article  MathSciNet  MATH  Google Scholar 

  11. Quasthoff, U., Richter, M., Biemann, C.: Corpus portal for search in monolingual corpora. In: Proceedings of the LREC 2006, Genoa, Italy, pp. 1799–1802 (2006)

    Google Scholar 

  12. Schiehlen, M.: A Cascaded Finite-State Parser for German. In: Proceedings of the Research Note Sessions of the 10th Conference of the European Chapter of the Association for Computational Linguistics (EACL 2003), Budapest, pp. 163–166 (2003)

    Google Scholar 

  13. Schiller, A., Teufel, S., Thielen, C.: Guidelines für das Tagging deutscher Textcorpora mit STTS. Universität Stuttgart and Universität Tübingen (1995)

    Google Scholar 

  14. Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: International Conference on New Methods in Language Processing, Manchester, UK, pp. 44–49 (1994)

    Google Scholar 

  15. Schmid, H.: Unsupervised Learning of Period Disambiguation for Tokenisation. Internal Report, IMS. University of Stuttgart (2000)

    Google Scholar 

  16. Schmid, H., Fitschen, A., Heid, U.: SMOR: A German computational morphology covering derivation, composition, and inflection. In: Proceedings of LREC 2004, Lisboa, Portugal (2004)

    Google Scholar 

  17. Schulte im Walde, S.: Webkorpora für die automatische Akquisition lexikalisch-semantischen Wissens. In: Workshop Webkorpora in Computerlinguistik und Sprachforschung. Institut für Deutsche Sprache, Mannheim (2012)

    Google Scholar 

  18. Springorum, S., Schulte im Walde, S., Roßdeutscher, A.: Automatic Classification of German an Particle Verbs. In: Proceedings of the 8th International Conference on Language Resources and Evaluation. Istanbul, Turkey (2012)

    Google Scholar 

  19. Stus, O.: Web-Korpus, Korpusaufbereitung der deutschen Web-Korpora. Internal Report, IMS, Universität Stuttgart (2008)

    Google Scholar 

  20. Weller, M., Heid, U.: Extraction of german multiword expressions from parsed corpora using context features. In: Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., Tapias, D. (eds.) Proceedings of the Seventh Conference on International Language Resources and Evaluation (LREC 2010), pp. 3195–3201. European Language Resources Association (ELRA), Valetta (2008)

    Google Scholar 

  21. Zarrieß, S., Schäfer, F.: Schulte im Walde, S.: Passives of reflexives: a corpus study. Linguistic Evidence - Berlin Special. Berlin, Germany (2013)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Faaß, G., Eckart, K. (2013). SdeWaC – A Corpus of Parsable Sentences from the Web. In: Gurevych, I., Biemann, C., Zesch, T. (eds) Language Processing and Knowledge in the Web. Lecture Notes in Computer Science(), vol 8105. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40722-2_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40722-2_6

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40721-5

  • Online ISBN: 978-3-642-40722-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics