Name: Web Corpus Construction
ISBN: 978-3-031-02152-7

Overview

Authors:

Roland Schäfer ⁰,
Felix Bildhauer ¹

Roland Schäfer
1. Freie Universität Berlin, Germany
View author publications

You can also search for this author in PubMed Google Scholar
Felix Bildhauer
1. Freie Universität Berlin, Germany
View author publications

You can also search for this author in PubMed Google Scholar

Part of the book series: Synthesis Lectures on Human Language Technologies (SLHLT)

305 Accesses
7 Citations

This is a preview of subscription content, log in via an institution to check access.

Access this book

eBook USD 29.99

Price excludes VAT (USA)

Softcover Book USD 37.99

Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

Table of contents (5 chapters)

Front Matter

Pages i-xv

Download chapter PDF
Web Corpora
- Roland Schäfer, Felix Bildhauer
Pages 1-5
Data Collection
- Roland Schäfer, Felix Bildhauer
Pages 7-36
Post-Processing
- Roland Schäfer, Felix Bildhauer
Pages 37-63
Linguistic Processing
- Roland Schäfer, Felix Bildhauer
Pages 65-84
Corpus Evaluation and Comparison
- Roland Schäfer, Felix Bildhauer
Pages 85-109
Back Matter

Pages 111-129

Download chapter PDF

About this book

The World Wide Web constitutes the largest existing source of texts written in a great variety of languages. A feasible and sound way of exploiting this data for linguistic research is to compile a static corpus for a given language. There are several adavantages of this approach: (i) Working with such corpora obviates the problems encountered when using Internet search engines in quantitative linguistic research (such as non-transparent ranking algorithms). (ii) Creating a corpus from web data is virtually free. (iii) The size of corpora compiled from the WWW may exceed by several orders of magnitudes the size of language resources offered elsewhere. (iv) The data is locally available to the user, and it can be linguistically post-processed and queried with the tools preferred by her/him. This book addresses the main practical tasks in the creation of web corpora up to giga-token size. Among these tasks are the sampling process (i.e., web crawling) and the usual cleanups including boilerplate removal and removal of duplicated content. Linguistic processing and problems with linguistic processing coming from the different kinds of noise in web corpora are also covered. Finally, the authors show how web corpora can be evaluated and compared to other corpora (such as traditionally compiled corpora). For additional material please visit the companion website: sites.morganclaypool.com/wcc Table of Contents: Preface / Acknowledgments / Web Corpora / Data Collection / Post-Processing / Linguistic Processing / Corpus Evaluation and Comparison / Bibliography / Authors' Biographies

Authors and Affiliations

Freie Universität Berlin, Germany

Roland Schäfer, Felix Bildhauer

About the authors

Roland Schäfer studied Theoretical and Indo-European Linguistics as well as Japanese Linguistics at Marburg and Bochum Universities. He completed his doctorate Arguments and Adjuncts at the Syntax-Semantics Interface in 2008 at Gottingen University, supervised by Gert Webelhuth and Regine Eckardt. Since then, he has been working as a research assistant at Freie Universitat Berlin, mainly doing corpus-based research on semantic and morpho-syntactic phenomena. In 2011, he started working on the COW ("Corpora from the Web") project with Felix Bildhauer. His teaching experience covers a wide range of topics including Theoretical and Corpus Linguistics, English and German Linguistics, as well as Computational Linguistics.

Bibliographic Information

Book Title: Web Corpus Construction
Authors: Roland Schäfer, Felix Bildhauer
Series Title: Synthesis Lectures on Human Language Technologies
DOI: https://doi.org/10.1007/978-3-031-02152-7
Publisher: Springer Cham
eBook Packages: Synthesis Collection of Technology (R0), eBColl Synthesis Collection 5
Copyright Information: Springer Nature Switzerland AG 2013
Softcover ISBN: 978-3-031-01024-8Published: 22 July 2013
eBook ISBN: 978-3-031-02152-7Published: 31 May 2022
Series ISSN: 1947-4040
Series E-ISSN: 1947-4059
Edition Number: 1
Number of Pages: XV, 129
Topics: Artificial Intelligence, Natural Language Processing (NLP), Computational Linguistics

Publish with us

Policies and ethics

Web Corpus Construction

Overview

Access this book

Other ways to access

Table of contents (5 chapters)

Front Matter

Web Corpora

Data Collection

Post-Processing

Linguistic Processing

Corpus Evaluation and Comparison

Back Matter

About this book

Authors and Affiliations

Freie Universität Berlin, Germany

About the authors

Bibliographic Information

Publish with us

Search

Navigation