Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Project Notes

DOI: 10.1007/s10579-016-9359-2

Cite this article as:
Schäfer, R. Lang Resources & Evaluation (2016). doi:10.1007/s10579-016-9359-2

Abstract

Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. In this paper, I present and evaluate a supervised machine-learning approach to general-purpose boilerplate detection for languages based on Latin alphabets using Multi-Layer Perceptrons (MLPs). It is both very efficient and very accurate (between 95 % and \(99\,\%\) correct classifications, depending on the input language). I show that language-specific classifiers greatly improve the accuracy of boilerplate detectors. The single features used for the classification are evaluated with regard to the merit they contribute to the classification. Furthermore, I show that the accuracy of the MLP is on a par with that of a wide range of other classifiers. My approach has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl datasets.

Keywords

Corpus construction Web corpora Boilerplate Non-destructive corpus normalization 

Funding information

Funder NameGrant NumberFunding Note
Deutsche Forschungsgemeinschaft
  • SCHA1916/1-1

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.Deutsche und niederländische PhilologieFreie Universität BerlinBerlinGermany

Personalised recommendations