Language Resources and Evaluation

, Volume 51, Issue 3, pp 873–889 | Cite as

Accurate and efficient general-purpose boilerplate detection for crawled web corpora

Project Notes

Abstract

Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results for search terms if these terms appear in boilerplate regions of the web page. In this paper, I present and evaluate a supervised machine-learning approach to general-purpose boilerplate detection for languages based on Latin alphabets using Multi-Layer Perceptrons (MLPs). It is both very efficient and very accurate (between 95 % and \(99\,\%\) correct classifications, depending on the input language). I show that language-specific classifiers greatly improve the accuracy of boilerplate detectors. The single features used for the classification are evaluated with regard to the merit they contribute to the classification. Furthermore, I show that the accuracy of the MLP is on a par with that of a wide range of other classifiers. My approach has been implemented in the open-source texrex web page cleaning software, and large corpora constructed using it are available from the COW initiative, including the CommonCOW corpora created from CommonCrawl datasets.

Keywords

Corpus construction Web corpora Boilerplate Non-destructive corpus normalization 

References

  1. Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.CrossRefGoogle Scholar
  2. Baroni, M., Chantree, F., Kilgarriff, A., & Sharoff, S. (2008). CleanEval: A competition for cleaning webpages. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan., B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC ’12) (pp . 638–643). Istanbul: European Language Resources Association (ELRA).Google Scholar
  3. Bauer, D., Degen, J., Deng, X., Herger, P., Gasthaus, J., Giesbrecht, E., et al. (2007). Filtering the internet by automatic subtree classification. In C. Fairon, H. Naets, A. Kilgarriff, & G. M. de Schryver (Eds.), Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (incorporating CLEANEVAL) (pp. 111–122). Louvain: Presses Universitaires de Louvain.Google Scholar
  4. Biemann, C., Heyer, G., Quasthoff, U., & Richter, M. (2007). The Leipzig Corpora Collection—Monolingual corpora of standard size. In Proceedings of corpus linguistic 2007. Birmingham: University of Birmingham.Google Scholar
  5. Broder, A. Z. (2000). Identifying and filtering near-duplicate documents. In D. Sanko & R. Giancarlo (Eds.), Proceedings of combinatorial pattern matching (pp. 1–10), Berlin.Google Scholar
  6. Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.CrossRefGoogle Scholar
  7. Cortez, P. (2011). Data mining with multilayer perceptrons and support vector machines. In D. E. Holmes & L. C. Jain (Eds.), Data mining: Foundations and intelligent paradigms. Volume 2: Statistical, Bayesian, time series and other theoretical aspects (Vol. 2, pp. 9–23). Berlin: Springer.Google Scholar
  8. Evert, S., & Hardie, A. (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In Proceedings corpus linguistics 2011. Birmingham: University of Birmingham.Google Scholar
  9. Finn, A., Kushmerick, N., & Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In DELOS workshop: Personalisation and recommender systems in digital libraries.Google Scholar
  10. Gallé, M., & Renders, J. M. (2014). Boilerplate detection and recoding. In M. de Rijke, T. Kenter, A. de Vries, C. X. Zhai, F. de Jong, K. Radinsky, et al. (Eds.), Advances in information retrieval—36th European conference on IR research, ECIR (pp. 462–467). Berlin: Springer.Google Scholar
  11. Grossberg, S. (1973). Contour enhancement, short-term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213–257.CrossRefGoogle Scholar
  12. Hall, M., & Witten, I. H. (2011). Data mining: Practical machine learning tools and techniques (3rd ed.). Burlington: Kaufmann.Google Scholar
  13. Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. In B. D. Davison, T. Suel, N. Craswell, & B. Liu (Eds.), WSDM ’10: Proceedings of the third ACM international conference on web search and data mining (pp. 441–450). New York: ACM.Google Scholar
  14. Marek, M., Pecina, P., Spousta, M. (2007). Web page cleaning with conditional random fields. In C. Fairon, H. Naets, A. Kilgarriff, & G. M. de Schryver (Eds.), Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (incorporating CLEANEVAL) (pp. 155–162). Louvain: Presses Universitaires de Louvain.Google Scholar
  15. Minsky, M. L., & Papert, S. A. (1988). Perceptrons. Cambridge: MIT Press.Google Scholar
  16. Neunerdt, M., Reimer, E., Reyer, M., & Mathar, R. (2015). Enhanced web page cleaning for constructing social media text corpora. In K. J. Kim (Ed.), Information science and applications (pp. 665–672). Berlin: Springer.CrossRefGoogle Scholar
  17. Nissen, S. (2003). Implementation of a Fast Artificial Neural Network Library (FANN). Technical report. Datalogisk Institut Københavns Universitet, Copenhagen.Google Scholar
  18. Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In J. Quemada, G. León, Y. Maarek, & W. Nejdl (Eds.), WWW ’09: Proceedings of the 18th international conference on World Wide Web (pp. 971–980). Madrid: ACM.Google Scholar
  19. Pomikalek, J., Rychly, P., & Kilgarriff, A. (2009). Scaling to billion-plus word corpora. Research in Computing Science 41, special issue: Advances in Computational Linguistics.Google Scholar
  20. Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk University Faculty of Informatics, Brno. http://is.muni.cz/th/45523fi_d/phdthesis.pdf.
  21. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.CrossRefGoogle Scholar
  22. Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Bański, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, & A. Witt (Eds.), Proceedings of challenges in the management of large corpora 3 (CMLC-3). UCREL, Lancaster.Google Scholar
  23. Schäfer, R. (2016). CommonCOW: Massively huge web corpora from CommonCrawl data and a method to distribute them freely under restrictive EU copyright laws. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, et al. (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC ’16) (pp. 4500–4504). Portorož: European Language Resources Association (ELRA).Google Scholar
  24. Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan., B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC ’12) (pp. 486–493). Istanbul: European Language Resources Association (ELRA).Google Scholar
  25. Schäfer, R., & Bildhauer, F. (2013). Web corpus construction. Synthesis lectures on human language technologies. San Francisco: Morgan and Claypool.Google Scholar
  26. Spousta, M., Marek, M., & Pecina, P. (2008). Victor: The web-page cleaning tool. In S. Evert, A. Kilgarriff, & S. Sharoff (Eds.), Proceedings of the 4th web as corpus workshop (pp. 12–17). Marrakech: European Language Resources Association (ELRA).Google Scholar
  27. Üstün, B., Melssen, W. J., & Buydens, L. M. (2006). Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Nature, 81, 29–40.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  1. 1.Deutsche und niederländische PhilologieFreie Universität BerlinBerlinGermany

Personalised recommendations