Baroni, M., Bernardini, S., Ferraresi, A., & Zanchetta, E. (2009). The WaCky Wide Web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation, 43(3), 209–226.
Article
Google Scholar
Baroni, M., Chantree, F., Kilgarriff, A., & Sharoff, S. (2008). CleanEval: A competition for cleaning webpages. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan., B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC ’12) (pp . 638–643). Istanbul: European Language Resources Association (ELRA).
Bauer, D., Degen, J., Deng, X., Herger, P., Gasthaus, J., Giesbrecht, E., et al. (2007). Filtering the internet by automatic subtree classification. In C. Fairon, H. Naets, A. Kilgarriff, & G. M. de Schryver (Eds.), Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (incorporating CLEANEVAL) (pp. 111–122). Louvain: Presses Universitaires de Louvain.
Biemann, C., Heyer, G., Quasthoff, U., & Richter, M. (2007). The Leipzig Corpora Collection—Monolingual corpora of standard size. In Proceedings of corpus linguistic 2007. Birmingham: University of Birmingham.
Broder, A. Z. (2000). Identifying and filtering near-duplicate documents. In D. Sanko & R. Giancarlo (Eds.), Proceedings of combinatorial pattern matching (pp. 1–10), Berlin.
Chang, C. C., & Lin, C. J. (2011). LIBSVM: A library for support vector machines. ACM Transactions on Intelligent Systems and Technology, 2, 1–27.
Article
Google Scholar
Cortez, P. (2011). Data mining with multilayer perceptrons and support vector machines. In D. E. Holmes & L. C. Jain (Eds.), Data mining: Foundations and intelligent paradigms. Volume 2: Statistical, Bayesian, time series and other theoretical aspects (Vol. 2, pp. 9–23). Berlin: Springer.
Google Scholar
Evert, S., & Hardie, A. (2011). Twenty-first century corpus workbench: Updating a query architecture for the new millennium. In Proceedings corpus linguistics 2011. Birmingham: University of Birmingham.
Finn, A., Kushmerick, N., & Smyth, B. (2001). Fact or fiction: Content classification for digital libraries. In DELOS workshop: Personalisation and recommender systems in digital libraries.
Gallé, M., & Renders, J. M. (2014). Boilerplate detection and recoding. In M. de Rijke, T. Kenter, A. de Vries, C. X. Zhai, F. de Jong, K. Radinsky, et al. (Eds.), Advances in information retrieval—36th European conference on IR research, ECIR (pp. 462–467). Berlin: Springer.
Grossberg, S. (1973). Contour enhancement, short-term memory, and constancies in reverberating neural networks. Studies in Applied Mathematics, 52, 213–257.
Article
Google Scholar
Hall, M., & Witten, I. H. (2011). Data mining: Practical machine learning tools and techniques (3rd ed.). Burlington: Kaufmann.
Google Scholar
Kohlschütter, C., Fankhauser, P., & Nejdl, W. (2010). Boilerplate detection using shallow text features. In B. D. Davison, T. Suel, N. Craswell, & B. Liu (Eds.), WSDM ’10: Proceedings of the third ACM international conference on web search and data mining (pp. 441–450). New York: ACM.
Marek, M., Pecina, P., Spousta, M. (2007). Web page cleaning with conditional random fields. In C. Fairon, H. Naets, A. Kilgarriff, & G. M. de Schryver (Eds.), Building and exploring web corpora: Proceedings of the 3rd web as corpus workshop (incorporating CLEANEVAL) (pp. 155–162). Louvain: Presses Universitaires de Louvain.
Minsky, M. L., & Papert, S. A. (1988). Perceptrons. Cambridge: MIT Press.
Google Scholar
Neunerdt, M., Reimer, E., Reyer, M., & Mathar, R. (2015). Enhanced web page cleaning for constructing social media text corpora. In K. J. Kim (Ed.), Information science and applications (pp. 665–672). Berlin: Springer.
Chapter
Google Scholar
Nissen, S. (2003). Implementation of a Fast Artificial Neural Network Library (FANN). Technical report. Datalogisk Institut Københavns Universitet, Copenhagen.
Pasternack, J., & Roth, D. (2009). Extracting article text from the web with maximum subsequence segmentation. In J. Quemada, G. León, Y. Maarek, & W. Nejdl (Eds.), WWW ’09: Proceedings of the 18th international conference on World Wide Web (pp. 971–980). Madrid: ACM.
Pomikalek, J., Rychly, P., & Kilgarriff, A. (2009). Scaling to billion-plus word corpora. Research in Computing Science 41, special issue: Advances in Computational Linguistics.
Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Masaryk University Faculty of Informatics, Brno. http://is.muni.cz/th/45523fi_d/phdthesis.pdf.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323, 533–536.
Article
Google Scholar
Schäfer, R. (2015). Processing and querying large web corpora with the COW14 architecture. In P. Bański, H. Biber, E. Breiteneder, M. Kupietz, H. Lüngen, & A. Witt (Eds.), Proceedings of challenges in the management of large corpora 3 (CMLC-3). UCREL, Lancaster.
Schäfer, R. (2016). CommonCOW: Massively huge web corpora from CommonCrawl data and a method to distribute them freely under restrictive EU copyright laws. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan, B. Maegaard, J. Mariani, J. Odijk, et al. (Eds.), Proceedings of the tenth international conference on language resources and evaluation (LREC ’16) (pp. 4500–4504). Portorož: European Language Resources Association (ELRA).
Schäfer, R., & Bildhauer, F. (2012). Building large corpora from the web using a new efficient tool chain. In N. Calzolari, K. Choukri, T. Declerck, M. U. Doğan., B. Maegaard, J. Mariani, et al. (Eds.), Proceedings of the eighth international conference on language resources and evaluation (LREC ’12) (pp. 486–493). Istanbul: European Language Resources Association (ELRA).
Schäfer, R., & Bildhauer, F. (2013). Web corpus construction. Synthesis lectures on human language technologies. San Francisco: Morgan and Claypool.
Google Scholar
Spousta, M., Marek, M., & Pecina, P. (2008). Victor: The web-page cleaning tool. In S. Evert, A. Kilgarriff, & S. Sharoff (Eds.), Proceedings of the 4th web as corpus workshop (pp. 12–17). Marrakech: European Language Resources Association (ELRA).
Üstün, B., Melssen, W. J., & Buydens, L. M. (2006). Facilitating the application of support vector regression by using a universal Pearson VII function based kernel. Nature, 81, 29–40.
Google Scholar