Advertisement

brWaC: A WaCky Corpus for Brazilian Portuguese

  • Rodrigo Boos
  • Kassius Prestes
  • Aline Villavicencio
  • Muntsa Padró
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8775)

Abstract

Initiatives for constructing very large corpora have increased in recent years, especially using the Web as corpus since large corpora are crucial for many Natural Language Processing tasks. The WaCky (Web-As-Corpus Kool Yinitiative) methodology has been used to build very large corpora (over a billion words each) for languages like English, Italian and German among others. In this paper we present the ongoing work on building brWaC, a massive Brazilian Portuguese corpus crawled from .br domains. At the moment, the crawling process and the PoS tagging are finished, resulting in a tokenized and lemmatized corpus of 3 billion words. Next step is parsing the whole corpus.

Keywords

Web as Corpus brWaC WaCky Brazilian Portuguese 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Ramisch, C., Villavicencio, A., Boitet, C.: Multiword expressions in the wild? the mwetoolkit comes in handy. In: Proc. of the 23rd COLING - Demonstrations, Beijing, China. The Coling 2010 Organizing Committee (August 2010)Google Scholar
  2. 2.
    Tsvetkov, Y., Wintner, S.: Extraction of multi-word expressions from small parallel corpora. In: Coling 2010: Posters, Beijing, China, Coling 2010 (August 2010)Google Scholar
  3. 3.
    Korhonen, A., Krymolowski, Y., Briscoe, E.J.: A large subcategorization lexicon for natural language processing applications. In: Proceedings of the 5th LREC, Genova, Italy (2006)Google Scholar
  4. 4.
    Lin, D.: Automatic retrieval and clustering of similar words. In: Proceedings of the 36th ACL and 17th International COLING (1998)Google Scholar
  5. 5.
    Baroni, M., Lenci, A.: Distributional memory: A general framework for corpus-based semantics. Computational Linguistics 36(4), 673–721 (2010)CrossRefGoogle Scholar
  6. 6.
    Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The wacky wide web: A collection of very large linguistically processed web-crawled corpora. Language Resources and Evaluation 43(3), 209–226 (2009)CrossRefGoogle Scholar
  7. 7.
    Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Inf. Retr. 11(5) (October 2008)Google Scholar
  8. 8.
    Granada, R., Lopes, L., Ramisch, C., Trojahn, C., Vieira, R., Villavicencio, A.: A comparable corpus based on aligned multilingual ontologies. In: Proceedings of the First Workshop on Multilingual Modeling, MM 2012, pp. 25–31. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  9. 9.
    Barbosa, L., Sridhar, V.K.R., Yarmohammadi, M., Bangalore, S.: Harvesting parallel text in multiple languages with limited supervision. In: Kay, M., Boitet, C. (eds.) COLING, pp. 201–214. Indian Institute of Technology, Bombay (2012)Google Scholar
  10. 10.
    Ferraresi, A., Bernardini, S., Picci, G., Baroni, M.: Web corpora for bilingual lexicography: A pilot study of english/french collocation extraction and translation. In: Using Corpora in Contrastive and Translation Studies. Cambridge Scholars Publishing, Newcastle (2010)Google Scholar
  11. 11.
    Ljubešić, N., Erjavec, T.: hrwac and slwac: Compiling web corpora for croatian and slovene. In: Proceedings of 14th International Conference on Text, Speech and Dialogue, TSD (2011)Google Scholar
  12. 12.
    Ljubešić, N., Toral, A.: caWaC – a web corpus of Catalan. In: Proceedings of LREC 2014 (May 2014)Google Scholar
  13. 13.
    Nivre, J., Hall, J., Nilsson, J., Chanev, A., Eryiǧit, G., Kübler, S., Marinov, S., Marsi, E.: Maltparser: A language-independent system for data-driven dependency parsing. Natural Language Engineering 13, 95–135 (2007)Google Scholar
  14. 14.
    Kohlschütter, C., Fankhauser, P., Nejdl, W.: Boilerplate detection using shallow text features. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, WSDM 2010, pp. 441–450. ACM, New York (2010)Google Scholar
  15. 15.
    Pomikálek, J.: Removing Boilerplate and Duplicate Content from Web Corpora. PhD en informatique, Masarykova univerzita, Fakulta informatiky (2011)Google Scholar
  16. 16.
    Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. Comput. Netw. ISDN Syst. 29(8-13), 1157–1166 (1997)CrossRefGoogle Scholar
  17. 17.
    Schmid, H.: Probabilistic part-of-speech tagging using decision trees (1994)Google Scholar
  18. 18.
    Shuyo, N.: Language detection library for java (2010)Google Scholar
  19. 19.
    Bick, E.: The Parsing System Palavras. Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Famework. PhD thesis, Aarhus University (2002)Google Scholar
  20. 20.
    Boos, R., Prestes, K., Villavicencio, A.: Identification of multiword expressions in the brwac. In: Proceedings of LREC 2014 (May 2014)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Rodrigo Boos
    • 1
  • Kassius Prestes
    • 1
  • Aline Villavicencio
    • 1
  • Muntsa Padró
    • 1
  1. 1.Institute of InformaticsFederal University of Rio Grande do SulPorto AlegreBrazil

Personalised recommendations