Abstract
This report presents a statistical study of WPT-03, a text corpus built from the pages of the “Portuguese Web” collected in the repository of the tumba! search engine. We give a statistical analysis of the textual contents available in the Portuguese Web, including size distributions, the language of the pages, and the terms they contain.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of ANLP 2000, the 6th Conference on Applied Natural Language Processing (2000)
Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of ANLP 1992, the 3rd Conference on Applied Natural Language Processing, Trento, Italy, pp. 152–155 (1992)
Campos, J.P.: Versus: a web data repository with time support. DI/FCUL TR 03–08, Department of Informatics, University of Lisbon, Masters thesis (May 2003)
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, U.S.A., pp. 161–175 (1994)
Gomes, D.: Tarântula – sistema de recolha de documentos da Web. Technical report, Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa, Report of the traineeship done by the author at the LaSIGE (August 2001) (in Portuguese), http://lasige.di.fc.ul.pt
Gomes, D., Silva, M.J.: A characterization of the Portuguese Web. In: Proceedings of the 3rd ECDL Workshop on Web Archives, Trondheim, Norway (August 2003)
Martins, B., Silva, M.: Language identification in Web pages (to appear, 2004)
Martins, B., Silva, M.: WebCAT: A Web content analysis tool for IR applications (to appear, 2004)
Medeiros, J.C.D.: Processamento morfológico e correcção ortográfica do português. Master’s thesis, Instituto Superior Técnico (1995)
Oakes, M.P.: Statistics For Corpus Linguistics. Edinburgh University Press (February 1998)
Poosala, V.: Zipf’s law. Technical Report 900 839 0750, Bell Laboratories (1997)
Santos, D., Rocha, P.: Evaluating cetempúblico, a free resource for portuguese. In: Proceedings of ACL 2001, the 39th Annual Meeting of the Association for Computational Linguistics, July 2001, pp. 442–449 (2001)
Santos, D., Sarmento, L.: O projecto AC/DC: acesso a corpora / disponibilização de corpora. In: Mendes, A., Freitas, T. (eds.) Actas do XVIII Encontro da Associação Portuguesa de Linguística, October 2002, pp. 705–717 (2002)
Sekiguchi, Y., Yamamoto, K.: Web corpus construction with quality improvement. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 201–206. Springer, Heidelberg (2005)
Silva, M.J.: The case for a portuguese web search engine. In: Proceedings of ICWI 2003, the IADIS International Conference WWW/Internet 2003 (November 2003)
Simões, A.M., Almeida, J.J.: jspell.pm – um módulo de análise morfológica para uso em processamento de linguagem natural. In: Actas da Associação Portuguêsa de Linguística, pp. 485–495 (2001)
Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Martins, B., Silva, M.J. (2004). A Statistical Study of the WPT-03 Corpus. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_34
Download citation
DOI: https://doi.org/10.1007/978-3-540-30228-5_34
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23498-2
Online ISBN: 978-3-540-30228-5
eBook Packages: Springer Book Archive