Skip to main content

A Statistical Study of the WPT-03 Corpus

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 3230))

Abstract

This report presents a statistical study of WPT-03, a text corpus built from the pages of the “Portuguese Web” collected in the repository of the tumba! search engine. We give a statistical analysis of the textual contents available in the Portuguese Web, including size distributions, the language of the pages, and the terms they contain.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brants, T.: TnT – a statistical part-of-speech tagger. In: Proceedings of ANLP 2000, the 6th Conference on Applied Natural Language Processing (2000)

    Google Scholar 

  2. Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of ANLP 1992, the 3rd Conference on Applied Natural Language Processing, Trento, Italy, pp. 152–155 (1992)

    Google Scholar 

  3. Campos, J.P.: Versus: a web data repository with time support. DI/FCUL TR 03–08, Department of Informatics, University of Lisbon, Masters thesis (May 2003)

    Google Scholar 

  4. Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR 1994, the 3rd Annual Symposium on Document Analysis and Information Retrieval, Las Vegas, Nevada, U.S.A., pp. 161–175 (1994)

    Google Scholar 

  5. Gomes, D.: Tarântula – sistema de recolha de documentos da Web. Technical report, Departamento de Informática da Faculdade de Ciências da Universidade de Lisboa, Report of the traineeship done by the author at the LaSIGE (August 2001) (in Portuguese), http://lasige.di.fc.ul.pt

  6. Gomes, D., Silva, M.J.: A characterization of the Portuguese Web. In: Proceedings of the 3rd ECDL Workshop on Web Archives, Trondheim, Norway (August 2003)

    Google Scholar 

  7. Martins, B., Silva, M.: Language identification in Web pages (to appear, 2004)

    Google Scholar 

  8. Martins, B., Silva, M.: WebCAT: A Web content analysis tool for IR applications (to appear, 2004)

    Google Scholar 

  9. Medeiros, J.C.D.: Processamento morfológico e correcção ortográfica do português. Master’s thesis, Instituto Superior Técnico (1995)

    Google Scholar 

  10. Oakes, M.P.: Statistics For Corpus Linguistics. Edinburgh University Press (February 1998)

    Google Scholar 

  11. Poosala, V.: Zipf’s law. Technical Report 900 839 0750, Bell Laboratories (1997)

    Google Scholar 

  12. Santos, D., Rocha, P.: Evaluating cetempúblico, a free resource for portuguese. In: Proceedings of ACL 2001, the 39th Annual Meeting of the Association for Computational Linguistics, July 2001, pp. 442–449 (2001)

    Google Scholar 

  13. Santos, D., Sarmento, L.: O projecto AC/DC: acesso a corpora / disponibilização de corpora. In: Mendes, A., Freitas, T. (eds.) Actas do XVIII Encontro da Associação Portuguesa de Linguística, October 2002, pp. 705–717 (2002)

    Google Scholar 

  14. Sekiguchi, Y., Yamamoto, K.: Web corpus construction with quality improvement. In: Su, K.-Y., Tsujii, J., Lee, J.-H., Kwong, O.Y. (eds.) IJCNLP 2004. LNCS (LNAI), vol. 3248, pp. 201–206. Springer, Heidelberg (2005)

    Google Scholar 

  15. Silva, M.J.: The case for a portuguese web search engine. In: Proceedings of ICWI 2003, the IADIS International Conference WWW/Internet 2003 (November 2003)

    Google Scholar 

  16. Simões, A.M., Almeida, J.J.: jspell.pm – um módulo de análise morfológica para uso em processamento de linguagem natural. In: Actas da Associação Portuguêsa de Linguística, pp. 485–495 (2001)

    Google Scholar 

  17. Zipf, G.K.: Human Behaviour and the Principle of Least Effort. Addison-Wesley, Reading (1949)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Martins, B., Silva, M.J. (2004). A Statistical Study of the WPT-03 Corpus. In: Vicedo, J.L., Martínez-Barco, P., Muńoz, R., Saiz Noeda, M. (eds) Advances in Natural Language Processing. EsTAL 2004. Lecture Notes in Computer Science(), vol 3230. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30228-5_34

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30228-5_34

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23498-2

  • Online ISBN: 978-3-540-30228-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics