Aranea: Yet Another Family of (Comparable) Web Corpora

Benko, Vladimír

doi:10.1007/978-3-319-10816-2_31

Vladimír Benko^21,22

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1565 Accesses
30 Citations

Abstract

Our paper deals with an on-going Project in the framework of which, by means of open-source and free tools, a family of web corpora is being created that would (to a large extend) deserve the designation of being “comparable”. A summary of results after the first stage of the Project is given, and experiences with the tools are commented.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Baroni, B., Bernardini, S.: BootCaT: Bootstrapping corpora and terms from the web. In: Proc. 4th Int. Conf. on Language Resources and Evaluation, Lisbon (2004)
Google Scholar
Baroni, M., Bernardini, S., Ferraresi, A., Zanchetta, E.: The WaCky Wide Web: A Collection of Very Large Linguistically Processed Web-Crawled Corpora. Language Resources and Evaluation 43(3), 209–226 (2009)
Article Google Scholar
Benko, V.: Data Deduplication in Slovak Corpora. In: Slovko 2013: Natural Language Processing, Corpus Linguistics, E-learning, pp. 27–39. RAM-Verlag, Lüdenscheid (2013)
Google Scholar
Benko, V.: Compatible Sketch Grammars for Comparable Corpora. In: Proc. XVI EURALEX Int. Congress, Bolzano (in print, 2014)
Google Scholar
Garabík, R., Šimková, M.: Slovak Morphosyntactic Tagset. Journal of Language Modelling (1), 41–63 (2012)
Google Scholar
Grefenstette, G.: Generating resources for the lexicography of under-resourced languages. Invited lecture at eLex 2013 Int. Conference, Tallinn (2013)
Google Scholar
Hajič, J.: Disambiguation of Rich Inflection (Computational Morphology of Czech). Karolinum, Praha (2004)
Google Scholar
Jakubíček, M., Kilgarriff, A., Kovář, V., Rychlý, P., Suchomel, V.: The TenTen Corpus Family. In: Proc. Int. Conf. on Corpus Linguistics, Lancaster (2013)
Google Scholar
Kilgarriff, A.: Comparing Corpora. International Journal of Corpus Linguistics 6(1), 97–133 (2001)
Article Google Scholar
Kilgarriff, A., Rychlý, P., Smrž, P., Tugwell, D.: The Sketch Engine. In: Proc. XI EURALEX Int. Congress, Lorient, pp. 105–116 (2004)
Google Scholar
Petrov, S., Das, D., McDonald, R.: A Universal Part-of-Speech Tagset. In: Proc. 8th Int. Conf. on Language Resources and Evaluation, Istanbul (2012)
Google Scholar
Piasecki, M.: Polish Tagger TaKIPI: Rule Based Construction and Optimisation. Task Quarterly 11, 151–167 (2007)
Google Scholar
Pomikálek, J.: Removing Boilerplate and Duplicate Content from Web Corpora. Ph.D. thesis, Masaryk University, Brno (2011)
Google Scholar
Rychlý, P.: Manatee/Bonito – A Modular Corpus Manager. In: 1st Workshop on Recent Advances in Slavonic Natural Language Processing, pp. 65–70. Masaryk University, Brno (2007)
Google Scholar
Schäfer, R., Bildhauer, F.: Web Corpus Construction. Synthesis Lectures on Human Language Technologies. Morgan & Claypool Publishers (2013)
Google Scholar
Schmid, H.: Probabilistic Part-of-Speech Tagging Using Decision Trees. In: Proceedings of International Conference on New Methods in Language Processing, Manchester (1994)
Google Scholar
Suchomel, V., Pomikálek, J.: Efficient Web Crawling for Large Text Corpora. In: 7th Web as Corpus Workshop (WAC-7), Lyon, France (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences, Panská 26, SK-81101, Bratislava, Slovakia
Vladimír Benko
Comenius University in Bratislava, UNESCO Chair in Translation Studies, Šoltésovej 4, SK-81334, Bratislava, Slovakia
Vladimír Benko

Authors

Vladimír Benko
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Benko, V. (2014). Aranea: Yet Another Family of (Comparable) Web Corpora. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_31
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Aranea: Yet Another Family of (Comparable) Web Corpora