Textual Paraphrase Dataset for Deep Language Modelling

Kanerva, Jenna; Ginter, Filip; Chang, Li-Hsin; Skantsi, Valtteri; Kilpeläinen, Jemina; Kupari, Hanna-Mari; Piirto, Aurora; Saarni, Jenna; Sevón, Maija; Tarkka, Otto

doi:10.1007/978-3-031-17258-8_27

Jenna Kanerva³,
Filip Ginter³,
Li-Hsin Chang³,
Valtteri Skantsi³,
Jemina Kilpeläinen³,
Hanna-Mari Kupari³,
Aurora Piirto³,
Jenna Saarni³,
Maija Sevón³ &
…
Otto Tarkka³

Part of the book series: Cognitive Technologies ((COGTECH))

1872 Accesses
1 Altmetric

Abstract

The Turku Paraphrase Corpus is a dataset of over 100,000 Finnish paraphrase pairs. During the corpus creation, we strived to gather challenging paraphrase pairs, more suitable to test the capabilities of natural language understanding models. The paraphrases are both selected and classified manually, so as to minimise lexical overlap, and provide examples that are structurally and lexically different to the maximum extent. An important distinguishing feature of the corpus is that most of the paraphrase pairs are extracted and distributed in their native document context, rather than in isolation. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.

Download to read the full chapter text

Chapter PDF

PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

ParaPhraser: Russian Paraphrase Corpus and Shared Task

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Author information

Authors and Affiliations

University of Turku, Turku, Finland
Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón & Otto Tarkka

Authors

Jenna Kanerva
View author publications
You can also search for this author in PubMed Google Scholar
Filip Ginter
View author publications
You can also search for this author in PubMed Google Scholar
Li-Hsin Chang
View author publications
You can also search for this author in PubMed Google Scholar
Valtteri Skantsi
View author publications
You can also search for this author in PubMed Google Scholar
Jemina Kilpeläinen
View author publications
You can also search for this author in PubMed Google Scholar
Hanna-Mari Kupari
View author publications
You can also search for this author in PubMed Google Scholar
Aurora Piirto
View author publications
You can also search for this author in PubMed Google Scholar
Jenna Saarni
View author publications
You can also search for this author in PubMed Google Scholar
Maija Sevón
View author publications
You can also search for this author in PubMed Google Scholar
Otto Tarkka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jenna Kanerva .

Editor information

Editors and Affiliations

Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Berlin, Germany
Georg Rehm

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Kanerva, J. et al. (2023). Textual Paraphrase Dataset for Deep Language Modelling. In: Rehm, G. (eds) European Language Grid. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-17258-8_27

Download citation

DOI: https://doi.org/10.1007/978-3-031-17258-8_27
Published: 02 November 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17257-1
Online ISBN: 978-3-031-17258-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Textual Paraphrase Dataset for Deep Language Modelling

Abstract

Chapter PDF

Similar content being viewed by others

PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

ParaPhraser: Russian Paraphrase Corpus and Shared Task

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Navigation

Textual Paraphrase Dataset for Deep Language Modelling

Abstract

Chapter PDF

Similar content being viewed by others

PKU Paraphrase Bank: A Sentence-Level Paraphrase Corpus for Chinese

ParaPhraser: Russian Paraphrase Corpus and Shared Task

OSPT: European Portuguese Paraphrastic Dataset with Machine Translation

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation