Skip to main content

Advertisement

SpringerLink
Log in
Menu
Find a journal Publish with us
Search
Cart
Book cover

European Language Grid pp 343–348Cite as

  1. Home
  2. European Language Grid
  3. Chapter
Textual Paraphrase Dataset for Deep Language Modelling

Textual Paraphrase Dataset for Deep Language Modelling

  • Jenna Kanerva3,
  • Filip Ginter3,
  • Li-Hsin Chang3,
  • Valtteri Skantsi3,
  • Jemina Kilpeläinen3,
  • Hanna-Mari Kupari3,
  • Aurora Piirto3,
  • Jenna Saarni3,
  • Maija Sevón3 &
  • …
  • Otto Tarkka3 
  • Chapter
  • Open Access
  • First Online: 02 November 2022
  • 1324 Accesses

  • 1 Altmetric

Part of the Cognitive Technologies book series (COGTECH)

Abstract

The Turku Paraphrase Corpus is a dataset of over 100,000 Finnish paraphrase pairs. During the corpus creation, we strived to gather challenging paraphrase pairs, more suitable to test the capabilities of natural language understanding models. The paraphrases are both selected and classified manually, so as to minimise lexical overlap, and provide examples that are structurally and lexically different to the maximum extent. An important distinguishing feature of the corpus is that most of the paraphrase pairs are extracted and distributed in their native document context, rather than in isolation. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.

Download chapter PDF

Author information

Authors and Affiliations

  1. University of Turku, Turku, Finland

    Jenna Kanerva, Filip Ginter, Li-Hsin Chang, Valtteri Skantsi, Jemina Kilpeläinen, Hanna-Mari Kupari, Aurora Piirto, Jenna Saarni, Maija Sevón & Otto Tarkka

Authors
  1. Jenna Kanerva
    View author publications

    You can also search for this author in PubMed Google Scholar

  2. Filip Ginter
    View author publications

    You can also search for this author in PubMed Google Scholar

  3. Li-Hsin Chang
    View author publications

    You can also search for this author in PubMed Google Scholar

  4. Valtteri Skantsi
    View author publications

    You can also search for this author in PubMed Google Scholar

  5. Jemina Kilpeläinen
    View author publications

    You can also search for this author in PubMed Google Scholar

  6. Hanna-Mari Kupari
    View author publications

    You can also search for this author in PubMed Google Scholar

  7. Aurora Piirto
    View author publications

    You can also search for this author in PubMed Google Scholar

  8. Jenna Saarni
    View author publications

    You can also search for this author in PubMed Google Scholar

  9. Maija Sevón
    View author publications

    You can also search for this author in PubMed Google Scholar

  10. Otto Tarkka
    View author publications

    You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jenna Kanerva .

Editor information

Editors and Affiliations

  1. Deutsches Forschungszentrum für Künstliche Intelligenz GmbH (DFKI), Berlin, Germany

    Georg Rehm

Rights and permissions

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.

The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

Reprints and Permissions

Copyright information

© 2023 The Author(s)

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Kanerva, J. et al. (2023). Textual Paraphrase Dataset for Deep Language Modelling. In: Rehm, G. (eds) European Language Grid. Cognitive Technologies. Springer, Cham. https://doi.org/10.1007/978-3-031-17258-8_27

Download citation

  • .RIS
  • .ENW
  • .BIB
  • DOI: https://doi.org/10.1007/978-3-031-17258-8_27

  • Published: 02 November 2022

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-17257-1

  • Online ISBN: 978-3-031-17258-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Share this chapter

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Search

Navigation

  • Find a journal
  • Publish with us

Discover content

  • Journals A-Z
  • Books A-Z

Publish with us

  • Publish your research
  • Open access publishing

Products and services

  • Our products
  • Librarians
  • Societies
  • Partners and advertisers

Our imprints

  • Springer
  • Nature Portfolio
  • BMC
  • Palgrave Macmillan
  • Apress
  • Your US state privacy rights
  • Accessibility statement
  • Terms and conditions
  • Privacy policy
  • Help and support

Not affiliated

Springer Nature

© 2023 Springer Nature