Skip to main content

A Finnish news corpus for named entity recognition

Abstract

We present a corpus of Finnish news articles with a manually prepared named entity annotation. The corpus consists of 953 articles (193,742 word tokens) with six named entity classes (organization, location, person, product, event, and date). The articles are extracted from the archives of Digitoday, a Finnish online technology news source. The corpus is available for research purposes. We present baseline experiments on the corpus using a rule-based and two deep learning systems on two, in-domain and out-of-domain, test sets.

This is a preview of subscription content, access via your institution.

Notes

  1. 1.

    While there does exist a body of work aiming at learning named entity recognition systems from unannotated data in an unsupervised manner, supervised learning from annotated data has been the prevalent approach in literature.

  2. 2.

    www.digitoday.fi.

  3. 3.

    The data sets are available through FIN-CLARIN at http://urn.fi/urn:nbn:fi:lb-2019050201.

  4. 4.

    The FiNER system and its technical documentation are available at http://urn.fi/urn:nbn:fi:lb-2018091301.

  5. 5.

    It should be noted that entities can, in general, overlap in two ways, namely, by being nested or by crossing. In the latter, two entities overlap but neither is contained in another. However, such cases were not encountered in the data.

  6. 6.

    Note that the GENIA annotation allowed the top-level and nested entities to be of the same class increasing the nested/all-ratio. We are not able to say if this is the case with NoSta-D based on (Benikova et al. 2014).

  7. 7.

    These are instances of named entities, not types. For example, New York may occur several times in the corpus. Each occurrence is considered a separate named entity.

  8. 8.

    https://dumps.wikimedia.org/fiwiki/latest/fiwiki-latest-pages-articles.xml.bz2 downloaded 1.2.2018.

  9. 9.

    https://www.kielipankki.fi/tools/omorfi/.

  10. 10.

    The pretrained word embeddings can be obtained directly from http://bionlp-www.utu.fi/fin-vector-space-models/fin-word2vec.bin.

  11. 11.

    We employ the pretrained FinnTreeBank tagger available at https://github.com/mpsilfve/FinnPos.

References

  1. Artstein, R., & Poesio, M. (2008). Inter-coder agreement for computational linguistics. Computational Linguistics, 4(34), 555–596.

    Article  Google Scholar 

  2. Benikova, D., Biemann, C., & Reznicek, M. (2014). NoSta-D named entity annotation for German: Guidelines and dataset. In Proceedings of the ninth international conference on language resources and evaluation (LREC 2014), pp. 2524–2531.

  3. Byrne, K. (2007). Nested named entity recognition in historical archive text. In Proceedings of the first IEEE international conference on semantic computing (ICSC 2007), pp. 589–596.

  4. Chiu, J., & Nichols, E. (2016). Named entity recognition with bidirectional LSTM-CNNS. Transactions of the Association of Computational Linguistics, 4(1), 357–370.

    Article  Google Scholar 

  5. Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 1(20), 37–46.

    Article  Google Scholar 

  6. Ehrmann, M., Jacquet, G., & Steinberger, R. (2017). JRC-names: Multilingual entity name variants and titles as linked data. Semantic Web, 8(2), 283–295.

    Article  Google Scholar 

  7. Finkel, J. R., Grenager, T., & Manning, C. (2005). Incorporating non-local information into information extraction systems by Gibbs sampling. In Proceedings of the forty-third annual meeting on association for computational linguistics (ACL 2005), pp. 363–370.

  8. Grishman, R., & Sundheim, B. (1996). Message understanding conference-6: A brief history. In Proceedings of the sixteenth international conference on computational linguistics (COLING 1996), pp. 466–471.

  9. Güngör, O., Güngör, T., & Üsküdarli, S. (2019). The effect of morphology in named entity recognition with sequence tagging. Natural Language Engineering, 25(1), 147–169.

    Article  Google Scholar 

  10. Güngör, O., Üsküdarli, S., & Güngör, T. (2018). Improving named entity recognition by jointly learning to disambiguate morphological tags. In Proceedings of the twenty-seventh international conference on computational linguistics (COLING 2018), pp. 2082–2092.

  11. Gustafson-Capková, S., & Hartmann, B. (2006). Manual of the Stockholm Umeå corpus version 2.0. Stockholm University, 2006. https://spraakbanken.gu.se/parole/Docs/SUC2.0-manual.pdf.

  12. Hardwick, S., Silfverberg, M., & Lindén, K. (2015). Extracting semantic frames using hfst-pmatch. In Proceedings of the twentieth nordic conference of computational linguistics (NODALIDA 2015), pp. 305–308.

  13. Ju, M., Miwa, M., & Ananiadou, S. (2018). A neural layered model for nested named entity recognition. In Proceedings of the sixteenth annual conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT 2018), pp. 1446–1459.

  14. Kanerva, J., Luotolahti, J., Laippala, V., & Ginter, F. (2014). Syntactic n-gram collection from a large-scale corpus of internet Finnish. In Human language technologies-the baltic perspective: Proceedings of the sixth international conference Baltic (HLT 2014), vol. 268, pp. 184–191.

  15. Katiyar, A., & Cardie, C. (2018). Nested named entity recognition revisited. In Proceedings of the sixteenth annual conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT 2018), pp. 861–871.

  16. Kettunen, K., & Löfberg, L. (2017). Tagging named entities in nineteenth century and modern Finnish newspaper material with a Finnish semantic tagger. In Proceedings of the twenty-first nordic conference on computational linguistics (NODALIDA 2017), pp. 29–36.

  17. Kettunen, K., Mäkelä, E., Ruokolainen, T., Kuokkala, J., Löfberg, L. (2017). Old content and modern tools-searching named entities in a Finnish OCRed historical newspaper collection 1771-1910. DHQ: Digital Humanities Quarterly, 11(3).

  18. Kettunen, K., & Ruokolainen, T. (2017). Names, right or wrong: Named entities in an OCRed historical Finnish newspaper collection. In Proceedings of the second international conference on digital access to textual cultural heritage, pp. 181–186.

  19. Kleinberg, B., Mozes, M., & van der Toolen, Y. (2017). Netanos-named entity-based text anonymization for open science. 2017. https://osf.io/w9nhb.

  20. Kokkinakis, D. (2003). Swedish NER in the Nomen Nescio project. Nordisk Sprogteknologi–Nordic Language Technology 2002, pp. 379–398.

  21. Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of the fifteenth annual conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL HLT 2016), pp. 260–270.

  22. Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 1(33), 159–174.

    Article  Google Scholar 

  23. Ling, W., Dyer, C., Black, A. W., Trancoso, I., Fermandez, R., Amir, S., Marujo, L., & Luis, T. (2015). Finding function in form: Compositional character models for open vocabulary word representation. In Proceedings of the 2015 conference on empirical methods in natural language processing (EMNLP 2015), pp. 1520–1530.

  24. Ma, X., & Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnns-crf. In Proceedings of the fifty-fourth annual meeting of the association for computational linguistics (ACL 2016), pp. 1064–1074.

  25. Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Proceedings of advances in neural information processing systems (NIPS 2013), pp. 3111–3119.

  26. Nadeau, D., & Sekine, S. (2007). A survey of named entity recognition and classification. Lingvisticae Investigationes, 30(1), 3–26.

    Article  Google Scholar 

  27. Ohta, T., Tateisi, Y., & Kim, J. D. (2002). The GENIA corpus: An annotated research abstract corpus in molecular biology domain. In Proceedings of the second international conference on human language technology research (HLT 2002), pp. 82–86.

  28. Pennington, J., Socher, R., & Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP 2015), pp. 1532–1543.

  29. Rosset, S., Grouin, C., Fort, K., Galibert, O., Kahn, J., & Zweigenbaum, P. (2012). Structured named entities in two distinct press corpora: Contemporary broadcast news and old newspapers. In Proceedings of the sixth linguistic annotation workshop, pp. 40–48.

  30. Silfverberg, M., Ruokolainen, T., Lindén, K., & Kurimo, M. (2016). FinnPos: An open-source morphological tagging and lemmatization toolkit for Finnish. Language Resources and Evaluation, 50(4), 863–878.

    Article  Google Scholar 

  31. Sohrab, M. G., & Miwa, M. (2018). Deep exhaustive model for nested named entity recognition. In Proceedings of the 2018 conference on empirical methods in natural language processing (EMNLP 2018), pp. 2843–2849.

  32. Szarvas, G., Farkas, R., Felföldi, L., Kocsor, A., & Csirik, J. (2006). A highly accurate named entity corpus for Hungarian. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006), pp. 1957–1960.

  33. Taulé, M., Martí, M. A., & Recasens, M. (2008). AnCora: Multilevel annotated corpora for Catalan and Spanish. In Proceedings of the sixth international conference on language resources and evaluation (LREC 2008), pp. 96–101.

  34. Tjong Kim Sang, E. F., & De Meulder, F. (2003). Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. In Proceedings of the seventh conference on natural language learning (HLT-NAACL 2003), pp. 142–147.

  35. UzZaman, N., Llorens, H., Derczynski, L., Verhagen, M., Allen, J., & Pustejovsky, J. (2013). Semeval-2013 task 1: Tempeval-3: Evaluating time expressions, events, and temporal relations. In Proceedings of second joint conference on lexical and computational semantics (SEM 2013), pp. 1–9.

  36. Verhagen, M., Gaizauskas, R., Schilder, F., Hepple, M., Katz, G., & Pustejovsky, J. (2007). Semeval-2007 task 15: Tempeval temporal relation identification. In Proceedings of the fourth international workshop on semantic evaluation (SEMEVAL 2007), pp. 75–80.

  37. Verhagen, M., Sauri, R., Caselli, T., & Pustejovsky, J. (2010). Semeval-2010 task 13: Tempeval-2. In Proceedings of the fifth international workshop on semantic evaluation (SEMEVAL 2010), pp. 57–62.

  38. Wang, B., Lu, W., Wang, Y., & Jin, H. (2018). A neural transition-based model for nested mention recognition. In Proceedings of the 2018 conference on empirical methods in natural language processing (EMNLP 2018), pp. 1011–1017.

Download references

Acknowledgements

We would like to express our sincere gratitude to Onur Güngör and Mohammad Golam Sohrab for their invaluable help with running the experiments. This work was funded by Academy of Finland (Award Numbers 292260 and 293239) and FIN-CLARIN. The third author has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant Agreement No. 771113).

Author information

Affiliations

Authors

Corresponding author

Correspondence to Teemu Ruokolainen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ruokolainen, T., Kauppinen, P., Silfverberg, M. et al. A Finnish news corpus for named entity recognition. Lang Resources & Evaluation 54, 247–272 (2020). https://doi.org/10.1007/s10579-019-09471-7

Download citation

Keywords

  • Named entity recognition
  • Finnish
  • Newswire
  • Wikipedia