Skip to main content

NumER: A Fine-Grained Numeral Entity Recognition Dataset

  • 786 Accesses

Part of the Lecture Notes in Computer Science book series (LNISA,volume 12801)

Abstract

Named entity recognition (NER) is essential and widely used in natural language processing tasks such as question answering, entity linking, and text summarization. However, most current NER models and datasets focus more on words than on numerals. Numerals in documents can also carry useful and in-depth features beyond simply being described as cardinal or ordinal; for example, numerals can indicate age, length, or capacity. To better understand documents, it is necessary to analyze not only textual words but also numeral information. This paper describes NumER, a fine-grained Numeral Entity Recognition dataset comprising 5,447 numerals of 8 entity types over 2,481 sentences. The documents consist of news, Wikipedia articles, questions, and instructions. To demonstrate the use of this dataset, we train a numeral BERT model to detect and categorize numerals in documents. Our baseline model achieves an F1-score of 95% and hence demonstrating that the model can capture the semantic meaning of the numeral tokens.

Keywords

  • Named entity recognition
  • Numeral classification
  • Numeral understanding
  • Natural language understanding

This work was supported by JST, AIP Trilateral AI Research, Grant Number JPMJCR20G9 and by NEDO, SIP-2 Program “Big-data and AI-enabled Cyberspace Technologies”, Japan.

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-030-80599-9_7
  • Chapter length: 12 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   79.99
Price excludes VAT (USA)
  • ISBN: 978-3-030-80599-9
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   99.99
Price excludes VAT (USA)
Fig. 1.

Notes

  1. 1.

    https://github.com/Microsoft/Recognizers-Text.

  2. 2.

    https://www.kaggle.com/hugodarwood/epirecipes.

  3. 3.

    https://www.kaggle.com/rmisra/news-category-dataset.

  4. 4.

    https://v2.spacy.io/.

References

  1. Azzi, A.A., Bouamor, H.: Fortia1@ the NTCIR-14 FinNum task: enriched sequence labeling for numeral classification. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies, pp. 526–538 (2019)

    Google Scholar 

  2. Chen, C.C., Huang, H.H., Takamura, H., Chen, H.H.: Overview of the NTCIR-14 FinNum task: fine-grained numeral understanding in financial social media data. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies, pp. 19–27 (2019)

    Google Scholar 

  3. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, Minneapolis (2019). https://doi.org/10.18653/v1/N19-1423

  4. Guo, J., et al.: Towards complex text-to-SQL in cross-domain database with intermediate representation. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (ACL), pp. 4524–4535. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/P19-1444

  5. Huang, Z., Xu, W., Yu, K.: Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991 (2015)

    Google Scholar 

  6. Jiang, M.T.J., Chen, Y.K., Wu, S.H.: CYUT at the NTCIR-15 FinNum-2 task: tokenization and fine-tuning techniques for numeral attachment in financial tweets. In: Proceedings of the 15th NTCIR Conference on Evaluation of Information Access Technologies, pp. 92–96 (2020)

    Google Scholar 

  7. Kobayashi, S.: Contextual augmentation: data augmentation by words with paradigmatic relations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 2 (Short Papers), pp. 452–457. Association for Computational Linguistics, New Orleans (2018). https://doi.org/10.18653/v1/N18-2072

  8. Min, K., MacDonell, S., Moon, Y.-J.: Heuristic and rule-based knowledge acquisition: classification of numeral strings in text. In: Hoffmann, A., Kang, B., Richards, D., Tsumoto, S. (eds.) PKAW 2006. LNCS (LNAI), vol. 4303, pp. 40–50. Springer, Heidelberg (2006). https://doi.org/10.1007/11961239_4

    CrossRef  Google Scholar 

  9. Munoz, S., Bangdiwala, S.: Interpretation of Kappa and b statistics measures of agreement. J. Appl. Stat. 24, 105–112 (1997). https://doi.org/10.1080/02664769723918

    CrossRef  Google Scholar 

  10. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticæ Investigationes 30(1), 3–26 (2007). https://doi.org/10.1075/li.30.1.03nad

    CrossRef  Google Scholar 

  11. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). https://doi.org/10.3115/v1/D14-1162

  12. R., S.P., Mandhan, S., Niwa, Y.: Numerical atribute extraction from clinical texts. CoRR abs/1602.00269 (2016). https://doi.org/10.13140/RG.2.1.4763.3365

  13. Tjong Kim Sang, E.F., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-independent named entity recognition. In: Proceedings of the Seventh Conference on Natural Language Learning at HLT-NAACL 2003, pp. 142–147 (2003). https://www.aclweb.org/anthology/W03-0419

  14. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014). https://doi.org/10.1145/2629489

    CrossRef  Google Scholar 

  15. Weischedel, R., et al.: OntoNotes release 5.0 (2013). https://doi.org/10.35111/XMHB-2B84

  16. Wu, Q., Wang, G., Zhu, Y., Liu, H., Karlsson, B.: DeepMRT at the NTCIR-14 finnum task: a hybrid neural model for numeral type classification in financial tweets. In: Proceedings of the 14th NTCIR Conference on Evaluation of Information Access Technologies, pp. 585–595 (2019)

    Google Scholar 

  17. Yadav, V., Bethard, S.: A survey on recent advances in named entity recognition from deep learning models. In: Proceedings of the 27th International Conference on Computational Linguistics, pp. 2145–2158. Association for Computational Linguistics, Santa Fe (2018). https://www.aclweb.org/anthology/C18-1182

  18. Yu, T., et al.: Spider: a large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 3911–3921. Association for Computational Linguistics, Brussels (2018). https://doi.org/10.18653/v1/D18-1425

  19. Yu, T., et al.: SParC: cross-domain semantic parsing in context. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 4511–4523. Association for Computational Linguistics, Florence (2019). https://doi.org/10.18653/v1/P19-1443

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thanakrit Julavanich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Verify currency and authenticity via CrossMark

Cite this paper

Julavanich, T., Aizawa, A. (2021). NumER: A Fine-Grained Numeral Entity Recognition Dataset. In: Métais, E., Meziane, F., Horacek, H., Kapetanios, E. (eds) Natural Language Processing and Information Systems. NLDB 2021. Lecture Notes in Computer Science(), vol 12801. Springer, Cham. https://doi.org/10.1007/978-3-030-80599-9_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-80599-9_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-80598-2

  • Online ISBN: 978-3-030-80599-9

  • eBook Packages: Computer ScienceComputer Science (R0)