Skip to main content

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

  • Conference paper
  • First Online:
Natural Language Processing and Chinese Computing (NLPCC 2023)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

  • 769 Accesses

Abstract

As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.

H. Wang and J.-J Zhu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://dblp.uni-trier.de/db/.

  2. 2.

    https://prodi.gy/.

  3. 3.

    Actually, PFN has trouble handling relations on nested entities, but such cases are rare so we still attribute it to the nested NER &RE baseline.

References

  1. Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: scienceie-extracting keyphrases and relations from scientific publications. In: SemEval-2017, pp. 546–555 (2017)

    Google Scholar 

  2. Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Adversarial training for multi-context joint entity and relation extraction. In: EMNLP, pp. 2830–2836 (2018)

    Google Scholar 

  3. Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755 (2019)

  4. Gábor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.: Semeval-2018 task 7: semantic relation extraction and classification in scientific papers. In: SemEval-2018, pp. 679–688 (2018)

    Google Scholar 

  5. Jain, S., van Zuylen, M., Hajishirzi, H., Beltagy, I.: Scirex: a challenge dataset for document-level information extraction. In: ACL, pp. 7506–7516 (2020)

    Google Scholar 

  6. Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.S.: S2orc: the semantic scholar open research corpus. In: ACL, pp. 4969–4983 (2020)

    Google Scholar 

  7. Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: EMNLP, pp. 3219–3232 (2018)

    Google Scholar 

  8. Magnusson, I., Friedman, S.: Extracting fine-grained knowledge graphs of scientific claims: dataset and transformer-based results. In: EMNLP, pp. 4651–4658 (2021)

    Google Scholar 

  9. QasemiZadeh, B., Schumann, A.K.: The acl rd-tec 2.0: A language resource for evaluating term extraction and entity recognition methods. In: LREC, pp. 1862–1868 (2016)

    Google Scholar 

  10. Shen, Y., Ma, X., Tan, Z., Zhang, S., Wang, W., Lu, W.: Locate and label: a two-stage identifier for nested named entity recognition. In: ACL-IJCNLP, pp. 2782–2794 (2021)

    Google Scholar 

  11. Taillé, B., Guigue, V., Scoutheeten, G., Gallinari, P.: Let’s stop incorrect comparisons in end-to-end relation extraction! arXiv preprint arXiv:2009.10684 (2020)

  12. Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: EMNLP-IJCNLP, pp. 5784–5789 (2019)

    Google Scholar 

  13. Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020)

    Google Scholar 

  14. Wang, J., Shou, L., Chen, K., Chen, G.: Pyramid: a layered model for nested named entity recognition. In: ACL, pp. 5918–5928 (2020)

    Google Scholar 

  15. Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACK-IJCNLP, pp. 220–231 (2021)

    Google Scholar 

  16. Yan, Z., Zhang, C., Fu, J., Zhang, Q., Wei, Z.: A partition filter network for joint entity and relation extraction. In: EMNLP, pp. 185–197 (2021)

    Google Scholar 

  17. Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: NAACL-HLT, pp. 50–61 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xian-Ling Mao .

Editor information

Editors and Affiliations

A Annotation Guideline

A Annotation Guideline

1.1 A.1 Entity Category

  • Task: Problems to solve, systems to construct, in order to achieve the goal of a specific domain.

  • Method: A method used or proposed to solve a problem or complete a task, including methods, systems, tools and models.

  • Metric: A concrete or abstract method for evaluating a method or task.

  • Material: Concrete dataset or knowledge base.

  • Problem: Inherent problems or defects of a method.

  • Other Scientific Term: Scientific terms related to CS but do not fall into any of the above classes.

1.2 A.2 Relation Category

Relation can be annotated between any two entities (even two nested entities). All relations (Used-for, Part-of, Hyponym-of, Evaluate-for, Synonymy-of) are asymmetric except Synonymy-of. To reduce redundancy, the head entity always appears before tail entity in the sentence for Synonymy-of. For example, in phrase "Recurrent Neural Network (RNN)", only ("Recurrent Neural Network", Synonymy-of, "RNN") is annotated while the reversed triplet is ignored.

  • Used-for: is used for , is based on , etc.

  • Part-for: is a part of , etc.

  • Hyponym-of: is a hyponym of . Note the difference between Hyponym-of and Part-of. Usually, Hyponym-of refer to entities at different level of abstraction, while Part-of refer to entities at the same level.

  • Evaluate-for: is evaluated by , etc.

  • Synonymy-of: has same meaning with , and refer to same entity, etc.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, H., Zhu, JJ., Wei, W., Huang, H., Mao, XL. (2023). FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-44696-2_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-44695-5

  • Online ISBN: 978-3-031-44696-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics