Abstract
As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.
H. Wang and J.-J Zhu—Equal contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
Actually, PFN has trouble handling relations on nested entities, but such cases are rare so we still attribute it to the nested NER &RE baseline.
References
Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: scienceie-extracting keyphrases and relations from scientific publications. In: SemEval-2017, pp. 546–555 (2017)
Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Adversarial training for multi-context joint entity and relation extraction. In: EMNLP, pp. 2830–2836 (2018)
Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755 (2019)
Gábor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.: Semeval-2018 task 7: semantic relation extraction and classification in scientific papers. In: SemEval-2018, pp. 679–688 (2018)
Jain, S., van Zuylen, M., Hajishirzi, H., Beltagy, I.: Scirex: a challenge dataset for document-level information extraction. In: ACL, pp. 7506–7516 (2020)
Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.S.: S2orc: the semantic scholar open research corpus. In: ACL, pp. 4969–4983 (2020)
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: EMNLP, pp. 3219–3232 (2018)
Magnusson, I., Friedman, S.: Extracting fine-grained knowledge graphs of scientific claims: dataset and transformer-based results. In: EMNLP, pp. 4651–4658 (2021)
QasemiZadeh, B., Schumann, A.K.: The acl rd-tec 2.0: A language resource for evaluating term extraction and entity recognition methods. In: LREC, pp. 1862–1868 (2016)
Shen, Y., Ma, X., Tan, Z., Zhang, S., Wang, W., Lu, W.: Locate and label: a two-stage identifier for nested named entity recognition. In: ACL-IJCNLP, pp. 2782–2794 (2021)
Taillé, B., Guigue, V., Scoutheeten, G., Gallinari, P.: Let’s stop incorrect comparisons in end-to-end relation extraction! arXiv preprint arXiv:2009.10684 (2020)
Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: EMNLP-IJCNLP, pp. 5784–5789 (2019)
Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020)
Wang, J., Shou, L., Chen, K., Chen, G.: Pyramid: a layered model for nested named entity recognition. In: ACL, pp. 5918–5928 (2020)
Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACK-IJCNLP, pp. 220–231 (2021)
Yan, Z., Zhang, C., Fu, J., Zhang, Q., Wei, Z.: A partition filter network for joint entity and relation extraction. In: EMNLP, pp. 185–197 (2021)
Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: NAACL-HLT, pp. 50–61 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Annotation Guideline
A Annotation Guideline
1.1 A.1 Entity Category
-
Task: Problems to solve, systems to construct, in order to achieve the goal of a specific domain.
-
Method: A method used or proposed to solve a problem or complete a task, including methods, systems, tools and models.
-
Metric: A concrete or abstract method for evaluating a method or task.
-
Material: Concrete dataset or knowledge base.
-
Problem: Inherent problems or defects of a method.
-
Other Scientific Term: Scientific terms related to CS but do not fall into any of the above classes.
1.2 A.2 Relation Category
Relation can be annotated between any two entities (even two nested entities). All relations (Used-for, Part-of, Hyponym-of, Evaluate-for, Synonymy-of) are asymmetric except Synonymy-of. To reduce redundancy, the head entity always appears before tail entity in the sentence for Synonymy-of. For example, in phrase "Recurrent Neural Network (RNN)", only ("Recurrent Neural Network", Synonymy-of, "RNN") is annotated while the reversed triplet is ignored.
-
Used-for: is used for , is based on , etc.
-
Part-for: is a part of , etc.
-
Hyponym-of: is a hyponym of . Note the difference between Hyponym-of and Part-of. Usually, Hyponym-of refer to entities at different level of abstraction, while Part-of refer to entities at the same level.
-
Evaluate-for: is evaluated by , etc.
-
Synonymy-of: has same meaning with , and refer to same entity, etc.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, H., Zhu, JJ., Wei, W., Huang, H., Mao, XL. (2023). FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_51
Download citation
DOI: https://doi.org/10.1007/978-3-031-44696-2_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)