FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

Wang, Hao; Zhu, Jing-Jing; Wei, Wei; Huang, Heyan; Mao, Xian-Ling

doi:10.1007/978-3-031-44696-2_51

Hao Wang¹¹,
Jing-Jing Zhu¹¹,
Wei Wei¹²,
Heyan Huang¹¹ &
…
Xian-Ling Mao¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

769 Accesses

Abstract

As scientific communities grow and evolve, more and more papers are published, especially in computer science field (CS). It is important to organize scientific information into structured knowledge bases extracted from a large corpus of CS papers, which usually requires Information Extraction (IE) about scientific entities and their relationships. In order to construct high-quality structured scientific knowledge bases by supervised learning way, as far as we know, in computer science field, there have been several handcrafted annotated entity-relation datasets like SciERC and SciREX, which are used to train supervised extracted algorithms. However, almost all these datasets ignore the annotation of following fine-grained named entities: nested entities, discontinuous entities and minimal independent semantics entities. To solve this problem, this paper will present a novel Fine-Grained entity-relation Extraction dataset in Computer Science field (FGCS), which contains rich fine-grained entities and their relationships. The proposed dataset includes 1,948 sentences of 6 entity types with up to 7 layers of nesting and 5 relation types. Extensive experiments show that the proposed dataset is a good benchmark for measuring an information extraction model’s ability of recognizing fine-grained entities and their relations. Our dataset is publicly available at https://github.com/broken-dream/FGCS.

H. Wang and J.-J Zhu—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://dblp.uni-trier.de/db/.
2.
https://prodi.gy/.
3.
Actually, PFN has trouble handling relations on nested entities, but such cases are rare so we still attribute it to the nested NER &RE baseline.

References

Augenstein, I., Das, M., Riedel, S., Vikraman, L., McCallum, A.: Semeval 2017 task 10: scienceie-extracting keyphrases and relations from scientific publications. In: SemEval-2017, pp. 546–555 (2017)
Google Scholar
Bekoulis, G., Deleu, J., Demeester, T., Develder, C.: Adversarial training for multi-context joint entity and relation extraction. In: EMNLP, pp. 2830–2836 (2018)
Google Scholar
Eberts, M., Ulges, A.: Span-based joint entity and relation extraction with transformer pre-training. arXiv preprint arXiv:1909.07755 (2019)
Gábor, K., Buscaldi, D., Schumann, A.K., QasemiZadeh, B., Zargayouna, H., Charnois, T.: Semeval-2018 task 7: semantic relation extraction and classification in scientific papers. In: SemEval-2018, pp. 679–688 (2018)
Google Scholar
Jain, S., van Zuylen, M., Hajishirzi, H., Beltagy, I.: Scirex: a challenge dataset for document-level information extraction. In: ACL, pp. 7506–7516 (2020)
Google Scholar
Lo, K., Wang, L.L., Neumann, M., Kinney, R., Weld, D.S.: S2orc: the semantic scholar open research corpus. In: ACL, pp. 4969–4983 (2020)
Google Scholar
Luan, Y., He, L., Ostendorf, M., Hajishirzi, H.: Multi-task identification of entities, relations, and coreference for scientific knowledge graph construction. In: EMNLP, pp. 3219–3232 (2018)
Google Scholar
Magnusson, I., Friedman, S.: Extracting fine-grained knowledge graphs of scientific claims: dataset and transformer-based results. In: EMNLP, pp. 4651–4658 (2021)
Google Scholar
QasemiZadeh, B., Schumann, A.K.: The acl rd-tec 2.0: A language resource for evaluating term extraction and entity recognition methods. In: LREC, pp. 1862–1868 (2016)
Google Scholar
Shen, Y., Ma, X., Tan, Z., Zhang, S., Wang, W., Lu, W.: Locate and label: a two-stage identifier for nested named entity recognition. In: ACL-IJCNLP, pp. 2782–2794 (2021)
Google Scholar
Taillé, B., Guigue, V., Scoutheeten, G., Gallinari, P.: Let’s stop incorrect comparisons in end-to-end relation extraction! arXiv preprint arXiv:2009.10684 (2020)
Wadden, D., Wennberg, U., Luan, Y., Hajishirzi, H.: Entity, relation, and event extraction with contextualized span representations. In: EMNLP-IJCNLP, pp. 5784–5789 (2019)
Google Scholar
Wang, J., Lu, W.: Two are better than one: joint entity and relation extraction with table-sequence encoders. In: EMNLP, pp. 1706–1721 (2020)
Google Scholar
Wang, J., Shou, L., Chen, K., Chen, G.: Pyramid: a layered model for nested named entity recognition. In: ACL, pp. 5918–5928 (2020)
Google Scholar
Wang, Y., Sun, C., Wu, Y., Zhou, H., Li, L., Yan, J.: Unire: a unified label space for entity relation extraction. In: ACK-IJCNLP, pp. 220–231 (2021)
Google Scholar
Yan, Z., Zhang, C., Fu, J., Zhang, Q., Wei, Z.: A partition filter network for joint entity and relation extraction. In: EMNLP, pp. 185–197 (2021)
Google Scholar
Zhong, Z., Chen, D.: A frustratingly easy approach for entity and relation extraction. In: NAACL-HLT, pp. 50–61 (2021)
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing Institute of Technology, Beijing, China
Hao Wang, Jing-Jing Zhu, Heyan Huang & Xian-Ling Mao
Huazhong University of Science and Technology, Wuhan, China
Wei Wei

Authors

Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jing-Jing Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Wei Wei
View author publications
You can also search for this author in PubMed Google Scholar
Heyan Huang
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Ling Mao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xian-Ling Mao .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

A Annotation Guideline

1.1 A.1 Entity Category

Task: Problems to solve, systems to construct, in order to achieve the goal of a specific domain.
Method: A method used or proposed to solve a problem or complete a task, including methods, systems, tools and models.
Metric: A concrete or abstract method for evaluating a method or task.
Material: Concrete dataset or knowledge base.
Problem: Inherent problems or defects of a method.
Other Scientific Term: Scientific terms related to CS but do not fall into any of the above classes.

1.2 A.2 Relation Category

Relation can be annotated between any two entities (even two nested entities). All relations (Used-for, Part-of, Hyponym-of, Evaluate-for, Synonymy-of) are asymmetric except Synonymy-of. To reduce redundancy, the head entity always appears before tail entity in the sentence for Synonymy-of. For example, in phrase "Recurrent Neural Network (RNN)", only ("Recurrent Neural Network", Synonymy-of, "RNN") is annotated while the reversed triplet is ignored.

Used-for: is used for , is based on , etc.
Part-for: is a part of , etc.
Hyponym-of: is a hyponym of . Note the difference between Hyponym-of and Part-of. Usually, Hyponym-of refer to entities at different level of abstraction, while Part-of refer to entities at the same level.
Evaluate-for: is evaluated by , etc.
Synonymy-of: has same meaning with , and refer to same entity, etc.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, H., Zhu, JJ., Wei, W., Huang, H., Mao, XL. (2023). FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_51

Download citation

DOI: https://doi.org/10.1007/978-3-031-44696-2_51
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)

FGCS: A Fine-Grained Scientific Information Extraction Dataset in Computer Science Domain

Abstract

Access this chapter

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Annotation Guideline

A Annotation Guideline

1.1 A.1 Entity Category

1.2 A.2 Relation Category

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Societies and partnerships

Search

Navigation