1 Introduction

Relation extraction (RE) is a task that captures relational facts between entities from plain text and serves as the basis of large-scale knowledge base construction and population research [1,2,3]. Previous RE studies focus on sentence-level RE which is to extract relational facts from a sentence [4,5,6], but the sentence-level RE models only concentrate on the local context within the single sentence [7, 8].

For the comprehensive context understanding capability of the RE model, one research [9] builds document-level RE dataset in English, and it derives widespread research [10,11,12,13]. However, in Korean, document-level RE research is hardly conducted due to the absence of a document-level RE dataset. Although the Korean cross-sentence-level RE dataset is collected [14], it is built on the paragraph-level so that the relations spread across the multiple paragraphs in a document are not considered. In other words, they constructed the dataset considering inter-sentence relations and intra-sentence relations only within paragraphs. Therefore, a document-level RE dataset is required to accelerate the document-level RE research in Korean.

Simply translating English document-level RE as an alternative to the absence of a dataset is inappropriate. The reason is that Korean is an agglutinative language, whereas English belongs to an inflectional language [15]. As an agglutinative language, one of the characteristics of Korean language is the relatedness between lexical morpheme and functional morpheme [16, 17]. As the complete meaning of the word in Korean depends on the type of functional morpheme, it is essential to identify the functional morpheme’s role in the sentence. For example, in the sentence “ . (An apple is at home.)”, the word “ ” is a combination of the entity “ (apple)” and functional morpheme “ [-neun]Footnote 1”. “ ” serves the subject role because “ [-neun]” is a subjective functional morpheme. Meanwhile, in “ . (I like the apple at home.)”, “ ” indicates the object as an objective functional morpheme “ [-reul]” is combined. Likewise, simply translating the English dataset is not desirable for collecting Korean dataset. Therefore, building Korean-based RE resources from scratch regarding the property is crucial.

Fig. 1
figure 1

An example from TREK dataset. For clarity, named entity involved in relations are colored in blue or green and other named entities are underlined. The relational triple ( , org:founded, ) is an inter-sentence relation extracted from multiple sentences and ( , org:members, ) is an intra-sentence relation extracted from a single sentence

In this paper, we present a TREK (Toward document-level Relation Extraction in Korean) dataset from Korean encyclopedia documents written by the domain experts. We construct the dataset with distant-supervision manner and conduct an additional human inspection to consider the missing or noisy annotations. As shown in Fig. 1, TREK dataset consists of a document and relational triples, i.e., (head, relation, tail). Along with the dataset construction, we analyze TREK dataset regarding named entity-types and relation classes. Furthermore, the human evaluation results suggest that our document RE dataset is quality assured. We also propose a Korean document-level RE model that employs named entity-type information and reflects the characteristics of the Korean language. Since the role of the word changes according to the type of functional morpheme, we adopt special tokens surrounding the entity to let the model distinguish the entity from the functional morpheme. We show the improved performance of our model on TREK and conduct qualitative analysis as well.

Our contributions are as follows:

  • We introduce a large-scale Korean document-level RE dataset, TREK (Toward document-level Relation Extrac-tion in Korean), based on an encyclopedia with distant-supervision and human inspection. TREK dataset contains a total of 285,696 examples with 15,703 documents, 329,099 sentences, and 100,776 entities. Our human evaluation also shows the ensured quality of our dataset.

  • We propose the document-level RE model that considers named entity-type information within the documents while taking into account of Korean language’s characteristics.

  • We show our model’s effectiveness on TREK dataset in the experiments and demonstrate its enhanced understanding capability through qualitative analysisFootnote 2

2 Related work

2.1 Document-level Relation Extraction

Document-level RE is challenging since entities are spread within multiple sentences [9]. To solve this problem, diverse studies have been conducted regarding mentions and entities. The method that uses the selective attention network, which regards the same entities with different positions as distinct mention representations to capture the relations between the scattered entities, is proposed [18]. Also, a study [19] considers various kinds of mention dependencies exhibited in documents, while another study [20] introduces mention-level RE model to extract the key mention pairs in the document. The method that utilizes relation information is also proposed. A study [21] uses relation embedding to capture the co-occurrence correlation of relations.

On the other hand, the research that focuses on an entity also appears. One research mitigates the sparsity of the entity-pair representation and achieves a holistic understanding of the document[22]. In addition, the method that enriches the entity-pair representation by introducing an axial attention mechanism with adaptive focal loss is proposed [23]. By exploiting the entity-types that provide additional information on the relatedness between each entity pair, various research shows improved performance on document-level RE task [24]. Likewise, diverse branches of research are widely conducted in document-level RE in English.

Fig. 2
figure 2

The process of TREK dataset construction. For comprehension, we provide an example in English. The original example in Korean is described in Appendix B. In phase 1 and 2, the entity “ (the Holy See)” marked in blue indicates the subject inserted by pseudo-subject insertion. The highlighted texts and solid green lines are the named entities and relations, respectively. In phase 3, the dashed gray indicates deleted relation and the red lines indicate re-assigned relations

2.2 Relation extraction in korean

In previous RE research in Korean, sentence-level RE studies have been conducted actively [25]. There exists research that considers Korean characteristics by exploiting dependency structures to find proper relation in a sentence [25,26,27]. Moreover, entity position tokens are utilized to capture contextual information in a sentence [28]. These studies are promoted by the publicly available datasets to build sentence-level RE models. A sentence-level RE dataset, KLUE [29], is published to overcome the lack of up-to-date Korean resources. Additionally, a Korean cultural heritage corpus [30] for entity-related tasks is disclosed. Even though the Korean cross-sentence-level RE dataset, Crowdsourcing  [14], is presented, this dataset only considers the paragraph-level, not the document-level. In other words, the relations spread across the multiple paragraphs in a document do not exist, which limits the capability of document-level REFootnote 3. Another research [31] presents HistRED constructed from Yeonhaengnok, which is a collection of records written in Korean and Hanja, classical Chinese writing. Due to the limited scope of HistRED to specific domains, history, its applicability to general domains is constrained. Furthermore, all existing datasets rely on human annotation, making it challenging and costly to construct large-scale data. Therefore, document-level RE resources are limited, and this leads to a lack of studies on document-level RE in Korean.

3 TREK dataset

We first describe how we build our TREK (Toward document-level Relation Extraction in Korean) dataset in Section 3.1. In Section 3.2, we analyze the TREK dataset and describe the statistics on the types of named entities and relation classes. We conduct human evaluation on our dataset in Section 3.3.

3.1 Data construction

We first annotate both entities and relations in a distant supervision manner from NAVER encyclopedia documentsFootnote 4, which consists of refined documents on diverse topics. Then, we additionally conduct a human inspection to consider missing or noisy annotations. To build a document-level RE dataset, we follow four steps; 1) named entity annotation, 2) sentence-level relation annotation, 3) document-level relation construction, and 4) human inspection.

Named entity annotation (NEA) We annotate the named entity to the documents as illustrated in the first phase of Fig. 2. Owing to the characteristic of the NAVER encyclopedia, most of the sentences in the raw documents do not have a subject which is a document’s titleFootnote 5

Thus, we manually insert the subject and annotate the named entities in the document. In detail, we first find the sentences where the subject is missing by utilizing a pre-trained dependency parser library [32] and then insert the document’s title as the subject. Also, we concatenate the subjective functional morpheme at the end of the title-entity to indicate the subject in the sentence. This pseudo-subject insertion process clarifies the structure of the sentence, and the named entity recognition (NER) model predicts the position and type of each entity from the clarified sentence.

Since Korean belongs to an agglutinative language, distinguishing entity and functional morpheme is critical in NER. Therefore, we let the NER model annotate being aware of the functional morpheme.

In addition to defined entity types, we define the named entity-type ‘TITLE’ and assign it to the entities which are equal to the title due to its importance in the document. We use pre-trained model which is trained with Korean documentsFootnote 6 based on the ELECTRA [33] architecture and fine-tune the pre-trained model with KLUE-NER  [29] datasetFootnote 7.

Sentence-level Relation Annotation (SRA) After annotating the named entity in the documents, we assign the relation between the two entities in a sentence.

We implement the sentence-level RE models by considering the characteristics of the Korean language. Since Korean is an agglutinative language that forms a word with lexical morpheme and functional morpheme, capturing the scope of the entity in the sentence is significant [15].

Table 1 The performance of sentence-level RE models according to the type of entity position token(s)

Therefore, we employ entity position tokens to denote the position of subject and object entities in a sentence following previous Korean research [28]. Start position tokens are inserted at the beginning of subject and object entities, while end position tokens are appended at the end of subject and object entities. We conduct experiments on sentence-level RE performance utilizing different types of representations as presented in Table 1.

Table 2 The comparison between Korean RE datasets

From the result, it is shown that the KLUE-RoBERTa utilizing [END] token embeddings shows the best performance among the others. It implies that distinguishing the borderline between the entity and functional morpheme is critical for predicting the relation. Consequently, we annotate intra-relation utilizing the KLUE-RoBERTa with [END] token, which shows the best performance.

We train the sentence-level RE model with KLUE-RoBERTa based on KLUE-RE  [29] datasetFootnote 8.

Document-level Relation Construction (DRC) For inter-sentence relations, we remove the pseudo-inserted subject, re-assigning the relation of the removed entity to the existing subject in the documents. For example, as depicted in phase 3 of Fig. 2, the red line indicates the re-assigned relation. When “ (the Holy See)” inserted by the pseudo-subject insertion in sentence [2] is removed, the relation “r2: org:member_of” is re-assigned to “ (the Holy See)” of sentence [1] and “ (Italy)” from sentence [2].

Human Inspection Our human inspection process is divided into annotation and filtering to handle missing and noisy annotations. First, when assigning inter-sentence relations based on subjects in phase 3, the relation between non-subject entities may exist in the document. To consider these relations, we additionally annotate all possible relations in the document with humans and collect missing relations between non-subject entities. As a result, 63,482 examples are additionally annotated.

Afterward, we exclude incorrect examples with the human-filtering process to remove the noises from the automatically annotated examples following previous research  [34]. In detail, human workers are instructed to decide whether the given example is valid or invalid. We delete the examples that are determined as invalid, and 14,663 examples are removed. The details of annotation and filtering are described in Appendix  C.

Table 3 Statistics of TREK dataset split. Dev. indicates the development set

3.2 Data statistics

We compare TREK dataset with existing Korean RE datasets [14, 29, 30] as in Table 2.

TREK dataset has approximately three times as many examples at a minimum compared to the other datasets. Our dataset covers both intra-sentence and inter-sentence relations at a document-level, and the ratios of intra-relation and inter-relation are 25.31% and 74.69%, respectively. This ratio is reasonable for a document-level RE dataset, considering DocRED [9], which has 18.49% and 81.51% of intra and inter-relation. Our dataset encompasses a general domain and is rich in diverse aspects, including the number of documents, examples, sentences, and entities compared to the existing datasets. Table 3 shows the statistics of train, development, and test sets in our TREK dataset. The dataset contains a total of 285,696 examples with 15,703 documents, 329,099 sentences, and 100,776 entities. The samples of TREK dataset are also described in Appendix D.

Types of Named Entity Figure 3 depicts the distribution of the named entity in TREK dataset. The distributions of each dataset split are depicted in Appendix E.

Out dataset has 11 named entity-types including LOCATION (30.65 %), CIVILIZATION (17.18 %), PERSON (13.44 %), TITLE (12.90 %), etc. The descriptions of each entity-type are indicated in Appendix F.

Fig. 3
figure 3

Types of named entity distribution of TREK dataset

Fig. 4
figure 4

Relation class distribution of TREK dataset

Fig. 5
figure 5

The overall architecture of the TREKER. TREKER reconstructs a given input document and performs three subtasks, i.e., coreference resolution, named entity prediction, and relation extraction in a multi-tasking manner. The reconstructed document D’ is fed in to the pre-trained language model (PLM) and we obtain the document embedding H. m is the mention embeddings from H. Also, v is the entity mention embedding for NEP while V is the entity embedding for bilinear operation in RE prediction. \(\mathcal {L}\) indicates the loss

Relation Classes Figure 4 illustrates the distribution of relation classes in TREK dataset. The relation class distributions of the dataset splits are depicted in Appendix E. Our dataset consists of 26 various relation types, including org:member_of (31.40 %), org:founded_by (16.32 %), per:colleagues (9.12 %), org:place_of_headquarters (7.69 %), etc. The descriptions of each relation class are shown in Appendix G. We also consider the inverse relation type such as ‘org:members’ and ‘org:member_of’. For example, when the head is a higher organization, and the tail is affiliated with the head, the relation is tagged as ‘org:members’. On the other hand, ‘org:member_of’ means that the head belongs to the higher-level organization, which is the tail. Therefore, the relation classes cover diverse relations between entities, including inverse relations.

3.3 Human evaluation

We conduct a human evaluation on our dataset, TREK . We collect 29 workers who are Korean undergraduate students. We randomly chose 485 documents from the dataset. We asked the workers to evaluate whether the given head and tail are mapped to the given relation appropriately based on the document. The score is scaled from 1 to 3, and a higher score indicates that the relation is more appropriately annotated. As a result, 40.73% of the examples are evaluated as appropriate and 35.26% as plausible. Only 24.00% of the examples are regarded as less appropriate relations. The agreement between the annotators is calculated with Fleiss’ Kappa coefficient [35] and is 0.4763, implying moderate agreement. In this result, we demonstrate the assured quality of TREK dataset.

4 Model

We implement our Korean document-level RE model based on TREK dataset considering named entity-type information and Korean language characteristics, and entitled as TREKER. As illustrated in Fig. 5, our model aims to predict the relation when the document and two entities, head and tail, are given. We first reconstruct a document input to reflect the properties of the Korean language. Then, the model extracts embeddings of the entity mention which represent the same entity scattered in the documents. With the extracted entity mention embeddings, the model trains on coreference resolution, named entity prediction, and relation extraction tasks in a multi-tasking manner.

4.1 Input document reconstruction

In TREK dataset, a document D has a set of entities \(E = \left\{ e_1, e_2, \cdots , e_{|E|}\right\} \). With the given documents and entities, we reconstruct the input document. We mark the special token with an asterisk (*) at the start and the end of every entity mention allowing the model to recognize and comprehend the functional morphemes and entity. Subsequently, the [CLS] token is inserted at the start of the document and the [SEP] token is placed at the end of every sentence. The reconstructed document D’ is then fed into the pre-trained language model and we obtain document embedding H \(\in \) \(\mathrm {\mathbb {R}}^{T \times dim}\) . T and dim indicate the maximum length of tokens in D’ and the dimension of the embeddings, respectively. From the document embedding H, we define the mention embedding \(M_k = \left\{ m_k^1, m_k^2, \cdots , m_k^{|M_k|}\right\} \) of k-th entity \(e_k\) by extracting the embeddings of the start special token of every entity in the H.

4.2 Coreference resolution

Our model performs the coreference resolution (CR) to capture the interactions between the long-distance mentions in multiple sentences. We define \(\mathbf {C^{\mathcal {Z}_E}}\), the set of all possible pair-combinations from all entity mention embeddings \(\mathcal {Z}_E = \left\{ M_1 \cup M_2 \cup ... \cup M_{|E|} \right\} \).

$$\begin{aligned} \mathbf {C^{\mathcal {Z}_E}} = \{ (m_1^1, m_1^2), (m_1^1, m_1^3),..., (m_i^a, m_j^b),...,(m_{|M_E|}^{|M_E|-1}, m_{|M_E|}^{|M_E|})\} \end{aligned}$$
(1)

We obtain the probability of whether the two mention embeddings represent the same entity:

$$\begin{aligned} P^{CR} = \textrm{sigmoid}(\overline{\textbf{C}} \textbf{W}^{CR} + \textbf{b}^{CR}), \end{aligned}$$
(2)

where \(\textbf{C}\) is the entity mention embedding pair in \(\mathbf {C^{\mathcal {Z}_E}}\) and \((\overline{ \cdot })\) is the concatenation operation. \(\textbf{W}^{CR}\) \(\in \) \(\mathrm {\mathbb {R}}^{dim \times 2}\) is a weight matrix for binary classification. \(\textbf{b}^{CR}\) is the bias of CR.

Since entities have a small number of mentions, each mention pairs from the combinations is rarely co-referencedFootnote 9. Due to the class imbalance, we apply the focal loss [36] which is known as a loss function that is robust to the class imbalance issue for \(\mathcal {L}^{CR}\) :

$$\begin{aligned} \mathcal {L}^{CR}= & {} -\biggl (y^{CR}(1-P^{CR})^{\gamma ^{CR}}\textrm{log}{P^{CR}}\nonumber \\{} & {} + (1-y^{CR})(P^{CR})^{\gamma ^{CR}}\textrm{log}(1-P^{CR}) \biggr ) Q^{CR}, \end{aligned}$$
(3)

where \(y^{CR}\) is 1 if mention pair refers to the same entity, otherwise 0. \(Q^{CR}\) is the class weight vector obtained by reversing the ratio in \(y^{CR}\), which is introduced in previous research[36]. \(\gamma ^{CR}\) is the hyperparameter.

4.3 Named Entity Prediction

We perform the named entity prediction (NEP) task to let the model comprehend head and tail with named entity-type information. We integrate all mention embeddings of the entity with logsumexp pooling [37] to obtain head and tail embeddings as in (4):

$$\begin{aligned} v (e_{k}) = \textrm{log} \sum _i^{M_{k}} \textrm{exp}(m_{k}^i) \end{aligned}$$
(4)

Since head and tail are given, the model calculates the probability \(P^{NEP}\) of each named entity-type with the corresponding entity mention embeddings \(v (\texttt{head})\) and \(v (\texttt{tail})\):

$$\begin{aligned} P^{NEP}_{\texttt {head}} =\textrm{softmax}(\textrm{tanh}(v (\texttt{head})) \textbf{W}^{NEP} + \textbf{b}^{NEP}) \nonumber \\ P^{NEP}_{\texttt {tail}} =\textrm{softmax}(\textrm{tanh}(v (\texttt{tail})) \textbf{W}^{NEP} + \textbf{b}^{NEP}), \end{aligned}$$
(5)

where \(\textbf{W}^{NEP} \in \mathrm {\mathbb {R}}^{dim \times |nep|}\) and |nep| is the number of named entity-type set. \({\textbf{b}}^{NEP}\) is the bias of NEP task.

We compute the NEP loss \(\mathcal {L}^{NEP}\) through cross-entropy loss:

$$\begin{aligned} \mathcal {L}^{NEP} = -\biggl (y_{\texttt{head}}^{NEP} \textrm{log}(P_{\texttt{head}}^{NEP}) + y_{\texttt{tail}}^{NEP} \textrm{log}(P_{\texttt{tail}}^{NEP}) \biggr ), \end{aligned}$$
(6)

where \(y_{\texttt{head}}^{NEP}\) and \(y_{\texttt{tail}}^{NEP}\) are a ground-truth named entity-type of head and tail, respectively.

Table 4 RE results (\(\%\)) of different RE models on TREK dataset

4.4 Relation extraction

For the relation extraction task, we model the interactions between the mention embeddings as in (7).

$$\begin{aligned} V(\texttt{head})&= \textrm{tanh} (v(\texttt{head}) \textbf{W}_{\texttt{head}}), \nonumber \\ V(\texttt{tail})&= \textrm{tanh} (v(\texttt{tail}) \textbf{W}_{\texttt{tail}}) \end{aligned}$$
(7)

Inspired by previous research[10], we employ group bilinear classifier [38, 39] to reduce the number of parameters when utilizing vanilla bilinear classifier. Therefore, we divide the entity embeddings into block size \(\alpha \) as in (8):

$$\begin{aligned} {V(\texttt{head})} = [{V(\texttt{head})}_1;...;{V(\texttt{head})}_\alpha ] \nonumber \\ {V(\texttt{tail})} = [{V(\texttt{tail})}_1;...;{V(\texttt{tail})}_\alpha ] \end{aligned}$$
(8)

Afterward, we apply group bilinear operation as indicated in (9):

$$\begin{aligned} P^{RE}_{r} = \textrm{sigmoid} \left( \sum _{i=1}^\alpha {V(\texttt{head})}^\text {T}_{i\intercal } {\textbf{W}}_i^{RE} {V(\texttt{tail})}_i + \textbf{b}^{RE}\right) \end{aligned}$$
(9)

where \({\textbf{W}}_i^{RE}\in \mathrm {\mathbb {R}}^{dim/\alpha \times dim/\alpha }\) is the weight matrix of the operation and \(\alpha \) is the hyperparameter. \({\textbf{b}}^{RE}\) is the bias of RE task.

Additionally, we utilize adaptive threshold loss [10] to consider the multiple relations that can exist between two entities in prediction. The adaptive threshold loss is used to obtain an entity-dependent threshold value with the learnable threshold class TH. It is tuned to maximize the evaluation scores in the inference stage, thereby returning predicted relations label above the threshold TH, otherwise deciding there is no relation.

For the loss function \(\mathcal {L}^{RE}\), we apply binary cross entropy loss for training as below and split the relation labels into \(\mathcal {O}^+\) and \(\mathcal {O}^-\):

$$\begin{aligned} \mathcal {L}^{RE}= & {} - \biggl (\sum _{r\in \mathcal {O}^+_r} \textrm{log}\biggl (\frac{\textrm{exp}(P^{RE}_r)}{\sum _{r'\in \mathcal {O}^+_r\cup \left\{ TH\right\} } \textrm{exp}(P^{RE}_{r'})} \biggl )\nonumber \\{} & {} + \textrm{log} \biggl ( \frac{\textrm{exp}(P^{RE}_{TH})}{\sum _{r'\in \mathcal {O}^-_r\cup \left\{ TH\right\} } \textrm{exp}(P^{RE}_{r'})} \biggl )\biggl ), \end{aligned}$$
(10)

where \(\mathcal {O}^+\) are positive relation classes that exist between head and tail and \(\mathcal {O}^-\) are negative relation classes that do not exist between two entities.

4.5 Final training objective

Consequently, we obtain the final loss \(\mathcal {L}^{total}\) by integrating losses with task-specific weights, \(\eta ^{CR}\), \(\eta ^{NEP}\), and \(\eta ^{RE}\).

$$\begin{aligned} \mathcal {L}^{total} = \eta ^{CR}\mathcal {L}^{CR} + \eta ^{NEP}\mathcal {L}^{NEP} + \eta ^{RE}\mathcal {L}^{RE} \end{aligned}$$
(11)

5 Experiments

5.1 Experimental Setup

The evaluation metrics are the F1 and Ign F1 scores. Ign F1 score is an evaluation metric that is computed by excluding the triples which are included in the train set, supplementing the F1 score in unseen data.

For hyperparameters, we set the learning rate as 5e-5 with AdamW [40] optimizer, and then the learning rate is linearly decayed. The batch size and sequence length are set as 2 and 512, respectively. The whole model was trained for 30 epochs on 4 RTX A6000 GPUs. The average training time of TREKER for each PLM is 21 hours. Following previous document-level RE research in English  [24], we set \(\alpha \) as 64 and \(\gamma ^{CR}\) as 2. The task weight \(\eta ^{CR}\), \(\eta ^{ET}\), and \(\eta ^{RE}\) are set as 0.1, 0.1, and 1, respectively and we manually searched for the parameters.

As baselines, we adopt the base version of BERT-multilin-gual [41], KoBERT(SKT Brain, 2019), KLUE-BERT [29], and KLUE-RoBERTa [29] from huggingface [42]. The baseline models were presented with the previous Korean language understanding evaluation benchmark.

Table 5 Ablation studies of our model on TREK dataset. The results are predicted from TREKER\(_{\text {KLUE-RoBERTa}}\). Avg. Dec indicates the average of performance decrease between the baseline and our model from both Ign F1 and F1

5.2 Main results

We compare TREKER models with the baselines and they all show improved performance as demonstrated in Table 4. TREKER\(_{\text {RoBERTa}}\) achieves the best performance among other TREKER models. In particular, TREKER\(_{\text {KoBERT}}\) achieves the largest improvements of 6.02%p and the result implies that our method can be applied regardless of the type of language model. Also, these substantial improvements show that the learning objectives with consideration of the named entity-types lead to the effective relation prediction in given documents.

We observe the performance differences between Korean and multilingual language models. TREKER\(_{\text {KLUE-RoBERTa}}\) and TREKER\(_{\text {KLUE-BERT}}\) outperform performances of TREKER\(_{\text {BERT-multilingual}}\), respectively. The gaps demonstrate that the TREKER\(_{\text {KLUE-RoBERTa}}\) and TREKER\(_{\text {KLUE-BERT}}\) models generally understand better the Korean context than TREKER\(_{\text {BERT-multilingual}}\) model which is trained in various languages.

Meanwhile, it is shown that TREKER\(_{\text {KoBERT}}\) model demonstrate lower performances than those of TREKER\(_{\text {BERT-multilingual}}\). We assume that the size of the vocabulary and pre-trained corpus of the model affect performances. KLUE-BERT and KLUE-RoBERTa [29] are trained on 63G sentences with the size of 32K vocabulary, whereas KoBERT (SKT Brain, 2019) is trained on 5M sentences has the size of 8,002 vocabulary. Therefore, our TREK dataset is challenging if the language model’s understanding capability is limited.

5.3 Ablation studies

Table 5 presents the ablation results of the proposed model. TREKER\(_{\text {KLUE-RoBERTa}}\) is validated by excluding the CR and NEP tasks. When the CR and NEP are both removed, the performance decreases by 3.49% on average in the Ign F1 and F1, respectively. When the CR task is removed, there is an average decrease of 0.79%, respectively. Similarly, the score drops when the TREKER is not trained on NEP task. These findings indicate that considering the named entity-type information significantly contributes to the accurate prediction of relations.

Fig. 6
figure 6

Comparison of relation prediction results between the baselines and our models

6 Qualitative analysis

We conduct the qualitative analysis to demonstrate how each model understands the relation between entities in Korean document-level RE task. Figure 6 shows the relation prediction results obtained from KLUE-RoBERTa and TREKER\(_{\text {KLUE-RoBERTa}}\). In sentences [1] and [2], (Natural IT Industry Promotion Agency) and (Seoul) are given. The gold relation of the enti-ties is ‘org:member_of’. In the prediction of KLUE-RoBERTa, incorrect relation ‘org:founded_by’ is suggested, while TREKER\(_{\text {KLUE-RoBERTa}}\) correctly finds ‘org:member_of’.

We attribute these results to that TREKER\(_{\text {KLUE-RoBERTa}}\) understands the characteristics of the Korean language more precisely than KLUE-RoBERTa does. TREKER\(_{\text {KLUE-RoBERTa}}\) is more effective in distinguishing the functional morpheme “ [-eun]” and (National IT Industry Promotion Agency) and is shown to predict the ground-truth relation. The result is compatible with the experiments in Section 5.2.

7 Limitation

In this work, we introduce the document-level RE dataset in Korean. We believe that our TREK dataset will contribute to the RE research in Korean, but there are still some limitations that can be improved in future work. Since we did not use all of the documents from NAVER encyclopedia, the scalability of the dataset is relatively lower than that of DocRED [9]. However, our dataset is the largest document-level RE dataset in Korean, and we plan to build a larger dataset in future work. Another limitation of our data construction process is the necessity of human inspection. We aim to address this by either using documents where the subject missing sentence does not exist or refining our NER and RE modules to reduce error cases, ultimately aiming to construct a quality-ensured dataset without the need for human inspection.

8 Conclusion

In this paper, we introduced the TREK dataset, which is the first large-scale Korean document-level RE dataset built from high-quality Korean encyclopedia documents. We constructed TREK dataset in a distantly-supervised manner for labor- and cost-effectiveness and conducted a human inspection for ensured quality. Moreover, we showed the detailed analyses on our TREK dataset and conduct the human evaluation. Furthermore, we proposed a model that utilizes named entity-type information and adopts the properties of the Korean language. In experiments, our proposed models outperformed the baselines, and ablation studies were also conducted to show our method’s effectiveness. The qualitative results showed the model’s enhanced capability. It is expected that our TREK dataset can contribute to the field of Korean document-level RE, bringing out various research in regard to both data construction and the development of ultimate Korean RE models in the future. requirements.