1 Introduction

Drug repurposing is the process of discovering new uses for existing drugs that go beyond their original medical indication [3]. This approach is more efficient, cost-effective, and less risky than developing a completely new drug [70]. There are three steps in traditional drug repurposing methods: hypothesis generation, mechanistic assessment, and evaluation. Nonetheless, these methods are often time-consuming and cannot fully elucidate the underlying mechanisms between different biological entities, such as genes and proteins [40, 76].

The use of KGs in drug repurposing has recently been surging, owing to their ability to illuminate interconnections between entities and capture the semantics within different types of relationships [76]. KGs are a type of graph-structured knowledge base that encodes real-world entities and the relationships between them as triples (s, r, o), where s represents the subject, r represents the relation, and o represents the object. For example, (Nelfinavir, inhibitor, HIV) is a triple that encodes the relationship between the drug Nelfinavir and its inhibition of the HIV virus. KGs are capable of discovering new knowledge among entities with the help of knowledge graph embedding models (KGEs) [61]. KGEs learn to represent entities and relations in a low-dimensional space and then, perform a link prediction task, which involves inferring missing links between subject-object pairs [14]. This powerful approach enables efficient and effective drug repurposing by identifying potential new uses for existing drugs [78].

On the other hand, coronavirus disease 2019 (COVID-19) [73] has been listed as an international public health emergency by World Health Organization due to its high transmission rate [11]. Considering the daily increase in the number of COVID-19 patients, constructing an appropriate KG for COVID-19 drug repurposing has become urgent [17]. Previous studies [11, 17, 71, 76] have suggested that an appropriate KG should yield proper link prediction results on different KGEs. According to [27], embedding a KG with multi-relations can prove to be effective for extracting features of infrequent biological entities. Therefore, it is vital for a KG to contain multi-relations. Multi-relations refer to a set of relations (at least two) between an entity pair. For example, based on the introduction of IbuprofenFootnote 1 in the Comparative Toxicogenomics Database (CTD) [30], there are multi-relations of “marker/mechanism" and “therapeutic" between the drug Ibuprofen and the disease headache.

Many COVID-19 KGs are constructed to conduct drug repurposing for COVID-19. Some of them [12, 15, 62] extracted entities and relations from scientific papers. However, they have to filter unnecessary papers to improve the quality of the literature selection [62]. To avoid the literature selection problem, some works [1, 11, 17, 71] alternatively integrated multiple structured databases to form a new KG. We carried out a literature review on those papers and summarized their three limitations as follows. The details of the literature review are reported in Sect. 2.

Firstly, most publicly available COVID-19 KGs do not support multi-relations. Secondly, the entity and relation types in most existing COVID-19 KGs are inconsistent with those of general medical KGs, where entity types, such as drugs, diseases, genes, proteins, and side effects, are of interest [5, 70, 76]. Nevertheless, some existing COVID-19 KGs [1, 11, 17, 71] only include a subset of these entity types, which limits the KGE from discovering meaningful links. Additionally, we found that none of the existing COVID-19 KGs originated from COVID-19-related drugs, i.e., they integrated all COVID-19 information from existing databases, which might contain many irrelevant elements for drug repurposing. In general, drugs shown to be useful in previous trials are more important, i.e., building a KG originating from COVID-19 drugs can provide more valuable insights into COVID-19 drug repurposing [71].

In this article, we aimed to bridge the gap between the general medical domain and the COVID-19 domain by creating a drug-origin COVID-19 KG called Drug-CoV, for the COVID-19 drug repurposing task. Our KG can be found at https://github.com/SRL94/kg, and the construction method can also be applied to other drug-repurposing KGs. Our contributions are as follows:

  1. 1.

    Establishing a dedicated COVID-19 KG, Drug-CoV, with multi-relations to evaluate biological relation prediction methods in COVID-19. By integrating data from four professional databases and extracting applicable triples, our Drug-CoV contains thousands of nodes containing drugs, diseases, genes, proteins, and side effects connected by a set of semantic relations.

  2. 2.

    Measuring the quality of Drug-CoV by comparing the link prediction results between different KGEs. One external COVID-19 KG was also used to compare the quality of our KG and to evaluate KGEs systematically.

  3. 3.

    Investigating the potential of Drug-CoV for drug repurposing by conducting drug repurposing with dropped triples. We ranked drug candidates for COVID-19 and analyzed these candidates from the biological perspective.

2 Literature review

In the literature review, we first introduce various artificial intelligence (AI)-based methods for drug repurposing. Then, we summarize the existing COVID-19 KGs and their limitations.

2.1 Artificial intelligence in drug repurposing

The era of big data has paved the way for using AI to define diseases, medicines and therapeutics, as well as to identify targets with high accuracy and precision [78]. This section summarizes the current AI-based drug repurposing methods, categorized into sequence-based and graph-based representations.

2.1.1 Sequence-based representation

Sequence-based representation methods [2, 21, 25, 69] use genetic or protein sequences as input data for machine learning models. These methods typically convert the sequences into numerical or vector representations, which can be processed by the machine learning model [39]. Machine learning utilizes hierarchical layers of linear and nonlinear transformations to investigate and analyze data [24, 28].

One example of a straightforward machine learning model is the artificial neural network (ANN), which draws inspiration from initial models of sensory processing in the brain [23]. ANN uses artificial neurons to nonlinearly transform the weighted sum of input feature variables to output targets. The weights are optimized by minimizing the prediction loss of the output targets through backpropagation on training samples [43]. For example, [2] employed ANN to classify drugs into therapeutic classes in pharmaceuticals by utilizing the drugs’ transcriptomic profile vectors. Lenselink et al. [25] compared the performance of various machine learning models in predicting targets and found that neural networks outperformed conventional machine learning methods like logistic regression. However, the generated vector representations of ANN are often high-dimensional and sparse [39].

To overcome the challenges posed by high-dimensional and sparse vector representations, more advanced models such as recurrent neural networks (RNNs) and convolutional neural networks (CNNs) have been developed to extract underlying features from biological sequences, resulting in improved performance [39]. For instance, [69] proposed a novel approach to drug repurposing that combines CNN and bidirectional long short-term Memory networks (LSTM). This approach uses the CNN module to learn the initial representation of drug-disease pairs based on their similarities and associations. Meanwhile, the LSTM module is employed to learn the path representations of the drug-disease pairs by using an attention mechanism [56] to balance the contributions of different paths. [21] developed DeepAffinity, a semi-supervised machine learning model that jointly encodes molecular representations using RNN and CNN.

2.1.2 Graph-based representation

Graph-based representation methods [44, 48] encode proteins and their chemical associations into a graph structure, where nodes represent molecules or atoms, and edges represent chemical bonds or interactions. These methods have gained popularity for link prediction in graphs, where nodes and edges are represented as low-dimensional vector representations. Using the vector representations of drugs and diseases, link prediction can measure their similarities and identify effective drugs for specific diseases [78]. Graph neural networks (GNNs) [77] and KGEs [61] are the two most popular techniques for learning vector representations of entities and relations in a KG.

GNNs extract latent features automatically by considering the neighboring nodes’ structure and aggregating information across multiple layers [77]. ProteinGCN [44] leverages the protein-ligand interaction graph to learn the features of the ligand and protein molecules and predict their binding affinity. To learn how drugs and diseases are embedded in different perspectives and predict new links between them, GCMM [75] constructs a graph by incorporating various known drug-disease relations, including drug-drug and disease-disease similarities, and uses a graph convolutional network (GCN) encoder. Nonetheless, GNNs may not perform well on graphs with sparse or irregular connectivity [64] since they heavily rely on the graph’s connectivity.

KGEs are utilized to capture global features, whereas GNNs mainly focus on preserving first-order or second-order proximity [39]. Mohamed et al. [35] proposed TriModel, a new KGE-based approach for predicting drug-target interactions in a multi-phase procedure, by extending DistMult [72] and ComplEx [53]. To predict links between diseases and drugs, [74] used the RotatE method proposed by [50] to acquire vector representations of entities (such as drugs and targets) and relationships (such as inhibition between drugs and targets).

In summary, graph-based representation methods have the advantage of capturing the complex relationships between drugs, targets, diseases, and other entities compared to sequence-based representation methods.

2.2 COVID-19 knowledge graphs

To enable drug repurposing for COVID-19 using graph-based representation, the construction of a suitable COVID-19 KG is crucial. We conducted a systematic review of existing COVID-19 KGs, using the search terms “COVID-19”, “knowledge graph”, and “drug repurposing”. We considered up to five papers returned by Google Scholar.

2.2.1 Knowledge graphs for information retrieval

There are two clusters of COVID-19 KGs. The first cluster includes KGs designed to facilitate information retrieval, such as those created by [9, 49, 62, 67]. These KGs capture the information in COVID-19-related scientific papers [58] and typically consist of paper nodes, author nodes, affiliation nodes, paper concept nodes, and more. They are used primarily for information retrieval and article recommendation [10]. For example, a typical query could be “Which papers discussing COVID-19 risk factors are most often cited by researchers within the CORD-19 dataset?” [67]. To ensure the quality of scientific papers, most KGs [9, 31, 49, 67] are constructed based on the CORD-19 dataset [60], which is a growing resource of scientific papers related to COVID-19 and historical coronavirus research.

2.2.2 Knowledge graphs for drug repurposing

The second cluster [1, 11, 17, 71] is designed to facilitate drug repurposing, which is one of the most critical tasks in the medical field as it can significantly speed up the traditional process of drug discovery and thereby shortens development timelines and reduces costs [22]. This cluster can be further split into two categories based on their sources.

The first category[12, 15, 62] extracts entities and relations from scientific papers. For instance, [15] created a cause-and-effect KG of COVID-19 pathophysiology to assist with drug repurposing. To create this KG, the literature was retrieved from open-access and freely available journals, such as PubMed and Europe PMC, with specific keywords. However, the literature selection quality remains a problem as the KGs have to filter literature based on available information about potential drug targets for COVID-19 [62].

The second category [1, 11, 17, 71] constructs KGs by integrating multiple existing KGs. For example, [71] integrated fourteen biological KGs. [11] added COVID-19-related information to an existing drug KG, rather than constructing a new one from scratch. [1] identified small molecules as potential drugs that target host proteins and disease processes involved in COVID-19. However, this requires experts to identify vocabulary, and the KG may contain irrelevant COVID-19 elements. Ge et al. [17] also constructed a public virus-related KG, called CoV-DTI, by integrating four public sources, but this KG only contains one relation, “interact with.”

In conclusion, our analysis reveals that there are some issues with the availability of KGs in the second category. Three out of the four reported KGs are not publicly available. Moreover, the publicly available KG lacks multi-relational information. Additionally, the entity types used in most COVID-19 KGs are inconsistent with those in general medical KGs. Furthermore, the KGs have integrated all COVID-19 information without filtering out irrelevant elements for drug repurposing. A KG specifically designed to identify useful COVID-19 drugs could provide more valuable insights for COVID-19 drug repurposing [41].

3 Concepts and notations

We now define the salient concepts that underlie Drug-CoV.

Knowledge graph. A KG is a type of structured data that represents knowledge as a graph. In this graph, nodes represent entities, and edges represent relationships or connections between them. A KG can be considered a type of semantic network that is used to organize and represent knowledge in a machine-readable format [66]. It represents information in the format of triples (subject, relation, object). Notable examples include Wikidata [59] and DrugBank [68]. The notation of the KG in this paper is denoted as \(G = (E, G)\), where E is the set of entities (e.g., drugs, diseases and genes) and R is the set of relations (e.g., cause, encode and target) that connect the entities.

Entity and relation. In a KG, entities refer to the real-world objects, concepts, or events that are being represented. Relations, on the other hand, describe the connections or interactions between entities in the real world. In a triple (subject, relation, object), the term subject (or object) also can be used interchangeably with the subject entity (or object entity). In this paper, a subject is denoted as \(s \in E\), an object is denoted as \(o \in E\) and a relation is denoted as \(r \in R\).

Multi-relation. In our paper, we adopt the definition of multi-relations introduced by [27]. Multi-relations refer to a situation where multiple types of relations (or edges) exist between a pair of entities.

Knowledge graph embedding. It is a technique to encode the entities and relations in a KG as dense and low-dimensional vector representations [4]. These vector representations, also called embeddings, capture the semantic meaning of the entities and relations in the graph, allowing for various downstream tasks such as link prediction. In this paper, we use the notation \(\mathbf {v_x}\) to represent a vector representation of item x.

Link prediction. Link prediction is the task of predicting the likelihood or probability of the existence of a relationship between two entities in a KG. KGE-generated embeddings can be used for link prediction, which has various applications, including predicting drug-target interactions in drug discovery [34].

More specifically, given a relation and an entity, the goal is to predict a missing entity, i.e., inferring s given (ro) or inferring o given (sr). To achieve this, KGEs calculate the score for each triple \((s, r, o_i)\) where \(o_i\) is a candidate entity from the entity set and then, select the entity with the highest score as the prediction result.

Attention mechanism. Attention mechanism is a concept used in neural networks that helps to focus on certain parts of the input while processing it [56]. It allows the network to assign different weights to different parts of the input, thus emphasizing or “paying attention" to more relevant information.

4 Methodology

In this section, we first present an overview of our Drug-CoV construction and then, briefly introduce our multi-relational KGE, ConvMR. Figure 1 illustrates the five critical stages of Drug-CoV construction. Starting off, the drug seeding stage and database integration stage are gone through to build a skeleton KG by integrating structured information from multiple databases. The sentence extraction stage and triple extraction stage are then used to extract additional triples from documents. Finally, the combination stage merges the skeleton KG with the extracted triples. The resulting KG is stored in a graph database, Neo4j,Footnote 2 which stores nodes and edges rather than tables or documents.

4.1 Drug seeding

Yan et al. [71] demonstrated that useful drugs in prior trials are crucial for drug repurposing. To identify potential COVID-19 treatments, we retrieved information from DrugBank, a reliable and comprehensive web resource that provides detailed information about Food and Drug Administration (FDA)-approved drugs and experimental drugs in the FDA approval process.Footnote 3 We selected 68 drugs with COVID-19-related properties as seed entities for further analysis to extract other entities of interest. These drugs were chosen based on their prior use in treating diseases with similar symptoms to COVID-19 and their known mechanisms of action that could potentially be effective against the virus.

Fig. 1
figure 1

The construction overview of Drug-CoV

4.2 Database integration

In this stage, we aimed to identify additional entities of interest, such as genes, proteins, and diseases, as well as relevant relations among them that could be valuable for COVID-19 treatment.

Database integration involves two main steps: (1) entity extraction from multiple public databases and (2) entity resolution to identify and unify mentions of the same entity across different sources.

We extracted entities of interest from four public databases, namely DrugBank, PubChem [63], CTD, and MedlinePlus [32], which were chosen for their provision of high-quality structured information and their popularity in prior studies. These databases provided raw data files in various formats, including CSV, TSV and XML. To extract structured interactions between entity pairs (such as drug-disease and drug-drug interactions), we designed Python parsers that parse the raw data files based on their specific data structures.

To perform entity resolution, which involves identifying all mentions of the same entity across multiple knowledge bases [18], we unified drug names with chemical formulas (such as \({\textrm{C}}_{37}{\textrm{H}}_{48}{\textrm{N}}_{6}{\textrm{O}}_{5}{\textrm{S}}_{2}\) for the drug “Ritonavir”), gene names with symbols and NCBI Gene IDs, and protein names with Uniprot IDs. Disease and side effect names were extracted from DrugBank and MedlinePlus, respectively, due to their lack of standardization. Entity resolution prevented the duplication of entities in our KG resulting from synonyms in different resources.

Fig. 2
figure 2

The schema of our KG, Drug-CoV. Nodes are entity types. Edges are relation types between them

It is important to note that relations among entities were introduced from previous medical KGs [11, 70] to ensure the consistency of our COVID-19 KG with the general domain. Figure 2 shows the schema of our KG.

As a result, a KG is constructed by integrating structured data from multiple public databases. We will expand the KG in the following stages by extracting more triples from sentences.

4.3 Sentence extraction

In addition to structured data, the four databases also include detailed text descriptions. To extract relevant information, we focused on identifying sentences that mention at least two distinct entities. To achieve this, we developed a Python-based sentence extractor using Beautiful soupFootnote 4 and CoreNLP [29]. Beautiful soup helped us to parse the HTML/XML documents and extract relevant information while removing images, headers, and other elements from web pages. CoreNLP was utilized for both coreference resolution and sentence splitting. Coreference resolution is a technique adopted to identify linguistic expressions that refer to the same entity, which is essential for precise identification of sentences with interest.

Fig. 3
figure 3

An example of sentence splitting and coreference resolution in CoreNLP. CoreNLP automatically segments the input text into separate sentences and identifies all references to the same entity using consistent color labeling. In the second sentence, the pronoun “it" is replaced with “Ritonavir” for clarity. However, since the second sentence only contains one entity of interest, it would not be passed to the triple extraction stage

4.4 Triple extraction and combination

To extract triples from the sentences, we used two tools: BERN2 [51] and CoreNLP. BERN2 is a named entity recognition tool used in biomedical natural language processing that can recognize and normalize nine types of biomedical entities. Figure 4 shows an example of BERN2 use.

Fig. 4
figure 4

An example of biomedical named entity recognition in BERN2

On the other hand, CoreNLP was implemented to identify the relation between a pair of biomedical entities. Nonetheless, it is worth noting that the relations identified by CoreNLP might be out of the scope of our KG since it finds direct relations in the text. For instance, our model extracted (Dexamethasone, may cause, upset stomach) and (GC-373, therapeutic potential for, COVID-19) but “may cause" and “therapeutic potential for" are not included in the schema. Therefore, we designed a similarity-based method to normalize similar relations and filter out irrelevant ones.

Given a triple \((s_j, r_j, o_j)\) extracted by BERN2 and CoreNLP (e.g., (Dexamethasone, may cause, upset stomach)), we used Sentence-bert (SBERT) [42] to generate a vector representation for \(r_j\). Similarly, we generated vector representations for the ten relations in the schema (as shown in Fig. 2). With these vector representations, we can compare the cosine similarity between the relation \(r_j\) and every relation \(r_i \in R\). Then, we replaced \(r_j\) with the relation \(r_k \in R\) which has the highest similarity. This process can be represented by the following equations:

$$\begin{aligned} \begin{aligned} k&= \underset{i}{\textrm{argmax}}(\cos (\mathbf {v_{r_j}}, \mathbf {v_{r_i}})) \text { for { i} in |R |} \\&\quad \cos (\mathbf{v_t},\mathbf{v_e})= \frac{\mathbf{v_t} \mathbf{v_e}}{\Vert \mathbf{v_t}\Vert \Vert \mathbf{v_e}\Vert } \end{aligned} \end{aligned}$$
(1)

where the cosine similarity is represented by \(\cos \). Based on this similarity-based method, the triple (Dexamethasone, may cause, upset stomach) was normalized to (Dexamethasone, cause, upset stomach) using the method described above. Also, (GC-373, therapeutic potential for, COVID-19) was normalized to (GC-373, therapeutic, COVID-19).

In the combination stage, we added the triples extracted from sentences to the skeleton KG. Finally, we constructed Drug-CoV, which contains a total of 36,438 relations belonging to 10 different types, between 9694 entities of 5 different types.

Fig. 5
figure 5

The distributions of entity types in Drug-CoV

Figure 5 shows the percentage of each entity type. The “Gene” category comprised 52%, the largest percentage of total entity types since it can indicate many potential paths between two different diseases. The “Drug" category made up 35%, the second largest percentage of total entity types since Drug-CoV is a drug-origin-based KG and can provide more insights between two different drugs.

4.5 Convolutional and multi-relational knowledge graph embedding

We proposed a Convolutional and Multi-relational model (ConvMR) [27] that enhances the semantic connection between multi-relations. We briefly introduce the encoding and convolution part as follows.

In learning, ConvMR concerns not only direct connections between an entity pair (i.e., original triples of a KG: \((s, r_1, o), (s, r_2, o),\ldots , (s, r_N, o)\)) but also multi-relations between the pair (i.e., \((s, r_1, r_2,\ldots , r_N, o)\)). To clarify, we call the original triple (sro) as triple; triple with multi-relation \((s, r_1, r_2,\ldots , r_N, o)\) as multi-relation triple, which means there are a total of N relations between (so) and \(N \ge 2\). Given a triple or a multi-relation triple, ConvMR encodes relations by using a proposed attention-based average operation (attn-average). Generally, the attn-average uses an attention mechanism to assign weights to relations among an entity pair and then, encodes them into a relation vector \(\mathbf {v_{r'}} \in \mathbb {R}^{1 \times M}\), where M is the dimension of the vector.

In the convolution part, ConvMR employs a convolution layer to calculate the score for the multi-relation triple \((s, r', o)\):

$$\begin{aligned} f = {\textrm{concat}}(g([\mathbf {v_s}, \mathbf {v_o}, \mathbf {v_{r'}}] *\Omega )) \end{aligned}$$
(2)

where concat is the concatenation operation. \(\Omega \) is a set of filters. Filters in convolutional layers are small matrices or tensors that are applied to an input matrix in a sliding window fashion. Each filter learns to detect specific features or patterns in the input. \(*\) is the convolution operation; g is the activation function ReLU. In training, valid triples or multi-relation triples are assigned higher scores than invalid triples or multi-relation triples.

In summary, the proposed ConvMR addresses the connection of multi-relations between an entity pair by learning their weighted joint. Compared to traditional KGEs, ConvMR can maintain the semantic connection between multi-relations and effectively mine features of less frequent entities.

5 Experiments

In this section, we measure the quality of Drug-CoV by comparing link prediction results between different KGEs and investigate the capacity of the Drug-CoV for drug repurposing.

5.1 KGE baselines

We used LibKGE [6], an open-source Python package for KGE, including TranE [4], RESCAL [38], TransH [65], DistMult [72], ComplEx [53], ConvE [14] and RotatE [50] in the comparison study. We chose these KGEs because they are the most popular KGEs in previous drug repurposing works [16, 37, 74, 76] and they can represent different kinds of KGEs, such as distance-based models (TransE, TransH), neural network models (ConvE) and semantic matching models (RESCAL, DistMult, ComplEx and RotatE). In order to handle the multi-relations, we used ConvMR, which can extract the semantic hierarchical information in multi-relations to enhance KGE. We also implemented a GNN methods, CompGCN [55], as well as several recent neural network-based and semantic matching-based methods, including GIE [8], MuRT [13] and HousE [26].

5.2 Dataset

In order to provide a fair comparison of these baselines, we included CoV-DTI [17]Footnote 5 to perform further evaluations. CoV-DTI is a virus-based KG that is constructed by integrating different databases. As illustrated in Sect. 2, the availability of the existing COVID-19 KGs is a problem, and CoV-DTI is the only public COVID-19 KG targeting drug repurposing with a similar size. CoV-DTI does not contain multi-relations. The statistical information of this dataset is shown in Table 1.

Table 1 The statistical information of the two knowledge graphs

5.3 Training protocols

We divided the triples into a training set, a validation set and a test set in an 8:1:1 manner. All the baselines from LibKGE were trained with the learning rate of [0.001, 0.01, 0.1] and the dimension of the initial embedding of [100, 200]. Other hyper-parameters for each approach were set at their default settings, as recommended by the LibKGE package. ConvMR, CompGCN, GIE, MuRT, and HousE were trained with the learning rate [0.001, 0.01, 0.1], the dimension of the initial embedding of [100, 200], and the epochs of [20, 30]. We implemented the grid search for parameter optimization.

5.4 Evaluation metrics

For the quality evaluation of Drug-CoV, we performed the link prediction task, as suggested by the KG community [4, 50]. For each test triple, the s or o is replaced by every other entity \(e \in E\). We first computed a similarity score for each test triple and then, sorted these scores in descending order to get the rank of the correct test triple. We used mean reciprocal rank (MRR) and the proportion of correct entities in the top N ranks (Hits@N) for \(N =1, 10\) and 100. Higher MRR and Hits@N values are indications of a better performance. Note that we used the filtered setting protocol [4] in our experiments, i.e., not taking any corrupted triples into account. The “corrupted triple" means the triple appears in the KG.

To evaluate Drug-CoV’s capacity for drug repurposing, we performed link prediction on a specific relation, “therapeutic".

5.5 Evaluation of KGs with different embedding methods

We tested different embedding methods on Drug-CoV and CoV-DTI. As shown in Table 2, ConvMR outperformed all other models regarding Hits@N and MRR on Drug-CoV. Generally, compared with other baselines, ConvMR can improve the link prediction performance by maintaining the semantic hierarchy of multi-relations [27]. For example, ComplEx achieved a 1.6 (49.4\(-\)47.8)% improvement in terms of Hits@100 value compared with TransE. TransH obtained a 8.2 (46.4\(-\)38.2)% increment in the Hits@100 when compared with RESCAL. Considering the multi-relation features, ConvMR exceeded the ComplEx by 3.3 (25.3\(-\)22.0)% (MRR), 2.8 (17.5\(-\)14.7)% (Hits@1), 4.3 (40.7\(-\)36.4)% (Hits@10) and 13.9 (63.3\(-\)49.4)% (Hits@100). It also surpassed TransH with 18.2 (25.3\(-\)7.1)% (MRR), 15.4 (17.5\(-\)2.1)% (Hits@1), 23.8 (40.7\(-\)16.9)% (Hits@10) and 16.9 (63.3\(-\)46.4)% (Hits@100). These results demonstrate that ConvMR is more effective and could be used on biological relation prediction tasks to improve prediction performance.

Table 2 Overall link prediction performance on two compiled COVID-19 KGs (in %)

A similar trend can be seen in evaluating CoV-DTI; however, ConvMR did not achieve the best performance on all metrics. It is mainly because CoV-DTI does not have multi-relation features for ConvMR to encode. Compared with other neural network-based models, such as ConvE, ConvMR still performed better on all metrics, especially for Hits@100 (57.2\(-\)12.1 = 45.1% improvement).

We also found that the best performance of CoV-DTI and Drug-CoV was not significantly different. Compared with CoV-DTI, Drug-CoV contains more complex relation types and entity types that can form more relational paths between entities. This result demonstrates that Drug-CoV facilitates the modeling of COVID-19 terms.

5.6 Ablation study of multi-relations

In this section, we study the importance of encoding multi-relation features in Drug-CoV. ConvMR is the only model capable of encoding multi-relations among all the baselines. ConvMR uses an average-based attention mechanism to assign different weights to reveal the semantic/hierarchical connection between multi-relations. We conducted an ablation study by omitting the attention mechanism. The ablated model (ConvMR-w/o) was found to cause a drop in the results by decreasing all metrics 1.8 (25.3\(-\)23.5)% MRR, 2.1 (17.5\(-\)15.4)% Hits@1, 0.2 (40.7\(-\)40.5)% Hits@100, 2.2 (63.3\(-\)61.1)% Hits@100. These decreases suggest that encoding the connection between multi-relations plays a pivotal role in relation prediction in the medical domain.

Table 3 Tha ablation study results of Drug-CoV (in %)

We further randomly selected one multi-relation triple (Ibuprofen, marker/mechanism, therapeutic, Headache) from Drug-CoV and fed it into the trained attention mechanism. The attention mechanism assigned 0.85 weight to relation “marker/mechanism" and 0.15 to relation “therapeutic”. According to the referencs on the database CTD,Footnote 6 we found that 2 out of 3 references labelled “Ibuprofen” as a “marker/mechanism” to “Headache”, and 1 out of 3 labelled “Ibuprofen” as a “treatment” to “Headache”. Based on this observation, we can conclude that the weights reflect the importance of a relation between an entity pair in the medical domain.

5.7 Evaluation of Drug-CoV’s capacity for drug repurposing

Table 4 Drug repurposing results on Drug-CoV

Drug repurposing is one of the most widely used application in the biological field. Previous studies [17, 76] did not specify how they dropped drug-disease links and what the drop rate was. In our work, we randomly dropped 10% drug-disease links with relation “therapeutic" from the original KG. The dropped 10% links were testing data and the original KG without the 10% link were used to train models.

As shown in Table 4, ConvMR achieved a Hits@100 of 96.6%, significantly outperforming that of RESCAL (Hits@100 = 31.0%), DistMult (Hits@100 = 34.5%) and ComplEx (Hits@100 = 94.8%). If compared with MRR, Hits@1 and Hits@10, ConvMR also attained a top-five performance. For example, concerning MRR, the ranked top-five models are TransE (47.2%), ConvE (47.0%), ComplEx (39.2%), ConvMR (31.0%) and CompGCN (30.9%). As for Hits@10, the ranked top-five models are ConvE (86.9%), ComplEx (84.5%), ConvMR (82.8%), TransE (82.8%) and GIE (50.0%). The superior performance of ConvMR likely results from the full use of semantic information available in multi-relations. TransE outperformed other models on MRR and Hits@1, indicating many one-to-one relations [65] in Drug-CoV after dropping drug-disease links. The large amount of one-to-one relations also restricted the ability of ConvMR to mine information hidden behind multi-relations. The proper results of ConvE and ConvMR mean that convolution-based models are better at extracting local information in the medical domain.

In summary, most models performed well in distinguishing drug-disease links in the drug repurposing task. Therefore, Drug-CoV is useful for modelling COVID-19 items and providing new insights into drug repurposing predictions.

5.8 Extraction and ranking of drug candidates for COVID-19

To test the COVID-19 drug repurposing ability of our KG Drug-CoV, we employed ConvMR and embedding scores to rank candidate drugs. We then ranked the top ten drug candidates for COVID-19 repurposing based on embedding scores. The embedding score of drug x could be calculated by the equation below:

$$\begin{aligned} S(x) \,=\, \log (1+e^{(ConvMR(\mathbf {v_x}, \mathbf {v_{therapeutic}}, \mathbf {v_{COVID-19}})}) \end{aligned}$$
(3)

where CovMR(.) is the prediction score of the triple. The prediction score is in the range between 0 and 1. A higher score represents a stronger potential association between the drug and COVID-19.

Table 5 List of drug candidates for COVID-19 repurposing proposed by Drug-CoV and ConvMR

The resulting top ten predicted drug candidates are listed in Table 5. Anti-inflammatory drugs are the most commonly suggested candidates among the top ten. For example, Baricitinib is a drug that inhibits Janus kinases (JAKs), which are enzymes involved in inflammation and immune responses [33]. JAK inhibition was identified as a treatment strategy for COVID-19 [45,46,47]. A study by [52] analyzed eight cohort studies and five randomized controlled trials of Baricitinib compared to control groups, looking at clinical and laboratory parameters. The meta-analysis results showed a significant reduction in mortality, improved \(PaO_2/FiO_2\) ratio, and lower C-reactive protein (CRP) levels in the Baricitinib group compared to the control group. Based on these findings, the study suggests using Baricitinib for moderate to severe COVID-19 cases.

Angiotensin-converting enzyme 2 (ACE2) is an enzyme that plays a role in regulating blood pressure [19]. ACE2 has been identified as a receptor for SARS-CoV-2, the virus that causes COVID-19, allowing the virus to enter host cells [57]. Ibuprofen, a non-steroidal anti-inflammatory agent, can increase ACE2 expression in rat and culture models [54]. However, the implications of this in relation to COVID-19 are still uncertain and further studies are needed to determine if this impacts the course of the disease. Additionally, the use of Ibuprofen in COVID-19 patients has been controversial, with some studies suggesting that it may increase the risk of severe illness or death in certain populations [36].

Our results highlight the potential of several anti-infective drugs, including Lopinavir, Ritonavir, and Darunavir. These drugs can target the 3C-like protease (3CLpro), a major protease in CoV that cleaves the large replicase polyproteins during viral replication. Lopinavir and Ritonavir are protease inhibitors that can effectively target 3CLpro [20]. However, clinical trials of these drugs have yielded mixed results. Some studies have shown no benefit in terms of reducing mortality or time to clinical improvement, while others have shown a modest benefit in hospitalized patients [7].

Our results on drug repurposing generate hypotheses regarding which existing drugs have greater potential for repurposing in treating COVID-19. Although the KG does not provide clinical or biological evidence, it provides valuable insights.

6 Conclusion

In this work, a dedicated COVID-19 KG called Drug-CoV was constructed for drug repurposing. By integrating and extracting information from multiple public databases, we have constructed a KG with over 30,000 interconnections between drugs, diseases, side effects, genes, and proteins. Our work is innovative in that it is the first to provide insights into multi-relations in drug repurposing. We have also demonstrated the effectiveness of encoding multi-relations in improving link prediction task performance. Our KG considers the importance of useful drugs and constructs a drug-origin KG. The proper results of experiments indicate that Drug-CoV can facilitate the modelling of COVID-19 terms on biological relation prediction and the repurposing of potential drugs.

Moreover, our KG can be expanded with more triples for future COVID-19 studies, thereby facilitating other research in this area. In addition, our KG construction methodology can be applied to discover effective treatments for other viruses, and it can assist in identifying the most promising drug candidates for further investigation.