In this section, we follow the five steps (see Fig. 1) to find new indications for existing drugs (drug repositioning):
-
1.
Representing four drug features using deep neural network.
-
2.
Transforming two disease features represented by one-hot-encoder using PCA.
-
3.
Using drug features to construct the drug-drug similarity matrices.
-
4.
Using disease features to construct the disease-disease similarity matrices.
-
5.
Using drug-drug similarity and disease-disease similarity to construct drug-disease association matrices.
Representing four drug features using deep neural network
In this subsection, we extract four drug features, chemical structures, protein sequences of drug targets, drug-related enzyme sequences and gene expression profiles. Also, the appropriate representation of features, derived by deep neural networks, is introduced.
Chemical structures
Numerous studies have attempted to explain the importance of chemical structures [8]. For instance, SMILES simplifies the chemical structure and encodes molecular graphs compactly as a human-readable string and describes molecules with an alphabet of characters as a formal grammar [41]. We download the SMILES strings from the DrugBank [42] and PubChem [43] database during the 2017–2018 academic year.
We use the variational auto-encoder (VAE) [38] to convert the discrete representation of molecules (SMILES string) into a continuous 192-dimensional vector. The SMILES string of drug i is pre-processed by the following steps to make appropriate inputs for VAE model:
-
A subset of 35 different characters is used for SMILES-based text encoding.
-
The strings are encoded up to a maximum length of 120 characters. Some spaces are added to shorter strings in order for all strings to be the same length.
Finally, the pre-processed SMILES string of drug i is given as an input to VAE model and vector \( {\overrightarrow{s}}_i \) is generated as an appropriate representation named SMILES vector. The “Keras” [44] and “Theano” packages [45] are utilized to apply this neural net.
Protein sequences of drug target
Each drug addresses one or multiple drug targets, which is a molecule associated with a particular disease process, to produce a desired therapeutic effect [46]. Drug targets are mostly proteins with active sites which can be ducked to the drugs. Each drug has one or multiple target proteins, and each protein can be the potential target of multiple drugs.
We retrieve drug target protein sequences from DrugBank during the 2017–2018 academic year [42]. We download the drug target section that includes proteins and genes. In this database, there is a list of drugs for each protein. Thus, we list the sequences of the target proteins for each drug.
We apply a deep neural network model named ProtVec [39] to convert the protein sequence into three continuous 100-dimensional vectors. In other words, each protein sequence is represented as three sequences of 3-gram. In n-gram modelling of protein informatics, usually, an overlapping window of 3 to 6 residues is used. ProtVec [39], instead of taking overlapping windows, generates three vectors of shifted non-overlapping words. Each 3-gram is presented as a vector of size 100.
For each drug i, we perform the following steps to generate a set of 300-dimensional vectors called ℙi to represent the sequences of target proteins:
-
The sequences of target proteins are listed as a set named Φi where |Φi| shows the number of targeted proteins by the drug i.
-
Each protein sequence σ ∈ Φi is given as an input to ProtVec. Three 100-dimensional vectors named \( \overrightarrow{{v_1}^{\sigma }} \), \( \overrightarrow{{v_2}^{\sigma }} \) and \( \overrightarrow{{v_3}^{\sigma }} \) are generated as outputs.
-
For protein sequence σ, the concatenation of these 3 vectors is computed as \( \overrightarrow{\ {v}^{\sigma }}=\overrightarrow{{v_1}^{\sigma }}.\overrightarrow{{v_2}^{\sigma }}.\overrightarrow{{v_3}^{\sigma }} \).
-
Drug i is represented by the associated proteins of set Φi as \( \kern0.50em {\mathbb{P}}_i=\left\{\overrightarrow{v^{\sigma }}|\sigma \in {\Phi}_i\right\} \).
Drug-related enzyme sequences
Drug-related enzyme sequences include all the enzymes involved in the activation and metabolism of a drug. We extract these sequence from DrugBank during the 2017–2018 academic year [42]. For each drug i, we execute the same process explained in section "Protein sequences of drug target" for enzyme sequences to generate a continuous 300-dimensional vectors based on drug-related enzymes called \( {\mathbbm{E}}_i \).
Gene expression profiles
We obtain raw data of gene expression profiles (GEPs) of CMAP dataset [12], and normalize them using R/Bioconductor “affy” package. These samples contain GEPs of five cell lines, either untreated or treated with any of 1309 different drugs. Differential gene expression profile (dGEP) of each cell line in presence vs. absence of a drug is computed by subtracting log2-scaled GEPs after merging biological replicated samples via mean function. A subset of 729 drugs are annotated and approved in Drug Bank [42] and PubChem [43] databases.
We use a specific architecture of stacked auto-encoders in a number of previous researches [47, 48]. It was shown, this architecture can retrieve important biological features of the data, such as gene co-expression patterns, pathways and biological processes [47], and exploit them to reduce the dimensionality of GEPs into a footprint sized vector called cell identity code (CIC) that contains important features of the data [48]. Importantly, CICs are resistant to noise and missing data [48] and can prevent overfitting by reducing the number of parameters of a deep neural network, when they are used as the input rather than the original GEPs.
For these reasons, we design a stacked auto-encoder of five layers, after observing that increasing the number of layers did not impact on decreasing the loss function. For each layer, different options for the number of neurons and the activation functions are listed, as potential values for hyper-parameters. Then we use a Bayesian approach for hyper-parameter optimization using “hyperopt” package [49]. Different options for activation function are rectified linear unit (ReLU), Linear, SoftPlus, and ELU. The optimal value for batch size is also selected through hyper-parameter optimization. Different options for each hyper-parameter are specified in Fig. 2. The learning rate is 0.001. We use mean square error (MSE) as the regression loss-function. “nadam” algorithm is used for both hyper-parameter optimization and final training.
We partition the data into training (60%), validation (15%) and test (25%) datasets. The stacked auto-encoder is trained and the appropriate weights and bias values are found. The validation dataset is used for hyper-parameter optimization. The test dataset is utilized for final evaluation of the model.
We perform 100 iterations of hyper-parameter optimization. The final hyper-parameters that were selected by the optimization process are highlighted in Fig. 2. After performing 300 epochs iteration, the optimal candidate network has the mean-squared error of 0.076.
Subsequently, the output of the bottleneck layer for available differential expression profiles has been extracted with the mean-squared error of about 0.0047 as loss and mean absolute error of around 0.0495. The output of this auto-encoder is a 20-dimensional vector representing dGEP (\( \overrightarrow{g_i} \)).
Transforming two disease features represented by one-hot-encoder using PCA
In order to find disease-disease similarity, we employ two sets of measures, namely the phenotypes (characteristics of a disease) and genotypes (genes involved in a disease). We download 10,881 human diseases with 8662 phenotypes and 7217 human diseases with 10,764 genotypes from Monarch [50]. In their intersection, there are 5955 diseases with both phenotypes and genotypes. For disease i, two one-hot-encoders, namely 8662-dimensional and 10,764-dimensional vectors, are constructed for phenotype and genotype, respectively.
For disease i, a phenotype one-hot-encoder is a zero vector with length 10,881. If a phenotype belongs to the disease, then the corresponding component of the vector is substituted 1. Also, we make genotype one-hot-encoder similar to phenotype one-hot-encoder.
These two one-hot-encoders are too sparse, specifically the one regarding genotype. To overcome this issue, we generate two vectors called \( \overrightarrow{{\mathrm{a}}_{\mathrm{i}}} \) and \( \overrightarrow{{\mathrm{d}}_{\mathrm{i}}} \) for phenotype and genotype using PCA, respectively. By test and trial, we find out appropriate numbers of components for PCA that identify the length of vectors \( \overrightarrow{{\mathrm{a}}_{\mathrm{i}}} \) and \( \overrightarrow{{\mathrm{d}}_{\mathrm{i}}} \) with 30 and 20, respectively.
Using drug features to construct the drug-drug similarity matrices
In this subsection, we generate a similarity matrix for each drug feature. We assume that there are n drugs. For each drug i, there are two vectors called \( \overrightarrow{{\mathrm{s}}_{\mathrm{i}}\ } \), \( \overrightarrow{{\mathrm{g}}_{\mathrm{i}}} \) and two sets named ℙi, \( {\mathbbm{E}}_{\mathrm{i}} \) to show the representation of chemical structures (s), gene expression profiles (g), protein sequences of drug target (p) and drug-related enzyme sequences (e), respectively.
We make a similarity matrix for each feature x ∈ {s, g } named \( {M}_{n\times n}^x \), the value of n shows the number of drugs, as follows:
$$ {M}^x\left[i,j\right]= sim\left(\overrightarrow{x_i},\overrightarrow{x_j}\right), $$
where the feature x is available for drug i in the database. The similarity between drugs i and j based on feature x is computed by sim function using Cosine measures which is more compatible with our data [51]. In order to compute sim function, we use the “proxy” package in R [52].
In addition, we make a similarity matrix \( {\mathrm{M}}_{\mathrm{n}\times \mathrm{n}}^{\mathrm{p}} \) for protein sequences of drug targets as follows:
-
1.
ℙi and ℙj are made as it was mentioned in section "Protein sequences of drug target".
If |ℙi| ≤ ∣ ℙj ∣ ,
\( \forall \overrightarrow{\uprho_{\mathrm{i}}}\in {\mathbb{P}}_{\mathrm{i}},\kern1em {\mathrm{R}}_{\overrightarrow{\uprho_{\mathrm{i}}}}=\underset{\ \overrightarrow{\uprho_{\mathrm{j}}}\in {\mathbb{P}}_{\mathrm{j}}}{\max}\mathrm{sim}\left(\ \overrightarrow{\uprho_{\mathrm{i}}},\overrightarrow{\uprho_{\mathrm{j}}}\right),{\mathrm{M}}^{\mathrm{p}}\left[\mathrm{i},\mathrm{j}\right]={\sum}_{\overrightarrow{\uprho_{\mathrm{i}}}\in {\mathbb{P}}_{\mathrm{i}}}{\mathrm{R}}_{\overrightarrow{\uprho_{\mathrm{i}}}} \) .
If |ℙi| > ∣ ℙj ∣ ,
\( \forall \overrightarrow{\uprho_{\mathrm{j}}}\in {\mathbb{P}}_{\mathrm{j}},\kern1em {\mathrm{R}}_{\overrightarrow{\uprho_{\mathrm{j}}}}=\underset{\ \overrightarrow{\uprho_{\mathrm{i}}}\in {\mathbb{P}}_{\mathrm{i}}}{\max}\mathrm{sim}\left(\ \overrightarrow{\uprho_{\mathrm{i}}},\overrightarrow{\uprho_{\mathrm{j}}}\right),{\mathrm{M}}^{\mathrm{p}}\left[\mathrm{i},\mathrm{j}\right]={\sum}_{\overrightarrow{\uprho_{\mathrm{j}}}\in {\mathbb{P}}_{\mathrm{j}}}{\mathrm{R}}_{\overrightarrow{\uprho_{\mathrm{j}}}} \) .
According to the set of drug-related enzyme sequences, the similarity matrix between drugs i and j, Me[i, j], is constructed like the protein sequences of drug targets.
In the following, drug-drug similarity intersection (DDSI) matrix called \( {I}_{n\times n}^E \) is constructed on the subset E ⊆ {s, p, e, g}. The number of drugs (n) shows that all features of the set E is available in the database:
$$ {I}^E\left[i,j\right]=\Big\{{\displaystyle \begin{array}{c}\left(\sum \limits_{x\in E}{M}^x\left[i,j\right]-\min \right)/\left(\max -\min \right),\kern0.5em i\ne j\\ {}1\kern14.5em ,\kern0.5em else\end{array}}\operatorname{} $$
where
$$ \mathit{\min}=\underset{1\le i\ne j\le n}{\mathit{\min}}\sum \limits_{x\in E}{M}^x\left[i,j\right]-0.01, $$
and
$$ \mathit{\max}=\underset{1\le i\ne j\le n}{\mathit{\max}}{\sum}_{x\in E}{M}^x\left[i,j\right]+0.01. $$
Using disease features to construct the disease-disease similarity matrices
We assume that there are m diseases. For each disease i, there are two vectors called \( \overrightarrow{a_i} \) and \( \overrightarrow{d_i} \) to show the representation of phenotype (a) and genotype (d) respectively. We display the length of these vectors below:
$$ \left|\overrightarrow{a_i}\right|=30,\left|\overrightarrow{d_i}\ \right|=20. $$
We make a similarity matrix for each feature x ∈ {a, d } named \( {M}_{m\times m}^x \) as follows:
$$ {M}^x\left[i,j\right]= sim\left(\overrightarrow{x_i},\overrightarrow{x_j}\right), $$
where sim function shows the similarity between diseases i and j based on feature x using Cosine measure [51]. In order to compute the sim function, we use the “proxy” package in R [52]. Finally, the disease-disease similarity (DiDiS) matrix called Dm × m is constructed as follows:
$$ D\left[i,j\right]=\Big\{{\displaystyle \begin{array}{c}\left(\sum \limits_{x\in \left\{a,d\right\}}{M}^x\left[i,j\right]-\min \right)/\left(\max -\min \right),\kern0.5em i\ne j\\ {}1\kern15em ,\kern0.5em else\end{array}}\operatorname{} $$
where
$$ \mathit{\min}=\underset{1\le i\ne j\le n}{\mathit{\min}}\sum \limits_{x\in \left\{a,d\ \right\}}{M}^x\left[i,j\right]-0.01, $$
and
$$ \mathit{\max}=\underset{1\le i\ne j\le n}{\mathit{\max}}{\sum}_{x\in \left\{a,d\ \right\}}{M}^x\left[i,j\right]+0.01. $$
Using drug-drug similarity and disease-disease similarity to construct drug-disease association matrices
In this subsection, we define the drug-disease association (DDA) matrix \( {A}_{n\times m}^E \) where E is a subset of drug features. To do this, we apply DDSI matrix \( {I}_{n\times n}^E \) and DiDiS matrix Dm × m to generate \( {A}_{n\times m}^E \) as follows [29]:
$$ {A}^E\left[i,j\right]={\mathit{\operatorname{Max}}}_{\begin{array}{c}\left({i}^{\prime },{j}^{\prime}\right)\in \mathcal{A}\\ {}i\ne {i}^{\prime },\kern0.5em j\ne {j}^{\prime}\end{array}}\sqrt{I^E\left[i,i^{\prime}\right]\times D\left[j,j^{\prime}\right]}\kern0.5em $$
(1)
where each pair (i′, j′) is selected from the previously known drug-disease associations set \( \mathcal{A} \).
To make the drug-disease association matrices (AE), we assemble the known drug-disease associations (set \( \mathcal{A} \)) from repoDB [53] and Zhang et al. [30] Datasets.