for Drug-Target Interaction Prediction

. The discovery of potential Drug-Target Interactions (DTIs) is a determining step in the drug discovery and repositioning process, as the eﬀectiveness of the currently available antibiotic treatment is declining. Successful approaches have been presented to solve this problem but seldom protein sequences and structured data are used together. We present a deep learning architecture model, which exploits the particular ability of Convolutional Neural Networks (CNNs) to obtain 1D representations from protein amino acid sequences and SMILES (Simpliﬁed Molecular Input Line Entry System) strings. The results achieved demonstrate that using CNNs to obtain representations of the data, instead of the traditional descriptors, lead to improved performance.


Introduction
The discovery of new and potential drugs is declining, as there is an increase of the misuse of the available medicine, causing a resistance effect to these kinds of agents [1]. Therefore, establishing effective computational methods is decisive to find new leads. Computational methods for DTI prediction are divided into 3 main approaches [4], namely ligand based, docking simulation and chemogenomic. Ligand based approaches are built upon the concept that similar molecules have similar properties and therefore should bind to the same group of proteins [6]. Docking Simulation is used for structure based drug design, where the interaction is simulated and scored using 3D structures [5]. Chemogenomic approaches are based on the chemical, genomic and/or the pharmacological space [8]. Due to the amount of available data and computational power, machine learning [3] and deep learning [9] are pursued over the traditional methods. We propose a deep learning approach to predict DTIs using 1D raw data, amino acids sequences and SMILES. We exploit the particular ability of CNNs to obtain 1D representations, which are features that express local dependencies or patterns, that can then be used in a Fully Connected Neural Network (FCNN), acting as a binary classifier. Coelho et al. (2016) [7] dataset was used to evaluate and validate the model. Additionally, we compared our model with different approaches, specifically random forest (RF), a FCNN architecture and support vector machine (SVM).

Data
The protein sequences were extracted from UniProt and the SMILES strings were collected from PubChem exclusively, in their canonical format. Since we are using protein sequences and SMILES strings directly, each amino acid and character, respectively, is considered as a feature. Therefore, it was necessary to define a threshold based on their length. An information threshold of 95% was used, resulting in a maximum length of 1205 for the protein sequences and 90 for the SMILES. All entries duplicated or containing missing characters in one of the datasets were also removed, resulting in 16011 (5839 positive and 10712 negative) samples for training and 7926 (3012 positive and 4914 negative) for testing. Table 1 summarizes the amount of unique drugs, targets and drug-target interactions extracted from the databases used to create the datasets and Table 2 the amount of unique proteins, drugs and number of targets for the training and testing datasets. Plus, only Yamanishi et al. (2008) [8] and DrugBank positive entries were used for training and testing, respectively.

Data Representation
We used Yu et al. (2010) [10] protein substitution table, which organizes amino acids into 7 groups according to their physicochemical properties. Each amino acid was encoded into an integer based on the corresponding group. In the case of SMILES strings, a simple integer encoding was used to transform each character of the strings into an integer.

Model
The proposed approach is based on a deep learning architecture ( Fig. 1) to predict DTIs using directly protein sequences and SMILES (1D raw data). One-Hot Layer was used to assign a binary variable for each unique integer value, converting every integer into a binary vector. Two series of 1D convolutional layers were used, one for the protein sequences and another for the SMILES. A global max pooling layer was applied after each series of convolutional layers to reduce the spatial size of each feature map to its maximum representative feature. The obtained deep representations were concatenated into a single feature vector, characterizing a DTI pair. The resulting feature vectors were then used as the input of a FCNN architecture. Dropout was applied between each fully connected layer to reduce the overfitting by deactivating a percentage of neurons. This architecture was followed by an output layer.

Hyperparameter Optimization Approach
Two simultaneous methods, combined with grid search, were used to determine the best model, early stopping and model checkpoint. Considering the fact that dividing the training set into training and validation and applying cross validation led to high scores for every model architecture and set of parameters in both training and validation, it was not possible to select the best model using this approach, as every model was supposedly good in the validation set but the results were inconsistent when applied to the testing set. Therefore, we decided to use all the training set for training and the testing set to evaluate the model performance in each epoch. Since the testing set is highly imbalanced, F1-score was used for this evaluation (Fig. 2). Table 3 summarizes the hyper-parameters obtained from grid search.

Results
We applied grid search for all the models in order to accurately compare and evaluate the performance. The descriptors used were the same as the original    [7], which contains a total of 432 protein descriptors and 323 drug descriptors, and collected using PyDPI package [2]. The protein descriptors are divided into amino acid composition, Moran autocorrelation and CTD (Composition, Transition, Distribution) descriptors. On the other hand, drug descriptors are divided into molecular constitutional, molecular connectivity, molecular property, kappa shape and charge descriptors, molecular access system (MACCS) keys and E-state fingerprints.
Due to the fact that the traditional split of the training set into training and validation led to inconclusive results, as mentioned in Sect. 2.4, the results obtained are related to "internal validation". The testing set was used to discover the best set of parameters, thus there is not an external validation set. Nonetheless, given the disparity of the training and testing set and the low similarity of the drug pairs that constitute them, the results are considered as valid and relevant.
The differences in performance between all models can be interpreted as a result of the difference between using deep representations, obtained from pro-tein sequences and SMILES strings, and global descriptors. Besides, it's also possible to highlight the difference between applying traditional machine learning and deep learning approaches ( Table 4). The results obtained validate the effectiveness of convolutional neural networks as a feature engineering tool and their capacity to automatically surmise and identify important sequential and structural regions for drug-target interactions. Another observation is that using an end-to-end deep learning method resulted in a high sensitivity (0.861) and specificity (0.961) when compared to the other models, which obtained a high specificity and a low sensitivity. Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.