In the dataset, there could be numerous Mechanism of Action’s (MoA’s) for each of the drug. Thus, this machine learning problem belongs to multi-label classification. The machine models tested in this paper are BRkNN (Binary Relevance K Nearest Neighbors), ML-KNN (Multi-label K-Nearest Neighbors) and a custom Neural Network.
BRKNN (binary relevance K nearest neighbors)
BRkNN is a variant of the k-Nearest Neighbors (kNN) method that is essentially equal to combining Binary Relevance (BR) with the kNN algorithm. BRkNN extends the kNN method to make independent predictions for each label .
Based on the assessment of each label's confidence score, BRKNN is classified into two types: BRkNN-a and BRkNN-b.
BRkNN-a (type A)
BRkNN-a determines whether BRkNN returns the empty set if none of the labels appear in at least half of the k nearest neighbours. If this criterion is met, the label with the highest confidence is outputted . For this model, Fig. 3 and Table 3 shows the graph and prediction scores, respectively.
In the graph shown in Fig. 3, the X-axis represents the number of neighbours, and the Y-axis represents the public and private dataset score. The private dataset score improves from 3 neighbours until 5 neighbours. The public dataset score improves until from 3 neighbours until 10 neighbours. Afterwards, both of the scores decline as the number of neighbours increases.
BRkNN-b (type B)
First, BRkNN-b estimates the “s” (average size) of the label sets of the k nearest neighbors, and later, outputs the integer which is nearest to “s” labels, which is having the highest confidence . For this model, Fig. 4 and Table 4 shows the graph and prediction scores, respectively. In the graph shown in Fig. 4, the X-axis represents the number of neighbors and the Y-axis represents the public and private dataset score. Both the private dataset score and the public dataset score improves from 3 neighbors until 2000 neighbors. 5000 neighbors yeilds a worse score. Afterwards, both of the scores remains constant. As seen from Table 4, the maximum difference between the public dataset score and private dataset score is 0.38482, which occurs when the number of neighbors is 30.
ML-KNN (multi-label K-nearest neighbors)
The ML-KNN technique is based on the well-known k-Nearest Neighbor (kNN) algorithm. First, the k nearest neighbors in the training set are selected for each test instance. The maximum a posteriori (MAP) concept is then used to determine the label set for the test instance based on statistical information obtained from the label sets of the neighbouring examples . For this model, Fig. 5 and Table 5 shows the graph and prediction scores, respectively. In the graph shown in Fig. 5, the X-axis represents the number of neighbors and the Y-axis represents the public and private dataset score. Both the private dataset score and the public dataset score improves from 3 neighbors until 20 neighbors. Afterwards, both of the scores declines as the number of neighbors increases.
Custom neural network
A neural network is created using Keras [18,19,20]. Keras is a Python-based deep learning API that runs on top of TensorFlow. Since there are 875 input features in the dataset, the input layer units is 875. Similarly, there are 206 output targets, the output layer units is 206. Both the dropout layer 1 and dropout layer 2 have 0.5 as the dropout rate. The model is compiled using the binary cross-entropy loss function. The optimizer used is adam. The Neural network implementation code can be found in Github . Figure 6 shows the layers of the neural network. Table 6 gives the description of the each of the layer used. Table 7 shows the activation functions used for the dense layers and the output layer. Figures 7 and 8 shows the graph for the sigmoid and RELU activation function, respectively. Figure 9 shows the accuracy graph. Table 8 shows the prediction scores for this model.
In the graph shown in Fig. 9, the X-axis represents the epochs and the Y-axis represents the public and private dataset score. Both the private dataset score and the public dataset score improves from 15 epochs until 75 epochs. Afterwards, both of the scores declines as the number of epochs increases.
Since the Private Dataset Score is considered for scoring the final leader board, the best score for private dataset obtained with each of the models is considered. The summary of the best accuracy for each of the model is shown in the Table 9. As seen from the Table 9, the custom neural network with 75 epochs and 100 batch size performs the best.