Cell imaging dataset containing 1105 drugs
The cell imaging dataset containing 19,864 unique compounds or drugs were sorted and screened (see Methods section), and image data obtained for 1105 drugs (including 812 dimensional image information), encompassing cell responses to 372 MOAs, were collected (Table S1 in Supplementary Data). These 1105 drugs of 372 MOAs have a broad range of clinical use. The image data represented the most intuitive phenotypic effects of these drugs on cells. The image data comprised 812 dimensional data, including Cells_Area, Shape_Area, Cells_Area Shape_Compactness, and Cells_Area Shape_Eccentricity. The distribution of data in each dimension ranged from + 677 to − 384, and more than 98.8% of the data were between − 20 and 20. We adopted the mean variance normalization method. The data in each dimension followed a normal distribution, with a mean of 0 and a variance of 1, and a range of − 7.930 to 13.934. The original data distribution is shown in Fig. 1a and the normalized data distribution in Fig. 1b.
We collected the MoA information of drugs from the LINCS database, which contained 372 types of MoA for the investigated drugs. Among these, 49 types of MoA were shared by five or more drugs and the most common MoA (adrenergic receptor antagonism) was shared by 43 drugs. The other common MoA were dopamine receptor antagonism, cyclooxygenase inhibition, and serotonin receptor antagonism. The relevant data distribution is shown in Fig. 1c for the top ten MoA types; The pie chart represents the MoA. The overall flowchart of the present study is illustrated in Fig. 2.
Conversion of the 812 dimensional image data by ITML
The supervised ITML is a global metric-learning method that can be used as an alternative method to understand the metric distance function for a specific task, according to different learning tasks. We used this method to obtain a distance measurement for the drug MoA classification task; parameters were num_constraints (number of constraints to generate) = 20, max_iter (maximum number of iterations) = 1000, and convergence_threshold = 0.001. The t-Distributed Stochastic Neighbor Embedding (t-SNE) plot graphs of the top ten drugs, before and after learning, are shown in Fig. 3a, b, respectively. Through training for all MoA of drugs, we obtained the T matrix. The 812 dimensional vector was then transformed to a new vector via supervised learning after passing through the T matrix.
Classification of drugs into 39 categories based on ITML-transformed features
We used the T matrix-transformed features to establish drug image phenotype (DIP) connections. The DIP connections were represented as “association scores” computed using Euler distance. To achieve this, for each calculated distance we obtained the corresponding association scores (detailed information is provided in the Methods section and in the Supplementary Distance data file).
The 609,960 pairs of DIP connections (Table S2 in Supplementary Data) observed for the 1105 drugs are shown in the heatmap representation of the distance matrix (Figure S1). The application of an automated, parameter-free clustering algorithm yielded 39 drug groups, with prominent consensus internal DIP similarities. We distinguished each of these 39 groups as a DIP community (Fig. 4b). We then used the MoA type composed of more than five drugs as a test set to determine whether the DIP community could be used for drug MoA discovery.
Our enrichment analysis identified significant (P < 0.01, Table S3) enriched community-specific drug MoA for each DIP community (Fig. 4b and Supplementary Table S4). For example, communities 1, 2, and 3 were enriched with local anesthetics, acetylcholine receptor agonists, and protein kinase A inhibitors, respectively.
To examine whether ITML can help in MoA recognition, we compared the effects of MoA recognition using raw data and data processed by the PCA algorithm and obtained 57 and 48 clusters, respectively. As shown in Table 1, with frequencies of enriched MoA of 26 and 24 and enrichment ratios of 0.4561 and 0.5000, respectively. These were lower than the results of ITML, indicating that clustering of ITML-processed data made it easier to identify drugs with consistent MoA.
DIP facilitates identification of drug MoA
Herein, 35 drug MoAs were enriched in the 39 classification communities. Several similar drug MoAs were enriched in the same communities. Protein synthesis inhibitors and histone deacetylase inhibitors were both enriched in Community 18. Cytochrome P450 inhibitors and epidermal growth factor receptor inhibitors were enriched in community 36. Acetylcholine receptor agonists, bacterial cell wall synthesis inhibitors, and angiogenesis inhibitors were enriched in Community 2. Acetylcholine receptor antagonists, retinoid receptor agonists, and tyrosine kinase inhibitors were enriched in Community 4. Adrenergic receptor agonists, norepinephrine reuptake inhibitors, and aromatic hydrocarbon derivatives, for instance, were relatively decentralized and not significantly enriched in all communities. This decentralized distribution may be attributed to the effects of these drugs on the phenotype of tumor cell lines, due to which the image data were not significantly changed. The cell images derived from other cells may be more helpful for the identification of these MoA.
To identify the image features that may be more conducive for the identification of drug use, we calculated the intra-class distance ratio between the features of each dimension in each cluster (Table S5, see the Methods section for details) and determined CSIFs according to the intra-class ratio (< 0.01). It was found that the CSIFs rarely overlapped between clusters. Only 26 different features played a role in two clusters, and no features simultaneously became CSIFs in three or more communities. For example, Nuclei_Intensity_MeanIntensity_Ph_golgi was the CSIF of cluster 16 of dopamine uptake inhibitors and of cluster 22, which had no enrichment of any kind of drugs. The CSIFs suggested that drugs within the same cluster may have specific responses to CSIFs. Although there were only four tubulin inhibitors in the dataset, they were all enriched in Community 20, which was also enriched with CDK inhibitors, and only two microtubule inhibitors were in observed in this cluster. The CSIFs corresponding to Community 20 were Cells_Texture_InfoMeas1_Hoechst_5, Cells_Texture_InfoMeas2_Ph_golgi_5, Cells_Texture_Variance_Hoechst_3, and Cytoplasm_AreaShape_Zernike_8_8. This may be related to the effects of the above drugs on the cell cycle, including inhibition of cell division and induction of changes in cell texture. These results suggest that it is feasible to discover the functions of known or new compounds based on DIP (Table S4).
Community 21 drugs that could block virus entry
It was found that there are two drugs, chloroquine and clomiphene with different MoA annotations, were clustered into cluster 21. And these two drug candidates found to be effective against COVID-19. The MoAs of clomiphene in cluster 21 was annotated as oestrogen receptor antagonist, which has been found to be resistant to Ebola virus, suggesting that it may have a similar MoA with chloroqunine. While the MoA of clomiphene was annotated as oestrogen receptor antagonist, different from chloroquine, chloroquine and clomiphene share common drug characteristics. For example, they inhibit T cell proliferation, reduce the release of proinflammatory cytokines, and increase the pH of the endosome to block endocytosis (Savarino et al. 2003; Vincent et al. 2005; Hoffmann et al. 2020). These drugs may be used for COVID-19 prevention and treatment through blocking the PH-dependent pathway. While the SARS-CoV-2 may entry the lung cells via both pH-dependent and pH-independent (TMPRSS2 dependent) pathways and the TMPRSS-2-primed pathway bypassing the endosome-mediated entry may partly explain the low success rates in COVID-19 therapy, chloroquine/hydroxychloroquine alone could not inhibit SARS-CoV-2 infection (Hoffmann et al. 2020). As a result, combination of drugs blocking endocytosis and TMPRSS-2 inhibitors may be promising (Ortega et al. 2020).
The aforementioned results showed that the MoAs of drugs can be recognized through multidimensional image features, and the use of ITML for feature conversion may help in the identification of drugs with similar MoA. The DIP communities can help in finding more drugs with similar MoA through the analysis of image data. At present, no effective drug has been approved for COVID-19 treatment. Drug candidates with certain therapeutic effects may be promising, as is the case of chloroquine and remdesivir. Drugs such as remdesivir target viral proteins but the DIP derived from uninfected cell lines may not be able to reflect their MoAs. Chloroquine and clomiphene exert anti-infective effects by regulating the host cell functions. However, serious side effects associated with the use of chloroquine, such as gastrointestinal effects and cardiotoxicity, may limit its clinical use (Doyno et al. 2021). The discovery of new alternative drugs through DIP is of great significance. It was found that the two drugs, chloroquine and clomiphene, in cluster 21 are effective against virus infections. They had different MoA annotations but similar drug characteristics. We observed that adrenergic receptor agonists, norepinephrine reuptake inhibitors, aromatic hydrocarbon derivatives, and drugs with other MoAs were not significantly enriched in all the clusters. The effects of these drugs on the phenotype of tumor cells were not significant and may result in image data of non-specific features. Based on the above findings, we suggest that ITML features are more conducive to drug classification than their original features and PCA features.
Image data obtained after the drug acts on the cell is one of the most easily obtained screening data. Evaluating the potential effects of drugs from images is of great significance. Here we used cell characteristic data processed by professional cell image software (CellProfiler) to predict MoA (Carpenter et al. 2006). Due to the complex MoAs of drugs, we used third-party MoA annotation data and optimized the metrics based on ITML. The optimized DIP communities were more closely related to the known MoA. Tubulin, CDK, and microtubule inhibitors exert their effects on the formation of spindles, and they were also classified into cluster 20. The effective drug candidates for COVID-19, such as chloroquine, and the anti-Ebola drug clomiphene were accumulated in Community 21. These results confirmed the possibility and accuracy of drug discovery based on image data. In addition, selective estrogen receptor modulators, such as raloxifene, could probably also exhibit the antiviral activities against SARS-CoV-2 and/or Ebola infections. It should be noted that the cell image data we used here derived from cell lines without SARS-CoV-2 infection. Therefore, drugs targeting viral proteins may not induce consistent effects on the cells, and hence these data may not be applicable to virus-targeted drug discovery. The image data of SARS-CoV-2-infected cells and under the effects of different drugs would be more useful in screening virus-targeting drugs.