Reference compounds
For the HPTC-A dataset (DNA/RelA/actin/WCS), we used 44 xenobiotic compounds. The “PTC-toxic” group had 24 nephrotoxicants known to damage human proximal tubular cells (PTCs) in vivo, and the “non-PTC-toxic” group had 12 nephrotoxicants not known to damage PTCs and 8 non-nephrotoxicants [detailed information on the PTC toxicity of most of the compounds can be found in our reports (Li et al. 2014; Kandasamy et al. 2015)]. For the HPTC-B and HK-2 datasets (DNA/γH2AX/actin/WCS), 42 of the compounds were used (excluding lead acetate and hydrocortisone). The compounds were dissolved in either DMSO at a stock concentration of 50 mg/mL, or water at a stock concentration of 10 mg/mL. The full list of reference compounds and their sources, solvents, and known human kidney and liver toxicity are provided in Supplementary Material 1—Table S1.
Cell culture and compound treatment
For both the HPTC-A and HPTC-B datasets, we used three different batches of primary human PTCs from three different donors. Two of them (HPTC1 and HPTC10; Lot #58488852 and #61247356, respectively) were bought from the American Type Culture Collection (ATCC, Manassas, VA, USA). The third batch of cells (HPTC6) was isolated from a human nephrectomy sample (National University Health System, Singapore). Only normal tissues without aberrant pathological changes, as determined by a pathologist, were used. Ethics approvals for the work with primary human kidney samples (DSRB-E/11/143) and cells (NUS-IRB Ref. Code: 09-148E) were obtained. All three batches of primary PTCs were cultured in renal epithelial cell basal medium (ATCC) supplemented with renal epithelial cell growth kit (ATCC) and 1 % penicillin/streptomycin (Gibco, Carlsbad, CA, USA). Only passages (P) 4 and P5 of primary PTCs were used in this study. For the HK-2 dataset, the HK-2 cell line (ATCC) was maintained in Dulbecco’s modified eagle medium (DMEM) supplemented with 10 % fetal bovine serum (FBS; Gibco) and 1 % penicillin/streptomycin.
Cells were seeded into 384-well black plates with transparent bottom (Greiner, Kremsmünster, Austria). All cells were cultured for 3 days to achieve the formation of a differentiated renal epithelium before overnight drug treatment (16 h; Li et al. 2013). The dosages of the tested compounds were 1.6, 16, 63, 125, 250, 500, 1000 μg/mL. Positive, negative, and vehicle controls (DMSO or water, depending on the solvent of the tested compounds) and untreated cells were included on each plate. Four technical replicates were performed for each compound and dosage.
Immunostaining
After compound treatment for 16 h, cells were fixed using 3.7 % formaldehyde in phosphate-buffered saline (PBS). The cells were blocked for 1 h with PBS containing 5 % bovine serum albumin (BSA) and 0.2 % Triton X-100. The samples were incubated with a mouse monoclonal antibody to γH2AX (phospho S139) (Abcam, Cambridge, MA, USA) at 2 µg/mL, or a rabbit polyclonal antibody to RelA (Abcam) at 1 µg/mL for 1 h at room temperature. Subsequently, the cells were incubated with a goat anti-mouse secondary antibody conjugated to Alexa 488 (Abcam) or a goat anti-rabbit secondary antibody conjugated to Alexa488 (Life Technologies, Carlsbad, CA, USA) at 5 µg/mL. Finally, the cells were stained with DAPI (Merck Millipore, Darmstadt, Germany) at 4 µg/mL, rhodamine phalloidin (Life Technologies) and whole-cell stain red (Life Technologies).
Apoptosis and necrosis assays
Cells were seeded into 96-well black plates with transparent bottom (Falcon, Corning, NY, USA) and cultured for 3 days before overnight drug treatment (16 h). They were treated with cisplatin, cyclosporin A, ochratoxin A, lincomycin, lithium chloride and ribavirin at 1000 μg/mL. Untreated cells and vehicle controls (DMSO and water) were included on each plate as well as positive (25 μg/mL arsenic(III) oxide) and negative (100 μg/mL dexamethasone) controls. Three technical replicates were performed for each treatment condition.
Cleaved caspase-3 (Abcam) and apoptotic/necrotic/healthy cells detection kits (PromoKine, Heidelberg, Germany) were used to identify apoptotic and necrotic cells. For cleaved caspase-3, the same immunostaining protocol as outlined above was used. The rabbit polyclonal anti-cleaved-caspase-3 antibody was diluted in blocking buffer and incubated with fixed cells for 1 h in room temperature. The cells were then incubated with a goat anti-rabbit secondary antibody conjugated to Alexa 488 at 5 µg/mL. Finally, the cells were counterstained with DAPI at 4 µg/mL and whole-cell stain red. For the apoptotic/necrotic/healthy cells detection kit, the protocols provided by manufacturer were used.
Image acquisition
Imaging was performed with a 20 × objective using the ImageXpress Micro XLS system (Molecular Devices, Sunnyvale, CA, USA). Four different channels were used to image DAPI, Alexa 488, Texas Red, and Cy5 fluorescence. Nine sites per well were imaged. The images were saved in 16-bit TIFF format.
Image segmentation and feature extraction
To reduce non-uniform background illuminations, we corrected the images using the “rolling ball” algorithm implemented in ImageJ (NIH, v1.48; Sternberg 1983). Cell segmentations and feature measurements were performed using the cellXpress software platform (Bioinformatics Institute, v1.2; Laksameethanasan et al. 2013). We extracted 129 features, which include 78 Haralick texture features, 29 intensity features, 9 intensity ratio features, 6 correlation features, 6 morphology features and cell count from the images. The detail list of features and their markers is shown in Supplementary Material 2.
Haralick’s texture features
The mathematical definitions of all Haralick’s texture features were described in Haralick’s original paper (Haralick et al. 1973). Here, we only provide mathematical definitions for the Haralick’s features included in our final feature sets. A gray-level co-occurrence matrix (GLCM) is a matrix that describes the distribution of co-occurring gray-level values at a given offset \((\varDelta x,\varDelta y)\) in an \(N_{x} \times N_{y}\) image, \({\mathbf{I}}(x,y)\), with \(N_{g}\) gray levels. In our notations, \(x\) and \(y\) are the row and column indices, respectively. The GLCM matrix is defined by
$${\mathbf{GLCM}}_{\varDelta x,\varDelta y} (i,j) = \sum\limits_{x = 1}^{{N_{x} }} {\sum\limits_{y = 1}^{{N_{y} }} {\left\{ {\begin{array}{ll} {1\;,} & {{\text{if }}{\mathbf{I}}(x,y) = i\;\;{\text{and}}\;\;{\mathbf{I}}(x + \varDelta x,\,y + \varDelta y) = j} \\ {0\;,} & {\text{otherwise}} \\ \end{array} } \right.} } ,$$
where \(i\) and \(j\) are the gray-level or intensity values of the image. The normalized GLCM matrix is
$$p(i,j,\varDelta x,\varDelta y) = \frac{{{\mathbf{GLCM}}_{\varDelta x,\varDelta y} (i,j)}}{{\sum\nolimits_{i = 1}^{{N_{g} }} {\sum\nolimits_{j = 1}^{{N_{g} }} {{\mathbf{GLCM}}_{\varDelta x,\varDelta y} (i,j)} } }}$$
Then, we have the marginal and sum probability matrices to be \(p_{x} (j,\varDelta x,\varDelta y) = \sum\nolimits_{i = 1}^{{N_{g} }} {p(i,\,j,\,\varDelta x,\,\varDelta y)}\), \(p_{y} (i,\varDelta x,\varDelta y) = \sum\nolimits_{j = 1}^{{N_{g} }} {p(i,\,j,\,\varDelta x,\,\varDelta y)}\), and \(p_{x + y} (k,\varDelta x,\varDelta y) = \mathop {\sum\nolimits_{i = 1}^{{N_{g} }} {\sum\nolimits_{j = 1}^{{N_{g} }} {} } }\nolimits_{i + j = k} p(i,\,j,\varDelta x,\varDelta y),\) where \(\,k = 2,\,3,\, \ldots ,\,2N_{g} .\)
The Haralick’s features are
-
(a)
Angular second moment: \(f_{\text{ASM}} (\varDelta x,\varDelta y) = \sum\nolimits_{i} {\sum\nolimits_{j} {\left\{ {p(i,j,\varDelta x,\varDelta y)} \right\}^{2} } }\)
-
(b)
Correlation: \(f_{\text{COR}} (\varDelta x,\varDelta y) = \frac{1}{{\sigma_{x} \sigma_{y} }}\sum\nolimits_{i} {\sum\nolimits_{j} {(i\,j)p(i,j,\varDelta x,\varDelta y)} } - \mu_{x} \mu_{y}\), where \(\mu_{x}\), \(\mu_{y}\), \(\sigma_{x}\) and \(\sigma_{y}\) are the means and standard deviations of \(p_{x} (j,\varDelta x,\varDelta y)\) and \(p_{y} (i,\,\varDelta x,\varDelta y)\), respectively.
-
(c)
Sum average: \(f_{\text{SA}} (\varDelta x,\varDelta y) = \sum\nolimits_{k = 2}^{{2N_{g} }} {k\,p_{x + y} (k,\varDelta x,\varDelta y)}\)
-
(d)
Sum variance: \(f_{\text{SV}} (\varDelta x,\varDelta y) = \sum\nolimits_{k = 2}^{{2N_{g} }} {(k - f_{\rm SA} (\varDelta x,\varDelta y))^{2} \,p_{x + y} (k,\varDelta x,\varDelta y)}\)
-
(e)
Sum entropy: \(f_{\text{SE}} (\varDelta x,\varDelta y) = - \sum\nolimits_{k = 2}^{{N_{g} }} {p_{x + y} (k,\varDelta x,\varDelta y)\;\log \left[ {p_{x + y} (k,\varDelta x,\varDelta y)} \right]}\)
-
(f)
Entropy: \(f_{E} (\varDelta x,\varDelta y) = - \sum\nolimits_{i} {\sum\nolimits_{j} {p(i,j,\varDelta x,\varDelta y)\;} } \log [p(i,j,\varDelta x,\varDelta y)]\)
-
(g)
Information measure of correlation 2: \(f_{{{\text{IMC}}2}} (\varDelta x,\varDelta y) = \sqrt {\left| {1 - \exp \left[ { - 2\left( {{\text{HXY}}2 - f_{E} (\varDelta x,\varDelta y)} \right)} \right]} \right|}\), where \({\text{HXY}}2 = - \sum\nolimits_{i} {\sum\nolimits_{j} {p_{x} (j,\varDelta x,\varDelta y)p_{y} (i,\varDelta x,\varDelta y)\log \left[ {p_{x} (j,\varDelta x,\varDelta y)p_{y} (i,\varDelta x,\varDelta y)} \right]} }\)
In our study, the images were the bounding boxes around the segmented cells with all the background pixels set to zero. We quantized the images into \(N_{g} = 256\) gray levels, and computed all the Haralick’s features for 0° \(\left( {\varDelta x = 0,\varDelta y = 1} \right)\), 45° \(\left( {\varDelta x = 1,\varDelta y = 1} \right)\), 90° \(\left( {\varDelta x = 1,\varDelta y = 0} \right)\), and 135° \(\left( {\varDelta x = 1,\varDelta y = - 1} \right)\) offsets. For each feature, the mean and standard deviation of the feature across all the offset values were used. We have implemented the extraction procedures for all the features using C++ in the cellXpress software platform (Bioinformatics Institute, v1.2; Laksameethanasan et al. 2013).
Concentration response curve and Δmax estimations
After feature extraction, we divided the values of a feature at all the tested compound concentrations by the values of the feature under the corresponding vehicle control conditions. Then, the ratios were log 2-transformed (Δ). All further data analyses, including building concentration response curves and toxicity classifiers, were performed using customized scripts under the R statistical environment (the R foundation, v3.0.2) and the Windows 7 operating system (Microsoft, USA).
For each feature, we estimated its concentration response curve using a standard log-logistic model:
$$\varDelta (x,(b,c,d,e)) = \frac{d - c}{{1 + \exp \{ b(\log (x) - \log (e))\} }},$$
where x is the xenobiotics compound concentration, e is the response half-way between the lower limit c and upper limit d, and b is the relative slope around e. We used the “drc” library (v 2.3-96) under the R environment to fit the values of b, c, d, and e. After that, the maximum response values (\(\varDelta_{\hbox{max} }\)) were determined using the estimated response curves. In theory, \(\varDelta_{\hbox{max} }\) should be equal to the upper limit d. However, in practice, the responses of some compounds may not plateau even at the highest tested dosages, and therefore the estimated d value may not be accurate. Instead, we fixed \(\varDelta_{\hbox{max} }\) to be the response value at 5 mM, which was around the highest tested concentrations for most of the our compounds. Finally, the median values of \(\varDelta_{\hbox{max} }\) across the three biological replicates were computed. The final result was a 129 × 44 (or 42) matrix of \(\varDelta_{\hbox{max} }\) values, which were used for training and testing the classifiers. Each column of the matrix was a feature vector, \({\mathbf{f}}_{i}\), where \(i = 1,\,2,\, \ldots ,\,129\).
Feature normalization
Before data classification, each feature vector \({\mathbf{f}}_{i}\) was normalized to the same range [−1, 1]:
$${\mathbf{f}}_{i} \leftarrow 2\frac{{({\mathbf{f}}_{i} - f_{\hbox{min} } )}}{{f_{\hbox{max} } - f_{\hbox{min} } }} - 1,$$
where \(f_{\hbox{min} }\) and \(f_{\hbox{max} }\) are the minimum and maximum values of the feature. To ensure the training and test datasets were independent to each other, these two normalization coefficients were estimated only using the training data, but applied to both training and test datasets.
Random forest classification
We used the random-forest algorithm (Breiman 2001) to predict xenobiotic-induced nephrotoxicity, because we have previously shown that the algorithm outperforms other commonly used classifiers, including support vector machine, k-nearest neighbors and naïve Bayes (Su et al. 2014). A random forest has two main parameters: \(N_{\text{tree}}\) and \(N_{\text{trial}}\). The first parameter specifies the number of decision trees built, and the second parameter specifies the number of random features used at each level of the decision trees. During cross-validation, we automatically determine the optimum classifier parameters using a grid search for \(N_{\text{tree}} = \left\{ {10,50,150,250,400,500} \right\}\) and \(N_{\text{trial}} = \left\{ {1,2,3,4,5} \right\}\). A series of temporary random forests were trained using all the possible combinations of parameters based on a training dataset \(\bar{\mathbf{X}}^{\prime}_{\text{training}}\), and the test accuracies of these combinations were estimated based on an independent test dataset \(\bar{\mathbf{X}}^{\prime}_{\text{FStest}}\). The combination of \(N_{\text{tree}}\) and \(N_{\text{trial}}\) with the highest test accuracy value were selected to train a final classifier, whose performance would then be estimated using a third independent test dataset \(\bar{\mathbf{X}}^{\prime}_{\text{RFtest}}\). We used the “randomForest” library (v4.6-10) under the R environment.
Automated feature selection
We used a greedy search algorithm, namely recursive feature elimination (RFE; Loo et al. 2007), to select a subset of features from all the extracted features \({\mathbf{F}}_{\text{all}} = \{ {\mathbf{f}}_{1} ,{\mathbf{f}}_{2} , \ldots ,{\mathbf{f}}_{{m_{\text{all}} }} \}\). The pseudocode for the algorithm is listed in Algorithm S2 (Supplementary Material 1—Text S1). The main idea is to start with all the features, iteratively rank the current feature set, remove the least important feature subset, evaluate the accuracy \({\text{acc}}_{j}\) of the retained feature subset \({\mathbf{F}}_{j}\) and finally select the feature subset with the highest accuracy. To reduce data overfitting, the ranking and evaluation of feature subsets were performed in two independent datasets, \(\left\{ \bar{\mathbf{X}}^{\prime}_{\text{training}} ,\,\bar{\mathbf{X}}^{\prime}_{\text{FStest} } \right\}\) and \(\bar{\mathbf{X}}^{\prime}_{\text{RFtest}}\), respectively (Algorithm S2). We ranked features based on their importance values estimated by the random forest algorithm by permuting the out-of-bag data and features (Breiman 2001).
In datasets with small sample sizes, the \({\text{acc}}_{j}\) curve (as a function of \({\mathbf{F}}_{j}\)) may not be smooth. Thus, the global maxima of \({\text{acc}}_{j}\) may not be a robust criterion for selecting the final feature subset. Instead, we designed an automated procedure to select a feature subset using Gaussian mixture modeling (GMM; Trevor Hastie et al. 2009). We clustered all the \({\text{acc}}_{j}\) values into 2–4 groups. Each of them was modeled as a Gaussian distribution. Then, we selected the smallest feature subset in the group with the highest average prediction accuracy (Algorithm S2). The optimum number of groups was also automatically determined based on the Bayesian information criterion (BIC), \({\text{BIC}} = - 2L_{m} + N_{d} \log \,(N_{s} )\), where \(N_{s}\) is the sample size, L
m
is the maximum log-likelihood computed by the GMM algorithm, and \(N_{d}\) is the number of the parameters.
Classification performance estimation
We used a stratified tenfold cross-validation procedure (Trevor Hastie et al. 2009) to estimate the PTC toxicity prediction performance of our phenotypic features. The pseudocode for the procedure is listed in Algorithm S1 (Supplementary Material 1—Text S1). The procedure has two main cross-validation loops. The first cross-validation loop aims to identify an optimum feature subset \({\mathbf{F}}_{\text{final}}\), while the second cross-validation loop aims to estimate the generalized prediction performance of \({\mathbf{F}}_{\text{final}}\). To keep the training and test data independent from each other, we divided all the treatment conditions into four non-overlapping sets, \({\mathbf{X}}_{\text{training}} ({\mathbf{F}}_{\text{all}} )\), \({\mathbf{X}}_{\text{FStest}} ({\mathbf{F}}_{\text{all}} )\), \({\mathbf{X}}_{\text{RFtest}} ({\mathbf{F}}_{\text{all}} )\), and \({\mathbf{X}}_{\text{test}} ({\mathbf{F}}_{\text{all}} )\). Furthermore, the normalization coefficients and classifier parameters were always estimated based on the training datasets only, but applied to both training and test datasets.
We used the following performance measurements
$$\begin{aligned} {\text{Sensitivity}} & = \frac{\text{TP}}{{{\text{TP}} + {\text{FN}}}} \times 100\;\% , \\ {\text{Specificity}} & = \frac{\text{TN}}{{{\text{TN}} + {\text{FP}}}} \times 100\;\% ,\quad {\text{and}} \\ {\text{Balanced}}\;{\text{accuracy}}\;({\text{acc}}) & = \frac{{{\text{Sensitivity}} + {\text{Specificity}}}}{2}, \\ \end{aligned}$$
where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. The same performance estimation procedure was used for HPTC-A, HPTC-B and HK-2 datasets.
Multi-dimensional scaling plots
To compare the compounds in the chemical structure space, we used the ChemmieR library to compute the pairwise Tanimoto coefficients between the structures of all the reference compounds. To compare the compounds in the phenotypic feature space, we first scaled all the phenotypic features to the same range [0, 1] and then computed the pairwise Euclidean distances between the feature values of all the reference compounds. Finally, we used the cmdscale function (Torgerson 1952) in the R environment to generate the multi-dimensional scaling plots.