1 Introduction

Since the breakthrough of chimeric antigen receptor expressing T cells, or CAR-T cells, cancer immunotherapy based on ex vivo engineered immune cells is an active area of research and development [1,2,3]. Analysis of cancer cell killing mediated by immune cells, including cytotoxic T cells, natural killer (NK) cells, and γδ T cells, is essential for the development of immune cell therapies for cancer. Various cytotoxicity assays such as chromium-51 release assay [4, 5], propidium iodide (PI) incorporation assay [6, 7], lactate dehydrogenase (LDH) assay [8, 9] have been used to assess the ability of immune cells to recognize and kill tumor cells. However, these assays primarily provide population-level cytotoxicity at specific time points, and lack information about detailed interactions between immune cells and cancer cells at a single cell level.

Direct observation of immune cell-cancer cell interactions using live cell imaging is a powerful tool for the mechanistic investigation of immune cell-mediated cancer cell killing [10, 11], but it has not been employed in standard assays due to the time-consuming data acquisition and analysis. Microwells with various dimensions are fabricated to confine immune cells and cancer cells within microscopic field of view, thus identical cells can be traced for long period of time [12]. Combining microwell-based experiments with motorized stages substantially increased data acquisition throughputs. However, the analysis of data has relied predominantly on manual tracking of individual cells. For a comprehensive understanding of the heterogeneous information embedded in live cell imaging data, the adoption of automated image analysis is essential [13, 14].

Recent advances in deep learning technology have greatly enhanced the automated image processing of cell imaging data. Various deep learning models have been developed for cell segmentation, classification, and tracking [15,16,17]. Furthermore, the integration of deep learning models in open source programs, such as ImageJ [18] and CellProfiler [19], has improved the accessibility of deep learning-based cell imaging analysis [20,21,22]. However, applying existing deep learning technology to specifically analyze immune cell-cancer cell interactions is technically challenging due to their dynamic nature of cell locations, contact areas, and interaction statuses. A semi-automated image analysis was performed using machine learning to examine NK cell-mediated cancer cell killing within droplets, but in this case, NK cell migration was restricted due to their confinement within droplets, thus NK cell detachment after target cell killing is limited [23]. Immune synapse formation between CAR-T cells and cancer cells was imaged using optical diffraction tomography and automatically analysed by deep learning; nevertheless, images suited for deep learning-based analysis were generated utilizing a cutting-edge microscopy technique that are not accessible in typical laboratories [24].

In our previous study, we developed a microwell array capturing single cancer cells to facilitate a live cell imaging-based NK cell cytotoxicity assay (Fig. 1A) [25]. Using this platform, quantitative analysis of NK cell-cancer cell interactions crucial for NK-mediated cytotoxicity, such as engagement with a cancer cell, killing it, and detached from it to find another target was conducted. While image analysis was manually performed in the previous study, the data generated using this microwells are well suited for deep learning-based automated image analysis: cancer cells are placed on the lower floor of microwells, whereas NK cells are located on the higher floor of microwells (Fig. 1B). Since each type of cell will be imaged in different focal plane, it is not necessary to differentiate between cell types. Additionally, since the data can be classified by microwell and by frame, it can readily generate high-quality large-scale datasets, well-suited for deep learning analysis.

Fig. 1
figure 1

Schematic illustration of experimental setup and data structure. A. Top and side views of the microwell system used for experiments. Cancer cells are placed in microwells designed to hold one cell each, and NK cells are freely migrating on top of cancer cells in the microwells. Therefore, NK cell images and cancer cell images are located in different optical planes. B. Representative bright field images acquired for from two different focal planes. NK focal plane images (NK-FPIs) contain NK cell images along with out-of-focus microwell images, while cancer focal plane images (C-FPIs) consist of focused microwells containing cancer cells. Scale bar: 20 μm

In this study, a deep learning-based image processing method was devised to automatically analyze data produced by the microwell platform in order to extract quantitative information regarding NK cell-cancer cell interactions during NK cell-mediated cancer cell recognition and killing. First, the acquired images from the microwell experiments were pre-processed and classified to generate data sets for deep learning training. Second, various deep learning models were constructed and tested using the pre-processed data to select the optimal model for the analysis. Finally, automatic data analysis was conducted utilizing the chosen deep learning model, and the results generated by the deep learning model were compared to those produced by manual analysis.

2 Results and Discussion

2.1 Construction of Deep Learning-Based Analysis Frame for Microwell Imaging Data

2.1.1 Microwell Image Pre-processing and Labelling

Microwells trapping single cancer cells were mounted on a microscope stage maintained with a cell culture conditions (37 °C and 5% CO2), and NK cells were gently added on the cancer cell-laden microwells (Fig. 1A). Live cell imaging data was collected by acquiring consecutive images of two focal planes, one for NK cells and the other for cancer cells, of identical field of views with 1 min of a time interval (Fig. 1B). A motorized stage was used in each experiment to scan 10 randomly selected positions, and each position contained ~ 50 microwells (Fig. 1B).

Data pre-processing and labelling were conducted by the scheme shown in Fig. 2. First, the position information of each microwell was extracted from the time lapse cancer focal plane images (C-FPIs) based on the repeated appearance of the microwells’ outer concentric circles (Fig. 2i). Using this location data, the image data of each individual microwell was extracted by cropping both C-FPIs (Fig. 2ii) and NK-FPIs (Fig. 2iii) into 80 px × 80 px images. The cropped images were then manually classified into different categories to generate data sets for training/validation/test. For C-FPIs, microwells were either empty or filled with various states of cancer cells, such as live, dying (blebbing), and dead, thus we classified it into four categories: Empty, Live, Dying, and Dead. The classification was performed by analyzing morphology and temporal changes in shapes as previously described [25]: live cells were identified by their smooth appearance and continuous shape changes, dying cells exhibited shrunken morphology with dynamic membrane blebbing, and dead cells displayed shrunken morphology with stationary blebs. Cell death was further confirmed with propidium iodide fluorescence images. For NK-FPIs, NK cells were either exist on top of the microwell or absent, thus they were classified into either Exist or Absent. We have not considered live/dead criteria for NK cells because the viability of NK cells was high (> 90%) and the probability of a dead NK cell located on the microwell was low. Cropped images of 62,297 C-FPIs (Empty: 20,622, Live: 15,614, Dying: 2410, and Dead: 23,654) and 50,962 NK-FPIs (Exist: 14,456, Absent: 36,506) were prepared for training. Due to the image analysis-friendly structure of the microwell chips, it was straightforward to produce a large amount of classifiable data, which is crucial for the training and validation of deep learning models.

Fig. 2
figure 2

Preprocessing of the acquired images for deep learning training. From the C-FPI, the positions of the microwells are determined (i) Using position data, C-FPIs (ii), and corresponding NK-FPIs (iii) are cropped. Then, the cropped C-FPIs are labeled as ‘Live’, ‘Dying’, ‘Dead’, or ‘Empty’ (iv), and NK-FPIs are labelled as ‘Exist’ or ‘Absent’ (v). Scale bar: 10 μm

2.1.2 Deep Learning Model Construction Using Microwell Imaging Data

The training and evaluation pipeline were constructed as schematically depicted in Fig. 3. To identify the optimal dataset for classification, we utilized either three consecutive image frames (3-frame) that capture the influence of changes over time, or single-frame images (1-frame). To determine an appropriate deep learning model to analyse the datasets, we tested off-the-shelf deep CNN models pre-trained with ImageNet data, such as VGG19 [26], InceptionResnetV2 [27] and DenseNet [28]. Also, we tested simpler customized CNN models comprising 2 (2L-CNN model) or 3 layers (3L-CNN model), schematically described in Fig. 4, as complex models are prone to overfitting [29, 30]. For the VGG19 model, the parameters from ImageNet were either kept fixed (VGG19 (fixed)) or updated during training (VGG19 (unfixed)). For all models, the original softmax layer was replaced with three custom fully-connected layers with a softmax layer for classification.

Fig. 3
figure 3

Workflow for the comparative study of various deep learning models for the classification of C-FPIs and NK-FPIs

Fig. 4
figure 4

Structures of the 2L-CNN and 3L-CNN models

The labelled data was separated into a test set at a 10% rate, and the remaining 90% of the data was randomly divided into training and validation sets at an 8:2 ratio. Then, to assess the suitability of each model, we plotted learning curves that illustrate the accuracy and loss across epochs, on the training (blue lines in Fig. 5) and validation set (orange lines in Fig. 5). To minimize overfitting, a total of 32 epochs that typically provided accuracy > 0.9 for most models based on the training set were executed, and L2 regularization was implemented to reduce overfitting. Overall, the learning curves generated by the 2L-CNN, 3L-CNN, and VGG19 (fixed) models using the validation set were smooth, and converged with those using the training set. In contrast, the accuracies of the learning curves produced by the VGG19 (unfixed), InceptionResnetV2, and DenseNet models using the validation set were highly variable, and generally lower than those for the training set, indicating these models have stability and overfitting issues. Therefore, only 2L-CNN, 3L-CNN, and VGG19 (fixed) models were further considered for the detailed performance evaluation.

Fig. 5
figure 5

Accuracy and loss curves of each deep learning model for C-FPI (A) and NK-FPI (B) over epoch. Blue lines: training data, Orange lines: validation data

To compare the overall performance of the three models, the average results of 20 repetitions for each model’s accuracy, recall, precision, and kappa were compiled (Table 1 and 2). When 3-frame images were utilized as input, overall performances were either superior (C-FPIs) or at least equivalent (NK-FPIs) to those of 1-frame images. Therefore, we conducted systematic statistical comparison among the three models using 3-frame images (Fig. 6). Among the three models, the 3L-CNN demonstrated the greatest performance for all metrics. In addition to the performance evaluation, confusion matrixes were produced using the test set to assess the predictability for each case (Fig. 7). All three models exhibited diagonal elements > 0.9, except for the ‘dying’ case of C-FPI, indicating overall performances of all models are excellent. Of note, 3L-CNN exhibited the highest prediction accuracy for the ‘dying’ case of C-FPI. The low accuracy of predicting the ‘dying’ case is potentially due to an imbalanced dataset containing only 3.8% (2,410 out of 62,297) of the ‘dying’ case. To test this possibility, the weight of the ‘dying’ case was adjusted from 1 to 4 with 3L-CNN, but there was no improvement in the confusion matrix. Based on these results, 3L-CNN with 3-frame images were used for the rest of the analysis.

Table 1 Performance of each model for C-FPI
Table 2 Performance of each model for NK-FPI
Fig. 6
figure 6

Performance of each model using 3-frame data for C-FPI (A) and NK-FPI (B). The statistical significance of the differences between models was evaluated using ANOVA for normally distributed data. ***p < 0.001

Fig. 7
figure 7

Confusion matrix of each model for C-FPI (A) and NK-FPI (B)

2.2 Deep Learning-Based Automated Analysis of NK cell-Cancer Cell Interactions

2.2.1 Analysis Framework for NK Cell-Cancer Cell Interactions

To analyse interactions between cancer cells in microwells and NK cells, time-lapse C-FPIs and NK-FPIs for each microwell was classified, and the classified states of C-FPIs and NK-FPIs were paired and reconstructed over time to generate time-series information (Fig. 8). By tracing the paired states of [NK-FPI/C-FPI] over time, information of NK cell-cancer cell interactions were extracted. For example, [Absent/Live] → [Exist/Live] (Fig. 8Ai) means the initiation of NK cell-cancer cell interactions. [Exist/Live] → [Exist/Dying] (Fig. 8Aii) indicates NK cell-mediated cancer cell killing. [Exist/Dead] → [Absent/Dead] implies detachment of NK cells after killing (Fig. 8Aiii).

Fig. 8
figure 8

Deep learning-based analysis of NK cell-cancer cell interactions for NK cells successfully killing cancer cells. A. Scheme of data analysis. BC. Measurement of T, T1, and T2 by deep learning B and manual analysis in previous study C. [25] The statistical significance of the differences was evaluated using Student’s t-test. ns not significant, *p < 0.05, and ***p < 0.001

2.2.2 Comparison Between Deep Learning-Based Analysis and Manual Analysis

Based on this framework, we re-analyzed our earlier microwell experiments, and compared the outcomes with the published results [25] that were manually analysed. In the previous work, microwells were coated with either fibronectin (FN) that promote cell adhesion or bovine serum albumin (BSA) that prevent cell adhesion, and their influence on NK cell-cancer cell interactions were examined. Specifically, time-lapse imaging was manually examined to select NK cells successfully killed cancer cells, and three different time durations were measured (Fig. 8A): T1, which measures the duration from the initial engagement to the onset of cancer cell death; T2, which measures the onset of cancer cell death to the detachment of NK cells; and T, which measures the total NK cell-cancer cell interaction duration for cancer cell killing (or T1 + T2). Previous manual analysis revealed that T1 was much greater for FN case whereas T2 was significantly higher for BSA case, leading to equivalent T for both cases (Fig. 8B).

We performed the similar analysis using the deep learning model to compare it with the manual analysis. First, the paired states of [NK-FPI/C-FPI] in time-series sequences were examined to find the cases showing sequential transition of [Absent/Live] → [Exist/Live] → [Exist/Dying] → [Absent/Dead], or exhibiting successful NK cell-mediated cancer cell killing (Fig. 8). Time-series sequences were examined to identify Tα, Tβ, and Tγ, of which is the time point of which [Absent/Live] → [Exist/Live], [Exist/Live] → [Exist/Dying], and [Exist/Dying] → [Absent/Dead] transition occurred, respectively. Then, T1, T2, and T were determined by the following equation: T1 = Tβ—Tα, T2 = Tγ—Tβ, and T = Tγ—Tα. Overall trend of T1, T2, and T automatically calculated by the deep-learning model (Fig. 8B) were identical to those obtained by the manual examination (Fig. 8C).

2.2.3 New analysis Enabled by Deep Learning

Manual analysis is frequently skewed toward particular events of interest due to time-consuming nature of examining the complete data [13, 31, 32]. As a result, the prior analysis primarily focused on the NK cell-cancer cell interactions that resulted in cancer cell killing, but every NK cell-cancer cell interaction does not necessarily lead to cancer cell death (Fig. 9A). Comprehensive quantitative evaluation of all NK cell-cancer cell interactions would reveal additional information on NK cell cytotoxicity. First, we measured the probability of NK cell detachment without killing. We traced all the NK cell-cancer cell engagement, or [Absent/Live] → [Exist/Live], and measured the probability of detachment without killing (P), or [Exist/Live] → [Absent/Live]. Regardless of microwell coating conditions, the probability of detachment without killing was ~ 0.8 (Fig. 9B), indicating only ~ 20% of NK cells engaged to cancer cells successfully killed cancer cells. Then, NK cell-cancer cell interaction time (T3) was measured for NK cells detached without killing (Fig. 9C). NK cells in FN case exhibited significantly lower T3 value than NK cells in BSA case, similar to T2 value (Fig. 9Biii). These results indicate that FN-coated surfaces in general support fast NK cell detachment that potentially enhances surveillance capability of NK cells.

Fig. 9
figure 9

Additional analysis enabled by deep learning-based automation. A. Scheme of two different fates of cancer cells after NK cell-cancer cell interaction. B. Probability of NK cell detachment without killing (P). C. Duration of NK cell-cancer cell interaction without killing (T3). The statistical significance of the differences was evaluated using Student’s t-test. ns not significant, and ***p < 0.001

3 Conclusion

A deep learning-based automated analysis was devised to quantitatively assess NK cell-cancer cell interactions in the context of single cancer cell arrays. Microwell-based single cancer cell arrays offered two unique benefits for the automated analysis of cell–cell interactions that is technically challenging. First, physical confinement of cancer cells in well-defined locations allowed the generation of large number of data (over 50,000) for deep learning. Second, physical separation of cancer cell and NK cell positions in distinct optical focal planes simplified segmentation and classification.

To determine which deep learning model was most suitable for the data analysis, numerous models were evaluated. Simple 3L-CNN demonstrated superior performance compared to more deeper models, such as VGG19, InceptionResnetV2, and DenseNet, which primarily produced flutuating learning curves and were susceptible to overfitting. The results of NK cell-cancer cell interactions analyzed by the 3L-CNN model were consistent with those analyzed manually. In addition, further analysis that is challenging to accomplish through manual analysis could be conducted.

These results suggest that automated image analysis could be facilitated through the creation of experimental platforms that produce data suitable for deep learning. The method devised in this research would be beneficial in the advancement of immune cell therapy.

4 Materials and Methods

4.1 Data Formulation

As a basic data set, we used two paired images that were nearly simultaneously acquired at the same position: the one image is a cancer-focal plane image (C-FPI) obtained by focusing on cancer cells in microwells, and the other image is an NK-focal plane image (NK-FPI) obtained by focusing on NK cells on microwells (Fig. 1A). The label representing each microwell of C-FPI and NK-FPI is set to LC and LNK, respectively. As the microwell of C-FPI is categorized into 4 distinct states (Live, Dying, Dead, and Empty) and the microwell of NK-FPI is categorized into 2 states (Exist, Absent) (Fig. 2A), we formulated the following one-hot encoding.

LC = [c0, c1, c2, c3], and.

LNK = [nk0, nk1],

where ci, and nki, are binary variables (either 0 or 1) representing one of the states of the C-FPI and NK-FPI cells, respectively.

4.2 Data Pre-processing and Labelling

The microwell array has a periodic circular structure, and the edge of the microwell in DIC images was relatively clearly identified (Fig. 2A). Therefore, the positions of the microwells were readily identified using a Prewitt method [33], and validated using another image frame of the time-lapse images acquired at the identical position. Based on the location of the microwell, 80 pixel × 80 pixel cropped images were extracted (Fig. 2Aii and iii).

Each of the extracted 80 px × 80 px image was manually classified and labeled. For C-FPI, 62,297 cropped images were labeled by classifying the image into four categories: Live (LC = [1, 0, 0, 0]), Dying (LC = [0, 1, 0, 0]), Dead (LC = [0, 0, 1, 0]), and Empty (LC = [0, 0, 0, 1]). For NK-FPI, 50,962 cropped images were labeled by classifying the image into two types: Exist (LNK = [1, 0]) and Absence (LNK = [0, 1]).

4.3 Computer Hardware/Software Configuration

Deep learning was conducted on a local computer. The computer runs 32 GB memory, AMD64 Family 23 CPU, and NVIDIA GeForce RTX 2080 Ti with 11 GB memory GPU. The computer software configuration is as follows: Win10 operation system, Python 3.7.3, and TensorFlow 1.13.1. In addition, the code runs in the integrated development environment Jupyter Notebook.

4.4 Construction of Deep Learning Models

For deep learning-based image analysis, various models were compared to select the optimal model. We tested the well-known off-the-shelf models pre-trained on ImageNet [34] such as VGG19 [26], InceptionResnetV2 [27] and DenseNet [28]. We also constructed models with reduced complexity, comprising 2 or 3 convolutional neural network (CNN) layers (2L-CNN and 3L-CNN in Fig. 4). All the models were modified by adding a classification layer consisting of three fully connected layers and softmax function. [35]

4.5 Training and Evaluation of Various Deep Learning Models

Training and evaluation of each deep learning model was conducted by the scheme shown in Fig. 3. Two different input types, 3-frame (three consecutive frames of images in time-lapse images) and 1-frame (a single image frame) were used. The labelled data was randomly split into a test set at a 10% ratio, and the remaining data was randomly divided into training and validation sets at an 8:2 ratio, and a total of 32 epochs was repeated for each training. The model’s loss and accuracy were assessed in each epoch during training using the training and validation sets. The loss of each model was calculated using Categorical Cross-Entropy loss function [36] with L2 regularization [37]:

$$L\text{oss}= -\sum_{i}{(y}_{i}\text{log}{p}_{i})+\lambda \sum_{i}{w}_{i}^{2},$$

where yi is the actual class label, pi is the model's predicted probability for class i. λ represents the regularization strength, and wi denotes the weights in the model's Dense layers. The λ value was determined by evaluating the performance of models using different values in a log scale, such as 0.1, 0.01, and 0.001. After comparing the results, 0.01 was selected as it typically yielded the best performance.

The accuracy of each model was calculated using the following equation:

$$A\text{ccuracy}= \frac{TP+TN}{TP+TN+FP+FN},$$

where TP, TN, FP, and FN stand for True Positives, True Negatives, False Positives, and False Negatives, respectively.

In addition to the accuracy, the overall performance of the models was evaluated with recall, precision, and Cohen’s Kappa (\(\kappa\)) defined below:

$$\text{Recall}= \frac{TP}{TP+FN}$$
$$\text{Precision}= \frac{TP}{TP+FP}$$
$$\upkappa = \frac{{P}_{0}-{P}_{e}}{1-{P}_{e}},$$

where Po is the probability of agreement observed in the actual data, and Pe is the expected agreement by chance. In other words, Po represents the model’s actual accuracy, and Pe represents the expected accuracy if all classes were distributed evenly. A \(\kappa\) value close to 1 indicates excellent model performance, while a value close to 0 indicates a performance equivalent to random chance. For the C-FPI that has four categories, accuracy, recall, precision, and \(\kappa\) were calculated by averaging the outcome values of each category.

Lastly, the model’s performance is visualized using a confusion matrix by analysing the test set. This is a table that shows how the model classified each class, where rows represent the actual classes and columns represent the predicted classes. The diagonal values represent the probability of each class to be correctly classified.