1 Introduction

Fluorescence lifetime is characterized by a decay from the excited state to the ground state, which is independent of fluorescence concentration but sensitive to the biological environment [1]. Fluorescence lifetime imaging endo/microscopy (FLIM) utilizes lifetime contrast between healthy/unhealthy biological tissue to distinguish them effectively. Due to the independence, typically, lifetime images are more homogeneous than intensity images which show higher contrast. This introduces challenges for visual recognition. For example, when imaging the same physical point on tissue with different hardware configurations, lifetime images are usually visually indistinguishable compared to intensity images. Jo et al. [2] reported that oral cancer has a shorter lifetime, whereas McGinty et al. revealed that tumours have a longer lifetime [3]. In addition, other hardware factors, such as excitation bandwidth (wavelength) and exposure time, also affect lifetime derivation [4]. We similarly observed that, as wavelength increases, the contrast or difference in lifetime values between a pair of normal/cancerous tissue becomes so small that, although there is a classification boundary, it cannot be a priori deduced which tissue state has the lower lifetime.

Machine Learning (ML), particularly Deep Learning (DL), has revolutionized biomedical image processing in many aspects, such as in classification and segmentation [5]. However, little attention has been paid to the application of ML/DL to FLIM images in, for example, the automatic discrimination of cancer. Existing approaches in this area usually employ conventional ML algorithms with handcrafted features, which inevitably requires professional knowledge on feature engineering. For instance, Chen et al. [6] apply support vector machines (SVM) for skin lesion detection with artificial features retrieved from lifetime decay fitting parameters. For DL, the situation is even worse: there is very limited research concerning DL in FLIM-based cancer differentiation, apart from a few preliminary studies [7]. Unlike ML, which can perform well on small-scale data, DL usually requires large-scale datasets for effective learning without over-fitting. Unfortunately, there is no such dataset publicly available in this domain, which severely affects the development and application of FLIM.

Contemporary architectures, such as ResNet [8] and DenseNet [9], have advanced the state-of-the-art of classification performance significantly. A common practice in those models is the modularization of convolution blocks, particularly the usage of bottleneck blocks. Those relatively fixed patterns enable them to be easily expanded with more sophisticated blocks, and multi-scale architectures are prevalent among the expansions. The primary disadvantage of ResNet is that it produces many redundant features but struggles to create new features [9, 10]. Different strategies could be applied to avoid this effect. For example, DenseNet [9] employs very narrow networks to reduce the redundancy and dense aggregation for new feature creation, and Res2Net [11] splits redundant features and applies a hierarchical multi-scale module to create new features per the separated features. Due to the split, however, Res2Net is unable to retain the correlations among input features as global information since the grouped features are handled separately.

Here, we introduce our unique dataset of over 100, 000 FLIM images from 18 pairs of normal/cancerous tissues of 18 patients. The images were collected by a custom FLIM system [12, 33] aiming for online in-vivo in-situ lung disease diagnostics, with various user-specified configurations. The dataset consists of multi-dimensional images rich in spatial and spectral information, which can reflect the diversity of fluorescence lifetime to a large extent. Afterwards, we describe the image post-processing procedure, which applies intensity image as a soft weight to the corresponding lifetime images. With this, lifetime independence from its intensity can be addressed, increasing the classification performance of classic CNNs. To further improve the discrimination and address the broad spread of correlating pixels with similar lifetime values in lifetime images, we propose a hierarchically aggregated multi-scale architecture at a layer-level, namely ResNetZ. We integrate the model into ResNet, and evaluated the performance on three aspects, including the impact of a shortcut connection and different aggregations; the overall performance with state-of-the-art CNNs and ResNet variations; and the comparison between Res2Net and ResNetZ. Accuracy, precision, recall, the area under the receiver operating characteristic (ROC) curve (AUC), and Cohen’s Kappa [13] were used as metrics.

The rest of the paper is organized as follows. Section 2 reviews the related work in FLIM for cancer classification and multi-scale architectures. Section 3 introduces the technical details of our method. Experimental results are presented in Sect. 4 and discussed in Sect. 5, followed by the conclusion and future work in Sect. 6.

2 Related work

2.1 FLIM in cancer classification

As shown in Fig. 1, a common practice is to derive the averaged lifetime by histogramming and discriminate cancer based on lifetime difference, with the assistance of histological images. Here, cancer tissue has an average lifetime of 1.48 ns, while a non-cancer sample has an average lifetime of 1.9 ns. Little effort has been made on automatic classification of cancer using ML algorithms on FLIM images. Gu et al. [14] utilized a feed-forward neural network-based extreme learning for the diagnostic of early cervical cancer using FLIM on H &E stained samples, with expert-engineered features. Cuenca et al. [15] and Jo et al. [2] applied a quadratic discriminant analysis binary classifier for distinguishing oral cancer and dysplasia, with six handcrafted features extracted from FLIM images. In [6], Chen et al. deployed a SVM model to distinguish non-melanoma skin lesions, where features were engineered from lifetime reconstruction. Marsden et al. [16] applied ML technologies for intraoperative cancer margin assessment with FLIM, where a dual-path architecture retrieved information at different scales for predicting the point-wise probability of cancer. Nonetheless, all those works investigated conventional ML methods with engineered features, and none of them concerns lung cancer classification using DL.

Fig. 1
figure 1

Lifetime contrast of non-cancerous (row 1) and cancerous (row 2) lung tissue using histograms (column 3) of lifetime images (column 2) obtained from intensity images (column 1), along with histological images (column 4) as the ground truth [7]

Considerable effort has been made by the authors to investigate the automatic classification of ex-vivo lung cancer from FLIM images. In [17], we applied four popular ML methods to FLIM images for ex-vivo lung cancer classification, namely K-nearest neighbour, SVM, neural network, and random forest. A significant difference between our approach and the existing ones is that we applied pixel values of lifetime images as features, instead of artificial ones. Later, we investigated the feasibility of traditional CNNs for the same classification problem [7]. With five classic CNNs, i.e. ResNet, ResNeXt, DenseNet, Inception, and Xception, the results were dramatically better than the ML-based solutions. We also showed that integrating lifetime with intensity information can achieve better results than using lifetime images only. We further expanded the study by embedding dilated convolutions to multi-scale technologies [18, 19]. Comparing with previous studies, this one does not use dilated convolutions as, empirically, they do not contribute much to performance improvement on the dataset. In addition, this study thoroughly compares the performance of different configurations to understand the impacts of the configurations on the results. Meanwhile, we also introduce the optimal image post-processing to maximise the performance improvement.

2.2 Multi-scale architectures

Multi-scale architectures have become very popular in contemporary CNNs, which are usually epitomized by employing a number of single/composite operations in parallel at different levels. Typical examples include a multi-path CNN for brain image segmentation [20], Inception, using several parallel convolution branches at a layer-level [21], and Res2Net with a hierarchical feature-level multi-scale model. One reason for their success is their capability to simultaneously extract features at different scales and, later, integrate the multi-scale features together, so that more information passes through their backbone networks.

Architecture-level multi-scale strategies are usually developed to deal with multiple inputs or for special purposes. Setio et al. [22] proposed a multi-view model to decline false-positive cases in pulmonary nodule detection. Moeskops et al. applied a multi-path architecture for magnetic resonance brain image segmentation [20]. Despite their success, the major problem of architecture-level multi-scale models is that the underlying ideas are usually problem-specific and, hence, it is challenging to migrate them to other architectures.

Layer-level models concern the features extracted after each layer as a whole and utilize more sophisticated operations to process the information. The operations could be simple operators, such as multiple parallel convolutions in Inception [21, 23], or a set of complicated ones, e.g. multiple dilated convolutions in densely connected blocks [24]. In addition, they can substitute a few or the entire original operations. For example, DeepLab [25] used an atrous spatial pyramid pooling with several parallel dilated convolutions for better semantic segmentation. Due to the modularization, layer-level styles are usually easy to be integrated into other backbone networks with similar architectures, such as the inception-like convolution blocks in DRINET [26]. With the paralleling, more features different in space, scale, and context can be retrieved. However, an apparent disadvantage is the increase of complexity due to parallel operations.

Feature-level multi-scale styles are normally characterized by splitting input features into groups, processing grouped features individually, and fusing processed features. The operations can be committed by group convolutions [27], depthwise separable convolution and its variations [28, 29], or pointwise group convolutions [30]. More sophisticated operations can be integrated into feature-level architectures. For example, Res2Net proposed a hierarchical aggregation into the processing, and ResNeSt [31] introduced split-attention into group convolutions. Besides the advantages of multi-scale models, feature-level architectures significantly decrease the complexity, compared with the conventional convolutions. However, due to the separation of the input features, the correlations among the features are partially ignored. Our inspiration comes from the hierarchical style reported in [11] and [32], except that we incorporate our model at a layer-level instead of at a feature-level, to retain the correlations among features, which can be further reused and aggregated. In addition, our ResNetZ architecture also introduces other possible configurations, such as Concatenation aggregation rather than Addition applied in [11] and [32]. This is explored further in Sect. 4.1.

3 Methodology

Fig. 2
figure 2

Schematic diagram of the proposed method adapted from [7]. Raw FLIM images were collected on an ex-vivo lung tissue fixed on a corkboard (Step 1). Raw images were then post-processed to obtain FLIM images suitable for the classification (Step 2). Finally, all processed images were input into CNN models for binary classification purposes (Step 3)

The FLIM dataset was gathered by a continuous collection of ex-vivo human lung tissue using our custom-built FLIM system. Raw FLIM images contain a certain level of noise for visual recognition, and thus post-processing is required before being fed into the CNNs for classification. It is worth noting that our intention is to introduce the best practice we have learned so far on the FLIM dataset for reproducible research. The overall procedure is depicted in Fig. 2, and technical details will be addressed in depth in the following sections.

3.1 Data collection

A custom fibre-based FLIM system was deployed to acquire data with various user-specified configurations, including different exposure time and two spectral bands [33]. For online imaging and diagnostic purposes, our custom FLIM imaging system recorded sequences of lifetime images with a resolution of \(128\times 128\) pixels, at a frame rate of 9 frames per second, which were aggregated across a line sensor of single-photon detectors [12]. Each frame contains four images, yielding an intensity and the corresponding lifetime image for each of the two customizable spectral bands. Lifetime values can be reconstructed by different algorithms, such as the Rapid Lifetime Determination method (RLD) [34]. Data Collection in Fig. 2 depicts an example of the experiment workflow, where a lung tissue was fixed on a corkboard and the \(128\times 128\) images of autofluorescence intensity and lifetime were reconstructed with an exposure time 6 µs, a spectral band of 498–570 nm, and the RLD decay fitting approach. These settings were chosen to represent intended characteristic conditions for future clinical trials.

Table 1 The effective number of images collected from 18 patients, including measured points on cancerous / non-cancerous tissue, and collected frames from each patient

For each ex-vivo experiment, a pair of cancerous/non-cancer tissue from each patient was scanned using direct contact between the fibre and tissue, and multiple measurements were extracted at different physical points on each tissue to enrich the variety of the images on the same tissue. Over 100,000 raw FLIM images were collected from 18 pairs of lung normal/cancerous tissues. For this study, we removed some images which may introduce extra variance. For example, we excluded images whose lifetime was not reconstructed by RLD, since their lifetime is significantly different. After cleaning there were, in total, 61,816 FLIM images remaining, including 25,372 from cancerous tissue and 36,444 from normal tissue. The detailed information regarding the remaining images is listed in Table 1. Note that each frame contains an intensity and its corresponding lifetime image.

3.2 Image post-processing

The raw images collected were very noisy (see the grayscale images at the right of Data Collection in Fig. 2). In order for the images to be suitable for human and machine perception, post-processing is therefore needed. The overall post-processing is depicted in Fig. 2, image post-processing. One of the criteria for achieving reasonable post-processing results is to ensure that the histograms derived from the averaged lifetime remain unchanged, while the images become visually plausible, as shown in Fig. 1.

Given a relatively short exposure time, e.g. 20 µs, the total number of recorded photons per pixel is usually in the order of \(100-1000\), or even lower, where the pixel record is easily affected by photon quantum noise [35]. In order for the signals to be recorded and processed efficiently, optimal signal–noise ratio (SNR) of fluorescence intensity measurement is required. In this study, we utilize a threshold value of \(\sqrt{\hat{N}}\) to approximate SNR [35], where \(\hat{N}\) is the mean of the measured fluorescence concentration. It is, therefore, assumed that pixel intensity larger than \(\sqrt{\hat{N}}\) is essential for a lifetime derivation with acceptable accuracy.

Let \(I^I = \{i^I_{x, y}~\vert i^I_{x, y} \,\ge \,0 \; \text {and} \; x, y \in [0, M]\}\) denote an intensity image \(I^I\) with size of \(M \times M\), and \(I^L = \{i^L_{x, y}~\vert i^L_{x, y}\,{\ge }\, 0 \; \text {and} \; x, y \in [0, M]\}\) denote the corresponding lifetime image \(I^L\) with size of \(M \times M\). The denoising approach can be defined as [7]:

$$\begin{aligned} \hat{i}^{I}_{x, y} ={\left\{ \begin{array}{ll} 0 &{} i^{I}_{x, y} {\le } \sqrt{\hat{N}} \\ i^{I}_{x, y} &{}\text {otherwise} \end{array}\right. }\end{aligned}$$
(1a)
$$\begin{aligned} \hat{i}^L_{x, y} = {\left\{ \begin{array}{ll} 0 &{} i^I_{x, y} {\le } \sqrt{\hat{N}} \\ i^L_{x, y} &{}\text {otherwise} \end{array}\right. } \end{aligned}$$
(1b)

Next, the intensity images are normalized with dark background D and lightfield images L, adapted from [36]:

$$\begin{aligned} \bar{I}^{I}=\frac{G{\,*\,}(\hat{I}^I-D)}{G{\,*\,}(L-D)} \end{aligned}$$
(2)

where \(*\) denotes the 2D convolution operator, \(\hat{I}^I\) is the intensity image, and G is a convolutional Gaussian smoothing filter with a 3\(\times\)3 kernel defined in [37] as:

$$\begin{aligned}{}[G]_{i,j}=\frac{1}{2\pi \mu ^{2}}e^{-\frac{i^{2}+j^{2}}{2\mu ^{2}}} \end{aligned}$$
(3)

where i and j are the distance from the origin in the horizontal and vertical axis, respectively, and \(\mu\) is the standard deviation of the distribution. Notice that since the corresponding dark background D and lightfield L images are not always available, we simply apply the 2D Gaussian smoothing filter G to the intensity images. The post-processed intensity \(\bar{I}^I\) image is therefore derived.

Afterwards, the normalized intensity image is binarized to yield a binary mask applied to the denoised lifetime image. Finally, a histogram-based contrast-enhancing algorithm from [37] is utilized to further improve the visual effect of the lifetime image, and the post-processed lifetime image is obtained.

In [7], we showed that combining both intensity and lifetime information together achieves better performance than using false-colour lifetime images alone for CNN-based cancer classification. In this study, we use intensity-weighted lifetime images as the output of the post-processing to be the input to the proposed model. With the evolution of the technologies, we observed that by feeding intensity-weighted lifetime images, the CNNs were able to obtain even better scores than the stacked images. The comparison of these two different formats on the classic CNNs can be found in “Appendix”.

3.3 Layer-level multi-scale architecture

As discussed in Sect. 2, multi-scale architectures can be implemented at layer and feature levels. The primary concern of the feature-level is that the correlations among the input features are partially ignored. Since both layer- and feature-level multi-scale models can be easily and, in most cases, seamlessly integrated into the networks with residual or similar blocks, we apply the replacement at layer-level so that the correlations among them are retained. To benefit from the advantages of layer-level multi-scale architecture and hierarchical aggregation, we propose a new layer-level multi-scale convolution architecture, called ResNetZ.

ResNetZ (Fig. 3b) and Res2Net (Fig. 3c) are visually similar since our ResNetZ is inspired by the Res2Net hierarchical aggregation. However, they are conceptually distinct: one major difference is that our ResNetZ performs multi-scale feature extraction on the input features as a whole to retain the correlations among the input features (Fig. 3b), whereas Res2Net splits input features into groups and performs multi-scale feature extraction per grouped features (Fig. 3c).

Fig. 3
figure 3

a original residual block in ResNet, b proposed ResNetZ module, where A is an aggregation operator, and c Res2Net module as a feature-level multi-scale example. Both ResNetZ and Res2Net blocks contain a shortcut connection (the leftmost blue dash line)

3.3.1 Block-wise shortcut connections

Comparing [11] and [32], an apparent difference, besides the utilization of dilated convolutions, is a shortcut connection used in [11]. Unlike the identity mapping in ResNet, which is used for a better flow of information through residual blocks, the shortcut connections in Res2Net (Fig. 3c) are located within the computational block. The advantage of this is that it helps the information and gradient flow. However, it also introduces extra complexity. For ResNetZ, the increased complexity is due to more feature maps being concatenated as input to the output 1\(\times\)1 convolution. Moreover, Res2Net also needs more feature maps extracted from the input 1\(\times\)1 convolution because of the splitting performed before the hierarchical aggregation.

3.3.2 Hierarchical aggregation

Another configurable hyperparameter is the aggregation of the global and local features before the 3\(\times\)3 convolution. Both [11] and [32] employed the ResNet-like Addition operation, which is able to spatially integrate features without sacrificing complexity. In DenseNet, a major difference from ResNet is that it replaces the Addition by Concatenation, which increases variation in the input of successive layers and improves efficiency. While Addition requires input features that have identical dimensions, Concatenation is flexible in dimensional terms. The main disadvantage of Concatenation, however, is the extra parameters introduced. Inspired by DenseNet, Concatenation can also be used as a viable alternative to Addition. As a result, there are potentially four different configurations by combining the shortcut connection with the aggregation.

3.3.3 ResNetZ block definition

Let \(x_g\) denote the global features from the first 1\(\times\)1 convolution as the input of the ResNetZ block, \(y_i\) denote the output features extracted from the branch 3\(\times\)3 convolution for \(i \in \{1, 2, ..., n\}\), \(\varvec{A}\) be the aggregation, and \(\varvec{\varGamma }\) be a composite operation consisting of a 3\(\times\)3 convolution, batch normalization [23], and a rectified linear unit [38]. Let \(y_g\) be the output of the ResNetZ block. Accordingly, \(y_i\) and \(y_g\) without the shortcut are governed by:

$$\begin{aligned} y_{i}&={\left\{ \begin{array}{ll} \varvec{\varGamma }(x_g) &{} i=1 \\ \varvec{\varGamma }(\varvec{A}(x_g,y_{i-1})) &{} 1<i\leqslant n \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} y_g&=\varvec{C}([y_1, y_2, ..., y_n]) \end{aligned}$$
(5)

where \(\varvec{C}\) is a concatenation operator. Let \(y_i'\) and \(y_g'\) be the model with the shortcut, which can be defined as:

$$\begin{aligned}&y_i' ={\left\{ \begin{array}{ll} x_g &{} i=1\\ \varvec{\varGamma }(x_g) &{} i=2\\ \varvec{\varGamma }(\varvec{A}(x_g,y_{i-1}')) &{} 2<i\leqslant n \end{array}\right. } \end{aligned}$$
(6)
$$\begin{aligned}&y_{g}'=\varvec{C}([y_1', y_2', ..., y_n']) \end{aligned}$$
(7)

Since Res2Net splits the features \(x_g\) (Fig. 3c), the shortcut only passes partial information to the output. In contrast, with the shortcut, the ResNetZ block is able to pass the whole set of features \(x_g\) to the output and, thus, enhances the information flowing in forward and backward propagation within the block. Aggregation \(\varvec{A}\) can be further split into \(\varvec{A^a}\) and \(\varvec{A^c}\) for Addition and Concatenation operators, respectively. Since Addition is pixel-wise, \(\varvec{A^a}\) implicitly conveys local (\(y_i\) or \(y_i'\)) information to the subsequent branches, whereas with \(\varvec{A^c}\), local (\(y_i\) or \(y_i'\)) information is explicitly carried to the remaining branches. In addition, by concatenating \(x_g\) and \(y_{i-1}\)/\(y_{i-1}'\) from different receptive fields, more information is expected to be integrated and retrieved as the output of the ResNetZ block. In consequence, the sequentially integrated output \(y_g\) and \(y_g'\) contains features rich in spatial and contextual information.

3.3.4 ResNetZ complexity

When the number of parallel 3\(\times\)3 convolutions is fixed, given the same backbone network, the aggregation style (A symbol in Fig. 3b) and the shortcut connection (blue dash line in Fig. 3b) will also affect the complexity and performance of the model. The shortcut connection introduces more features to be concatenated as the output of the ResNetZ block. When paralleling several 3\(\times\)3 convolutions with aggregation, the receptive field of each branch will also increase exponentially, due to the features from the previous branch. In addition, compared with Addition, Concatenation doubles the features to be fed into the 3\(\times\)3 convolution. Taking into account the shortcut and aggregation styles, when the width and the scale are small, the complexity of the model is \(C_\mathrm{{add}}< C_\mathrm{{concat}}< C_\mathrm{{add+shortcut}} < C_\mathrm{{concat+shortcut}}\). When they become larger, the complexity of the model changes to \(C_\mathrm{{add}}< C_\mathrm{{add+shortcut}}< C_\mathrm{{concat}} < C_\mathrm{{concat+shortcut}}\).

3.4 Implementation details

All models were implemented in PyTorch.Footnote 1 For the existing CNNs, we used their official implementation included in PyTorch or published by their authors. For ResNetZ, we integrated the layer-level multi-scale architecture into the PyTorch-implemented ResNet. As per Fig. 3b, we substituted the original residual block (Fig. 3a) with the ResNetZ model (Fig. 3b). Unlike ResNeXt and Res2Net, which keep the width of the ResNet backbone, we used a narrower version of ResNet, so the width of the input 1\(\times\)1 convolution is retained as for the branch 3\(\times\)3 convolution, thereby reducing the overall complexity of our ResNetZ model.

To ensure a fair comparison, we adapted the authors’ official implementation of Res2Net to a similar version of our proposal, adjusting the width of the backbone ResNet, so that the scale and width of ResNetZ and Res2Net are equal. In addition, ResNetZ followed the same configuration of Res2Net by using Addition as the aggregation with a shortcut connection.

All models were examined on 61, 816 FLIM images from 18 patients, as described in Sect. 3.1. Images from 17 patients (patients \(1-17\) in Table 1) were used as the training set, where \(10\%\) training images were split as the validation set. The images from the remaining patient (patient 18) served as the independent testing set, which contains 840 cancerous images and 888 normal images. For all evaluated CNNs, we applied a stochastic gradient descent for optimization with momentum 0.9. The learning rate was initially set to 0.1, and divided by 10 at epochs 50, 100, 150, and 175 for in total 200 epochs, with batch size 128. We used binary cross-entropy as the loss function. Weights were initialized using He’s method [39]. In addition, we also employed weight decay \(1e-4\). For data augmentation, we utilized a simple strategy of vertical and horizontal flipping, as well as random crop with zero-value padding of 16 pixels. All training and testing were performed using NVidia V100 GPU provided by JADE.Footnote 2

4 Results

In order to fully quantify the performance of the proposed layer-level multi-scale architecture, we first evaluate the influence of the aggregation style and the shortcut connections. We then fix these two parameters and evaluate the impact of integrating the model in different ResNets. Finally, we compare our layer-level multi-scale model with a feature-level model (Res2Net) to understand how levels impact the results.

Fig. 4
figure 4

Impact of different ResNetZ configurations on the results

4.1 ResNetZ configurations

According to Fig. 3, multi-scale architectures potentially have different configurations. We tested the proposed model on ResNet50 as the backbone with width 8, in parallel with scales 2, 4, and 6. Following the naming convention in [11], we also use S for scale and W for width. For ResNetZ50-W8-S2, the complexity is consistent with the former situation, whereas the other two (ResNetZ50-W8-S4 and ResNetZ50-W8-S6) are with the latter. To simplify the presentation, we append A, AS, C, and CS to the model names to represent the model with Addition, Addition with Shortcut, Concatenation, and Concatenation with Shortcut, respectively.

The scores, depicted in Fig. 4, are grouped by accuracy, precision, recall, AUC, and Kappa. All variations except recall show a very similar tendency, with different configurations and scales. For ResNetZ50-W8-S2, the shortcut is very helpful for Addition aggregation, with \(4\%\) gain, but has little effect on Concatenation. The Concatenation achieves higher scores than Addition, except on recall, regardless of having the shortcut or not. When the scale increases from 2 to 4, i.e. ResNetZ50-W8-S4, the shortcut still leads to performance improvement, especially for Addition. In contrast, Concatenation is not always better than Addition. In ResNetZ50-W8-S4-AS and ResNetZ50-W8-S4-C, the Addition with shortcut achieves very similar results than Concatenation with or without shortcut on accuracy, AUC, and Kappa. Further increasing scale to 6, the shortcut still improves the performance of Addition, but considerably deteriorates on Concatenation, which is even lower than Addition alone. ResNetZ50-W8-S6-AS and ResNetZ50-W8-S6-C produce comparable results on accuracy, AUC, and Kappa, where the discrepancy is less than \(1\%\).

In general, the shortcut connection always introduces performance gain in terms of accuracy, precision, AUC, and Kappa. The gain is usually more remarkable on Addition than on Concatenation, mainly when the scale is small. An exception occurs when the model is relatively complex since an extra shortcut connection does not improve the performance for Concatenation. That is, ResNetZ50-W8-S6-CS is inferior to ResNetZ50-W8-S6-C for all metrics except recall. For the aggregation styles, Concatenation is usually superior to Addition, with or without the shortcut, except for ResNetZ50-W8-S6 with the shortcut. This is not unexpected since the features introduced by Concatenation are twice than in Addition for the convolution branch. In summary, Concatenation without shortcut is overall superior almost for all the metrics than the other three configurations with three different scales.

4.2 Overall performance

Based on the results of Sect. 4.1, we evaluate the performance of the proposed model with Concatenation as the aggregation without the shortcut connection. We first evaluate six state-of-the-art CNNs, namely ResNet50, DenseNet121, Inception, Xception, SENet, and Res2Net. It is worth noting that the classification of FLIM images may not benefit from very complex CNNs as prior experience. Therefore, we appraise three shallow ResNet variations, including ResNet38 and ResNet50, with two different widths. Further, we use these three variations as the backbone networks and integrate the ResNetZ block into the backbone, but with a smaller width. The results are listed in Table 2 and ResNetZ ROC curves, along with two backbone ResNet in Fig. 5.

Table 2 Performance comparison of the backbone ResNet, ResNetZ, and three variations of ResNet family
Fig. 5
figure 5

ROC curves of ResNetZ with the backbone ResNets, where dotted lines are the backbone ResNet, and solid lines are ResNetZ

Amongst the contemporary CNNs, Res2Net50 achieves the overall best scores. Meanwhile, DenseNet121 is inferior to ResNet and its variations. Its performance may improve with deeper configurations. In the backbone ResNet, with depth growing from 38 to 50 and width from 32 to 64, the outcomes increase consistently, except for ResNet50-w64. Considering these are relatively shallow networks and the relatively small FLIM dataset, these outcomes are anticipated. However, note that as ResNet50-w32 yields better scores than ResNet50-w64, other state-of-the-art CNNs with less width may perform better.

In general, ResNetZ surpasses the backbone ResNet with three different depths, but with significantly fewer parameters since ResNetZ employs parallel 3 X 3 convolutions with narrow width. Specifically, ResNetZ38-W16-S2 achieves the overall best scores, but with fewer than 3.5M parameters. For ResNetZ38, all ResNetZ models are superior to ResNet38 but require considerably fewer parameters. A further deepening of ResNet to 50 layers, ResNet50-W32 yields the best scores in accuracy, AUC, and Kappa. However, ResNetZ50-W12-S4 also produces very comparable results.

Considering Fig. 4 and Sect. 4.1, with a relatively simple ResNet, the performance of ResNetZ improves by increasing depth, width, and scale. Due to the parallel 3\(\times\)3 convolutions, which concatenate more features, the model produces better scores than the backbone with considerably fewer parameters. Although there are exceptions in ResNet50, the decline is understandable since the model already achieved the best outcomes with ResNet38 and, hence, the scores of more complex ResNetZ variations may also drop.

4.3 Multi-scale at layer and feature levels

ResNetZ extracts and fuses features at layer-level, whereas Res2Net performs similar operations at feature-level. We conduct further experiments to compare both models. To make the comparison fair, we follow the architecture of Res2Net, i.e. using addition for the aggregation with the shortcut, and applying the same width and scale to both models. The results are shown in Fig. 6. In general, ResNetZ yields promising scores, slightly better than Res2Net but with significantly fewer parameters. As shown in Fig. 3, given the same width and scale, Res2Net requires a much wider 1 X 1 convolution as input to maintain the width of the branch, in contrast to ResNetZ. When the width and scale are relatively large, the difference in complexity becomes significant. Additionally, given the same configurations and backbone networks, the highest score achieved by Res2Net is larger than for ResNetZ. Regarding accuracy, both architectures improve almost consistently when growing the number of parameters, followed by a decline after reaching the peak (Fig. 6, plot 1). For four of the seven variations, ResNetZ produces higher scores than Res2Net and the highest, 88.83% compared to 88.19%. The same tendency is found on AUC and Kappa. For precision (Fig. 6, plot 2), Res2Net is marginally better than ResNetZ, but their best scores are very close, 90.83% and 91% for Res2Net and ResNetZ, respectively. As for recall (Fig. 6, plot 3), both models obtain comparable results. In this case, all scores of our model are over 80%, whereas Res2Net produces both the best 87.05% and the worst 77.25% scores.

Fig. 6
figure 6

ResNetZ vs Res2Net over accuracy (first plot), precision (second plot), recall (third plot), AUC (fourth plot), and Kappa (fifth plot)

5 Discussion

5.1 FLIM images

Unlike other biomedical images, FLIM provides an extra dimension and introduces several visual recognition challenges. With the capability of user-specified configurations, the custom fibre-based FLIM system is able to deliver multi-dimensional images rich in spatial and spectral information. Although the dataset is relatively small in terms of the number of patients, it will gradually increase over time as the sample collection is still ongoing. Since the FLIM system was designed for in-vivo in-situ diagnostics with endoscopically delivered fibres, reliable classification is, therefore, of great clinical importance in real-time human lung cancer diagnostic pathways.

To attract more engineers, researchers, and enthusiasts to overcome the challenges and together push forward this particular area, the FLIM dataset is available on https://github.com/qiangwang57/flim_cancer_ml.

5.2 Model configurations

Multi-scale strategies are flexible in configurations. An identity shortcut has proved to be helpful for information and gradient flowing, achieving better outcomes [8, 9, 11]. In this study, when the model is simple in scale, the performance gain by the shortcut is significant, especially for Addition aggregation. However, for Concatenation, the shortcut improvement is not significant and, in an extreme case, harms the performance. Consequently, the shortcut should be used with Addition. As far as the aggregation is concerned, Concatenation is overall better than Addition in most cases, which is expected since it doubles the input features. As a result, Concatenation can be used, in general, if the priority is model performance. When complexity is a major concern, particularly when ResNetZ has relatively more branch convolutions, Addition with shortcut can substitute Concatenation for comparable performance with relatively less complexity.

5.3 Layer-level multi-scale architecture

A remarkable advantage of ResNetZ is the complexity compared with the backbone ResNet. The primary reason is the parallel 3\(\times\)3 convolutions, along with the aggregation. This enables the fusion and extraction of features at different scales, which are further concatenated as the output. A direct consequence is that each 3\(\times\)3 convolution is much narrower than the original ResNet. With concatenation as the aggregation, our ResNetZ has similarities with DenseNet. Within the block, every branch convolution is supervised directly by the input, enforcing the convolution to learn different features, except that the supervision is performed at block scope. Although the results in Sect. 4 show that the performance gain is not always consistent with the number of parameters, we believe this is not because of the model itself but due to the relatively small patient diversity in the dataset.

Fig. 7
figure 7

Visualization of class activation map (CAM) using Grad-CAM on normal and cancerous images with ResNet, Res2Net, and ResNetZ

Figure 7 illustrates the CAM areas generated by ResNet, Res2Net, and ResNetZ on images of normal and cancer tissues, based on Grad-CAM [40]. For normal and cancerous images, the CAM areas produced by ResNet generally cover almost entire image, but with partial concentration on a particular portion. Res2Net tends to have larger focusing areas than ResNet, with more concentration on the part with moderate brightness. The CAM results by ResNetZ are further enhanced, comparing with Res2Net. In particular, ResNetZ tends to have multiple concentrating areas in the images with mild brightness. This indicates the superiority of ResNetZ over ResNet and Res2Net.

Both layer- and feature-level multi-scale architectures perform comparably on the FLIM dataset given the same width, depth, scale, and configuration. Since feature-level multi-scale needs to split the input features, unlike layer-level, it contains more parameters due to wider input features. The difference in the number of parameters becomes significant when the backbone network is complex. Overall, they are comparable in terms of metrics.

6 Conclusion

This paper formally introduced a unique FLIM image dataset, described the best practices to improve raw input images, and proposed a novel multi-scale CNN, called ResNetZ, to further improve lung cancer classification. Through 61, 816 FLIM images on 18 pairs of normal/cancerous tissue collected from 18 patients, we show the superiority of the proposed method over the backbone ResNet with significantly fewer parameters. In particular, ResNetZ38-W16-S2 presented the overall best performance but with only 3.5 M parameters. We also compared our layer-level multi-scale model with feature-level one (Res2Net) to demonstrate the advantages and disadvantages of the ResNetZ model. Given the same model configurations, ResNetZ is superior to Res2Net in most cases. It is notable that with the same configurations, Res2Net is up to \(20\%\) more complex than ResNetZ. Since the FLIM system is designed for online in-vivo in-situ imaging, fewer parameters are more convenient for real-time clinical diagnostics provided by clinicians at bedside. As ResNetZ is designed to be independent of the backbone network, it could be easily migrated to other networks with similar convolutional blocks, such as segmentation-oriented networks. Future research will be conducted on the migration of our approach to other backbone networks and different research scenarios.