1 Introduction

When the statistical data published by the WHO International Cancer Research Unit are examined, the most common cancers in all age groups in the world are breast, prostate,and lung cancer [1]. While 5-year survival rates in breast and prostate cancer are 90% and 100% [2,3,4], 5-year survival rate in lung cancer is 18–20%, and it is the cancer with the highest mortality [3, 5] [6]. While it was stated in the articles of 2007 that 5-year survival rates in lung cancers diagnosed at an early stage reached 60–70%, these rates vary between 20.2 [3] and 22% [2] when it comes to today. In this respect, early diagnosis of lung cancer is more important than other cancers.

Computed tomography images are used most frequently in lung cancer screenings and lesions are classified according to their visual characteristics. It is not always possible to detect a lesion at an early stage and to differentiate malignant benign with high accuracy [7]. The development of computer-aided diagnosis systems is thought to be beneficial for doctors in terms of early and accurate diagnosis. Therefore, it is very important to find an effective method to classify and segment nodules. Traditional methods may be insufficient to analyze such images, and therefore studies on classification and segmentation of nodules using deep learning (DL) methods have been carried out in recent years. DL methods consist of neural networks with the ability to learn datasets, and due to these networks, they have performed well in classifying and segmenting nodules. In particular, convolutional neural network (CNN) methods have been effective in extracting and classifying the features of images, greatly improving the quality and reliability of treatment, especially the early diagnosis and screening process [8].

In this study, DL methods were used for the classification and segmentation of nodules on lung CT images, and the effectiveness of these methods was investigated. Studies on this subject may help early diagnosis and treatment of nodules by showing good performance in terms of accuracy and sensitivity in the classification and segmentation of nodules. Zheng et al. conducted a study using the XGBoost classifier, which uses machine learning algorithms with CT images and clinical data to classify benign and malignant tumors [9]. Zhu et al. used neural networks in their study of lung nodules, taking into account features such as fuzzy boundaries, sparse distribution, and subtle differences in CT images [10]. Prosper et al. conducted a study that included advances in the characterization of pulmonary nodules and early cancers using radiological features and deep learning architectures in addition to traditional image analysis approaches in chest CT [11]. Gugulothu et al. demonstrated the performance of the Logarithmic Layer Xception Neural Network (LLXcepNN) classifier using image processing methods to extract lung regions from CT images [12]. In their study, Lima et al. used pre-trained VGG16, VGG19, Inception, Resnet50, and Xception to extract features from each 2D layer of 3D nodules. They then used principal component analysis to reduce the dimensionality of the feature vectors [13]. Saied et al. used machine learning algorithms and deep learning models together with PCA in their studies [14]. Qiu et al. introduced a new deep learning model for the segmentation of small lung nodules [15]. Kido et al. targeted nodule segmentation in their study by developing a nested 3D fully connected convolutional network and a new loss function [16]. Bhattacharjee et al. proposed a fine-tuned dual skip-connection-based segmentation system that integrates the pre-trained Residual Neural Network (ResNet) 152 with the U-Net architecture to achieve a fast and accurate segmentation algorithm with fewer stages [17]. Savic et al. proposed a segmentation algorithm based on the fast-marching method [18].

When we review the literature, the lack of a comprehensive study that performs both classification and segmentation for the diagnosis and identification of pulnomer nodules, as in our study, constitutes the motivation for our study.

The innovative aspect of our study is that this model was used for the first time in the classification of pulmonary nodules. Our study aims to facilitate computer-aided diagnosis of a critical disease such as pulmonary nodules, as segmentation and classification of lung diseases is a technical challenge. Another contribution is the attempt to show that the model used for classification is structured in such a way that it can be used for different diseases within the lung. Our study addresses several shortcomings identified during our literature searches, including:

  1. 1.

    Overcoming the limitations of CNN networks in feature extraction by using attention blocks in the model when CNN fails.

  2. 2.

    Reducing computational costs by extracting the most effective features using the PCA method instead of a large number of features.

  3. 3.

    Demonstrate that more efficient results can be achieved when deep learning and machine learning models are used together.

  4. 4.

    Shows that segmentation can achieve results that can be easily evaluated by experts when different segmentation models with different backbones are used.

For this purpose, in our study, first of all, we studied the C+EffxNet model, which has previously shown success, in classifying pulnomer nodules as benign and malignant [19]. In the first version of this model, versions B0 to B3 of the EfficientNet model were used. In this study, the EfficientNet B4 version was added to these models and deep features were extracted from it.

The study consists of two stages (classification and segmentation). Primarily, our original dataset obtained from Van Yüzüncü Yıl University Dursun Odabaşı Training and Research Hospital was marked as benign and malignant on CT by two expert radiologists from this hospital. In the segmentation phase, three segmentation algorithms were used and the results were given comparatively. To increase the performance of these algorithms, three different backbones such as InceptionV3, DenseNet121, and SeResNet101 were used for better feature extraction. Our study presents an ablation study that researchers can compare by using three different segmentation algorithms and three different spine models used with these algorithms. In addition, another innovative aspect of our study comes to the fore as it is a study that provides an in-depth analysis for the pulnomer nodule.

Our study was organized as follows: First, relevant studies for classification and segmentation in Sect. 2. The methodologies used in the Sect. 3, the proposed method in the Sect. 3.5, the experimental studies in the Sect. 4, and the discussion and conclusion section in the Sect. 5.

2 Related works

In this part of our study, first the pulnomer nodule classification and then the literature studies published using segmentation and their use in our study are included in addition to these studies.

2.1 Classification works

Al-shabi et al. [20], developed a model called ProCAN on the LIDC-IDRI dataset. In the study, 98.05% AUC and 95.28% accuracy (Acc.) values were obtained. Fu et al. [21] presented a CNN approach that includes attention on CT images. In these studies, a success of 0.94 was obtained regardless of the size of the nodule. Heuvelmans et al. [22] trained the lung cancer prediction convolutional neural network (LCP-CNN) to generate a malignancy score for each nodule using CT data. They reported that the network they developed in their study performed well in identifying benign nodules by excluding malignancy with a high degree of accuracy in one-fifth of patients with small to medium nodules. Apostolopoulos et al. [23] used deep-convolutional Generative Adversarial Networks to train CNNs to eliminate the lack of large-scale data. They also developed a CNN network called feature fusion VGG19 (FF-VGG19) to improve feature extraction. This study shows that if the data to be used in the diagnosis of the disease are not sufficient, generative networks can be used and CNN models may be insufficient in feature extraction. In their study, they obtained an accuracy value of 0.92. Bening and malignant datasets were presented in their studies with these nodules completely removed. In our study, these nodules were tried to be determined with attention blocks on all CTs. Gu et al. [24]. lung has shown that it can be more successful than human performance during automatic nodule detection, especially for smaller nodules, in review studies examining nodule detection, segmentation, and classification performances. It has also been stated that when compared to traditional methods, deep learning methods show lower false positives although they provide high sensitivity. He et al. [25] proposed an interpretation-model-guided classification method based on ISHAP (improved Shapley Additive ExPlanations) for the classification of benign and malignant pulmonary nodules. They obtained a sensitivity of 0.862, a specificity of 0.885, and an accuracy of 0.873 in the Lung Image Database Consortium (LIDC) dataset. They reported that estimates made with the features extracted from the study are sometimes not understood or interpreted by clinicians. Our study presents a hybrid model to overcome this problem. Astaraki et al. [26] conducted a study using supervised and unsupervised methods using convolutional features. In this study, after feature extraction, the disease was diagnosed by the machine learning method. In the study, it was stated that more training may be needed to generalize machine learning models. The approach we used in our study has previously provided good performance in diagnosing COVID-19 on CT images. Bening malignant discrimination performance is also shown in this study. Halder et al. [27] proposed the 2-Pathway Morphology-Based Convolutional Neural Network (2PMorphCNN), which classifies lung nodules by capturing both textural and morphological features. In their studies, the LIDC-IDRI dataset had 96.85% accuracy. In addition, it was stated in their studies that the convolution process was not effective in feature extraction alone. In order to overcome this in our study, the CBAM model consisting of attention blocks is included in the hybrid model. Huang et al. [28]. developed a manifold-based deep learning model and preprocessed the CT images in their study of benign malignant discrimination and removed relevant nodules. They are then classified through deep features. In their studies, it was emphasized that the high number of features could cause problems in classification. In our study, the PCA method was used to reduce the number of features. In this way, an increase in classification accuracy rates has been observed. Jin et al. [29] have published a detailed review article on the use of machine learning algorithms for the diagnosis of lung nodules. In this article, it is emphasized that deep learning applications can give better results than machine learning. In our study, an approach has been used that yields successful results as a result of the use of both deep learning applications and machine learning applications for pulmonary nodules. Yang et al. [30] proposed an improved U-Net framework for the 3D U-Net model in their benign and malignant identification studies, where they used U-Net for low-level features and CapsNet for high-level features.

2.2 Segmentation works

Dutande et al. [31] first performed pre-processing on CT images in their proposed study of the Deep Residual Separable Convolutional Neural Network for lung tumor segmentation. It has been emphasized that the standard U-Net model may be insufficient in feature extraction. In our study, backbones were used to overcome this inadequacy. Additionally, the performance of backbones in different segmentation models is also demonstrated. Tyagi et al. [32] obtained a Dice coefficient of 80.74% on the LUNA16 dataset in their study where they proposed a 3D conditional, generative, contentious network with simultaneous compression and excitation blocks for lung nodule segmentation. The use of patch-based processing, which affects the performance of GANs in the study, has also increased the transaction cost in proportion to accuracy. Liu et al. [33] proposed a method called the cascaded dual-pathway residual network (CDP-ResNet) to improve the segmentation of lung nodules in CT images. They obtained an 81.58% Dice coefficient on the LIDC dataset. In their study, they stated that even if it is not necessary to specify the location of the nodule in all layers, the ROI of a particular layer where the nodule is located should be given. Even if the location of the nodule is not specified in our study, segmentation algorithms can detect the location of the nodule.

3 Materials and methods

3.1 Dataset

Our study is retrospective and covers the period from 2015 to 2021. The images were obtained by multislice tomography devices with 128 detectors (Siemens SOMATOM Definition AS+128, Forchheim, Germany) and 16 detectors (Somatom Emotion 16-slice; CT2012E- Siemens AG Berlin and München, Germany) at the Faculty of Medicine of Van Yüzüncü Yl University. Patients with nodular mass lesions on the lungs were detected in the films evaluated by specialist radiologists. Subsequently, patients with at least two years of follow-up or a definitive pathological diagnosis were included in the study, while patients with artifacts in their images and nodules smaller than 5 mm were excluded from the study.

Our study was in two stages and, in the first stage, it was aimed at evaluating the classification of lesions first. For this purpose, patients were evaluated histopathologically and clinicoradiologically by specialist radiologists, and two classes labeled as malignant and benign were formed. Then a total of 199 images were obtained from all axial sections, including the lesion from 29 (19e/10b) patients in the Malignant group, and a total of 202 images were obtained from all axial sections including the lesion of 68 (38e/30b) patients in the Benign group.

Since normal lung images without any lesions were needed for artificial intelligence training, 343 normal sections from 67 (38e/29b) patients were used as a third group, covering the upper, middle, and lower zones of the lung. The mean age of the patients included in the malignant group was 65 ± 10.4 (43–88 years), the mean age of the patients included in the benign group was 59 ± 12.2 (27–81 years), and the mean age of the patients included in the normal group was 56.9 ± 14.1 (26–81 years). For the second stage, the segmentation stage, there were mixed malignant and benign patients in the dataset. A total of 379 images were obtained from 80 (43e/37b) patients, 229 images from 57 (28 Male/29 Female) patients with benign diagnosis, and 150 from 23 (15 Male/8 Female) patients with malignant diagnosis. The mean age of this group was calculated as 61 ± 12 (27–88). Then, the location of the lesion in each image was determined by specialist doctors, and the lesions’ borders were drawn manually using image processing programs. Thus, two separate copies of each image were obtained, marked and unmarked. The marked images are shown in Fig. 1. Masks were removed from the marked benign and malignant patient data, as shown in Fig. 2.

Fig. 1
figure 1

Example of marked dataset

Fig. 2
figure 2

Malignant patient image a original image b mask of image

3.2 Deep learning models

The C+EffxNet approach, utilized for classification, represents a novel hybrid deep learning method developed by Canayaz [19], specifically tailored for COVID-19 CT images. In the initial phase, we construct a CBAM-based model with an input layer of (256, 256, 3). This model incorporates channel attention, spatial attention, and residual blocks to extract crucial features from the images. Subsequently, the hypercolumn technique is employed to amalgamate these extracted features. Moving on to the second stage, EfficientNet models are integrated into the CBAM model, resulting in the creation of a third hybrid model. The final layer of the CBAM model is fused with the initial input layer of the EfficientNet models, adapting the shape of the first input layer to match the data output of the CBAM model. It is noteworthy that the models in this hybrid approach are not pre-trained; instead, hyperparameters are fine-tuned during the model training process. The training data for the hybrid model comprises images processed in these stages, yielding 1024 features for each model.

In summary, this approach involves a classification task employing two hybrid models. The first model consists of layers comprising CBAM blocks and feature maps of images utilizing the hypercolumn technique. In the second model of the hybrid approach, four versions of EfficientNet, a prominent deep learning architecture, are employed. In this study, the B4 version of EfficientNet is incorporated into this model, and a performance evaluation is conducted. This innovative approach is employed for feature extraction due to its demonstrated high performance in CT images.

The InceptionV3, DenseNet121 and SEResNet101 backbones were used for feature extraction from segmentation algorithms used for segmentation. If we briefly explain these models;

InceptionV3 consists of symmetric and asymmetric building blocks that include convolutions, mean pooling, maximum pooling, joins, dropouts, and fully connected layers [34]. In this network model architecture, there are factored convolutions to reduce the number of parameters. In addition, faster computation is provided by using small convolution windows. The fundamental features of Inception v3 are as follows: (a) Inception Module: The Inception v3 model employs Inception modules containing interconnected convolution filters of different sizes. These modules assist in learning various features of the network and contribute to a more effective representation of information. (b) Auxiliary Classifiers: Inception v3 aims to facilitate the training of deeper networks by incorporating auxiliary classifiers during the training process. This approach can enhance learning by utilizing information from both earlier and deeper layers of the network. (c) Batch Normalization: Batch normalization is frequently utilized in Inception v3 as a technique that aids in training the network faster and improving generalization. This contributes to the network's more stable learning. (d) Factorization into Small Convolutions: In Inception v3, the computation cost is reduced by using factorized small convolution matrices instead of large convolution matrices. This helps achieve a lighter and more efficient model.

In the DenseNet121 architecture, each layer is directly linked to all other layers. In these networks, feature maps of previous layers are not collected in each layer, they are only combined and used as input [35]. In this architecture, there are 1 7 × 7 Convolution, 58 3 × 3 Convolutions, 61 1 × 1 Convolutions, 4 Average Pools, and 1 Fully Connected Layer. DenseNet architecture has denser connections compared to previous models, allowing the network to learn more effectively and facilitate information transfer. The key feature of DenseNet architecture is the inclusion of direct connections from all layers to each other. In contrast to traditional CNN models, each layer is connected not only to the previous layer but also to all preceding layers. This enhances the flow of information, enabling the network to learn deeper and more effective features. DenseNet121 is a specific DenseNet model consisting of 121 layers. This model is commonly used in computer vision tasks such as object recognition, classification, and segmentation, particularly in tasks related to computer vision. Additionally, it is popular in transfer learning applications, as a pre-trained DenseNet model on extensive datasets tends to perform well in similar tasks.

SEResNet is a variation of ResNet that uses compression and excitation blocks. ResNet introduces a prominent architecture to facilitate the training of deep neural networks and enhance performance. SEResNet101 is an enhanced version of ResNet101. By incorporating Squeeze-and-Excitation (SE) blocks, the goal is to provide better learning capabilities to the model. These blocks aim to emphasize important features by focusing on learned feature maps. Consequently, the objective is to enable the network to learn more effective features, ultimately improving overall performance. "Squeeze-and-Excitation" (SE) blocks are a mechanism used to assist deep neural networks in learning more effective features. These blocks aim to focus on crucial features in the feature maps. SE blocks typically consist of the following two main steps: In the Squeeze step, channel information in each feature map is compressed (squeezed) using a learned weight set. This involves measuring the importance level of each channel and performing a process that reduces the number of channels. In the Excitation step, weights are applied to the feature maps using the summary information obtained from the compression step. These weights express the importance level of each channel, creating a mechanism to determine which channels need more emphasis. This process determines the significance of each channel, and the network optimizes learning by placing greater emphasis on these important channels. As a result, the model learns more effective features, leading to an overall improvement in performance. Squeeze-and-Excitation blocks are designed to deliver better performance, especially in tasks that involve large and deep neural networks. It can better map the channel dependency with the squeeze and excitation block. In this way, it can better calibrate the filter outputs, leading to performance gains [36].

3.3 Classification

In this section, some of the classifiers used in the study will be briefly explained. Before moving on to classifiers, information about PCA, which is the feature reduction method we used in the study, will be given.

3.3.1 Reduced features with PCA

Principal component analysis (PCA) is an unsupervised linear transformation technique utilized for dimensionality reduction [37]. It allows the identification of new subspaces with the highest variance in high-dimensional data, effectively reducing its size [37, 38].While this reduction may result in the loss of certain properties, it primarily discards less informative traits about the population.

PCA brings together highly correlated variables, forming a reduced set of artificial variables known as 'principal components,' capturing the most significant variation in the data [38]. As an orthogonal statistical technique [39]. PCA aims to project high-dimensional data samples into a lower-dimensional space through linear transformation [40], preserving the original data features to the extent possible.

Through a linear transformation that minimizes redundant covariance information and maximizes variance information [41], PCA effectively combines highly correlated variables. The outcome is a concise set of dummy variables—'principal components'—that represent the primary sources of variation in the data [38].

Creating principal components in high-dimensional data begins with normalization of the data.

[p normalized random variables \({{Z}^{\mathrm{^{\prime}}}=(Z}_{1}, {Z}_{2},\dots ,{Z}_{p})\)]. Eigenvalues and eigenvectors of these standardized data are obtained from the variance–covariance Σ matrix. The eigenvalues of this matrix

$$\left|\Sigma -\lambda {I}_{p}\right|=0$$
(1)

are obtained from the roots of the Eq. 1. These obtained eigenvalues are ordered so that \({\lambda }_{1}> {\lambda }_{2}>,\dots ,>{\lambda }_{p}>0\) effectively represents the decreasing variance in the data [42]. After establishing linear Eqs. 23 as:

$$Y = l^T Z,$$
(2)

\(l\) as the loading vector, the equation

$${\Lambda =P}^{T}\Sigma P$$
(3)

is obtained, where Λ is the eigenvalue matrix and P is the eigenvalue matrix. The basic assumption of PCA is that the score and loading vector corresponding to the largest eigenvalues contain the most useful information, and the rest mainly contain noise. For this reason, these vectors are generally created in order of decreasing eigenvalues [43].

3.3.2 Classifiers

The purpose of SVM is to obtain the optimal separation hyperplane that will separate data belonging to different classes [44,45,46]. It uses Lagrangian multipliers to solve the optimisation problem to find the optimal separation hyperplane while solving the classification problem. In this way, the number of transactions is reduced [47]. SVMs are supervised learning models that analyze data for classification [45].

In SVM, the classification process for each sample xi in the data set can be expressed as in Eq. 4 [46]:

$$f\left({x}_{i}\right)={\text{sign}}\left(\sum\limits_{j=1}^{n}{\alpha }_{j}{y}_{j}K\left({x}_{i},{x}_{j}\right)+b\right)$$
(4)

where, f(xi): classification score of sample xi, αi: support vector weights, yj: label (class) of sample xj. K(xi, xj): Kernel function, b: bias.

KNN is a non-parametric classification method. The input consists of the k closest training examples in a dataset [48]. It is determined by the k-value of the sample data point and the nearest neighbor [37]. The formula for KNN is given in Eq. 5.

$$\widehat{y}(x)={\text{mode}}\left\{{y}_{i}|{x}_{i }\epsilon {N}_{k}(x)\right\}$$
(5)

here, \(\widehat{y}(x)\): estimated label of point x, \({N}_{k}(x)\): cluster containing the k closest neighbors of point x, \({y}_{i}\): label of point xi.

The RidgeClassifier is a classifier that utilizes Ridge regression. Ridge regression [49] is suggested for data with multicollinearity issues to obtain predictors with smaller variances. In cases where classification is involved, it transforms the target variable by considering classes into [− 1, 1] and constructs the model using Ridge regression to solve the problem [50]. The loss function and formula for this classifier are given Eq. 6 [51].

$${L}_{0} =\mathrm{ Mean Squared Error }+ {L}_{2}\mathrm{ penalty}$$
(6)
$$J(w)=\sum_{i=1}^{n}l\left({f(x}_{i}\right),{y}_{i})+\alpha {\Vert w\Vert }_{2}^{2}$$

where, J(w): Loss function representing the total error, \(l\left({f(x}_{i}\right),{y}_{i})\): Loss function measuring the error between the model's prediction and the actual label. α: regularization parameter, \({\Vert w\Vert }_{2}^{2}\): L2 norm, represents the sum of squares of the weight vector.

The Ridge Classifier restricts the weights using this regularization term, enhancing generalization. As a result, the risk of overfitting decreases, and a more generalizable model is obtained.

XGBoost, a machine learning method based on decision-tree and gradient-boosting proposed by Chen and Guestrin in 2016 [52], has been defined as a fantastic blend of hardware and software optimization approaches that yield excellent results in a short period [53]. XGBoostClassifier is a gradient boosting classifier based on xgboost. XGBoost is an implementation of the widely recognized gradient boosting algorithm known for its efficiency and prediction accuracy [52]. This algorithm is equipped with methods such as Regularization, Missing values Imputation, Cross-Validation, Hyperparameter tuning, etc., to enhance the computation time and performance of the model [53]. The primary goal of XGBoost is to optimize the performance of the model through a target function. This function encompasses both loss and regularization terms. This classifier is expressed as in Eq. 7.

$${\text{Obj}}=\sum_{i=1}^{n}l({y}_{i},\widehat{{y}_{i})}+\sum_{k=1}^{K}\Omega ({f}_{k})$$
(7)

Here, n: number of samples in the data set,\(l({y}_{i},\widehat{{y}_{i})}\):: The loss function measures the error between the actual label \({y}_{i}\) and the prediction\(\widehat{{y}_{i}}\). K: Total number of trees, Ω(\({f}_{k})\): The regularization term controls the complexity of each tree.

3.4 Segmentation

3.4.1 U-Net

Fully Convolutional Network (FCN) architecture is a highly successful and frequently used basic architecture recommended for semantic segmentation [54]. The U-Net architecture is an architecture based on FCN architecture, which was proposed for the semantic segmentation of medical images [55]. The architecture of the network consists of two parts (contraction and expansion path) [55, 56]. Operations on the left side of the architecture, the shrinking or contraction path, are used to capture context information about the image (extract features from the image). These operations are exactly the same as the classical CNN architecture logic.

The operations on the right side of the architecture, the expansion path, are used to precisely position the parts that need to be segmented in the picture [55, 57]. For the skip connection between the shrinking path and the expanding path, a concatenation operator is applied instead of the sum. This enables spatial information to be applied directly to deeper layers and obtains a more accurate segmentation result [58]. Classical deep learning needs abundant examples and expensive computing resources. However, U-Net can adapt to minimum training sets [57, 59]. In particular, this network is suitable for medical image segmentation tasks [57, 60].

The main strategy that distinguishes U-Net from other segmentation architectures is to combine feature maps of the contraction phase and their symmetrical counterparts in the expansion phase. In this way, it is possible to disseminate context information into high-resolution feature maps [61]. The U-Net structure is shown in Fig. 3.

Fig. 3
figure 3

U-Net structure [55]

3.4.2 LinkNet

LinkNet has a lightweight deep neural network architecture that allows learning of semantic segmentation tasks without a significant increase in parameters. It is similar to U-Net and other segmentation networks. There is an encoder on the left and a decoder on the right [61]. The encoder functions by encoding the information in the source space, and the decoder maps this information in spatial categorization to perform the segmentation [61, 62]. The encoders used in segmentation processes in current neural network architectures perform more than one downsampling operation. This process causes some spatial information to be lost. Some information is lost when going through cascading convolutions in the encoder part [63]. This lost information is difficult to recover [62]. In Link-Net, the input of each encoder layer is also assigned to the output of the corresponding decoder. The purpose of this process is to recover lost spatial information that can be used by the decoder and upsampling processes [62]. Fewer parameters are used because the information learned by the encoder is shared with the help of the decoder [62, 64]. The LinkNet structure is shown in Fig. 4.

Fig. 4
figure 4

LinkNet structure [62]

3.4.3 FPN

FPN uses the pyramidal hierarchy of deep convolutional networks to construct feature pyramids at marginal extra cost [65]. Image pyramids are a data structure designed to support convolution through reduced imagery. It consists of a series of copies of the original image in which both sample density and resolution are reduced in regular steps [66]. Feature pyramids form the basis of a standard solution built on image pyramids [67]. It is believed that the pyramid can provide some conceptual unification to the problem of representing and manipulating low-level visual information. It offers a flexible, convenient multi-resolution format that matches multiple scales found in visual scenes and reflects multiple processing scales in the human visual system [66]. FPN is a feature extractor based on the pyramid concept with accuracy and speed [67]. ConvNets represent a high level of semantics and are resistant to variance, but pyramids are still needed to obtain the most accurate results. The main advantage of specifying each level of an image pyramid is that it produces a multi-scale feature representation where all levels, including high-resolution levels, are semantically strong. However, the disadvantage of an image pyramid is that it causes an increase in extraction time [67]. FPN combines features using bottom-up, top-down path and lateral connections by leveraging feature hierarchy. In this way, it creates a strong feature pyramid [68].

The bottom-up path provides a top-down way to render higher-resolution layers than the rich layer. Each stage is defined as a pyramid level [66]. Among the reconstructed layers, the final layer output is selected as feature maps to construct the pyramid [67]. The top-down path is enriched by adding lateral links between feature maps [64]. The top-down path provides a top-down way to create spatially coarser, but semantically stronger feature maps with higher resolution layers. These features are enhanced by features on the bottom-up path through lateral links. Each lateral link joins feature maps of the same spatial dimension from the bottom–up and top–down path [67]. The FPN structure is shown in Fig. 5.

Fig. 5
figure 5

FPN structure [65]

3.5 Proposed method

Our benign-malignant analysis study, which is the subject of the study, consists of two stages. The first stage is the classification stage. At this stage, the classification performance of the dataset was analyzed. For this, the dataset given by the C+EffxNet method is trained. The models obtained as a result of this training were then used in the extraction of deep features. In the C+EffxNet method, versions of the EfficientNet deep learning model from B0 to B4 are used. The input value of this method is (256, 256, 3). From this new hybrid method, 1024 features were extracted for each model. We then subjected these features to feature selection using the PCA method and extracted 100 best features for each model. For our image dataset, this value was (744.100). In other words, 100 features of each image were selected. 20% of this feature dataset are reserved as test data. First of all, the classification process was carried out for this dataset using RidgeClassifier, SVM, KNN and XGBoostClassifier. The operations performed for the classification section are shown in Fig. 6.

Fig. 6
figure 6

Classification process

Parameters used in training models; optimiser Adam, learning rate 0.0005, loss function categorical cross_entropy, batch_size 8, epochs set to 100.

The second stage is the segmentation stage of the disease. At this stage, results were obtained by using three powerful segmentation algorithms such as U-Net, LinkNet, and FPNet. InceptionV3, DenseNet121, and SeResnet are used as backbones for feature extraction in these algorithms. In the running of the algorithms, first of all, benign and malignant datasets were run separately and the results were obtained, and then these datasets were combined and the results were obtained. Parameters used in the segmentation phase; Adam optimizer was used as the optimizer, learning rate, Threshold value of 0.0001 was determined as 0.5, batch size was determined as 8. The segmentation process is shown in Fig. 7.

Fig. 7
figure 7

Segmentation process

3.6 Metrics

Accuracy is one of the most common criteria in practice used to evaluate the generalization ability of classifiers [69]. and is the ratio of the number of correctly diagnosed nodules to the total number of nodules [37]. Precision is a measure of how well a model predicts only the positive outcome of a classification. Recall is the ratio of the number of correctly classified to the number of condition positive [37] and is used to measure the proportion of correctly classified positive patterns [69].

The F1 score is a measure that combines both precision and recall in a single measure. It is calculated by averaging the harmonic of precision and recall, and a higher F1 score performs better [69]. These metrics are Sensitivity (Se), Specificity (Sp), F-score (F-Scr), Precision (Pre), and Accuracy (Acc). The True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN) values are used to calculate the metrics. The equations for these metrics are given in Eqs. 812.

$${\text{Se}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FN}}}$$
(8)
$${\text{Sp}}=\frac{{\text{TN}}}{{\text{TN}}+{\text{FP}}}$$
(9)
$${\text{Pre}}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}}$$
(10)
$${\text{F-score}}=\frac{2{\text{TP}}}{2{\text{TP}}+{\text{FP}}+{\text{FN}}}$$
(11)
$${\text{Accuracy}}=\frac{{\text{TP}}+{\text{TN}}}{{\text{TP}}+{\text{TN}}+{\text{FP}}+{\text{FN}}}$$
(12)

The kappa value is a statistic that is used to measure the agreement between two raters or annotators who assign ratings or labels to a set of items. It is calculated by taking the ratio of the observed agreement between the raters to the agreement that would be expected by chance and subtracting the result from 1. This measure is useful because it takes into account the possibility of chance agreement between the raters, which can sometimes be high even if the raters are not actually in agreement [70]. A kappa value of 1 indicates perfect agreement, while a value less than 1 indicates less than perfect agreement. The Kappa coefficient Equation is given in Eq. 13.

$$k=\frac{{\text{Pr}}\left(a\right)-{\text{Pr}}(e)}{1-{\text{Pr}}(e)}$$
(13)

here, Pr(a) is the ratio of the total number of matches observed for the two values, and Pr(e) is the probability of this agreement occurring by chance.

The Jaccard index, also known as the Jaccard similarity coefficient, is a measure of the similarity between two datasets. Calculated by dividing the intersection size of two sets by the size of their union. This measure can be used to compare the similarity of two datasets, and a higher Jaccard index indicates more similarity [71]. The Jaccard similarity coefficient Equation is given in Eq. 14.

$$\text{Jaccard index}=\frac{{\text{TP}}}{{\text{TP}}+{\text{FP}}+{\text{FN}}}$$
(14)

The Dice coefficient, also known as the Sørensen-Dice coefficient, is another measure of the similarity between two datasets. It is calculated by dividing twice the intersection size of two sets by the sum of the sizes of the two sets. The Dice coefficient can also be used to compare the similarity of two datasets, with a higher coefficient indicating greater similarity [72]. The equation for the Dice coefficient is given in Eq. 15.

$$\text{Dice coefficient}=\frac{2{\text{TP}}}{2{\text{TP}}+{\text{FP}}+{\text{FN}}}$$
(15)

4 Experimental studies

4.1 Pre-processing

No preprocessing has been performed for the dataset used in the classification process. However, in the preprocessing stage of segmentation, experts examined the images in the dataset one by one and labeled the diseased parts in benign and malignant images. These tags are then separated into masks by image processing programs. These masks were used as segmentation tags in segmentation.

4.2 Bening malignant classification

At this stage, C+EffxNet models were run with the dataset. During the training phase, models saving the lowest validation loss function were retained. Initial performance evaluations on the validation dataset were conducted using these saved models, and the results are presented in Table 1. It's important to note that each sample in the dataset corresponds to an image, and the size of each image is (256, 256, 3).

Table 1 Results obtained from the validation dataset

When we examine this table, the best result on the validation dataset was achieved by EffNetB3 with a score of 0.88. Initially, our plan involved utilizing the C+EffxNet model for feature extraction. Through training this model with the dataset, it learned from the previously unseen pulmonary nodule dataset and adjusted its weights accordingly. Consequently, the model became prepared for feature extraction from this dataset. The purpose of presenting Table 1 is to highlight the importance of deep feature extraction by showing the performance of this model before deep feature extraction. Following the extraction of 1024 features for each image in the dataset, PCA was employed to select the most influential 100 features from this set. Table 2 provides the classification algorithm performance with the selected features.

Table 2 Classification results of selected deep features

When we examine Table 2, we see that the best results are obtained with the features in which EffB0 is used. With the features obtained from EffB0, a performance of 0.97 was achieved in all classifiers. While the accuracy value obtained after training the hybrid model was 0.84, this value increased by 0.13 points with the use of deep features. The MSE, RMSE, and MAE values obtained with these features were 0.0402, 0.2006, 0.0268, respectively. The lowest value of these metrics was realized with the features obtained with EffnetB2. With these features, 0.0335 MSE values and 0.1831 RMSE values were obtained, and when we look at the table, it is seen that these values are the lowest values among other MSE and RMSE values. The confusion matrix of the EffnetB0 model, where the best result is obtained, is shown in Fig. 8.

Fig. 8
figure 8

EffB0 confussion matrix

To confirm the reliability of our results, classification results were obtained using Cross Validation (CV) and Leave One Out Cross Validation (LOOCV) with a K-fold 10 on the dataset. In these verification methods, in addition to the above classifiers, classification has been performed with many classifiers. The correlation between the results obtained when using these methods is also shown in Table 3.

Table 3 CV and LOOCV results of selected traits

When we look at Table 3, it is seen that the best results are obtained on the features obtained from EffB0 as in Table 2. In the classification made with these features, first, RidgeClassifier obtained the best result with 0.98 accuracy, while SVM became the second classifier with 0.977. The correlation between these characteristics and the results obtained is 0.99.

4.3 Bening malignant segmentation

In the segmentation phase, ablation studies were performed, and the results were obtained. First, only the dataset that contained malignant cells was trained. The results obtained for malignant are given in Table 4.

Table 4 Malignant results

When we examine Table 4, we see that the best results in train data are obtained by using the U-Net model and the InceptionV3 backbone. While the Jaccard index value obtained in the Train data was 0.9283, the Dice coefficient value was 0.9489. The test values obtained in this model and in the spine are 0.7129 and 0.8214, respectively. The best values obtained in the test data were obtained with the FPN model and DenseNet121 backbone. The resulting Jaccard index value is 0.8026, while the Dice coefficient value is 0.8877. Taking into account the values obtained in the test data in the studies, we can clearly see that the best results for the Malignant dataset are obtained with the FPN model and DenseNet121 backbone. Then, model trainings were carried out on the dataset containing bening. The results obtained from this dataset are given in Table 5.

Table 5 Bening results

When Table 5 is examined, it is seen that the best results on train data are obtained with the LinkNet model and DenseNet121 backbone, with the Jaccard index 0.9140 and Dice coefficient 0.9284. A slight decrease in these values was observed compared with the train data in the malignant dataset. When we look at the test results for the Bening dataset, it is seen that the best performance is obtained with the U-Net model and the InceptionV3 backbone. The values obtained are 0.5127 for the Jaccard index and 0.5728 for Dice coefficient. When we compare it with malignant, this value is quite low. Finally, the training was carried out with the dataset containing both datasets. The results obtained as a result of this training are given in Table 6.

Table 6 Bening and malignant results

When we examine Table 6, we see that the model that provides the best performance in the training data is the U-Net model and the SeResNet101 backbone. The resulting Jaccard Index value is 0.9094, while the Dice coefficient value is 0.9301. In the test data, it was seen that the best results were obtained with the FPN model and the DenseNet121 backbone. The Jaccard index value and Dice coefficient value obtained in this model and backbone are 0.3263 and 0.3890, respectively. These values were again observed to be quite low compared with the malignant results.

The application implemented for our study can be accessed at https://github.com/mcanayaz/PulnomaryNodules

5 Discussion and conclusions

Examination and classification of pulmonary nodules play an important role in the rapid diagnosis of lung cancer development. Our work consists of two stages. In the first stage, benign and malignant classification is made. At this stage, the performance of the C+EffxNet approach, which was previously recommended in the Covid-19 classification and had successful results, in the classification of benign and malignant was first measured. The published study of this approach used versions B0 through B3 of the EfficientNet model. In this study, new results were obtained from the B4 version. In the first stage, as a result of model training, the maximum success rate was 88%. However, the success rate increased to 97.98% as a result of the extraction of deep features and classification with feature selection from the extracted features. This clearly shows us the power of feature extraction and feature selection. The test size ratio in classification studies is 0.2. The best results were obtained with features extracted from the Ridge and XGBoost Classifier. In order to confirm the reliability of the results, CV and LOOCV cross-validation methods were applied to the obtained features. The correlation between the results obtained from these methods is 0.997.

The second phase of the study is the segmentation phase of benign and malignant nodules. At this stage, masks were obtained from the images marked by our radiologists. The results were obtained by running the dataset created from images and masks with 3 segmentation algorithms. In this section, the image dataset containing bening was first obtained, then the image dataset containing malignant, and finally, the segmentation results with separate backbones were obtained on the dataset obtained by combining the two. When we look at the results, the highest Dice coefficient of 0.88 was obtained from the image dataset containing malignant, while a jaccard index of 0.8026 was obtained. It should also be noted that the results obtained with this dataset are higher than those obtained with the Bening dataset and the combined dataset obtained by combining the two. Among the limitations of our study, it is thought that the number of images in the dataset should be increased to increase the success of performance metrics for segmentation. At the end of the study, segmentation results were obtained in patients that the models had never seen, and the results and radiologists' comments on the performance of the models are given in “Appendix”. We will continue to work with new models for segmentation. In addition, it is planned to write an interface where both classification and segmentation can be used together.