Computed Tomography-derived intratumoral and peritumoral radiomics in predicting EGFR mutation in lung adenocarcinoma

Objective To investigate the value of Computed Tomography (CT) radiomics derived from different peritumoral volumes of interest (VOIs) in predicting epidermal growth factor receptor (EGFR) mutation status in lung adenocarcinoma patients. Materials and methods A retrospective cohort of 779 patients who had pathologically confirmed lung adenocarcinoma were enrolled. 640 patients were randomly divided into a training set, a validation set, and an internal testing set (3:1:1), and the remaining 139 patients were defined as an external testing set. The intratumoral VOI (VOI_I) was manually delineated on the thin-slice CT images, and seven peritumoral VOIs (VOI_P) were automatically generated with 1, 2, 3, 4, 5, 10, and 15 mm expansion along the VOI_I. 1454 radiomic features were extracted from each VOI. The t-test, the least absolute shrinkage and selection operator (LASSO), and the minimum redundancy maximum relevance (mRMR) algorithm were used for feature selection, followed by the construction of radiomics models (VOI_I model, VOI_P model and combined model). The performance of the models were evaluated by the area under the curve (AUC). Results 399 patients were classified as EGFR mutant (EGFR+), while 380 were wild-type (EGFR−). In the training and validation sets, internal and external testing sets, VOI4 (intratumoral and peritumoral 4 mm) model achieved the best predictive performance, with AUCs of 0.877, 0.727, and 0.701, respectively, outperforming the VOI_I model (AUCs of 0.728, 0.698, and 0.653, respectively). Conclusions Radiomics extracted from peritumoral region can add extra value in predicting EGFR mutation status of lung adenocarcinoma patients, with the optimal peritumoral range of 4 mm. Supplementary Information The online version contains supplementary material available at 10.1007/s11547-023-01722-6.

class (in classification tasks) or the average of the values (in regression tasks) of these neighbors.
One of the key advantages of the KNN algorithm is its simplicity and ease of implementation.It is also a non-parametric method, which means that it can handle data that does not conform to a specific distribution.KNN is also a very interpretable algorithm, as the predictions are based on the actual data points rather than a black-box model.
However, the KNN algorithm can be sensitive to the choice of K, which can greatly affect the quality of the predictions.It can also be computationally expensive for large datasets, as it requires calculating the distance between the query instance and all of the training instances.

c. Logistic Regression (LR)
LR is a statistical method used for binary classification, where the goal is to predict the probability of an event occurring.It is a type of linear regression that uses a logistic function to transform the output of the linear regression into a probability value between 0 and 1.In LR, the input features are used to calculate a linear combination, which is then transformed by the logistic function to produce the predicted probability of the event.The logistic function is a non-linear function that produces an S-shaped curve, which is useful for mapping the linear combination onto the probability scale.
The LR algorithm is trained using a labeled dataset, where each instance is associated with a binary label indicating the presence or absence of the event.The algorithm estimates the parameters of the logistic function using maximum likelihood estimation, which is a statistical method for finding the parameters that are most likely to have produced the observed data.One of the key advantages of LR is its interpretability.The output of the algorithm is a probability value, which can be easily understood and interpreted by humans.LR is also computationally efficient and can be trained on large datasets.
LR is commonly used in applications such as credit scoring, fraud detection, and medical diagnosis.It is a simple yet powerful machine learning method that is wellsuited to binary classification tasks.

d. Extremely Randomized Trees (ExtraTrees)
A machine learning algorithm that is used for both classification and regression problems.It is a type of ensemble learning method, meaning it combines multiple models together to produce a more accurate final prediction.ExtraTrees is similar to other decision tree-based algorithms, such as Random Forest, but it has some important differences.In ExtraTrees, the trees are built using a random subset of the features, and at each split in the tree, the algorithm randomly selects a subset of candidate thresholds to determine the best split.These two randomization techniques help to reduce overfitting and improve the robustness of the model to noise and outliers.
ExtraTrees is also computationally efficient, since the trees are built independently of each other, and the randomization reduces the number of candidate thresholds that need to be evaluated.As a result, it can handle large datasets with high-dimensional feature spaces.It is particularly useful in situations where other methods may overfit or struggle with noisy data.

e. CatBoost
CatBoost is an open-source machine learning algorithm that is designed to work well with both categorical and numerical data.It was developed by Yandex, a Russian search engine company, and was first released in 2017.
One of the key features of CatBoost is its ability to handle categorical features with high cardinality, which is a common challenge in many real-world datasets.It does this by using an algorithm called "ordered boosting", which works by constructing decision trees that split on categorical features in a way that preserves the natural ordering of the categories.
In addition to its strong performance on datasets with categorical features, CatBoost also has several other useful features, such as built-in cross-validation, early stopping, and the ability to handle missing values in the data.

f. eXtreme Gradient Boosting (XGBoost)
XGBoost is used for supervised learning tasks, such as classification, regression, and ranking.XGBoost is a type of gradient boosting algorithm, which works by iteratively training a series of weak models (usually decision trees) and combining their predictions to produce a final output.XGBoost uses a regularized version of gradient boosting that includes both L1 and L2 regularization, which helps to prevent overfitting and improve generalization.
One of the key features of XGBoost is its scalability, which allows it to handle very large datasets and train models quickly.It also has built-in cross-validation, early stopping, and support for missing values in the data.
Overall, XGBoost is a powerful and flexible algorithm that has become very popular in the machine learning community, winning many machine learning competitions on platforms such as Kaggle.It is widely used in industry and academia for a wide range of tasks, and is known for its high predictive accuracy and speed.

g. NeuralNetFastAI
NeuralNetFastAI is a machine learning library that provides an easy-to-use interface for building and training neural networks.It is built on top of PyTorch, one of the most popular deep learning frameworks, and is designed to make it easy for practitioners to build state-of-the-art deep learning models without requiring extensive knowledge of the underlying math and programming.
One of the key features of NeuralNetFastAI is its flexibility, which allows users to customize their neural network architectures and training processes in a variety of ways.
NeuralNetFastAI is a powerful and flexible library that is well-suited to a wide range of deep learning tasks, including image and text classification, object detection, and natural language processing.

h. NeuralNetTorch
NeuralNetTorch is a machine learning library that provides an interface for building and training neural networks using PyTorch, one of the most popular deep learning frameworks.It is designed to make it easier for practitioners to build, train, and experiment with neural network models, while also providing flexibility for customizing the architecture and training process.
One of the key features of NeuralNetTorch is its modular design, which allows users to build complex neural network architectures by stacking together a variety of different layer types.Users can choose from a range of activation functions, convolutional layers, recurrent layers, and pooling layers, among other options, to build customized networks that are tailored to their specific needs.
NeuralNetTorch also provides a number of useful tools for working with large datasets, including support for data augmentation and distributed training.It also has built-in functionality for visualizing and interpreting the results of model training, making it easier for practitioners to debug and fine-tune their models.

i. Light Gradient Boosting Machine (LightGBM)
LightGBM is designed to train gradient boosting decision tree models.It was developed by Microsoft and is known for its speed, scalability, and high accuracy.
One of the key features of LightGBM is its ability to handle large datasets with high dimensionality.It does this by using a technique called "gradient-based one-side sampling", which samples the data based on the gradients of the loss function and the distribution of the data, resulting in faster and more efficient training.
LightGBM also uses a technique called "leaf-wise tree growth", which grows the decision trees in a way that prioritizes the leaf nodes with the most significant gradients, resulting in deeper trees and more accurate predictions.
In addition to its speed and accuracy, LightGBM has a number of other useful features, such as built-in support for categorical features, early stopping, and parallel training on multi-core CPUs.

Supplementary Tables
Supplementary Table S1.

Table S4 .
The optimal classifier corresponding to each VOI.Difference of AUC between VOI_I model and VOI_P models in the internal testing set.
*VOI, volume of interest Supplementary Table S5.Difference of AUC between VOI_I model and VOI_P models in the external testing set.*VOI, volume of interest