Introduction

High-definition videoendoscopy and laryngoscopy are nowadays the cornerstones of laryngeal examination. The use of Narrow Band Imaging (NBI) optical modality, as opposed to standard White Light (WL) endoscopy, has been demonstrated to offer benefits in enhancing the visualization of vessel abnormalities and characterization of diseases [1, 2]. Traditionally, the quality of the endoscopic evaluation has relied on the subjective expertise of medical practitioners, resulting in inherent inter- and intra-observer variability. Due to time constraints and patient compliance, often a clear view of the larynx is not obtained, with possible misdetection of abnormal laryngeal conditions. Conversely, when a lesion is found, if the quality of the view is not optimal, the risk of misdiagnosis increases with a consequent potential undervaluation of malignant disorders, ultimately compromising patient care and management.

The advent of Artificial Intelligence (AI) has transformed numerous domains within the medical field, ushering in an era of unprecedented precision and efficiency [3]. Specifically, its application in medical imaging can be exploited for analyzing a vast amount of information, provided by videos and images, with the potential to enhance diagnostic accuracy and streamline clinical workflows. The integration of AI technologies with the imaging field is called computer vision, and especially in the upper aerodigestive tract endoscopy, such technology has demonstrated unprecedented power for automating and standardizing the analysis of laryngoscopic images [4]. Particularly, several methods have been proposed during the last years for automated key frame extraction in endoscopic videos [5,6,7,8]. The most popular approach in this field is deep learning (DL) which allows computer systems to automatically learn how to recognize images by leveraging vast datasets of annotated laryngoscopy images.

These solutions can have an important impact in the otolaryngological practice. First, the automatic extraction of frames from endoscopic videos can assist the development of computer-aided detection [9], segmentation [10, 11], and Diagnosis systems [12], as their outcomes rely on substantial amounts of well-annotated data. From the clinical point of view, standardizing the analysis of laryngoscopic data can enhance the accuracy of endoscopic diagnosis and facilitate the clinical decision-making process by automatically selecting and saving the keyframes with the optimal level of information, ultimately saving time when reviewing the examination. Finally, during the outpatient examination of the larynx, the implementation of a method for recognizing informative frames might become a way to give otorhinolaryngologists real-time feedback about how to adjust the navigation of endoscopic tools within the target or to direct their attention toward data that clearly depict the laryngeal structures.

In this paper, we explore the development and validation of a DL model designed to face the classification of informative frames in laryngoscopy. The contributions of this research are two-fold: first, we aim to demonstrate the feasibility and efficacy of AI in providing automatic informative laryngoscopy frame selection both in WL and NBI; second, by pursuing a real-time computing algorithm, we strive to establish the groundwork for its seamless integration into clinical practice with the aim of providing visual feedback to guide the otolaryngologist during the examination.

Materials and methods

Study population and image datasets

The “NBI-InfFrames” from Moccia et al. [8], which to our knowledge is the only publicly available dataset to date, was initially used for this study. It comprises 720 frames from NBI videos of n = 18 patients affected by laryngeal Squamous Cell Carcinoma: 180 informative frames with proven good-quality information, 180 dark frames identified as underexposed, 180 blurred frames, and 180 frames that exhibit either saliva or specular reflections. The images were captured with a resolution of 1920 × 1072 pixels, using an Olympus Visera Elite S190 video processor and an ENF- VH video rhino-laryngo video scope (Olympus Medical System Corporation, Tokyo, Japan). However, the dataset presents some weaknesses: it consists of only NBI frames and it is based only on a few patients, which might reduce the generalization capability of a model trained only on this dataset. Finally, prospectively considering screening campaigns, the informative frame selection should be exploited not only for patients with laryngeal carcinoma but also for other diseases and for healthy subjects.

To address these limitations a new collection of both WL and NBI frames, indicated as the “Frame quality dataset” was added to the existing dataset. The recordings of the videolaryngoscopies (VLs) on 51 patients with a healthy larynx or affected by several laryngeal conditions (carcinoma, papillomatosis, polyps, Reinke’s edema, granulomas, and nodules) from the Units of Otolaryngology and Head and Neck Surgery of the IRCCS San Martino Hospital-University of Genova (Italy) and Hospital Clínic of Barcelona (Spain) were retrieved. The Institutional Review Board approval was obtained at both institutions (CER Liguria: 169/2022; Reg. HCB/2023/0897).

At both centers, the VLs were captured in the office using a flexible naso-pharyngo-laryngoscope (HD Video Rhino-laryngoscope Olympus ENF-VH, Olympus Medical System Corporation, Tokyo, Japan) through a transnasal route. As a general policy of the two centers, the VLs were performed first using WL and then switching to NBI.

VideoLAN VLC media player v.3.0.18 software was used for the automatic sampling of the frames from the VLs. A supervised strategy was used for a binary classification task so that the extracted frames were divided into two categories according to the following criteria: frames with a close and clear view of the larynx as a whole or of its subsites without artifacts alterations such as blur, saliva, reflexes or underexposure were classified as “informative”. All the other extracted frames were classified as “non-informative”. At the University of Genova, three junior otolaryngologists (AI; AT; AP) independently classified the images. Afterward, a senior laryngologist (GP) revised the classification and gave the final approval. Similarly at the University of Barcelona, three junior otolaryngologists (CS; LR; BA) labeled the frames while a senior laryngologist (IV) reviewed the dataset.

The resulting database, i.e. the internal dataset, obtained from the union of publicly available image collection [8] and the University of Genova dataset included a total of 5147 frames. It was reorganized with a patient-wise allocation method into 3126 frames for the training phase, 1317 frames for validation, and the remaining 704 frames (346 “informative” and 358 “Uninformative”) for testing. Data were then further augmented by applying 10% vertical and horizontal shifts, 10% zoom, rotations ranging from 0° to 10° and horizontal flip.

The dataset retrieved from the Hospital Clínic of Barcelona was used as an external set to validate the performance of the selected DL model on data acquired and annotated independently from the training set, allowing for testing the robustness of the proposed method. Data variability is associated with the geographical diversity of the involved population, the endoscopic technique, and the annotation process conducted by different physicians. The external testing set consisted of 646 NBI and WL laryngeal frames (informative, n = 306; uninformative, n = 340).

The breakdown and content of the dataset is reported in Table 1.

Table 1 Internal and external datasets collection

Deep learning models selection

The deployment of computer-aided systems in real-world settings demands the DL convolutional neural networks (CNNs) to be light in terms of architecture so that the number of mathematical operations required to process an input image, known as Floating Point Operations Per Second (FLOPS), remains low, allowing faster inference time and simpler hardware requirements. A shallow CNN comprising three convolutional blocks and an equivalent number of fully connected layers was built to this end. In each convolutional block, a 3 × 3 convolution was followed by a batch normalization operation and a Rectified Linear Unit (ReLU) activation. The final number of model parameters was less than 0.3 million.

On the other side, a basic architecture may not be able to extract sufficient features to correctly predict the informativeness of a frame. Therefore, to evaluate which neural networks could meet the conditions mentioned above, a careful analysis of the literature related to the processing of laryngeal endoscopic frames was conducted. This included the work by Yao et al. [7], who presented a very detailed comparison of the performance of various state-of-the-art networks. Based on their results, two other deeper state-of-the-art networks were explored in this work: ResNet-50 and MobileNetv2. The advantage of the former lies with the skip connections or “residual blocks” which allow for better gradient flow resulting in improved accuracy, while being a compact version of ResNet grants it to accelerate the inference process. On the other hand, MobileNetv2 is well-suited for mobile and embedded devices where computational means are limited, thanks to the use of depthwise separable convolutions. After modifying the top classification layer to output the probability of the frame belonging to each of the informative and uninformative classes, using a softmax function, ResNet-50 presented 23.8 million parameters, while MobileNetv2 had 2.4 million parameters.

Self-supervised pretext task

Instead of the standard ImageNet-based pre-training, the pretext task strategy presented by Gidaris et al. [13] was adopted to speed up the training of the classification model and enhance its performance. The pretext strategy is a self-supervised learning process that divides the computer vision pipeline into two tasks: the real task, e.g., any classification or detection task for which there are not enough annotated data available, and the pretext task which enables self-supervised data visualization learning. For the pretext task strategy, the dataset was derived by randomly sampling 232 frames from the internal set. From these frames, patches measuring 512 × 512 pixels were cropped, extracted, and subjected to rotations of 0°, 90°, 180°, and 270°. The three networks were pre-trained for 200 epochs on the rotated patches to predict the correct rotation angle, with the categorical cross-entropy loss, the Adam optimizer, a batch size of 8, an initial learning rate of 0.0001 and the following learning rate decay rule:

$$lr=\frac{1}{1 + 0.9 \cdot \mathrm{ epoch}}\cdot {lr}_{0}.$$

The best-performing weights obtained during this phase were used to initialize the weights of the corresponding models for the real task, which involved training the CNN for 200 epochs using the same hyperparameters as those of the pretext task for the purpose of classifying the informative frames.

Outcome evaluation

The outcomes of the DL models were evaluated by comparing the predicted classes with the ground-truth classes. True positives (TPs) represent the number of frames accurately predicted as belonging to the informative class, while true negatives (TNs) account for the number of uninformative frames correctly predicted. The false negatives (FNs) correspond to informative frames not identified by the model, and conversely, false positives (FPs) refer to frames wrongly categorized as part of that class. Based on these definitions, the models’ performance was assessed by calculating standard evaluation metrics such as precision, recall, F1-score, and Receiver Operating Characteristic (ROC) curve as previously reported and explained in the literature [4].

To provide better insights into the workings of the CNNs, the gradient-weighted Class Activation Mapping (gradCAM) algorithm was implemented [14]. This algorithm outputs a heatmap of the important regions in an image used by the network to make its predictions, by computing the gradients of the predicted class with respect to the final convolutional layer’s feature maps. It permits users to understand the DL model decisions, thereby facilitating debugging and improvements.

Video testing

Four unedited preoperative videolaryngoscopies not used for the model’s training were selected for testing the computational speed of the model and simulating a real-time frame quality classification during an examination. In fact, in clinical practice, the assessment occurs through the analysis of videos, where temporal information also holds significant value. Thus, the informativeness of the selected video was analyzed in a dynamic context by measuring the Video Informativeness Level (VIL), which is equivalent to the fraction of frames predicted as informative with regard to the totality of examined frames:

$$VIL=\frac{number\,of\,video\,frames\,predicted\,as\,informative}{total\,number\,of\,frames}\%.$$

This measure can quantify the part of the video that is really relevant for the examiner with potential implications for selective data storage and a faster review process.

The classification of each frame is presented to the user as a colored border around the frame (green = informative, red = uninformative). To unlock the real-time processing of the proposed method, the ResNet-50Pretext model was developed with the Keras library, compressed, and then saved in the ONNX format. This allowed the informativeness of any frame to be inferred in less than 0.020 s on average with an NVIDIA RTXTM A6000 GPU (48 GB of memory) and through different deep-learning frameworks (e.g., Tensorflow or Pytorch).

Statistical analysis

Categorical variables are reported as frequencies and percentages, while continuous variables are reported as medians and first–third quartiles. The McNemar test was employed to statistically compare the performance of the three different CNNs on the internal test set. The diagnostic performance of the selected model was compared between the internal and external test sets by using the Receiver Operating Characteristic (ROC) curves and Area under the ROC Curve (AUC) calculations. Pairwise comparisons of AUCs were conducted using a Z-test [15, 16]. A two-sided p < 0.05 was considered significant for both tests. Statistical analysis was carried out using Python version 3.9 (packages scipy.stat and statmodels version 0.13.2).

Results

Internal dataset

Figure 1 summarized the methodological process. The classification metrics generated by the three CNNs when tested on the internal test set (704 frames) after the pretext task-based weights initialization and model training are displayed in Table 2. The simple architecture of the shallow CNN often failed to recognize informative frames, revealing the lowest performance. Despite the ability to recognize more than 3/4 of the informative frames, the average F1 score was only 81%. Conversely, with MobileNetv2 the frames were classified with a precision and F1-score equal to 95%, while the recall was 96%. This comparison between the shallow CNN and MobileNetv2 confirms that a deeper architecture is necessary to extract the proper features for larynx inspection. However, the ResNet-50 model achieved the best performance during the internal testing phase of the classification model. Out of the 704 test frames, 673 were correctly classified, resulting in precision = 95%, recall = 97%, and the F1-score = 96%. Figure 2 shows some examples of the ResNet-50 model outputs.

Fig. 1
figure 1

Flow chart of data utilization for the current dataset. WL white light, NBI narrow band imaging

Table 2 Performances of the deep learning models on the internal and external test sets
Fig. 2
figure 2

Examples of frames from the internal test set. In the first column, the original image is shown. In the second column, the prediction is output as a green frame for informative frames and as a red frame for uninformative frames. In the third column, the areas used by the model for the prediction are output according to a gradCAM heatmap

The performances of the Shallow CNN vs. MobileNetv2 and Shallow CNN vs. ResNet-50 were significantly different (p-value < 0.05), while MobileNetv2 and ResNet-50 performed closely on the internal set. Given the superior F1 score, the ResNet-50 model was chosen for additional external test evaluation and video testing. Finally, an ablation study was conducted to validate the efficiency of the pretext task, demonstrating that the ResNet-50 F1-score improved by one percentage point when compared to the results achieved using the standard ImageNet-based weights.

External dataset

Table 2 also reports the results of the pretext ResNet-50 model on the external set: the precision, recall, and F1-score were equal to 97%, 89%, and 93%, respectively. Figure 3 shows some graphical examples of the output of the pretext ResNet-50 model on the external test set. Figure 4 shows the pretext ResNet-50 model ROC curves for the internal and external sets. No statistical difference was found between the two datasets (p-value = 0.062).

Fig. 3
figure 3

Examples of frames from the external test set. In the first column, the original image is shown. In the second column, the prediction is output as a green frame for informative frames and as a red frame for uninformative frames. In the third column, the areas used by the model for the prediction are output according to a gradCAM heatmap

Fig. 4
figure 4

The model (ResNet-50, pre-trained with the pretext strategy) performance according to the receiver operating characteristics (ROC) curves for the internal (blue) and external (orange) test sets

Video dataset

Table 3 resumes the video characteristics. With the current experiment hardware, the median computing times for the five videos was 0.015 [0.012–0.018] milliseconds per frame. Figure 5 shows the evolution of the Video Informativeness Level (VIL) during the different test videos. The four testing videos are provided in the supplemental materials.

Table 3 Characteristics of the tested laryngoscopy videos
Fig. 5
figure 5

Plots of the model’s (ResNet-50, pre-trained with the pretext strategy) predictions trend during each test video. The black line represents the probability for the frame to be classified as “informative” (when in the green background) or “uninformative” (when in the red background). The blue line represents the average probability for the frames of the whole video to be assigned to the category “informative”. At the end of each bar plot a histogram with the Video Informativeness Level (VIL) value for the same video is reported. Videos 2 and 3 report the in-office diagnostic laryngoscopy of the same patient before and after local anesthesia

Discussion

The introduction of an automatic informative frame classifier offers several advantages, including the possibility of time savings during the review of endoscopic examinations, facilitating the clinical decision-making process, and the pre-selection of frames to label and process with computer-aided disease detection and diagnosis systems.

In this study, we tested the ability of several DL models (from a simple and lightweight shallow CNN to the state-of-the-art MobileNetv2 and ResNet-50 deeper networks) to classify informative frames. The latter proved itself capable of individuating good-quality frames with diagnostic relevance from laryngoscopic videos with real-time performance. This topic has been studied by a few publications in the literature so far. The publicly available small dataset of 720 frames by Moccia et al. [8] incorporated in our study was used by Patrini et al. [6] to test if an automated DL strategy could improve the accuracy of this task. Indeed, the CNN employed outperformed the original results achieved with a traditional machine learning approach, demonstrating that DL models should be the first choice in this field. Finally, on the same dataset, Galdran et al. [5] tried to use a similar DL strategy with a light-weight CNN achieving similar diagnostic performances but with less computational time paving the way for real-time implementation. Yao et al. [7] in 2022 investigated the same task on 22,132 laryngoscopic frames (healthy and with vocal fold polys patients) classifying them into informative or uninformative binary classification achieving comparable outcomes (Recall = 0.76–0.90; Precision = 0.63–0.94; F1-score = 0.69–0.92). The same group recently published a further step in this research, investigating the role of the previously developed algorithm in facilitating DL models development by automatically extracting the informative frames for training a DL classifier capable of diagnosing vocal fold polyps [17]. Their results showed that the classifier trained on machine-labeled frames held improved performance compared to the classifier trained on human-labeled frames. The latter experience shows that the algorithm present in this paper can also have a role in the development of future DL models.

Nevertheless, in all these studies, no attempt was made to implement these algorithms on videoendoscopies. In the present work, we tested our DL model on videos simulating clinical use by developing a user interface capable of classifying in real time the informativeness of the video frames. This enables providing immediate feedback to the clinician during the endoscopy. At the end of each video, the Video Informativeness Level (VIL) never exceeded 60% suggesting that our system could potentially reduce the size of the data to be stored and subsequently reviewed.

The video tests showed that, even though the model did not achieve perfect recall and precision outcomes, the algorithm implementation is clinically reliable. Moreover, misclassification cases are largely imputable to the criteria used during the frames labeling process due to the highly subjective evaluation as there is no gold standard to assess the quality of a frame. As demonstrated in Fig. 6, frames in which there is a slight blur but not enough to set the ground truth as uninformative (e.g. third row), as well as those that exhibit reflections along with a clear view of the vocal cords (e.g. first and second rows), can be borderline cases. Furthermore, the real-time frame classification can also be implemented to help improve the quality of endoscopic examinations, informing clinicians about the quality of the collected video and allowing them, for instance, to identify the moments during the endoscopy that should be submitted to a classification algorithm to predict lesions’ histology [12, 18, 19].

Fig. 6
figure 6

Examples of miss-predicted frames. The first column shows the original frames. The second column shows how the frame was originally labeled by the physician as “I” (informative) or “U” (uninformative). In the third column, the areas considered by the model for the prediction are output according to a gradCAM heatmap. In the fourth column, the prediction is output as a green frame for informative frames and as a red frame for uninformative frames

Conclusions

In this study, we introduce an automated approach for assessing frame quality during laryngoscopy examinations. Notably, our work establishes the first extensively generalized multi-center and multi-modality dataset designed for this specific purpose. The proposed DL method demonstrated superior performance in identifying diagnostically relevant frames within laryngoscopic videos. With its solid accuracy and real-time capabilities, the system stands poised for deployment in clinical settings, either autonomously for objective quality control or in conjunction with other algorithms within a comprehensive AI toolset aimed at enhancing tumor detection and diagnosis. The next phase of our research will involve subjecting the system to clinical trials, aiming to validate its effectiveness and reliability in real-world medical scenarios.