1 Introduction

Commonly known as the voice box, the larynx is a tube-shaped organ roughly 5 cm in length. The larynx is responsible for speech, plays a vital role in respiration, and prevents food particles from entering the airway into the respiratory system. The larynx is one of the most common sites of occurrence of Head and Neck cancers [1]. Smoking tobacco and consumption of alcohol are the major etiologic factors that contribute to laryngeal cancer [2]. More than 180,000 people worldwide are diagnosed with Laryngeal cancer every year [3]. As it has a poor prognosis, early detection, diagnosis, and effective treatment are essential for better outcomes.

The larynx extends from the base of the tongue to the cricoid cartilage before the start of the trachea. It is comprised of bone, muscles, cartilage, and a mucosal lining. The three main subsites of the larynx are the supraglottis, the glottis, and the subglottis as represented in Fig. 1. Cancer can develop in any or all the subsites of the larynx. The incidence across subsites is not uniform: Cancer incidence at the supraglottis is 30–50% of all cases, the glottis is 50-70%, and the subglottis incidence is scarce at 0–1% of all laryngeal cancers [4].

Fig. 1
figure 1

Laryngeal anatomy and its subsites [5]

When cancer is detected in the larynx, the Tumor-Node-Metastasis (TNM) staging system, a standard followed by the American Joint Committee on Cancer, is used to describe the malignant tumour. The TNM stage highlights the specific location of cancer within the larynx, the extent of its spread, and if other parts of the body have been affected. The T-staging is subsite specific and the T-number assigned depends on the spread of the tumour within the specific subsite impacted. The N-stage reports the impact of cancer on the lymph nodes of the entire larynx organ. The M-stage expresses if cancer has metastasized [6]. Based on the diagnostic and staging information collected, a treatment plan is designed for the patient. Depending on the severity of the case, the course of treatment may be a combination of surgery, radiation therapy, and chemotherapy. Cancers of the larynx diagnosed in the early stages have a high chance of organ preservation and are highly curable. In contrast, those diagnosed in advanced stages often require total laryngectomy, which has a severe impact on the quality of life of the patient [7] as it affects the day to day activities at both a physical and psychological level.

Imaging helps determine the extent of the disease. There are various radiographic imaging modalities that are commonly used to investigate the spread of disease as illustrated in Table I. It is important to note that over-staging and under-staging of disease are detrimental to treatment as the former will cause the removal of healthy tissue, decreasing larynx function, while the latter will leave behind cancerous tissue at the site. In the early stages of cancer, where the nodule is very small, neck imaging may be waived if surgery is not a mode of treatment. However, for advanced stages, Computed Tomography (CT) or Magnetic Resonance Imaging (MRI) scans are administered to determine the spread of tumour and invasion of cancerous tissue. Laryngeal cancer tends to metastasize in the lungs and the liver. Chest X-rays are used to investigate for such metastases. Additionally, but rarely Abdominal CT or liver Ultrasound scans are done to analyse the extent and spread of disease. Positron emission tomography (PET)/CT is another imaging modality, but its relevance in Laryngeal Cancer diagnosis is still debated.

Table I Imaging modalities commonly used to investigate the spread of laryngeal cancer

Presentation of a larynx tumor on the same patient on different imaging modalities are very dissimilar as illustrated in Fig. 2. Segmentation techniques and their parameters change widely depending on the inputs provided. Conceivably, a segmentation algorithm that works accurately on a given image modality is likely to perform poorly on another. Therefore, it is important to know the features of a particular modality for a given organ and how a tumor presents itself while designing a segmentation technique.

Fig. 2
figure 2

Laryngeal tumor delineated on (a) contrast-enhanced CT, (b) MRI and (c) PET CT [8]

This paper focuses on the work that has been done in automating the detection and segmentation of the larynx and tumours in the larynx on CT images. The reason for choosing this modality for this review is that CT remains the more widely used imaging modality. Currently, a standard on imaging for laryngeal cancer imaging does not exist, and the individual cases determine the preference of imaging modality selection [9]. Contrast-enhanced CT are relatively quick to acquire, efficient in cost and computational power required, and tolerant to slight movements during acquisition. CT is often used more widely as it is a readily available and cost efficient option. MR has a greater cost for acquisition in terms of time, computation and resources. Optimization of MR techniques is difficult in this complex part of the body [9]. It is much faster to capture a CT image (60 s) while giving comparable results to MRI, which takes longer to capture (30 min).; therefore, it requires fewer computing resources compared to an MRI. The short capture time causes clearer images to be obtained and is not affected by patient movements such as coughing, sneezing, breathing etc. [10].

2 Importance of Segmentation

Segmentation is the process of localizing and delineating anatomical structures and tumours in a medical image. Early cancer detection works toward preventing advancement to further stages, thereby increasing chances of a complete recovery. Under-staging a tumour can cause parts of the cancerous tissue to be left behind after treatment. However, if a tumour is over-staged, this could lead to unnecessary loss of healthy tissue, limiting laryngeal function instead of preserving the organ [11]. It is of utmost importance to accurately identify the cancerous tissue and plan the course of treatment action. Segmentation of CT images provides spatial and contextual information that is very valuable as it simplifies analysis and other follow-up tasks such as treatment planning and TNM staging.

The T stage is dependent on the affected anatomies that may show up only in a single slice, depending on the thickness of the CT image. The size of the tumour and the anatomies affected are crucial pieces of information that determine the stage, which determines the treatment course.

There are a few challenges in using the CT imaging modality for laryngeal cancer diagnosis. In conventional contrast CT images, the movement of vocal cords is not captured. This is an essential parameter in the T staging of glottic tumours. Mobility of vocal cords has to be observed via clinical reports, or a phonation CT image has to be captured. Iodine-enhanced tumours and non-ossified cartilage are challenging to distinguish. Also, cartilage invasion can cause over-staging as it can appear indistinguishable to the human eye [12].

2.1 Dice Score Coefficient

The Dice Score (DSC) is a widely used metric to evaluate the performance of medical image segmentation methods quantitatively. The segmentation predicted by the image segmentation algorithm (SAlgorithm) is evaluated against the ground truth segmentation (SGroundTruth), as illustrated in Fig. 3. High values of DSC are desirable. It addresses class imbalance by penalizing false-positive regions while rewarding accurately segmented regions.

Fig. 3
figure 3

Illustration of Dice Coefficient

3 Automated Segmentation Methods

Automated segmentation of sub-anatomical structures and lesions on the imaging data is useful in saving the detection and diagnostic time of the radiologist and the clinician. Manual outlining of the larynx Region of Interest (ROI) by a radiologist can take up to 2.5 h for a single CT image series [13].

There are various approaches to automate the segmentation process.

3.1 Atlas Based Segmentation

Atlas-based methods are widely used in medical image processing for segmentation of anatomy and region of interest [14]. An atlas is a medical image which is pre-labelled with the segmentation and serves as ground truth for the algorithm. When a novel image is to be segmented, it is aligned to the atlas. Alignment is done by identifying vital anatomical structures in the atlas and overlapping the same structures in the novel image, or intensity of pixels, or a variety of other methods with the help of image transformations. After the alignment, a segmentation for the novel image is generated by the segmentation algorithm.

A database of multiple atlases is used instead of a single representative image, to increase the accuracy of segmentation which is known as Multi-Atlas Based Segmentation [15]. The Fig. 4 illustrates the segmentation of laryngeal anatomical substructures that was achieved through an atlas based registration technique. Atlas based registration works well for anatomy registration for a given modality due the predictable nature of occurrence unlike presence of malignancies.

Fig. 4
figure 4

Atlas-based registration of laryngeal anatomic substructures [16]

3.2 Supervised and Unsupervised Image Segmentation

Supervised segmentation algorithms use a set of labelled images, which are pre-categorized data, with apriori knowledge and human input for designing the segmentation model. They are quite powerful for automatic segmentation of medical images, albeit not versatile when there is a lot of variance between the train and test sets. They use classification and regression techniques for the segmentation challenge.

Unsupervised Image segmentation algorithms work with unlabelled data for classification and segmentation. Using the knowledge of the final outcome, they generate a split between non-homogeneous regions into various sub-regions using statistical parameters of the image. Clustering, Association, and Dimensionality reduction are commonly used techniques in this approach. Unsupervised image segmentation needs a larger representative dataset to produce comparable results to supervised methods as they do not have label information. However, the time spent labelling the images is saved as the bulk of the dataset makes up for the lack of detailed input to the model.

3.3 Convolutional Neural Networks

Convolutional Neural Networks (CNN) are a popular approach that uses Deep Learning for image classification and segmentation. The architecture of a CNN (Fig. 5) consists of layers of interconnected nodes with assigned weights that get updated when the model is trained. Pairs of images with the expected segmentations are taken as input to train the model, which learns to segment the ROI by assigning weights to various aspects or objects in the given input images. The final trained model uses these weights to compute the segmentation of the input images. 3D CNNs can capture spatial dependencies on CT images because of the architecture and through the application of relevant filters. Popular CNN architectures include the U-Net, FC-Net and R-CNN which are commonly used for medical image segmentation.

Fig. 5
figure 5

CNN architecture

4 Automated Segmentation of the Larynx

The studies reviewed are grouped based on the type of automated segmentation carried out.

4.1 Entire Larynx segmentation

The papers in this subsection worked towards the common goal of segmenting the entire larynx organ at risk from Head and Neck CT images.

Mencarelli et al. [17] designed a hierarchical model to represent substructures with unsupervised learning to detect substructures based on image intensities. The automatic segmentation had an 88% success rate of the time for identifying the ROI of the larynx without manual intervention. They used the Bland Altman method instead of the more straightforward DSC approach for checking the overlap between the expected segmentation and the results obtained. Wu et al. [18] leveraged the relationship hierarchy of detected objects in Head and Neck CT images. They contoured the substructures of Head and Neck CT images with a fuzzy model approach. The DSC varied from 49 to 74%, depending on CT image quality. Tam et al. [31] employed an architecture that exploited shape features and a multiple output support vector model with regression techniques.

Thomson et al. [19] reported results obtained with an atlas-based segmentation approach that was statistically significantly worse than the hierarchical [17] model. Only 8% of the larynx contours could be used as is, without alterations. With the introduction of bias in their algorithm, they could get the DSC significantly higher, to 84%. Another atlas-based approach [20] that used a consensus voting scheme to contour obtained a DSC of 71% for the larynx. Tao et al. [21] used a combination of manual delineation and atlas-based auto segmentation to get a DSC of 73% for the supraglottis and 64% for the glottis segmentation. They aimed to reduce the inter-observer differences during manual contouring resulting in different radiation dosages during treatment. Unbiased contouring could lead to maximum tumour preservation while eliminating tumour tissue (Table II).

Table II Summary of Approaches to the Automatic Segmentation of the Larynx using CT images

Ibragimov and Xing [23] were the first to use deep learning methods for the larynx segmentation in Head and Neck CT images. They employed a CNN to identify the consistent intensity patterns and segment the larynx using a 45 image dataset to arrive at an 85.6 ± 4.2% DSC. Rooij et al. [29] used the 3D U-Net, a deep learning CNN, to contour the larynx ROI. This study used the largest dataset of images we encountered in our review. Their additional initial step of cropping input images around the ROI is not available in the real world scenario. Liang et al. [25] developed a multi-stage CNN network named the ODS-Net. The first detector CNN bounded the larynx with a bounding box. The second CNN used the bounding box to produce a segmentation mask for the larynx. The ODS net is trained on non-contrast CT images and cannot be directly compared to CT with contrast. Zhong et al. [26] used a variation of U-Net for auto contouring the larynx on real-world clinical cases. Though they reached a dice score of 70%, during the oncologists’ assessment, the preference was for the manual segmentation compared to the one generated by the CNN. OARNet [28] employs a similar strategy to the ODS Net with multiple CNNs performing detection and segmentation task. With a dense CNN and skip connections, they claimed to increase the accuracy. Lei et al. [22] followed a similar approach with the intent of speed. A Recurrent-CNN quickly detected the larynx with a bounding box while a U-Net segmented it

4.2 Part Larynx Segmentation

Research work that uses laryngeal CT images for specific sub-anatomy detection and segmentation have been mentioned in this section.

Hewavitharanage et al. [32] developed a supervised support vector machine classifier to segment a single anatomical substructure of the larynx using texture features, as illustrated in Fig. 6. The size of the segmented structure is an indicator of vocal cord impairment. With a sample size of 20 subjects, they obtained an 80% DSC score with their ground truth of manual annotations. In another paper by the author [40], an unsupervised image segmentation algorithm was employed to detect the CT slice which contains the vocal fold. They achieved this by localization of the vertebral column and anterior commissure on the CT image. Using a combination of techniques, they achieved an 85% accuracy in estimation in 20 patients. [16]

Fig. 6
figure 6

Rimaglottidis segmentation output [32]

4.3 Automated Detection in the Larynx

This section has an overview of detection tasks that were carried out on laryngeal CT images.

Santin et al. [34] used deep learning to detect the presence of cartilage abnormalities in CT Scans. With a dataset of 326 images, they obtained an area under the ROC curve of 0.72 in their binary classification problem that used a pre-trained CNN with 83% sensitivity. Lassau et al. [35] developed a model that detected tumor invasion in thyroid cartilage. This is a crucial detection task thyroid cartilage invasion worsens the prognosis of laryngeal cancer. They used a dataset of 511 images to train their model, with a result of 70% area under ROC curve. However, the results are not reproducible as details of the model used were not disclosed in the paper. Ayyalu et al.[46] worked on checking the dependence of anatomic similarity for auto-segmentation of Head and Neck CT images. Ten patients captured in different scenarios ranging from poor to perfect used multiple atlases to perform auto contouring. They concluded that auto segmentation of the larynx depended heavily on anatomic similarity compared to the other organs.

5 Laryngeal CT Datasets

Data is crucial for designing, testing and validating segmentation algorithms and models. We have curated datasets that contain CT images of the larynx in Table III. The number of images in each dataset, type of segmentation it contains and the availability of the dataset are highlighted.

Table III Datasets available

6 Commercial contouring software

Commercial contouring software that outlines the larynx ROI for applications such as radiotherapy planning. For the sake of completeness of this review, a thorough comparison has been carried out [13], regarding the efficiency of these algorithms in auto contouring the larynx. Table IV lists the software and their effectiveness. There is scope for the user to manually correct the predictions which enables it to reach a higher efficiency after intervention and validation by human intervention.

Table IV Commercial Auto Segmentation software results on CT images of the Larynx

7 Discussion and Conclusion

There have been commendable advances using computer-aided techniques for detection, contouring and segmentation on Head and Neck CT images, with promising results. The brain stem, mandible and spinal cord remain the most studied organs at risk. The larynx tends to be side-lined even when the work includes large scale datasets. Recent work with appreciable results that constructed models for automated segmentation of multiple Head and Neck organs at risk [41], [42], [43], [44], [36], [45] did not include the larynx.

There is a lack of a standard for segmentation of the larynx anatomy in CT images, making annotation difficult. Ground truth creation is crucial; maintaining uniformity of annotations across a large dataset is a herculean task. A well-curated representative dataset increases the likelihood that a detection or segmentation model trained on it performs with clinically acceptable sensitivity and specificity. Therefore, it is not easy to develop and validate automated segmentation models.

This review gives an overview on all the computer-aided detection and segmentation tasks that have been carried out on laryngeal CT images. A lot of effort has gone towards the segmentation of the larynx as an entire structure, used as region of interest. A few of the studies have focused on segmentation of anatomical substructures and vocal cords.

Our work finds that studies that include laryngeal anatomy for detection or segmentation tasks tend to be limited to a small localized dataset. The availability of segmented laryngeal CT datasets in the public domain is sparse. We hope that our compiled list of available of laryngeal CT images will encourage work in this research gap.

We conclude our paper by highlighting the necessity for research into the laryngeal auto-segmentation, the requirement for the creation of datasets, while highlighting the work done in this field. We argue that there is a great need for more research as standardization and auto segmentation can lead to more effective treatment. If implemented usefully of the larynx, Auto segmentation can be a helpful tool for better, more accessible and quicker diagnosis radiotherapy planning.