Early Experiences with Crowdsourcing Airway Annotations in Chest CT
- Cite this paper as:
- Cheplygina V., Perez-Rovira A., Kuo W., Tiddens H.A.W.M., de Bruijne M. (2016) Early Experiences with Crowdsourcing Airway Annotations in Chest CT. In: Carneiro G. et al. (eds) Deep Learning and Data Labeling for Medical Applications. LABELS 2016, DLMIA 2016. Lecture Notes in Computer Science, vol 10008. Springer, Cham
Measuring airways in chest computed tomography (CT) images is important for characterizing diseases such as cystic fibrosis, yet very time-consuming to perform manually. Machine learning algorithms offer an alternative, but need large sets of annotated data to perform well. We investigate whether crowdsourcing can be used to gather airway annotations which can serve directly for measuring the airways, or as training data for the algorithms. We generate image slices at known locations of airways and request untrained crowd workers to outline the airway lumen and airway wall. Our results show that the workers are able to interpret the images, but that the instructions are too complex, leading to many unusable annotations. After excluding unusable annotations, quantitative results show medium to high correlations with expert measurements of the airways. Based on this positive experience, we describe a number of further research directions and provide insight into the challenges of crowdsourcing in medical images from the perspective of first-time users.
Respiratory diseases are a major cause of death and disability and are responsible for three out of the top five causes of death worldwide . Chest computed tomography (CT) is an important tool to characterize and monitor lung diseases. Quantification of structural abnormalities in the lungs, such as bronchiectasis, air trapping and emphysema, is needed to track disease progression or to predict patient outcomes. We have recently shown that, the airway-to-vessel ratio (AVR) is an objective measurement of bronchiectasis which is sensitive to detect early lung disease [7, 11]. Unfortunately, manual measurements of the airways and adjoining arteries suffer from intra- and inter-observer variation and are very time-consuming (8–16 h per chest CT).
Computer algorithms can be used to improve accuracy and efficiency of the measurements. The first step is to extract the airways and vessels from the scan. Machine learning techniques learn from example images which have been manually annotated, and have shown to be very effective for such extraction tasks . However, these techniques require a large amount of annotated images, which is also expensive and time-consuming.
We therefore propose to use the wisdom of the crowd to gather annotations. In crowdsourcing, untrained internet users (knowledge workers or KWs) carry out human intelligence tasks (HITs), such as annotating images1. The KWs are unpaid volunteers, or receive a small financial reward for each task. Early research into crowdsourcing for medical images [1, 5, 6, 8] showed that non-expert workers were able to carry out a range of HITs relatively well; our goal is to investigate whether this is true for airway measurement in chest CT.
In this paper we describe our early experiences with crowdsourcing airway measurements in chest CT images. In Sect. 2 we describe how we generate 2D slices, how we collect annotations from the KWs and how the annotations are processed. Section 3 describes the data and the number of annotations collected, followed by a presentation of the results in Sect. 4. We discuss our findings, lessons learnt as first-time users of crowdsourcing and steps for future research in Sect. 5, followed by a conclusion in Sect. 6.
Our main question for this study was whether non-expert workers would be able to annotate airways in chest CT images. By “an airway annotation” we understand two outlines: one of the airway lumen (inner airway) and one of the airway wall (outer airway). Annotating an airway consists of two steps: localizing an airway, and creating the outlines. In this study we focused on the second question only. We therefore acquired annotations using already existing 3D voxel coordinates and orientations as a starting point.
2.1 Image Generation
2.2 Annotation Software
Details of HIT on Mechanical Turk
Save lives by annotating airways!
Draw two contours to annotate an airway (dark circle or ellipse) in image from a lung scan
image, annotation, contour, draw, drawing, segmentation, medical
2.3 Airway Measurement
an odd number of ellipses
an even number of ellipses, but the distance between centers of paired ellipses (pairs were assigned based on center distance) is larger than 10 voxels
For the remaining usable annotations, we measured the areas of the inner and outer ellipse, in order to compare them to the expert annotations. We perform the comparisons for each KW annotation individually, as well as for a combined measurement of the KWs. To obtain the combined measurements, we used only images with at least three usable annotations, and took the median of the areas.
For this preliminary experiment we used 1 inspiratory pediatric CT scan from a cohort of 24 subjects from a study [3, 9], collected at the Erasmus MC - Sophia Children’s Hospital. In this scan, 76 airways were annotated by an expert using Myrian software. The expert localized an airway, outlined the inner and outer airway, and recorded the measurements of the areas.
3.2 Crowd Annotations
We generated a total of \(76\times 4=308\) images using the method described in Sect. 2.1. We randomly created HITs with 10 images per HIT. A KW could request a HIT, annotate 10 images, and then submit the HIT. The KWs were paid $0.10 per completed HIT. Only KWs who had previously done at least 100 HITs with an acceptance rate of 90 % could request the HITs. We accepted all HITs, i.e. no additional quality control was performed after the HITs were carried out.
We first collected 1 annotation per image with freehand tool. As we will describe in Sect. 4, it became clear that an ellipse tool was needed. With the ellipse tool, we collected 10 annotations per image.
With the ellipse tool, we collected 10 annotations per image. However, based on our experience with the freehand tool, to reduce costs we did not gather annotations for all the images. In the end, with the ellipse tool 90 of the 308 images were annotated, resulting in 900 annotations.
A selection of the results with the ellipse tool in shown in Fig. 4 (bottom). Using the tool eliminated the problem of non-ellipsoidal airways. However, the problems of either a single contour, or workers annotating vessels, were still present. While the annotations still were not perfect, we decided to do proceed with an initial analysis of the annotations.
4.2 Airway Measurement
We filtered unusable annotations as described in Sect. 2.3. Out of 900 annotations, 610 were found to be unusable. Of these 610, 133 annotations contained no ellipse, and 445 annotations contained only a single ellipse. For annotations with a single ellipse, there are three possible causes: spam, the worker indicating “no airway visible”, or the worker misunderstood the instructions. To better differentiate between these causes, we looked at whether the ellipse was adjusted, indicating that the worker tried to annotate something. This was the case for 244 of the 445 annotations with a single ellipse. Although we do not analyse these annotations in this preliminary study, we note that these annotations still could be used to measure airways.
Next we focus on the the 290 usable annotations, i.e. where the worker placed ellipses in pairs. Of these, 256 annotations contained a single pair, 25 annotations contained two pairs, and a further 6 annotations contained three pairs. For this preliminary study, we only consider the annotations with a single pair for further analysis.
Note that analysis above is performed on a per-annotation, not per-image basis. By aggregating the annotations obtained per image, we can get better estimates of the measurements from the crowd. In Fig. 6 we show the median areas for the images for which at least three workers produced usable annotations. The correlations are now medium to high for both types of orientations, although the sample size is lower, because for many images there were too few usable annotations. This motivates collecting more annotations per image in the future.
Our results show that untrained KWs are able to interpret the CT images and attempt to annotate airways in the images. However, many KWs did not follow the instructions, resulting in unusable annotations. For example, in 244 out of 900 annotations the workers did attempt create an annotation, but only placed a single ellipse in the image. The usable annotations show medium to high correlations with expert measurements of the airways, especially if the worker annotations are aggregated. The results are not convincing enough to say that the workers can annotate the airways as well as experts (as more analysis is needed to test such claims), but the collected annotations could already be useful for training machine learning algorithms. Overall we feel that the results encourage further investigation. The next step is to collect annotations for all 24 subjects in the cohort, after a number of changes we describe below.
Based on our results, the next logical step is to increase the amount of usable annotations per image. There are several ways in which this can be achieved. One possibility is to improve the interface, for example by only accepting annotations that contain two ellipses. Alternatively, we could include a tutorial, showing workers step by step how to create the annotations. However, both of these options require custom-made adjustments to the interface, which is costly/time-consuming for novice users of MTurk such as ourselves.
In the short term, more feasible solutions for us are to simplify the instructions, increase the number of collected annotations per image to 20 (20 is also the choice in other crowdsourcing literature [6, 8]), and to improve the postprocessing of the annotations. Here we used very simple rules to filter and aggregate the annotations with reasonable results. An alternative would be to use unsupervised outlier detection, or train a supervised classifier to detect outliers. Such a classifier could be based only on the characteristics of the annotations (such as size of the ellipse), or could also include characteristics of the image.
If our future research demonstrates that the crowd can reliably annotate airways, we will need to address the question of localizing the airways, and of using the annotations in machine learning algorithms. For localizing airways, we could show larger slices, and ask the KWs to click all locations where airways are visible. Such clicks can then be used to learn to recognize good voxel positions, at which airway measurements can be collected. Alternatively, we could use the already collected annotations (both usable and unusable) to learn the appearance of “annotatable” slices, bypassing the localizaton step.
Overall our first experiences with crowdsourcing are positive, but also teach us a number of important lessons: (i) there is more to setting up a crowdsourcing task than we thought, and (ii) the task itself needs to be simpler than we thought. With regard to setting up the task, a challenge was to make a choice between different annotation tools, and how such tools might influence the results. With regard to the task itself, the number and the wording of instructions are likely to affect how well the instructions will be carried out. While it is widely known that the task should be “as simple as possible”, it is difficult to estimate the complexity of a novel task in advance, without performing preliminary experiments such as the ones described here.
For both the annotation interface and the instructions, it would be interesting to investigate how exactly different choices influence the final results. However, this “parameter space” is too large, and it is not feasible to explore it. This calls for more “rules-of-thumb” when designing large-scale data annotation tasks, as well as more interaction between researchers in medical image analysis, and researchers in fields where crowdsourcing is a more established technique.
We presented our early experiences with setting up a crowdsourcing task for measuring airways in chest CT images. Our results show that the KWs were able to interpret the images, but that the instructions were too complex, leading to many unusable annotations. For the usable annotations, quantitative results show medium to high correlations with expert measurements of the airways, especially if measurements of the KWs are aggregated. Our results are encouraging, we therefore intend to continue this research direction, by simplifying the instructions and collecting more annotations for an in-depth analysis. As beginner users of crowdsourcing, we describe several challenges we encountered during this research, and we hope our experiences will help other researchers in medical image analysis considering crowdsourcing for annotating their data.
We adopt the terminology used by Amazon MTurk platform.
This research was partially funded by the research project “Transfer learning in biomedical image analysis” which is financed by the Netherlands Organization for Scientific Research (NWO) grant no. 639.022.010. We gratefully acknowledge Dr. Daniel Kondermann of Heidelberg University for his help with crowdsourcing, and the anonymous reviewers for their constructive comments.