Capsule endoscopy (CE) has become the preferred method of investigating the small intestine [1, 2]. Although capsule administration is a relatively straightforward task in the absence of medical complications, reading of CE videos is laborious, with outcome and reporting dependent not only on reviewer attentiveness and expertise but also on several other specific perceptual and interpretational factors [3, 4]. When viewing a soporific stream of often repetitive non-distinct images in a quiet, dark room, a significant risk of loss of concentration can lead to inaccuracy of reported findings [4, 5]. Nevertheless, practicing gastroenterologists may be considered adequately trained in CE reporting after a short, 1-day training program [6]. Moreover, formal training in CE during gastrointestinal (GI) fellowship, defined only loosely, includes completion of a hands-on course with a minimum of 8 h of continuing medical education (CME) credit, followed by review of CE studies by a credentialed capsule endoscopist [6]. There is currently no standardization of national or international training programs, although guidelines are being developed [7]. Furthermore, only limited evidence-based information on the optimal reading mode of CE review is currently available [3, 4, 8].

The past has taught us a great deal about medical image perception, not only in “classical” image-based specialties such as radiology and pathology, but also in other clinical specialties that use imaging technology—such as gastroenterology, laparoscopic surgery, or dermatology [911]. Medical images and videos represent a significant source of information that aid clinicians with diagnostic and therapeutic decisions [11]. Yet, the correct interpretation of medical images relies on a host of factors, with significant health and medicolegal issues accruing from their inaccurate interpretation, which consists of two basic processes—visual perception (image inspection) and cognition (rendering an interpretation) [10, 11]. The use and development of computer-based models to predict human performance has also been a topic of interest for which a paucity of perception-oriented research exists, yet the opportunities abound.

The American Society of Gastrointestinal Endoscopy (ASGE) recommends a minimum number of 20 supervised procedures to provide adequate experience for those intending to practice CE independently [6]. Commercially available software provides a diverse range of viewing modes (VM) and frame rates (FR), in addition to other image enhancement tools such as digital chromoendoscopy [3, 12, 13]. No consensus has been reached for the latter technique according to a number of studies, its optimal mode of application yet to be determined [2, 3]. The use of differing VM has been, to date, the subject of only two studies [14, 15]. In the most recent large cohort study, Zheng et al. [15] reported that the low lesion detection rates observed were not influenced by increasing CE experience. Detection rates are significantly higher when reading in single VM/FR15 (single screen with FR 15/s) and quad VM/FR20 (four screens with FR 20/s) compared with reading in single View/FR25 (single screen with FR 25/s). Increasing viewing speed in quad VM from FR20 to FR30 appears to have no significant effect on detection ability. Therefore, the investigators suggested that quality control measures to compare and improve lesion detection rates need further study.

In this issue of Digestive Diseases and Sciences, Nakamura et al. [14] used a standardized, single-type lesion model to explore the relationship between VM and FR on lesion detection, determining the effect of these settings on CE reading time. They randomly selected 10 complete (to cecum) CE videos, obtained with PillCam® SB2, recording “real time,” i.e., the actual time the video was playing, without interruptions, from the point of duodenal entry to cecal exit, with 11 different combinations of VM and FR. Thereafter, a single CE video clip of excellent image clarity, comprised 60 positive images of small bowel angioectasias, was selected. To examine the effect of experience, the video was then read by six CE reviewers (three novices and three experienced) using nine combinations of VM and FR. Videos were presented to each reader in randomized order to minimize the risk of lesion recall. Readers were asked to count each positive image when an angioectasia was seen using a manual counter, generating a maximum number of positive images (MPIs). At the same time, the reading time for each combination of VM and FR was recorded. The authors reported that the optimal combination (for a high MPI) was FR10 using dual VM or quad VM. The outcome measure used was the maximum number of frames on which one of more angioectasias was seen. Increasing FR10–FR15 shortened reading times by 33 %, reduced mean MPI of 25–28 %. Altering VM had no effect on the reading time for any given FR.

Naturally, there are certain limitations to the study design; for instance, investigators knew that angioectasias were the only lesions observable on the video clips, which does not accurately reflect clinical practice, since this pre-knowledge likely increased targeted and focused awareness which in turn increased the detection rate. Furthermore, the uninterrupted viewing of the CE video with the concurrent use of a manual counter seems unwieldy. Angioectasias are one of the most recognizable types of GI tract lesion [14]; it is likely that non-angioectatic, more subtle pathology would have been less identifiable, particularly by novices, which could lower detection rates for any given VM and FR than was reported. Interwoven in this study design is the lack of information on the detection ability in real-life circumstances, i.e., when the video clarity is suboptimal or during rapid transit of the capsule within the GI tract. To an extent, the authors attempted to offset this by eliminating access to the stop-and-roll and to-and-fro functions of the reviewing software.

For those of us who use CE regularly in our practice, it comes as no surprise that Nakamura et al. [14] confirmed the findings from Zheng et al. [15] study, i.e., that experience contributes little to lesion detection. In endoscopy, as in life, detection skills are related to attentiveness and awareness. It is in the interpretation of findings that expertise comes into play. Therefore, the single lesion model finds us in agreement. Still, the use of only one video clip, no matter how randomized the review, can be a source of bias. Perhaps further studies should investigate the use of digital chromoendoscopy under optimal reviewing conditions, with a larger number of video clips and lesion types.

Perhaps one of the main reasons that standardization of review has not been formally adopted is the aforementioned wide variability of trainee exposure combined with the multiplicity of reading modes and instrumentation. Again, the purpose is not to standardize reviewer-based protocols but to develop advanced and controllable CE platforms and software algorithms that can reliably detect and characterize lesions and automatically provide diagnosis [5], not unlike the automated interpretational systems that have revolutionized cardiac arrhythmia detection [16]. Although the latter may seem a tad futuristic and unattainable, we should not forget the origins of wireless endoscopy, which was developed in the same manner.