The performance of different models of camouflage measurement was assessed in human touch-screen predation experiments. Although animal vision varies substantially between taxa, human performance in touch-screen experiments has been found to agree with behavioural data from non-humans [27]. Furthermore, spatial visual processing is thought to be similar between taxa [17], suggesting the results based on achromatic human performance should be good indicators of performance in many other vertebrate species.
Backgrounds and prey generation
Photographs of natural tree bark were used as background images (oak, beech, birch, holly and ash, n = 57 images), taken using a Canon 5D MKII with a Nikkor EL 80 mm lens at F/22 to ensure a deep depth of field. Photographs were taken under diffuse lighting conditions. Luminance-based vision in humans is thought to combine both longwave and mediumwave channels [44]. As such we used natural images that measure luminance over a similar range of wavelengths by combining the camera’s linear red and green channels. Next the images were standardised to ensure that they had a similar overall mean luminance and contrast (variance in luminance, see [16]). Images were then cropped and scaled with a 1:1 aspect ratio to the monitor’s resolution of 1920 by 1080 pixels using bilinear interpolation. Images were log-transformed, resulting in a roughly normal distribution of luminance values. A histogram of logged pixel values with 10,000 bins was analysed for each image. The 1st, 50th (median) and 99th percentile luminance values were calculated, and their bins modelled with a quadratic function against the desired values for these percentiles to ensure the median was half way between the luminance at the upper and lower limits. The resulting images all have an approximately equal mean and median luminance, and similar luminance distributions (contrast), and equal numbers of pixels at the upper and lower extremes.
Each prey was generated from the background it was presented against using custom written code similar to that used previously [28]. This methodology creates unique two-tone prey that match the general pattern and luminance of the background (see Fig. 1b). Briefly, for each prey a triangular section of the background image was selected from a random location, 126 pixels wide by 64 pixels high. For disruptive prey the threshold level was calculated that would split the image into the desired proportions (60% background to 40% pattern). For background matching prey a Gaussian gradient was applied to the edges prior to threshold modelling that made it less likely that underlying patterns would come through nearer the edge of the prey. This avoids creating salient internal lines in the background matching prey parallel with the prey’s outline, while ensuring no patterns touch the very edges. If the thresholded proportion was not within 1% of the target limits the process was repeated. Prey were generated with either dark-on-light or light-on-dark patterns, and each participant only received one of these treatments. Dark-on-light prey had the dark value set to equal the 20th percentile of the background levels and the light value set to the 70th percentile. The light-on-dark prey used the 30th and 80th percentiles respectively. The differences between these two treatments are due to the fact that there is slightly more background area than pattern, and these values ensure that the overall perceived luminance of the two treatments is similar to the median background luminance, factoring in the 60/40 split of background to pattern area.
Calculating camouflage metrics
The camouflage metrics measured in this study fall into seven distinct methodologies, though many of these in turn provide a number of additional variations: Gabor edge disruption ratios (GabRat, first proposed in this study), visual cortex-inspired models based on the HMAX model [20, 45], SIFT feature detection [18, 22], edge-intersecting patch count [32], luminance-based metrics [10, 16], Fourier transform (bandpass) pattern analysis [10, 16, 41], and edge-detection methods to quantify disruption [19, 21, 33]. Where possible, we have used the same terminology for the different metrics as they are used in the literature. Many of these variables can be used to compare a prey target to a specific background region. Therefore, where the metrics allowed, we compared each prey to its entire background image (the ‘global’ region) and to its ‘local’ region, defined as the area of background within a radius of one body-length (126 pixels) of the prey’s outline. The distance of one body length is the largest ‘local’ area that would not exceed the background image limits, because prey were always presented within one body length of the screen edge. This also ensured that the shape of the local region was always uniform, however one body length is also a flexible unit of scale measurement that could be used in other animal systems. Measuring two regions allowed us to test whether a prey’s local or global camouflage matching was more important across the different metrics (see Fig. 1a). If the prey are a very poor luminance match to their backgrounds then we might expect them to stand out enough for the comparatively low acuity peripheral vision to detect them easily using efficient search [34]. However, where the prey are a good luminance and pattern match to their local background they should be most difficult to detect as this would require the participant adopt inefficient search strategies, slowly scanning for the prey. We can make further predictions on the importance of pattern and edge disruption at specific spatial scales given humans are most sensitive to spatial frequencies in the region of around 3–5 cycles per degree [39]. This scale is equivalent to a Gabor filter with a sigma between approximately 2–4 pixels.
For clarity we use the term ‘edge’ to refer to perceived edges based on a given image analysis metric, and ‘outline’ to refer to the boundary between prey and background. Unless otherwise specified, these methods were implemented using custom written code in ImageJ. The GabRat implementation will be made available as part of our free Image Analysis Toolbox [10], code for all other metrics is already available in the toolbox, or is available on request. See Table 1 for an overview of the measurement models.
Gabor edge disruption ratio (GabRat)
Prey were first converted into binary mask images (i.e. white prey against a black background), a Gabor filter was then applied to each of the pixels around the edge of the prey at a range of angles (four in this study; the Gabor filter settings were identical to those used in the HMAX model below, and Fig. 1e). The angle of the prey’s outline at each point (parallel to the outline) was the angle with the highest absolute energy (|E|) measured from the mask image. Each point around the prey’s outline in the original image was then measured with a Gabor filter at an angle parallel to, and orthogonal (at right angles) to the edge at each point. This measured the interaction between the prey and its background. The disruption ratio at each point on the prey’s outline was then calculated as the absolute orthogonal energy (|E
o
|) divided by the sum of absolute orthogonal and absolute parallel energies (|E
p
|). Finally, the Gabor edge disruption ratio (GabRat) was taken as the mean of these ratios across the whole prey’s outline:
$$ GabRat=\frac{\Sigma \frac{\left|{E}_o\right|}{\left(\left|{E}_o\right|+\left|{E}_p\right|\right)}}{n} $$
Consequently, higher GabRat values should imply that prey are disruptive against their backgrounds (having a higher ratio of false edges), and lower GabRats imply that the edges of prey are detectable (see Fig. 1c). This process was repeated with different sigma values for the Gabor filter to test for disruption at different spatial frequencies (sigma values of 1, 2, 3, 4, 8 and 16 were modelled in this study). It is therefore possible for prey to be disruptive at one spatial scale or viewing distance, while having more detectable edges at another.
HMAX models
The HMAX model is biologically inspired, based on an understanding of neural architecture [20]. It breaks down images using banks of Gabor filters [37] that are then condensed using simple steps into visual information for object recognition tasks. It was also found to outperform the SIFT in an object classification comparison [42], so we might therefore expect it to perform best in a camouflage scenario. The HMAX model was developed in an attempt to emulate complex object recognition based on a quantitative understanding of the neural architecture of the ventral stream of the visual cortex [20, 45]. Our HMAX model first applied a battery of Gabor filters to the prey image, and the local and global background regions. The Gabor filters were applied at four angles and ten different scales, with Gamma = 1, phase = 2π, frequency of sinusoidal component = 4, minimum sigma = 2, maximum sigma = 20, increasing in steps of 2. C1 layers were created following Serre et al. [20] by taking the maximum values over local position (with a radius of sigma + 4) and the neighbouring scale band. The mean of each scale band in the prey’s C1 was then calculated as we wished to compare the prey’s overall pattern match rather than a perfect template match, which would test a masquerade rather than background-matching hypothesis [6]. This C1 template was then compared to each point in the local surround and entire (global) background image. The average match between the prey’s C1 layer and the C1 layers of each region was saved along with the value of the best match and the standard deviation in match (a measure of heterogeneity in background pattern matching). This model was run both with and without an allowance for the template to rotate at each point of comparison. When rotation was allowed the angle that had the best match at each comparison site was selected. The HMAX model with rotation describes how well the moth’s average pattern (i.e. angles, scales and intensities) matches the background if the prey can rotate to the optimal angle at each point. The model without rotation forces the prey to be compared to its background at the same angles, so should be a better predictor in this study where the prey’s angle relative to its background is fixed.
SIFT feature detection
Scale Invariant Feature Transform (SIFT, [18]) models were primarily developed for object recognition and rapidly stitching together images by finding sets of shared features between them even if they occur at different scales or angles. Although the SIFT models share some similarities with known biological image processing and object recognition at certain stages, such as the inferior temporal cortex [46], the method as a whole is not intended to be biologically inspired, although it has been applied to questions of animal coloration [22].
The SIFT function in Fiji (version 2.0.0 [47]) was used to extract the number of feature correspondences between each prey and its local and global background without attempting to search for an overall template match. Settings were selected that resulted in a large enough number of correspondences that the count data would exhibit a normal rather than Poisson distribution, but that was not too slow to process. These settings produced roughly 300 features in the prey, and 300,000 in the background in a sub-sample run to test the settings. The initial Gaussian blur was 1, with 8 steps per scale octave, a feature descriptor size of 4, 8 orientation bins and a closest to next-closest ratio of 0.96. Prey and their local background regions were measured in their entirety against a white background. As it stands this methodology might not therefore be suitable for comparing prey of different shapes and sizes without further modification and testing.
Edge-intersecting cluster count
The number of cases where patterns intersected the preys outline (following [32]) were summed in each prey using a custom written script. Background matching prey had no instances of edge intersections, which would create zero inflation and violate model assumptions. We therefore analysed a subset of the data containing only disruptive prey for testing this metric.
Luminance-based metrics
Prey were compared to their local and global background regions using a number of luminance-based metrics that could affect capture times. Luminance was taken to be pixel intensity values. LuminanceDiff was calculated as described in Troscianko et al. [10], as the sum of absolute differences in the counts of pixel numbers across 20 intensity bins, essentially the difference in image luminance histograms. This measure is suitable where the luminance values do not fit a normal distribution, which is the case with our two-tone prey. Mean luminance difference was the absolute difference in mean luminance values between prey and background regions. Contrast difference was calculated as the absolute difference in the standard deviation in luminance values between prey and background region. Mean local luminance was simply the mean pixel level of the local region. This was not calculated for the entire background image because they had been normalised to have the same mean luminance values.
Bandpass pattern metrics
Fourier Transform (bandpass) approaches [17, 40] only loosely approximate the way visual systems split an image into a number of spatial frequencies, however, they have a long and successful track record of use in biological systems, are fast to calculate and provide output that can be used flexibly to test different hypotheses [16, 41]. Fast-Fourier bandpass energy spectra were calculated for the prey and their local and global background regions using 13 scale bands, increasing from 2px in multiples of √2 to a maximum of 128px [10]. PatternDiff values were calculated as the sum of absolute differences between energy spectra at each spatial band [10]. This metric describes how similar any two patterns are in their overall level of contrast at each spatial scale. Descriptive statistics from the pattern energy spectra were also calculated, these being: maximum energy, dominant spatial frequency (the spatial frequency with the maximum energy), proportion power (the maximum energy divided by the sum across all spatial frequencies, mean energy and energy variance (the standard deviation in pattern energy, a measure of heterogeneity across spatial scales) [10, 41]. A metric similar to the multidimensional phenotypic space used by Spottiswoode and Stevens [48] was calculated from these descriptive statistics. However, rather than sum the means of each descriptive pattern statistic, a Euclidean distance was calculated after normalising the variables so that each had a mean of zero and standard deviation of one (ensuring equal weighting between pattern statistics). We termed this metric ‘Euclidean Pattern Distance’. In addition, a full linear mixed model was specified containing all descriptive pattern statistics and all two-way interactions between them (the model was specified in the same form as other linear mixed models in this study, see statistical methods below). The full model was then simplified based on AIC model selection.
Canny edge detection methods
Canny edge detection methods were applied following Lovell et al. [21] and Kang et al. [33], using a Java implementation of the method [49]. The Canny edge filter was applied to each image with the settings specified by Lovell et al., using a sigma of 3 and a lower threshold of 0.2. The upper threshold required by the Canny edge detection algorithm was not specified by Lovell et al., so a value of 0.5 was selected that ensured there would be no bounding of the data where no edges were detected. Following Lovell et al., the prey’s outline region was specified as being four pixels inside the prey’s outline and 8 pixels outside (see Fig. 1d). As above, two background regions were measured; local and global, although the 8px band around the prey’s outline was not included in the local or global regions. We measured the mean number of Canny edge contours in each region (i.e. the number of edge contour pixels in each region divided by the total number of pixels in each region to control for the differences in area being measured). It is unclear whether Lovell et al. applied this control, however given the areas being measured are fixed in this experiment (all prey and backgrounds are the same size) this would not affect the overall outcome. VisRat was calculated as the mean Canny edge contours found in the background region (either local or global) divided by the mean Canny edge contours found in the prey’s outline region (termed ContEdge by Kang et al.). DisRat was calculated following Kang et al. as being the mean Canny edge contours found inside the prey (termed MothEdge by Kang et al.) divided by ContEdge. Both VisRat and DisRat required a log transformation to demonstrate a normal error distribution.
Experimental setup
Prey were presented at a random location against their background using custom written HTML5/Javascript code on an Acer T272HL LCD touch-screen monitor. The display area was 600 mm by 338 mm, 1920 by 1080 pixels. The monitor’s maximum brightness was 136.2 lux, and minimum was 0.1 lux, measured using a Jeti Specbos 1211 spectroradiometer. The monitor’s output fitted a standard Gamma curve where brightness (lux) = 8.362E-4*(x + 25.41)^2.127*exp (−(x + 25.41)/3.840E11), where x is an 8-bit pixel value. The monitor was positioned in rooms with standard indoor lighting levels and minimal natural light. Prey were 39.38 mm wide (126 pixels, approximately 4.59°) by 20.03 mm (64 pixels, approx. 2.30°) high, viewed from a distance of approximately 500 mm (approx. 27.9 pixels per degree). If participants touched the screen within the bounds of the prey the code recorded a capture event to the nearest millisecond, a high-pitched auditory cue sounded, and a green circle appeared around the prey for 1 s. If they touched the screen outside the prey’s bounds a low-pitched auditory cue sounded, and they were not progressed to the next screen. If the participant failed to find the prey after 20 s (timeout) a red circle appeared around the moth for 1 s and a low-pitched auditory cue sounded, and capture time was set at 20 s (this occurred in just 3.5% of slides). In addition, for every successful or failed capture event, or timeout event the location of the touch was recorded. Participants started each session by clicking a box asking them to ‘find the artificial triangular “moths” as fast as possible’, confirming that they were free to leave at any point, and that it should take less than 10 min to complete (all trails were under 10 min). A total of 120 participants were tested, each receiving 32 slides (i.e. 32 potential capture events), creating a total of 3840 unique prey presentations.
Statistics
All statistics were performed in R version 3.2.2 [50]. For each camouflage metric a linear mixed effects model was specified using lme4 (version 1.1-10). The dependent variable in each model was log capture time. The main aim of this study was to establish which camouflage measurement models best predicted human performance, and as such we compared the variance in capture times explained between models. The multiple models created here increase the likelihood of making a type I error, however, alpha correction methods (such as Bonferroni or Šidák corrections) are not strictly suitable for these data as many of the models are non-independent, measuring subtly different versions of the same effect, and they would increase the likelihood of type II errors. As such we focused on the level of variance explained by each variable and its associated effect sizes for ranking the models. A number of variables known to affect capture times were included in the model to reduce the residual variance to be explained by the camouflage metrics [28]. These were the X and Y screen coordinates of the prey, each included with quadratic functions and with an interaction between them to reflect the fact that prey in the centre of the screen were detected sooner than those at the edges or corners. A variable was used to distinguish the first slide from all subsequent slides, describing whether the participant was naive to the appearance of the prey. Slide number was fitted to account for learning effects. Random effects fitted to the model were participant ID and background image ID, allowing the model to ignore the differences in capture time between participants or against specific backgrounds when calculating the fixed effects. Each camouflage metric was substituted into this model and the deviance explained by each camouflage metric was calculated using the pamer function of LMERConvenience-Functions (version 2.10). All camouflage metrics were continuous variables transformed where necessary to exhibit a normal error distribution with the exception of treatment type, which was categorical (background matching or disruptive prey). An additional final model was assembled based on the best performing edge disruption metric, pattern matching metric and luminance matching metric with two-way interactions between them. These variables were checked for autocorrelation using Spearman covariance matrices [51]. This full model was them simplified based on AIC maximum likelihood model selection to determine whether the best camouflage predictors interact in synergy to better predict camouflage than any single metric on its own.