Testing the reliability of hands and ears as biometrics: the importance of viewpoint

Stevenage, Sarah V.; Walpole, Catherine; Neil, Greg J.; Black, Sue M.

doi:10.1007/s00426-014-0625-x

Testing the reliability of hands and ears as biometrics: the importance of viewpoint

Original Article
Open access
Published: 20 November 2014

Volume 79, pages 989–999, (2015)
Cite this article

Download PDF

You have full access to this open access article

Psychological Research Aims and scope Submit manuscript

Testing the reliability of hands and ears as biometrics: the importance of viewpoint

Download PDF

Sarah V. Stevenage¹,
Catherine Walpole¹,
Greg J. Neil¹ &
…
Sue M. Black²

2080 Accesses
5 Citations
Explore all metrics

Abstract

Two experiments are presented to explore the limits when matching a sample to a suspect utilising the hand as a novel biometric. The results of Experiment 1 revealed that novice participants were able to match hands at above-chance levels as viewpoint changed. Notably, a moderate change in viewpoint had no notable effect, but a more substantial change in viewpoint affected performance significantly. Importantly, the impact of viewpoint when matching hands was smaller than that when matching ears in a control condition. This was consistent with the suggestion that the flexibility of the hand may have minimised the negative impact of a sub-optimal view. The results of Experiment 2 confirmed that training via a 10-min expert video was sufficient to reduce the impact of viewpoint in the most difficult case but not to remove it entirely. The implications of these results were discussed in terms of the theoretical importance of function when considering the canonical view and in terms of the applied value of the hand as a reliable biometric across viewing conditions.

Match me if you can: Evidence for a domain-general visual comparison ability

Article Open access 07 January 2022

A guide to measuring expert performance in forensic pattern matching

Article Open access 14 March 2024

Modeling Covarying Responses in Complex Tasks

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Whilst criminals have learned to hide their face, or disguise their voice, their hands may nevertheless provide an important biometric within a court setting (Delac & Grgic, 2004). Indeed, the visibility and identification of unique cues within the hand, such as vein patterns and skin features (Black, Mallett, Rynn & Duffield, 2009; Black, MacDonald-McMillan & Mallett, 2013; Black, MacDonald-McMillan, Rynn & Jackson, 2013; Jackson & Black, 2013), have been sufficient to support a number of recent criminal convictions. Alongside this, however, the inherent flexibility of the hand means that it may be viewed from a variety of different viewpoints and in a variety of different positions, potentially compromising its biometric value. The purpose of the present paper is to investigate the limits of the hand as a biometric cue through exploring the ability of viewers to match images as viewpoint changes.

Key in this enquiry is the concept of the ‘canonical view’. In their seminal paper, Palmer, Rosch and Chase (1981) found high agreement amongst participants in three tasks involving (1) rating the ‘goodness’ of an image of a familiar object, (2) forming a mental image of a familiar object and (3) selecting the best camera angle to take a photo of a familiar object. Importantly, high agreement resulted whether participants judged a limited set of views presented to them (Palmer et al., 1981), or generated their own views through unconstrained rotation of familiar objects in a real-time 3D virtual space (Blanz, Tarr & Bülthoff, 1999). The consistently preferred image was termed the ‘canonical view’ and Palmer et al. suggested that it provided a ‘privileged perspective’. Perhaps most importantly, Palmer et al. noted that the canonical view elicited faster responses in an object naming task (see also Bülthoff, Edelman & Tarr, 1995) and in a visual search task (Newell, Brown & Findlay, 2004). Moreover, Gomez, Shutter and Rouder (2008, Expt 2) demonstrated benefit of presenting the canonical image during a free-recall task extending the importance of canonicality from perceptual- to memory-based tasks. Indeed, when asked to recall the names of 171 objects encountered in a study list, participants were able to recall significantly more objects when studied from canonical images (41 %) than when studied from non-canonical images (33 %).^{Footnote 1} When taken together, these studies implied a performance advantage when viewing canonical images, but a performance cost otherwise. Consequently, if a canonical view was also demonstrated for hands, then their reliability as a biometric may be thrown into question in situations in which the viewing conditions deviated from the canonical ideal.

Attributes of the canonical image

Blanz et al., (1999) considered the attributes required to define a view as canonical. Three main characteristics were highlighted:

1.
Goodness of recognition, through representing distinctive object characteristics and minimising occlusion,
2.
Familiarity, through frequency of exposure, and
3.
Display of object functionality through reflecting a characteristic mode of interaction.

For a novel object, the preferred or canonical view could only be based on the first of Blanz et al.’s criteria. Thus, a canonical view (if one existed) reflected only geometric aspects of the image itself, and agreement amongst viewers on the canonical view tended to be relatively low (see Cutzy & Edelman, 1994; Edelman & Bülthoff, 2002; Perrett & Harries, 1988). In contrast, for a familiar object, the canonical view could additionally be informed by experience (frequency of exposure to different viewpoints) and understanding (appreciation of function), and this tended to result in a greater consensus regarding the canonical view.

Laeng and Rouw (2001) offered support to suggest that the cardinal defining characteristic of the canonical view was its ‘frequency of exposure’. They reported that, whilst the canonical view of a familiar face was best represented by a ¾ profile (see also Troje & Bülthoff, 1996), the canonical view of one’s own face was closer to the frontal image, this being the view most frequently seen. However, it may be premature to define frequency of exposure as the most important aspect of canonicality. Indeed, the perspective from which we most often see an object may be inherently linked to the function that the object fulfils (the last of Blanz et al.’s criteria), and herein lies the basis for predictions for the current paper.

The present study

Given the aim of exploring whether the hand, as a biometric, could be processed accurately across different views, the central question for the current paper was whether a canonical view existed for hands. If so, performance was expected to be optimal when presented with this canonical view, and was expected to be impaired when presented with a non-canonical view. This would be a damaging result when evaluating the hand as a biometric, as it would suggest that the processing of the hand would only be reliable under limited conditions. However, with canonicality potentially influenced by both frequency of exposure and object function, it may be anticipated that a flexible object such as a hand may frequently be observed from a variety of viewpoints and in a variety of positions as it carries out a range of functions (see Laeng, Carlesimo, Caltagirone, Capasso & Miceli, 2002). As such, it may be predicted that hands may not have as strong a preference for a single canonical view, and consequently may survive presentation across a range of views, compared to a more rigid object. To test this prediction, the processing of hand images was compared here to the processing of ear images. Both represent valuable biometric cues (see Yan & Boywer, 2007 for a review of ear recognition, and Black et al, 2009 for a review of hand recognition). However, the hand has a greater degree of flexibility and multifunctionality compared to the ear.

Performance was explored in a lab-based task designed to be analogous to that within a criminal investigation. Specifically, a traditional simultaneous matching task was used in which participants were asked to find the image (from 10 possibilities) that matched a target image. Given the preceding discussion, it was expected that both hand and ear processing may show sensitivity to a change in viewpoint, with optimal performance being associated with more optimal images. However, it was also expected that hands would be less affected by a change in viewpoint compared to ears because the non-rigidity of the hand provides for greater functionality and in turn, exposure to a larger array of viewpoints. As such, the present study is grounded in the predictions of canonicality across rigid and non-rigid cues, but provides an important test of the limits of the hand as a forensic biometric.

Experiment 1: method

Design

A 2 × 3 mixed design was used in which stimulus type (hands or ears) was varied between participants, and viewpoint (good, medium and poor) was varied within participants. Performance was tested by means of a ‘1 in 10’ task (Bruce et al., 1999) in which the participants’ task was to select one image (from an array of 10) that matched a target. Accuracy of performance was recorded.

Participants

A total of 50 novice participants (35 females, 15 males) took part either on a volunteer basis or in return for course credit. Participants were randomly assigned to study either hands (n = 25, 18 females) or ears (n = 25, 17 females), and both the age range (t ₍₄₈₎ = 1.18, ns) and gender split (\(\chi_{(1)}^{2}\) < 1, ns) were matched across the two groups. In addition, one hand expert and one ear expert provided baseline data for comparison purposes. Each gained their expertise through academic experience within the field of anatomy, with specialisation in the area of hands or ears to assist UK investigative processes either through the preparation of court evidence, or through facial reconstruction, respectively.

All participants reported normal, or corrected-to-normal, vision and did not recognise any individuals from either their hands or ears.

Materials

Hand images

A bespoke set of stimuli was gathered from 42 individuals (20 females, 22 males) to provide two images of each of six viewpoints of the hand. The two images differed only in the direction of the light source, and hence in the pattern of shadows. Their collection ensured that the matching task involved two different images of the same hand. Consequently, reliance on simple picture-related cues in the matching task was minimised. The six viewpoints captured (1) the dorsal (back) surface of the hand laid flat, (2) the palmar surface of the hand laid flat, (3) the hand in a relaxed pose, (4) the hand viewed from above whilst holding a glass (5) the hand viewed from above whilst holding a pen, and finally (6) the hand viewed from above whilst holding a mobile phone. These six viewpoints were selected to capture a range of hand positions reflecting forensic ideals (dorsal and palmar views) and functional utility (grasping, writing, texting).

From this set, the images associated with 30 individuals were selected on the basis of a lack of distinguishing features such as pigmentation irregularities, tattoos, cuts or abrasions, nail irregularities, or significant levels of visible hair on wrists or knuckles. All individuals were photographed without jewellery and nail varnish.

Ear images

Ear images were obtained from the facial photographs of 116 individuals represented in the SuperIdentity Stimulus Database. The ears were extracted from full head images using Corel Photoshop such that the full extent of the ear was visible whilst minimising the amount of hair within the image. In this way, two ear images were extracted (for the reasons stated above) for each of six viewpoints capturing (1) the ear from the side, (2) the ear from a ¾ profile, and (3) the ear from the front as viewed both from a horizontal (0°) perspective and from a +20° perspective looking down. Again, these viewpoints were selected to reflect those available in optimal forensic contexts (mug-shots) and in more ecologically valid contexts such as from a closed-circuit television (CCTV) image where a camera is typically mounted above head height looking down.

From the set of images available, 30 individuals were selected to minimise visible head hair, and other distinguishing features such as lobe or helix irregularities, or multiple piercings. Again, all individuals were photographed without jewellery.

Both sets of stimuli were photographed using a Nikon D200 SLR camera under controlled artificial light conditions. The hands were photographed resting on a matt black horizontal surface, from a distance of approximately 45 cm. The (heads and) ears were photographed against an 18 % grey background from a distance of 1 m.

Determination of viewpoint quality

To determine the quality of the viewpoints, a crowdsourcing technique (Mturk) was used in which 100 individuals were shown the 6 viewpoints for a single hand, and the 6 viewpoints for a single ear. In line with Palmer et al, (1981), their task was to select the image that best corresponded to the mental image that they formed in their mind’s eye when imagining a hand or an ear. For both hands and ears, the most popular viewpoint was nominated as the optimal or ‘good’ viewpoint. This was chosen by a minimum of 40 % of the individuals. Similarly, the viewpoint of intermediate popularity was nominated as the ‘medium’ viewpoint and this was chosen by approximately 20 % of the individuals. Finally, the viewpoint that was least popular was nominated as the non-optimal or ‘poor’ viewpoint, and was selected by less than 5 % of the individuals. Care was taken to balance the popularity of corresponding nominations across the hands and ears as far as possible. The resulting nominated viewpoints, and their level of popularity amongst the 100 individuals, are summarised in Fig. 1 and were used in the subsequent experimentation.

Procedure

Across the experiment, participants completed 30 ‘1 in 10’ matching trials in which their task was to decide which, of a set of 10 images, matched the single target displayed simultaneously at the top of the computer screen. As such, this was a perceptual-matching task with no memory component and no naming requirement. All trials were ‘target present’ trials, however, the target image at the top of the screen and the image within the array were always two different images (even if in the same viewpoint) to prevent simple picture matching.

The format of each trial was identical and consisted of the presentation of the target at the top of the screen, with the array of 10 images, in three rows of 4 (top), 3 (middle) and 3 (bottom), simultaneously displayed beneath it. Above each image in the array was a number to denote its position within the array, with positions 1–4 referring to locations from left to right on the top row, positions 5–7 referring to locations from left to right on the middle row, and positions 8–0 referring to locations from left to right on the bottom row (see Fig. 2).

The target image was always presented in the good viewpoint, analogous to the optimal image of a ‘suspect’s hand’ within an investigation. The array of 10 images all showed stimuli in either good, medium, or poor viewpoints with 10 trials for each viewpoint. These were blocked according to viewpoint. The order of these blocks, and the selection of individual target exemplars presented within each block, was counterbalanced across participants, to minimise the influences of fatigue and item effects within the study.

The participant’s task was to respond as quickly but as accurately as possible to indicate which of the 10 images in the array depicted the target at the top of the screen. Participants were aware that the image of the target in the array would be different and thus they were looking for a different image of the same hand (or ear) rather than an identical image. Participants indicated their answer by pressing the numbered key (0–9) on a standard keyboard that corresponded to the position of their selected image in the array, and all images remained visible until this response was made. Self-paced breaks separated the three blocks of trials and the entire experiment lasted approximately 30–40 min, after which participants were thanked and debriefed.

Experiment 1: results and discussion

Accuracy on the ‘1 in 10’ task is summarised in Table 1 and was explored to determine whether novice performance on the matching task (1) was better than chance, (2) approached the level of the experts and (3) differed across viewpoint.

Table 1 Absolute and standardised accuracy of performance (and standard deviation) on the ‘1 in 10’ matching task for experts, novices (experiment 1) and trained participants (experiment 2)

Full size table

Comparison to chance

To address the first question, a series of one-sample t tests was conducted comparing accuracy to a chance level of 0.1. These indicated that for both hands and ears, and across every viewpoint, novice participants were significantly better than chance (all ts ₍₂₄₎ > 5.93, p < 0.001). This was important in demonstrating the absence of floor effects within the data despite the very different nature of the hand and ear stimuli.

Comparison to experts

To address the second question, one-sample t tests were conducted to compare the absolute performance of participants to that of the relevant expert at each viewpoint. As might be anticipated, these revealed that, whilst the novice participants performed at above chance levels, they performed below the level of the expert in all conditions (all ts₍₂₄₎ > 7.63, p < 0.001).

Impact of viewpoint

To address the final question, a 2 × 3 mixed Analysis of Variance (ANOVA) was conducted to explore accuracy of performance when matching hands and ears across good, medium and poor viewpoints. For this analysis, accuracy levels were standardised by expressing them as a proportion of the performance level attained in the optimal (good) condition (see Table 1). This ensured a focus on the relative impact of a change in viewpoint, and prevented the findings being affected by variation in absolute levels of performance across the stimuli.

The ANOVA revealed a main effect of stimulus type (F _{(1, 48)} = 41.59, p < 0.001, partial η² = .464), with better overall performance for hands than for ears. In addition, a main effect of viewpoint emerged (F _{(2, 96)} = 409.52, p < 0.001, partial η ² = 0.895), with better performance when presented with more optimal viewpoints. These effects were qualified by the expected interaction between stimulus type and viewpoint (F _(2,96) = 24.79, p < 0.001, partial η ² = 0.34).

Analysis of the simple main effects confirmed a significant effect of viewpoint for both hands (F _(2,48) = 47.27, p < 0.001, partial η ² = 0.66) and ears (F _(2,48) = 233.48, p < 0.001, partial η ² = 0.907) suggesting that the performance for both stimulus types suffered as the view became less optimal. However, a series of Bonferroni-corrected comparisons confirmed that performance with hands was not affected by a change from good to medium images (t ₍₂₄₎ = 2.04, p > 0.05) but was only affected by a change from medium to poor images (t ₍₂₄₎ = 6.72, p < 0.001). In contrast, performance with ears was affected as soon as the image moved away from optimal, with significant differences in performance levels between good and medium images (t ₍₂₄₎ = 14.92, p < 0.001) as well as between the medium and poor images (t ₍₂₄₎ = 4.16, p < 0.001).

In accounting for these results, it was possible that ear processing was more affected by a change in viewpoint than hand processing because ear processing was an inherently difficult task. Important in this regard was the demonstration of equivalent absolute levels of performance in the best image case (t ₍₄₈₎ = 2.48, ns) despite the differences between hands and ears as stimuli. Consequently, the substantial impact of viewpoint for ears could not easily be attributed to an inherent difficulty when matching ears. However, the possibility remained that the difficulty when matching ears was revealed not in baseline performance levels, but in a greater vulnerability as the image quality was changed. Such an explanation was compatible with the predictions for this study in which the flexibility of the hand was expected to minimise the impact of a sub-optimal viewpoint. Indeed, these two accounts would be difficult to separate out.

Taking all analyses together, the results of Experiment 1 provided support for the predictions. Specifically, the change in viewpoint had a significant effect when matching hands, but had a greater effect, from an equivalent starting point, when matching ears. These results supported the prediction that the inherent flexibility of the hand-enabled exposure to a variety of viewpoints with the consequence that canonicality was less strong for hands than ears.

In terms of implications for the hand as a biometric, the data here led to the conclusion that when matching hands, performance could survive moderate changes in viewpoint whereas when matching other more rigid biometrics (such as ears), a change in viewpoint compromised performance quite substantially. As such, these data confirmed a greater reliability of the hand as a biometric cue across optimal and moderately optimal viewing conditions.

Several aspects of the current results were interesting and unanticipated, and as such warrant some consideration. In particular, it was interesting to note impairment in the performance of the two experts as viewpoint changed. Whilst it was not possible to assess the extent of the impact of viewpoint statistically for each of the experts (there being only one expert for each stimulus type), it was possible to determine whether the experts were affected to the same degree as the novice participants.

To this end, a series of one-sample t tests was conducted, comparing the decline in performance shown by the expert, to the decline in performance shown by the group of novices. This confirmed that novice performance declined more than expert performance as the viewpoint became less optimal. This was evident when matching ears as the image changed from good to medium (ears: t ₍₂₄₎ = 6.25, p < 0.001; hands: t ₍₂₄₎ = 1.08, ns), and when matching both ears and hands as the image changed from medium to poor (ears: t ₍₂₄₎ = 8.64, p < 0.001; hands: t ₍₂₄₎ = 11.23, p < 0.001). Consequently, these results suggested that whilst the experts were affected by a change in viewpoint, they were affected less than novices.

This latter analysis did not sit within the main purpose of this Experiment but nevertheless raised questions: For example, could the provision of training be sufficient to improve performance levels from that of the novice towards that of the expert. Relatedly, could the provision of training ameliorate the negative impact of the sub-optimal viewpoint so that trained participants come to show greater resilience than novices when presented with sub-optimal viewpoints?

Whilst representing an important applied issue, such questions relate well to the theoretical consideration of Blanz et al., (1999) regarding the criteria underpinning a canonical view. Indeed, it may be argued that expertise brings with it a capacity to use a range of cues so that the matching task can still be completed even when a subset of the cues is unavailable through occlusion in a sub-optimal image. Similarly, it may be argued that expertise brings the capacity to show better understanding of function, and greater levels of exposure to non-standard viewpoints through expert study. All factors may lead to the prediction that canonicality is less strong (or the negative impact of a non-canonical image can more easily be overcome) when the viewer brings expertise to their viewing task.

Experiment 2 was conducted to present an examination of these emergent questions. Through the provision of video instruction, the performance of a group of ‘trained’ participants was compared to that of the novices and experts studied in Experiment 1. It was anticipated that training would improve overall levels of performance, and would reduce the impact of a change in viewpoint compared to the novices such that the performance of the trained group would more closely resemble that of the experts.