1 Introduction

Security services and safety enterprises have been employing video surveillance both as a preventive measure and for law enforcement with an ever-increasing tendency for many years now. With digitisation progressing, the scope of available photographs and video recordings from smartphones or smart home applications is continually growing. Due to the lack of standardised technical procedures and scientific findings, law enforcement agencies have not yet been able to exploit the full potential of available image material. Currently, in analyses of video material for the purpose of identification or exclusion of suspects the emphasis is still set on face recognition. However, in many cases, criminals take measures to conceal their identity. This is done by deliberately making the face unrecognisable, wearing a mask, or dressing in similar clothing, thereby making it impossible to assign a criminal act to one specific individual. In addition, it is not uncommon for insufficient quality of the image material to render identification by face recognition difficult, despite all technical progress. In essence, the following challenges occur when identifying people based on face recognition from video surveillance:

  • the presence of a suspect a crime scene can neither be proven nor refuted

  • Accomplices cannot unequivocally be distinguished from each other preventing proof of the individual contribution of each person in the crime.

  • It is not possible to prove a perpetrator, or a group of perpetrators as to having committed several offences at different places at different times (i.e., a series of offences).

Within the context of criminal investigations, in individual cases and where video recordings are available, an effort is made to prove the involvement of suspects in the crime by means of bioforensic expert reports. These are based on biometric data that go beyond face recognition including features such as body size, shoulder width, ratio of forearm to upper arm length and distinctive motion or gait patterns. However, due to the limited scientific capacities in this domain and the comparatively high costs associated with it, such an individual evaluation of video material is only possible in exceptional cases.

The COMBI research project aims at the analysis an anthropometric pattern that is a combination of human body proportions (lengths, heights, and widths in particular) as a biometric identifier as well as to make this pattern feasible for the standardised application in the taking of evidence by the police and prosecution. This is founded on a scientifically sound and reproducible digital forensic comparative method that accounts for ethical, constitutional, and criminal procedural requirements.Owing to the practical application and the necessity of judicial recognition, the focus is primarily set on the automation of essential processes and on the matching being quantifiable. Moreover, the anthropometric pattern is intended to be applicable independently of face recognition algorithms, AI-based gait analyses or any impracticability in the recognition of persons by features related to clothing. A remaining possible starting point for the identification of perpetrators in these instances is the derivation of person-specific digital skeletons, so-called rigs, whose suitability as a biometric identifier is being analysed during the project. At first however, there is a need for a valid prediction of key points that accurately represent human joints and form the basis of any rig derivation. Thus, OpenPose, an AI framework for the prediction of various joints of the human body was applied to ensure an automatable process and, above all, a quantifiable analysis. This framework along with the way it was employed within the COMBI research project is detailed in the further sections.

2 Theoretical Framework

2.1 OpenPose

The learning process of OpenPose [1, 8] is based on extensive data sets [2, 13] containing labelled images capturing people in different poses from movies, sports, or everyday scenes. The labelled information describes 2D coordinates in the image that refer to a corresponding joint. Prior to the actual process, image sections, so-called windows, that surround coordinates of a joint are defined in different sizes. Ideally the coordinates of a joint are in the centre of each window, however that depends on their position in the image. The window sizes are determined by the following pixel dimensions: 9 \(\times\) 9, 26 \(\times\) 26, 60 \(\times\) 60, 96 \(\times\) 96, 160 \(\times\) 160, 240 \(\times\) 240, 320 \(\times\) 320 and 400 \(\times\) 400. This results in a denied number of windows of different sizes for a dened number of images of a training data set. Using histogram of oriented gradients (HOG), each window is then transformed into an abstract information representation of associated pixels that ultimately contributes to the formation of an informational foundation for the learning process. According to [11], the training was based on the MPII-Human-Pose (MP refers to multi-person) [2], LSP [6] and FLIC [9] data sets that have a combined capacity of 4297 images (28,000 from MPII + 11,000 from LSP + 3987 from FLIC). Consequently, for the training of joint recognition of both ankles, knees, shoulders, elbows, and wrists there are 3.438.960 image sections (8 windows * 10 joints * 42,987) serving as a training basis. During the learning process, patterns are derived from the previously abstracted image information, that are described mathematically using descriptors such as Euclidean vectors and learnt by the neural network. Finally, this learning process results in a prediction model that is specifically trained for the prediction of human joints. For the quantitative assessment of the trained prediction model, an evaluation was carried out in the same fashion as the training process using more than 2000 images of the LSP (1000 images) and Flic (1016 images). These were separated from the training data set prior to the training process. Thus, it forms an independent set of images for the evaluation process. This methodological foundation ultimately allows different prediction to be trained for a variety of object classes such as body regions or joints in the context of person recognition within images. At this stage, a cascaded application of these prediction models occurs: First, for the prediction of the whole person, followed by the prediction of individual body regions such as the upper and lower body, the upper and lower arm, etc. until finally the prediction and assessment of individual joints in from the image information of already predicted body regions. The predicted joints then serve as a basis that allows the human musculoskeletal system to be abstracted and converted into a 3D digital skeletal model, a so-called rig. In the COMBI research project, OpenPose was applied on the image material gained during the photogrammetric identification procedure as well as on the video frames from the simulated crime videos. This makes for a workable analysis that applied the same method on all data sets. An exemplary representation of predicted OpenPose key points, combined into an OpenPose skeleton (OpenPoseRig), is shown in Fig. 1.

Fig. 1
figure 1

Basic OpenPose process. Schematic representation of the steps leading to a computer model for the prediction of key points that correspond to joints of the musculoskeletal system of a human individual. Labelled images A form the basis for a training data set (B). In the context of a pre-processing process (C), so-called windows (image sections) of dened size are determined for each image from B around labelled coordinates of a key point or joint. In the learning process (D), patterns are derived and learned from previously abstracted image information and finally transferred into a computer prediction model. The cascaded application of different models for the prediction of e.g., a person (E), body regions (F) and nally for the prediction and assessment of various joint points (G), ultimately enables the abstraction of the human musculoskeletal system by a so-called digital skeleton (rig) (H). The training of the OpenPose model was based on the data sets MPII-Human-Pose, LSP as well as FLIC. For the quantitative assessment of the trained prediction model, the evaluation of this model was carried out on the basis of more than 2000 images of the data sets LSP (1000 images) and FLIC (1016 images). The BODY_25 model in its native form (https://cmu-perceptual-computing-lab.github.io/openpose/web/html/doc/md_doc_02_output.html#pose-output-format-body_25) was used as the output format for the OpenPose rigs. The BODY_25 model contains an interpolated hip point. This simplifies the connection between the two legs. This is particularly helpful when the upper body is not included in the image or is obscured. By adding the foot key points, it is even possible that some body keypoints can be more accurately predicted. This has a particular influence on the position of the ankle key points, for example [5]

2.2 Human Pose Estimation in Metric 3D Reference Models

Using terrestrial laser scanners allows to digitally capture the crime scene [3, 14, 16], convert the resulting 3D point cloud into a 3D model that can be displayed on the computer. Within the 3D model it is possible to conduct measurements and geometrical shape matching for instance. This only requires reference dimensions that are obtained with the terrestrial laser scans as well. With images or video frames, the challenge is to compare persons within the 2D information in a way that enables matches on a quantitative basis. Similarly to the 3D model, this can be achieved by using reference objects with known dimensions to infer measurements of an object in a video frame. For this purpose, the reference dimensions must either be in immediate proximity of the object that is to be measured or exhibit a well-defined geometry and size that allows inference of the size of the object in the image using classical photogrammetry. To close this gap, modern software application like BlenderFootnote 1 [4, 7, 15] facilitates the merging of 2D (image) and 3D (scan) information. This combination results in a so-called 3D reference model, in which each video frame of a video sequence is linked to the 3D model of a crime scene. It permits measurements to be taken on objects in a video at any time that are always in reference to the depth information of the 3D model. Another benefit of this integrated information is the possibility of transferring objects between different 3D reference models to compare them with each other. Whilst doing so, the measurements of the transferred objects remain and are only scaled within the respective 3D reference model thus enabling a cross-camera comparison of objects. When applying this kind of object comparison to persons, the challenge arises to make the biomechanical motion sequence of a person comparable. The motion sequence is represented by a series of poses. A human pose is the posture of a person that results from the biomechanical movement of the human musculoskeletal system and that is assumed within an image or video frame. This means that the measurement of a person’s height in a frame with the measurement defined by measuring from the floor to the top of the head is dependent on the posture adopted by the person. Hence, it does not necessarily correspond to the person’s true body height. A pose-dependent height is the measurement from the floor to the top of head while the person is assuming a pose.

A person’s posture that is depicted in a video frame can be reconstructed by means of a 3D model of that person. This is done by superposing the frame onto the 3D model while at the same time ensuring that the reconstructed posture of the model corresponds to the real one from all perspectives. As the 3D model represents the dimensions of the real person, height can be directly estimated from the model. Body height or stature is the maximum pose -dependent height of a person when standing fully erect with the feet within hip width and the head oriented in the Frankfurt planeFootnote 2 similarly to the posture taken during an identification procedure. The body height of a person depicted in the frame is calculated in the metric 3D reference model using a mesh in form of a plane—which starting from the feet on the ground to the highest point of the head (corresponding to a vertex)—is placed sagitally in the centre of the rig and in this way acts as a measuring device. The vertical dimension of the plane then corresponds to the person’s body height in the frame. However, these measurements alone are not sufficient to assign a suspect to one of the perpetrators in a video since they are rather unspecific features to a distinct person. Nevertheless, it is associated with the musculoskeletal system which from the perspective of anthropometry is specific of a person. The method by the authors makes use of this specificity by matching the musculoskeletal system and its resulting body geometries of suspects, in particular their proportions of body parts, with perpetrators in the video. The matching of person-specific information from 2D material with the person-specific information of the real person needs to be possible if biometric identification from surveillance footage is to work. Hence, the problem is to translate suitable information from 2D to 3D and vice versa. In this light, a framework that has been developed to detect multiple persons in 2D material [12] is to be tried for its suitability for person-specific assignment. Consequently, a quantitative evaluation of similar poses provides the basis for a person classification. The number of poses that can be analysed depends primarily on the underlying image material. Furthermore, the poses have to fulfil certain requirements. One of these is that one leg must touch the ground in order to have a starting point for the analysis of the pose by a rig assignment using the person-specific rigs. Furthermore, no extremities should be covered by other parts of the body. The similarity of the poses is limited to the requirements mentioned. Depending on the underlying material and its quality, at least 15 poses for each person are taken as a basis for a rig assignment which is explained in detail in the following sections.

3 Use of the OpenPose Framework in the COMBI Research Project

In Fig. 2 recording and evaluation methods of the COMBI research project are visually represented. The individual process steps are explained in detail in the following sections.

Fig. 2
figure 2

Representation of the data collection, rig derivation and methods used in the COMBI research project. In the first step of the research project, data is collected at different stations: a photogrammetric station and a walking track which is to serve as a simulated crime scene. Person-specic digital skeletons, so-called rigs, are derived from the collected data using AI approaches as well as manual procedures. By including 3D spatial information and merging all data into a metric 3D reference model, it becomes possible to compare and assign rigs of known identities to rigs of unknown identities. This process is called rig alignment and matching. Furthermore, it is important to scientifically evaluate whether and to what extent the rigs created are unique and person-specic. To properly address this question 3D body scans provided by Avalution (https://www.avalution.net/) were included in the analyses as a way to enrich the data

3.1 Study Design and Data Collection

For the COMBI research project, study participants with different ages but approximately equal in numbers of females and males were acquired from the Lower Saxony police force. During several runs, images and videos of the study participants were recorded from a total of six camera perspectives at two different stations: one equipped with a turntable by stageonairFootnote 3 as well as a camera for photogrammetric data collection and the other as a walking track recorded by five recording systems that operate in sync to simulate a crime scene (Figs. 34). Both stations were situated in a large room provided by the Lower Saxony police force. While data will be collected at several stations, at this stage, however, the focus is only on two stations.

Fig. 3
figure 3

Workstation for photogrammetric data collection. The minimum set for the photogrammetric data collection consists of one SLR camera (incl. standard lens) with a tripod as well as a turntable in front of a neutral background. For data collection the SLR was positioned to fully capture the person using a focal length of 18 mm and a distance of 170 cm between camera and turntable. The specifications were determined experimentally in order to be able to completely photograph a person with a maximum body height of 200 cm. Furthermore, it was ensured that the camera is focused horizontally on the centre of the body. The corresponding camera height is dependent on the combined height of the suspected person and the turntable. Images of the turntable alone should be taken to use them as a so-called mask in post processing steps

Fig. 4
figure 4

Depiction of the walking track serves as a simulated crime. This simulated crime scene was used to collect data from participants by recording different poses and movements using five recording systems that operate in sync. The track also served as a basis for the creation of the metric 3D reference model and assignment of rigs derived in later process steps

At these stations, study participants were photogrammetrically and videographically recorded in different states of clothing and while assuming different poses on the walking track (the simulated crime scene)—from standing upright to walking, bending, and turning etc.. Furthermore, for one run physical markers indicating specific anatomical landmarks on the joints were additionally attached to the participants. These markers formed the basis for subsequent rig derivations used evaluate OpenPose predictions.

In addition to the photogrammetric and video material collected this way, a second data basis consisting of 3D body scans provided by the company Avalution and stemming from the SizeGERMANY anthropometric survey in which more than 13,000 people were measured between 2007 and 2009 by means of 3D body laser scanning, was also taken into account to further evaluate the individuality of derived rigs. The data included here represents an extract of the survey sample and consists of 170 men and 170 women in two body height classes each (1.75 m and 1.85 m in the case of the men and 1.63 m and 1.73 m in the case of the women). To prove the individuality of the rigs, OpenPose performed on the image and video material as well as on the 3D body scans and both stations were scanned with a terrestrial laser scanner (type FARO FOCUS 3D X130) to create a metric 3D reference model of each station which in turn can be used to derive, fit, and measure the rigs.

3.2 Creating a Metric 3D Reference Model

As outlined in 2.1, a 3D model of the area which was recorded by surveillance cameras can be created by recording lengths and widths from and the distances between objects of that area. The model is part of a metric 3D reference model which provides the basis for matching of derived rigs of suspects with perpetrators in the video since it is important that such comparisons take place in 3D space. The next step in creating the metric 3D reference model is the alignment of virtual cameras in Blender: Virtual cameras are reconstructed at the exact positions of the actual surveillance cameras in the 3D model using fixed auxiliary objects such as door frames for the right orientation and rotation and parameterised according to the technical configuration if the physical camera such as focal length and sensor size. Once the alignment is complete, the footage from the actual surveillance cameras is imported and used as a background image for the virtual camera. Blender offers the possibility to look directly through the virtual camera, which in turn facilitates the superimposition of the imported footage on the underlying 3D reference model. This superimposition now allows true-to-scale measuring at any position in the video material. As the virtual camera acts in the same way as the actual surveillance camera (even considering distortions such as those caused by a fisheye lens), objects at a greater distance to the camera appear smaller than objects that are close to it and thus accurate measuring is possible at any position within the metric 3D reference model. The final element of the metric 3D reference model is then a selection of suitable video frames from the footage, i.e. where the perpetrator is close to the camera and with their extremities clearly visible, form the basis for comparing the derived rigs of suspects with perpetrators in the video (Fig. 5).

Fig. 5
figure 5

The metric 3D reference model. Alignment of virtual cameras within the 3D model, recreated from measurements of the captured area. When merged with the video footage, this metric 3D reference model can be used to compare the rigs of study participants with a simulated perpetrator from the footage

3.3 Prediction of Person-Specific Rigs through OpenPose

Images and video frames from the collected data as well as images taken of the 3D body scans were transferred into OpenPose. The basis for the prediction of the joints was the Body_25 model which represents the standard model including key points of the feet. The predictions created this way (Fig. 6) on images from the photogrammetry station were then used to generate person-specific rigs that were then compared to the predictions on frames from the footage of the simulated crime scene in the metric 3D reference system.

Fig. 6
figure 6

Exemplary representation of OpenPose predictions. A Video frame from the simulated crime video and the participant depicted in it. Coloured key points connected by coloured ellipses form the rig predicted by OpenPose. B Single image from the photogrammetry station, with the participant depicted from a near-frontal view

3.4 Creating Person-Specific 3D Rigs

3D digital skeletons, so-called rigs, were created from images taken during photogrammetry to compare the anthropometric pattern between the study participants. Two kinds of rigs were generated for each participant that were either created based on physical joint markers (manually created rigs, Fig. 7) or on OpenPose predictions (automatically created rigs, Fig. 8). The physical joint markers were manually attached to the subjects prior to the photogrammetric procedure. They were placed onto the skin above palpable bony landmark within the articular area of the bone. At the same time, they approximate to the position of the keypoints of the OpenPose predictions which in turn serve as the basis for the second kind of rigs. The photogrammetric procedure provides images of the subject from all angles and allows for the creation of the person’s 3D model that can be both imported into Blender. Within a metric 3D reference model of the photogrammetric workstation, this data can be used to set up a minimum of three virtual cameras showing the subject from the frontal and both lateral views. Through the views the rig is then generated by linking the anatomical (with the physical markers for orientation) or OpenPose predicted joints. The derivation of a rig based on OpenPose includes the following predicted joints: ankles, knees, hip, shoulders, elbows, wrists and the key point for the nose. The single physical joint markers and predicted key points respectively can be merged into a rig via superimposition of the images on the 3D model within the metric 3D reference model using Blender.

Fig. 7
figure 7

Rig generation based on physical joint markers within the metric 3D reference model. The metric 3D reference model including the aligned virtual camera of the photogrammetry station forms the basis for the derivation of rigs in Blender. The created rig can be viewed from different angles and coloured as needed for better visualisation

Fig. 8
figure 8

Rig generation based on OpenPose preditions within the metric 3D reference model. The metric 3D reference model including the aligned virtual camera of the photogrammetry station forms the basis for the derivation of rigs in Blender. The created rig can be viewed from different angles and coloured as needed for better visualisation

3.5 Rig Assignment in the Metric 3D Reference Model

The rig is then imported in the metric 3D reference model of the walking track (simulated crime scene) and aligned via its joints along the axes according to the position of the perpetrator in the video frame (Fig. 9), starting with the feet on the ground and finishing at the head. A possible match is assumed when the extremities and the joints can be perfectly aligned with those of the perpetrator. Not only the individual height of a person in a frame and rig respectively, but also the proportions of the extremities can be quantitatively measured in Blender by means of a virtual ruler and thus be included into the comparison.

Fig. 9
figure 9

Assignment of rigs based on physical joint markers as well as on OpenPose predictions into the participant depicted in the a video frame within the metric 3D reference model. A Illustration of the OpenPosen predictions of the participant’s joints. B Representation of a rig based on physical joint markers that was fitted into the participant in the video frame in Blender. Important in this process step is the consideration of the well-dened pose of the participant in the image and video material into which the rig is assigned. C Representation of a rig based on OpenPose predictions that was fitted into the participant in the video frame in Blender. D Illustration of both kinds of rigs fitted into the participant in the video frame in Blender

4 Results and First Evaluation

Three differing aspects are in the focus of evaluation in the COMBI research project. One concerns the individuality of the rigs and the OpenPose predictions. The second aspect aims at correctly assigning the rigs (i.e. the reference to which the identities of the persons are known) to the persons or OpenPose predictions (of the perpetrators with unknown identities) in the video frames by means of superimposition. This is done by inter-individual (between persons) and intra-individual comparisons (OpenPose predictions and superimpositions of different frames of the same person) based on the distances that the rigs are composed (degree of dissimilarity) as well as on the RMSD, a quantitative measure describing how similarly the rig and OpenPose prediction fit into a person on a frame. Associated with the correct assignment is the third aspect of evaluation which examines how accurate both the rigs and the OpenPose predictions represent the persons’ actual anatomy and proportions. For this purpose, the OpenPose framework provides information regarding the accuracy of the predictions which are taken into consideration together with a visual examination of how well the OpenPose prediction follows the person in the frame. In addition, frames taken from various camera angles and distances from the camera are taken into account as well.

A first analysis addressing the uniqueness of the rigs was based on the pose-independent comparison of the Euclidean distances between the joints of the OpenPose predictions (2D) and the rigs based on OpenPose (3D) (Fig. 10). For the former, images from the photogrammetric procedure as well as images taken of the 3D body scans in a frontal view, here defined as the image where the participant or the 3D body scan displays maximum shoulder and hip breadth and selected by either the use of image processing methods, i.e. contour comparison or—in the case of the 3D body scans—with the aid of the coordinate system. 13 distances between the 14 key points representing joints of the human musculoskeletal system were included into the analysis. The predictions of these key points performed best in terms of stability across several images of one participant or 3D body scan respectively and therefore were deemed the most reliable. The following formula 1 was used to obtain a coefficient describing the degree of dissimilarity between respective rigs and OpenPose predictions respectively. As a value of \(E^{AB}\) = 0 corresponds to a match, increasing values thus imply an increasing dissimilarity between two rigs or OpenPose predictions. Table 1 represents measured values of the joints from the example rigs No. 1 and No. 8 which were used to compare the dissimilarity of those rigs. The values were measured using a virtual ruler in Blender, within the add-on MeasureIt (https://docs.blender.org/manual/en/latest/addons/3d_view/measureit.html). The virtual ruler could be assigned to selected vertices representing the points to be measured. Figure 11 shows a selection of 10 rigs, including rig No. 1 and No. 8, and the dissimilarity between them. The representation of the degree of dissimilarity among the OpenPose predictions is shown in Fig. 12A for the 3D body scans, in Fig. 12B for the study participants based on frontal images from the photogrammetric procedure as well as in Fig. 12C among the rigs of the study participants that are based on OpenPose. A second approach examines the similarity of position rather than length and is based on the superimposition of the rigs on the person in the image. For this purpose, the rig is adjusted into the person’s well-defined pose using joint markers and OpenPose predictions. Measures such as the RMSD (root mean standard deviation) can then be used to quantify the similarity of the rig and the person in the image as is intended for future analyses to provide another independent analysis.

Fig. 10
figure 10

Measured distances on the rigs. Representation of the 13 measured distances collected from the rigs for a comparison of similarity

$$\begin{aligned} E^{AB}=\sqrt{\sum \limits _{i=1}^n i \frac{(d_{A_i}-d_{B_i})^2}{n}} \end{aligned}$$
(1)
Table 1 Values of measured joints from rigs No. 1 and No. 8
Fig. 11
figure 11

Degree of dissimilarity applied on rigs No. 1 and No. 8. The rigs are composed of joints and Euclidean distances between those joints. Using the distances, two rigs can be compared for their dissimilarity. The minimum score attainable is zero describing two rigs as being identical. Identical scores are only attained within the sample when the same rig is compared with itself. The maximum attainable score is set by the comparison that results in the greatest dissimilarity. Based on the data from Table  1 the degree of dissimilarity between rig No. 1 and rig No. 8 is 3.87. The comparison shows that the rigs can be differentiated from each other

Fig. 12
figure 12

Representation of the degree of dissimilarity among the OpenPose predictions of A the 3D body scans, B the study participants based on frontal images from the photogrammetric procedure as well as, C among the rigs of the study participants that are based on OpenPose. The heat maps depicted show that a value of zero only occurs when the predictions or rigs of the same participant and 3D body scan respectively is compared. Apart from that, the results show a significant degree of dissimilarity across all three data sets, strongly that there are no two rigs or OpenPose predictions alike

5 Conclusion and Further Procedure

In the further course of the COMBI research project, analyses regarding the project’s two main objectives outlined in this publication will be continued and expanded, for instance by the application of the RMSD and further measures of similarity. Randomly selected rigs of study participants are to be clearly matched to the corresponding identity that is depicted on a video frame. Moreover, alternate AI frameworks [10, 17, 18] are to be evaluated regarding their suitability for the accurate prediction of human joints. In addition, it is planned to address versatile questions regarding the influences (e.g. the resolution and the perspective of the cameras on the OpenPose predictions) the and practical implementation of the rig derivation and matching procedure (for instance, the possibility to identify a crime series by successfully matching the anthropometric patterns of suspects from different crime scenes).Thus, in the future the taking of evidence could thus be improved, criminalistic hypotheses could be falsied or veried and hence support investigations up to the point of solving the crime. Law enforcement agencies could then greatly benefit in their investigation against crime, especially in cases of violent crime or organised property crime if such a method were to be introduced. At this stage, the derived rigs whether based on physical joint markers or on OpenPose, only contain a simple abstracted bone representing the spine. This unnatural immobility can lead to limitations when fitting the rig into a pose. To assume any possible human pose the rigs need to be optimised which is another concern of the project. At this point, the research done by the COMBI research project could demonstrated the individuality of the rigs and OpenPose predictions as well as apply an AI framework in a forensic context.