Keywords

1 Introduction

The lack of clinical information on a day-to-day basis hinders our understanding of disease trajectories on multiple time scales, including diseases affecting gait and balance (e.g., neurological conditions). Free-living (habitual) ambulatory gait analysis has demonstrated unique insight into disease progression, with implications for diagnosis and evaluating treatment efficacy. For example, spatial metrics (e.g., step length), temporal metrics (e.g., step time), and gait irregularities (e.g., compensatory balance reactions or near-falls) of free-living mobility behaviour have demonstrated promising capabilities to predict the risk of falling in older adult populations.

The recent explosion of ambient sensors (e.g., motion capture sensors, force mats), smart phones, and wearable sensor systems (e.g., inertial measurement units, IMUs) have facilitated the emergence of new techniques to monitor gait and balance control in natural environments and during everyday activities [8, 22, 29]. Embedded into living environments, ambient third-person video (TPV) and depth cameras (e.g., Microsoft Kinect) have been investigated as means to extract gait parameters [10, 14], detect episodes of freezing of gait in Parkinson’s disease [5], detect falls, and longitudinal changes in the patient’s mobility patterns [3, 4, 36]. While TPV systems have demonstrated potential to detect small changes over long periods (i.e., months to years), these approaches suffer from visual occlusions (e.g., furniture), difficulty handling multiple residents, and extraction of spatiotemporal parameters when the full-body view is unavailable. Moreover, they are restricted to fixed areas. Considering mobility is characterized by moving the body from one location (i.e., environment) to another, significant daily-life mobility data may go uncaptured without multiple camera coverage using ambient sensors.

An alternative approach is to use wearables sensors affixed to the user’s body. There have been many successful research programs using IMUs to monitor physical (and sedentary) activity, identify activity types, estimate full body pose, and measure gait parameters [8, 17, 21, 22, 29]. Body-worn IMUs have demonstrated excellent capabilities to measure temporal gait parameters. However, a critical drawback associated with the use of IMUs is inaccurate estimation of key spatial parameters. In particular, step width is linked to gait stability and possesses a strong association to fall risk [6, 27]. This measurement limitation is largely attributed to a relative lack of motion in the frontal plane during gait, resulting in small IMU excitation and low signal-to-noise ratio.

Egocentric first-person video (FPV), acquired via body-worn cameras, may outperform IMUs for the purpose of estimating spatial parameters of gait. Bearing in mind a waist-worn camera pointed down and ahead of the user, FPV offers a potentially stronger signal for spatial estimation, especially in the frontal plane. For instance, a smartphone-based camera was mounted on the waist to quantify gait characteristics in [25]. However, the system requires additional markers on the feet. There are also secondary reasons for investigating FPV as a sensing modality for gait assessment. Vision captures rich information on the properties of the environment that influence mobility behaviour, including slope changes (e.g., stairs, curbs, ramps) and surfaces (e.g., gravel, grass, concrete) [32, 33]. Furthermore, FPV offers the potential to reconstruct events by capturing the immediate environmental context more readily than IMU-based data alone. Without detailed information of the mobility context, such as the presence of other pedestrians, terrain characteristics, and obstacles, the ability to interpret ambulatory gait data is constrained. For example, FPV recordings have been used for the purpose of validation of other IMU-based algorithms [17, 46] by manually viewing video frames and identifying specific events.

To address the problem of ambulatory measurement of spatial gait parameters, this paper tackles the initial problem of localizing feet in FPV frames in 2D coordinates of video captured from a belt-mounted camera. In comparison to head- and chest-mounted camera views, we hypothesized that a waist-level view would offer the best view for 3 reasons. First, waist-level FPV offers a consistent view of the legs and feet even when turning. In contrast, head- or chest-mounted views tend to rotate in anticipation of turns or changes in attention, which reduces the available views of the feet. Second, a waist-level view affords greater resolution of the feet than views higher on the body. Finally, camera egomotion is hypothesized to provide a rich source of temporal information to segment body parts [28]. We propose a method to generate pixel-wise foot placement outputs towards the eventual goal of estimating spatial parameters (e.g., step width). The transformation between pixel outputs to distances, likely using 2D metrology approaches, is beyond the scope of the current study and will be examined in subsequent works. To achieve the foot localization solution, we propose a FPV-based deep hybrid architecture called the FootChaser model (see Fig. 3). The models comprise of (a) the FootRegionProposer model, which uses a ConvNet to propose high confidence feet regions (or bounding boxes), and (b) the LocomoNet, which examines the temporal dynamics of the proposed regions to refine the FootRegionProposer output by filtering the false positives to locate feet. An evaluation of the proposed method to accurately localize feet is reported and discussed.

1.1 Related Work

While there have been TPV-based research efforts utilizing smartphone or ambient camera video to assess gait (e.g., [10, 14, 36]) and estimate pose (e.g., [9, 12, 15, 20, 50]), the challenges and signals associated with FPV are distinct. There are several factors that challenge the proposed concept: (1) occlusion or extreme illumination conditions, (2) similar objects/terrain patterns to the feet (e.g., other people’s feet), and (3) motion blur from fast movements. In this section, we focus on reviewing previous efforts using FPV to address these challenges and to inform our chosen camera type and location.

There are relatively few previous works aiming to extract spatial gait parameters using FPV. An interesting and novel approach was using a walker-mounted depth and/or color camera to estimate 3D pose of lower limbs, mainly in frontal plane [18, 31, 35]. To achieve this, Ng et al. [31] used general appearance model (texture and colour cues) within a Bayesian probabilistic framework. In [18], a Kinect (depth) sensor along with two RGB cameras were placed on a moving walker, and the 3D pose was formulated as a particle filtering problem with a hidden Markov model. The key limitation of these works is the dependency on a stable platform (i.e., walker) to afford consistent views of the lower limbs and monitor pose over time, which is not generalizable to individuals that do not require a walking aid for ambulation.

The possibility of using one or several body-mounted cameras is investigated for 3D full body [24, 43, 51] and upper limb (arms and hands) [30, 40] pose estimation. In [24, 43], outward-looking body-mounted cameras along with optimization approaches were used to estimate 3D body pose. In [43] more than ten cameras were attached to the person’s joints, and structure from motion approach was used to localize the cameras, estimate the joint angles and reconstruct human motion. The main limitation of the proposed method is the obtrusive multi-camera setup and intensive computational load required to infer pose in a video sequence. To alleviate the main weaknesses of [43], Jiang et al. [24] developed a model based on synchronized egocentric videos captured by a chest-mounted camera and a Kinect sensor. The 3D body pose model employs camera egomotion and contextual cues to infer body pose, without direct views of the key body parts (i.e., legs, feet) desired for gait assessment. Moreover, the videos were restricted to relatively static activities (i.e., sitting, standing). Such restrictions and the failure to examine more complex (i.e., dynamic) scenarios limits the applicability of their approach to the gait assessment problem.

In contrast to the previous studies, [39, 51] utilized body-related visual cues (outside-in/top-down view) provided by fisheye cameras attached to a bike helmet and baseball cap, respectively. In [51], a ConvNet for 3D body pose estimation was developed to address limitations in its former version [39], including dependency on 3D actor model initialization and inability to run in real-time. Although the authors compensated for the distortion imposed by the fisheye lens, estimation of the lower body 2D heatmaps (ankles, knees, hip, and toes) was less accurate due to the strong perspective distortion (i.e., a large upper body and small lower body).

The closest approach in spirit to the proposed approach is a hybrid method which combines both global object appearance (spatial network) and motion patterns (temporal network) in a two-stream ConvNets structure. This approach was inspired by Simonyan and Zisserman [44], in which a ConvNet was trained by stacks of optical flow for the task of TPV-based activity recognition. Similar architecture is also employed in FPV-based methods to recognize different activities [28, 45]. To capture long-term sequential information from FPV data, recurrent neural network/long-short term memory (LSTM) was used by Abebe et al. [1, 2] where stacked spectrograms generated over temporal windows from mean grid-optical-flow vectors were used to represent motion [45].

Modeling temporal information in a specific regions enclosed by bounding boxes in consecutive frames is investigated in some TPV-based studies [7, 47]. In [23] an object-centric motion compensation scheme was implemented by training CNNs as regressors to estimate the shift of the person from the center of the bounding box. These shifts were further applied to the image stack (a rectified spatiotemporal volume) so that the subject remains centered. More related to our LocomoNet approach is the work by Brattoli et al. [7], in which a fully connected network was trained to analyze the grasping behavior of rats over time. Based on optical flow data of both initial positives (paw regions) and random negatives cropped from other regions, temporal representation was learned to detect paws.

Fig. 1.
figure 1

Egocentric camera-based gait assessment overview. Panels a, b, c, d, e represent different phases of gait captured by a belt-mounted camera. The x and y location of the right foot (red bounding boxes) and left foot (green boxes) over consecutive frames (XCoM: extrapolated center of mass). Rows f and g depict lateral sidestep and lateral crossover compensatory balance reactions, respectively. These reactions are important behaviours related to fall risk. Note the transformation between pixel-wise box coordinates to distances is not covered in the current study. (Color figure online)

2 The FootChaser Framework

In this section, we describe the framework for proposing high confidence regions by incorporating both temporal and spatial data, for the task of gait assessment. As an alternative to inferring gait parameters from 3D pose estimates, we hypothesized that tracking the centers of the person’s feet in 2D plane of walking over time could provide accurate spatial estimates. The scope of this paper is to first detect the feet, and examine the transformation between camera coordinates to spatial locations in subsequent efforts.

Let \(I_i\) be the \(i^{th}\) frame in a video sequence with the length N, captured by a belt-mounted camera with an outside-in top-down view (\(i = \{1,2\cdots N\}\)). The manually annotated ground truth (GT) data is in the form of bounding boxes \(GT_{f,i}=[x_{f,i}^{GT},y_{f,i}^{GT},w_{f,i}^{GT},h_{f,i}^{GT}]\) indicating the camera wearer’s feet \((f=\{left, right\})\) in 2D \(1080\times 1920\) coordinate system of each frame (see Fig. 1). x and y denote the center (\(C_{f,i}^{GT}\)), and w and h represent the width and height of the bounding box(es) respectively (see Fig. 2). The goal of the FootChaser framework is to detect and localize the centers of each foot (if present in the frame) in the form \(P_{f,i}=[x_{f,i}^{P},y_{f,i}^{P},w_{f,i}^{P},h_{f,i}^{P}]\) during the gait. In an ideal case, the error measure (E) will be minimized for x (\(E(x_{f,i}^{GT},x_{f,i}^{P})\)) and y (\(E(y_{f,i}^{GT},y_{f,i}^{P})\)) trajectories and the underlying area should be the same for the Ps and GTs, i.e, the intersection over union (IoU) measure will be maximized (\(IoU=1\)). The predicted x (\(\approx \) frontal axis) and y (\(\approx \) sagittal axis) trajectories can be used to estimate pixel-wise step width and step length gait parameters, respectively.

To investigate the feasibility of pixel-wise step-by-step gait parameter extraction, the \(x_{left}^{GT}\), \(x_{right}^{GT}\) data are plotted in Fig. 2. While \(y_{left}^{GT}\) and \(y_{right}^{GT}\) were examined for measurement of step length, we focus on step width estimation in the current study. We observed that (a) the trajectories roughly resemble the center of pressure (CoP) data captured by forceplates, (b) the local maxima and minimuma seem to be correlated with right heel strike (RHSs) and left heel strike (LHSs), respectively (further investigation is required using gold-standard gait analysis methods, e.g., Vicon), and (c) GT data can be divided into frames with one foot (\(GT-One\)), and both feet (\(GT-Two\)).

In most of the \(GT-Two\) frames, a small portion of the trailing foot is observable (see Fig. 1), and is irrelevant for extraction of gait parameters. Considering shape distortions affect detection results, we hypothesized that the ConvNet is more likely to detect the other foot rather than the less-visible one similar to the findings of Huang et al. [19] and Rozamtsev et al. [41]. In other words, in the frames with two GT, the network often locates the center of the foot that is required for the extraction of gait parameters.

Considering these cues, we surmised that tracking each foot separately is unnecessary and frames with only one predicted foot center can be used to extract step width. Specifically, (\(C_i^{P-one}\)) obtained from the FootChaser (\(P-One=[x_i^{P-one},y_i^{P-one},w_i^{P-one},h_i^{P-one}])\), regardless of the foot type f. As the key signals for the calculation of spatiotemporal gait parameters (e.g., LHS and RHS points), these can be observed from the \(x^{P-one}\) and \(y^{P-one}\) trajectories.

Fig. 2.
figure 2

Sample bounding box X-coordinate time series data from dataset 2. Ground Truth (GT) data for left (green) and right (red) feet, and FootChaser predictions with 1 identified region (blue). The expected x location of left heel strike (LHS) and right heel strike (RHS) are marked (further investigation is required using gold-standard gait analysis methods, e.g., Vicon). Periods with 2 identified feet (GT-Two) are indicated by dotted boxes. (Color figure online)

To achieve feet localization, we propose a two-stage FootChaser framework comprised of two ConvNets: (1) FootRegionProposer and (2) LocomoNet. The FootRegionProposer proposes \(n \in \mathbb {N}\) bounding boxes as ’proposed foot regions’, or \(PFR_{j,i}\), \(j=\{1,\)...\(,n\}\) in the \(i^{th}\) frame. As there may be several false positives in the proposed regions, we hypothesized that the FootRegionProposer results may be boosted by applying another ConvNet, called LocomoNet, trained to be sensitive to the periodic/specific movement patterns embedded in the user’s feet regions during gait. In other words, the LocomoNet is expected to filter out false positives by selecting the most confident regions. After applying the LocomoNet on \(PFR_{j,i}\), only the frames with a single PFR are used for step width estimation (see Fig. 2).

2.1 FootRegionProposer

The FootRegionProposer is a ConvNet fine-tuned to propose PFRs in a frame. The \(j^{th}\) proposed region is in the form of a bounding box \(PFR_{j,i} = [x_{j,i}, y_{j,i}, w_{j,i}, h_{j,i}]\), where \(x_{j,i}\), \(y_{j,i}\), \(w_{j,i}\), and \(h_{j,i}\) denote the center coordinates, and width and height of the box, respectively (see sample PFRs marked by red rectangles in Fig. 3). The training procedure for the LocomoNet is discussed in Subsect. 3.2. As noted above, there are several factors that may challenge the performance of the FootRegionProposer: (1) occlusion or extreme illumination conditions can increase the number of false negatives, (2) objects or terrain similar to the feet (i.e., noise, see Fig. 4-c), and (3) motion blur from fast movements. In addition to incorporating a fast and precise object localization/detection ConvNet (e.g., faster R-CNN [38], or YOLO [37]), a second ConvNet was applied to the FootRegionProposer output to filter false PFRs (Subsect. 2.2).

Fig. 3.
figure 3

The FootChaser framework. First, the FootRegionProposer proposes \(n \in \mathbb {N}\) \(PFR_{j,i}\) bounding boxes (red boxes), \(j=\{\)1,2,...\(,n\}\) in the \(i^{th}\) frame. Multiple regions proposed are examined by LocomoNet to filter out false positives. After obtaining the stacks of optical flow volume \(OFV_i\) (V and U are vertical and horizontal 2D flow components) from the \([i-L/2,i+L/2-1]\) frames (L denotes the depth/length of stack), LocomoNet inputs are obtained by cropping fixed size regions centered at the center of each \(PFR_{j,i}\), i.e., \((x_{j,i},y_{j,i})\), which creates the optical flow volumes from PFRs \((OFV-PFR_{j.i})\). Final FootChaser outputs reflect frames with a single proposed region (\((C_{i}^{P-one}\)). (Color figure online)

2.2 LocomoNet: Learning from Gait Patterns

To reduce the number of proposed false positives (i.e., false PFRs) by FootRegionProposer Network (towards the goal of ‘one’ true PFR), the dynamic temporal structure of the \(PFR_{j,i}\) will be further examined by the proposed LocomoNet ConvNet. Inspired by Simonyan and Zisserman’s work [44], we consider examining optical flow features to deliver bounding boxes with higher confidence of representing feet.

The horizontal \(U=\{U_1,U_2,...,U_{N-1}\}\) and vertical optical flow \(V=\{V_1,V_2\),...\(,V_{N-1}\}\) can be calculated separately for each two consecutive frames in the video sequence (the height and width of the U and V components are equal to the frame’s 2D dimension, i.e., \(1080\times 1920\)). Considering a fixed length of L consecutive frames, the optical flow volume \(OFV_i=\{U_{i-L/2},V_{i-L/2},\)...\(,U_{i+L/2-1},V_{i+L/2-1}\}\) is obtained for the \(i^{th}\) frame. In order to represent the temporal information of \(PFR_{j,i}\), a fixed \((W_c\times H_c)\) region centered at \((x_{j,i},y_{j,i})\) is cropped from \(OFV_i\), which ends up to a \((2L\times W_c\times H_c)\) volume of interest \((OFV-PFR_{j,i})\) corresponding to that proposal (see Fig. 3). Each of these volumes are fed into the LocomoNet for filtering. The training procedure for LocomoNet is discussed in Subsect. 3.3. After applying the LocomoNet, if the output frame has only one remaining FPR, the center of that \(PFR_{j,i}\) will be saved in the center vector (\(C_i^{P-One}\)). Otherwise, the corresponding component will be replaced by NaN and will not be considered in the evaluation.

Fig. 4.
figure 4

Sample frames reflecting high inter- and intra-class variability in terms of: (1) intense illuminations conditions and shadows (row 1-a, b), (2) different phases of gait, (3) different walking surfaces, e.g., color, texture (each column corresponds to a specific environment and walking surface), and (4) motion blur during crossover and side-step compensatory reactions (row 3-a, b).

3 Experiments

3.1 Dataset

Sufficiently large datasets are challenging to collect, often the primary bottleneck for deep learning. However, there are no publicly available datasets specific to our needs, i.e., large dataset captured by a belt-mounted camera including the images/videos of feet from different people with a considerable diversity in appearance (e.g., shoes with different colors, shape, barefoot, socks) and movement (i.e., gait). To facilitate training, we decided to fine-tune [34] the ConvNet based on real images with normal optics from large scale datasets, which also boosts the generalizability of the network. We fine-tuned the ConvNet on Footwear (footgear) subcategory images (\(\approx 1300\) images with bounding boxes, and 446 images of shoes from top-down view with and without bounding boxes, and we added the bounding boxes manually) from the ImageNet 2011 [42] dataset. Such images resemble more realistic appearance of one’s footwear from different views (compared to alternatives such as UT-Zap50K [52]).

In our dataset, 3 healthy young participants (researchers affiliated with the Neural and Rehabilitation Engineering and Computational Health Informatics Labs, at the University of Waterloo) participated in our data collection procedure. The FPV data was collected, using a GoPro Hero 5 Session camera centered on participants’ belt (30 fps, 1080\(\times \)1920), with no specific calibration and setup. A wearable IMU was attached as closely as possible to the camera to collect movement signals (for future experiments). Overall, 5 datasets (including 2 separate datasets from 2 participants in different environments) were captured in five different indoor (tiles, carpet) and outdoor environments (bricks, grass/muddy) around the University of Waterloo campus, resulting in 4505 (\(=5\times N, N=901\)) total frames (Fig. 4 shows samples from the dataset). Frames were annotated by drawing bounding boxes around the right and left shoes (in PASCAL VOC format), using the LabelImg tool [48].

In addition to the normal walking sequences, in two datasets, simulated compensatory balance reactions (CBRs: lateral sidestep, crossover stepping) during gait were also collected (see Fig. 4-row 3 columns a, b for sets 1 and 2, and the GT plot for dataset 2 in Fig. 6). CBRs (near falls) are reactions to recover stability following a loss of balance (see Fig. 1-panels f and g), characterized by rapid step movements (or reaching) to widen the base of support. CBRs also introduce more challenge to our dataset as the corresponding FPV data is usually blurry (i.e., fast foot displacement) (see Fig. 4) and the field of view may be occluded.

3.2 FootRegionProposer Training

There are several models that can be taken into account for FootRegionProposer, including SSD (Single Shot MultiBox Detector) [26], faster R-CNN [38], R-FCN [11]. In [19], it is shown that SSD models typically have (very) poor performance on small objects, e.g. the relatively small feet regions in our experiments. Among related approaches, YOLO [37] shows state-of-the-art results in terms of speed and accuracy.

To implement the FootRegionProposer, the original YOLO version 2 from the Darknet deep learning framework was used [37]. The pre-trained weights on the large-scale ImageNet dataset were used for network initialization, which was then fine-tuned on ImageNet shoe sub-category. The ConvNet was further fine-tuned on images of shoes that are captured in realistic scenes from a top-down view. All of the network inputs were resized to \(K\times 3\times 832\times 832\), where \(K = 64\) was the batch size (mini-batch size: 32). Moreover, the stochastic gradient descent with momentum was used as optimization method, with an initial learning rate of \(\gamma = {0.001}\), momentum: 0.9, and decay rate of 0.0005 (at steps 100 and 25000) selected using a Nvidia Titan X GPU. To further address the problem of limited data, the data was augmented (i.e., random crops and rotation) to improve the generalization of the network.

3.3 LocomoNet Training

Although YOLO is very fast, it often suffers from a high number of false positives. The goal of the LocomoNet is to improve FootChaser performance by reducing the number of false proposals. The LocomoNet output maps each OFV to one of the two possible classes. Similar to [28, 45, 49], the TVL1 optical flow algorithm [53] is chosen, here with OpenCV GPU implementation. Moreover, similar to [28, 44, 49], the stack length of \(L = 10\) (i.e., 20 input modality channels for LocomoNet) is selected, and crop size is set to \(W_c=H_c=224\).

Based on our experiments, a \(224\times 224\) region and the stack length of \(L=10\) provided sufficient temporal information for foot regions during gait. Moreover, we handled off-the-frame crops by shifting the \(224\times 224\) box in the opposite direction in place of resizing to retain the aspect ratio. To train the LocomoNet, 300 positive (shoe/foot regions) volumes were extracted for left and right feet in each of the 5 datasets, resulting in a total of 3000 (= \(2\times 300\times 5\)) true positive volumes. An equal number of negative volumes (i.e. 3000) were also randomly cropped from the non-shoe regions from the consecutive frames, with a constraint of \(IoU\approx 0\) with the shoe regions at the \(i^{th}\) frame, the past and next frames in the volume were not constrained to allow for a more realistic evaluation.

The approach proposed in [49], where the authors demonstrated the possibility of pre-training temporal nets with ImageNet model, was applied in the current study. After extracting optical flow fields and discretizing the fields into [0, 255], the authors averaged the ImageNet model filters of first layer across the channel to account for the difference in input channel number for temporal and spatial nets (20 vs. 3), then copied the average results 20 times as the initialization of temporal nets. Considering such an approach, a motion stream ConvNet (ResNet-101 [16] architecture) pre-trained on video information in UCF101 dataset was used, with stochastic gradient descent and cross entropy loss. Batch size, initial learning rate, and momentum were set to \(K=64\), 0.01, and 0.9, respectively.

Fig. 5.
figure 5

Example FootRegionProposer results (PFRs) for three frames marked by red boxes. Correct foot regions were identified by the FootRegionProposer; however, false positives were also proposed. After applying the LocomoNet, some false positives were filtered out (marked with (\(\times \))). In (a) and (c) false positive(s) are successfully removed, (b) shows a case of intense illumination and shadows challenging LocomoNet, resulting two false positives that were not filtered out.

3.4 Evaluation

  1. (1)

    Model generalizability. To evaluate the extent to which subject-related movement patterns in different environments can be handled by LocomoNet, a leave-one-dataset-out (LODO) cross-validation was performed. To achieve this, a \(LocomoNet_{N_D}\) (\(N_D=\{1,2,\)...\(5\}\)) model was trained using the whole dataset except \(N_D\) dataset (i.e., 4800 volumes for training) and tested on the dataset \(N_D\) (i.e., 1200 volumes for testing), and repeated 5 times. The following LODO accuracies were obtained for our 5 datasets: 1: 92.41%, 2: 91.16%, 3: 98.33%, 4: 83.83%, and 5: 96.25%. The high accuracies indicate the generalizability of LocomoNet to discriminate foot-related \(OFV-PFR\) in unseen datasets. The following average IoU scores were obtained for each set: 1: 0.7626, 2: 0.7304, 3: 0.3794, 4: 0.7155, and 5: 0.5235. Considering an IoU threshold of 0.5 is typically used in object detection evaluation to determine whether detection is positive (IoU of true positive \(>0.5\)) [13], we interpret that the generalizability of the model except for \(N_D=3\), is satisfactory. We attributed the lower performance of the network on dataset 3 to the patterns of walking surface (tiles with different sizes, see Fig. 4-c).

  2. (2)

    The number of proposed regions with \(IoU<0.2\) (false positives) dramatically reduced after applying the LocomoNet on FPRs. To assess the false positive removal performance of the \(LocomoNet_{N_D}\), we define a elimination rate metric as \(ER_{N_D} = \frac{\text {Number of filtered PFRs in a specific IoU interval}}{\text {Total number of PFRs in a specific IoU interval}}\times 100\), (IoU = \(Area(GT\cap P)\)/\(Area(GT\cup P\))). As shown in Table 1, the PFRs in a low IoU score range (\(\in [0,0.2)\)), representing false positives, were removed with a high rate (e.g., in \(IoU_{[0,0.1)}\) with 83.25\(\%\) reduction). The relatively low true positive removal score (i.e., in \(IoU_{[0.9,1)}\) with 8.09\(\%\) reduction) reflects satisfactory performance of LocomoNet in retaining the true positives (refer to Fig. 5 for some failure and success cases).

    Table 1. Number of proposed foot regions (\(N_{PFR,{N_D}}\)) and elimination rate (ER) in different intersection-over-union (IoU) intervals indicating LocomoNet ability to remove false positives by dataset. \(N_{PFR,{N_D}}\) dramatically reduced after applying the LocomoNet. \(ER_T\) is the weighted average of elimination rate, \(IoU>0.5\) and \(<0.5\), representing the true and false positives, respectively [13].)
    Table 2. Mean absolute error (MAE) results for the \(GT-One\) region in absolute pixels and as a fraction of image resolution. MAE = \(1/N\sum |GT-One_{a,f,i}-P-One_{a,i}|\), where \(a =\{x,y\}\), \(f =\{left,right\}\), \(N = length(GT-One)\). MAE/R as a fraction of image resolution, where (R): \(R_x\) = 1920, \(R_y\) = 1080.
    Table 3. Mean absolute error (MAE) for \(GT-Two\) regions in absolute pixels and as a fraction of resolution (MAE/R), where (R:) \(R_x\) = 1920, \(R_y\) = 1080.
    Fig. 6.
    figure 6

    Time series plot of X coordinate center of the most confident proposed foot regions (PFR, blue) predicted by the FootChaser framework for dataset 2. Ground truth (GT) for the left and right feet are plotted in green and red, respectively. Spikes represent compensatory balance reactions (CBRs) performed by the participant. (Color figure online)

  3. (3)

    FootChaser prediction trajectories closely match ground truth trajectories. The performance of the FootChaser in tracing the GT data can be assessed by measuring (1) the individual IoU scores, and (2) the pixel-wise distance (error, E) between the predicted foot center and its corresponding point in GT data, i.e. as discussed in Sect. 2, by comparing the predicted \(P-One\) bounding boxes with \(GT-one\) (\(E(a^{P-One},a^{GT-One}), a=\{x,y\}\)), where mean absolute error (MAE) is taken into account as the error metric E (see Table 2). For \(GT-Two\) (e.g., the black dotted parts in Fig. 2), the performance was evaluated by comparing the \(a_i^{P-One}\) with the nearest GT point regardless of the foot type (Table 3 displays the results). At first glance, this may appear to be a weak metric. However, as discussed in Sect. 2 and depicted in Figs. 6 and 2, in \(GT-Two\) data the FootChaser is biased toward proposing regions corresponding to the nearly-full-view feet (rather than partially-observable ones). In this application, the observed bias to larger objects is a strength as it predicts the center of the foot required for the extraction of spatiotemporal gait parameters. This can be attributed to the fact that the FootRegionProposer is trained on ImageNet dataset that mainly includes the full-view images of feet. Moreover, this is in line with the findings of [19, 41], where a higher performance was reported for the detection of bigger objects in videos. Considering these points, the error criteria for \(GT-Two\) regions seem to be a satisfactory representation of performance.

    In addition to the relatively low error rates (\(<10\%\) for the x trajectories), as presented in Fig. 6, the framework also predicted many of the points at the timings of CBRs (spikes). Therefore, these trajectories can be a promising avenue for the detection of CBRs. High E values for \(D_3\) (Tables 2 and 3) also support the low IoU rate achieved for that dataset (due to the patterns of the walking surface).

4 Conclusion and Future Work

As the main contribution, this study addressed the potential of incorporating a body-mounted camera to develop automated markerless algorithms to assess gait in natural environments. This advances our long-term objective to develop novel markerless models to extract spatiotemporal gait parameters, particularly step width, to complement existing IMU-based methods.

As the next steps, we aim to: (1) collect synchronized criterion (gold) standard human movement data using motion capture (e.g., Vicon) or gait analysis tools (e.g., pressure-sensitive mat, GaitRite) synchronized to FPV data and develop a model to directly extract spatiotemporal gait parameters from FPV data, convert the pixel-wise results of the FootChaser into the commonly-used distance units (e.g., m or cm) and directly extract spatiotemporal gait parameters, and (2) develop a more robust version of FootChaser framework by collecting a large free-living FPV+IMU dataset from older adults with different frailty levels, annotate the data, and make the dataset publicly available.

This paper contributes an advance in the field of ambulatory gait assessment to localize feet in a waist-mounted FPV feed towards a fully automatic system to detect abnormalities (e.g., compensatory balance reactions, or near-falls), identify environmental hazards (e.g., slope changes, stairs, curbs, ramps) and surfaces (e.g., gravel, grass, concrete) that influence mobility and potential risk to falls. As described earlier, FPV data also provides objective evidence on cause and circumstances of perturbed balance during activities of daily living, Our future studies will examine the potential for automatic detection of these environmental fall risk hazards [32, 33].

Given massive amounts of unlabeled FPV data collected during longer-term study, we aim to develop approaches that can robustly handle significant diversity in movement patterns (e.g., rhythm, speed), different populations (e.g., older fallers, Alzheimer’s disease), and varying clothing and footwear appearance. To address these aspects, similar to [9], we aim to personalize both of the FootRegionProposer and LocomoNet ConvNets to introduce an adaptive pipeline “AdaFootChaser” in our future work.