Markerless gait estimation and tracking for postural assessment

Postural assessment is crucial in the sports screening system to reduce the risk of severe injury. The capture of the athlete’s posture using computer vision attracts huge attention in the sports community due to its markerless motion capture and less interference in the physical training. In this paper, a novel markerless gait estimation and tracking algorithm is proposed to locate human key-points in spatial-temporal sequences for gait analysis. First, human pose estimation using OpenPose network to detect 14 core key-points from the human body. The ratio of body joints is normalized with neck-to-pelvis distance to obtain camera invariant key-points. These key-points are subsequently used to generate a spatial-temporal sequences and it is fed into Long-Short-Term-Memory network for gait recognition. An indexed person is tracked for quick local pose estimation and postural analysis. This proposed algorithm can automate the capture of human joints for postural assessment to analyze the human motion. The proposed system is implemented on Intel Up Squared Board and it can achieve up to 9 frames-per-second with 95% accuracy of gait recognition.

is one of the root causes of physical injuries that leads to joints inflammation and posttraumatic arthritis [39]. Despite the recovery from joint inflammation, a person may still posses the symptom of post-traumatic arthritis after 6 months and so considered as chronic pathological disease [32]. The statistic reported in [13] shows that 30 million people suffer from osteoarthritis (OA) in the United State, of whom more than half are under age 65. OA is a chronic condition that can affect any joints, such as knees, hips, lower back and neck, small joints of the fingers, the bases of the thumb and big toe [13]. Post-traumatic osteoarthritis [32] is recognized as a disabling condition as soon as 20 months after injury and there is no cure for it.
The capture of human motion can be classified into two types [41], i.e. (a) marker-based approach, and (b) markerless-based approach. Marker-based approach requires sensors or visual markers to be attached on human body for motion analysis. Marker-less motion approach uses passive cameras to detect human body visually and process with computer vision techniques. Reyes-Ortiz et al. [35] tracked human activity using the smartphone's inertial measurement unit (IMU) for human action recognition. Beravs et al. [4] proposed an IMU system with the validation algorithm to measure the joint angle of ankle and the motion of lower limb exoskeletons. However, the marker-based approaches require sensors to be attached to body and long setup preparation before any human motion analysis.
On the other hand, markerless-based motion analysis approach tracks the boundaries or features of human bodies without the mounted markers using camera. It retrieves the spatialtemporal dimension of salient objects such as human skeleton and motion [9,22,24,34] in the video. Since it does not require the attachment of sensors and markers, the quality of kinematic data can be improved in the physical training for precise postural assessment in the sports field [26]. Hence, the sport community believe that the use of computer vision could improve the physical training efficiency to achieve high performance [10]. However, high computational cost and multiple cameras calibration setup are the limitations encountered in this approach in order to generate fast inference in real-time applications.
The information of the human posture is crucial cue in sports because it can be analyzed to achieve the optimal performance and reduce the risk of physical injury [15]. To achieve the low computational cost, this paper proposes a markerless gait estimation and tracking for postural assessment with edge computing to do the sports screening. This proposed system consists of four stages: (a) human joints estimation to extract key-points, (b) gait recognition for single human instance tracking, (c) pose tracking for processing time reduction, (d) postural assessment using joints information. The overall proposed pipeline architecture is shown in Fig. 1 and it is migrated to a low power embedded system, Intel Up Squared Board with single camera setup. It is hypothesized that the system enables the fast processing gait estimation with joints location invariant to camera orientation without marker attached to the human body for the postural assessment. The extension of this research can cover other areas of sports such as weight lifting, diving and cycling.

Related work
Human motion is a key information to analyze the human posture that receives huge attention in the the research of computer vision and artificial intelligence [2]. The recent computer vision research of human pose estimation associates with the deep learning approach to achieve significant improvement to the standard benchmark such as the accuracy of clustering body part and parsing human pose. Human pose estimation enables the potential of computer vision to analyze the human motion and capture the invisible subtle postural information which are neglected by human eyes. Therefore, a real-time postural assessment system using markerless-based approach can be widely applied to the sport scene to reduce the risk of injury especially at the joints.

Deep learning approach in human pose estimation
The markerless-based approach [19] are mainly classified based on its scanning techniques, i.e. (a) Top-Bottom method and (b) Bottom-Up method. Top-Bottom method detects the human location first and implements the pose estimation within location result. DensePose with Mask-Region-Convolutional Neural Networks baseline [1] was developed by using deep learning top-down method to locate the person on the image and predicted the bounding boxes to crop the detected person become Region of Interest (ROI) pooling. After that, the ROI was fed into the fully-convolutional network architecture in order to predict the coordinates of the body key-points. Based on the set of coordinates, it identified the body part and aligned them to estimate the pose for individual person [14,17]. Ning et al. [27] used ResNet151 as backbone to detect the human candidates and crop the detected person in tracked window. A Cascaded Pyramid Network which was the single-person human pose estimator implemented within the tracked window. Then, a flow-based pose tracker was implemented to track the pose ID for persons who was in the image area. Top-bottom method reduces the false key-point prediction but the early commitment could affect the process if the person detector fails in the early stage.
Bottom-Up method estimates the body part nodes on the image and associates them as pairwise part connections using grouping algorithm to the belonging person without the process of person detection. Leonid Pishchulin et al. [30] proposed a bottom-up method, "DeepCut" for multi-person human pose estimation. It initialized all the possible body part candidates from the image and labelled them corresponding to the body part to classify them to each possible person. Cao et al. [8] proposed a method to use CNN approach with two branches of the same sequential prediction process to train the part confidence map for body part and Part Affinity Field (PAF) for the association degree between the limbs. Then, a greedy parsing algorithm was implemented which performed a fast connecting process to form a high-quality parses of human pose estimation. Bottom-up method predicts the body parts in the image and implements the grouping algorithm to assign the body part belonging to the particular person. Meanwhile, the grouping algorithm could be complicated as the fully connected graph is a non-deterministic polynomial-time hardness problem [30].

Postural assessment using markerless-based approach
The recent developments of postural assessment in sports often use markerless-based approach with deep learning architecture such as CNN. Tian et al. [23] proposed a hierarchical CNN model to assign the social roles to the players which were defender and attacker. After that, the scene of the human activity was categorized based on the movement and action of the defender and attacker. Furthermore, Joshi et al. [20] proposed to use Inception-V3 neural network to classify the robust sport images to six classes such as basketball, tennis, badminton, cricket, rugby and volley. It used the steps feature descriptor for classification based on the human activity and surrounding environment to do the sport categorization.
Kim et al. [21] proposed to use the human motion analysis to predict the play evolution in the dynamic sport scenes in soccer field. Based on the human motion, the next ground level movements of the players could be predicted. This method was applied on the automatic camera control and the sports visualization for analysis. Bialkowski et al. [5] investigated the raw player detection and the strategy analysis from the noisy data in the hockey game. Based on the occupancy maps of the players which was predicted by histogram of gradient orientation, the specific team formation was retrieved and wrong prediction of strategy could be removed.
Hayasahi et al. [16] proposed a CNN model visual imagery analyzer to recognize the upper body pose features and use the head and pelvis to get the spine orientation and estimate body orientation to estimate the random decision of human movement. Jian et al. [40] developed an AI coach camera system to collect the 'great' pose of athlete in the spatialtemporal sequence. This system used the 'great' pose as benchmark to detect the 'bad' poses of the athletes in the next few moves. Nuttachot and Sajjaporn [31] proposed a model for practice badminton basic skills which used the pose estimation to collect the human joints in spatial-temporal sequence and compared it with the world-class players' posture as reference. It extracted the posture embedding by triplet loss technique and fed the embedding posture to One-shot Network to find the similarity score between the input posture and reference posture. In a nutshell, the markerless-based approaches have the potential to benefit the sports community [10].

Proposed postural assessment system
A markerless gait estimation and tracking for postural assessment system is proposed to detect and track human gait information to evaluate the posture of sport man. This proposed pipeline is implemented in a real-time manner with the input video stream sequence. The proposed pipeline consists of four stages: (a) human joints estimation to extract key-points, (b) gait recognition for single human instance tracking, (c) pose tracking for processing time reduction, (d) postural assessment using joints information [12,36,37]. The benchmark measurements for the postural assessment are introduced in this research such as i.e. (a) step rate, (b) running gait, (c) angle of the elbow.

Human joints estimation
The detection of human joints information can be achieved using OpenPose network [8]. In Fig. 2, the stage-0 of the architecture uses ten layers of Very Deep Convolutional Neural Network (VGG-19 network) to analyze the color image with size h × w × 3 and generate a set of features maps. At stage-1, the feature maps from VGG-19 network are parsed into two branches, part confidence map and PAF with multi-stage manners. The first branch predicts a set of part confidence maps to locate key-points of human body, and the second branch generates PAF to encode the association degree and direction between the key-points across multi-stages. Subsequently, the greedy parsing algorithm is applied to assemble the part association according to the part confidence map location and PAF score to the particular person. The estimated human pose with the labeling is displayed in Fig. 3 regardless of the person physical appearance or wearing and it consists of 14 key-points in the human body. The coordinates (X K , Y K ) of each points are generated from the first branch part confidence map and the associations between each key-points are calculated from the second branch PAF.
There are 14 key-points extracted to normalize the scale and view piont of human object in this system [29]. The key-points #1-#13 extracted from the OpenPose network and the pelvis (#14) measured at the middle point between #8 and #11 coordinates. The key-points array consists of 2×14 array of #1-#14 distance joint-to-neck presented in (X K , Y K ) coordinates taking the top-left corner of an image as the origin. To differential the motion of each key-points corresponds to the relative person, the coordinate of each key-points are subtracted with the coordinate of neck (#1). The neck point is considered as the relative point of the person. To obtain invariant human key-points from any camera orientation, the joint-to-neck distance (D) is calculated as follows, where the subscript K is the index of key-points, (X 1 , Y 1 ) is the coordinate of neck and (d K,x , d K,y ) is the distance from K-th joint to neck (#1). Instead of using coordinate of key-points, the distance array is arranged as the spatial information as follows, The position of a person can be located at near or far field from the camera causing the variant of joint-to-neck distance. To normalize the joint-to-neck distances, Euclidean distance with relative neck to the pelvis distance is used to normalize all joints distance, where is the Euclidean distance measured from neck (#1) to pelvis (#14) andD is the normalized joint distance which is invariant to the distance between the person and camera. In this proposed pipeline, the neck (#1) to pelvis (#14) key-points must be co-existed conditionally to classify as a human object. Subsequently, the normalized joint-to-neck distance (D) from each frame are stacked into a sequence of time-frame to generate spatial-temporal sequence of human motion.

Gait recognition
Gait recognition is used to identify a specified person based on the walking gait. The normalized key-points coordinates of a detected person are arranged in the spatial-temporal sequence for gait recognition. Long Short Term Memory (LSTM) network [18] is implemented as shown in Fig. 4 to process the spatial-temporal sequence and output the object index. LSTM has an internal hidden cell h N used as the memory effect along with the where σ g is sigmoid activation function and σ h is hyperbolic tangent activation function. b is the bias of the gates. W and U are the weight matrices of input vector x and recurrent connection h respectively. For each input x = (x 1 , x 2 , x 3 , ....x N ), the output feature of LSTM is denoted from the hidden cell h N . In this gait recognition, the 14 normalized vectors of X and Y axis are parsed into LSTM cells with N time steps, where N is the number of time frame. The LSTM network consists of 28 LSTM cells input size and 28 fully connected cells with ReLu activation function. Subsequently, the output of fully connected layer is fed into softmax layer which is used to classify the gait pattern according to trained gait sequence, where M is the number of indexing classes. The gait recognition can index a person from the global pose estimation for local pose tracking using smaller tracked window. This is to assist the sports screening system for a targeted personal performance monitoring.

Pose tracking
For the pose tracking, the result of person identification from LSTM network is necessary and is cropped into a smaller window size to create ROI. Then, it is parsed into a single tracking iteration using discriminative correlation filter [25] for pose tracking. There are two steps of pose tracking, i.e. localization step and update step. Localization step identifies the new target location (p t ) by finding the position of the maximum in correlation between h t−1 and image patch features f using, whereg(h d ) is the location of maximum of correlation responses, the symbol '*' is the circular correlation, N c is the number of channel, f d is the set of N c channel features and h d is the constraint correlation filter. The channel reliability can be computed using elementwise product as,w d =w where it is normalized s.t. dw d = 1,w (update) d is the learned channel filter obtained from the update step andw (det) d is the channel detection reliability measured on the ratio between the second and first highest non-adjacent peaks in the channel response map [7]. The two largest peaks in the response map are obtained as two largest values after a 3 × 3 nonmaximum suppression. The detection reliability is estimated from per-channel responses. The new scale (s t ) is derived from the new position based on the channel responses.
In the update step, the ROI of foreground histogram (c f ) and background histogram (c b ) are extracted for region analysis. The foreground histogram is computed using Epanechnikov kernel within the estimated object bounding box and the background histogram is computed from the neighborhood twice the ROI size [25]. The foreground and background histograms can be updated using the exponential moving average with the histogram learning rate (η c ) as follows, where c t is the histogram of foreground and background, η is the correlation filter learning rate. The foreground and background histograms are optimized to construct the spatial reliability map (m) to identify pixels in the training region which likely belong to the target. The constraint correlation filter (h t ) is estimated using the spatial reliability map, that identifies pixels which is set to zero in the learned filter. The per-channel learning reliability weights are measured from the maximum response value of a learned channel filter. Subsequently, the learned weights and detection weights are calculated in (6). The single iteration update step of constraint correlation filter (h t ) and channel reliability (w t ) are updated independently in every input frame with the correlation filter learning rate (η) as follows, The tracking algorithm is implemented in every frame after the gait recognition to locate the person with less computational cost and time reduction. It leads this system to become a real-time application.

Postural assessment in sports
Once the human object is tracked, the cropped image of the individual is fed into OpenPose network each time for local joint estimation to extract the normalized joints vectors. Once those joints information are obtained, the postural assessment can be enabled. Since the postural assessment is enabled, the more activities or postural analysis could be added in the future. In this paper, running action is the main demonstration because it is an important action in the most of the high performance sports. Three measurements are performed in this proposed pipeline. i.e. (a) step rate, (b) gait measurement, and (c) angle of elbow.
Step rate is defined as the frequency of the foot landing and it always associates with human muscle activity [12]. Higher step rate reduces the foot ground inclination angle that decreases the peak hip abduction angle as well as moment during stance phase. The higher step rate increases the muscle activity that improves the joint moment and energy absorption. In order to obtain the step of each foot, the number of step (n) of right (#10) and left (#13) are counted separately based on the trough amount of the sinusoidal waveform in the Y-axis of joint-to-neck distance. The average step of recreational runner is 150 to 170 steps per minute and elite runner can obtain up to 180 steps per minute [33]. The step rate of each foot is the fraction between the number of step (n) and the total time (t total ) which is formulated as, The inverse of step rate in second is the time taken between the first contact of two consecutive footsteps of the same foot and it is expressed as, where T sec is known as stride time where it is the time in second of one stride.
Step rate and stride time manipulate the muscle activity in anticipation of foot-ground contact. The time of one stride is useful to evaluate the symmetry of stride time [36]. The energy absorption on knee and hip is corresponding to the symmetry of stride using left and right stride. If both stride time has a big gap, the unsymmetrical gait causes the muscle be imbalanced and the uneconomical movement pattern which the wrong posture wastes the unnecessary power that will over-strain the joints and muscles increasing the risk of injury [38]. The running gait can be split into two phases, i.e. stance and swing phase. Stance phase dominates approximately 40% before toe-off while swing phase covers around 60% of the gait movement. In between 50%-60% and 90%-100% in the swing phase, the runner enter a floating stage where both foots are lifted away from the ground in a very short period. As a result, a running cycle consists of 20% floating time, 40% of stance and 40% of swing [11]. In this system, only stance and swing phases are measured to create a clear indication.
In a running postural assessment [37], the elbow angle is suggested to be less than 90 degree during running and the shoulders can be relaxed and facing forward to smooth the breathe pulse. This elbow angle analysis can be viewed from the side view with the camera setup as shown in Fig. 5. The elbow angle (θ) is calculated using cosine rule as follows: where a, b and c are the distances in between the wrist, elbow and shoulder.

Result and discussion of the proposed system
The proposed markerless gait estimation and tracking for postural assessment system is implemented on a microcomputer, Intel Up Squared board with Intel Movidius Neural Compute Stick 2 to achieve edge computing and low power implementation. The specification of the board consists of Intel Celeron™ N3350, 2GB ram. This postural assessment system

Computation of gait analysis
For the human pose estimation model, OpenPose is optimized using Intel OpenVino to integrate to the Intel Processor. The proposed system aims to retrieve 2 × 14 key-points arranging in a series of spatial-temporal features. Subsequently, these features are fed into LSTM network for gait recognition. In the gait recognition experiment, two person's walking sequences are recorded to train the gait recognition and M is set to 2. The human joints estimation is performed during the video recording with 5 frame-per-second (FPS) in 1280×720 resolution. There are 20 training set and 10 validation set of each person walking action. LSTM network was implemented with cross-entropy loss and Adam optimizer with learning rate set to 10 −4 . Figure 6a and b shows the pipeline classifies a person based on the walking action from "Person A" and "Person B" using the joints trajectory of walking in the continuous manner in Fig. 6c. In the gait recognition pipeline, the N time step is tested in 10, 15, and 30 of the 2 × 14 key-points spatial-temporal sequence. The different N time step setup of the gait recognition pipeline and the validation accuracy is calculated through (14).
where TP is true positive and N v is the total number of validation set. Table 1 shows the comparison between the different N time step setup of the gait recognition pipeline and the validation accuracy is calculated through (14). The result shows that 15 time step setup achieves the highest validation accuracy, 95.83% and it is chosen to implement in this system. Meanwhile, it simulates the scenario of multi-person are indexed in stadium. Every trainees are tracked in the stadium while they are entering the stadium based on the walking action. Then, their sports posture information will be transfered to the sport analysis function. Figure 6c demonstrates an example plot of the normalized joint-to-neck distance in spatial-temporal sequence of walking action of Person A. Despite the noise was obtained due to the some occlusions or inaccuracy of pose estimation, the sequence could be analyzed in periodic gait movement. Apparently, the fluctuation of each sinusoidal plot are unique from every individuals. Once a human subject is indexed, the discriminative correlation filter locates the indexed person using a tracked window with less computational cost. Table 2 tabulates the comparison of proposed system in frame-per-second (FPS) using the setup on Intel Up Squared board. A raw machine setup was tested that the result was around 5.0 FPS [28] while decreasing the extracting point that only improved slightly up to 5.65 FPS [29]. The proposed method is tested and the computation time is reduced and the FPS can be achieved up to 9.

Validation of postural assessment
Step rate, running gait and the elbow angle are measured in this proposed pipeline. For the step rate, Fig. 7 is using the dot line for indication purpose to demonstrate the joint-to-neck distance plot of #10 and #13 as the joint motion in Y-axis. The trough of the joint-to-foot distance cycle represents the foot landing that generates the number of steps (n steps ). It demonstrates two significant troughs in Fig. 7 for both foot where there are two steps for both foot. The step counting function is mainly utilized for step rate and running gait measurement. The comparison between the actual walking step by counting manually and the measured step amount of walking with random route within a minute obtained an average error of the three testing sets is 4.29% that is lower than 5% as shown in Table 3. Figure 7 shows the left feet (#13) is main relative point with a complete running cycle. The intersection points represent both foot are lifted away from the ground. Thus, the swing, stance phase and floating time are predicted respectively. Subsequently, the athlete can refer the result of the running gait measurement to adjust the running gait and avoid the incorrect gait. The assessment of the elbow angle is using the side view of the sport man. The plot of the elbow angle degree in time series during running is visualized in Fig. 8. As a result, the peak of the sinusoidal signal represents the position of wrist is at waist level while the trough shows the wrist is above the waist level. It does not require the wrist have to be in any form but the elbow angle is suggested to be lower than 90 degree to help maintain a fast cadence and get the shoulder relaxing to reduce the muscle stress. Figure 9 demonstrates the real time implementation to do postural assessment. At the first stage in Fig. 9a, the system identifies the user according to the walking gait. After     Fig. 9b. The new tracked window is used to predict the position of a subject in the next frame. Subsequently, the gait measurement results are transferred to cloud database through a lightweight publish-subscribe network protocol, MQTT. User can access to a dashboard which is demonstrated in Fig. 9c and d as on the cloud remotely to observe the past postural assessment with relative to time sequence. Figure 9c shows a correct result of running gait. The step rate is shown on the top left of the dashboard with a green indication where the elbow angle remains 90 degree during running. The pie chart at the bottom right of the dashboard indicates that the running gait consists of approximately half of the swing and stance phases respectively. Figure 9d shows an incorrect posture with the step rate highlighted in red color and the angle of the running elbow is below a threshold angle. The pie chart shows that the swing phase and stance phase are not in a synchronize manner. The unsymmetrical running gait is causing the risk of injury at a long term scenario.

Discussion and implications
The proposed pipeline comprises of the human joints estimation, gait recognition, pose tracking using discriminate correlation filter for the application of postural assessment. The key-points coordinate from the human joints estimation are firstly normalized to obtain the invariant key-points to camera's orientation. The 14 normalized vectors are combined to generate the human motion in the frame series. It leads that the resulting position of the human object that does not degrade the performance of gait recognition and postural assessment. With this proposed computer vision processing, it does not require to attach sensors to the body during athlete training and it could avoid the physical interference causing an unnatural posture to reduce the risk of sports injury.
To reduce the computational cost of the proposed pipeline, this gait estimation algorithm is optimized by reducing the cycles of estimating part confidence maps and part affinity field. Subsequently, the discriminate correlation filter locates the human object in every frame after the gait recognition to further reduce the computational cost. Only the region of interest within the tracked window is cropped and fed into OpenPose network to extract the joints information. The capturing rate in frame-per-second is significantly increased to enable the real-time application using a small-scale standalone module. As a result, when the capturing rate increases, it increases the human joints data samples on the motion plot to improve the signal quality for accurate postural assessment.
The development of real-time processing unit using Intel Up Squared board enables the edge computing system. Intel OpenVino facilitates the optimization of the deep learning model and optimizes the human pose estimation model to accelerate the model implementation on Intel CPU. Consequently, this sports screening system is optimized to reduce the latency for edge computing. The edge computing brings the computation closer to the athlete during the physical training. Essentially, it saves the transmission time and bandwidth usage to minimize the need for long distance of the communication between the server and client. Thus, the overall proposed gait estimation and tracking pipeline enables the fast processing computation to elevate the performance of postural assessment in sports applications.

Conclusion
A markerless gait estimation and tracking system with edge computing for postural assessment is proposed in this paper. Human joints are extracted from pose estimation with 14 joint-to-neck distances, which are normalized with relative to neck-to-pelvis distance to obtain invariant features from any camera's orientation. These features are used to generate the spatial-temporal sequence and passed to LSTM network for gait recognition. The indexed person is tracked for postural assessment. The automation postural assessment system can achieve up to 9 FPS on Intel Up Squared board with 95% recognition accuracy and the purpose of the low power consumption and latency reduction are achieved. The pipeline introduced the gait estimation as a significant measure to body posture in sports analysis based on the bio-mechanics of human body. The body orientation of the athlete creates the hard-occlusions and elbow angle only applied with side view which affects the result of sports analysis. In the future works, to filter out the noise of the spatial-temporal sequence caused by hard-occlusions, 3D geometry pose estimation will be used to simulate the joint which has the potential to eliminate zero dropping caused by the occlusions to improve the gait recognition, sports analysis and also the quality of dataset collection.
Funding Open Access funding enabled and organized by CAUL and its Member Institutions. We would like to gratefully acknowledge the support of NVIDIA Corporation with the donation of the the Quadro P6000 GPU used for this research.

Conflict of Interests
The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.