1 Introduction

This paper presents a joint motion formula for calculating rotating and translating joint motion taken using a single camera. Nowadays, there is active interest in the calculation of movement of objects with joints such as actors or athletes, with the purpose of making movies from augmented reality, monitoring the progress of patients with mobility impairments or improving the performance of athletes. By analyzing the joint angles in real-life human movements and applying them to animated characters, animators can create more lifelike animations that capture the nuances of human motion. Special effects teams can create realistic-looking CGI characters or scenes that integrate seamlessly with live-action footage. Joint angle measurements can provide objective data to track improvement or decline in a patient’s range of motion, and it can also develop targeted therapy plans. Also, joint angle computation can be used in sports training and performance analysis to evaluate an athlete’s technique and identify areas for improvement. Lastly, joint angle measurements can be used to assess an athlete’s biomechanics, identify potential injury risks, and optimize their performance.

Recently, deep-learning methods are widely used for image recognition [2, 6, 9, 12, 14]. Notably, images shot with a single camera without sensors are much easier to obtain. The advantage of this method allows for easy calculation of the rough shape of an object with joints, but obtaining their angles and positions accurately is challenging. In contrast, various sensors have been used to compute human motion. For example, in movies and sports, position and orientations of objects must be known accurately. When they move, very precise joint angles are required for accurate force calculations. The data for configurations of humans is relatively easy to collect, but it is more difficult to do so for various configurations of other non-human animals. Moreover, calculating the exact size and orientations using deep learning for many articulated objects is not easy.

This paper attempts to obtain a three-dimensional model of the traces of an object containing joints. Previously, much effort had been devoted to three-dimensional analysis in a single image. Other approaches fall short of only computing the three-dimensional position of the joint. In comparison, we try to calculate the joint motion by determining the position of the joint as well as the orientation of the joint. The calculation presented by this paper results in the output containing the relative position and orientation of the joints.

When three-dimensional information is obtained as an image, it is projected onto two dimensions, during which a loss of information occurs. This loss of information must be recovered to calculate the movement of the joint. Restoring this lost information has been a difficulty, and this paper uses the following approach to recover it. First, we fix a reference on an environmental object. Using multiple screenshots of a scene as an input, we identify the position and orientation of the camera at the time of which pictures were taken. The movement of the joint is calculated using the equation of change-of-coordinates as shown in this paper. However, there is no method of reproducing the three-dimensional model of an articulated body and its texture, even when one camera used for photographing moves and the tracked object moves simultaneously. Because kinematic equations and camera formulas for joints are nonlinear, 3D reconstruction of an articulated body is not possible if the equation for joint motion calculation is incorrect. If three-dimensional reconstruction of the joint were possible, we can verify the formula for joint angle calculations.

In this paragraph, we introduce assumptions and constraints of this study. We assume that only one moving camera with no known external or internal variables (such as image distortion) is used. That is, the internal parameters of all images taken except the focal length are the same. If the focal length changes, the position and orientation of the camera also changes. We expect that tracked objects are solid objects while they are recorded by a moving camera. Additionally, we suppose that all input images include joints with different angles of rotation from joints of other images. We expect a correspondence relation between feature points and two-dimensional pixel position of the feature point to be provided as input from another algorithm [30] or from the user. This study also assumes that feature points can be found not only in a rotating body but also in stationary background objects, and that the information on foreground and background is available. As constraints, the study requires information on the structure of the joint and the joint segment. There must be more than three feature points that can derive the equation of a plane that comes into contact with the moving and rotating joints. In addition, there must be at least three independent feature points that can be used to calculate the motion of any one joint. Finally, to avoid local optima, an initialization process is required in parameter optimization.

2 Related works

Many other articles focus on calculating the angle of the joint instead of calculating the movement distance of the joint. In this paper, we also calculate the movement distance of the joint. While we can choose between images and sensors to calculate the distance of joint movement, the method of using the extra sensor is not covered in this paper. To calculate the motion of the joint using images, we can either use 3D geometric analysis or training [3, 28], which has recently gained attention to investigation.

Many papers have focused on human pose estimation [16, 24, 28]. However, the joint angle computation formula presented in this paper can be used for any kind of articulated body. The references below show the trend. First, the image was analyzed to extract features such as shape, color, and texture [32]. Human pose was estimated using multiple cameras [8]. Afterwards, a more difficult study was attempted with the estimation of human pose using a single camera [5]. Later, the human pose was estimated using 2D and 3D models [8, 23]. Finally, equipment such as RGBD cameras were used to collect information on depth [27]. Recently, the trend has shifted towards using deep learning [3, 28].

A 2012 survey article [8] reviews methods of estimating model-based human body pose for multi-view video. This paper classifies the human body pose estimation method into a five-step process; camera calibration, voxel reconstruction, segmentation, estimation, and tracking. In the case of using one camera to collect an input, the 2016 survey paper [5] includes features analysis, human body models, and methodologies as criteria for classifying the human pose estimation algorithm. Feature analysis in this paper [5] is divided into low-level analysis of shape, color, texture, etc., mid-level analysis of local features and global features, and high-level analysis of combined human body parts. Human body model classification is divided into kinematic, planar, and volumetric models. The methodologies described in this paper [5] are divided into discriminative methods like deep-learning and bottom-up methods that start with the interpretation of pixels or body parts. As a result of deep learning, one can retrieve estimated values for the joint angle. In comparison, this paper focuses on calculating the exact joint angle based on accurate input values of feature point using the coordinate transformation formulars.

This paper further develops Lee and Chen\('\)s method [13] and proposes a method of calculating joint motion by using changes in coordinate systems. To successfully change the coordinate systems, the position and orientation of the camera must be properly calculated, which can be found by using the three-dimensional reconstruction method [7, 15] or the three-dimensional triangulation [7]. Previous studies such as Hartley\('\)s book [7] and Moon\('\)s paper [15] are useful references for three-dimensional reconstruction. Many methods of three-dimensional reconstruction [7, 15] are based on feature points and require foreground and background separation [10]. In comparison, we use a shape-from-silhouette for the three-dimensional reconstruction [10] . The shape-from-silhouette method uses a silhouette image of an object from multiple standpoints, and each pixel in the two-dimensional silhouette becomes a viewing pyramid.

In the perspective of multiple-view 3D reconstruction, which examines pictures from different positions and orientations, it is possible to create a three-dimensional volume by removing only the non-silhouette part. We work backwards to verify which pixels of one view are projectable to the surface voxels of the 3D object that was created by projection. By regarding the average of each scene pixel as the color for the voxel, a three-dimensional object and its texture can be reconstructed. This paper uses a space carving algorithm [10] to allow for 3D reconstruction for each joint. The joints move, so the movement of the joints must first be calculated.

The key idea for finding joints is to calculate the intersection space of the body area for each joint [30] . This method uses a link that connects the joints, if there is an intersection between the spaces in which the joints operate. To find the intersection, we express the motion of feature points by a matrix and its singular value decomposition. This calculation reveals the position and structure of the joint by solving the expressions related to the existing null space. Then, using rigid factorization of the affinity matrix, we solve the multi-body segmentation. Other previous works encounter various problems for the calculation for multi-body segmentation. Zelnik-Manor and Irani [31] used eigenvectors to separate motions. However, their performance deteriorates in calculating articulated motion. Tresadern and Reid [26] used RANSAC [25] for body segmentation. Yan and Pollefeys [30] provided a segmentation algorithm for articulated motion by building an affinity matrix and computing principal angles. Nevertheless, their method is less capable if there is noise on the input.

Training-based methods were limited to the calculation and pattern recognition of various human postures. These methods are difficult to use in applications that require accurate joint angle and motion measurements. Training-based methods use the posture and the image of an object for training purposes. For instance, Jiang [4, 9] found the nearest shape using a large database containing correspondence between joints and posture. Akhter and Black [1] had collected the data set of captured motion, from which they realized that postures had limitations to movement of the joints. They defined a parameter of body posture and obtained an estimation for three-dimensional postures from two-dimensional coordinates of each joint by gathering a large data set of postures. Martinez et al. [14] built a system predicting three-dimensional coordinates, given the two-dimensional locations of joints. Chen and Ramanan [2] implemented a system for estimation of two-dimensional postures, which uses deep neural nets and available data for 3-dimensional motion capture. Gupta et al. [6] used examples from motion capture and implemented a system that matches features based both on motion and in frequency domain. Lassner et al. [12] introduced an extended version of a model of statistical body shape by obtaining three-dimensional fits for body model, which are derived from data sets regarding multiple human postures. Chen and Ramanan [2] followed a similar method.

3 Overview of proposed method

Fig. 1
figure 1

Joint angle translation calculation flow of an articulated body

This paper approaches the problem of joint angle computation by gathering input information on the joint structure by collecting input directly from the user. Using this input, we try to reconstruct three-dimensional joints by calculating the correct joint angle and coordinates for the center of the joint. We can reconstruct joints with their surfaces, whose calculations have not been attempted in existing methods. However, the weakness of the method introduced by this paper is that the required additional information needs to be provided by the user or other algorithms by receiving information on the structure of the joint and feature points.

Figure 1 shows the procedures for the computation of texture-mapped three-dimensional reconstruction by calculating the exact motion of the joint. All input images contain joints with different translational distances and rotation angles. Subsequently, we calculate the exact translational and rotational angle of the joint in the photographed image. One of the scenes is selected as a reference scene, and compared with the reference scene, while the translation and rotation of the joints in the input image are corrected inversely with the camera used for photography. The camera used for photography is then translated and rotated in the direction opposite to the movement of the joints, and the camera is moved as if there were no motion in the joints. Next, the texture-mapped three-dimensional reconstruction is performed on the three-dimensional object in the absence of joint movement. In this paper, we focus on the formula for calculating the translational and rotational angles of the joint and the formula for moving the camera inversely with respect to the movement of the joint.

Step (A) is the calculation of the feature point coordinates of the three-dimensional scene and camera parameter calculation. This paper uses a moving camera with various focal lengths to capture an image of a scene and calculate both internal and external parameters of the camera. The location of the feature point can be found by the triangulation technique for a fixed environmental object, such as a floor. This part can be replaced by any other method of three-dimensional reconstruction [7, 15]. After this work, we find the equations of the bottom plane by locating three feature points on the floor plane which is a fixed environmental object. Step (B) calculates the approximate translational distance of the first joint by locating the point where the ray emanating from one feature point of the first joint meets the plane of the floor [21]. The value of translational distance is used as the initial value when calculating and optimizing the motion of the joint. During step (C), joints are calculated sequentially, starting from the first joint. The input is the two-dimensional position of the feature point.

For example, the joint of a person\('\)s foot that is in contact with these fixed objects is the first target whose motion will be calculated. There is a paper [21] proving that it is difficult to calculate joint motion due to many local optima when calculating some translational and/or rotational motion. To address this problem, the approximate movement of the joints is calculated by using another joint or a surrounding environmental object whose position and orientation are known. The result is used as an initial value for parameter optimization of computing exact movement. The movement of joints is determined by using a change in coordinate systems. The method transforms the ray of some scene that includes a feature point of the selected joint as dictated by the corresponding movement of the joint. Next, it computes the intersection of the previous ray with the ray that is pointing at the corresponding feature point of the other scene.

This paragraph describes the method for calculating joint motion. We assume that the k-th feature point appears in the i-th scene and the j-th scene, where j-th scene is selected as a reference scene. We cast the ray from the camera used in the i-th scene to the k-th feature point. The ray equation of the i-th scene is rotated and translated so that it passes through the k-th feature point of j-th scene. Then, the calculation minimizes the error between the feature points of the real image and the computed feature points when 3D coordinates of point of intersection of two rays is projected onto the camera image. The parameter optimization method finds the translational and rotational optima. The calculations use the input of moving feature points and their corresponding relationships based on the provided joint structure.

The output of step (C) is supplied to steps (C) and (D). In step (D), we calculate the three-dimensional volume for the fixed part and the rotational part of the joint. In steps (A), (B) and (C), we use the parameter optimization method.

4 Notation used in joint movement computation

4.1 Extended quaternion

The reason for using the extended quaternion is to reduce the number of variables for optimization. Quaternions are only available for rotation. To overcome this problem, we use an extended quaternion with moving variables. For the sake of clarity, the equations of this paper are presented in a matrix form, but the actual implementation is in a quaternion form [22].

4.2 Notation used in coordinate system transformation

To represent the coordinate transformation between two 3D coordinate systems, we usually use a 4\(\times \)4 homogeneous transformation matrix. The notation \({}^{A}_{B}Tr\) is used to convert values expressed in the B-coordinate system into that conveyed in the A-coordinate system. All transformation matrices represented in this paper are coded using extended quaternions. All 2D image coordinates are denoted by I. Let \(I_i\) be the i-th coordinate system of the two-dimensional image of i-th camera taken in the position and orientation of \(C_i\). There are several joints in an articulated body. The reference 3D coordinates system is represented by W, while the first joint in contact with an environmental object is referred to as \(J_1\), and the three-dimensional first joint coordinate system photographed by the \(C_i\) camera is denoted by \(I_i J_1\). The \(I_i W\) coordinate system is the three-dimensional coordinate system, identical to W coordinate system. The three-dimensional coordinate system of the l-th joint coordinate system of image i is \(I_i J_l\) [22].

Let the coordinate system of camera i be \(C_i\). The k-th feature vector of the set of two-dimensional images, which is perpendicular to the Z-axis of this camera coordinate system, appears as a \(m_{ik}\) image pixel when visualized in two dimensions, and we assume that a ray vector is cast from the origin of \(C_i\) passing through this pixel. Similarly, the coordinate system of camera j is \(C_j\). The k-th feature vector of the two-dimensional image set perpendicular to the Z-axis of the camera coordinate system appears as a \(m_{jk}\) image pixel when visualized in two dimensions, and a light ray equation pointing toward the origin of \(C_j\) passes through this pixel. The focal length \(f_i\) is used for photographing the image i and the focal length \(f_j\) is used for the image j.

The transformation matrix of \({}^{W}_{C_i}Tr\) represents the \(C_i\) coordinate system observed in W or reference coordinate system. \(M_k\) is assumed to be the k-th three-dimensional feature point in the object of the reference coordinate system, and this feature point is projected on each image used in the photographing.

This point can be observed with the pixels of image i and image j denoted \(m_{ik}\) and \(m_{jk}\), respectively, and the three-dimensional position of \(M_k\) is unknown.

5 Calculating translation, rotation angle and joint length of joints

In this section, the procedures for processing the tree structure joints is discussed in comparison to an earlier work [20], and a new equation using the coordinate system transformation is developed. The previous work [19] introduces the concept of joint angle calculation. Another paper [17] computes motion from two images, and an incomplete method [18] computes motion using multiple images. This paper presumes that the user knows the structure of joints to calculate joint angles in advance. The structure of the joint is fixed at an unknown length, and all joints have a total of six degrees of freedom, in which three originates in three-dimensional rotation and the other three in three-dimensional translation. Especially, if the movement of the joint includes translation, the movement requires information on the coordinates and movement of either an environmental object or known joint to avoid local optima. There is a local optimum in each of the calculations of joint rotation angle and that of the center of rotation  [21]. Joint angles and translational distances can be calculated with respect to the joints of a tree structure. After calculating the joint angles and translation distances of a structure with three-dimensional rotation and three-dimensional translation, this paper performs 3D reconstruction for each joint to verify the accuracy of the joint angle computation.

A scene is fixed as the frame of reference to calculate the rotation angle and the translational distance of various scenes, and the translational and rotational angles of the joints of other selected scene are shown below, with respect to the reference scene. When 3D reconstruction is performed, the reconstruction of the articulated body occurs based on the shape of the scene. Subsequently, we consider a set of given scenes, \(I_i\) and \(I_j\). The first reference coordinate system that is in the task of calculating motion is a fixed world coordinate system, W. The location of the feature point for a fixed object can be found by means of ray intersection. The floor is presumed to be in contact with joints among fixed environmental objects. The calculation method for the initial value related to translation is described in [21]. If the motion for joint \(J_l\) is found in all scenes, we will use \(J_l\) to calculate the motion of the joint \(J_{l+1}\), which is adjacent to \(J_l\). It is assumed that rotation and translation occur in \(J_{l+1}\) joint in scene \(I_j\) with respect to scene \(I_i\). 3D joint reconstruction takes place based on scene \(I_j\), a reference scene. If the exact motion of the joint were calculated, it should be the same shape as the joint \(J_{l+1}\) of the scene \(I_j\); the shape is the result of performing rotational and translational transformation on the ray vector of the scene \(I_i\). In this case, the 3D feature point \(M_k\) of the joint \(J_{l+1}\) can be made identical to the situation where there is no motion regardless of the scene. If there is no motion, the feature point calculation technique of 3D reconstruction can be used.

Let us assume \(I_j\) is set as a reference scene. The conversion formula that converts the \(I_i J_l\) coordinate of scene i to the \(I_j J_l\) coordinate of scene j is \([{}^{I_j J_l}_{I_j J_{l+1}}Tr][{}^{I_i J_l}_{I_i J_{l+1}}Tr]^{-1}\). This equation may also portray the relative rotational angle and translation of the joint \(J_{l+1}\) from scene i to scene j. Using this transformation equation as a set of variables, we can find coordinate transformation matrices of translation and rotation that change the viewpoint accurately from another viewpoint through optimization. The internal parameters of the camera for capturing the scene were obtained using the fixed background scene in the calculation step (A). This paper assumes that image i and image j have the same feature point \(M_k\) appearing in each image. In addition, it also assumes that correspondence information between feature points appearing in each scene is provided. It can be assumed that one ray vector corresponding to one feature point projected on each camera is cast.

Let the ray vector \(\vec {ray}(m_{ik})\) point toward \(M_k\) from the origin of \(C_i\), camera i, and the ray vector \(\vec {ray}(m_{jk})\) point toward \(M_k\) from the origin of \(C_j\), camera j . Let \(\bigotimes \) represent a ray intersection test. When two rays intersect, the 3D intersection point is calculated and denoted \(\bar{M_k}\). If they do not meet exactly, we take the midpoint of the line that creates the shortest distance between the two rays as the 3D intersection point. \(\bar{M_k}\) is projected back to \(C_i\) and \(C_j\). Then we find best translational value for the joint, the rotation center of the joint, and the joint\('\)s rotational angle that minimizes the error between the projected pixel and the actual two-dimensional input positions \(m_{ik}\) and \(m_{jk}\). As in the case of fixed scenes, translational and rotational angles of joints are determined through a type of ray intersection test. It is assumed that the angles of joints of all scenes in different input images are different. The object we are tracking has multiple joints. All joints may translate and rotate. The translational distance, the angle of rotation, and the center of rotation of the joint should be accurately corrected for a three-dimensional articulated body reconstruction. To calculate the k-th three-dimensional position on a rotating articulated body, we need to include a calculation for the coordinate transformation that accounts for the motion of the joint. Parameters t and s represent the parameters of two rays. Furthermore, the parameters t and s also represent distance information from camera\('\)s center to the point of ray intersection. The joint angle calculation can be understood as a coordinate transformation calculation, which can include both rotation and translation. Since the rotation space is different from the Cartesian coordinate space that we usually use, there is a limit for the precision of calculating the joint angle by using the difference of the projected pixels. The information on the distance from the center of the camera to the ray intersection during the calculation of the joint angle effectively converts this rotational problem into a Cartesian space problem.

It is assumed that the feature points \(M_k\) on joints appearing in images i and j are commonly observed in both images. It is assumed that the \((l+1)\)-th articulated joint appearing in image j is rotated and translated to an unknown value compared to the \((l+1)\)-th articulated joint appearing in image i of the articulated body. We apply the coordinate system transformation to convert the reference coordinate system from \(I_i J_l\) to \(I_j J_l\), scene j is a reference scene for 3D joint reconstruction. To correct the translation value and rotation angle of the \(J_{l+1}\) joint, we solve the optimization problem with \([{}^{I_j J_l}_{I_j J_{l+1}}Tr][{}^{I_i J_l}_{I_i J_{l+1}}Tr]^{-1}\) as an unknown variable. For joints without translation, translation simply occurs at the center of rotation of the joint. It can be assumed that two rays are emitted from the image i and the image j to the three-dimensional feature point \(M_k\) on the \((l + 1)\)-th articulated link. However, since the \((l + 1)\)-th articulated joint is rotated and translated relative to the image i, it must be corrected. The rays on the image i side are rotated and then intersected with the rays on the side of image j. \({}^{C_i}\vec {ray}(m_{ik})\) is translated and rotated by \([{}^{I_j J_l}_{I_j J_{l+1}}Tr][{}^{I_i J_l}_{I_i J_{l+1}}Tr]^{-1}\).

We perform the intersection calculation between the transformed ray \({}^{C_i}\vec {ray}(m_{ik})\) = \((m_{ik} t~~-f_i t ~~ 1)^T\) and the ray \({}^{C_j}\vec {ray}(m_{jk})\) = \((m_{jk} s~~-f_j s ~~ 1)^T\). Using the calculated \(I_i J_l\) and \(I_j J_l\) coordinate system, the rotation angle and translation value of the joint are calculated using the feature point \(M_k\) of the \(J_{l+1}\) joint. The ray intersection test \(\bigotimes \) for the calculation step (C) can be performed using the following (1).

$$\begin{aligned} \begin{array}{l} {}^{I_j J_l} \bar{M}_{kij} = ( [{}^{I_j J_l}_{I_j J_{l+1}}Tr ][{}^{I_i J_l}_{I_i J_{l+1}}Tr ]^{-1} [{}^{I_i J_l}_{C_i} Tr] {}^{C_i}\vec {ray}(m_{ik}) ) \bigotimes [{}^{I_j J_l}_{C_j} Tr] {}^{C_j}\vec {ray}(m_{jk}) \end{array} \end{aligned}$$
(1)

Equation 1 is an equation for evaluating rotation and translation values. Let \(\bar{M}_{kij}\) be the midpoint of the line that produces the shortest distance between \(\vec {ray}(m_{ik})\) and \(\vec {ray}(m_{jk})\). \(\bar{M}_{kij}\) is projected on image j, which is the projected point \(\hat{m}_j (A, R_j , \vec {t}_j , \bar{M}_{kij} )\). For the k-th feature point, we evaluate the error with input points \(m_{jk}\) and \(\hat{m}_j (A, R_j , \vec {t}_j , \bar{M}_{kij} )\). The procedure finds the center of rotation and the rotational angle of the joint that minimizes the error, and when the algorithm determines that the optimum value has been reached, the optimization operation is terminated. The variables that store the optimal values are \([{}^{I_j J_l}_{I_j J_{l+1}}Tr]\) and \([{}^{I_i J_l}_{I_i J_{l+1}}Tr]\), where i and j range from 1 to n, \(i \ne j\). Since there are a total of n input images, we must compare \(n-1\) times for one feature point. The reason for using the variable as the reference coordinate system of \(I_i J_{l+1}\) is that information on the coordinate system is needed to calculate the rotation and translation for the next connected joint. If there is only one joint, we put \([{}^{I_j J_l}_{I_j J_{l+1}}Tr][{}^{I_i J_l}_{I_i J_{l+1}}Tr]^{-1}\) as a set of variables.

$$\begin{aligned} \begin{array}{l} min ~w_1 \Sigma _{k=1}^{p} \Sigma _{i=1}^{n-1} \Sigma _{j=i+1}^{n} \Vert m_{jk} - \hat{m}_j (A, {}^{W}_{i}\hspace{-0.8 mm}R , \vec {t}_i ,{}^{W}_{j}\hspace{-0.8 mm}R , \vec {t}_j , [{}^{I_j J_l}_{I_j J_{l+1}}Tr ], [{}^{I_i J_l}_{I_i J_{l+1}}Tr ], \bar{M}_{kij}) \Vert ^2 \\ \hspace{2 mm}+ w_2 \Sigma _{k=1}^{p} \Sigma _{i=1}^{n-1} \Sigma _{j=i+1}^{n}( (t ~distance~difference)^2 + (s ~ distance~difference)^2 ) \end{array} \end{aligned}$$
(2)

where A is a unique matrix containing camera parameters, \({}^{W}_{i}R\) is the orientation matrix of the camera taken at i, and \({}^{W}\vec {t}_i \) is the position vector of the i-th camera represented in the reference coordinate system, W.

\(\hat{m}_j (A, {}^{W}_{i}R , {}^{W}\vec {t}_i, {}^{W}_{j}R , {}^{W}\vec {t}_j , [{}^{I_j J_l}_{I_j J_{l+1}}Tr ], [{}^{I_i J_l}_{I_i J_{l+1}}Tr ], \bar{M}_{kij} ) \) is a calculated two-dimensional pixel, which is a pixel formed when projecting a three-dimensional point \(\bar{M}_{kij}\). \(\bar{M}_{kij}\) is the intersection point of two rays, onto image j. \(w_1\) is a weight for reducing the error between the calculated feature point and the input feature point, and \(w_2\) is a weight for the distance between the camera and the calculated three-dimensional feature point. The distance must be constant to calculate a stable intersection point. t and s are parameters used to represent the parameters of two rays, from scene i and scene j respectively, and represent information on the distance from the camera, which is located at the center of the intersection point. The angle of rotation was converted to a distance value using t and s. Distance values are more reliable than joint angle values. We use three variables for translation and four variables for the quaternion rotation to represent \([{}^{I_i J_l}_{I_i J_{l+1}}Tr ]\). However, one constraint equation occurs for each quaternion expression. For n input images, we used up to 7n variables and n quaternion constraints for the calculation step (C). As a result of the calculation in step (C), the local coordinate system of the joint is obtained from each input image. We fix the j-th image as the reference image. The calculation results are used in the steps (C) and (D) in Fig. 1. \((4n + 3)\) variables may be used for joints containing rotation only.

6 Comparison of other methods with our method using a simple example

In this section, we compare the method presented in this paper with other methods using a simple example. There are methods using various sensors to calculate the joint angle, but the method described in this paper requires only one camera. Direct comparison of methods using a mobile cameras or multiple fixed cameras, or those using a depth camera or an ordinary camera, is not fair due to different environments. Ordinary cameras have no depth sensor; however, if there is a depth sensor, it is easy to remove the background and calculate position of a feature point. A fixed camera can easily calculate its 3D position by triangulation.

6.1 Equipment and inputs used

In this paper, the joint body was photographed using one IXY 810 digital camera that moves freely with a variable focal length, and thereafter the joint angle for the articulated body was calculated. The captured image was used as the input. The position and orientation of the camera can be found by performing a 3D reconstruction on the background. The 3D position of the joint body is not immediately known, but the 3D position is known as a result of calculating the joint angle. This paper presents a novel method for calculating joint angles using only one moving camera.

6.2 Using marker versus not using marker

Markers are used to facilitate the perception of feature points. Having markers is convenient, but in general markers in images are not readily available. The joint motion calculation method presented in this paper can be used regardless of availability of markers.

6.3 Methods using depth camera

Deep-learning and depth cameras have enabled the estimation of the configuration of an articulated body [29, 33]. However, depth cameras using infrared technology have limits on the distances they can measure. The position and orientation of the camera can be found by a 3D reconstruction. When using a perfect depth camera without errors, the three-dimensional position is determined based on the view of the camera that was taken. Information on the 3D position allows for more sophisticated operations such as background removal. However, readily available cameras do not have depth sensors. Also, the information measured on depth is not perfect due to effects of occlusion. If the position and the orientation are both precisely known, we can compare the two configurations to compute the joint motion and solve for the joint motion with the 3D transformation matrix set as a variable for the transformed joint.

6.4 A simple example

Fig. 2
figure 2

Example of an articulated body motion

Figures 2(a) and 2(b) show two 3D articulation configurations with the assumption that the two figures are the configurations that can be determined when the exact joint angles are calculated. To obtain this configuration, this study uses parameter optimization. For convenience, it is assumed that the answer has already been found by parameter optimization. We then verify that the equation works by substituting the actual values into the equation. Figure 2(a) is the configuration before the articulated body moves, and Fig. 2(b) is the shape generated after the movement of articulated body. All that we know is one image of each shape, but we can find the position and orientation of the camera by first performing a 3D reconstruction on the background feature points. Starting from the joint in contact with the background object among the articulated bodies of the moving object, we calculate the movement one by one to find the 3D position of the feature points. W is the world coordinate. \(J_1\) is the first joint coordinate system that only moves, and this joint coordinate system is located at the origin (0,0,0) of Fig. 2(a). The x y z axis is indicated by different colors. Red represents the x-axis, green represents the y-axis, and blue represents the z-axis. \(J_2\) is the coordinate system established after the first rotation. After rotating on the Y axis of the W coordinate system, the coordinate system is set to (0,0,20) in Fig. 2(a). The second rotary joint is located at (0,0,50) in Fig. 2(a). The coordinate system installed after rotation is called the \(J_3\) coordinate system. In the configuration of Fig. 2(a), after moving by (10,0,0) along the x-axis from the W coordinate system point of view, the second rotary joint rotates \(-90\) degrees about the y-axis from the W coordinate system point of view, and the third rotary joint rotates \(+90\) degrees about the y-axis from the W coordinate system point of view. The resulting configuration becomes the same as shown in Fig. 2(b). \(I_1 J_0\) represents the \(J_0\) coordinate system of the \(I_1\) image, and \(I_1 J_0\) is identical to the \(I_2 J_0\) and W coordinate systems.

$$\begin{aligned}{}[ {}^{W}_{I_1 J_2} Tr ] = \left( { \begin{array}{cccc} 0 &{} 0 &{}-1 &{} 0\\ 0 &{} 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \hspace{-1mm}, [ {}^{W}_{I_1 J_3} Tr ]= & {} \hspace{-1mm} \left( { \begin{array}{cccc} 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{}-1 &{} 0\\ -1 &{} 0 &{} 0 &{} 50\\ 0 &{} 0 &{} 0 &{} 1 \end{array} } \right) \\ \hspace{-1mm}, [ {}^{I_1 J_2}_{I_1 J_3} Tr ]= & {} \hspace{-1mm} \left( { \begin{array}{cccc} -1&{}0&{}0&{}30\\ 0 &{}0&{}-1&{}0\\ 0&{}-1&{}0&{}0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \end{aligned}$$
$$\begin{aligned}{}[ {}^{W}_{I_1 J_1} Tr ]= I ,~[ {}^{I_2 J_1}_{I_2 J_2} Tr ] = \left( { \begin{array}{cccc} -1 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} -1 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) , [ {}^{I_2 J_2}_{I_2 J_3} Tr ] = \left( { \begin{array}{cccc} 0 &{} -1 &{} 0 &{} 30\\ 0 &{} 0 &{}-1 &{} 0\\ 1 &{} 0 &{} 0 &{} 0\\ 0 &{} 0 &{} 0 &{} 1 \end{array} } \right) \end{aligned}$$
(3)

6.5 First joint translation calculation

In this subsection, the coordinate system for the \(J_1\) joint is calculated, but the calculation occurs in the \(J_0\) coordinate system. Since 3D reconstruction has already been performed, the position and orientation of the camera that captured the \(I_1\) and \(I_2\) images are already known. We assume that the position of the camera from the point of view of \(I_1 J_0\) is (400,0,0). The feature point corresponding to (0,0,20) in Fig. 2(a) is captured in \(I_1\) and \(I_2\). We compare the ray formula of this example with the ray formula of the input image \(I_1\). In equations of both rays, the starting point and slope of the equations are the same. Therefore, the ray formula connecting the starting point (400,0,0) of \(C_1\) camera and feature point (0,0,20) can be found from \(I_1\) even if the location of the (0,0,20) point is not known. The ray formula connecting these two points is called \(\vec {ray}(t)\) and is used for intersection calculation. t and s are parameters.

$$\begin{aligned} { [ {}^{W}_{I_1 J_1} Tr ]= I ,~ {}^{W} \vec {ray}(t) = {}^{W} \left( \begin{array}{c} 400\\ 0\\ 0 \end{array} \right) + {}^{W} \left( \begin{array}{c} -400\\ 0\\ 20 \end{array} \right) t } \end{aligned}$$
(4)

In comparison with scene \(I_1\), in scene \(I_2\) shows a the movement of (10,0,0). Equation 1, including intersection calculation, must be solved through an optimization process to find this movement of (10,0,0), but for the sake of convenience, we can check for the accuracy of the formula by substituting the known answer into the formula. In addition, (1) is in the camera viewpoint, but in this example, since the information on the viewpoint of a camera is not sufficient, the W viewpoint is used for conversion.

$$\begin{aligned} \hspace{-3 mm} [{}^{I_2 J_0}_{I_2 J_1} Tr]= \hspace{-1mm} \left( { \begin{array}{cccc} 1 &{} 0 &{} 0 &{} 10\\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 1 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \hspace{-1mm} , [{}^{I_1 J_0}_{I_1 J_1} Tr ]= I, [{}^{I_2 J_0}_{I_2 J_1} Tr ][ {}^{I_1 J_0}_{I_1 J_1} Tr ]^{-1} = \hspace{-1mm} \left( { \begin{array}{cccc} 1 &{} 0 &{} 0 &{} 10\\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 1 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \hspace{-1mm} =Tr_1 \end{aligned}$$
(5)

If we apply (1) to calculate the motion of the joint when changing from scene \(I_1\) to scene \(I_2\) in the same way as (5), \([ {}^{I_1 J_0}_{I_1 J_1} Tr ]=I\), which suggests there is no change, and the translation component of \([ {}^{I_2 J_0}_{I_2 J_1} Tr ]\) is \((10,0,0)^T\). The feature point observed in scene \(I_1\) was moved in scene \(I_2\). \([ {}^{I_1 J_0}_{I_1 J_1} Tr ]\) and \([ {}^{I_2 J_0}_{I_2 J_1} Tr ]\) can be found after optimization calculation, but the substitution of \([ {}^{I_1 J_0}_{I_1 J_1} Tr ]\) and \([ {}^{I_2 J_0}_{I_2 J_1} Tr ]\) into (1) only confirms that (1) is valid. If \(\vec {ray}(t)\) is changed to the \(I_2\) scene using (5) , it follows that:

$$\begin{aligned} { [ Tr_1 ] {}^{I_1 J_0} \vec {ray}(t) = {}^{I_2 J_0} \left( \begin{array}{c} 410-400t\\ 0\\ 20t\\ 1 \end{array} \right) } \end{aligned}$$
(6)

Equation 1 performs the calculation of ray intersection with this and \(\vec {ray}(s)\), which is a ray on the \(I_2\) side. Similarly, from the perspective of \(I_2 J_0\), the position of the camera that captured \(I_2\) is assumed to be \((0,-400, 0)\), and the corresponding feature point of \(I_2\) is (10,0,20), but this value is not known until the joint angle is accurately calculated. However, in the input image \(I_2\), since the ray equation starting from the \(C_2\) camera origin to the feature point is the same as the \(\vec {ray}(s)\), this \(\vec {ray}(s)\) is used for the calculation of intersection.

$$\begin{aligned} { {}^{I_2 J_0} \vec {ray}(s) = {}^{W} \vec {ray}(s) = {}^{W} \left( \begin{array}{c} 0\\ -400\\ 0 \end{array} \right) + \left( {}^{W} \left( \begin{array}{c} 10\\ 0\\ 20 \end{array} \right) - {}^{W} \left( \begin{array}{c} 0\\ -400\\ 0 \end{array} \right) \right) s } \end{aligned}$$
(7)

If the ray intersection point is calculated in (1), \(410-400t = 10s, 0 = 400s-400, 20t = 20s\), for which the only set of solution is \(t = 1, s = 1\). With this calculation, we can solve for the feature points, and since we can obtain the internal parameters in addition to the camera position and orientation, when the feature points are projected onto the ccd, we can find if the feature points of the image and the calculated feature points match. If they do not match, the optimization package modifies the object motion variable values to perform optimization to produce perfect match. The number of feature points must be greater than the number of variables to solve for the correct answer.

6.6 Calculation of the second joint rotation

This subsection calculates the coordinate system for the \(J_2\) joint, but the calculation takes place in the \(J_1\) coordinate system. Similar to the previous calculation method, the \(\vec {ray}(t)\) ray formula is used for the sake of convenience. In the W coordinate system, the origin of the \(C_1\) camera is (400,0,0) and the coordinates of the feature point are (0,0,25). Similarly, the 3D position of this feature point is unknown, but it is described by a ray formula emanating from the same starting point with the same slope as the ray formula directed to the feature point from taking the \(I_1\) image.

$$\begin{aligned} { {}^{W} \vec {ray}(t) = {}^{I_1 J_0} \vec {ray}(t) = {}^{W} \left( \begin{array}{c} 400\\ 0\\ 0 \end{array} \right) + {}^{W} \left( \begin{array}{c} 0-400\\ 0\\ 25 \end{array} \right) t },~ [ {}^{I_1 J_0}_{I_1 J_1} Tr ] = I \end{aligned}$$
(8)

\({}^{I_1 J_1}\vec {ray}(t) = {}^{W} \vec {ray}(t)\). However, this \(\vec {ray}(t)\) ray formula calculates the intersection with the ray formula of \(\vec {ray}(s)\) when the scene changes from \(I_1\) to \(I_2\). Applying the equation to calculate the rotation of the joint when changing from the \(I_1\) scene to the \(I_2\) scene as in (9) yields the following result.

$$\begin{aligned}{}[ {}^{I_1 J_1}_{I_1 J_2} Tr ] = \left( { \begin{array}{cccc} 0 &{} 0 &{} -1 &{} 0\\ 0 &{} 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) ,~ [ {}^{I_2 J_1}_{I_2 J_2} Tr ] = \left( { \begin{array}{cccc} -1 &{} 0 &{} 0 &{} 0\\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} -1 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \end{aligned}$$
$$\begin{aligned}{}[ {}^{I_2 J_1}_{I_2 J_2} Tr ] [ {}^{I_1 J_1}_{I_1 J_2} Tr ]^{-1} = \left( { \begin{array}{cccc} 0 &{} 0 &{} -1 &{} 20\\ 0 &{} 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) = Tr_2 \end{aligned}$$
(9)

To “calculate the scene change” from the \(I_1\) scene to the \(I_2\) scene using (9) , the ray formula of \(\vec {ray}(t)\) from the W perspective must be converted to the \(J_1\) perspective.

$$\begin{aligned}{}[ Tr_2] {}^{I_1 J_1} \vec {ray}(t) = \left( { \begin{array}{cccc} 0 &{} 0 &{} -1 &{} 20\\ 0 &{} 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) {}^{I_1 J_1} \left( \begin{array}{c} 400-400t\\ 0\\ 25t\\ 1 \end{array} \right) = {}^{I_2 J_1} \left( { \begin{array}{c} -25t+20\\ 0\\ 420-400t\\ 1 \end{array} } \right) \end{aligned}$$
(10)

The result of changing scene \(I_1\) to scene \(I_2\) for the ray intersection calculation of (1) is the same as that of (10). Afterwards, we calculate the intersection between \(\vec {ray}(t)\) and \(\vec {ray}(s)\) on to the feature point (5,0,20) of scene \(I_2\). The 3D feature point (5,0,20) is known as a result of the intersection calculation. However, in the input image \(I_2\), since the ray equation starting from the \(C_2\) camera origin to the feature point is the same as the \(\vec {ray}(s)\), this \(\vec {ray}(s)\) is used for the calculation of intersection.

$$\begin{aligned} { {}^{W} \vec {ray}(s) = {}^{W} \left( \begin{array}{c} 0\\ -400\\ 0 \end{array} \right) + {}^{W} \left( \begin{array}{c} 5\\ 400\\ 20 \end{array} \right) s },~ [ {}^{I_2 J_0}_{I_2 J_1} Tr ] = \left( { \begin{array}{cccc} 0 &{} 0 &{} 1 &{} 10\\ 0 &{} 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \end{aligned}$$
(11)

Since the intersection calculation takes place from the \(J_1\) perspective, we transform \(\vec {ray}(s)\) to the \(J_1\) perspective.

$$\begin{aligned} { \hspace{-3 mm} [ {}^{I_2 J_0}_{I_2 J_1} Tr ] ^{-1} {}^{W} \vec {ray}(s) = \hspace{-1 mm} \left( { \begin{array}{cccc} 1 &{} 0 &{} 0 &{} -10\\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} 1 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \hspace{-1 mm} {}^{W} \hspace{-1 mm} \left( \begin{array}{c} 5s\\ 400s-400\\ 20s\\ 1 \end{array} \right) = {}^{I_2 J_1} \hspace{-1 mm} \left( \begin{array}{c} 5s-10\\ 400s-400\\ 20s\\ 1 \end{array} \right) } \end{aligned}$$
(12)

The result of ray intersection calculation from the \(J_1\) point of view are, \(-25t + 20 = 5s--10\), \(0 = 400s-400\), \(420-400t = 20s\), and the solution is \(t = 1, s = 1\). With this calculation, the feature points can be found, and since the internal parameters can be known in addition to the camera position and orientation, when the feature points are projected onto the ccd, it can be determined if the feature points of the image and the calculated feature points match. If they do not match, the optimization package modifies the object motion variable values to perform optimization to produce a perfect match. The number of feature points must be greater than the number of variables to solve for the solution.

6.7 Calculation of the third joint rotation

In this section, the coordinate system for the \(J_3\) joint is calculated, but the calculation occurs in the \(J_2\) coordinate system. Similar to the previous calculation method, the ray formula of \(\vec {ray}(t)\) is used. In the W coordinate system, the origin of the \(C_1\) camera is (400,0,0) and the coordinates of the feature point are (0,0,100). Similarly, the 3D position of this feature point is unknown, but it is described by a ray formula emanating from the same starting point with the same slope as the ray formula directed to the feature point from taking \(I_1\) image.

$$\begin{aligned} { {}^{W} \vec {ray}(t) = {}^{I_1 J_0} \vec {ray}(t) = {}^{W} \left( \begin{array}{c} 400\\ 0\\ 0 \end{array} \right) + {}^{W} \left( \begin{array}{c} -400\\ 0\\ 100 \end{array} \right) t } \end{aligned}$$
(13)

According to the calculations above, the result is as follows. \([ {}^{I_1 J_0}_{I_1 J_1} Tr ] = I\)

$$\begin{aligned}{}[ {}^{I_1 J_1}_{I_1 J_2} Tr ] = \left( { \begin{array}{cccc} 0 &{} 0 &{}-1 &{} 0\\ 0 &{} 1 &{} 0 &{} 0\\ 1 &{} 0 &{} 0 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) , [ {}^{I_2 J_0}_{I_2 J_1} Tr ] = \left( { \begin{array}{cccc} 1&{}0&{}0&{}10\\ 0 &{}1&{}0&{}0\\ 0&{}0&{}1&{}0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) , [ {}^{I_2 J_1}_{I_2 J_2} Tr ] = \left( { \begin{array}{cccc} -1&{}0&{}0&{}0\\ 0 &{}1&{}0&{}0\\ 0&{}0&{}-1&{}20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \nonumber \end{aligned}$$

The result of changing the ray formula of \(\vec {ray}(t)\) from the W perspective to the \(J_2\) perspective is presented.

$$\begin{aligned} {}^{I_1 J_2} \vec {ray}(t) = [ {}^{I_1 J_0}_{I_1 J_2} Tr ]^{-1} {}^{W} \vec {ray}(t) = \left( { \begin{array}{cccc} 0 &{} 0 &{} 1 &{} -20\\ 0 &{} 1 &{} 0 &{} 0\\ -1 &{} 0 &{} 0 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \left( \begin{array}{c} 400-400t\\ 0\\ 100t\\ 1 \end{array} \right) \nonumber \\= {}^{I_1 J_2} \left( { \begin{array}{c} 100t-20\\ 0\\ -400+400t\\ 1 \end{array} } \right) \end{aligned}$$
(14)

Similarly, from the perspective of \(I_2 J_0\), the position of the camera that captured \(I_2\) is assumed to be \((0,-400, 0)\), and the corresponding feature point of \(I_2\) is (-20,0,70), but this value is not known until the joint angle is accurately calculated. However, in the input image \(I_2\), since the ray equation starting from the \(C_2\) camera origin to the feature point is the same as the \(\vec {ray}(s)\), this \(\vec {ray}(s)\) is used for the calculation of intersection. Then, we calculate the intersection of ray formulas of \(\vec {ray}(s)\) and \(\vec {ray}(t)\) in relation to the feature point (-20,0,70) of scene \(I_2\).

$$\begin{aligned} { \hspace{-3 mm} {}^{W} \vec {ray}(s) = {}^{W} \left( \begin{array}{c} 0\\ -400\\ 0 \end{array} \right) + {}^{W} \left( \begin{array}{c} -20\\ 400\\ 70 \end{array} \right) s },~ [ {}^{I_2 J_0}_{I_2 J_1} Tr ] [ {}^{I_2 J_1}_{I_2 J_2} Tr ] = \left( { \begin{array}{cccc} -1 &{} 0 &{} 0 &{} 10\\ 0 &{} 1 &{} 0 &{} 0\\ 0 &{} 0 &{} -1 &{} 20\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \end{aligned}$$
(15)

Since this ray is from the point of view with respect to W, the result of changing to the \(J_2\) point of view is as such:

$$\begin{aligned} { {}^{I_2 J_2} \vec {ray}(s) = [ {}^{I_2 J_0}_{I_2 J_2} Tr ]^{-1} {}^{W} \left( \begin{array}{c} -20s\\ -400+400s\\ 70s\\ 1 \end{array} \right) = {}^{I_2 J_2} \left( \begin{array}{c} 20s+10\\ -400+400s\\ -70s+20\\ 1 \end{array} \right) } \end{aligned}$$
(16)

Since the joint rotates when the ray formula of \(\vec {ray}(t)\) of the \(J_2\) perspective is changed from the \(I_1\) scene to the \(I_2\) scene, this motion is set as a variable and obtained through an optimization process. However, for concerns of convenience, we substitute the correct answer from the optimization to verify (1).

$$\begin{aligned} \hspace{-3 mm} [ {}^{I_2 J_2}_{I_2 J_3} Tr ] [ {}^{I_1 J_2}_{I_1 J_3} Tr ]^{-1} \hspace{-1 mm} = \hspace{-1 mm} \left( { \begin{array}{cccc} 0 &{} -1 &{} 0 &{} 30\\ 0 &{} 0 &{} -1 &{} 0\\ 1 &{} 0 &{} 0 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \hspace{-2 mm} \left( { \begin{array}{cccc} -1 &{} 0 &{} 0 &{} 30\\ 0 &{} 0 &{} -1 &{} 0\\ 0 &{} -1 &{} 0 &{} 0\\ 0 &{}0 &{}0&{}1 \end{array} } \right) ^{-1} \hspace{-5 mm} = \hspace{-1 mm} \left( { \begin{array}{cccc} 0 &{} 0 &{} 1 &{} 30\\ 0 &{} 1 &{} 0 &{} 0\\ -1 &{} 0 &{} 0 &{} 30\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \hspace{-1 mm} = \hspace{-1 mm} Tr_3 \end{aligned}$$
(17)

The result of applying motion to the ray formula of \(\vec {ray}(t)\) in the perspective of \(J_2\) is shown by the following equation.

$$\begin{aligned} { Tr_3 {}^{I_1 J_2} \vec {ray}(t) = } \left( { \begin{array}{cccc} 0 &{} 0 &{} 1 &{} 30\\ 0 &{} 1 &{} 0 &{} 0\\ -1 &{} 0 &{} 0 &{} 30\\ 0 &{}0 &{}0&{}1 \end{array} } \right) \left( \begin{array}{c} 100t-20\\ 0\\ -400+400t\\ 1 \end{array} \right) = \left( \begin{array}{c} -370+400t\\ 0\\ -100t+50\\ 1 \end{array} \right) \end{aligned}$$
(18)

From the \(J_2\) point of view, we calculate the intersection of \(\vec {ray}(t)\) and \(\vec {ray}(s)\). This process is included in the objective function of optimization. As a result of the ray intersection calculation, \(-370+400t=20s+10, 0 = 400s-400, -100t+50=-70s+20\), and the unique set of solution is \(t = 1, s = 1\). With this calculation, the feature points can be known, and since the internal parameters can be known in addition to the camera position and orientation, when the feature points are projected onto the ccd, the feature points of the image and the calculated feature points match. If they do not match, the optimization package modifies the object motion variable values to perform optimization to produce perfect match. The number of feature points must be greater than the number of variables to solve for the correct answer.

6.8 Comparison with model-based methods

In the example above, the model body is a figure composed of line segments. The 3D model is projected onto 2D space and compared with the input image. The method using the model takes time to calculate the silhouette of the 3D model. Also, it takes time to calculate the match between the calculated silhouette and the blob of the image. If we assume that we implement a model-based method using this example, and that the correct answer has already been found through the optimization process to obtain the motion shape, the outline silhouette for a stick figure becomes a 2D line segment on an image projected from a 3D line segment to a ccd. The objective function includes the rate of match between “the stick figure silhouette projected on 2D” and “input image blob with line segments”. We wish to find the best match; however, if the model silhouette is concave, there exists a local optima in the match calculation with the blob. For example, when trying to match the 2D silhouette of open fingers of the left hand and open fingers of the right hand, the calculations done for matching with blobs may fail to find a solution due to multiple local optima.

6.9 Why use parameter optimization for kinematics?

We primarily use kinematics to model articulated bodies. The kinematics equation is nonlinear [21]. In general, to calculate the motion of the entire articulated body, we must use one transformation matrix for the motion of one joint as a variable. An articulated body such as a person has a degree of freedom (DOF) much larger than 6. The DOF in three-dimensional space is 6. This means that up to six equations can be generated to calculate the joint of a shape pointing to one position and orientation. If the degree of freedom of the joint body is 6, the equation can be solved. However, since the number of joints in the entire articulated body is much greater than 6, it is common to find out the angles of joints through an optimization process. Since there exists multiple solutions, we can get the desired answer by properly setting the objective function. Since joints are usually composed of kinematic chains, we require many optimization variables.

For the articulated body presented in this example, kinematic modeling is performed similarly to (19). Since the \([{}^{J_0}_{J_1} Tr ]\) transformation matrix requires 3 variables to model movement, and each of the \([{}^{J_1}_{J_2} Tr ]\) transformation matrix and \([{}^{J_2}_{J_3} Tr ]\) transformation matrix requires 4 variables to model 3D rotation using quaternions, at least 11 variables for parameter optimization are required. We assume that P is a feature point or orientation vector in terms of the \(J_3\) coordinate system.

$$\begin{aligned} {}^{J_0} p = [{}^{J_0}_{J_1} Tr ] [{}^{J_1}_{J_2} Tr ] [{}^{J_2}_{J_3} Tr ] {}^{J_3} p \end{aligned}$$
(19)

In this paper, only one joint movement was calculated at a time. In the optimization process, if the number of variables becomes larger, the calculation time grows exponentially large. Figuring out the movement of one joint at a time is a much easier problem than calculating the movement of an articulated body as a whole. Nonlinear kinematics equations are also used to model a single two-dimensional object segment [21]. Therefore, the objective function containing the kinematics equation for articulation modeling is non-linear. If the joint movement is not large, the previous shape can be used as an initial value for the optimization process to calculate a new shape. Since this joint movement is not large, it can be treated as an initial value close to the correct answer. Therefore, in this way, we can bypass the problem of local optima.

Fig. 3
figure 3

Input images and corresponding three-dimensional volume reconstructions of a manikin

Fig. 4
figure 4

Input images and corresponding three-dimensional volume reconstructions of a woman

7 3D Reconstruction of joints

7.1 Three-dimensional reconstruction of fixed parts

The process used in this paper is called ”shape-from-silhouette” [15]. An input image whose background is removed for each joint, and a feature point coordinate and its corresponding information are fed to the three-dimensional joint reconstruction task. The user enters the fineness of the volume or the size of the voxel. A total of four input images were used in all reconstruction work used in the experiment.

7.2 Three-dimensional reconstruction of moving part

For 3D reconstruction, the reference scene must be chosen first. The reference scene is a scene where there is no translation and rotation of the joints. The basic concept for three-dimensional reconstruction of joints is reconstruction of each joint. Figure 3 (a) shows the input image of the joint of a manikin, including translation. The left arm of the manikin rotates relative to the other input images. Figure 4 (a) shows the input image including the rotating joint. The head of the woman was rotated relative to the other images in all the input images. After calculating the parameters inside the camera, the center of rotation and the angle of rotation were calculated in the next step of calculations. For reconstruction of the three-dimensional joint of the rotating part of the body, each three-dimensional reconstruction is required independently for each joint. All reconstructions occur separately on joints. Any three-dimensional reconstruction algorithm can be used to reconstruct the fixed part. For three-dimensional articulated body reconstruction of the moving portion, a center of rotation and rotational angle as well as translation of the joint should be calculated, and the position and orientation should be corrected for the original camera. Fixing a reference scene is required to calculate translational values and rotational angles. For example, frame number i is set as the reference scene, and the translational value, rotational angle, and center of rotation are calculated based on i. Then three-dimensional articulated body reconstruction is performed. When taking photos of a joint with a camera, the spatial relationship between the joint and the camera becomes fixed. Since the camera of scene j can be regarded as translated and rotated according to the joint as compared with the reference scene i, the camera can also translate and rotate inversely to match the reference scene i so that the camera position corresponding to the reference scene can be obtained.

All 3D reconstruction starts with reference coordinate system and reference. When we try to reconstruct the \((l+1)\)-th joint, the calculation for correcting the position and orientation of the camera j considering the reference scene i is as follows. Using the calculated \(I_i J_l\) and \(I_j J_l\) coordinate system, the rotational angle and translation values of the joint \(J_{l+1}\) were calculated using the feature point \(M_k\) of the joint \(J_{l+1}\) . The following equation is an equation that rotates and translates the camera used for photographing the \(J_{l+1}\) joint backward with the joint motion

$$\begin{aligned} \begin{array}{l} [{}^{W}_{C^/_j}Tr] = [{}^{W}_{I_i J_l}Tr ] ([{}^{I_j J_l}_{I_j J_{l+1}}Tr][{}^{I_i J_l}_{I_i J_{l+1}}Tr]^{-1})^{-1} [{}_{W}^{I_j J_l}Tr ] [ {}^{W}_{C_j}Tr ] \end{array} \end{aligned}$$
(20)

\([ {}^{W}_{C_j}Tr ]\) is a transformation matrix observing the \(C_j\) coordinate system in the reference coordinate system W. \([{}^{W}_{C^/_j}Tr]\) is the corrected camera transformation matrix, which is created by multiplying the inverse of the motion\('\)s transformation matrix of the joint by the original camera matrix. \(([{}^{I_j J_l}_{I_j J_{l+1}}Tr][{}^{I_i J_l}_{I_i J_{l+1}}Tr]^{-1})^{-1}\) is a transformation matrix for rotating and moving the viewpoint of the joint shown in image j backwards and converting it to motionless with respect to image i. The relationship between the two coordinate systems became fixed while the \(C_j\) coordinate system had been taking photographs of the joint where the \(I_j J_{l}\) coordinate system had been established. To match the reference scene i, we need to reverse the viewpoint to the \(I_j J_{l}\) perspective. Reconstruction takes place from the perspective of W, so it should be finally converted into the perspective of W. For 3D joint reconstruction, the local coordinate system was calculated using the feature points in the rotating part using the calculation step (C). For the 3D reconstruction of the rotating part, the camera\('\)s position and orientation were corrected as if there were no rotation, using (20). A three-dimensional articulated body reconstruction with textures was performed using the corrected external parameters of the camera.

8 Experimental result

This paper discusses on a moving camera that does not provide external and internal variable values such as the distortion of an image. All input are images of a body with 3D joints that can be translated or rotated. These joints contain different rotational angles and different translational distances for each input image. It is assumed that we know the joint structure, background and foreground parts. The method presented in this paper is based on the results of reconstruction of the existing joint structure, and the formula is developed to enable three-dimensional reconstruction starting from the joint including body surface. In general, the smaller the number of variables, the easier it is to optimize. Figures 3 (a) through  7(a) show four actual input images used to scan a fixed background object. The left top image of Fig. 3 (a) through  7(a) was set as the reference scene of 3D reconstruction. Figures 3 (c) through  7(c) depict the input silhouette with the background removed for the fixed part. Figures 3 (d) through  7 (d) depict the input silhouette background removed for the first moving part. Figures 3 (e) through  7(e) show the input silhouette background removed for the second rotating part. From Fig. 3 (b) to Fig. 7(b), we have created various three-dimensional joint reconstructions from silhouettes to show that the formulas presented in this paper are accurate. All reconstructions use the best obtainable data. Joint reconstruction is rarely reconstructed even if errors on the joint angle are small. The proposed reconstruction computes the initial value and performs the optimization process, in order.

This experiment took pictures of four configurations, each of which has three different displacements and joint angles. To show the validity of the formula for joint motion calculation described in this paper, we also took three additional pictures for each configuration. A total of four images were used to create a three-dimensional reconstruction for each configuration. The comparison of two different three-dimensional reconstruction of the same scene allows for exact calculation of the joint movement.

Fig. 5
figure 5

Input images and corresponding three-dimensional volume reconstructions of a bike

Fig. 6
figure 6

Input images and corresponding three-dimensional volume reconstructions of a manikin

Fig. 7
figure 7

Input images and corresponding three dimensional volume reconstructions of a toy excavator

Figure 3 shows the two-dimensional movement and one-dimensional rotation of the bottom of the manikin. Since the arm also rotates, information on the coordinate system is required to calculate the arm motion. For this purpose, we modeled 6 degrees of freedom rotation and translation to set one coordinate system for every joint of each scene. As a result of this calculation, the coordinate system of each scene is also known. We use a quaternion to model the rotation. In case of n scenes, 7n variables are needed for translation and rotation corresponding to the first joint. We also need n quaternion constraints. The left shoulder joint of the manikin also rotates. Since the rotation center of the shoulder joint is to be calculated, the set of variables used is a parameter for modeling the relative rotation and center of rotation between a set of two scenes. A coordinate system was assigned to the bottom part of the manikin of each scene, and rotation of the shoulder joint of each scene was calculated using each coordinate system assigned. The number of variables used for the n scenes used in the arm motion calculation is three for the center of rotation and \(4(n-1)\) for the relative rotation, resulting in a total of \(4n-1\). We also need \(n-1\) quaternion constraints.

Figure 4 shows that the woman\('\)s body is fixed, and her neck is rotated. The actual human neck has more degrees of freedom than three degrees of freedom, but the rotation is regarded as three degrees of freedom for the joints. Since the center of rotation is to be calculated, the number of variables of the optimization process is three for the center of rotation and \(4(n-1)\) for the relative rotation, totaling \(4n-1\). We also need \(n-1\) quaternion constraints.

Figure 5 has a rotating lid on the can. The bicycle is fixed on the lid. Since its center of rotation also needs to be calculated, the parameters of the optimization process are 3 for the center of rotation and \(4(n-1)\) for the relative rotation, again totaling \(4n-1\). We also need \(n-1\) quaternion constraints. In fact, the lid is included in the rotating part, but when reconstructing the fixed part of the can, the lid is also included to confirm that the texture is wrong. This shows that the lid rotation angle is calculated.

The bottom of the manikin in Fig. 6 is fixed. However, the joints in the left shoulder and the left elbow rotate. The rotation of the joint in the left elbow should be calculated using the coordinate system of the calculation of the shoulder joint. Therefore, three rotation centers and 4n rotation variables were used to calculate the movement of the shoulder joint. We also need n quaternion constraints. As a result of this calculation, the coordinate system of each scene is also known. The number of variables of the elbow joints is 3 for the center of rotation and \(4(n-1)\) for relative rotation, so it amounts to \(4n-1\) in total. We also need \(n-1\) quaternion constraints.

The excavator shown in Fig. 7 consists of a body with one degree-of-freedom rotation and an arm with one degree of freedom rotation. Except for the arm and base, the object does not move or rotate. The rotation of the arm should be calculated using the coordinate system of the body part. The rotation center of 3 variables and 4n rotation variables were used to calculate the body motion. We also need n quaternion constraints. 3 for rotation center and \(4(n-1)\) for relative rotation to calculate arm motion. We also need \(n-1\) quaternion constraints.

In this paper, we used background-removed images and used the correspondence relation between feature points as input. Background removal and feature extraction algorithms are not the focus of this research. Although more input pictures can produce more accurate 3D articular reconstruction for all 3D joint reconstructions, the experiment in the study always used only four input pictures for convenience.

Table 1 Number of feature points used in camera parameter computation and joint motion calculation

Table 1 shows the number of input feature points used for various 3D joint reconstructions with textures and the number of feature points used for each fixed part. It also has information on the rotational and translational part for each input image. The correspondence between each feature point is entered by a different algorithm or by the user. Information on the three-dimensional position of the feature points on fixed or background objects can be calculated by applying any three-dimensional reconstruction method. The origin of the reference coordinate system of the 3D reconstruction, which appears in the input scene, is fixed on the wall.

Using the method presented in this paper, it is not difficult to segment the moving part, as the translational value or the center of rotation can be found. In the case of rotational motion, it is only necessary to project the calculated center of three-dimensional rotation on each image and to cut out the silhouette of the rotating part. Using the feature points, we calculated not only the center of rotation but also the rotation angle. The calculated motion values are used for 3D reconstruction of joints. The reconstructed joints are shown in Fig. 3 (b) to Fig. 7(b). We need sufficient input feature points on the rotating body for joint angle calculations. As shown in Table 1, up to 26 feature points are used in Fig. 3(d) and up to 20 feature points in Fig. 3(e).

There are local optima in (2) during the calculation of joint angles and center of rotation. The existance of local optima is demonstrated in a paper [21]. To overcome the local optimization, we use the feature points near the center of rotation of the joint as the initial values of the optimization parameters for calculating the center of rotation of the joint. If the initial values are close to the actual answer, one can find the corresponding joint angle and center of rotation. In Fig. 3, the manikin was photographed in various directions under static conditions in every input scene. A 3D reconstruction is possible for each scene, paving the way for calculation of correct angle of rotation and center of rotation. Three-dimensional positions were calculated for the feature points of the model\('\)s head using three-dimensional reconstruction of the still scene at a selected angle of rotation. Using the 3D position, an accurate value for the angle of rotation can be computed. For more accurate computations, we require more feature points.

Table 2 Computed joint angles in quaternion and translational values using our approach versus corresponding values found using separate 3D reconstruction with Figs. 3(d) to 5(d) as input
Table 3 Computed joint angles in quaternion and translational values using our approach versus corresponding values found using separate 3D reconstruction with Figs. 6(d) to 7(e) as input

Tables 2 and 3 compare the calculation results after using (2) with the calculation result of correct position after using three-dimensional reconstruction, with Figs. 3(d) to 7(e) as inputs. If the x, y, and z values are identical in the corresponding scene change values in Tables 2 and 3, the joints are rotating joints that share a common center of rotation. Translating joints have different x, y, and z values. The difference in the angles of rotation was expressed using a quaternion. The dot product between two quaternion values is related to the difference in angle of rotation between the two quaternion values. If the result of the two quaternion dot products is a value of 1.0, it implies no difference in rotation. There are \(length_1\) and \(length_2\) values in Tables 2 and 3. \(length_1\) is a comparison of the translational distance traveled. \(length_1\) is an absolute length difference regardless of coordinates. \(length_2\) is a comparison of Euclidean distances assuming the same coordinate system. The coordinate system may change if the coordinates are those other than W coordinates. Results show that the calculations on the angle of rotation are fairly accurate, but those associated with the translation are consistently prone to errors. We need to use good initial values to get satisfactory results. As it can be seen in the calculation results of Fig. 5(d) at the end of Table 2, large errors occur in the calculation of the center of rotation. Specifically, this rotation involves one degree of freedom. In this paper, the parameter optimization has caused the center of rotation to move far along the axis of rotation in search of a better value. However, this value does not significantly affect the rotation angle. This value can also be controlled by constraining the variable value in parameter optimization.

The joint angles were calculated by adding one more scene of Fig. 8(a) using the calculated joint values in Fig. 4. Then, five 3D joint reconstruction was performed as shown in Fig. 8(b). GRG2 optimization precision is set to \(10 ^ {-7}\). Four quaternion variables and one quaternion constraint are used to calculate the neck angle. The joint angle was calculated after 377 function calls including the gradient calculation. This can be seen in the three-dimensional reconstruction Fig. 8(b).

It is very difficult to compare the method presented in this paper with other methods. There are no joint movement calculation papers for scenes with moving joints and moving cameras at varying focal lengths. However, Lee and Chen [13]\('\)s method is the most similar to the method presented in this paper in that it sets the six feature points that know 3D information in the head and tracks the human motion. However, there are limitations in knowing the feature point coordinate system information, and when the feature points are concentrated in a human head having a small distribution per image scene, a serious error occurs in calculating camera information. Since the method presented in this paper uses a fixed environment object, the feature points are generally widely distributed, and the camera position and orientation calculations are accurate.

All the code was written in C programming language without using any other library than GRG2 [11], and the input image of this paper was created using one IXY810 portable digital camera.

Fig. 8
figure 8

Computing one input image using computed results

9 Discussion

The core of joint 3D reconstruction lies in the accurate calculation of the joint rotation angle. The distance information between the camera and the feature point was used in calculating the angle of rotation of (2). This work transforms the problem of rotational coordinates into the problem of Cartesian coordinates. Since most optimization methods have been developed based on Cartesian coordinates, there is a large error when solving the rotation coordinates by the optimization method.

Since the center of rotation is on the joint, the position of the joint can be roughly defined as the center of rotation. This information can be used for the initial value of the center of rotation. In the case of adjacent scenes, the initial value of the variable is calibrated to 0, so that the relative motion is a variable. Especially, if the camera is moving and the object is stationary, we may find the three-dimensional surface information about the object, which facilitates the calculation of joint angles. Like many other studies, this method suffers from errors caused by flawed input of feature points, which hinders the accurate calculations of joint angles. When there are several joints in chain, the error propagates to the next joint calculation.

Because our calculation depends on optimization, it is very important to compress the number of variables used in the optimization. If the number of variables increases, the computational complexity and computation time increase significantly. To minimize the number of variables used for parameter optimization, we introduced an extended quaternion that is capable of translation.

According to our calculations, we found that the center of rotation of the female head in Fig. 4 is behind her mouth. The degree of freedom of a human head is higher than three degrees of freedom, although it is assigned to three degrees of freedom. An increase in the details of modeling requires a corresponding increase in the degrees of freedom. However, there is a trade-off in increasing the degrees of freedom as it becomes more difficult to perform parameter optimization. For this reason, we used only three rotational degrees of freedom in this paper. However, additional three degrees of freedom are required to calculate translation or center of rotation. A total of six degrees of freedom were used for this reason. The input image of Fig. 6(a) portrays the scene in which the lower arm and upper arm of the human body are covered. The problem of the occluded part of the object itself poses another problem for the three-dimensional reconstruction of an articulated body. In Fig. 4 (d), the silhouette was extended to make sure that the female neck was not deleted. The experimental results show that the formula for calculating the three-dimensional joint motion, including the texture of the joints, works properly.

10 Conclusion

The problem that needs to be solved is the computation of 3D joint motion values for an articulated rigid object. Specifically, as the camera translates and rotates, we need to find the 3D motion values in a tree-structure with moving and rotating three-dimensional joints, which brings about changes in the focal length. In this paper, we use this result in three-dimensional joint reconstruction.

Many papers have tried to calculate the correct joint angles. Their continuous efforts evidently show that this is an important topic open to diverse applications. Overall, joint angle computation from 2D images has the potential to significantly enhance the visual effects and animation in movies, creating more immersive and engaging experiences for viewers. It may also help patients with mobility impairments, evaluate an athlete’s technique, and identify areas for improvement by assessing an athlete’s biomechanics. Currently, training is based only on the data that pertains to human joints, and the information on various non-human objects is incomplete. Until now, there had been no way to solve the articulated body motion by using the coordinate transformation. This approach is a way of recovering lost information using coordinate system transformations. The formulas presented in this paper are general formulas that can be applied to all kinds of joints. Since the articulated body may constantly move, it is very difficult to reconstruct three-dimensional motion without calculating movement of the joints. However, as shown by the various reconstructions presented, the formulas presented in this paper are accurate, supported by the correct reconstructions. If there is no correct reconstruction, there are two possible reasons: an error in the input feature point, or a poor initial value.

A serious constraint on the formula presented in this paper is the presence of local optima. There is a local optimum as proven by a paper [21]. For this reason, it is difficult for the optimization function to converge to the optimum value unless the initial value is properly calculated. The initial value can be obtained by using the results of an accurate calculation of adjacent scenes. In the case of consecutive input scenes with small changes in time, the rotational angle of the joints will be small. For input images with small angles of rotation, it is much faster and easier to calculate the rotational angle using parameter optimization. It is convenient to use the parameter optimization because the correct joint value and the initial joint value are similar.

This technique has succeeded in reconstructing the three-dimensional joint by using joints and cameras to create a fixed relationship between them during the time of photography. However, because our calculations use parameter optimization and three-dimensional reconstruction, computations cannot be performed in real time. So far, there is no three-dimensional reconstruction of joints that includes textures. Therefore, we provide a new possibility for three-dimensional reconstruction of the articulated body with texture. The value of this paper lies in the procedures of calculating accurate joint angles for complex articulated body composed of a tree-structure, by using only one moving camera.