1 Introduction

A Flexible Manufacturing System (FMS) is a modern computer-controlled production system having automated material handling and semi-dependent workstations responsible for efficiently manufacturing parts ranging from low to medium volume [8]. In the manufacturing industry, incorporating computer-controlled machines and robots into the production process offers a multitude of advantages, including improved utilization rates and increased productivity levels [22]. The fast growing trend of using robots in industry is evident by recent world robotics report [19] whose statistics are taken to show the graph in Fig. 1.

Fig. 1
figure 1

World robotics report 2022 by Int. Federation of Robotics [19]

The in-plant logistics by throwing and catching robots should be the fastest transportation means due to the direct connection between two workstations. Implementing this throw-catch transportation approach in production systems can result in fully automated processes within flexible manufacturing systems, and it leads to the benefits of optimized utilization of plant facilities, routing flexibility and an increase in productivity volume in addition to fast transportation and all these benefits lead to reduced production cost in FMS [11]. The proposed research aims to investigate this efficient approach.

The trajectory of mechanically thrown object can be observed by cameras. However, the number and placement of these cameras can affect the accuracy of the observations. To reduce observational errors, multiple cameras should be strategically positioned in internal logistics settings to provide a comprehensive view of the trajectory. Such optimum positions need to be determined through rigorous experimentation in such internal logistics vicinity which is usually 3–5 m [10, 31].

The proposed work involves the development of a 3D simulated environment that makes it possible to mechanically throw an object having certain properties and with certain launching conditions. It also allows observation of the resulting captured trajectory through a standard pinhole camera model as such model can effectively describe the perspective projection of most modern cameras [18]. The virtual camera can be positioned anywhere within the 3D simulated environment for internal logistics of the manufacturing plant. The developed simulation is versatile and can be used for any real-world camera by providing its specific characteristics like capturing speed, sensor size, per pixel area, focal length, etc. Figure 2 depicts the whole procedure of projection of 3D simulation world points onto the 2D digital image of the virtual camera.

Fig. 2
figure 2

Process of 2D digital image formation for the virtual camera from the simulated 3D world scene

The 3D simulated world points are first converted into 3D camera seen points by taking care of extrinsic parameters (specified location and orientation) of the virtual camera. Then, standard pinhole camera model is used to convert 3D camera seen coordinates into 2D image plane coordinates and finally specified intrinsic parameters of the camera (e.g. sensor size, per pixel area, etc.) are considered for 2D digital image formation in terms of pixels. The standard pinhole camera model, utilized to convert 3D camera seen coordinates into 2D image plane coordinates, is shown in Fig. 3.

Fig. 3
figure 3

Used pinhole camera model for 2D image plane mapping of camera seen 3D points

Here \(P_c (X,Y,Z)\) is the camera seen 3D point that is projected on camera’s image plane as 2D point \(P'(x,y)\). The two coordinates of \(P' \) are derived by following mathematical equations.

$$\begin{aligned}{} & {} x = -f \times \frac{X}{Y}. \end{aligned}$$
(1.1)
$$\begin{aligned}{} & {} y = -f \times \frac{Z}{Y}. \end{aligned}$$
(1.2)

As stated above, any real-world camera can be simulated. However, in simulated experiments of this work, the characteristics of the ‘IDS UI-1220RE-M-GL’ camera are used. It is because the real-world experiments were carried out using these cameras and using the same camera characteristics enabled comparison of the results of simulated and real-world experiments effectively.

The proposed work can be divided into two parts. The first part is the derivation of the best multi-camera setup, using extensive simulated experiments, for such in-plant logistics settings. This part is further divided into two sub-parts. First, to identify optimal camera positions by extensive experimentation in a simulated environment by throwing a spherical object, with a known radius, and observing its trajectory by placing the virtual camera at multiple locations and observing the errors in reconstructed 3D interception positions. These interception positions are reconstructed from projected 2D positions on the camera’s sensor area. Based on the observed errors, the optimal camera positions are identified. Afterward, different combinations of calibrated cameras (placed at the identified optimal positions) are tried in further experimentation to determine the best multi-camera setup with minimal trajectory error.

The second part of this research involves intelligent tracking of thrown objects by a cognitive catching robot. Such a robot observes the initial part of the thrown object’s flight trajectory using the best multi-camera setup and predicts the remaining trajectory intelligently with the help of a proposed encoder–decoder Bi-LSTM DNN model. The model was trained on a dataset comprising 3000 trajectories generated through simulated experiments. The evaluation of the proposed model demonstrated its superior performance compared to the current state of the art. It achieved a maximum mean average error (MAE) of 5 mm and root-mean-square error (RMSE) of 7 mm in 200 real-world test experiments, indicating its capability to accurately predict the final catching point.

This manuscript is structured as follows: The next Sect. 2 explores already conducted work related to this research problem. Section 3 explains the proposed methodology. Then, Sect. 4 presents the obtained results. Finally, the paper is concluded in Sect. 5, where planned future work is also discussed.

2 Related work

A lot of work for ball-catching robots found in the literature. For example, the authors of [6] demonstrated their work that was capable of catching two balls simultaneously thrown by humans from 4 to 6 m. The planning algorithm of their work re-plans the robot movement every 20 ms based upon the detected ball through 2 cameras, mounted on the throwing side, and state estimation by UKF (Unscented Kalman Filter). Their reported precision, in both articles [6, 7], was within 2 cm (i.e. 20 mm) for 80 % successful catches. Similarly, automated object detection and tracking have achieved a great accuracy now-a-days as evident by some recent works [2, 9, 21, 30, 33, 34] found in literature.

The robotic throw-catch approach, for material transportation in FMS, was first time proposed in [10]. This approach is feasible for the transportation of only small-sized objects as the aerodynamic instability of large-sized objects is high and this factor makes them highly unpredictable during their flight [12, 14]. Detecting and predicting the interception positions of objects with various shapes, during their flight, is a complex problem that requires a systematic and step-wise solution. In the existing literature [3,4,5, 10,11,12,13,14, 24,25,28, 31] on trajectory detection and prediction in in-plant logistics scenarios, researchers opted to utilize a tennis ball as the projectile. It is due to the fact that understanding the aerodynamic properties of an object is critical for accurately predicting its behavior during flight and the aerodynamic properties of a tennis ball are well-studied, and there is a significant body of scientific literature that explores these properties. A review in this context can be found in [23]. Hence, this research work considers only spherical shape objects. However, the proposed work paves the way to extend future research for other shaped objects as well.

The work in [29] used a high-speed 3D video range camera that works under the photonic mixer device (PMD) principle that helps to determine flying object distance with time of flight sensor. Such sensor operates by emitting infrared laser light that hits a target object and then returns to the sensor, where it is detected. A similar effort was made in [3,4,5] where the single camera, with 87 frames per second (fps) capturing speed, was placed at the catching side. In the initial trajectory of a mechanically thrown tennis ball, its 3D interception positions are determined through photoelectric sensors that were placed at a vertical distance of 40 mm for measuring the initial parameters of the ball. The interception positions of the later part of the trajectory are measured by size-based tracking as interception position determination is more accurate when the flying ball is near to catching side camera. The final 2D impact position accuracy was assessed using a DST touch kit in only 17 test throws. The average error in the final impact position ranged from 1.20 to 3.98 mm. The limitation of this work is that photoelectric sensors are not easily implementable in industry.

The works in [13, 14, 24,25,26,27,28] used mechanical throwing device for throwing tennis ball. After measuring the initial interception positions of the thrown ball through the stereovision of two mounted cameras on the catching side, the remaining trajectory estimation is done by applying different prediction algorithms. For example, in works [26, 27], the fast kNN-based prediction was made for future positions of the ball and an accuracy of 30 mm was achieved as the mean squared error between actual and predicted positions in 92% cases of 150 tested trajectories. In [25], the same authors used a simple one-hidden layer neural network for the remaining trajectory estimation. This neural network was trained by 150 simulated trajectories. During testing, the average prediction error in a simulated environment was approximately 24–26 mm, which is regarded as high for simulated outcomes.

In recent studies [15,16,17, 20], machine learning algorithms have been employed to track mechanically thrown balls for ping-pong-playing robot. The study described in [15,16,17] constructed a dataset consisting of 614 ball trajectories captured by four ceiling-mounted cameras. A variational auto-encoder deep neural network was trained using 90% of this dataset and subsequently validated using the remaining 10%. The trained model was able to accurately predict the final touch point by analyzing the first 40–50 frames of the ball’s flight trajectory, recorded at a speed of 180 frames per second. It had shown an absolute mean error of 40–60 mm when evaluated on 35 test trajectories. This error is high and also suffers from a shortfall of limited testing. The second work, presented in [20], used 3 cameras (right, left, and auxiliary) to observe the flight trajectory of a ping-pong ball thrown through a mechanical device. It trained a dual neural network on 300 trajectories and, upon testing on 30 trajectories; it had shown an absolute mean error of 36.6 mm in final touch point prediction. The limited training and testing was the major drawback of this work. The limitations in existing works for mechanically thrown object tracking are summarized in Table 1.

Table 1 Limitations in existing works of mechanically thrown object tracking

The limitations of the current state of the art emphasize that tracking of mechanically thrown object’s trajectory is still an open research problem due to various factors. There is a need for a large dataset of trajectories in order to have better learning of nonlinear patterns of thrown object trajectories. Furthermore, there is a need to enhance accuracy for measuring interception positions of thrown objects through visual sensors and for this purpose, extensive experimentation is required by placing cameras at different positions. In a real-world environment, producing thousands of trajectories with a multitude of variations in experimental setup is not possible. So a 3D simulated environment is needed in which an object can be thrown with multiple variations and its trajectory can be monitored through simulated cameras following the basic principle of pinhole cameras for perspective projection, as applied in real-world cameras. The work in hand is an effort in this direction.

3 Proposed work

The intention to do this work is to enhance the intelligence of cognitive robots for catching mechanically thrown objects in an internal logistics environment within flexible manufacturing systems of Industry 4.0. The work consists of two parts. The first part is to derive the best multi-camera setup for enhancing the visual tracking of mechanically thrown objects’ trajectories. The second part consists of the collection of a dataset, of trajectories captured by derived best multi-camera setup, and training of the proposed bidirectional LSTM model for intelligent tracking of thrown objects.

This paper explains the detailed procedure through involved simulation and algorithms that are used in the derivation of the best multi-camera setup for such internal logistics settings and later on explains the second part of research that utilizes derived multi-camera setup for preparation of trajectories dataset to train proposed intelligent model responsible for accurately predicting final interception position for catching robot. The following two subsections explain each part in detail.

3.1 Proposed best multi-camera setup derivation

For deriving the best multi-camera setup, it is necessary to derive the best camera capturing positions within such an industrial internal logistics environment. It needs the placement of cameras nearly everywhere, within the similar 3D vicinity of internal industrial logistics, and observing obtained reconstructed results about the thrown object’s flight trajectory. Since it is nearly impossible in practice, a simulated experimental setup has been developed in accordance with the real-world’s camera parameters and experimental setup.

The simulation enables to throwing of spherical objects with any properties. However, the tennis ball was used as a throwing object in simulated and real-world experiments because the same object was used in similar experiments in state of the art and hence it enabled us to compare obtained results accurately with already existing works. Another reason for using a tennis ball is its well-defined properties like a radius of 32.2 mm, mass of 56 g, etc. Additionally, the aerodynamic properties of tennis balls are also well defined like the coefficient of air drag (i.e. cw) as 0.35. The detailed survey about these properties can be found in [1].

The other simulation parameters of the throwing object (like its launching speed, derivation of all three coordinates of its initial velocity from measured launching speed) and camera properties (like its focal length, capturing speed, etc.) are used as per real experiments in order to establish harmony between simulation and real-world results. Moreover, a sample experiment was performed in the real-world and simulated environment by using the same parameters and comparing the 3D interception positions results reconstructed by real and simulated camera-captured frames. A close resemblance in reconstructed results proved the significance of simulation. Interested readers are referred to the previously published paper of this work [32] to know all the details about the sample experiment. Figures 4 and 5 show the first and second screens of the developed simulation that allow setting all above said parameters.

Fig. 4
figure 4

Developed multi-camera simulation (main screen)

Fig. 5
figure 5

Setting camera parameters (screen-2)

The main screen of the simulation shows that it also allows settings to place multiple (2–4) cameras for capturing mechanically thrown object trajectories. However, at this first stage, only simulated experiments were performed to derive the best camera capturing positions by placing the single camera at multiple positions and analyzing the 3D interception positions accuracy of mechanically thrown objects by analyzing the difference between actual 3D positions of that object and its reconstructed 3D positions that are determined through its size as appeared in its captured trajectory frames. Later on, at the next stage, the combination of multiple cameras will be experimented by placing cameras at derived best positions in both simulated and real-world environments. The algorithm to derive the best camera capturing positions is given as Algorithm 1.

Algorithm 1
figure g

Best Camera Capturing Positions Derivation

Algorithm 1 starts by capturing trajectories by placing the camera at nearly every reasonably distant position in a simulated 3D world within the reasonable vicinity of the internal logistics environment. The “capture trajectories” process consists of reconstructing trajectories captured by cameras from their videos and then comparing reconstructed 3D object positions with actual positions within its flight trajectory. Hence it determines the accuracy of visually captured trajectories by placing cameras at multiple positions. The step-2 of Algorithm 1 shows that at first, for guessing the best results area, the camera is chosen to be placed at a 500 mm distance (along each dimension). As we must put a limit on axis values, the 3D grid is divided in a way that the camera is considered to be placed from − 500 to 3500 along Y-axis values. At each Y-axis value, the height in camera position (i.e. Z-axis value) is considered from − 500 to 1500 and X-axis positions are considered 1500 mm toward right and left. The camera is always oriented (targeted) toward (0,0,0) which is the point from where the ball is always launched in a simulated environment. Figure 6 shows 500 mm apart camera positions that are considered for initial experimentation.

Fig. 6
figure 6

Initially considered 500 mm apart camera positions

Algorithm 2 represents the “capture trajectories” process that consists of capturing trajectories by placing the camera at multiple positions with a particular distance within the specified limits of all three X, Y, and Z axes. In first-time experimentation, the distance is 500 mm and the axes limits are chosen reasonably that represent the 3–5 m vicinity of the internal logistics environment of the industry. However, these limits can be adjusted. For instance, if good results are obtained at the initially chosen value of − 500 mm for the Z-axis, then it will show the potential for even better outcomes beyond this value and hence expand the experimentation range by considering the Z-axis value of − 1000 mm as well. This iterative process can be further continued.

Algorithm 2
figure h

Capture Trajectories and Calculate Errors

The obtained results of experimentation are shared in the next section. However, here it can be seen from step-5 of Algorithm 1 that the initial best camera positions obtained, because of minimum trajectory errors, are further experimented with by capturing trajectories at 250 mm apart distance along all three axes surrounding those initially obtained best positions. This process is continued and further refined experimentation of capturing trajectories is done at 100 mm apart distances, along all three axes, surrounding obtained best positions as depicted by step-8 of Algorithm 1.

The process of capturing trajectories, as shown in Algorithm 2, includes determining 3D interception positions of thrown objects by the camera by reconstructing them from the size appearing in captured frames. Afterward, these reconstructed positions (during its flight trajectory) are compared with actual positions. Hence, it determines the accuracy of visually captured trajectories by placing cameras at multiple positions. Let \((X_d, Y_d, Z_d)\) be the determined 3D position of thrown ball and \((X_a, Y_a, Z_a)\) is its actual position then the magnitude of the difference vector between these two positions can be calculated by the following Eq. 3.1.

$$\begin{aligned}{} & {} {\mathrm{Magnitude \,of \,diff.\,vector}} \nonumber \\{} & {} \quad = \sqrt{{(X_d - X_a)^2 +(Y_d - Y_a)^2 + (Z_d - Z_a)^2}} \end{aligned}$$
(3.1)

It shows that visually captured trajectory error can be calculated as the sum of the difference between all corresponding interception positions on two 3D curves of the ball. The first 3D curve consists of the determined ball positions in the reconstructed trajectory scene and the second curve consists of the actual ball positions. So, the whole trajectory error can be calculated by the following Eq. 3.2.

$$\begin{aligned}{} & {} \text {Trajectory error} \nonumber \\{} & {} \quad = \sum _{i=1}^{n} \sqrt{{(X_{d_i} - X_{a_i})^2 + (Y_{d_i} - Y_{a_i})^2 + (Z_{d_i} - Z_{a_i})^2}} \nonumber \\ \end{aligned}$$
(3.2)

where ‘\(n\)’ is the total number of captured trajectory frames. Depending on the capturing camera position, it is possible that the captured trajectory error could be less in the initial, middle, or final part of the trajectory. So the trajectory is divided into four parts and the corresponding four sub-trajectory errors are calculated separately as depicted by step 3 of Algorithm 2. For example, if the total captured video frames (during the thrown object’s flight) are 17 then these frames can be divided into four segments as shown in Fig. 7. These four sub trajectories (sub1, sub2, sub3 and sub4) actually cover determined errors in first five (1–5), next four (6–9), then next four (10–13) and last 4 four (14–17) frames, respectively, and in this case, these four errors can be calculated by the formulae given in Eqs. 3.3, 3.4, 3.5, and 3.6, respectively.

$$\begin{aligned}{} & {} \text {Sub1 traj. Error} \nonumber \\{} & {} \quad = \sum _{i=1}^{5} \sqrt{{\left( X_{d_i} - X_{a_i}\right) ^2 + \left( Y_{d_i} - Y_{a_i}\right) ^2 + \left( Z_{d_i} - Z_{a_i}\right) ^2}} \nonumber \\ \end{aligned}$$
(3.3)
$$\begin{aligned}{} & {} \text {Sub2 traj. Error} \nonumber \\{} & {} \quad = \sum _{i=6}^{9} \sqrt{{\left( X_{d_i} - X_{a_i}\right) ^2 + \left( Y_{d_i} - Y_{a_i}\right) ^2 + \left( Z_{d_i} - Z_{a_i}\right) ^2}} \nonumber \\ \end{aligned}$$
(3.4)
$$\begin{aligned}{} & {} \text {Sub3 traj. Error} \nonumber \\{} & {} \quad = \sum _{i=10}^{13} \sqrt{{\left( X_{d_i} - X_{a_i}\right) ^2 + \left( Y_{d_i} - Y_{a_i}\right) ^2 + \left( Z_{d_i} - Z_{a_i}\right) ^2}} \nonumber \\ \end{aligned}$$
(3.5)
$$\begin{aligned}{} & {} \text {Sub4 traj. Error} \nonumber \\{} & {} \quad = \sum _{i=14}^{17} \sqrt{{\left( X_{d_i} - X_{a_i}\right) ^2 + \left( Y_{d_i} - Y_{a_i}\right) ^2 + \left( Z_{d_i} - Z_{a_i}\right) ^2}}\nonumber \\ \end{aligned}$$
(3.6)
Fig. 7
figure 7

Trajectory errors when the camera captures trajectory in 17 frames

The camera-captured RGB image is converted into HSV colored image first. Then background is subtracted from HSV colored image. After that, noise is removed from a resultant binary image. The ball as a circle is detected from a binary image and its 3D position is reconstructed from the recognized radius of the ball and its 2D center position (in pixels) within the captured frames. Figure 8 below shows some captured images by placing the camera at positions (− 800, 3500, 500) and oriented toward (0,0,0) which is the launching position of the ball.

Fig. 8
figure 8

a Captured frames 4, 9 & 15 when camera placed at (− 800, 3500, 500) b Actual & reconstructed 3D position in corresponding binary segmented image obtained after background subtraction and noise removal

The same methodology for “object position determination” in camera-captured frame is employed both in simulated and real-world environment. This methodology is explained in detail in following Algorithm 3. Fig. 9 shows few results from ball detection process in simulated camera captured frame.

Fig. 9
figure 9

a Simulated camera’s captured frame b Background image c Binary image after background subtraction and noise removal d Detected ball through CHT algorithm

Algorithm 2 shows that whenever the process of “capture trajectories” is initiated by placing the camera at multiple (particular distance apart) positions, it returns a list of results in terms of multiple records. Each record in the list represents obtained results for a particularly placed camera and that record has 07 fields. The first three represent the three coordinates for the position of the camera and the last 04 fields represent four sub-trajectory errors obtained while trajectory captured by that camera placement. In order to find the best capturing position against each sub-trajectory error, the four minimum sub-trajectory errors have to be calculated using the Algorithm 4 that returns a list of Best Camera Trajectory (BCTx) that have 04 values pointing toward the indexes of Cx which is in fact the list of all experimented camera positions that are ‘x’ distance apart. In this way, the BCTx points toward four camera positions where best-captured trajectories were found due to corresponding minimum four sub-trajectory errors.

Algorithm 3
figure i

Object Position Determination

So recalling Algorithm 1, the camera is placed on all possible 500 mm apart positions, and then their minimum trajectory errors are calculated to derive the best positions. Then, in phase II, the camera is placed in all possible 250 mm apart positions (adjacent to best-derived positions in the earlier phase). Again, their minimum trajectory errors are calculated and further derive the best positions. Finally, the same process is repeated with 100 mm apart positions surrounding derived positions of phase II and hence final best camera capturing positions are derived in phase III. All these experimentation results are shown in Sect. 4. In the subsequent section, the results are presented for both simulated and real-world environments. Multiple cameras are strategically positioned at the derived optimal locations, enabling the determination of 3D interception positions through stereovision among these cameras. So this experimentation provides the best multi-camera network resulting in the most accurate observation of the trajectory of mechanically thrown objects within the vicinity of the internal logistic environment of Industry 4.0.

Algorithm 4
figure j

Calculate Minimum Trajectory Errors

3.2 Proposed intelligent tracking for cognitive catching robot

For the applicability of automated in-plant logistics by robotic throw catch approach for smart manufacturing in Industry 4.0, the mechanically thrown object should be accurately caught by intelligent cognitive robots. The derived best multi-camera network, for such industrial settings, helps it to accurately observe 3D spatial coordinates of the thrown object in its initial flight trajectory. But such robots also need to be intelligently trained so that they can accurately predict the remaining trajectory, in real time, by observing the initial flight trajectory of the thrown object. The mechanically thrown object’s trajectory tracking and prediction is still an open research problem because of several factors among which the major factor is the nonlinear nature of the thrown object’s trajectory.

Consequently, a large dataset of trajectories is required to enhance the learning of nonlinear patterns in mechanically thrown object trajectories. However, generating thousands of trajectories with numerous variations in the experimental setup is not feasible in a real-world environment. Therefore, the created 3D simulated environment enables the throwing of objects with multiple variations, while their trajectories can be tracked using simulated cameras based on the fundamental principles of the pinhole camera model and perspective projection (similar to real-world experimental cameras). The simulation not only allows for camera placement anywhere within the 3D space of the internal logistics environment but also facilitates trajectory capture at any desired capturing speed (frames per second). Thus, it fulfills the requirement for extensive throwing experimentation to gather trajectory data essential for training supervised learning algorithms, particularly deep recurrent neural networks such as RNN and LSTM.

Using the optimal camera setup in a simulated environment, a total of 3000 experiments were performed by minor adjustments in launching parameters (± 0.001 m/s in launching speed as well as ± 0.001 in azimuth and inclination angles) of the thrown ball. Two trajectories are recorded, as two time series, against each experiment. The first time series comprises the perceived interception positions as observed by the cameras, while the second time series contains the actual 3D interception positions. A training dataset needs to be prepared for the throws, where each dataset entry consists of two parts. The initial portion of the trajectory comprises 2/3 of the positions perceived by the observing cameras, while the remaining 1/3 represents the actual positions of the thrown ball.

The proposed deep neural network model, called encoder–decoder Bi-LSTM, is introduced that is capable of accurately predicting the last 1/3 part of the thrown object’s trajectory by utilizing the initial 2/3 portion as observed by the cameras. This model has demonstrated excellent performance in identifying patterns in such many-to-many \((m\times n)\) mapping problems involving nonlinear time series data. The proposed model architecture is depicted in Fig. 10.

Fig. 10
figure 10

Proposed intelligent tracking model architecture

This architectural representation illustrates a proposed intelligent tracking model designed to map a sequence of m input 3D positions to ‘n’ output 3D positions. The model’s input shape is defined as (m, 3), indicating that encoder Bi-LSTM gets initial ‘m’ three-dimensional interception positions of observed trajectory as input. In fact, the encoder Bi-LSTM consists of two LSTMs (forward and backward) with 50 neurons each.

The output of the encoder Bi-LSTM is passed through a ‘Relu’ layer and then fed repeatedly n times to the decoder Bi-LSTM using the RepeatVector operation. The decoder Bi-LSTM also consists of two LSTMs (forward and backward) with 50 neurons each. The output of the decoder is processed through a ‘Relu’ layer and a time-distributed dense layer to represent 3 features (X, Y, and Z position coordinates) at each time step.

The final output of the model is represented as n \(\times \) 3, representing the last ‘n’ 3D predicted positions of the thrown object. The decoder Bi-LSTM incorporates the current input, RepeatVector value, and the previous output’s hidden state. The RepeatVector operation is used exclusively for repeating the output of encoder LSTM and it does not possess any trainable parameters. Both encoder and decoder Bi-LSTM have 100 input neurons (i.e. 50 neurons for each forward and backward LSTM). The proposed tracking model is evaluated on multiple testing throws and it exhibited outstanding accuracy. The detailed results of these evaluations are presented in the next section.

4 Results and discussion

During real-world experiments, a specially designed mechanical device is utilized to throw the tennis ball. This device incorporates a spring mechanism, and the ball is thrown by harnessing the kinetic energy generated through the stretching or compressing of the spring. The process of launching the tennis ball is visually depicted in Fig. 11.

Fig. 11
figure 11

Mechanical ball launching in real world experiments

The thrown ball’s launching speed is accurately determined using a radar gun. The initial velocity is then derived based on this measured speed and the throwing angles. Readers interested in a comprehensive explanation of this derivation are referred to the previously published paper of this work [29], which provides detailed insights into the real and simulated experimental setups. The approximate distance separating the throwing point from the final catching point is 3 m. All simulation parameters, including the launching parameters of the thrown object and camera properties, are replicated from the real-world experimental setup. Specifically, the IDS imaging system’s camera model “UI-1220RE-M-GL” is employed in real-world experiments, and the characteristics of this camera are summarized in Table 2.

Table 2 Characteristics of used camera “IDS UI-1220RE-M-GL”

The imaging process in simulated cameras used a standard pinhole camera model that employs perspective projection, as the case in real-world cameras. It means, that viewed objects are projected smaller when distant from the camera and projected larger when near to the camera. In the simulation, the ball is initially launched from point (0, 0, 0) and the axes of its movement are illustrated in Fig. 12.

Fig. 12
figure 12

Considered X, Y, and Z-axis in experimentation

As explained before, the proposed work consists of two parts. The first part is to derive the best multi-camera setup for enhancing the visual tracking of mechanically thrown objects’ trajectories. Then the second part consists of the preparation of the trajectories dataset, using a derived multi-camera setup, and then the training and testing of the proposed Bi-LSTM model for intelligent tracking of thrown objects. The methodologies employed in both parts are explained, in detail, in the previous section. However, the experimentation results obtained in each part are given in the following two subsections.

4.1 Results in deriving best multi-camera setup

Recalling Algorithm 1, the simulated experimentation initiates by systematically positioning the camera at regular intervals of 500 mm within the designated ranges of the X, Y, and Z axes and then their four minimum sub-trajectory errors are calculated to derive the best camera positions at the first place. In simulated experiments, the initial velocities (along all three axes of the launched tennis ball) are tried to keep the same as their measured values in conducted real-world experiments. When the simulated camera is placed at each of the experimented positions, it is oriented toward (0,0,0), i.e. the launching position of the ball. The camera capturing speed is set at 60 frames per second (fps), which is consistent with the speed used in real-world experiments. Using this rate, the trajectory of the ball is recorded in 17 frames. The camera positions, where minimum sub-trajectory errors are observed, are shown in Fig. 13.

Fig. 13
figure 13

Minimum observed four sub-trajectory error’s positions in initial 500 mm apart experimentation

The mentioned four sub-trajectory errors specify the minimum observed errors in the camera-captured first five (1–5), the next four (6–9), then the next four (10–13), and the last 4 four (14–17) frames, respectively. As explained previously, these errors are calculated by formulae given in Eq. 3.3 to 3.6 respectively. In the next step, the cameras are placed at a 250 mm distance surrounding these identified positions. Figure 14 shows the further tested 250 mm apart camera positions.

Fig. 14
figure 14

Camera placement at all 250 mm apart positions surrounding initially best identified positions

The results obtained, after experimenting on all the above-mentioned positions, show that minimum trajectory errors are still at the same positions. However, the trajectory errors are less at a distance (i.e. Y-axis value) of − 750 mm as compared to a Y-axis value of − 250 mm, and also observed that trajectory errors are less at a distance of 2750 mm than at 3250 mm. Figure 15 illustrates that the trajectory of the thrown object is most effectively captured when observed from the throwing side within the distance range of − 750 to − 500 mm along the Y-axis. Furthermore, the results obtained at − 500 mm exhibit superior performance compared to those at − 750 mm.

Fig. 15
figure 15

Better resultant 250 mm apart positions

Fig. 15 also demonstrates that the trajectory of the thrown object is optimally captured when observed from the catching side within the distance range of 2750–3000 mm along the Y-axis. Notably, the results obtained at the Y-axis value of 3000 mm outperform those obtained at 2750 mm. The obtained experimental results have further shown that the errors are less within the Z-axis values between 0 to 500 mm and also X-axis values between 0 to 500 mm. It helped to figure out initially guessed best resultant areas. Those areas are shown in Fig. 16.

Fig. 16
figure 16

Initially guessed best resultant areas

As the results on distance − 500 are better than − 750 as well as the results on distance 3000 are better than 2750, so for final refined experiments the selected distances are − 500 and − 600 (near to − 500) and similarly on the catching side, the selected distances are 3000 and 2900 (near to 3000). The camera positions within these areas are considered for refined experiments and the finally selected best resultant areas for refined experimentation are shown in Fig. 17.

Fig. 17
figure 17

Finally guessed best resultant areas for refined experiments

The best results obtained within finally guessed areas make sense because the size of the ball changes significantly within these areas and the reconstruction of the 3D trajectory of a thrown ball, from camera-captured video, is done through its recognized radius in captured video frames. In the following Fig. 18a, b, c, d and e, the initial, middle, and last frames of the trajectory of the thrown ball are shown while the trajectory is captured from the top side, right and left sides, and throwing and catching sides, respectively. Significant variations, in the size of the thrown ball, can be observed when capturing its trajectory from the throwing or catching side, in contrast to top or side views.

Finally, the refined experimentation is done within finally selected areas. This time, cameras are placed at 100 mm apart positions within those areas. Figure 19 shows the camera positions used for refined experiments. Based upon this refined experimentation, Table 3 shows the best-derived camera positions resulting in minimum sub-trajectory errors.

Fig. 18
figure 18

First, middle and last frames of thrown ball trajectory

Fig. 19
figure 19

Camera placement positions considered for refined experiments within the best resultant area

Table 3 Best camera positions based upon min. sub-trajectory error
Fig. 20
figure 20

Setting parameters for capturing trajectory through C1 and C4

Then further experimentation is employed by different combinations of these best-derived positioned cameras by applying stereovision among them. While trying their combinations, it was ensured that both throwing and catching side cameras must be included for good results. This experimentation is conducted in both simulated and real-world environments to determine the optimal multi-camera setup for accurately observing the trajectory of a thrown object in such industrial settings. Figures 20 and 21 show some simulation screenshots when the trajectory was captured through C1 and C4.

Fig. 21
figure 21

Actual & reconstructed 3D position in corresponding binary segmented images (a. Frame 4 b. Frame 9 c. Frame 16) obtained after background subtraction & noise removal when trajectory captured through C1 & C4

Similarly, the following Figs. 22 and 23 show some simulation screenshots when the trajectory is captured through three cameras (C1, C3, and C4). It can be seen that accuracy is further improved when this camera combination is tried.

Fig. 22
figure 22

Setting parameters for capturing trajectory through C1, C3 and C4

Fig. 23
figure 23

Actual and reconstructed 3D position in corresponding binary segmented images (a. Frame 6 b. Frame 10 c. Frame 17) obtained after background subtraction and noise removal when trajectory captured through C1, C3 and C4

The 50 simulated experiments were performed with minute variations in the initial launching velocities of the thrown ball. All possible combinations of 2–4 cameras were experimented with. The total trajectory error (calculated by Eq. 3.2) was monitored. Finally, the average error is calculated and presented in Table 4. Table 4 shows that good results are obtained with particular three and four-calibrated camera setups. However, it can be implied from the obtained minimum error result that three calibrated cameras are enough for the best observation of flight trajectory. Moreover, the combination of three cameras (C1, C3, and C4) reconstructed a complete trajectory with minimum error. Hence it was chosen as the optimal multi-camera setup.

The derived optimal multi-camera setup was also tested for 50 real-world tennis ball-throwing experiments. However, the accuracy of the reconstructed final position of the ball could only be tested using a DST touch screen that is placed on the catching side. This screen could measure two coordinates (X-axis and Z-axis values) but its placement (i.e. distance) can measure the Y-axis value as well. The results of these experiments have shown promising accuracy of the final 3D touch point of the ball as the average reconstruction error for both X and Z-axis values was under 2 mm and for Y-axis it was under 4 mm.

4.2 Proposed tracking model results

The proposed model needed a large training dataset of throws and, for this purpose, the 3000 simulation experiments were performed using a derived optimal multi-camera configuration setup. Minor adjustments, in launching parameters, were made for a variety of experiments. Two trajectories are recorded, as two time series, against each experiment. The first time-series comprises the perceived interception positions as observed by the cameras, while the second time series contains the actual 3D interception positions. This approach was adopted to ensure that each training dataset’s trajectory should contain initial 3D interception positions as camera-observed positions, while the final 3D interception positions should correspond to the actual positions. This design choice is influenced by the findings of the catching robot’s work presented in [12], which highlights the need for a minimum of 80 ms to allow the motor of a contemporary catching robot to accurately position its catching gripper at the predicted final 3D position. So, the final 3D catching position should be predicted at least 80 ms before the flying object reaches its final position.

In the real-world experimentation, the capturing speed of the camera was 60 frames per second. So, using the same capturing speed in simulated experiments, the moving ball was found in 17 frames with a total trajectory time of around 283 ms. To effectively train the proposed intelligent tracking model, it is essential to configure it to receive as input the initial 3D interception positions observed by cameras within the initial 200 ms flight trajectory of the thrown ball. The model should then be capable of predicting the subsequent 83 ms of the flight trajectory. So, each of the 3000 trajectories of the training dataset is composed of 12 interception positions (i.e. 2/3 of trajectory) as observed by cameras and the remaining 5 interception positions (i.e. 1/3 of trajectory) as the last 5 actual positions of the ball in simulation.

Table 4 Average trajectory error of 50 simulated test experiments(using various multi-camera setups)

The architecture of the proposed intelligent tracking model is already explained in Sect. 3.2. The proposed model of encoder–decoder Bi-LSTM effectively retains time-series information from both forward and backward sequential contexts, making it well-suited for learning patterns in many-to-many time series problems. This model was evaluated using various numbers of tested throws and demonstrated superb accuracy in predicting the last 5 interception positions of a thrown object by observing the first 12 interception positions, through a derived multi-camera setup to achieve optimal performance. As each predicted interception position has three coordinate values there are a total of 15 calculated error values. The formulae for calculating root-mean-square error (RMSE) and mean absolute error (MAE) are given in the following Eqs. 4.1, 4.2, respectively.

$$\begin{aligned}{} & {} \text {RMSE} = \sqrt{\frac{\sum _{i=1}^{5} \sum _{j=1}^{3} \left( \text {Predicted}_{ij} - \text {Actual}_{ij}\right) ^2}{15}} \end{aligned}$$
(4.1)
$$\begin{aligned}{} & {} \text {MAE} = \frac{\sum _{i=1}^{5} \sum _{j=1}^{3} |\text {Predicted}_{ij} - \text {Actual}_{ij}|}{15} \end{aligned}$$
(4.2)

The model is tested gradually from a small to large datasets of simulated experimented throws. Figure 24 depict the prediction error results obtained by testing the proposed model on four distinct test datasets comprising 100, 200, 500, and 1000 simulated throws, respectively. The presented graphs demonstrate consistent error values within the favorable range of 0.5\(-\)2.5 mm, indicating the model’s effectiveness.

Fig. 24
figure 24

Prediction error results of proposed model evaluation for different datasets of simulated test throws

In the following Fig. 25, the comparison between ground truth and predicted values (of X, Y, and Z-axis, respectively) is shown for the last 5 predicted interception positions of a tested thrown ball in the simulated experiment. Within these Figs, the initial 12 values represent the observed interception positions, captured using a multi-camera setup, while the last 5 values correspond to the ground truth and predicted values generated by the proposed intelligent tracking model. The substantial overlap among corresponding predicted and ground truth values demonstrates the high accuracy achieved by the proposed model.

Fig. 25
figure 25

Comparison between ground truth and predicted values (of X, Y and Z-axis, respectively) for last 5 predicted positions of a tested throw

In real-world scenarios, similar experiments were conducted. However, due to the absence of absolute position detector sensors in the experimental setup, it was only possible to measure the final position of the thrown ball using a DST touch screen located on the catching side. The accuracy of predicting the final 3D catching position was assessed by root-mean-square error (RMSE) and mean absolute error (MAE) that are calculated using the formulae provided in Eqs. 4.3 and 4.4, respectively.

$$\begin{aligned} \begin{aligned}&\text {RMSE in Final Position Prediction} \\&\quad = \sqrt{\frac{\sum _{i=1}^{3} (Prediction_{i} - \text {Actual}_{i})^2}{3}} \end{aligned} \end{aligned}$$
(4.3)
$$\begin{aligned} \begin{aligned}&\text {MAE Final Position Prediction} \\&\quad =\dfrac{\sum _{i=1}^{3}|Predicted_{i}-\text {Actual}_{i}|}{3} \end{aligned} \end{aligned}$$
(4.4)

Fig. 26a and b present obtained error results, for final touch point prediction, in terms of MAE and RMSE, respectively. These results are obtained by 200 ball-throwing real-world test experiments.

Fig. 26
figure 26

Final touch point prediction error results by proposed model when evaluated on 200 real-world test throws

Table 5 Comparison of final catching point prediction accuracy achieved in this work and claimed in other state-of-the-art works for mechanically thrown object tracking

These obtained results have shown that the proposed intelligent tacking model has shown the ability of final catching point prediction with a maximum MAE error of 5 mm and RMSE of 7 mm while tested in 200 real-world throwing experiments. The obtained results are compared with some recently implemented works of mechanically thrown object tracking that utilized machine learning techniques for final catching point prediction. This comparative analysis is presented in Table 5. The results demonstrate significant improvement in the accuracy of predicting the final catching point through intelligent tracking by the proposed model.

5 Conclusion and future work

For smart manufacturing in Industry 4.0, the approach of utilizing throwing and catching cognitive robots holds the potential to become the most efficient mode of in-plant logistics in an automated way. However, for the adaptability of this approach in industry, further research is required in two key dimensions. Firstly, the identification of an optimal multi-camera setup is necessary that accurately observe the initial flight trajectory of thrown objects within in-plant industrial settings. Secondly, an advanced machine learning algorithm is needed to enhance the intelligence of the cognitive catching robot. This algorithm should enable the catching robot to accurately predict the final 3D catching position by analyzing the observed initial flight trajectory of mechanically thrown objects.

The proposed work aims to address both of these research dimensions. It involves the creation of a 3D simulation environment that facilitates controlled mechanical throwing of objects within the internal logistics environment of Industry 4.0. Through this simulation, users have the ability to specify the attributes of the thrown object and observe its trajectory by positioning a simulated camera at any desired position and orientation within the 3D environment. To the best of our knowledge, this simulation represents the first attempt in this research domain. The simulation played a crucial role in conducting extensive experimentation to identify optimal camera positions for accurately capturing the 3D interception positions of the flying objects based on their apparent size on the camera’s sensor plane. Various calibrated multi-camera setups were tested using the identified optimal positions, and the most effective configuration was determined based on the obtained results.

Subsequently, a training dataset was created, consisting of trajectories of 3000 simulated throwing experiments. The initial part of each trajectory, of the dataset, contained the interception positions observed through the derived optimal multi-camera setup, while the final part consisted of the actual positions. An encoder–decoder bidirectional LSTM neural network model was proposed and trained on this dataset. The performance of the model exceeded the state-of-the-art, achieving a mean average error of 5 mm and a root-mean-square error of 7 mm in predicting the final 3D catching point when tested on 200 real-world experiments.

It should be noted that this work is currently limited to objects with a spherical shape. However, future plans include expanding this research to encompass other regular shapes such as rectangular, square, or cylindrical objects. Additional research is required to track the orientation of such regular-shaped objects during their flight trajectory. The successful implementation of this approach has the potential to transform traditional logistics of small-sized materials in industries and lead to improved efficiency and cost savings.