Introduction

Despite continuously advancing capabilities of automation in manufacturing, human involvement in assembly remains a key enabler (Makris, 2021), especially when aiming toward effective, versatile, and highly performing manufacturing systems (Chryssolouris, 2006). Humans are the primary enabler of flexibility in assembly owing to their ability to perform a broad range of tasks, adapt to new and changing conditions, learn new skills, and comprehend the overall context.

In recent years, the concept of hybrid production systems has been introduced (Kousi et al., 2018), aiming to increase levels of flexibility and reconfigurability at the factory by increasing the level of collaboration between workers and autonomous machines such as robots. Several manufacturing operations are challenging to fully automate owing to their complexity and required dexterity. Usually, these operations exist in labor-intensive assembly and disassembly and pose considerable physical strain and psychological stress to workers. For this reason, future production must incorporate smart humancentric collaborative systems in which people have the primary role.

The high degree of complexity of these processes and the behavioral flexibility of human workers provide several challenges in the implementation of efficient collaboration with a smart production system (Wang et al., 2019). The intricate processes and variable behavior of human workers pose substantial challenges to the smooth implementation of such collaborations. These challenges necessitate robust solutions for seamless human–machine interaction, quick and effective management of unpredictable human errors, and the development of a universally applicable system considering the vast range of assembly and disassembly operations. These manufacturing challenges can be categorized as:

  • Establish a user-friendly and seamless interaction between humans and the system. This interaction has to be made in a way without increasing the workload with extra communication steps. Current human–robot collaboration (HRC) techniques involve interaction with user interfaces (UIs) such as pressing confirmation buttons and reading notification messages. All these have a negative impact on the operator’s workload and there is an immediate need to be addressed.

  • Minimize human error Unlike a solid automated system, human involvement increases the likelihood of unanticipated faults and mistakes, that the production system must resolve in a fast and robust way, to minimize the production times.

  • Easy deployment in production The complexity of assembly and disassembly operations introduces further difficulty to the development of a comprehensive and scalable solution that will be easily applied in multiple cases.

Inside collaborative production systems, a significant amount of effort has been invested in the development of interfaces that interactively deliver system information to human workers. In order to build an effective human-centric system, a smart feedback loop from people to the system is necessary (Tzavara et al., 2021). A human action recognition (HAR) component is essential for providing the system with the ability to understand what workers are doing and collaborate effectively. In this context, the power of recent AI advancements in HAR applications can benefit modern collaborative manufacturing and address the aforementioned challenges. Through real-time monitoring of human actions, AI-driven HAR can enable intuitive machines and robots to understand human behavior, reducing the reliance on cumbersome interfaces and minimizing operator workload. Furthermore, by spotting potential errors early and providing timely feedback, HAR can enhance productivity and reduce downtime. Lastly, AI-driven HAR can be customized to the needs of diverse assembly lines, ensuring scalability and wide applicability.

The paper aims to demonstrate an innovative framework for the implementation of HAR in a smart production environment with multiple sensors and autonomous resources. The suggested approach is intended to address the challenges of a) seamless system-human interaction, b) human uncertain behavior, and c) complexity in generalizing the system’s model. This paper introduces the PraxisFootnote 1 holistic framework for human action recognition in smart manufacturing. Praxis framework includes all the sets of operations for deploying, training, and maintaining AI models for HAR in production.

The paper is organized as follows: In Sect. Literature review an examination of the relevant literature on the methodology of HAR is provided. In Sect. Approach the Praxis structure and accompanying requirements are defined, which are subsequently addressed in depth in Sect. 4 where the implementation details are presented. In Sect. 5, the application of the Praxis framework in an industrial use case is evaluated while in Sect. 6 the conclusions and future developments are getting reported.

Literature review

In recent years, there has been an increasing body of literature on human action recognition in various applications (Mahbub & Ahad, 2022) but also in the manufacturing context (Liu & Wang, 2017, 2018). The state-of-the-art approaches in manufacturing sector cover some of the following topics:

  • Action recognition model

  • Sensors and data sources

  • Action recognition features

  • Datasets and data availability

The research direction has been on modeling methods for monitoring human activities. Initially, stochastic approaches such as Hidden Markov Models and Dynamic Bayesian Networks were developed to overcome the boundaries of uncertainty (Urgo et al., 2019; Zhao et al., 2019). Due to the significant advancement of AI and machine learning (ML), several Neural Network architectures such as convolutional (CNN) (Wen et al., 2019), long short-term memory recurrent neural network (LSTM) (Muhammad et al., 2021) and graph convolutional network (GCN) (Li et al., 2019, 2022) have been investigated In addition, it is important to mention that these (ML) approaches have been investigated also in the closely related field of ergonomics assessments. (Ciccarelli et al., 2022) proposed a sensor-independent learning model based on a parallel (CNN) architecture for identifying and classifying postures in working environments. Their approach does not require special cameras and a novel explainable Machine Learning approach was used to gain awareness of the obtained results, checking if the network learnt spurious associations or not.

Another aspect that has been investigated around HAR is the relevant data sources to employ. From the perspective of data sources, the recent widespread of low-cost video camera systems and depth sensors increased the studies around vision-based data (Zhang et al., 2019). Wearable inertial measurement units (IMU) sensors that provide motion data are also considered in the literature (Dehzangi & Sahu, 2018). One of the main challenges in HAR is the amount and quality of the data. (Herrmann et al., 2019), proposes a framework that can generate motion data, label them, and efficiently store them in the cloud.

The features used by each model are also an important factor for experiments. Accurate modeling of human actions is essential for effective HAR systems. It is important to consider the context of manufacturing processes, temporal dynamics, and spatial relationships between various objects and human workers. Several studies have focused only on the human features coming from the RGB and depth image or extracting the skeleton key points (Liu & Wang, 2018). The information on the objects that exist in the manufacturing environment is also an additional feature that is included in HAR applications (Andrianakos et al., 2020).

Finally, from the perspective of data availability, in ML-based approaches, big, annotated datasets are required. Numerous rich datasets exist in the literature for general-purpose activities, but they cannot be applied to the context of manufacturing operations due to their specialized ability in terms of normal everyday activities. In recent years, it has been popular to build labeled datasets of assembly operations, such as the Assemlby101 (Sener et al., 2022), the IKEA ASM (Ben-Shabat et al., 2020), and the recent H4MM dataset (Cicirelli et al., 2022). Most of these datasets contain vision-based data from RGB-D sensors, and they were generated by researchers in their facilities using specialized testbeds.

Overall, these studies highlight the importance of HAR in manufacturing and point out the following requirements. One of the key requirements is real-time performance, as it is often necessary to recognize actions quickly in order to enable timely response and intervention. In addition, HAR algorithms must be robust to environmental factors such as changing lighting conditions, occlusions, and cluttered backgrounds that are common in manufacturing environments. Another requirement is adaptability to different tasks, as manufacturing operations can involve a wide range of activities, from simple assembly tasks to complex maintenance procedures. HAR algorithms must be able to adapt to different tasks and activities to ensure accurate recognition of actions. Finally, the integration of HAR with other systems is an important consideration, as HAR is often just one component of a larger manufacturing system. The HAR system must be able to integrate with other systems, such as robotics, to enable seamless automation of manufacturing processes.

These requirements highlight the need for HAR algorithms that are not only accurate, but also efficient, adaptable, and integrated with other systems. Meeting these requirements will be crucial for the successful implementation of HAR in manufacturing and the realization of its potential benefits.

In view of all that has been mentioned so far, the requirements of HAR in manufacturing can be translated to the following technical challenges: a) the precision of different AI algorithms; b) the modeling of HAR characteristics; c) the selection of applicable sensors; and d) the availability of large, annotated training datasets (Table1). These are the difficulties that the suggested work aims to overcome. Praxis tries to combine and create tools to overcome all these challenges and create a unified solution that can be effective and robust. Through the next sections, it will be shown how the Praxis framework overcomes the difficulties and the limitations in the current state-of-the-art, and how it tries to integrate and combine innovative technologies (Table1).

Table 1 Overview of Literature based on the research focus

Approach

Praxis framework establishes a way to utilize HAR for creating a seamless collaboration of human workers with smart manufacturing systems. Multiple factors must be examined to achieve the optimal performance of a HAR application, which is largely reliant on the intended manufacturing use case. To be considered holistic, an AI framework must comprise a set of processes that combine machine learning, data engineering, and development operations.

Mainstream machine learning approaches present complexity in terms of their required steps. Starting with the data collection, it may exhibit challenges regarding high quality, and expert knowledge to decide the usable type of required information. Once the data has been collected, it may need to be “cleaned”, transformed, and preprocessed to ensure that it is suitable for a machine learning model. This can involve handling missing or corrupted data frames, normalization, and even proper data augmentation to be converted into a suitable format. Additionally choosing a suitable model can be a challenging, and time-consuming trial and error approach, due to the large number of parameters that must be tuned properly. Finally, deploying a machine learning model, inside a real-world production environment, requires careful consideration of factors like performance, scalability, and security. This may involve integrating the model into an existing application, building a custom deployment pipeline, and monitoring the model’s performance over time.

The objective of the Praxis framework is to provide engineers with workflows, tools, and automation, that abstract away all these unnecessary complexities. To achieve this, Praxis must be hardware, data, and model agnostic. Figure 1 depicts the Praxis framework, which introduces a closed-loop approach to operations, beginning with the manufacturing system’s data collection and ending with the deployment of the HAR inside a real-world production environment.

Fig. 1
figure 1

Praxis framework approach

The Data Collection from the manufacturing system is the first major operation. Flexibility in terms of recording and combining several data from different sensors is required. The maintainability of the data to continuously update and expand the collection is also a requirement. Praxis Data Collection gathers both single and sequence data. At this step, the collected data from different sensors are stored alongside their timestamp and their sensor origin position as ROS messages in bag files, enabling the online or offline re-publishing of the sensor data for each assembly procedure.

Additional processing procedures are required to provide a consistent dataset including indexed data. The Data Processing phase comprises the actions that transform the raw information into the Praxis Dataset, which is suitable for machine learning modeling. The temporal and spatial calibrations of the data are the first steps that must be performed. Temporal calibration is the process of synchronizing data from sources with varying frequencies in terms of time. Spatial calibration is the process of transforming geometrical data from several sensors into a common reference frame that is defined by the developer.

The Praxis Dataset is indexed by frames. Each frame contains the raw data values from each sensor, from the common origin, based on their timestamp. The structure of the Praxis dataset is depicted in Fig. 2. In addition to the raw data of each sensor, additional characteristics must be extracted for HAR. The current Praxis approach consists of two feature detectors: a) hand 3D landmarks and b) assembly objects that mean 3D position. However, the modular design facilitates the introduction of additional feature extractors from the raw sensor data, if necessary.

Fig. 2
figure 2

Praxis dataset structure

The final processing step for creating a dataset ready from training a HAR model is data labeling. This is a manual process and is performed by annotators who know the characteristics of the applied case. Each frame in the Praxis dataset is annotated with four types of labels. Two granularities of human activity are considered. Actions are the primitive activities conducted with hands and an object, such as ‘insert’, ‘reach’, and ‘tighten’. Tasks are coarse activities made up of a series of actions, for example, ‘pick and place’ or ‘screw’. The final two labels are the interacting object and the hand used in the specific frame.

Having the dataset ready, the HAR Model training is the next step. The selection of the AI method is a non-trivial process and is dependent on each use case. The modularity and flexibility of Praxis design in terms of feature selection, and output type, adapts to any kind of input in any NN architecture, without the need to change the core implementation. Another consideration is the implementation of a flexible way to select specific features to load from the dataset as Tensors of appropriate size for each NN. For this purpose custom configurable data generators were implemented in Keras. Four different NN architectures were tested in the scope of this work and validated in an industrial use case as discussed in Secttion "Implementation".

The HAR Model Inference is deployed on a computer unit at the production station that has a direct connection with the selected input sensors. The trained model is validated offline on a test dataset. This dataset is carefully annotated to align with the established ground truth. Emphasis is placed on ensuring their performance is robust, scalable, and secure, offering an optimal solution for real-world applications. We introduce a feedback and optimization loop into our system. This includes continual performance monitoring of the models, gaining feedback from their real-time operation, and using this feedback to fine-tune the models for improved accuracy and efficiency. The operators that interact with Praxis can report online faults in action inference through a user interface application. The collected data from the real-time operation can be used again to increase the Praxis Dataset and train new models.

Overall, this approach presents a holistic, adaptable, and efficient solution for human action recognition within assembly line production, marking a significant contribution to this field of study. The main points of Praxis’s novelty can be summarized as follows:

  • Unified Framework for HAR in Manufacturing The Praxis framework, unlike many other works in the field, does not solely focus on a single aspect of HAR (like sensor selection, feature extraction, or AI modeling). Instead, it provides an integrated solution, combining advancements in all these areas within a unified framework specifically designed for the manufacturing sector. This enables more efficient and seamless human-robot collaboration in this context.

  • Transformation Algorithm We have expanded on the transformation algorithm for the hand points and the object position in 3D. This is a novel approach in the context of HAR in manufacturing, providing more precise action recognition and better alignment between human and robot actions.

  • Annotation System While some annotation systems exist in the field of HAR, our approach is tailored specifically for manufacturing applications, focusing on the unique requirements of this sector. It is designed to create rich, annotated datasets for HAR, which is crucial for developing and training accurate recognition algorithms.

  • Flexibility and Adaptability Praxis allows for more dynamic and flexible collaboration between human operators and robots by recognizing and adapting to human actions in real-time. This is a significant advancement in the field, moving away from static task allocation towards a more dynamic, responsive system.

  • Integration of HAR with Other Systems Praxis is designed to integrate seamlessly with other manufacturing systems, such as robotics, allowing for more comprehensive automation of manufacturing processes. This integration is not commonly found in other HAR frameworks, making Praxis a significant contribution to the field.

Implementation

In this section, we present the comprehensive implementation details of the Praxis architecture, a framework for AI-driven human action recognition in assembly. The section is organized into four main phases as shown in Fig. 3, each crucial in the development and deployment of the Praxis framework. We begin by describing the Data Collection phase, highlighting the significance of high-quality, operator-generated data and the utilization of various sensors for data capture. Next, we delve into the Data Processing phase, where we discuss the calibration of spatial and temporal data, ensuring accurate positioning and synchronization. Subsequently, we explore the process of Feature Extraction, encompassing 2D hand landmarks and assembly object positions, which contribute to the creation of the Praxis dataset. Additionally, we address security and ethical considerations regarding data processing and augmentation techniques. Moving forward, we elucidate the evaluation of NN architectures using a multi-criteria evaluation algorithm, enabling users to define weight values for key performance criteria. Finally, we provide insights into model training and inference, including the implementation of specialized neural networks for individual task detection, enhancing the flexibility and efficiency of the HAR model. This organized presentation offers a comprehensive understanding of the various components involved in the successful implementation of the Praxis framework for AI-driven human action recognition in assembly tasks.

Fig. 3
figure 3

Praxis workflow

Data collection

As every data-driven application needs high-quality data, the Praxis approach remains the same. In this study, the Data Collection phase was performed by recording 15 people performing the manual assembly of the selected use case. We selected the participants, with a range of ages from 20 to 30 years old, having varying skill levels. The utilization of genuine operator-generated data, as opposed to simulated computer-generated data, which incorporates various types of data, is deemed obligatory. The diverse data types collected through genuine data can prove to be valuable for future research endeavors, particularly in the identification and correction of errors or inaccuracies in specific assemblies.

Four sensors have been selected: 1) a static high-definition 3D camera 2) a wearable 3D camera on the worker’s headset and 3) two wearable IMU sensors, one at each wrist of the worker. To control and integrate sensor data, the ROS (Quigley et al., 2009) software system is used. Using ROS, we can capture and save each specific frame of interest in a.bag file format. The recorded frequency for all sensors was set at 15 Hz. Each candidate was recorded 3 times. Each time is considered to be completed after a successful assembly, and disassembly of the product. Before the first try, verbal and written instructions for the assembly of the product were also given to each of the candidates. In total 21,000 data frames were recorded with the size of 452 GB. The recorded raw data is stored in a local file server.

Data processing

The Data Processing phase includes the Data Calibration of the recorded data into a commonly defined geometrical space (spatial calibration) and synchronized in a common timeline (temporal calibration). As described in the Data Collection phase, we collect four data streams that should be calibrated.

Spatial calibration determines the accurate transformation of the sensors in relation to the origin of the common coordinate system. We define the origin of the coordinate system as the position of a fiducial marker (Fig. 4). The detection and tracking of this marker is robust, fast, and simple and calculates the pose of the vision sensors that are used in Data Collection. This technique provides the versatility to include moving and static vision sensors and it is flexible for quick integration of additional sensors. The IMU data are calibrated based on hands detection which is described below.

Fig. 4
figure 4

a Sensors position in common 3D space b The view from the head camera c The view from the static camera

Temporal calibration is responsible for fixing time deviations by synchronizing the parallel data streams of the sensors. The sampling rate of all sensors has been established at 15 Hz. Data mismatches may arise because of frame drops of the sensors. An adaptive algorithm is developed and is used to synchronize the data into a single frame by matching their timestamp (Fig. 5). After the Temporal and Spatial calibration is performed the sensor data are saved again as a synchronized bag file.

Fig. 5
figure 5

Temporal calibration on two data streams

Having the data in a synchronized form indexed by frame, the next operation is Feature Extraction. In this work, the hand landmarks, as well as the assembly object’s positions are used to create the complete Praxis dataset. These data were decided to be performed at the Data Processing stage because it is evaluated to be the heaviest working process. The MediaPipe (Zhang et al., 2020) library was utilized to extract the 2D hand landmarks. For object detection, we used the Detectron2 (Wu et al., 2019) library, which enables state-of-the-art detection and segmentation of objects from 2D images. An algorithm for the transformation of the hand points and the object position in the 3D space based on the recorded depth image was developed.

Since task and action recognition is the main aspect of this framework, and since the human factor is present, the ways data are saved and processed, may raise security and ethical questions.

In Fig. 6, there are complete storing policies for the retrieved data. Sensitive information may exist in the raw form, but in all experiments, all the augmentation techniques and enhancement methods were performed after the processing. During this step, no sensitive information was registered, and there was no correlation to the raw form of the data. It is important to note that the complete raw data that was recorded was 452 GB, and the data used to train all the models was 10–12 GB, depending on the augmentation and noise additive techniques. This is a reduction of 97% in the total amount of data. This “filtering” of the raw data can be translated to the complete removal of digital information regarding actual human faces, bodies, and any other sensitive information. For gathering the initial dataset for annotation, the users are informed that their recordings will be stored and used for training ML models. In this case, the developers are responsible for maintaining the anonymization of gathered data.

Fig. 6
figure 6

Sensitive data storing policies

To further enhance the size and remove biases from the dataset, augmentation techniques were used. The most biased value is the unnormalized distances from the world frames to each of the hand joints and the object positions. The action recognition system must understand the relative distances and the changes between them, instead of overfitting by connecting each of the classes with a fixed distance. In order to overcome this problem, two augmentation techniques were combined, and used in all the data containing spatial information. The Gaussian noise addition, and the spatial shifting transformation (SST).

Since the data represent distances, with a deviation of \(1cm\) to \(0.7m\), the addition of the Gaussian noise has to be such, that does not interfere with the small, but useful for recognition, changes in the actual data distance deviations. Considering this limitation, within Praxis framework has implemented an abstract way of creating copies of the dataset with selected Gaussian noise parameters for each one of them. In this case, we created 2 copies from the dataset that had standard deviations of \(1mm\), and \(2mm\), respectively, both of them having a mean value of \(0\). This technique tripled the total size of our dataset, maintaining accuracy and expanding the randomness for the network to be able to understand. At this point, it is also mandatory to point out that to further unify the, augmented with the real data, a shuffling technique was used to normalize the distribution of all the data inside the Praxis Dataset.

At this stage, having only noise addition to the dataset is not the optimal approach. As mentioned before, the dependence on the recognized actions and the fixed distances is still present. To overcome this limitation, a spatial shifting transformation (SST) algorithm was developed. In this example the complete algorithm can be split into two major categories a) the position offset addition, and b) rotational offset addition as presented in Fig. 7. On the left side of the figure the 3D position and rotation vertices have their initial values relative to the common transformation, that as stated before comes from the detection marker, that is in a fixed place. This initial pose of the complete set of points that define the human hands in a given time can be defined as a combination of two three-dimensional vectors; the proportional \(p=({x}_{pos}, {y}_{pos}, {z}_{pos})\), and the rotational \(r=({x}_{rot}, {y}_{rot}, {z}_{rot})\). After the SST, this 3D information gets manipulated by adding a random offset \({\delta }_{pos}\) and \({\delta }_{rot}\) to the local position and rotation vectors respectively, resulting in two new vectors:

$$ p^{\prime } = \left( {x_{pos} + \delta_{pos} , y_{pos} + \delta_{pos} , z_{pos} + \delta_{pos} } \right) $$

\(r^{\prime } = \left( {x_{rot} + \delta_{rot} , y_{rot} + \delta_{rot} , z_{rot} + \delta_{rot} } \right)\)

Fig. 7
figure 7

SST augmentation technique

Both of the sets \(\left( {p, r} \right)\) and \(\left( {p^{\prime } , r^{\prime } } \right)\) considered to have the same, common coordinate reference.

This augmentation technique emulates the change of the detection marker in 3D space. For the HAR model, only the relative positions of hands and objects are the important features. In our case, we used a range of positional offsets of\({\delta }_{pos} \in (-0.5m, 0.5m)\), per coordinate axis, and a range of rotational offsets \(\delta_{rot} \in \left( { - 70^{^\circ } , 70^{^\circ } } \right)\), for each Euler angle. For the complete augmentation, we also used a canonical distribution for the selection of these positional and rotational offsets. This technique was used to enhance the dataset after the Gaussian noise addition, with one extra copy. Figure 8 shows all the augmentation steps for enlarging artificially the Praxis Dataset.

Fig. 8
figure 8

Praxis dataset augmentation

To implement the Praxis framework for HAR, it is necessary to make the Annotation of the recorded data. For this reason, we developed a graphical user interface (Fig. 9) application to assist the annotator in quickly labeling the recorded data. This application is designed to label video datasets. We have defined two levels of action granularity in which the annotator must specify the start and end time in the shared timeline as well as the interaction object. The first level, termed “Actions”, describes simple hand movements and interactions with a single object. The second level is referred to as “Task” and represents more complex activities that are derived from a series of “Actions”. The combination of the extracted features and the label for each frame synthesizes the Praxis Dataset.

Fig. 9
figure 9

Video annotation software for Praxis Dataset

Praxis dataset

The annotation software in this study generates output as JSON files for each assembly run. Each of these files comprises a list of annotations, detailing the type (whether action or task), the label, related objects, and handedness. Alongside these JSON files, a separate CSV file houses the feature data, with each row signifying a list of ‘flattened’ feature data and the initial row displaying the feature names.

A Simple File System Structure (SFS) is utilized to associate respective files (such as synchronized bag files, the Feature CSV, and annotation JSON) for each assembly run within its designated folder. The Custom Keras Data Generators (CDG) serve to retrieve data from multiple files per assembly, subsequently augmenting them to generate the necessary Tensors for both training and testing purposes.

While the SFS approach proved adequate during the testing phase, it also presents limitations, including rigidity in storage and sharing capabilities, and a prerequisite for prior knowledge of the structure within the CDG parsing modules. These limitations could be addressed by employing hierarchical data format databases like HDF5. This approach facilitates easier data access, filtering, and retrieval into Tensors for the CDG, given its self-explanatory structure.

Designed for periodic extension with new data captured from assembly runs, the Praxis dataset allows operators to review the system’s detections in real-time and provide feedback in cases of false detection. New feature data, coupled with their corresponding timestamps and the accurately performed task, are saved. The annotation software can then review this data with detections serving as annotations and user-reported errors visible in the timeline. These can then be corrected by annotators and incorporated into the Praxis Dataset, which enables the re-training of models to enhance their accuracy.

Model training and inference

Four different AI architectures for HAR were implemented and tested in the Praxis framework. This proves the ability of this approach to be model agnostic provided that appropriate dataset manipulation code is implemented in the CDG, to adhere to each NN architecture training input requirements.

A multi-criteria evaluation algorithm is developed to compare the performance of each NN architecture. The selected criteria are a) training time \({d}_{train}\), b) training accuracy \({a}_{train}\), c) inference accuracy \({a}_{inf}\), (using unknown data), d) and inference delay \({d}_{inf}\),. Based on the specifications of the use case that Praxis will be implemented, the user should define a weight value for each criterion (\({W}_{{d}_{train}}\), \({W}_{{a}_{train}}\), \({W}_{{d}_{inf}}\), \({W}_{{a}_{inf}}\)). Table 2 presents the range of weight values according to the significance that the user wants to give.

Table 2 Criteria weight significance

Table 3 presents the results of the four NN models that were developed and tested within Praxis. The models were implemented using the TensorFlow library (Abadi et al., 2015) and they were trained and deployed in a PC with an NVIDIA RTX-3060 GPU. The inference accuracy is calculated upon real, unknown test data, and the delay refers to the average pass-through time needed for the network to output the results.

Table 3 HAR AI-trained models

The only constraining factor for this evaluation method to give valid results is that the:

$$ \mathop \sum \limits_{i} W_{i} = 1 $$
(1)

This reflects the fact that there is always a tradeoff in weight selection. In the current application, we can eliminate the training time factor (\({W}_{{d}_{train}}\)), since it has no significance in the use case, and only happens once. However, in use cases where the “Actions” and “Tasks” are updated frequently this criterion should have higher significance weight. For simplicity purposes, we can eliminate the training accuracy, since it is a value that tends to fluctuate between 0.7 and 0.9 in real word scenarios. Values close to 1 usually are indicators of overfitting, and system retraining is necessary.

For the evaluation through the weighted sum method, the temporal criteria should be normalized. The training time was not normalized in this work since it has no weight. The inference delay (\({d}_{inf}\)) measures the average time needed for a HAR model to make detection on an untrained data stream. The normalized delay \(\widetilde{{d}_{inf}}\) is calculated using the following equation, which is based on an average processing time of 33.3 ms per frame, a commonly accepted threshold for real-time performance at 30 frames per second (FPS).

$$ \widetilde{{d_{inf} }} = 1 - \frac{{d_{inf} }}{33.3ms} $$
(2)

To generalize the weight selection, we decided to select the best model based on an overall approach. Since the remaining nonzero weights are the \({W}_{{a}_{inf}}\), and \({W}_{{d}_{inf}}\) and the (1) relation is always true, we can relate these two weight variables using a simple relation:

$$ W_{{d_{inf} }} = 1 - W_{{a_{inf} }} $$

For every model we can compute its evaluation score, \(E= E({W}_{{d}_{inf}})\). In Fig. 10 this score per model is presented, with the horizontal axis presenting the inference delay significance weight. Values close to 0 represent high importance to the inference delay, and low importance on the inference accuracy, and values close to 1, represent high importance to the inference accuracy, and low importance to the inference delay.

Fig. 10
figure 10

Overall NN evaluation scores

This graph shows that the Binary CNN model performs better on average, compared to the other models as the inference accuracy weight is more significant. According to Table 2, if we assume that the level of significance of the inference delay is set to be at least “Considerable”, the result will be always the Binary CNN. Furthermore, in the worst-case scenario, both CNN and the Binary CNN, have an acceptable inference delay, that is far less than the real time limit of 33.3 ms.

As for the core implementation, instead of using one Convolution NN (CNN) model for the whole approach to detect \(n\) number of actions or tasks, we decided to specialize the ability of the neural network to detect individual tasks. So, we created \(n\) neural networks each one to be able to detect one specific action. This implementation also increases the flexibility of the HAR model to quickly include new actions or tasks. Figure 11 presents the architecture of a Binary CNN for recognizing one “Action”. The trained models run in parallel and calculate the probability of each action in the current data frame.

Fig. 11
figure 11

Binary CNN architecture

In this specific binary classification model, the input data consisted of both hands’ 3D landmarks, and the sensor data, in a period of 11 frames. This 2D mapping can be treated as a spatial and temporal map, but all the features of interest require extraction. The sequential arrangement of layers comprises three 2D Convolutional (Conv2D) layers, two MaxPooling2D layers, a Flatten layer, and three Dense layers. The Conv2D layers are responsible for applying filters to the input data, allowing the network to detect and learn features of varying complexity and abstraction. MaxPooling2D layers are employed to down sample the feature maps by selecting the maximum value from a region of neighboring cells, thus reducing spatial dimensions, and providing some invariance to small translations. The Flatten layer reshapes the 3D output of the last Conv2D layer into a 1D vector to ensure compatibility with the subsequent Dense layers, which integrate the learned features and generate predictions based on them. The final Dense layer, designed for binary classification, has a single output neuron with a sigmoid activation function. This topology offers several advantages, including the ability to learn hierarchical features from input data, reduced computational complexity, and partial invariance to translations. Its suitability for binary classification tasks is further enhanced by the network’s capacity to learn complex, non-linear decision boundaries.

Industrial case study

The Praxis framework was deployed and tested in a collaborative assembly station that was based on a real industrial case from the machinery industry. More specifically, the assembly station focused on the assembly operation of an air compressor. This product consists of 15 individual parts, as seen in Fig. 12. The weight of each part fluctuates from 0.5 kg to 1 kg, and they are placed in part holders that are fixed in place on top of the work bench.

Fig. 12
figure 12

Air compressor assembly parts

For this assembly process, an HRC station was developed with one collaborative robotic (cobot) arm. Figure 13 presents the assembly station where we performed the validation of the system. The operator wears an Augmented Reality (AR) headset the HoloLens 2 (Microsoft HoloLens2, 2019). The AR headset is used both for egocentric data acquisition from the depth and image sensors that are available and for providing information to the operator. The human operator works opposite to the cobot and shares the assembly tasks. The required assembly tasks for this product are described in Table 4. The cobot is equipped with a two-finger gripper and it is suitable to perform only “pick and place” tasks.

Fig. 13
figure 13

Praxis testbed for air compressor industrial case

Table 4 Assembly process of air compressor

To evaluate the performance of the Praxis framework, we carried out experimental assembly trials with 11 unique participants. These individuals were new to the assembly process. Three participants’ assembly attempts were selected to create the Praxis Dataset for this specific use case. These individuals performed all the assembly steps, and Fig. 14 displays the distribution of the annotated dataset.

Fig. 14
figure 14

Actions and Tasks included in the industrial case categorized by hand use

The evaluation of Praxis was based on a performance comparison between the “fixed collaboration” approach, without the HAR module, and the “seamless collaboration” approach, where the Praxis HAR was deployed. The remaining eight participants were divided into two groups. Four of them were instructed to work under the “fixed collaboration” assembly, while the other four operated within the “seamless collaboration” assembly. This approach offered a balanced comparison of performance under both conditions.

The “fixed collaboration” approach represents the current collaborative paradigm, in which task allocation is predefined, and operators are responsible for tracking their assembly progress. There is no error detection module during the assembly phase in this approach. The errors detected before completion occurred largely due to the participants’ misconceptions, which contributed to the increased assembly time.

Table 5 presents the results from the “fixed collaboration” experiments. The average assembly time under this collaboration approach is 355 s (approximately 6 min). This is notably longer than the roughly 4-min duration it took for the initial three participants to perform the assembly manually.

Table 5 Assembly performance metrics for participants in fixed collaboration

In the “seamless collaboration” approach, task allocation is dynamic and driven by the real-time actions of the operators (Table 6). The system actively recognizes and responds to the operator’s actions using the Praxis HAR module, aiding in a more fluid and efficient assembly process. The error detection module was also active in this scenario, significantly reducing the number of undetected errors post-assembly.

In this approach, participants were free to perform any assembly task they wanted. The robot responded accordingly, based on the operator’s real-time actions. When errors were detected before the completion of assembly, the additional correction time was less impactful, increasing assembly time by an average of only 0.2 min (12 s) per error. Consequently, the average assembly time was significantly reduced to approximately 246 s (around 4 min), demonstrating the effectiveness of the Praxis HAR module in enhancing assembly efficiency. The table below summarizes the assembly time and errors detected during the ‘seamless collaboration’ approach (Table 6).

Table 6 Assembly performance metrics for participants in seamless collaboration

In conclusion, this real-world application of the Praxis framework has shown significant benefits in terms of reduced assembly time, decreased error rates, and enhanced overall productivity. These improvements highlight the potential of Praxis in advancing efficient and quality-controlled collaborative manufacturing processes.

Conclusions

In the presented research, the introduction of Praxis, an AI-driven framework for human action recognition in assembly processes, is made. The manuscript first unpacks a comprehensive review of related literature, identifying research gaps and the unique novelty of the proposed approach. The conceptualization, training, and implementation of the Human Action Recognition (HAR) model within the Praxis framework are then elucidated.

The study emphasizes the transformative potential of artificial intelligence in collaborative assembly line operations, particularly the integral role of real-time human action recognition. The integration of the framework within assembly line operations demonstrates its utility in enhancing human and robot collaboration, refining quality control mechanisms, and offering critical insights for the perpetual improvement of manufacturing processes.

The case study included offers a tangible demonstration of the Praxis framework in the air compression production industry. It highlights how real-time feedback on worker performance, facilitated by the framework, contributes to a noticeable reduction in assembly errors and minimizes the interaction time that was required for system updates.

In conclusion, the Praxis framework signifies a substantial progression in the application of AI within the manufacturing sector. Offering accurate, real-time recognition and analysis of human actions, greatly enhances both the efficiency and quality of collaborative assembly line operations. Further, it serves as a valuable tool for continual assembly process improvement through data collection and analysis.


For future work, there are several promising directions. Extending the application of the Praxis framework to various sectors within manufacturing could unlock new insights and productivity gains. Additionally, incorporating more complex and nuanced human actions and interactions could further enhance the model’s recognition abilities. Research into training the model with fewer data or in more unstructured environments could increase its robustness and versatility. Lastly, exploring feedback mechanisms for operators in real-time could amplify the framework’s immediate impact on quality and safety improvements. These steps forward can help cement the role of AI as a critical tool in modern industrial processes.