Reinforcement Learning With Stereo-View Observation for Robust Electronic Component Robotic Insertion

In modern manufacturing, assembly tasks are a major challenge for robotics. In the manufacturing industry, a wide range of insertion tasks can be found, from peg-in-hole insertion to electronic parts assembly. Robotic stations designed for this problem often use conventional hybrid force-position control to perform preprogrammed trajectories, such as e.g. a spiral path. However, electronic parts require more sophisticated techniques due to their complex geometry and susceptibility to damage. Production line assembly tasks require high robustness to initial position and rotation variations due to component grip imperfections. Robustness to partially obscured camera view is also mandatory due to multi stage assembly process. We propose a stereo-view method based on reinforcement learning (RL) for the robust assembly of electronic parts. Applicability of our method to real-world production lines is verified through test scenarios. Our approach is the most robust to applied perturbations of all tested methods and can potentially be transferred to environments unseen during learning.


Background
Despite the progress in the robotization of the industry, there are still many assembly tasks that are usually performed manually by production workers.The need for further improvements in production efficiency and cost reduction has inspired research for many years.Most of the work has focused on an idealised assembly model, known as the peg-in-hole problem [1].However, due to the diversity of assembled components' shapes in the manufacturing industry, this is a subject of ongoing research [2].
In the electronics manufacturing industry, industrial robots face challenges when assembling non-standardised electronic components in through-hole technology (THT).This difficulty comes from the physical properties of these components, which come in various shapes and have differing numbers of leads arranged in non-standardised patterns.The pins are also easily bent due to their susceptibility to applied forces.Furthermore, the clearance of through-hole pins is typically less than 1 mm, depending on the printed circuit board (PCB) design, making inserting electronic parts challenging.Figure 1 presents a close view of the THT component and the effect of damaged pins on the element.
A highly precise force control system is necessary to mitigate potential damages.However, industrial assembly systems must also account for errors introduced by the grasping procedure and the PCB clamping mechanism.The picking error arises from the way electronic parts are fed to the robotic stations through profiled trays with large clearance slots.

Related Work
Robotic stations used for assembly on production lines are based mainly on compliance control systems.These stations use impedance or admittance controllers to control industrial Fig. 1 A close view of the THT component with one of the pin bent robots [3,4] that perform programmed trajectories while maintaining a constant downforce.Nevertheless, these control systems require manual parameter adjustment, which is time-consuming.
The methods based on the compliance control system require high sensor precision and well-prepared robotic stations.However, machine vision and deep learning can improve the performance of insertion tasks, even when dealing with imperfect hardware or the construction of the robotic station.For example, Huang et al. [5] propose a pure machine vision system with a feedback rate of 1000 Hz to align a peg in a hole instead of a force control system.Meanwhile, deep learning-based methods for insertion tasks are presented by Triyonoputro et al. [6] and Yu et al. [7].Both solutions use the convolutional neural network (CNN) to precisely compute the pose of the insertion target.Moreover, Triyonoputro et al. [6] use two images captured by the vision tool attached to the robot end effector.The trajectory algorithm then uses this computed pose as input.
Recently, reinforcement learning (RL) has gained attention as a solution to assembly problems [8][9][10][11].These works utilise RL algorithms for much more complex assembly tasks than peg-in-hole insertion, such as connector or gear insertion.The RL agent commands the robot controlled by an impedance or an admittance control system in all these methods.The agent's observation mainly consists of proprioception and 6-axis force-torque (F/T) data.Another group contains RL methods that rely solely on visual information.In Schoettler et al. [11] work, the agent acquires the image from an external camera placed in the workspace.In [12,13], visual information is captured from a single wrist camera and preprocessed by a complex neural network pre-trained in a supervised manner.
The aforementioned works have addressed the solution for peg-in-hole or connector insertion tasks.However, assembly of electronic parts is considered a multiple peg-in-hole task, which is more challenging for industrial robots due to the complex geometry of the object [14,15].To tackle this issue, Hou et al. [16] have proposed an RL-based method that employs a DDPG algorithm [17] supported by a fuzzy logic system and variable time-scale prediction.In this method, the RL agent computes the 6-dimensional action, which represents translations and rotations along the XYZ-axes based on the object's pose relative to the target and the 6-axis F/T data.Similar approaches have been introduced by Hou et al. [18] and Xu et al. [19].In both works, DDPG is a core algorithm, and the policy network's output is used to correct the control signal computed by the manually tuned PD force controller.However, these works differ in the reward function they use.Xu et al. [19] propose a fuzzy reward system instead of a complex handcrafted reward function.The improvements proposed in these works are intended to accelerate training and achieve safe exploration.However, the manipulated objects used in those works were solid metal blocks that were more resistant to damage than electronic components.Additionally, in the presented experiments, those objects were rigidly attached to the robot's end-effector, simplifying the problem by reducing the impact of uncertainties from the grasping procedure.
In contrast, Ma et al. [20] propose a reinforcement learning-based solution for the assembly of electronic components, such as the pin header.Their solution uses highquality cameras and a precise 6-axis F/T sensor.Nevertheless, the agent's observation space consists of only F/T data, and cameras are used for the pre-policy control step.The action space compared to the abovementioned works computes only translations along XYZ-axes.

Contribution
This paper presents a method for the precise assembly of non-standardised THT electronic components using reinforcement learning and stereo-view observation.We employ Soft Actor-Critic (SAC) [21] as the core algorithm, as it has been shown [22] to be more efficient for real-world robotics applications than other continuous control algorithms such as PPO [23] and TD3 [24].We refer to our method as SAC with stereo-view observation space (SAC-SV).Furthermore, we used two separate convolutional neural networks to extract features from input images, as opposed to the single network used in previous work [6].
Moreover, this work provides test scenarios that are suitable for evaluating the potential of a method to be used in a real-world production line.We collected the requirements specified by production personnel and identified potential sources of errors that could occur in robotic stations.Our test scenarios assess the robustness of the methods to position and rotation disturbances.Furthermore, the proposed procedures verify whether the trained agent can be transferred from experimental to production scenarios.Specifically, for the electronic parts assembly task, we check if the policy trained on the empty PCB can be applied to the PCB after the automatic assembly stages.
The paper is organised as follows.Section 2 introduces a method for assembling THT electronic parts.We start by describing the environment for the electronic component assembly task.Next, we present the industrial robot control system and the technique for asynchronous learning.Section 3 presents the experiments along with a detailed description of the robotic system used for the experiments and the training procedure.In this section, we compare SAC-SV to state-of-the-art approaches for vision-driven reinforcement learning driven by a single camera vision system acquired from an external camera or the tool-mounted camera.We also validate our solutions against conventional methods and the force-based RL method.Following these experiments, we report the performance of the proposed method in transferring the policy trained on the empty PCB to a scenario with a partially assembled PCB.Finally, we discuss the results obtained and plans for future work.

Reinforcement Learning
We model our problem as a standard RL setting [25], where interaction between the agent and the environment can be described as a Markov decision process (MDP).In each discrete timestamp t, the agent is in state s t ∈ S, performs the action a t ∈ A, and receives the scalar reward r t sampled from the reward function r t (s t , a t ), where S defines the state space of the environment, and A defines the continuous action space.After performing an action a t , the environment moves to the next state, which is drawn at random from an unknown state transition distribution s t+1 ∼ p(• | s t , a t ).The objective of the RL agent is to learn the policy a t = π(s t ) from the collected data by maximising the expected return R = T t=0 γ t r t , where T is the length of the planned trajectory and γ ∈ (0, 1) is a discount factor.

Soft Actor-Critic
The Soft Actor-Critic [21] is an actor-critic off-policy algorithm based on maximal entropy.Entropy controls the agent's exploration ability by augmenting the reward at each step.SAC uses neural networks as approximations for soft Qfunction Q ψ (s t , a t ) and policy π φ (a t | s t ), which are parameterized respectively by ψ and φ.The soft Q-function parameters can be updated by minimising the soft Bellman residual where D is an experience replay buffer that stores transitions (s t , a t , r t , s t+1 ), V Q ψ 1 ,ψ 2 (s t+1 ) denotes the value function implicitly defined by Q-functions and policy as using the reparameterization trick [26].Finally, the entropy temperature coefficient α can be fixed during training or dynamically adjusted, as proposed by [27].This coefficient can be optimised by solving the following objective: where H is a target entropy that is usually set empirically to H = dim(A).We follow the implementation proposed by Haaranoja et al. [27], where two soft Q functions with independent parameters Q i are used to mitigate positive bias in the policy improvement steps.The target Q-value is computed by taking the minimum value from the Q-function approximations.Both networks are independently optimised by solving the J Q i (ψ i ) objectives.

Assembly Process Environment
In our environment for assembling electronic parts tasks, the RL agent's observation contains two images acquired from cameras attached to the end-effector.Figure 2 illustrates the concept of obtaining these images from this vision tool.The camera's view angle was empirically chosen to achieve a view that gives information about the assembly place and its surroundings.Each output image is of size 1024×512 pixels in the RGB colour space.However, to enable the use of the neural network in a real-time control scenario, the desired images are resized to the resolution of 128 × 128 pixels.control loop

Fig. 2 Images acquisition visualisation from the vision tool
Due to the specificity of the task and direct control of the real robot, we have implemented constraints for the RL agent.A workspace is defined as a cylinder with a radius of 7 mm and a limitless height.The insertion pose of electronic parts determines the reference frame of the workspace.Additionally, the assumed maximum rotation about each axis is 15 • .At the start of each episode, the agent is positioned 2 mm above the PCB surface.
Each trial lasts up to 50 steps.During a single step, the agent executes the action a t = [ x t , y t , z t , θ z t ], where x t , y t , z t are displacements along the XYZ-axes, respectively, and θ z t is a rotation on the Z-axis.Thus, the where d is a 2 -distance between the tool center point (TCP) and full insertion pose, and α is a reward sensitivity coefficient.Full insertion occurs when the robot reaches the assembly position in the XY-axes and the defined position in the Z-axis below the surface of the PCB.To ensure safety, we designed a penalisation mechanism in the environment.The episode is interrupted when the agent leaves the workspace, exceeds the time limit, or exceeds the rotation limit.If the episode is terminated due to leaving the workspace or exceeding the time limit, the agent receives the same reward as during nonterminal steps.If termination occurs due to exceeding the rotation limit, the agent receives a reward of r = −2 to prevent damage to camera cables connected to the vision tool.The task is completed when the relative position of the TCP p z on the Z-axis is less than or equal to 0.0 mm, which means that the electronic part is inserted into the target position.In this situation, the reward received is r = 10.0.

Smart Assembling
We developed an RL-based method for assembling electronic parts, integrating the SAC algorithm with the admittance control system.The presented system consists of two control loops.The block diagram is presented in Fig. 3.In the outer loop, the controlling element is the SAC algorithm, and in the inner control loop, the admittance control system [28] is used.
The RL agent sends commands to the admittance controller at 10 Hz, receiving feedback information at the same frequency.The pose information required for the reward function is received from the admittance controller.At the same time, the images are acquired from the camera's drivers running independently from the controller.To ensure the reliable real-time control loop that sends commands with a given frequency, we integrated the SAC algorithm with a distributed learning architecture called Ape-X [29].In this architecture, multiple actors are spawned, each with its instance of the environment.Those actors generate the experiences and store them in the shared replay buffer.The learner samples mini-batches from this shared replay buffer and updates the network parameters.The actors' parameters are periodically synchronised with the latest learner's parameters.Our experiments were conducted with only one robot, so we used Ape-X architecture with one spawned actor.

Policy Model
We represented the control policy as a neural network, as introduced in Section 2.2.As described in Section 2.3, the observation space consists of RGB images.Each image is pre-processed by an independent 5-layer convolution neural network (CNN) with filters of size 32.Then, the computed features are concatenated and passed directly to the actor π φ and the critic Q ψ .Both function approximators are neural networks with two fully connected layers of 256 neurones per layer size.For every layer in the model, we use LeakyReLU [30] as an activation function.The concept diagram of the model is depicted in Fig. 4 The CNN backbones are shared between the actor and critic networks in variants with visual information.We followed an optimisation procedure proposed by Yarats et al. [31], where the parameters of the vision network are updated by the gradient calculated from the critic loss function.

Admittance Controller
We implemented a standard admittance controller [28] operating in the task space to safely assemble the electronic parts susceptible to applied forces.This controller is part of the control scheme depicted in Fig. 3. Compared to hybrid forceposition control, this control system allowed us to control the robot with high precision in the task space and minimise the contact force detected during trajectory execution.The admittance controller is described by where K, D, and M represent stiffness, damping, and inertia matrices, respectively.W ext = [F, τ ] represents the The coefficients d i j of the damping matrix can be calculated using the formula d i j = 2ζ i j m i j k i j , where ζ i j is a damping ratio for each degree of the control system.The control signal is first computed by integrating the acceleration ẍ and the obtained velocity ẋ.The control acceleration ẍ is given by Finally, the robot's joint positions q(t) are computed from inverse kinematics applied to the resulting control output.

Experimental Setup
We built a real-world laboratory stand to carry out experiments, depicted in Fig. 5.This laboratory stand consists of the following devices: the Universal Robot UR5e-series industrial robot 1 , the servo-electric gripper2 , a 6-axis F/T sensor 3 , and a custom-made tool with vision sensors.We placed PCB panels and electronic components in the robot's workspace.In the production lines, PCBs are delivered in panels, where a single panel can contain different numbers of boards.The electronic parts are placed on the 3D printed trays.Such a setup allowed us to ensure conditions similar to those on the production line.In our experiments, we used two distinct PCB panels, one designed for research purposes and one sourced from the production line.We selected three types of electronic parts, namely: component type 1a/b (Fig. 6a and b), component type 2 (Fig. 6c), and component type 3 (Fig. 6d).The letters a and b denote the various types of PCB for specific electronic parts.These elements differ in their geometry, appearance, and arrangement of the leads.Figure 6 presents the electronic parts and their corresponding insertion places.
We used ROS 2 [32] middleware to control the industrial robot and operate peripheral devices.Furthermore, we used RLlib [33] as software to manage the learning process and implement the RL agents.The advantage of RLlib is the availability of out-of-the-box software for distributed algorithms like Ape-X.We performed all experiments on the workstation with NVIDIA Titan X GPU 4 .

Training and Evaluation
During the training process, the agent's task was to insert the electronic component into the target pose on the PCB.The agent was trained for 50000 steps in an asynchronous manner, as described in Section 2.4, which took an average of 3 hours.In this setup, the actor sends a rollout with 10 transitions to the replay buffer while the learner synchronises model parameters every 50 environment steps.Each episode began with picking an electronic element from the tray.Followed by the robot moved to the initial pose, which is the electronic part's assembly pose 2 mm above the PCB 4 https://www.nvidia.com/en-us/geforce/products/10series/titan-xpascal/surface.Moreover, to ensure that each episode was unique and improve the robustness of the RL agent, the initial position on the XY-axes and the initial rotation on the Z-axis were disturbed with the noise sampled, respectively, from p xy noise ∼ U(−2, 2) mm and θ z noise ∼ U(−2 • , 2 • ).After the termination of the episode, the robot would return the grasped electronic part and pick up another one.
Next, we evaluated each trained model with respect to its robustness to environmental disturbance.We designed a test scenario that reflects the cumulative errors in the production machinery.There are three primary sources of errors in robotic assembly system on the production line: determining the picking pose of the electronic parts placed in the trays, a picking procedure with universal fingers, and a panel clamping system precision.
The test scenario consisted of 7 tests.These tests differed in the continuous uniform distribution range applied to the XY-axes' initial position and the Z-axis's initial rotation.At first, the model was evaluated without any applied noise.Afterwards, the robustness of the model against position disturbance was tested by applying noise samples from p ). Finally, the model was subjected to tests with compound perturbations.For each test, we run 100 trials of insertion.During the evaluation, we collected data on the insertion status (success or failure) and assembly time of the successfully completed trial.On the basis of these data, we calculated the success rate and averaged the assembly time for each test.
For every performed experiment, we set the admittance controller's desired stiffness and inertia as the following diagonal matrices: K = diag{1000, 1000, 1000, 20, 20, 20}, M = diag{3, 3, 3, 0.04, 0.04, 0.04}.and a one-damping ratio ζ = 2.8 for every degree of freedom.To ensure smooth motion and stability of the control system, we limited linear velocities to 0.1 m/s and angular velocities to 1.0 rad/s.Moreover, we used a low-pass filter with a cut-off frequency of 25 Hz for data acquired from the F/T sensor and applied the following constraints: 5 N for forces and 1 Nm for torques.We use the default joint-torque limits provided by the vendor 5 .The parameter values were empirically determined to maximise movement speed while maintaining compliance.Further details on the SAC algorithm hyperparameters are available in Table 1.

Performance Comparison
In these experiments, we evaluate the suitability of our dual-camera robotic vision system for assembling nonstandardised electronic parts by comparing the method to the other vision-driven RL methods with different visual information sources and conventional methods, which are widely used in the industry.The following methods and setups are compared: 1) Straight down -The robot moves straight down from the starting position until successful insertion or exceeding the limit of contact force, which we set to 2 N. 2) Random search -The robot moves randomly in the XY plane until the insertion or trial termination signal is detected.The displacements on those axes are sampled from the uniform distribution.The force controller controls the displacement in the Z-axis by holding a constant contact force of 2 N. 3) Spiral search -The robot follows the spiral trajectory [34] on the XY plane until successful insertion or a trial termination signal occurs.The force controller controls the displacement in the Z-axis by holding a constant contact force of 2 N. 4) SAC with the combined view (SAC-CV) -SAC-CV, like our SAC-SV, learns the policy that takes multiple images on the input to the neural network.However, the input images are combined into one, as was presented by Triyonoputro et al. [6].A detailed description of this operation is given in Appendix A.1.

5) SAC with the mono view (SAC-MV) -SAC-MV uses
an image acquired from a single camera vision system attached to the robot's end-effector like it was presented in [12].However, in our experiments, we used SAC instead of DDPG, which was initially introduced in the work mentioned.6) SAC with the external view (SAC-EV) -SAC-EV differs from the previous in the source of visual information.Images for the action computation are acquired from an external camera placed in the robot workspace [11].The testbed for this experiment is presented in Appendix A.2.However, in this setup, the camera's field of view was set so that the agent could see only one PCB from the entire panel, consisting of a group of PCBs.7) SAC with F/T feedback (SAC-Force) -The SAC-Force [20] differs from previous methods by taking 6-axis F/T data [F x , F y , F z , M x , M y , M z ] to compute the output action.Input F/T data were acquired by averaging 24 received samples.Here, we use the same actor and critic neural networks as vision-based agents.
In this section, we present results only for component type 1a (Fig. 6a), while in Appendix B, we show results for the remaining electronic parts.We followed all the evaluation procedures for all the experiments described in Section 3.2.The methods mentioned were evaluated based on the success rate and the average assembly time.The results obtained for component type 1a are reported in Table 2.
Due to the approach to image acquisition by SAC-EC, we have only provided the results of the tests performed on the PCB used for training.During the experiments, we also evaluated the effectiveness of this method on other PCBs from the panel.However, we omitted them in Table 2 since the agent was unable to perform the task.Additionally, we attempted to train an RL agent whose external camera returns the image of the entire panel and not a single PCB.However, the agent barely achieved more than 20% efficiency during learning.Therefore, we decided to stop further experiments with it.
The only methods that were able to achieve an almost 100% success rate were those driven by visual observation space.Moreover, our method was the most reliable among them.The RL agent with F/T feedback information reported poor performance in all test scenarios.We assume that this was caused by the fact that Ma et al. [20] did experiments for precise electronic parts assembly tasks with a more complex test bed consisting of a specialised industrial robot and highresolution cameras.Furthermore, their experimental object was a pin header with pins aligned in a line instead of a more complex pattern.
Additionally, we examined another approach to the stereoview observation space called SAC-CV.We modified the image processing technique presented by Triyonoputro et al. [6] to fit into the reinforcement learning domain.SAC-CV achieved comparable results to our method and even scored slightly lower average assembly times for the test scenarios with minor perturbations.However, overall, SAC-SV is more robust to the increasing applied disturbances.In terms of the production application, our solution does not need additional devices, like vision systems, that reduce the error that occurs.
The above results show that the insertion of electronic components is a challenge for conventional methods.The holes for non-standardised THT electronic parts have tight clearance, significantly complicating the assembly process.Traditional techniques perform programmed trajectories, which rely only on pose feedback information; therefore, their effectiveness decreases with increasing disturbances.Furthermore, the standard implementation of these algorithms cannot handle compound and orientation perturbations.With the feedback from the vision system, our method learns the features that enable it to be robust against all applied disturbances.In addition, off-policy algorithms like The 100 trials of the insertion validated each method.Conventional methods such as spiral or random search cannot achieve high success rates due to the complexity of the task.Only vision-driven RL-based solutions achieve high success rates.Nevertheless, our method is the most robust to the applied disturbances.The bottom part of the table shows the average assembly times with standard deviations Bold entries highlight the best results in a single column to improve visualization for the reader.In the column's top part, positions with the highest success rates are bolded.The positions with the lowest assembly times are highlighted in the bottom part SAC decide on the next action at each time step to quickly correct the trajectory.

Transfer to Partially Assembled PCB
In the previous experiments, the agents were trained on the empty PCB.However, in real-world scenarios, the final production stage is an assembly of the non-standardised THT parts.The ideal approach would be to train the policy offline, outside the production line, and then transfer it to the robotic station at the factory.In this particular experiment, we evaluated vision-based RL agents in terms of their possible transferability to the partially assembled PCB (Fig. 7.) Table 3 presents the obtained results.All RL agents trained from scratch with the visual feedback acquired from the vision sensors attached to the end-effector achieved an almost 100% success rate.However, the variant's performance using an external camera as the observation space source was significantly worse because of other components' occlusion of the assembly place.When evaluating the performance of the policy transfer from an empty PCB, none of the agents achieved a 100% success rate.Nevertheless, our method achieved the highest efficiency among them.This experiment showed that the stereo-view observation space can score relatively good transfer efficiency without additional modifications, such as input enhancements.

Real-World Applicability
We analysed the vision-based RL methods presented in Section 3.3 in terms of their usage on real-world production lines.In production scenarios, multiple PCBs are packed into a single panel.Hence, the policy that operates on the visual information acquired from the external camera placed in the robot workspace poorly scales to the PCB not used during the training.Each PCB from the panel would require a separate camera, and it is challenging to acquire similar images across all vision sensors.RL algorithms are known to be sensitive to changes in the observation space [35,36].A slight difference in the background causes a significant decrease in performance.We confirmed this statement through experiments in which SAC-EV for other PCBs of the panel failed to insert any electronic part.
In contrast, methods that use visual information acquired from the vision sensor attached to the robot tool provide a stable background independent of the location of the PCB.The results presented in the previous sections showed that our method is the most robust to pose and visual disturbances.Therefore, SAC-SV meets the requirements of applicability in real-world production applications.

Conclusion
This paper presents a stereo-view RL approach for the electronic part insertion task.We chose Soft Actor-Critic as our core algorithm.Our experiments show that vision-driven RL methods, combined with a compliance control system, can assemble delicate components that are vulnerable to applied forces.We evaluated the performance of the RL agent with different visual information sources used in existing works.All of them achieved a success rate of more than 95% for test scenarios with low values of applied disturbances of initial position and rotation on the Z-axis.However, when the disturbances' limits were increased, our method outperformed the others in terms of the percentage of tasks successfully completed.
We also showed that our stereo vision system attached to the robot's end-effector to acquire visual information and the method to extract features from the stereoview observation space is more suitable for real production scenarios than the configuration with an external camera.In the case of the method that uses an external camera for image acquisition, the camera field of view is set up only for one PCB from the panel.This method achieved a high success rate only on the PCB used for training and failed to complete the insertion task on other PCBs of the panel.The dual-camera vision system focusses only on the assembly place and its surroundings.
Following the experiments that evaluated performance against pose disturbances, we also examined the transferability of the policy trained on the empty PCB to the partially assembled PCB.The results showed that our method achieved the best performance during these test scenarios, with a success rate of 81%.The performance of mono-view methods dropped significantly when the view was partially occluded.Moreover, in these experiments, we presented that processing stereo-view observation space by separate neural networks shows relatively high efficiency.
The advantages of RL algorithms for the assembly of electronic parts have also been demonstrated by comparison with conventional methods such as straight-down insertion, random search, and spiral search.Compared to the one presented in this work, conventional methods were not robust to the perturbations applied to the initial position and rotation over the z-axis.It should be noted that our technique could be also combined with other RL approaches for continuous control, such as TD3 or PPO.However, SAC is known for its sample efficiency, which is a desirable feature for real-world tasks.
In future research, we will verify our method in production lines and gather more information on overall performance and robustness.We are also planning to work on the problem of fast adaptation to the new tasks, defined as adjusting trained policy to new products on the production lines and new robotic stations placed in the factory without training from scratch.Achieving adaptability to unseen environment variants could significantly increase the usability of RLbased methods on high-mix, low-volume production lines where products are constantly changing.We believe that the RL-based method will replace conventional methods on production lines.

A.1 Combined View Setup
In the SAC-CV experiments, we followed the procedure described by [6] to obtain the combined view as a visual observation.Images acquired from two cameras attached on the robot's end-effector are merged into an output image of 1024×1024 pixels and then down-sampled to 128×128 pixels.Camera 1 points to the left side of the gripper and camera 2 points to the right.Such an approach provides a 360-degreelike vision in one image.The concept scheme is presented in Fig. 8 A.

External Camera Setup
For the SAC-EC experiments, we placed an external camera in the robot's workspace (the detailed setup is presented in Fig. 9a).We set up the camera to get the field of view on the one PCB from the panel.The image acquired from this setup is illustrated in Fig. 9b (Tables 4, 5 and 6).The 100 trials of the insertion validated each method.The second table shows the average assembly times with standard deviations Bold entries highlight the best results in a single column to improve visualization for the reader.In the column's top part, positions with the highest success rates are bolded.The positions with the lowest assembly times are highlighted in the bottom part

Fig. 3
Fig.3The block diagram of the RL-based method for assembling electronic parts: a) Ape-X distributed architecture that allows asynchronous training; b) the environment for assembling electronic parts with presented control diagrams.In this setup Actor (the RL agent) sends the Cartesian motion trajectory to the admittance controller, which directly controls the robot's joints

Fig. 4
Fig. 4 Soft Actor-Critic with stereo-view observation (SAC-SV): architecture built with two separate convolution neural networks for each view and two fully connected neural networks, respectively, for actor and critic.Features computed by vision networks are concatenated and passed directly to the actor and critic

Fig. 6
Fig. 6 Electronic parts (above) and their corresponding insertion positions (below) used for the experiments.This work proposes the following naming convention for electronic parts: a) component type 1a, b) component type 1b, c) component type 2 and d) component type 3.All shown components have four leads

Fig. 7
Fig. 7 Partially assembled PCB used for the experiments.The surrounding elements significantly modify the view observed by the policy

Fig. 8
Fig. 8 Concept scheme of the image concatenation process for SAC-CV method

Table 1
SAC algorithm hyperparameters used for experiments

Table 2
Summary of the experiments comparing the performance of the methods for the component type 1a (Fig.6a)

Table 3
Success rate on test scenario with partially assembled PCB

Table 4
Summary of the experiments comparing the performance of the methods for the component type 1b (Fig.6b)

Table 5
Summary of the experiments comparing the performance of the methods for the component type 2 (Fig.6c)The 100 trials of the insertion validated each method.The second table shows the average assembly times with standard deviations Bold entries highlight the best results in a single column to improve visualization for the reader.In the column's top part, positions with the highest success rates are bolded.The positions with the lowest assembly times are highlighted in the bottom part

Table 6
Summary of the experiments comparing the performance of the methods for the component type 3 (Fig.6d)The 100 trials of the insertion validated each method.The second table shows the average assembly times with standard deviations Bold entries highlight the best results in a single column to improve visualization for the reader.In the column's top part, positions with the highest success rates are bolded.The positions with the lowest assembly times are highlighted in the bottom part