1 Introduction

Augmented Reality (AR) has rapidly developed and demonstrated its effectiveness during assembly work (Xie et al. 2022; Korkut et al. 2023). AR-based work manuals can replace traditional paper manuals in guiding assembly operations (Souza et al. 2020; Wang et al. 2023) by transforming textual work instructions into visual elements and blending them with the real environment. In recent decades, the use of AR in assembly processes has gained popularity. Several scholars have begun to study augmented assembly (AA) and have achieved positive results. For instance, Gattullo et al. (2019) studied the design and expression of visual elements in assembly work instruction for AR and proposed a standardization method for visual elements in augmented assembly work instruction. Zubizarreta et al. (2019) developed ARGITU, an augmented assembly system that uses a CAD model for markerless AR registration.

Despite many recent technological advancements, marketless AR registrations for different assembly environments and objects still face challenges, including low precision, weak robustness, and poor timeliness, thereby restricting the popularization of augmented assembly systems in practical applications. The commonly used markless AR registration methods can be divided into two categories: sensor- and image-based (Fang et al. 2017; Zhang et al. 2022; Shu et al. 2022). Sensor-based methods primarily complete registration by fusing and processing signals from various sensors, such as speed, inertia, and depth sensors; however, these methods face challenges in adaptability to different environments, and the deployment of multiple sensors has a significant impact on operators (Li et al. 2019). In contrast, less intrusive image-based methods have received significant attention (Benmahdjoub et al. 2022), primarily achieving camera registration by using feature information from individual RGB images (Rambach et al. 2017). A body of research focuses on image templates, image features, and deep learning methods for AR registration. Objects with complex structures, various shapes, and rich textures can be easily registered using the existing methods. However, textureless assembly part registration through image-based methods still faces challenges owing to the lack of image feature information on the part (Astanin et al. 2017; Jiang et al. 2023).

In textureless assemblies, despite the smooth surface, distinct edge features suitable for feature-line descriptions often exist (Akinlar et al. 2011). To resolve the problem of textureless assembly part registration, this study examines the local geometric edge features of assembly parts and introduces a novel feature line description and matching method. The proposed method demonstrates effective and accurate augmented assembly registration. The main contributions of this study are as follows:

  1. 1)

    To ensure high accuracy and fast processing speed of feature extraction, an improved line extraction technique is proposed to extract contour lines from textureless assembly parts.

  2. 2)

    A novel image-matching algorithm referred to as line neighborhood edges descriptor (LNED) is proposed and used to describe the extracted contour lines using local geometric edge features. The binary encoding is used to reduce the computational overhead for LNED matching.

  3. 3)

    Based on the LNED matching algorithm, this study presents a novel markless AR registration method for augmented assembly systems to assist mechanical product assembly. Additionally, the proposed method utilizes the coarse-to-fine strategy is used to track the assembly part.

The remainder of this paper is organized as follows: Image-based registration methods for monocular camera-augmented assembly systems are reviewed in Sect. 2. The framework of local geometric edge feature-based registration is introduced in Sect. 3.1. The feature line extraction method for the assembly of real images is described in Sect. 3.2, followed by the introduction of a novel line descriptor for describing the extracted lines in Sect. 3.3. The extracted lines are matched to evaluate the initial pose of the real camera. The process of determining the precise camera position is analyzed in Sect. 3.4. Finally, in Sect. 4, the performance of the proposed algorithm is analyzed, and the application of the registration algorithm is implemented in the augmented assembly system.

2 Related works

AR registration is an essential but challenging issue in augmented assembly systems, particularly when the assembly parts are textureless. This section briefly discusses the current image-based registration methods for augmented assembly systems.

2.1 Template matching-based registration method

Template matching is a commonly employed method for image-based registration in augmented assembly. Previously, most template-matching algorithms used edge pixels as matching features. Olson et al. (1997) used the image edge chamfer distance to calculate the similarity between the template and input images for template matching. Although this method has strong timeliness, it is sensitive to environmental factors such as occlusion and illumination. Hinterstoisser et al. (2013) proposed the LINE-MOD (multiMODal-LINE) algorithm, which uses the image gradient as a matching feature. This method adopts a binary mode to represent image gradients and uses ultrahigh-speed parallel image processing. It can realize real-time matching but is only suitable for targets with a fixed scale in image pairs. Liu et al. (2014) improved the LINE-MOD method by leveraging the high tolerance for geometric deformation exhibited by small templates, which solved the problem of object rotation and scaling transformation during augmented assembly registration. Wang et al. (2017, 2018) introduced depth information into the sliding window for similarity evaluation, thereby improving the adaptability of the LINE-MOD algorithm to handle the rotation and scaling variations of the target. Recently, Yu et al. (2018) introduced the Orientation Compression Map and Discriminative Regional Weight (OCM-DRW) method, which uses gradient direction compression mapping for image matching. They used binary direction compression mapping to reduce computational overhead before detecting the similarity between the template and the target image. The OCM-DRW method not only ensures matching accuracy but also improves the timeliness of the algorithm. Although the registration method based on template matching has a significant advantage in terms of timeliness, discrete template images cannot accurately match continuous camera poses during AR registration (Wang et al. 2018). Additionally, matching errors appear easily because the image gradient information for textureless objects is unclear.

2.2 Feature description based registration method

Another commonly used method for image-based augmented assembly registration method involves using an image feature description. This method uses image information to establish feature descriptors and then matches the descriptors in different images to find the corresponding relationship between 2D-3D points for the camera pose a solution. Presently, the most widely used feature description methods are based on feature points, such as the Scale-invariant Feature Transform (SIFT) (Sujin et al. 2023), Speeded Up Robust Features (SURF) (Bay et al. 2008), Features from Accelerated Segment Test (FAST), Binary Robust Independent Elementary Features (BRIEF), and Oriented FAST and Rotated BRIEF (ORB) (Rublee et al. 2011). These feature description methods are efficient for calculating the corresponding feature point descriptors but require rich corner features in the images. For the metal part, the surfaces were smooth and texture-less with half finishing and finishing. The image of a metal part cannot generate distinct gradient differences. This phenomenon resulted in fewer feature points and insufficient information for describing the feature points in the image. A lack of feature points and unreliable feature descriptions cause registration failures. In the literature (Dong et al. 2021; Tsai et al. 2018), a variety of real textureless objects have been collected for feature point-based matching, which further demonstrates that it is challenging to match textureless objects based on feature points. The lack of texture in the assembly parts causes a high false estimation rate, which indicates a very large matching jitter and mismatching phenomenon. However, it is challenging to satisfy the requirements of AR guidance.

With improvements in computing power, researchers have considered the use of feature lines in images for AR registration. Grompone (Grompone 2012) proposed a Line Segmentation Detection (LSD) algorithm that can achieve fast and accurate line extraction. Based on the LSD algorithm, Tombari (2013) used the angle information between several adjacent lines to describe the feature line and established a Bunch-Of-Lines Descriptor (BOLD) for line matching. He et al. (2020) improved the BOLD algorithm, applied it to registration, and proposed a contrast invariant descriptor, grayscale inversion invariance BOLD (GIIBOLD). Zhang and Koch (2013) proposed a line-band descriptor (LBD) algorithm for textureless object matching. This method uses the appearance of feature lines and local rectangular region gradients for image matching. This algorithm can be used to register texture-less objects with poor timeliness. To improve the timeliness of registration based on line descriptors, Wang et al. (2020) described the feature lines by the angle from continuous line to establish the Chain of Lines Feature (COLF) descriptor and converted COLF to binary code for feature matching. COLF can help improve the efficiency of the feature line-based registration algorithm to a certain extent but can only achieve 7 FPS (Frames Per Second) in an augmented assembly system. Additionally, the feature line-based registration method often uses PnPs (Yang et al. 2023) to solve the camera pose; however, the number of PnP points obtained through the line endpoint is relatively small. It is challenging to ensure a stable PnP algorithm solution for limited matching point pairs, resulting in a low success rate of feature-line-based registration methods for augmented assembly systems (Wang et al. 2020).

2.3 Deep learning-based registration method

In recent years, with the rapid development of deep-learning technology (Filipi Gonçalves Santos et al. 2023), researchers have begun exploring the application of deep learning for AR registration (Li et al. 2022). Tremblay (2018) used a convolutional neural network to evaluate the 6-DOF position of a target from a single image and realized the translation and rotation coordinate prediction of the camera. This method can adapt to chaotic scenes and deal with occlusion problems; however, it has challenges meeting the requirements of augmented assembly systems. Detone et al. (2016) proposed HomographyNet, which obtains a homology matrix from the transformed images for AR registration. The accuracy of HomographyNet was superior to that of the ORB algorithm. However, it was challenging to evaluate real images using the trained model obtained from artificially transformed images. Poursaeed et al. (2019) applied an improved Siamese network structure to camera pose evaluation. They introduced internal camera parameters into the convolutional neural network calculations for accurate camera pose matrix calculations. Liu et al. (2021) proposed a deep-learning method for learning line feature descriptions. Using the line description, image pairs can be matched for camera pose calculations. However, owing to several network layers, it cannot satisfy the real-time requirements of augmented assembly systems. AR assembly registration algorithms must satisfy the ease of use and timeliness requirements. Existing deep learning algorithms still face challenges in meeting the requirements of AR registration. Most existing deep-learning methods require massive amounts of human labeling data for training, which causes a manual burden during algorithm applications (Li et al. 2021). Deep learning algorithms typically require a GPU (Graphics Processing Unit) for computation, which is unsuitable for portable devices with low computing power. Deep learning algorithms with large-scale parameters cannot obtain real-time computing results when running on a CPU (Central Processing Unit) (Tremblay et al. 2018; Liu et al. 2021; Li et al. 2022). Lightweight deep learning model can be deployed in real-time embedded system of portable devices for image detection tasks, but it is still dissatisfactory for AR registration problem (Fang et al.2020).

In general, registration methods based on template matching and feature point descriptions require rich texture and corner features, making them unsuitable for textureless targets. Deep learning-based methods must be further improved regarding operability and timeliness. Additionally, deep learning often requires numerous manually labeled data for training, which increases the challenge of augmented assembly registration.

3 Local geometric edge features based registration method for textureless objects

3.1 Framework of local geometric edge features based registration

The local geometric edge feature of an assembly primarily refers to the contour edge at the line endpoint. During the augmented assembly process, the line of the assembly part in the real image is obtained using a line-extraction algorithm. The extracted lines in the real image were then used to match the CAD images for initial camera pose evaluation. Finally, the bundle adjustment method is used to calculate the precise pose of the camera. The framework of the proposed textureless object-registration algorithm is illustrated in Fig. 1. It includes four stages: contour line extraction, feature line description, feature line matching, and bundle adjustment.

Fig. 1
figure 1

Registration framework based on local geometric edge features

Contour line extraction refers to the extraction of lines from CAD designs and real images. The contour lines of the CAD images were generated using CAD vector line perspective projection. For a real image, the lines of the assembly part were extracted and organized using image processing methods.

In the feature line description stage, the extracted feature lines are described according to the design rules. In the feature line matching stage, the lines are matched according to the descriptor. The most similar pair of real and CAD images can be obtained by line matching, and the virtual camera pose of the CAD image is obtained as the initial pose of the real camera for the real image.

As the augmented assembly guidance is a continuous process, the initial camera pose estimated by discrete sampling through the CAD model cannot be accurately superimposed on the real scene. The bundle adjustment stage was used to obtain a precise pose from the initial pose. By reversely projecting the 3D points from the CAD model onto a real image at the initial camera pose, the distance error loss function of the reverse projection points on the real image can be established by distance image transformation. Finally, iterative optimization is used to solve the precise camera poses using the error loss function.

3.2 Feature lines extraction CEDLines

Because the CAD model is a vector graph, contour lines can be extracted through perspective projection calculations. For real images, this study proposes a Counter Edge Drawing Lines (CEDLines) algorithm based on Edge Drawing Lines (EDLines) (Akinlar et al. 2011) for line extraction. Because of the influence of environmental factors on real image features, extraction lines often contain numerous non-target lines, which significantly affect the accuracy and efficiency of line matching. In the augmented assembly registration process, it is crucial to further obtain real contour lines based on the extraction lines generated with EDLines method.

(1) CAD image sampling.

The pole sampling of the viewing sphere method (Wang et al. 2020) was adopted to capture CAD images from different viewpoints on the viewing sphere, as shown in Fig. 2. In a CAD environment, a virtual camera is used to capture perspective projection views of the assembly object model from different perspectives. The captured information contained different contour lines of the assembly object model and the corresponding virtual camera pose. The CAD model of the assembly object was placed at the center of the regular icosahedron. The optical axis of the virtual camera always passes through the center of the regular icosahedron, and the sampling viewpoint is located at each triangular vertex. The focal length of a virtual camera must be consistent with that of a real camera when collecting CAD images. To obtain a uniform distribution of sampling views and avoid view redundancy caused by excessive sampling density at the spherical surface sampling poles, regular icosahedron surface sampling was used in this study. Each surface of the regular icosahedron was divided into 4 parts. To balance the computational speed and accuracy, we chose to iterate twice to form 16 equilateral triangles on each surface. Finally, 320 different images were captured at each triangular vertex on the surface of a regular icosahedron for one part and then used for AR registration.

Fig. 2
figure 2

CAD image sampling

(2) Connection of broken lines.

In real scenes, owing to the influence of lighting, texture, and other factors, the contour lines extracted by EDLines are fractured and discontinuous, affecting the accuracy of line matching. The broken lines were connected according to certain rules to reduce the discontinuity of the contour lines. Suppose there are two adjacent fracture line segments, P1P2 and P3P4, where P2 and P3 are the two adjacent endpoints of the fracture lines. If the distance between the two adjacent endpoints at the fault is less than the threshold dt and the angle between the two lines is less than θt, then the two line segments are integrated into a new continuous line P1P4. The formula for determining the broken line connections is as follows:

$$\left\{ {\begin{array}{*{20}c} {\left| {\overrightarrow {{P_{2} P_{3} }} } \right| < d_{t} } \\ {\arccos \frac{{\overrightarrow {{P_{1} P_{2} }} \cdot \overrightarrow {{P_{3} P_{4} }} }}{{\left| {\overrightarrow {{P_{1} P_{2} }} } \right| \cdot \left| {\overrightarrow {{P_{3} P_{4} }} } \right|}} < \theta_{t} } \\ \end{array} { }} \right.$$
(1)

(3) Interference lines elimination.

After integrating the broken lines into continuous lines, it was necessary to eliminate the interference lines generated by environmental disturbances. The elimination of interference lines is mainly performed for crossing short lines deleted from the real contour lines. Burr interference edges on real objects cause contour lines to generate cross segments. Branching point e which contains two or more line segments, is defined as the endpoint. The first endpoint of the contour line was set as s1, and the branching point e was detected on the contour line. When a branching point appears, the contour line is decomposed into multiple line segments, and the two adjacent line segments are denoted as s1e and es2. The judgment conditions for the two line segments chosen as the real contour lines are as follows:

$$LK = {\text{arccos}}\left( {\frac{{\overrightarrow {{s_{1} e}} \cdot \overrightarrow {{es_{2} }} }}{{\left| {\overrightarrow {{s_{1} e}} } \right| + \left| {\overrightarrow {{es_{2} }} } \right|}}} \right) \cdot \frac{{\left| {\overrightarrow {{s_{1} e}} } \right| \cdot \left| {\overrightarrow {{es_{2} }} } \right|}}{{\left| {\overrightarrow {{s_{1} e}} } \right| + \left| {\overrightarrow {{es_{2} }} } \right|}}{ }$$
(2)

where LK denotes the metric value. The larger the metric value, the greater the contribution of the two line segments to the contour line. The line segment with the larger LK was retained, and the other line segments at the branching point were deleted. When LK is the maximum value calculated for all line segments at the branching point and satisfies the condition in Eq. (1), s1e and es2 are combined into a new line, s1s2.

After removing the cross-branching interference on the contour lines, the discontinuous short lines extracted by EDLines were removed. The threshold lt is set as a judgment condition for any contour line sisi+1, and a line segment whose length is lower than the threshold should be deleted. The judgment formula is as follows:

$$\left| {\overrightarrow {{s_{i} s_{i + 1} }} } \right| > l_{t}$$
(3)

The extraction results for the feature lines of the EDlines and CEDlines are shown in Fig. 3. The left side of the figure was extracted using the EDlines algorithm. As shown in the figure, there were many breaks, crosses, and short lines in the extracted feature lines. The right-hand side of the figure was extracted using the CEDlines algorithm. By improving the EDLines algorithm, the extracted lines became cleaner and more evident. Thus, the number of extracted lines was effectively reduced. Computational resources can be used more rationally during the image feature description and matching processes.

Fig. 3
figure 3

Feature lines extraction from textureless assembly part

3.3 Feature lines description and matching

After the contour lines of the real and CAD images were obtained, the lines in the two images were matched to evaluate the initial camera position. In this study, we designed a new feature-line matching method for textureless assembly parts. The endpoint of the line on the assembly part was generated using several contour edges. These contour edges are less affected by lighting, background, and other factors and can generate a stable and obvious gradient in textureless images. The rectangular region at the endpoint of the extracted line is selected as the descriptive region. The image edges in the rectangular region are used to describe the extracted lines. A block diagram of the matching method is shown in Fig. 4. We designed a new feature descriptor as shown in the highlighted block. Compared with traditional methods (Tombari et al. 2013; Zhang and Koch 2013), an endpoint region with abundant edge features is selected to design the descriptor, which can improve the accuracy and stability of line matching for textureless assembly parts.

Fig. 4
figure 4

Block diagram of LNED matching

3.3.1 Line neighborhood edges descriptor (LNED)

The description region is perpendicular and symmetrical to the direction of the line. This region is called the LNE. To distinguish the influence of pixels with different distances from the straight line in LNE, the LNE was divided into several subregions S with the contour line as a symmetric line. Each subregion has a width of w and height of n, as shown in Fig. 5. The red line dL in the figure is the direction vector of the contour line to be described, Si is the subregion in LNE, and li is the contour edge in LNE.

Fig. 5
figure 5

LNED descriptor

In the real image, the gradient value at the edge of the assembly part is large. After image denoising, the gradient value of each pixel in the image is solved. For the pixel point at row n and column i in the sub-region S of LNE, if the gradient value is extreme value and greater than threshold τ, the point is selected as the edge point. Let the pixel gradient value at pixel point Pni be Gni(dxni,dyni), then the calculation formula of pixel gradient magnitude value gni at pixel point Pni is:

$$g_{ni} = \sqrt {dx_{ni}^{2} + dy_{ni}^{2} }$$
(4)

If Pni satisfies these requirements, a new vector is created by joining P0 with Pni. The contour line direction unit vector \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{d_{L} }}\) is then used to perform the cross-product operation with \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{p_{0} p_{ni} }}\). The magnitude of the operation result can be used as the eigenvalue of Pni. The formula is as follows:

$$\left\{ {\begin{array}{*{20}l} {d_{ni} = \sqrt {\left( {x_{ni} - x_{0} } \right)^{2} + \left( {y_{ni} - y_{0} } \right)^{2} } } \hfill & {g_{ni} \ge \tau } \hfill \\ {d_{ni} = 0} \hfill & {g_{ni} < \tau } \hfill \\ \end{array} } \right.$$
(5)
$$v_{ni} = \left| {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{p_{0} p_{ni} }} } \right| \cdot \left| {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{d_{L} }} } \right| \cdot \sin \sigma = d_{ni} \cdot \sin \sigma$$
(6)

where dni is the vector norm of \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{p_{0} p_{ni} }}\), vni is the eigenvalue of pixel point Pni at row n and column i, σ is the angle between two vectors.

According to σ in the clockwise direction, the pixel points that meet the requirements are equally divided into four intervals at the range of [0, 2π]. The eigenvalues of the pixel points belonging to the same interval are accumulated according to certain rules. The interval eigenvalues of all the pixel points in the n row in subregion S were calculated as follows:

$$v_{nk}^{s} = \mathop \sum \limits_{{\left( {k + 1} \right) \cdot \pi /2 > \theta > k \cdot \pi /2}} \lambda_{nk} \cdot v_{nk}$$
(7)

where k = 0, 1, 2, 3 correspond to the four intervals. s is the number of sub-region, \(v_{nk}^{s}\) represents the cumulative eigenvalue at row n in the sub-regions on interval k, θ is the angle between PniP0 and dL in the clockwise direction. When calculating the eigenvalue of each row, to reduce the influence of the background and pixel points farther from the contour line on the descriptor, the weighted coefficient λnk is added according to the distance between the pixels in each sub-region S of LNE. The formula is as follows:

$$\lambda_{nk} = \left( {1/\sqrt {2\pi } \cdot \mu_{s} } \right) \cdot e^{{ - d_{ni}^{2} /2\mu_{s}^{2} }}$$
(8)

where μs = 0.5(n s-1), dni is the distance value of point Pni to dL.

For the pixels in row n in each subregion, the feature vector can be expressed as

$$\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{v_{n}^{s} }} = \left( {\begin{array}{*{20}c} {v_{n0}^{s} ,} & {\begin{array}{*{20}c} {v_{n1}^{s} ,} & {v_{n2}^{s} ,} & {v_{n3}^{s} } \\ \end{array} } \\ \end{array} } \right)$$
(9)

where \(\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{v_{n}^{s} }}\) is a feature vector.

The feature vectors of each subregion S in LNE are represented by a matrix, and the description matrix LNEMs for subregion S are obtained as follows:

$${\text{LNEM}}_{s} = \left( {\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{v_{0}^{s} }}^{T} ,\;\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{v_{1}^{s} }}^{T} , \ldots ,\overset{\lower0.5em\hbox{$\smash{\scriptscriptstyle\rightharpoonup}$}}{{v_{n}^{s} }}^{T} } \right) \in {\varvec{R}}^{4 \times n}$$
(10)

The mean–variance Ms and mean vector Ns of LNEMs are used to simplify the description matrix, and the simplified formula is as follows:

$${\text{LNEM}}_{s} = \left( {\begin{array}{*{20}c} {{\varvec{M}}_{s}^{T} ,} & {{\varvec{N}}_{s}^{T} } \\ \end{array} } \right)$$
(11)

The description matrices of all the subregions S in LNE are combined to obtain the contour line descriptor LNED, which can be expressed as

$${\text{LNED}} = \left( {\begin{array}{*{20}c} {\begin{array}{*{20}c} {{\varvec{M}}_{1}^{T} ,} & {{\varvec{N}}_{1}^{T} } \\ \end{array} ,} & { \cdots ,} & {\begin{array}{*{20}c} {{\varvec{M}}_{s}^{T} ,} & {{\varvec{N}}_{s}^{T} } \\ \end{array} } \\ \end{array} } \right) \in {\varvec{R}}^{8 \times s} \user2{ }$$
(12)

3.3.2 Contour lines matching

To improve the timeliness of the augmented assembly registration, the contour line matching process was divided into offline and online stages. In the offline stage, contour lines and edges in CAD images are extracted by perspective projection through the CAD model vector information, and the contour line descriptor LNED in the CAD images is directly calculated using edges according to the LNED description algorithm. Therefore, the CAD image processing time can be saved in the online stage. In the online stage, contour line extraction, and description are required for a real image to obtain the LNED for matching the CAD images. The pose matrix of the best-matched CAD image was determined as the initial pose of the real camera.

Before matching the LNED descriptors of the real and CAD images, the LNED was converted into binary descriptors to reduce the computational consumption of feature matching. Similar to the Brief descriptor (MC et al. 2010), the LNED feature vectors are represented by a series of 0 and 1 binary codes. Experiments show that the algorithm can achieve a good balance between timeliness and matching effect when LNED selects S = 9 for LNE. Under these circumstances, the LNED is a 72-dimensional floating-point vector. For this 72-dimensional floating-point vector, 0 and 1 are used to encode the size relationship between the elements. As shown in Fig. 6, a 72-dimensional vector was extracted from 32 values in a certain order, and each value was represented by an 8-bit binary string. A 256-dimensional binary vector was obtained by connecting 32 binary strings.

Fig. 6
figure 6

Binary LNED descriptor

After LNED descriptors are transformed into binary encoding, Hamming distance is used as the distance metric between two descriptors to perform XOR and SUM operations on the two binary strings. By judging the LNED of real images and CAD images, the match relationship of the two images can be obtained. For each frame of an image during the augmented assembly process, the CAD image with the minimum distance from Hamming is found as the matching result for the frame, and the virtual camera pose ξ0 of the CAD image is taken as the initial camera pose to the real image.

The LNED algorithm was applied to feature line matching between real and CAD images, and the matching results are shown in Fig. 7. In the figure, four groups of parts were chosen, and the left side of each group of images is a real image. The contour lines in the real image were extracted online using the CEDLines algorithm, and the LNED descriptors of each contour line were calculated simultaneously. On the right side, the CAD images rendered by the virtual model were created offline. Contour lines were obtained through vector perspective projection calculations, and the corresponding LNED descriptors were calculated. Finally, the LNED descriptors of the real and CAD images are matched to obtain the corresponding matching relationship. The number of extracted lines is affected by the structure of the assembly object. When calculating the perspective projection for AR registration, it is necessary to select at least four noncollinear key points for camera pose calculation. Therefore, at least two non-collinear edges must be matched. The more lines, the more descriptors obtained, and the more accurate the image-matching results.

Fig.7
figure 7

LNED-based lines matching

3.4 Pose calculation based on the distance image

The initial pose of the camera is estimated from the discrete perspective of the virtual model; therefore, the accurate superposition of AR cannot be realized using the initial camera pose. Therefore, it is necessary to determine a precise pose based on the initial pose. In an augmented-assembly environment, the contour shape of the assembly part in a real image changes with different viewing angles. The number of contour lines extracted by the algorithm in real images is affected by factors such as illumination and background interference, which makes it difficult to ensure sufficient 3D-2D corresponding point pairs. Therefore, common PnP-solving algorithms cannot be used to determine the precise poses of cameras (Yang et al. 2023; Xu et al. 2017). To obtain precise camera pose parameters, we adopt the bundle adjustment method to iterative minimization to calculate the optimal camera pose parameters. First, the spatial points on the contour lines of the CAD model were reverse-projected onto the edge distance image generated by the real image. The distance values of the projection points in the distance image are then used to establish the camera pose error loss function. Finally, the Levenberg–Marquardt (L-M) optimization algorithm was used for the iterative solution. The initial pose of the camera for the bundle adjustment method calculation can not only be obtained from feature matching but can also use the pose parameters of the previous frame, which can improve the timeliness of registration.

3.4.1 The established objective function

In the augmented assembly process, the relationship between the spatial coordinate points and image pixel points can be obtained using the camera model:

$${{\varvec{p}}}_{i}=I\left(K\bullet \left({\varvec{R}}{\bullet {\varvec{P}}}_{i}+{\varvec{t}}\right)\right)$$
(13)

where pi is the image coordinate point, Pi is the corresponding spatial coordinate point, I(·) is the homogeneous coordinate transformation function, K is the internal parameter of the camera, and R and T are the external parameter matrix of the camera.

After the spatial points in the CAD model are reverse-projected onto the real image distance field, the corresponding edge distance values on the distance image where the points are located are taken as the matching error to determine the optimal camera pose. The principle of the 3D spatial coordinates of the contour points set on the CAD model for reverse projection on the distance image is shown in Fig. 8. The spatial points were captured from all the contour lines in the CAD model, as shown in Fig. 8. In this figure, the black points represent the 3D spatial points of the virtual model. The red points are the reverse projection points of the 3D point on the real image at the initial camera pose ξ0. A value of zero in the distance image represents the position of the contour lines in the real image, and the red number represents the position of the reverse projection point in the distance image. During the assembly process, situations where the background is relatively noisy exist. The non-assembly object edge suppression method proposed by (Bin et al. 2019) can be used to solve this problem. A local foreground background color statistical model of assembled objects was used to suppress the contour edges of the non-assembled objects in the image.

Fig. 8
figure 8

CAD spatial points reverse projection on the distance image

A standard distance transformation was used to generate a distance image of the real image. The distance between 3D the reverse projection and its nearest contour lines can be expressed as:

$$d_{i} = DT\left( {{\varvec{p}}_{i} } \right) = DT\left( {I\left( {K \cdot \left( {R \cdot {\varvec{P}}_{i} + t} \right)} \right)} \right)$$
(14)

Define \({\mathbb{P}}\) as a set of 3D points on the CAD model, then the pose error loss function of \({\mathbb{P}}\) is defined as follows:

$$f\left( {R,t} \right) = \sum\limits_{{P_{i} {\mathbb{P}}}} {\left( {d_{i} } \right)^{2} } = \sum\limits_{{P_{i} \varepsilon {\mathbb{P}}}} {\left( {DT\left( {I\left( {K \cdot \left( {R \cdot {\mathbf{P}}_{i} + t} \right)} \right)} \right)} \right)^{2} }$$
(15)

During the contour extraction of real images, to improve the accuracy of the camera pose calculation and reduce mismatching due to the existence of interference objects, the inclination angles of the matching points are introduced into the error loss function as the error matching parameter to improve the accuracy of bundle adjustment. The inclination angle of each point on the CAD contour was calculated for comparison with the corresponding point in the real image. When the inclination angle values were similar, the corresponding contour matching was considered correct.

For 3D point Pi, suppose its corresponding 2D projection point on the CAD image is \({{\varvec{p}}}_{i}{\prime}\), and the corresponding contour point on the real image is \({{\varvec{p}}}_{i}^{{\prime}{\prime}}\), then the corresponding inclination angle values of Pi on the CAD image and real image are calculated as follows:

$$\left\{ {\begin{array}{*{20}c} {\phi \left( {{\varvec{p}}_{i}^{\prime } } \right) = a\tan 2\left( {dy^{\prime } ,dx^{\prime } } \right)} \\ {\phi \left( {{\varvec{p}}_{i}^{\prime \prime } } \right) = a\tan 2\left( {dy^{\prime \prime } ,dx^{\prime \prime } } \right)} \\ \end{array} } \right.$$
(16)

where dx', dy', dx'', dy'' are the gradient values of the image points in the x and y directions, respectively.

The degree of matching between matching points can be calculated using the following formula:

$$\phi_{i} = \left| {\phi \left( {{\varvec{p}}_{i}^{\prime } } \right) - \phi \left( {{\varvec{p}}_{i}^{\prime \prime } } \right)} \right|$$
(17)

In a continuous assembly process, the assembly part is sometimes blocked by other objects. To mitigate the influence of occlusion on the error loss function of the camera pose, an influencing parameter of occlusion was introduced to ensure the accuracy of the camera pose. When 3D points on the CAD model are projected onto an occluded area, the projection points are often far from the contour lines in the real image and correspond to a larger distance value in the distance image. Tukey’s operator (Lu et al. 2020) was used as the occlusion suppression coefficient.

$$\gamma_{i} = \left\{ {\begin{array}{*{20}l} {\left[ {1 - \left( {d_{i} /\kappa } \right)^{2} } \right]^{2} } \hfill & {d_{i} \le \kappa } \hfill \\ 0 \hfill & {d_{i} > \kappa } \hfill \\ \end{array} } \right.$$
(18)

where di is the distance value of the reverse projection 3D point, κ is the distance threshold.

The above-influencing factors are taken as calculation parameters and introduced into the camera pose error loss function to obtain

$$f\left( {R,t} \right) = \sum\limits_{{P_{i} \varepsilon {\mathbb{P}}}} {\gamma_{i} } \cdot \left( {DT\left( {I\left( {K \cdot \left( {R \cdot \, {\mathbf{P}}_{i} + t} \right)} \right)} \right) + \phi_{i} } \right)^{2}$$
(19)

3.4.2 Optimal method based on L-M

After establishing the camera pose error loss function, a numerical optimization solution was obtained. If the rotation matrix R is used directly to express the rotation, complex nonlinear constraints that are not conducive to the calculation of the numerical optimization algorithm are introduced. To ensure the simplicity of the analytical derivation of the error loss function during numerical optimization, this study adopted Lie algebra to parameterize the camera pose.

$${\text{se}}\left( 3 \right) = \left\{ {\xi = \left( {\begin{array}{*{20}c} \rho \\ \omega \\ \end{array} } \right)\varepsilon {\mathbb{R}}^{6} ,\rho \varepsilon {\mathbb{R}}^{3} ,\omega \varepsilon {\text{so}}\left( 3 \right),\hat{\xi } = \left[ {\begin{array}{*{20}c} {\hat{\omega }} & \rho \\ {o^{T} } & 0 \\ \end{array} } \right]\varepsilon {\mathbb{R}}^{4 \times 4} } \right\}$$
(20)

In the equation, se(3) is a Lie in \({\mathbb{R}}^{6}\) space and the element is ξ which is a 6D vector. The first three dimensions of ξ represent the translation position ρ and the last three dimensions represent the rotation position ω. ω is a Lie algebra, and se (3) vector defined in \({\mathbb{R}}^{3}\) space. It's represents the vector rotation operation.

When the pose matrix is represented by a Lie algebraic element, the corresponding pose error loss function is obtained by substituting Eq. (19).

$$f\left( \xi \right) = \mathop \sum \limits_{{P_{i} \varepsilon {\mathbb{P}}}} \gamma_{i} \cdot \left( {DT\left( {I\left( {K \cdot \left( {\exp \left( \xi \right) \cdot P_{i} } \right)_{3 \times 1} } \right)} \right) + \phi_{i} } \right)^{2}$$
(21)

The nonlinear iterative L-M algorithm is used to solve equation f(ξ) under the initial pose ξ0 of the camera. Additionally, the optimal camera poses ξt can be obtained by using the equation:

$$\xi_{t} = \mathop {\arg \min }\limits_{\xi } f\left( \xi \right)$$
(22)

When the L-M algorithm was used to calculate the Eq. (22), the Jacobian matrix J \(\varepsilon {\mathbb{R}}^{1 \times 6}\) of f(ξ) needed to be calculated. At the condition of tiny disturbance Δξ, the Jacobian matrix J is the derivative of tiny disturbance Δξ. This can be calculated using the following equation:

$$J = \frac{\partial f\left( \xi \right)}{{\partial \Delta \xi }} = \sum\limits_{{P_{i} \varepsilon {\mathbb{P}}}} 2 \cdot \gamma_{i} \cdot \frac{{\partial DT\left( { \, {\varvec{p}}_{i} } \right)}}{{\partial \, {\varvec{p}}_{i} }}\frac{{\partial \, {\varvec{p}}_{i} }}{\partial \Delta \xi }$$
(23)

where pi is the projection of 3D point Pi at camera pose ξ. The partial derivative \(\partial {\varvec{p}}_{i} /\partial \Delta \xi \varepsilon {\mathbb{R}}^{2 \times 6}\) can be solved by the perturbation model, \(\partial DT\left( {{\varvec{p}}_{i} } \right)/\partial {\varvec{p}}_{i} \varepsilon {\mathbb{R}}^{1 \times 2}\) This represents the partial derivative of the gradient in the distance image at point pi. Hence, it can be calculated by using central finite difference:

$$\left\{ {\begin{array}{*{20}c} {\frac{{\partial DT\left( {{\varvec{p}}_{i} } \right)}}{{\partial u_{i} }} = \frac{{DT\left( {u_{i} + 1, v_{i} } \right) - DT\left( {u_{i} - 1, v_{i} } \right)}}{2}} \\ {\frac{{\partial DT\left( {{\varvec{p}}_{i} } \right)}}{{\partial v_{i} }} = \frac{{DT\left( {u_{i} , v_{i} + 1 } \right) - DT\left( {u_{i} , v_{i} - 1} \right)}}{2}} \\ \end{array} } \right.$$
(24)

where ui and vi is the coordinate value of point pi.

4 Experiments

The system hardware configuration parameters of all the experiments in this study were as follows: the camera used was an ordinary USB camera with a CMOS color sensor, which had a resolution of 640 × 480. The operating system employed was Windows 10, running on an Intel Core I7 CPU with a clock speed of 3.6 GHz and 8 GB of memory.

4.1 Registration precision analysis

The mark-based registration algorithm was chosen as a reference for comparison because of its excellent performance regarding accuracy, timeliness, and adaptability (Chen et al. 2022). To ensure consistency in the verification scenes for both registration algorithms, videos containing marks were captured for three parts of the experiments. The experimental setup is shown in Fig. 9. Figures 9 (1)–(3) are the three video scenes from the experiment. Figures 9 (4)–(5) show the registration results of the mark-based method, and Fig. 9 (6)–(9) show the registration results of our method. During the registration process, the center of the part was used as the spatial origin coordinate. The camera pose exhibited continuous motion in different videos to analyze the registration performances of the two registration algorithms.

Fig. 9
figure 9

Registration of different parts

The mark-based and proposed algorithms were applied to calculate the camera pose of each frame in the video, and the pose coordinate curves of each frame in the six orientations of Rx, Ry, Rz, Tx, Ty, and Tz were obtained, as shown in Fig. 10.

Fig. 10
figure 10

Registration coordinate curve

The six orientation coordinate curves are shown in Fig. 10. The curves represent the rotation and translation variations of the camera pose, in which the unit of the rotation coordinate is ° and the unit of the translation coordinate is mm. In Fig. 10, different color curves are used to distinguish the results of the two algorithms for the three scenes. The camera pose coordinate curves obtained from the mark-based algorithm were used as reference standards to verify the proposed algorithm. In the experiments, part 1 moved within a moderate range in the direction of rotation and movement. The rotation angle was approximately 20°, and the movement range was 40 mm. The maximum error in the rotation and movement directions appeared in frame 180, in which the rotation error around the Y-axis was 2.8° and the movement error along the Y-axis was 3.5 mm. The rotation angle of Part 2 was maintained within a range of 10°, and the movement was greater when the movement range exceeded 50 mm along the X-axis. The maximum error of Part 2 occurred in frame 210; the rotation error around the X-axis was approximately 3°, and the movement error along the Z-axis was approximately 3 mm. Part 3 performed a relatively significant rotational motion in which the rotation angle around the Y-axis was close to 50°. A significant movement error of approximately 5 mm occurred in Frame 240.

Compared with the mark-based registration method, the proposed algorithm demonstrates a maximum rotation error of approximately 3° and a maximum positional error of approximately 5 mm. The average rotation error is 1.47° and the average positional error is approximately 1.16 mm. Therefore, the proposed algorithm effectively addresses the registration problem of textureless objects and fulfills the requirements of augmented assembly guidance.

4.2 Time consumption of LNED

Feature extraction and matching time are the main factors affecting the registration timeliness of augmented assembly systems. Presently, feature point-based matching methods are primarily adopted for richly textured objects in augmented assembly systems, which can easily reach 15 frames per second (FPS) (Wang et al. 2017). For textureless objects, image matching is often calculated based on object edges, which results in high computational consumption for feature extraction and matching. We chose the Parts (4) in Fig. 7 using 640 × 480 resolution images as test objects to compare the time consumption of different existing matching methods. The average time consumption results of image-pair matching for different methods are listed in Table 1.

Table 1 Matching results of different algorithms

In Table 1, the authors compared the proposed LNED algorithm with existing methods such as BOLD (Tombari et al. 2013), LBD (Zhang and Koch 2013), LP-HardNet (Liu et al. 2021) and ORB (Rublee et al. 2011). BOLD describes and matches lines using the angle features between adjacent lines, while LBD uses local image gradients of lines for description and matching. LP-HardNet performs image-line matching using a CNN. ORB is a feature point matching method.

Additionally, in Table 1, the Time Consumption represents the time consumed by different algorithms to complete feature extraction and matching for a pair of images. The Frame Rate is the corresponding frame rate for pair-image matching. The line-matching rate is the ratio of the number of matched lines to the number of extracted lines. The ORB algorithm extracts keypoint features from images, and it is not possible to match straight lines. LP-HardNet should build a large image dataset that contains the test assembly object for model training; therefore, it was not possible to test the line-matching rate of the scenario in these study. Compared to existing methods for textureless objects, our method has an advantage regarding timeliness. With the binary descriptor for LNED, the time consumption was less than those for BOLD and LBD. The trained model generated by LP-HardNet contains several parameters, and its timeliness is limited by the design of the neural network structure and hardware. Compared with available algorithms, the proposed method can achieve line matching with higher accuracy.

LNED can reach 22 FPS, which meets the timeliness requirements of the AR system. During the application of the augmented assembly, the registration time can be reduced by extracting the descriptors of the CAD templates in the offline stage. In the online stage, the augmented assembly system only needs to extract and describe the real assembly part images, which can reduce calculation consumption.

4.3 Augmented assembly case

In this section, we describe the development of a precision-machine-augmented assembly system based on the proposed markerless AR registration method. The assembly guidance task is shown in Fig. 11. A computer optical drive gear was used as the assembly object. First, we use the black base as the detection object and complete the AR registration by matching the line features on the base, as shown in Fig. 11 (1). We subsequently superimposed the blue wireframe model of the base onto the real image. The blue wireframe model shown in Fig. 11 (1) accurately superimposes the real black base.

Fig. 11
figure 11

Augmented assembly case

After completing the AR registration, a virtual model was used to guide the actual assembly process. A green gear virtual model is used to superpose the accurate installation position of the real black base, as shown in Fig. 11 (2)–(4). The initial position of the virtual model for gear installation guidance is shown in Fig. 11 (2). The initial position of the virtual model was directly above the actual installation position. In the AR visualization animation, the virtual model moved downward from its initial position to the actual installation position. The animation process and direction of the virtual model are shown in Fig. 11 (3). Moreover, the actual installation position of the real gear is shown in Fig. 11 (4). At this time, the virtual model of the green gear was precisely superimposed on the position to be installed.

The operators completed the installation of the real gears based on the guidance of the virtual model, as shown in Fig. 11 (5). In Fig. 11 (5), the actual gear is shown in white on a black base. After the installation of the real gear was completed, the virtual wireframe model of the gear was adopted to verify the installation accuracy, as shown in Fig. 11 (6). The green wireframe model shown in Fig. 11 (6) can be accurately superimposed on real white gear.

The proposed AR registration method is based on image feature description and match. The accuracy and effectiveness of the proposed method depend on the extraction of image features. Like other image feature extraction methods (Rublee et al. 2011; Grompone 2012; Tombari et al. 2013), the proposed method is also subject to environmental. Under low light or high brightness environments, it is difficult to extract image features, which affects the accuracy of the algorithm in this article. The proposed method requires searching for the initial position of the assembly object, which is not suitable for situations where the camera moves rapidly. When the speed of camera movement is too high, there will be significant errors in the object pose information calculated.

5 Conclusions

A novel line-matching method called LNED is introduced in this study, which describes the extracted contour lines using local geometric edge features. Unlike commonly used descriptors such as SIFT, ORB, and LBD, LNED does not require a significant amount of texture information in the image for image matching. This addresses the challenge of textureless assembly part image matching. A novel markerless AR registration method for augmented assembly systems was presented based on the LNED method matching, featuring a coarse-to-fine strategy for tracking the assembly objects, and enabling the accurate registration of textureless parts without the need for artificial markers. The LNED-based registration method applies to textureless assembly parts, although a need exists for further improvement in computational speed. In the future, more accurate bundle adjustment algorithms can be considered for precise pose calculations.