1 Introduction

Augmented Reality (AR) applications combine real and virtual spaces in the same environment coherently, aiming to enrich user’s sensory experience (Billinghurst et al. 2015). For example, visual depth perception involves physiological and psychological aspects (Kiyokawa 2015) that an AR System must consider.

Among the devices used to display augmented content visually, Head Mounted Displays (HMDs) are wearable computing items that attach to the head of the user and whose appearance can vary from a helmet to a pair of glasses. HMDs show the AR environment in two main ways: through video (Video See-Through—VST) and through an optical combiner (Optical See-Through—OST) (Combe et al. 2023). Recently, a new category of HMDs has also emerged that uses a smartphone as a device for tracking the environment (back camera) and display (screen) (Langlotz et al. 2014).

One of the main challenges of AR systems is the registration problem, which consists of the proper insertion of virtual elements so that, to the user, the system must adequately align the elements in the real and the virtual worlds concerning each other or the illusion that the two worlds coexist will be compromised (Azuma 1997; Itoh et al. 2021). In systems that use smartphone-based OST HMDs, the registration problem becomes particularly challenging because the camera’s point of view, which tracks the environment, does not coincide with the user’s (Grubert et al. 2018).

According to Grubert et al. (2018), we can solve the register in an AR system through parameter estimation procedures that require user interaction to collect 3D-2D correspondences by manually aligning a world reference point to 2D points displayed on the screen of an OST HMD.

Determining the parameters of the tracking camera and the visual system from the user is the calibration process (Aleotti 2022). Calibration consists of an optimization process that seeks to create a mathematical model that reconstructs the formation process of an image on a camera sensor (or on the retina, in the case of the human eye) (Grubert et al. 2018). It can be done either manually, semi-automatically, or automatically, depending on the degree of human intervention required. In this work, a semi-automatic approach was adopted (Itoh et al. 2021), to avoid the typical errors and inaccuracies of human manual intervention (Grubert et al. 2018).

The optimization process, in turn, can be performed through algorithms deterministic or non-deterministic. While the deterministic approach exclusively uses procedures parameterized by predetermined values, the non-deterministic approach uses probabilistic values. Among the non-deterministic methods, those nature-inspired stand out which are based on physical, chemical, and biological phenomena. Some advantages of these methods are greater ease in modeling problems and sometimes superior computational performance (de Castro LN 2006). For these reasons, we choose these methods as a solution for calibration. Although many works have already been proposed to calibrate OST (Grubert et al. 2018) HMDs, they have yet to address smartphone-based devices and have used only deterministic optimization methods.

2 Related work

According to Grubert et al. (2018), calibration methods can be divided into three categories: manual, semi-automatic, and automatic calibration. Among the works focused on manual calibration, Tuceryan and Navab (2000) developed a calibration method called SPAAM (Single Point Active Alignment Method). To calibrate the HMD, the user must align a virtual 2D symbol, such as a cross, with a 3D point in the real world (on a marker, for example) consecutive times, and then the system calculates all calibration parameters. The main advantage of this method is its flexibility, as it does not require specific hardware.

Genc et al. (2002) have improved SPAAM through a two-stage calibration process called SPAAM2. The user executes the complete calibration only once. When the user uses the system again, it uses the same intrinsic parameters, allowing it to obtain the new calibration promptly. Zhang et al. (2017) proposed a dynamic SPAAM that considers the orientation of the user’s eyes to calculate the projection model. The method is called RIDE (Region-Induced Data Enhancement). Kellner et al. (2012) proposed a calibration method in which the user points the portable marker at a distant 3D target, and the system finds correspondences between points.

Regarding semi-automatic calibration methods, Navab et al. (2004) developed the Easy SPAAM. This method calibrates the HMD by updating the parameters of the SPAAM method so that the user indicates only a few correspondences between points. Owen et al. (2004) presented the Display-Relative Calibration (DRC) method, which consists of two steps. In the first, the user performs an offline calibration using a mechanical gauge. In the second, the system offers five different options to the user to complete the calibration. Gilson et al. (2008) developed a system similar to that of Owen et al. (2004), but the approach replaces the user’s eye with a camera, facilitating the collection of correspondences between points. Makibuchi et al. (2013) proposed the ViRC (Vision-based Robust Calibration) method that estimates parameters offline, and then the user must collect some correspondences between points to complete the calibration. For this, the user aligns a cross on the screen and a fiducial marker. With this, the method reduces the need for user interaction and errors in the projection model.

Among the automatic calibration methods, Luo et al. (2005) proposed a technique for OST HMDs with a structure similar to glasses, using an auxiliary camera. However, the camera position causes registration errors at short distances. Figl et al. (2005) presented a calibration method for a Medical use binocular HMD. The authors used a stepper motor to automatically change the distance of a calibration standard to automate the process. Itoh and Klinker (2014) developed the INDICA (INteraction-free DISPLAY CAlibration) method, which also uses an eye tracker coupled to the HMD. This method performs the initial calibration using the SPAAM method without subsequent user interaction. Plopski et al. (2015) proposed the CIC (Corneal-Imaging Calibration) method, which estimates the eye’s position from the images of a fiducial pattern captured by a camera, which appears reflected on the cornea. For this, the method adopts a simplified model of the human eye based on a combination of two spheres: the outer membrane of the eyeball (sclera) and the cornea itself. In Cutolo et al. (2020), the calibration method presented uses computer vision techniques to estimate the projection parameters of the display for a generic camera position. The approach requires data from a camera that observes a planar pattern on the OST display.

The automatic calibration method presented in the North Star (2019) project divides the process into two steps to correct the distortions generated by the geometry of the optical system. First, the user fixes a camera inside the HMD, pointing at a flat panel monitor that displays a black and white checkerboard pattern (fiducial marker). With this, it is possible to determine the position and orientation of this monitor concerning the HMD. Then, the user configures four cameras (two pointed at each viewfinder) where the user’s eyes are. The displays generate a virtual image identical to the pattern seen on the monitor, superimposing it on the camera view but with inverted colors. If the virtual image is perfectly aligned with the fiducial marker, all cameras see an utterly white rectangle in the region occupied by the monitor. In this case, the system completed the calibration. Otherwise, the virtual images are moved slightly through an optimization process until to achieved the ideal configuration. The apparent redundancy of cameras is necessary to avoid ambiguities in interpreting captured images.

3 Proposed method

The proposed method aims to semi-automatically calibrate smartphone-based OST HMDs through stochastic nature-inspired optimization algorithms (NIOAs). Our approach allows the insertion of virtual elements in the scene displayed by the OST HMD to compose an AR environment with adequate registration. The smartphone-based OST HMD is cheaper and more affordable than traditional HMDs and has become popular in recent years.

We use NIOAs due to the computational cost involved in the optimization process, which is necessary for calibration. A deterministic approach can make it challenging to model the problem because it is computationally expensive. Deterministic methods usually require analytically or numerically calculating first and second-order derivatives. When the objective function or constraints include nonlinear functions, ensuring that the boundary conditions are satisfied requires sophisticated modeling. In addition, there is an ever-growing set of applications where nature-inspired computing has shown promise.

3.1 Overview

Since the smartphone camera in our OST HMD is not in the same pose and position as the user’s point of view, we used a webcam to simulate the user’s gaze (virtual eye) (Fig. 1). This webcam and a smartphone camera simultaneously observe the same fiducial marker so that the system can determine the intrinsic, extrinsic, and projection matrix parameters of the virtual camera (which generate the images shown on display). This strategy enables the generation of an AR environment with the correct registration. We can divide this process into the following steps: (i) calibration of the intrinsic parameters of the virtual eye camera and of the smartphone camera; (ii) calibration of the extrinsic parameters of the virtual eye camera concerning the smartphone camera; (iii) calibration of the virtual camera projection matrix parameters (projection parameters’).

Fig. 1
figure 1

Diagram of calibration process

From the intrinsic parameters, it is possible to determine the fiducial points’ position in each camera’s coordinate systems. We can also perform the procedure when the virtual eye camera are disconnected from the system. In this case, the system uses the points in the smartphone camera’s coordinate system and the extrinsic parameters to determine the position of the fiducial points in the virtual eye camera coordinate system. Finally, we can calculate the points in the screen’s coordinate system (of the smartphone) using the points in the coordinate system of the virtual eye and the projection parameters.

To objectively verify the quality of the proposed method, we calculated, for each step, the mean squared error (MSE), considering the difference between the ideal position of a set of reference points and the position estimated by the calculated calibration parameters. In the experiments, we used a scenario where the augmented content is presented in a restricted field of view of approximately 50\(^o\) vertically and horizontally. This choice considered the field of the human visual system, which presents the measurement and the limitations that most OST HMDs place on the field of view.

The experiments used the Haori Mirror AR Headset as a reference HMD. Figure 2 shows the main components of the HMD. The region indicated by the letter “A” is intended to fit the smartphone, which must be positioned so that its rear camera points forward, enabling tracking of the environment. The AR scene is displayed in stereo vision on the smartphone screen, passing through an opening in the upper part of the HMD and being reflected by the internal mirror “B” towards display “D”, which again reflects the images in the direction of the user’s gaze. On the path between “B” and “D”, the images also pass through magnifying lenses, indicated by the letter “C”. There is also another lens in a position symmetrical to this one for the direct eye.

Fig. 2
figure 2

Structure of the Haori mirror AR headset

In the Haori Mirror optical architecture, a mirror reflects the display image and then passes through a lens. This image is then reflected in the viewfinder. Figure 3 shows the architecture idealized for carrying out the semi-automatic calibration using a camera to simulate the field of view of the human eye (virtual eye). We performed and tested the calibration for the left eye only to simplify the experiments’ hardware setup. However, all the procedures can calibrate vision in the right eye. Furthermore, it is possible to estimate the vision of the right eye from the vision of the left eye and the distance between the centers of the two pupils.

Fig. 3
figure 3

Structure of the Haori mirror AR headset

3.2 Calibration

The calibration process needs to identify points whose positions are known in the world coordinate system (fiducial points). To carry out the tracking, we used a fiducial marker (Fig. 4) which is observed simultaneously by both cameras (virtual eye and smartphone). The marker’s regular pattern allows determining the positions, on the x and y axes, of the vertices of the black squares, as long as the center of the image captured by the observing camera is inside the grid of squares. A distance \(l=5.34cm\) exists between any pair of neighboring vertices in the same row or column. The user configures in the system calibration the z axis that corresponds to the distance between the camera and the marker.

Fig. 4
figure 4

Fiducial marker used in our calibration

Executing the procedures above identifies a set of pixels near each vertex. A proximity clustering algorithm reduces each cluster to a single vertex, whose position is the average of the positions of its elements. Our method removes groups with many pixels, as they represent edges of objects that appear in the camera’s field of view but are not part of the marker. We evaluated the calibration results by applying the MSE metric according to the equation

$$\begin{aligned} \textrm{MSE} = \frac{\sum ^n_{i=1} (e_i)^2}{n} \end{aligned}$$
(1)

where e is a single error and n is the number of calibration trials.

The intrinsic parameters of a camera are the horizontal (\(\alpha \)) and vertical (\(\beta \)) magnification factors, the \(\theta \) angle between the sensor’s horizontal and vertical axes, and the position (\(x_0\), \( y_0\)) from the center of the projection plane. In our experiments, we set \(\theta \) = 90\(^{\circ }\), assuming that there is no manufacturing defect in the sensors of both cameras. This assumption proved to be adequate in our experiments.

The image coordinate system was transformed before the calibration process so that the system shifted its origin (lower left corner) to the center of the image and, therefore, of the sensor. Consequently, we have \((x_0, y_0) = (0, 0)\) in the new system. These transformations use the equations

$$\begin{aligned} x' = x - \frac{L}{2} \end{aligned}$$
(2)
$$\begin{aligned} y' = y - \frac{A}{2} \end{aligned}$$
(3)

where L is the width of the image, A is the height of the image, (xy) is the original coordinates of each pixel and \((x', y')\) is the coordinates after the transformation. The calibration of the intrinsic parameters must obtain \(\alpha \) and \(\beta \). The camera executing its calibration estimates the initial values of these parameters for each fiducial point observed by the camera, using the equations

$$\begin{aligned} \alpha = \frac{x_i \times z_c}{x_c} \end{aligned}$$
(4)
$$\begin{aligned} \beta = \frac{y_i \times z_c}{y_c} \end{aligned}$$
(5)

where (\(x_c, y_c, z_c\)) is the position of the fiducial point in the camera coordinate system and (\(x_i, y_i\)) is the position in the image coordinate system. Then, the system calculates the average of the initial estimates of \(\alpha \) and \(\beta \), and uses this average to determine the minimum \(p_{min}\) and maximum \(p_{max}\) values of these parameters in the process of optimization through the equations

$$\begin{aligned} p_{min} = p - 0.5 \times p \end{aligned}$$
(6)
$$\begin{aligned} p_{max} = p + 0.5 \times p \end{aligned}$$
(7)

where p is the initially estimated average value of a parameter. In the next step, the system executes the optimization algorithms delimiting the initial values \(p \in [p_{min}, p_{max}]\).

For each newly calculated value of \(\alpha \) and \(\beta \) during the optimization process, it is possible to estimate the position (\(x^*_i, y^*_i\)) where each fiducial point should appear in the coordinate system of the image, given its position (\(x_c, y_c, z_c\)) in the camera coordinate system, if the parameter values are correct. The Eqs. (8) and (9) give this calculation

$$\begin{aligned}{} & {} x^*_i = \frac{\alpha \times x_c}{z_c} \end{aligned}$$
(8)
$$\begin{aligned}{} & {} y^*_i = \frac{\beta \times y_c}{z_c}. \end{aligned}$$
(9)

The chessboard distance (\(D_c\)) between the estimated position (\(x^*_i, y^*_i \)) and the observed position \((x_i, y_i)\) of each fiducial point in the image coordinate system was used for the calculation of the individual errors \(e_i\) (and consequently the MSE) during optimization. We obtain this measure of distance by the equation

$$\begin{aligned} D_c = max(| x_1 - x_2|, |y_1 - y_2|) \end{aligned}$$
(10)

where \((x_1, y_1) = (x^*_i, y^*_i)\) and \((x_2, y_2 ) = (x_i, y_i)\).

The smartphone camera in our OST HMD performs the environment tracking, but as it is not in the same pose and position as the user’s point of view, we used a webcam to simulate the user’s gaze. Since the system must show the virtual objects from the virtual eye’s perspective, we need to calculate the extrinsic parameters between the two devices. These parameters are the translations on the x, y and z axes, here called \(t_x\), \(t_y\) and \(t_z\), respectively; and the rotations on the same axes, here called \(r_x\), \(r_y\) and \(r_z\), respectively. The initially estimated values of \(t_x\), \(t_y\), and \(t_z\), for each fiducial point observed simultaneously, both by the smartphone camera and by the virtual eye, are calculated, respectively, by the equations

$$\begin{aligned}{} & {} t_x = x_{c_o} - x_{c_s} \end{aligned}$$
(11)
$$\begin{aligned}{} & {} t_y = y_{c_o} - y_{c_s} \end{aligned}$$
(12)
$$\begin{aligned}{} & {} t_z = z_{c_o} - z_{c_s} \end{aligned}$$
(13)

where \((x_{c_s}\), \(y_{c_s}\), \(z_{c_s})\) is the position of the fiducial point in the smartphone camera coordinate system and \((x_{c_o}\), \(y_{c_o }\), \(z_{c_o}\) )is the position in the camera coordinate system of the virtual eye. Then, as with the calculation of the intrinsic parameters, the system calculates the average of these initial estimates and the minimum and maximum values for each of the three parameters. The initial guess of \(r_x\), \(r_y\) and \(r_z\) is 0\(^o\), and their minimum and maximum values are -5\(^o\) and 5\(^o\), respectively. Each time the parameter values are updated, during the optimization process, it is possible to estimate the position (\(x^*_{c_o}\), \(y^*_{c_o}\), \(z^*_{c_o} \)) where each fiducial point should appear in the virtual eye camera’s coordinate system, given its position (\(x_{c_s}\), \(y_{c_s}\), \(z_{c_s}\)) in the camera’s coordinate system from the smartphone. The system calculates these values using the equation

$$\begin{aligned} \begin{bmatrix} x^*_{c_o} \\ y^*_{c_o}\\ z^*_{c_o}\\ 1\\ \end{bmatrix}= & {} \begin{bmatrix} 1 &\quad {} 0 &\quad {} 0 &\quad {} t_x\\ 0 &\quad {} 1 &\quad {} 0 &\quad {} t_y\\ 0 &\quad {} 0 &\quad {} 1 &\quad {} t_z \\ 0 &\quad {} 0 &\quad {} 0 &\quad {} 1\\ \end{bmatrix}\times \begin{bmatrix} 1 &\quad {} 0 &\quad {} 0 &\quad {} 0\\ 0 &\quad {} cos~r_x &\quad {} -sin~r_x &\quad {} 0\\ 0 &\quad {} sin~r_x &\quad {} cos~r_x &\quad {} 0\\ 0 &\quad {} 0 &\quad {} 0 &\quad {} 1\\ \end{bmatrix}\nonumber \\{} &\quad {} \times \begin{bmatrix} cos~r_y &\quad {} 0 &\quad {} sin~r_y &\quad {} 0\\ 0 &\quad {} 1 &\quad {} 0 &\quad {} 0 \\ -sin~r_y &\quad {} 0 &\quad {} cos~r_y &\quad {} 0 \\ 0 &\quad {} 0 &\quad {} 0 &\quad {} 1 \\ \end{bmatrix}\nonumber \\{} &\quad {} \times \begin{bmatrix} cos~r_z &\quad {} -sin~r_z &\quad {} 0 &\quad {} 0\\ sin~r_z &\quad {} cos~r_z &\quad {} 0 &\quad {} 0 \\ 0 &\quad {} 0 &\quad {} 1 &\quad {} 0 \\ 0 &\quad {} 0 &\quad {} 0 &\quad {} 1 \\ \end{bmatrix} \times \begin{bmatrix} x_{c_s} \\ y_{c_s} \\ z_{c_s} \\ 1 \\ \end{bmatrix} \end{aligned}$$
(14)

Previously, the chessboard distance was used because we thought it would be suitable for a discrete space like the image coordinate system. While the Euclidean distance was chosen for a continuous space as the camera coordinate system. Therefore, the system uses the Euclidean distance \(D_e\) between the estimated position (\(x^*_{c_o}\), \(y^*_{c_o}\), \(z^*_{c_o}\)) and the observed position (\(x_{c_o }\), \(y_{c_o}\), \(z_{c_o}\)) of each fiducial point in the virtual eye camera coordinate system to calculate the individual errors \(e_i\) (and consequently the NDE) during optimization. We obtain this measure of distance by the equation

$$\begin{aligned} D_e = \sqrt{(x_1-x_2)^2 + (y_1-y_2)^2 + (z_1 - z_2)^2} \end{aligned}$$
(15)

where (\(x_1\), \(y_1\), \(z_1\) ) = (\(x^*_{c_o}\),\(y^*_{c_o}\),\(z^*_{c_o}\)) and (\(x_2 \), \(y_2\), \(z_2\)) = (\(x_{c_o}\), \(y_{c_o}\), \(z_{c_o}\)).

3.3 Parameters for optimization

The parameters of the NIOAs evaluated in this research are values such as mutation and crossover rates of the evolutionary algorithm. The proper choice of these parameters can contribute to the optimization process’s efficiency and effectiveness. We determined these parameters considering literature recommendations (Sexton et al. 2019) and experimental tests using the Easom function, given by the equation

$$\begin{aligned} f(x_1,x_2) = -\cos {(x_1)}\cos {(x_2)}\exp {(-(x_1 - \pi )^2 - (x_2 - \pi )^2))} \end{aligned}$$
(16)

where x and y are real numbers, \( x \in [-10.10] \) and \( y \in [-10.10] \), with a global optimum at \( (\pi , \pi ) \).

In the simulated annealing algorithm, we performed the optimization with 50 initial solutions (particles) instead of a single one. In each interaction, the system independently updates all solutions. A Gaussian deviation was adopted for the perturbations, with \( \mu = 0 \) and \( \sigma \), for each optimized variable, of 0.0103 times the size of the interval of its respective domain. The initial temperature was \( 9 \cdot 10^{-9} \), being reduced by a factor \( \beta _d = 0.753 \) every 100 iterations. The algorithm ends its execution after 20 temperature reductions.

In the evolutionary strategy algorithm, we adopted a population of 375 individuals and a Gaussian perturbation, for the mutations, with \( \mu = 0 \) and \( \sigma \), for each optimized variable of 0.21 times the size of the interval of their respective domain. We use an initial mutation rate of 1, reduced by a factor \( \beta _d = 0.981 \) at each iteration (new population generated), with a maximum of 50 iterations underperforming. The crossover rate was 0.7, with recombinations performed by calculating a weighted average (with random weighting) between the values of two individuals.

In the particle swarm algorithm, we used 40 particles, with acceleration constants \( AC_1 = AC_2 = 2.05 \), constriction coefficient \( \chi = 0.729 \), and initial weight of inertia \( w = 1 \), being reduced at each iteration by a factor \( \beta _d = 0.985 \). The system performs a maximum of 50 iterations. We considered a circular neighborhood of size one; that is, the behavior of its two nearest neighbors influences the behavior of each particle.

4 Experimental environment

The calibration system developed has two modules, one running on a notebook and the other on a smartphone, communicating via Bluetooth. Figure 5 shows the calibration process, which is carried out entirely on the notebook module, based on the images it captures through the virtual eye and the images it receives remotely from the smartphone. Once the system obtains the calibration parameters, they are sent via Bluetooth back to the smartphone, which uses them to render the virtual objects in the correct position on the screen.

Fig. 5
figure 5

Data flow and processing on notebook and smartphone

To create the experimental environment, we used an OST HMD Haori Mirror AR Headset; a Motorola Moto g(6) play smartphone; a Full HD Webcam (1080p) with USB connection; an Acer Aspire 5 Notebook A515-41 G-1480; a mannequin head, similar in size and shape to a human head real; a parallelepiped-shaped wooden base, with dimensions of 30.5 \(\times \) 30.5 \(\times \) 1.5 cm; a tripod with adjustable height, level meter and the possibility of rotating its base around three orthogonal axes in space; a three-dimensional metal piece in the shape of an “L”, made up of two rectangular plates measuring 28 \(\times \) 21 cm and 28 \(\times \) 9.5 cm; a cardboard plate cut to the dimensions of the larger metal plate; and an A4 bond sheet with a printed fiducial marker. Figure 6a shows our experimental environment.

Fig. 6
figure 6

Experimental environment. a Environment configuration, b virtual eye, c perforation in left eye position, d, e OST HMD dressed in a mannequin; and f three-dimensional orthogonal reference coordinate system

The webcam simulates the OST HMD user’s point of view (virtual eye) and, therefore, was attached to the mannequin’s head with its lens fitted in a perforation made in the position of the left eye (Fig. 6b). Due to the camera’s shape, it was also necessary to make a hole in the left temple (Fig. 6c). Also, we made a hole at the back of the head so that it could be inserted more easily. This set was screwed onto the wooden base to facilitate its positioning and movement during the experiments and to compensate for the weight of the OST HMD and the smartphone, preventing the head from tilting forward.

We placed the OST HMD on the mannequin’s head in the same way as it would be on a real user and inserted the smartphone in the space selected for it in the upper part of the helmet (Figs. 6e, d). The notebook receives the images from the webcam through the USB cable, which passes through the hole at the back of the head (Fig. 6f). The notebook communicates with the smartphone via Bluetooth.

We fixed the OST HMD in different positions concerning the marker in the tests, and we adopted the three-dimensional system of orthogonal reference coordinates (Fig. 6f). The system has its origin in the center of the marker, the horizontal x axis pointing to the right (of the OST HMD), the vertical y axis pointing up, and the horizontal z axis pointing forward to the OST HMD.

The calibration process uses two applications, one on the smartphone and the other on the notebook. We implemented the smartphone application in Java and executed it on the Android 9 Pie operating system. The notebook application uses Java language and runs on the Ubuntu 21.04 operating system.

The user controls the calibration process through the notebook application. Initially, the user must inform, through a text command environment, which cameras currently connected to the computer correspond to the virtual eye. The system creates a window displaying the image captured by the virtual eye (in 640p resolution) and waits, in the background, for the user to launch the application on the smartphone.

Figure 7 shows the images presented to the user during the calibration process after connecting the devices. The first line corresponds to the variations of the virtual eye image, while the second corresponds to the variations of the smartphone image. The first column shows the original images in grayscale, the second column shows the edges of these images, and the third column shows the vertices of the black squares of the fiducial marker, outlined by blue squares, to facilitate user visualization. The red lines are reference lines that divide the images into equal parts, both horizontally and vertically, and consequently, intersect at the center of each of them.

Fig. 7
figure 7

Images from both cameras displayed in the notebook application. Variations of the virtual eye image (first line), variations of the smartphone image (second line), original images in grayscale (first column), edges of the original images (second column), and vertices of the black squares of the fiducial marker outlined by blue squares. The red lines are reference lines (Color figure online)

The thresholds for detecting the edges and vertices of the virtual eye image are among the parameters that the user must configure. The ideal values for these parameters may vary according to the camera model, ambient light, and pigment used to print the marker. The thresholds for detecting edges and vertices of smartphone images also depend on the factors above. The user must also configure in the system the size of the side of the black squares of the fiducial marker, the initial distance between the marker and the smartphone, and the distance between the smartphone and the virtual eye (both in the z axis direction). These configurations allow the user to print the marker with any scale (keeping the original aspect) and start the calibration process with the smartphone and the virtual camera in any position. It is also necessary to configure the specification parameters of the virtual square used as a reference for calculating the projection matrix. As the user modifies these parameters, the system displays the generated virtual square on the screen, highlighted by a blue outline (Fig. 8). The objective of the calibration process is to “fit” the virtual square over one of the black squares or one of the “white” squares (spaces) between them. This strategy allows the calibration to adapt to different smartphone screen sizes.

Fig. 8
figure 8

Virtual square used as a reference for calibration

Fig. 9
figure 9

System configuration. a Virtual eye—image coordinates, b virtual eye—Camera coordinates, c Smartphone—image coordinates; and d Smartphone—Camera coordinates

The system also displays in real-time the position of the vertices of the marker squares, from the point of view of both cameras, both in the image coordinate system in pixels (with origin in the center of the image), and the camera coordinate system in centimeters. Figure 9b–f shows the 2D points printed in the format (x; y) and the 3D points in the format (x; y; z).

The system allows the user to calibrate the intrinsic parameters (of both cameras), the extrinsic parameters (of the virtual eye concerning the smartphone), and the projection matrix parameters. The system estimates the error continuously and replaces the stored values when the error is less than the previously calculated one. After the system obtains the intrinsic parameters, it calculates the distance of each camera relative to the marker. When the OST HMD is moving (closer to or farther from the marker), the system automatically updates the values.

The lightest virtual square among the four black squares uses the calculated calibration parameters. In an ideal calibration, the virtual square should be the same size and rotation as the black squares and positioned precisely in the space between them (Fig. 8).

5 Experiments and results

The objectives of the experiments are (i) the calibration of the intrinsic, extrinsic, and projection parameters of the OST HMD through the calibration system; and (ii) to evaluate the calibration of the OST HMD configured in different positions concerning the marker. We measure the positions based on the reference coordinate system. We move the OST HMD on the table to change the position on the x and z axes. We change the tripod height setting to change the position relative to the marker on the y axis.

We defined the OST HMD positions for the tests considering the following criteria. On the z axis, we tested four positions equally spaced within the limits possible, considering the table’s length. On the x axis, we tested three positions: (a) with the set in the center of the table, concerning its width (\(x = 0\)); (b) the highest possible value for x, without the smartphone losing the marker reference and without the virtual square leaving the limits of the internal lens of the OST HMD; and (c) the lowest possible value for x, following the same criteria as in the previous item. On the y axis, we used the same criteria as on the x axis but also considered the maximum and minimum height possible for the tripod. Table 1 presents the 20 tested positions, measured in centimeters, taking into account all possible combinations for the three axes.

Table 1 Positions used for calibration tests

We calibrated with the OST HMD in the initial position \(x = 0\), \(y = 0\), and \(z = 32.6\). The position of z corresponds to the distance, on the same axis, between the marker and the smartphone. The distance of z from the smartphone to the virtual eye is fixed and, in the configuration adopted for the experiment (with the helmet and virtual eye resting on the mannequin head), is 9.3 cm. Therefore, the initial distance, on the z axis, of the virtual eye concerning the marker is 41.976 cm (32.6 + 9.3). The user must configure these initials values in the system.

We executed each calibration step (extrinsic, intrinsic, and projection parameters) 1000 times for each tested nature-inspired algorithm (simulated annealing, evolutionary strategy, and particle swarm). Then, we calculate the mean error. Table 2 shows the results for the calibration process of the intrinsic parameters of the smartphone camera and the virtual eye, with the average error in pixels squared measured using the Mean Square Error (MSE) metric.

Table 2 Average error in pixels squared (1000 runs)—calibration of intrinsic parameters

Table 3 presents the results for the extrinsic parameters, with average error measured in centimeters squared, and projection, with error measured in units squared of normalized smartphone screen coordinates.

Table 3 Average error in centimeters squared (1000 runs)—calibration of extrinsic and projection parameters

Figures 10 and 11 show the test results after OST HMD calibration in each position shown in Table 1. We divided each image into five lines (each line corresponding to a test in a specific position) and two columns. The left column shows the picture of the OST HMD in the tested position, the representation of the x and z axes on the table (first three lines of each image), the representation of the y axis on the wall (last two lines of each image), and the values of x and y in centimeters in the legend in the upper left corner. The right column shows the screenshot of the window displayed by the calibration system when running the test. These images show the registration obtained by both cameras. The virtual square visualized by the virtual eye appears on the first line of each capture.

Fig. 10
figure 10

Calibration tests for \(z = 32.6\) e \(z = 66.6\)

Fig. 11
figure 11

Calibration tests for \(z = 100.7\) e \(z = z = 134.8\)

The virtual square generated after the calibration process has size, rotation, and positioning close to those observed in the image captured by the virtual eye. However, as in the architecture of the used OST HMD, the field of view of the user’s eye and the smartphone camera have a small shared space. So it is a challenge to keep the head positioned in such a way as to be able to see the marker and, at the same time, keep it close to the center of the smartphone camera image, enabling tracking. Small head movements, even if involuntary, cause the marker reference to be lost, and the virtual square disappears.

Another aspect observed during the tests was the accelerated consumption of the battery and heating of the smartphone. This problem occurs due to the high processing required by the constant capture and processing of images and by continuously sending these same images via Bluetooth to the notebook. Latency was also a recurring problem, as communication occurs via Bluetooth using a large volume of data. The system performed comfortably (no latency issue) using frames with \(640\times 480\) pixel resolution.

The evolutionary strategy and swarm algorithms presented very similar results in precision compared to the simulated annealing algorithm in the calibration of intrinsic parameters. In the calibration of extrinsic and projection parameters, there was no significant difference between the three algorithms. The tests with the OST HMD in different positions showed that the proposed method allows the registration between real and virtual elements with precision. The positioning error of the vertices of the virtual square, in terms of coordinates of the world, is 2.67 cm, concerning the x axis and the y axis, in all cases. Accuracy, however, is reduced at greater distances. This limitation is acceptable because applications with limited recognition distances often use fiducial markers.

Comparing existing calibration methods in the literature in terms of accuracy is a difficult task since the experimental environment used by the authors presents variations in the number of measurement locations, number of measurements in each area, distance from the equipment to the physical marker used in calibration and angle of inclination of the device concerning the marker. Furthermore, the authors used different metrics to measure the performance of their methods. We present in Table 4 the metrics and errors highlighted by the authors in each state-of-the-art calibration method analyzed in this research. However, we suggest not using these values directly to determine the best-performing calibration method. Details of how each metric is defined and the experimental environment’s setup are presented in the original articles.

Table 4 Metrics and errors of our method and the state-of-the-art methods

In Table 5, we compared our method with the state-of-the-art ones regarding significant characteristics.

Table 5 Comparison of our method with the state-of-the-art methods

We use non-deterministic optimization methods to calibrate the smartphone-based OST HMD and, to the best of our knowledge, we did not find in the literature OST HMD calibration using non-deterministic optimization methods. The use of non-deterministic optimization methods allows easier problem modeling and possibly better computational performance (de Castro LN 2006). Furthermore, as our OST HMD calibration approach is smartphone-based, it presents a low-cost solution, as any smartphone can be used with the device without the need to purchase any specific expensive equipment.

6 Conclusion

In this work, we developed a method that allows semi-automatically calibrating, through stochastic nature-inspired optimization algorithms, smartphone-based OST HMDs. As a reference device, we used the Haori Mirror AR Headset. Our calibration approach is a low-cost solution, since we can use any smartphone with a low-cost HMD. Besides, smartphone-based OST HMDs can exploit modern smartphones’ graphics and processing power. The results showed that our method was accurate for the peripersonal space (approximately one meter around the user), allowing this type of OST HMD in AR applications.

Calibrating a device semi-automatically allows the user to execute this procedure quickly and accurately, as it avoids errors typical of human manipulation. Even if it is necessary later to carry out an additional manual calibration step to adapt the estimated parameters to different users, this step can be done in a simpler and faster way, as it consists of a refinement and not a complete calibration process. AR applications that use such devices can benefit from our approach since the calibrated device positively impacts the quality of virtual and real elements registration in the application and, consequently, the quality of the AR experience. In future works, we intend to adapt the calibration system to other smartphone-based OST HMD architectures.