As said, the goal of the present paper was not to develop “just another platform” for remote assistance. Rather, it was to study how the unexpressed potential of AR technology could be exploited to build a resource-effective approach to the problem. The aim was, in particular, to maximize the quality of the support offered to on-site operators while optimizing also the time investment for remote experts.
Hence, a set of requirements and must-have features supporting the above approach were first identified; then, since none of the existing solutions analyzed in the review of state of the art integrated them all, a platform was developed to remedy this lack. Its architecture is reported in Fig. 1. Three main components can be identified: the operator-side mobile application and the expert-side web portal, representing the platform front-end, and a back-end supporting key services such as information exchange, archiving, etc. Design choices and implementation details will be discussed in the following.
Operator-side mobile application
A number of design choices had to be made concerning, among others, the AR-enabled devices to be supported by the mobile application, the tracking method to adopt, the AR framework to use, and the AR features to exploit.
For what it concerns the devices to be supported, different alternatives were considered. Although wearable technologies (such as HMDs or smartglasses) may guarantee a higher degree of freedom compared to hand-held devices, previous works proved that they could introduce limitations in terms of interaction; hence, the latter solution is often preferred by the users . Moreover, smartphones and tablets are much more common and wide-spread than wearable devices, and they were shown to be characterized by a higher ease of use as users are already accustomed with them [4, 5, 21, 35]. Finally, the possibility to avoid the use of additional, expensive hardware may be a significant advantage from the viewpoint of the company offering the assistance service, which can serve a wider and more heterogeneous set of customers.
Based on these considerations, the operator-side application was targeted to mobile devices, even though the proposed approach could be also implemented onto HMDs or smartglasses. Given the larger number of devices supporting Android, it was implemented for this environment, but similar functionalities could be provided also in other environments.
Another key point in the implementation of the operator-side application was the selection of the tracking method. In fact, marker-based approaches were proven to be the most effective solution for many industrial applications [7, 17]. However, these approaches are characterized by some limitations which are particularly critical for remote assistance, as reported in [15, 34], and . Unlike other “planned” activities such as maintenance or training, remote assistance may not rely on a previous setup phase to place, e.g., markers where necessary. Moreover, the flow of the assistance could be very unpredictable: the remote expert may be requested to devise new solutions to a given problem directly while he or she is providing the support, resorting to methods for quick contents (authoring and) placement.
Indeed, marker or, more in general, image detection can be useful in the considered context to identify known products (and their parts) when this information has been already integrated in the product itself by the manufacturer. For instance, in the developed application, OCR (optical character recognition) was used to identify, before starting the call, the product to be serviced via its product label (similarly to what was done in  using a QR code).
However, for the remote assistance session, the marker-based approach was discarded in favor of 6-DOF optical-inertial positional tracking . As seen, most of the commercial solutions use this technique to manage AR contents, since it allows to attach them directly to real-world elements.
As said, the mobile application was implemented for the Android environment. To develop the AR functionalities, the ARCoreFootnote 2 library was selected because of its deep integration with the underlying system and its cost-effectiveness. Its 6-DOF positional tracking was exploited to attach AR contents to elements in the operator’s space, and its “anchoring” feature was used to keep the above contents in place during the assistance session. Within a session, a call with the expert can be established, closed, and then re-established as needed; through anchoring, AR contents are preserved also throughout these calls. The devised assistance paradigm can thus provide operators with helpful information that could increase their ability to operate autonomously. Hopefully, this will reduce the load on the remote experts for solving the problem in that specific situation, but also should that situation occur again in the future.
The client application supports registration/authentication, product recognition (via OCR on the product label, if available) and initial troubleshooting. If the operator is not able to solve the problem autonomously with the provided frequently asked questions (FAQs), then the application lets him or her initiate an AR-supported remote assistance session (or schedule it).
An audio-video call is set up. During the call, the mobile device’s camera is used to provide the expert with a video feed about the serviced product and the surrounding environment, whereas the screen is used as a “magic-window” for displaying AR contents to the operator.
The expert can provide support to the operator using all the most common AR-based tools identified in the review of the state of the art, whose implementations are shown in Fig. 2. The tools can be split in two categories, i.e., temporary and persistent.
Temporary tools include hand drawings (Fig. 2a) and laser pointer (Fig. 2b), and rely on graphics contents which are displayed as 2D overlays over the video feed; according to , these 0-DOF elements can be referred to as “0D” AR contents. These elements are non-permanent, i.e., they vanish few seconds after placement. Persistent tools, on the other hand, exploit 6-DOF tracking to let the expert place 3D shapes likes arrows and circles (Fig. 2c and Fig. 2d) as well as instruction cards (Fig. 2e and Fig. 2f); these contents are attached to real-world elements using spatial anchoring (hence, in , they are named “6D” AR contents). Cards, in particular, can either contain text, images, or animated GIFs. If cards do not require to be spatially anchored in the real world, they can be just inserted as 0D contents.
A key feature of the devised platform is the possibility to record the AR contents exchanged during the session. Once placed, the contents are displayed in a timeline within a scrollable panel located in the lower part of the interface (Fig. 2g) which can be shown/hidden by pressing a button.
Not all the contents are stored in the timeline. Very temporary or explana-tion-complementary tools (laser pointer, hand drawings, and 3D shapes) are not considered, although some of them will be spatially persistent throughout the session. On the contrary, text/image cards, which represent powerful tools for the expert to visually fix in the operator’s mind the concept he or she is explaining using the voice, are recorded for later use, being them anchored or not. These contents can be accessed both within the current session as well as in the future. The timeline panel can also be used to enlarge (full-screen) or reduce 0D cards, as well as to highlight the actual card to be displayed in case of multiple occluding 6D cards.
The augmented video-stream is also saved (with bookmarked instructions) and made accessible to the user for a full recap of the session (Fig. 2h).
Expert-side web portal
The expert-side of the platform was developed as a web portal, similar to . The portal’s interface is shown in Fig. 3.
During the call, the interface provides the expert with information concerning the operator (company, equipment, etc.) and previous requests for assistance (if any). Data collected by the operator through the initial troubleshooting phase are also reported, offering the expert information that could help him or her to frame the context of the assistance request and possibly anticipate operator’s needs.
Similar to , the TwilioFootnote 3 APIs were used to establish a peer-to-peer bi-directional audio and mono-directional (operator-to-expert) video communication with the operator’s side. Video feed is displayed in the portal’s interface. Since Twilio APIs support the exchange of other data between the involved peers and can be integrated with ARCore to create collaborative AR experiences, they were also exploited to support the transfer of AR data.
AR tools that can be used by the expert to support voice explanations are grouped in a palette displayed in the portal’s interface. The expert can control the laser pointer or make hand drawings appear on the screen of the operator’s device by using the mouse on the received video feed.
He or she can also choose the 3D shapes to be added in the operator’s space. While the expert places them on the video feed, the operator-side application tries to attach them to real-world objects by estimating planar surfaces in the camera’s field of view using ARCore. Hopefully, added shapes will be displayed in the same place independent of operator device’s movements.
The expert can also add text/image instructions that, as said, are displayed in a chronological order as scrollable cards in the lower part of the operator-side application’s interface (as well as in the web portal’s interface); the instruction cards can be either displayed as 0D elements or anchored to the real world as 6D elements. 6D cards’ tilting is controlled so that they are always oriented for best readability.
Cards can be either picked up from a list of common instructions, selected from those used in a previous session, or created on-the-fly for the specific session. As said, cards can contain text or (animated) images. Similar to what happens on the operator’s side, the expert can select added instructions (by clicking them either in the list or in the video feed) to highlight them, e.g., to catch the operator’s attention during the explanation or resolve occlusions.
As said, the expert and the operator can close the call at any time, but they will retain the possibility to browse and visualize previously created AR instructions. Should the operator need further help, he or she could request a new call; in this case, previously placed contents are supposed to retain the original position. Finally, the operator is provided with a session history, through which the recap of previous sessions (instructions timeline and audio-video recording) can be visualized for future reference.
A key characteristic of the remote assistance paradigm supported by the devised platform is the possibility to have sessions that can be connected each other and restored when necessary. In this way, the remote expert can, e.g., leverage instructions delivered in previously completed sessions for the same or similar problems, whereas the operator can retrieve instructions from a previous session to execute a procedure for which assistance was received in the past without asking for further support.
To this aim, platform’s back-end development was centered on the concept of session, and both the mobile application and the web portal rely on this concept for operation.
Back-end was implemented using Google FirebaseFootnote 4, leveraging some of its off-the-shelf features such as Authentication, Realtime Database, Cloud Storage, ML-kit, Cloud Functions, and Hosting. Building on them, a networked platform was created, supporting user registration, authentication, call scheduling, session management/archiving, recording, push notifications, and OCR (for the troubleshooting).
For each session, the platform records information collected in the troubleshooting phase as well as instructions provided by the expert using available AR tools (Fig. 4). The same information is displayed also in the web portal’s interface. Independent of the time passed between two sessions and of who actually provided the support, the expert has at his or her disposal helpful information that can ease the identification/solution of the problem.
Session recording is performed on the web portal’s side, but storage is handled in the back-end. Thus, recorded sessions can be made available also on different devices. The communications of both peers are saved, together with the video feed received by the mobile device side and graphic contents drawn on it. Bookmarks are also set, allowing to quickly jump to frames where instructions were provided using AR tools.