Keywords

1 The First Wave: Marker-Based AR

We believed that the first wave of modern AR was dated back in 1999 when an open source AR tracking library named ARToolKit [9] was demonstrated at SIGGRAPH 1999. ARToolKit was a C/C++ open source library that only required simple fiducial markers (i.e. black-and-white square markers) and an off-the-shelf camera to work with. Using ARToolKit’s built-in functions, it became very easy for researchers and programmers to jump start in AR and obtain real-time camera’s 3D pose estimation (with respect to the ARToolKit marker) regardless of indepth 3D computer vision understanding. During the beginning of the 21st century, this library’s ease of use triggered rapid development in diverse applications not only in the field of AR itself but also in other 3D-vision applications. Some example usages of ARToolKit included an ARToolKit-based tangible interface for musician [17], ARToolKit for educational exhibitions [28], and ARToolKit as passive markers for a motion capture system [20].

Fig. 1.
figure 1

(Image credit: http://campar.in.tum.de/twiki/pub/ISMAR/IarAbstractARTag/IarDetailsFialaSlides.pdf, http://www.hitl.washington.edu/artoolkit/ and https://alternativeto.net/software/artoolk/)

(A) shows examples of fiducial marker libraries in computer vision; some are specifically designed for AR tasks (e.g., ARToolKit and ARTag) whereas the others are for other tasks in computer vision. (B) and (C) show examples of using ARToolKit markers to generate real-time AR see-through effects.

Inspiring by the success and popularity of ARToolKit, many vision-based fiducial marker libraries were introduced afterwards, mostly for two main reasons—to improve robustness of marker detection and tracking regardless of partial occlusion or difficult lighting situations, and to introduce more visual alternatives of fiducial markers for different tasks. Examples of fiducial marker libraries in computer vision are shown in Fig. 1A. Note that standard barcodes, QR codes and many 2D planar patterns are not suitable as vision-based fiducial markers because they either require some specific camera orientations (relative to the marker) or provide inadequate information for visual computing.

It can be said that in the first wave of modern AR, the key developments heavily relied on computer vision algorithms and applications; the most popular AR features back then were marker-based AR see-through effects using an electronic monitor or a head-mounted device (HMD) as shown in Fig. 1B and C. Nevertheless, during the first wave of modern AR, utilization and popularity of AR outside research laboratories were scarce.

2 The Second Wave: Spatial AR with Projector

During SIGGRAPH 1998, there was a signal regarding another wave of modern AR. In this conference, Raskar et al. [18] proposed an idea called Office of the Future (Fig. 2 left) where interactive imagery projected from projectors were used to augment arbitrary flat surfaces in a seamless manner; they named this kind of projector-based AR as Spatial Augmented Reality (SAR) [2]. Inspiring by the office-of-the-future proposal as well as the trend of projector miniaturization following afterwards, the first decade of the 21st century was overwhelmed by not only SAR researches using projector-camera devices (a.k.a. pro-cam devices) but also continuous developments for smaller but brighter projection mechanisms. In the following paragraphs, we clarify the rise and fall of SAR in more detail.

Fig. 2.
figure 2

(Image credit: Office of the future [18] and http://www.pranavmistry.com/projects/sixthsense/)

Examples of Spatial AR for augmenting a physical surface with interactive projected imagery.

Speaking of portable projectors, although they needed not to be firmly fixed on ceiling, their bulky form factors prevented them from being used in truly mobile fashions. After portable projectors became reasonably affordable around the beginning of the 21st century, they were continuously used in diverse researches and applications regarding SAR and computer vision; most of the time, a portable projector was coupled with one or more cameras in order to visually observed the projected results. Some example studies regarding portable projectors included methods to calibrate a pair of projector and camera either geometrically [10] or photometrically [5], methods to calibrate multiple projectors [29], imperceptible [3] or infrared [1] projection techniques, etc.

Another important trigger in the second wave of SAR was probably the SixthSense project (Fig. 2 right) by Mistry et al., published in CHI 2009 [15]. In this project, they built a prototype of wearable pro-cam device where a smartphone-sized laser projector (a.k.a. handheld, mobile or pico projector) and a camera were used to transform any ordinary flat surface into an interactive touch screen. Their proof-of-concept applications were broadcasted by many news channels, making this project one of the most famous projects in SAR.

When SAR usage scenarios were changed from bulky portable projectors staying (mostly) still on a table to small mobile projectors being moved at all times, problems regarding pro-cam calibration were exaggerated and solutions needed to be provided for dealing with unpredictabilities of projection surfaces. Our previous works started from utilizing a motion sensor for indirect pro-cam geometric calibration [23] and then changed to pro-cam coaxialization using a plate beam splitter in order to completely eliminate the need for geometric calibration [24]; finally, in order to deal with visual crosstalk problems, we decided to maintain projection in the visible light spectrum but change visual analysis tasks into the infrared spectrum [21]. Other popular approaches for dynamic pro-cam geometric calibration included projecting a known pattern on an unknown surface and analyzing the distorted pattern to reconstruct the surface’s geometry.

Despite of proof-of-concept demonstrations and few commercial projector-based interactive touchscreen devices, usages of SAR beyond exhibitions and lighting performances were uncommon. Our recent study regarding SAR [22] revealed shared concerns from different experts regarding robustness, price, usability and practicality of SAR in actual usages. Unfortunately, the second wave of modern AR in SAR was not last long. After a decade (approximately) of active development, researchers and manufacturers slowly lost attraction of miniature projector utilization. Our assumption regarding the falling of SAR includes the introduction of Microsoft Kinect Sensor in 2010 and limitations of miniature projectors themselves.

Because of the all-in-one affordable solution provided by Kinect sensor, the needs for developing complicated pro-cam hardwares for geometric calibration or 3D reconstruction were sharply decreased. As for the limitations of miniature projectors, during their golden years, many attempts were pushed towards building brighter, smaller and focus-free projectors. In order to build a focus-free projector (i.e., projected images are always in focus regardless of arbitrary surface depths) with strong brightness, laser projection technologies were once expected as promising solutions. However, because of the international standard of safety for laser devices, the expected solution of laser projection was limited to very small amount of brightness. Even there were other non-laser projection mechanisms that could produce hundreds of lumens in brightness while maintaining their small form factors, without the focus-free ability, our dream of freestyle mobile projection on any desired surface will never come true.

3 The Third Wave: Wearable AR for Corporate

The third wave of modern AR came in the form of professional head-mounted wearable devices as shown in Fig. 3, starting with Google Glass in 2013, Microsoft HoloLens (developer edition) in 2016, MagicLeap One (creator edition) in 2018, and Microsoft HoloLens 2 in 2019; the years specified here are official released years, except for HoloLens 2 that is still in the preorder stage at the moment of writing this paper. Unlike the two previous waves, the third wave of modern AR has been driven by world famous tech companies whose aims are clearly not just AR prototypes for academic presentations but commercial AR products with (hopefully) mass production. Another uniqueness of this wave is that development of these wearable devices requires high-level interdisciplinary knowledge that goes far beyond computer vision to audio, optics, mechanic, etc. As a result, what most AR researchers do with these wearable AR devices is not to try tweaking their internal mechanisms but to study advantages and disadvantages of using them in each situation. For example, with HoloLens, [12] was able to use the real world geometry as input data and allow a user to define and solve a physical problem by Poisson’s equation; [8] conducted experiments using Google Glass for training new scientists in wet laboratory work; [14] created an application on Google Glass that allowed people with Parkinson to monitor their speech volume; [26] discovered that unlike virtual reality headset, users of HoloLens did not suffer from obvious simulation sickness.

Fig. 3.
figure 3

(Image credit: http://time.com/, https://news.microsoft.com/, https://www.bloomberg.com/ and https://www.microsoft.com/en-us/hololens/buy)

Leading head-mounted wearable devices for AR from three companies. (A) is Google Glass, (B) is Microsoft HoloLens 1, (C) is MagicLeap One, and (D) is Microsoft HoloLens 2.

In the past, the first Google Glass and HoloLens were promoted with their AR capabilities. However, MagicLeap One and HoloLens 2 are now being advertised as mixed reality (MR). According to the long understanding in reality-virtuality continuum, MR refers to everything where real and virtual worlds are mixed up; this literally means that AR is a subset of MR. Nevertheless, this perspective has slightly been changed since MagicLeap has positioned their wearable device as non-AR but MR where the mixing between two worlds is indistinguishable in a 3D hologram manner. But for the sake of this AR review paper, we will stick with the term AR.

Until now, Google Glass, Microsoft HoloLens and MagicLeap are three big names that have worldly represented the future of wearable AR where the overlaid virtual information is more controllable than the previous wave of SAR as discussed in our previous work [22]. These wearable devices from the three companies share many things in common. They all are packed with sophisticated hardwares and algorithms developed by great engineers. They all are famous not only among AR researchers but also among technologists around the world; this is in particular for Google Glass, the first AR product that successfully popularized AR to end users. Despite of good things in common, they all are struggling the same problem of very high and unaffordable price tags. Their introductory prices are 1,500 USD for Google Glass, 3,000 USD (developer price) or 5,000 USD (commercial price) for HoloLens 1, 2,295 USD for MagicLeap One, and 3,500 USD for commercial HoloLens 2. This problem alone has made utilizations of these wearable AR devices being limited to small groups of researchers or big organizations who can afford (e.g., United State Army, NASA, DHL, General Motors).

For this third ongoing wave of modern AR, there are important lessons learnt from the original Google Glass whose initial aim of being a consumer-grade gadget failed due to privacy laws, driving regulations and social disapproval [7]. This means that the reasons consumers refrain from Google Glasses are not only their unaffordable prices but also their unsuitability regarding consumer life styles. Despite of the previous failure, Google Glass has already come back and this time Google as well as Microsoft have directed their attention to enterprise customers whose personal development and corporate training can take great advantages from these hi-tech and pricey AR headsets.

4 The Fourth Wave: Markerless AR in Smartphone

During the first and second waves of marker-based AR and SAR, one of the most popular techniques is utilizing known visual markers to pinpoint the virtually generated AR contents in world coordinates. Some visual markers are very obvious and not blending to the working environment like those in Fig. 1A. In many researches and AR creator platforms, to avoid using markers with intrusive visibility, natural objects are used as visual markers by help from vision-based feature point matching techniques; this allows us to use something like companies’ logos as AR markers.

In our previous work [23], we addressed problems of marker based interactions and proposed a multi-target tracking solution in order to include non-marker objects into AR calculation. Multi-target tracking is a good start to taggle this problem, but in the long run, unless we have a proper map of objects in the environment, augmenting the environment with interactive AR contents remains difficult. To incorporate everything in the environment into AR systems, SLAM (Simultaneous Localization And Mapping) is a promising solution that has become popular recently. Using SLAM techniques, we are able to use camera images in conjunction with other information in order to reconstruct and update a map of an unknown environment in real time. SLAM is especially popular for interactive systems that deal with unknown environments; this includes usage situations of wearable AR devices (Sect. 3) and smartphone AR (Sect. 4).

While the third wave of AR in professional wearable devices is still ongoing, the fourth wave of modern AR in smartphone has already touched the ground with the official releases of Apple’s ARKit in 2017 and Android’s ARCore in 2018. Like Google’s Project Tango proposed back in 2014, ARKit and ARCore utilize SLAM techniques to create markerless AR effects on smartphones. But unlike the discontinued Project Tango relying on specially designed cameras and specific computational modules, ARKit and ARCore use smartphones’ built-in motion sensors and cameras to perform SLAM, enabling sustainable smartphone AR in the long run.

Similar to the third wave of wearable AR headsets, the wave of smartphone AR has been driven mainly by big tech companies. However, while wearable AR headsets are very expensive and aim for corporate customers, smartphone AR is mostly free and involves diverse applications for arbitrary smartphone users. Examples regarding smartphone AR include AR DeepCalorieCam [25] that uses ARKit to measure the actual size of the meal (in order to estimate the total calories); interactive AR coding environments where N. Dass et al. [4] show that participant satisfaction is better with smartphone AR (using ARKit) than a traditional tablet or Microsoft HoloLens; a mixed-reality mobile remote collaboration system [6] using ARCore position tracking.

5 The Fifth Wave: AR Underneath Artificial Intelligence

During the years of 2014–2016 (approximately), extended reality technologies (including virtual, augmented and mixed realities) made the headlines that excited many technologists and researchers around the world. But in the past couple of years, majority of the world has turned their interest to Artificial Intelligence (AI) driven by machine learning, particularly the field of computer vision that has been disrupted significantly by deep learning. The popularity in AI (in comparison to AR) during the past decade is illustrated in Figs. 4 and 5. In Fig. 4, it can be seen that the increase in numbers of AI papers is the most obvious in IEEE Xplore Digital Library during 2017 and 2018. As for Google Trends (worldwide) in Fig. 5, popularity in AI keyword has totally beat AR keyword since October 2016.

Fig. 4.
figure 4

The number of search results during 2009 to 2018 regarding two keywords (i.e., ‘artificial intelligence’ and ‘augmented reality’) from three research paper platforms—arXiv.org, IEEE Xplore Digital Library and ACM Digital Library. (Data retrieved on 19 April 2019)

Fig. 5.
figure 5

Google Trends (worldwide) comparing the two keywords of ‘artificial intelligence’ and ‘augmented reality’ during 1 January 2009 to 31 December 2018. The vertical axis represents interest over time where 100 is the peak popularity of the corresponding keyword during the specified time interval. (Data retrieved on 19 April 2019)

Because computer vision always plays important roles in AR, disruption in computer vision results in disruption in AR as well. Hence, under the huge umbrella of machine learning and deep learning, there is also AR underneath. For vision-based tasks of recognizing and annotating objects with AR virtual contents, using state-of-the-art pre-trained convolutional neural networks gives researchers and developers a huge jump start with promising image recognition and annotation results. For example, AR DeepCalorieCam [25] uses Inception-v3 (pre-trained on ImageNet dataset and fined tune on UEC-FOOD100 dataset) to recognize food category from an image before applying ARKit to measure the actual size of each food. Recognizing and annotating humans in images is another task that has received lot of attention recently. The latest proposal from Wang et al. [27] uses a self-supervised deep learning technique to predict human 3D poses from 2D image inputs. This kind of human 3D pose estimation mechanisms can help leverage any AR systems that require indepth understanding of human’s real-time actions/behaviors.

For interactive virtual makeup applications, it becomes a lot easier to overlay virtual makeups on a moving face as facial landmarks can be precisely located with many free machine learning libraries. An example is in the automatic virtual makeup system of [16] that uses Dlib for extracting 2D facial landmarks from a face image; Dlib is a machine learning library that originally utilizes a combination of Histogram of Oriented Gradients (HoG) and linear Support Vector Machine (SVM). In BeautyGAN [11], more advance technique of deep learning is used to simply transfer a makeup style from a reference makeup face to another non-makeup face. This growth in deep learning techniques helps reform traditional AR virtual makeup systems and strengthen them to become the next generation of artificial intelligence based AR systems.

Superimposing an AR content over a live video of the actual world cannot be more indistinguishable when there is a neural style transfer technique from deep learning to help blend two different image styles together; [13] demonstrates this concept using the neural style transfer and ARKit. Even for AR tasks involving 3D reconstruction for AR headsets (mentioned in Sect. 3), there is a recent proposal from Rematas et al. [19] that uses deep learning to convert a typical youtube video of soccer game into dynamic 3D information; this means that all players in the game are dynamically 3D reconstructed in a way that we can wear a 3D AR headset to see this soccer game in 3D AR style.

It can be said that with recent disruption in artificial intelligence, we can expect AR systems (regarding all four previous waves mentioned earlier) to become more intelligent, more seamless and more interactive in the near future. Once AR can overcome their long-standing technical difficulties, what remain unsolved are the true problems of AR in the long run—problems regarding affordability, user experience and practical usage scenario.

6 Conclusion

For the past two decades, it can be said that modern AR has been through a lot of good and bad times. This paper reviews these two decades and summarizes it into five waves of modern AR. The first wave of fiducial marker based AR is classic and can still be seen until now, especially in AR explorer apps and other proof-of-concept AR systems. The second wave of projector-based AR was once popular; but due to many limitations, it has become slow recently. The third and fourth waves of modern AR have been both driven by big tech companies. The third wave of expensive AR headsets has aimed for corporate training whereas the fourth wave of smartphone AR has directed their attention to mass consumers. Finally, all four waves of modern AR have been elevated by the fifth wave of artificial intelligence disruption where solutions regarding long-standing AR technical difficulties have been proposed one after another.