3.1 The Concept
The concept of BeThereNow is based on the idea of creating an immersive application, where users are depicted standing in various sceneries as if they were there. The objectives of our work can be summarized as follows:
The system should work like a magic mirror, where users watch themselves in the display, as if they were standing in various sceneries.
Users (Foreground) should be depicted in front of various landscapes (Background).
Apply meaningful, elegant and aesthetically pleasing means of changing the displayed sceneries.
Effective and precise background subtraction.
Users should be able to capture digital photographs and send them by email.
The system should not be static, but provide a demonstration mode which intrigues the users to engage with it.
The system is used in a walk up and use manner while also offering users the ability to browse through different sceneries. Therefore, once the users step in front of the display they are instantly shown. Meanwhile, a physical cube (Figs. 1 and 4) resides next to the display, which can be rotated in order to switch through the various sceneries. When the cube is rotated, a synchronized virtual cube - having the different landscapes at its sides - is also rotated, creating a one to one mapping of the actual with the virtual object. This metaphor is also applied when the demonstration mode is applied, where the virtual cube starts rotating and the landscapes are therefore switched after a short interval (30 s). The demonstration mode is enabled when no users are near the system and disabled upon any user entering the effective area in front of it. Users can use a secondary touch display, residing at the side of the display, in order to either change the language of the displayed image description or take a snapshot of themselves in the currently displayed scenery. The system’s overall rendering process is separated in two primary sections: Background and Foreground. Background refers to the pictures of the sceneries used to immerse the users into, while foreground contains the users or objects that stand in front of the display. The background is in essence the virtual cube, while the foreground is estimated using a background removal process.
An additional optional capability of the system is the addition of Foreground Mask to sceneries: some areas of the background can appear in front of the users (e.g. objects such as a desk). This option enhances the reality of the settings as users can be immersed in a room, standing, for instance, in front of a desk and appear as if they are studying. Additionally, another usage example involves showing theatrical costumes (Fig. 5), which the visitors can virtually ‘wear’ by standing in such a position that their bodies are completely hidden and only some of their parts are visible. The overall rendering process is illustrated in Fig. 2 below.
Computer Vision Algorithm
Silhouette extractor. The vision module employs a typical RGB-D sensor (i.e. Microsoft Kinect or an Asus Xtion camera). RGB-D sensors provide two images: a conventional RGB image and a depth image registered to the former one. Through the depth image, the 3D coordinates of surfaces in the RGB image are measured.
In case that the RGB sensor is judged insufficient in terms of quality or resolution, an additional high resolution RGB camera is employed, which is rigidly mounted on top of the RGB-D sensor. Then, color images are provided by the additional camera, while color images from the RGB-D sensor are disregarded. The aforementioned was mainly the case for early RGB-D sensors but is not required for more modern ones (i.e. Kinect 2) as they also provide a high end color camera.
The sensor is placed so that covers the scene including the ground plane (see Fig. 1) and thereafter calibrated so as to estimate its relative posture to the scene. A computer vision component is responsible for two things. First it detects objects (persons) in the depth image, finds their outlying contours, and maps these contours to the color image. Then, these contours determine the portion of the color images to be displayed as foreground in the large scale display.
Calibration. Calibration is a two-step process. The first is conventional intrinsic and extrinsic, grid-based calibration . If an additional color camera is employed, it is this one that is calibrated instead of that of the RGB-D sensor. Using this calibration, the location of an imaged 3D point in the depth image, is found in the color image.
The second determines a cuboidal volume in the scene, aligned with the ground plane, which we call out “working volume”. Only within this volume objects (persons) are considered by the system. To achieve this, the ground plane is estimated first, by imaging an empty scene and extracting the 3D points of the ground plane. By least-squares fitting of a plane to these points, the ground plane is approximated. The two lateral to the camera planes of the cuboid are perpendicular to the camera axis. The remainder cuboid faces are defined from the aforementioned planes, so that they limit the working space according to the sensor’s range so that very distal surfaces are not considered.
Foreground segmentation. The method finds the foreground in the color image, using the 3D information that the depth image avails. At each frame, input to the method is the depth and color images. The system finds surface points within the working volume in 3D, as well as, the regions where these surfaces occur in the depth image as silhouettes. These silhouettes are mapped to the color image to the foreground.
Due to sensor limitations, the depth image often exhibits pixels with missing values. This hinders foreground detection because the effect is intensely pronounced upon outlines, due to the depth discontinuity that they image; and, in our application, also due to human hair.
To reduce missing depth values, we apply nearest neighbour NN filling for such missing values. That is for an invalid pixel we assign the depth value of the nearest valid depth pixel with a neighborhood, if any, to the invalid pixel we currently evaluate. The output is the processed depth image which is henceforth called D.
We use image D to compute the 3D points of the scene within the working volume. These are foreground 3D points. The 2D pixels corresponding to these points define foreground mask M upon D. Next, a connected component procedure is employed on M in order to isolate the blobs corresponding to humans. Thus the blobs are filtered according to their size, excluding minute blobs that occur due to sensor noise. For each resultant blob we extract the external contours as well as the internal contours. Minute internal contours are attributed to sensor noise and are filled.
Using the calibration between the depth and color camera, the 3D counterparts of the contour pixels in the depth image are predicted in the color image. In this way, the contours from the depth image are transferred to the color image. Note that due to the aforementioned NN preprocessing all such pixels have a depth value. We apply a 2D Gaussian smoothing on the transferred contours to account for possible inaccuracies. Areas in the color image encapsulated by the smoothed contours C map of the foreground in the depth image (see Figs. 1 and 6).
The final step of the algorithm is to encode the result by creating an RGBA image. The RGB channels of this image are copied from the color image. The alpha channel contains the, aforementioned, mask of the foreground. Channel A is created by filling the external contours in C, but not the internal ones. In practice, the alpha channel is employed to restrict the color RGBA image so that only the foreground appears in it.
Technologies used. The system was developed using Microsoft’s XNA framework in combination with Windows Presentation Foundation (WPF). XNA is used for the rendering of the users in the various landscapes, while the user interface is implemented using WPF. The computer vision component was developed using C ++. Finally, an Arduino board equipped with a light sensor was used for the identification of the cube’s orientation.
An optimization that was needed involved smoothing the persons’ outline. Due to the sensor’s inefficiency in accurately discriminating the foreground (the people’s shapes) from the background (what lays behind the users in the real world), a custom smoothing algorithm was needed in order to fade out the boundaries of the foreground. An early implementation was developed in CPU but was later replaced by an HLSL shader, running in GPU, due to performance drop. The computer vision component provides the users’ contours slightly inflated and thus containing a few pixels of the background as an outline. At rendering stage, the outline is faded out, using a neighbor-based algorithm, where each pixel’s alpha value is the average of all its neighboring pixels.