1 Introduction

With the development of wearable cameras, first-person activity recognition has been a popular topic in recent years [1]. There are many conventional approaches which tackle first-person activity recognition. Some of these approaches employ motion feature such as optical flow and also a classifier, e.g., LogitBoost and SVM (support vector machine)[2, 3]. In recent years, DCNN (deep convolutional neural network), the state-of-the-art model for visual recognition, has been proposed [4] and then applied to several tasks on first-person activity recognition.

Although DCNN models provide remarkable results for image recognition, they require a large amount of labeled training samples. Fine-tuning is a promising method for reducing the amount of required training samples and also reducing training time [58]. Unfortunately, there are no large-scale datasets for first-person activity recognition, while datasets for image recognition, such as ImageNet [9] and Places [10], are publically available. For this reason, even if using fine-tuning, DCNN models for first-person activity recognition require collection and annotation of large-scale FPV (first-person vision) videos. Castro et al. [11] actually collected 40,000 images in 26 weeks by recording with a wearable camera and annotated all of the images.

To cope with such time-consuming issues, we propose a synthetic method for generating training images. Our method consists of three steps, as shown in Fig. 1, which are image synthesis, fine-tuning DCNN, and recognizing activities from natural FPV images. In this paper, we focus on reading activity recognition. Reading is a pervasive and intellectual activity in daily life, and its recognition can be useful for building context-aware interfaces [12], life log systems [13, 14] and experience sharing systems [15]. In addition, the image synthesis approach is useful for reading activity recognition, because one must take care of copyright issues when collecting the images of books and magazines for training. Conversely, if one wants to recognize other activities, it is easier to collect the images of the related objects such as displays and keyboards. In this paper, we contribute in:

Fig. 1
figure 1

Overview of our approach

  • Reduction of the collection and annotation costs on deep learning dataset by using simple image generation and synthesis

  • Methodology of applying the synthesis approach to recognition of first-person activities

  • Interpretations of the synthesis approach using simple visual patterns in terms of representation in deep visual patterns

The rest of this paper is organized as follows. In Section 2, we introduce some related works about our approach: first-person activity recognition, deep convolutional neural networks, and synthetically image generation. In Section 3, we proposes the image synthetic method for generating training images. In Section 4, we show the adaptability of our synthetic images to first-person reading activity recognition with real FPV videos. In addition, we also demonstrate the generalizability of deep features from the synthetic images. In Section 5, we discuss our method in terms of image synthesis processes through further experiments on several types of the synthetic training dataset.

2 Related works

2.1 First-person activity recognition

Body-mounted devices help to record personal information and to analyze personal activities. A kind of such devices, 3-axis accelerometers, provides user’s posture estimations, and many researchers employ them for life-logging [16, 17]. Head-mounted cameras also have become popular with development of camera downsizing and high-efficiency video coding [1821].

For first-person activity recognition, there are several methods based on image segmentation [22, 23] and object recognition [24, 25]. These approaches include multistage recognition processes, and hence, recognition errors tend to be stacked. To avoid explicit object recognition, many studies use motion feature such as optical flow with a classifier such as LogitBoost and SVM [2, 3, 2629].

2.2 Deep convolutional neural network

DCNN models such as AlexNet [4], VGGNet [30], GoogLeNet [31], and ResNet [32] have been proposed and demonstrated remarkable performances in image classification. In first-person activity recognition, there are DCNN-based methods in which optical flow [5, 6] and pooled motion features [33] are used as image features. Moreover, LSTM (long short-term memory) model, which is also a kind of deep models for learning data with recursive expressions, has been introduced together with DCNN models aiming at additionally learning temporal correlations of activities [7, 8].

2.3 Synthetic image generation

For data augmentation, image synthesis is a useful approach for reducing the effort of manually annotation. Wong et al. [34] investigate a benefit of data augmentation for MNIST handwritten character dataset. For object detection, Khail et al. [35] propose an image synthetic method in which real object images and real background images are synthesized. For text localization in natural images, Gupta et al.[36] also propose an image synthetic method in which computer-generated texts and natural real images are synthesized. Sun and Saenko [37] and Su et al.[38] employ 3D CAD object models with real background images for image synthesis. Castro et al.[39] propose a method of generating synthetic structural magnetic resonance images for learning schizophrenia.

3 Methodology of image synthesis

For recognition of first-person reading activities, we synthesize training samples of “Reading” class. The “Reading” class samples represent visual patterns of open books in FPV images, as shown in Fig. 2b. This section describes a procedure of generating the computer-generated book images and superimposing them on real background images. In addition, we explain how to prepare “Others” class images as the negative class.

Fig. 2
figure 2

Synthetic training samples. a “Others” class. b “Reading” class

3.1 Generation of book images

Generation procedure of the computer-generated book images can be divided into four types of image processing: texture drawing, edge distortion, perspective projection, and rotation. We call these processes T, D, P, and R, respectively. Figure 3 shows examples of results in these processes, and Table 1 shows the parameters for the processes in detail.

Fig. 3
figure 3

Computer-generated book image. a Texture drawing (left page). b Texture drawing (right page). c Distortion. d Perspective. e Rotation

Table 1 Main parameters in the generation processes

The process T aims to reproduce textures like real books on a white canvas so as to be an open book image, as shown in Fig. 3a,b. This process consists of two steps which are the determination of a layout and the drawing of a texture. First, we prepare two white canvases corresponding to the left and the right page of an open book. Next, we determine a layout in each page by randomly placing figures, headlines, and texts on each page.

The process D makes the shape of the book images distorted as shown in Fig. 3c. For the left pages, we set a coordinate system such that the origin is at the left-bottom corner in the canvas, the positive direction of X-axis is right-to-left and of Y-axis is bottom-to-top. For the right pages, we reverse this coordinate horizontally. We distort the image by moving pixel at (x, y) to (x, y) using y=y+f(x). Here, the distortion function f(x) is defined by \(f(x) = \alpha (x \sqrt {1-x^{2}})\), where α is a parameter for controlling the strength of distortion. In experiments, we set the strength parameters as α=0.1.

The process P is a perspective projection, as shown in Fig. 3d. In order to generate FPV-like appearances, we determine the perspective projection, as shown in Fig. 4, in which each of the original left-top and right-top corners (red points) is moved to a point randomly selected in the rectangles near the corners (blue rectangles). In experiments, we set the width and height of the rectangle as 10% of the image ones.

Fig. 4
figure 4

Perspective projection by a randomly determined homography matrix

The process R is a rotation transformation, as shown in Fig. 3e. Here, we only rotates the images with a slight degree of angle to generate open book images caused by reading activity. Note that the black regions around the books are assigned to the transparent color.

3.2 Image synthesis

For image synthesis, we superimpose the computer-generated book images onto real background images, as shown in Fig. 5. As real background images, we use images which are randomly selected from the ImageNet dataset [9]. In Section 5.3, we will demonstrate that the use of the ImageNet images is superior to the use of other domain-specific background images.

Fig. 5
figure 5

Bounding region (green rectangle) for book image location

We set a bounding region on the background images (the green rectangle in Fig. 5) and randomly put the computer-generated book images inside the bounding region. This region is used for preventing the book images from being put on the periphery of the background images.

3.3 Negative samples against reading activities

The “Others” class is the negative class and hence represents various visual patterns except for the “Reading” class images. For the “Others” class images, we simply use the background images which are used for generating the “Reading” class images. In other words, the “Reading” and the “Others” samples only have a difference whether an open book appears or not, as shown in Fig. 2b and a, respectively.

4 Experimental results with real first-person vision videos

In this section, we report the performance of our synthetic method with real FPV videos. In the experiments, we compare our DCNN model with other baseline models.

4.1 Fine-tuning of DCNN model

We use GoogLeNet(v3) [40] as a pre-trained DCNN model with the ImageNet dataset, as shown in Table 2. In fine-tuning, we retrain only the final layer of the model because the fine-tuning of deeper layers degrades the performance from our preliminary experiment. We optimize the parameters with the cross-entropy loss function using the SGD (stochastic gradient descent) algorithm. In the optimization, we use 25,000 training samples per class and feed mini batches in size of 10.

Table 2 GoogLeNet(v3) architecture

4.2 Test dataset

For evaluation, we prepare 20 real FPV videos, as shown in Fig. 6. In order to prove the generalization of the proposed method, we record the videos with two types of wearable cameras at four different places. The label “Reading” or “Others” for each image is provided by manual annotation in which “Reading” is provided if an open book is located around the center of the image. Table 3 shows a summary of the our dataset. In addition, we use publicly available two datasets including “Reading” class: LENA (Life-loggine EgoceNtric Activities) dataset [41]

Fig. 6
figure 6

Example images of 20 videos for test. a “Others” class. b “Reading” class

Table 3 Summary of our dataset used for evaluation

and MEAD (Multimodal Egocentric Activity Dataset)[42], as shown in Fig. 7. Table 4 shows a summary of the two datasets used for our evaluation. Note that Table 4 includes only the “Reading” class data. For “Others” class, we randomly select 3830 images, which is the same as the number of the “Reading” class images, from other activity images such as “Writing” and “Working at PC” in the two datasets.Footnote 1Footnote 2

4.3 Evaluation result

Using our synthetic dataset, we compare the DCNN model with two baselines 1-NN (1-nearest neighbor) and linear SVM in terms of precision, recall, and

Fig. 7
figure 7

Example images of two public datasets for test. a “Others” class. b “Reading” class

Table 4 Summary of public dataset used for evaluation

F-measure. For 1-NN and linear SVM, we use Fisher vectors [43] as image feature. Fisher feature has been often used in the image classification tasks as well as bag-of-visual-words [44].

We show the experimental results in Tables 5 and 6. We find that the DCNN model significantly outperforms the other two baselines on all datasets in terms of the averaged F-measure. Further improvements might be possible if motion features like optical flow are used in DCNN models (e.g., [6]), but we only evaluate the methods without motion features in order to clearly show the effectiveness of using synthetic images.

Table 5 Comparative result with our dataset in Table 3
Table 6 Comparative result with the public dataset in Table 4

5 Discussion on the image synthesis in deep learning

In this section, we verify our synthetic approach in more detail and discuss it.

5.1 Effect of changing the number of training samples

First, we verify the effect of changing the number of training samples. Table 7 shows the results with 1000, 5000, and 25,000 training samples. For the “Reading” class, we observe that the F-measure roughly keeps constant over the three cases. On the other hand, for the “Others” class, we observe that the F-measure improves with increasing the number of training samples. Since images in the “Others” class are diverse, the increase is especially effective for recognizing other activities.

Table 7 Effect of changing the number of training samples

5.2 Effect of changing combinations of the generation processes

Our synthesis approach consists of the four image processings, T, D, P, and R as mentioned in Section 3.1. Here, we verify which process is effective in improving the recognition performance. To do this, we generate sets of images as shown in Fig. 8 in which the most left column indicates the processes used for image synthesis. For examples, the images at the “None” row are generated by no image processing and the images at “PDT” are generated by the combination of the three image processings: P, D, and T.

Fig. 8
figure 8

Example images synthesized by a combination of the four generation processes: R, P, D, and T

From Table 8, we observe that the process T always brings the best improvement of the F-measure (21.5% in average) and the process D the second one (10.4% in average). This result means that the two processes T and D produce discriminative features for recognition while the other two processes R and P provide less discriminative power. In fact, if book regions are overexposed like Fig. 9, the proposed method fails to recognize such images. We further verify which combination contributes in improving the performance. In Fig. 10, we summarize each contribution (average F-measure difference) of the possible combinations of the processes. For example, the bar chart at “DT” indicates the F-measure difference averaged between DT and the other possible combinations, None, R, P, and RP. We observe that the combination of D and T is the most effective in improving the F-measure, and based on the above, we conclude that the two processes D and T are required in the image synthesis for producing discriminative features and the other processes R and P should be used in combination with DT.

Fig. 9
figure 9

Recognition failure examples

Fig. 10
figure 10

Contributions of each process combination

Table 8 Results of the comparison in the generation processes

5.3 Effect of using domain-specific backgrounds

In the abovementioned experiments, we use images in the ImageNet dataset as background images, as shown in Fig. 2. In order to verify the effect of using the background images, we evaluate the recognition performance with domain-specific backgrounds, as shown in Fig. 11. We recorded the background images at the same places where we do our dataset in Table 3.

Fig. 11
figure 11

Synthetic training samples with domain-specific background. a “Others” class. b “Reading” class

We show the experimental results in Table 9. We find that the ImageNet dataset provides better performance than the domain-specific one in terms of F-measure. In particular, we observe the increase of F-measure for the “Others” class. Since we use the pre-trained DCNN model with ImageNet, the use of ImageNet enables the efficient learning.

Table 9 Results of the comparison with the dataset synthesized by domain-specific background images

6 Conclusions

We propose a method of synthetically generating training samples for deep learning. The proposed method synthesizes book images from simple computer-generated patterns and real background images. The synthetic approach is particularly useful for recognizing reading activity because of the copy right issues, i.e., capturing books with a digital camera and their use often causes trouble.

From the comparison with the two baselines, we find that our synthetic dataset is effective in combination with the DCNN model. In addition, we find that the use of ImageNet images as background brings an improvement in recognizing the activities in the “Others” class. These results are promising for deep learning-based recognition because we are able to easily prepare a large number of training images.