1 Introduction

Since late 1980 s the world of wearable devices has encountered a tremendous evolution while components miniaturization enabled us to freely interact with a wide range of mobile systems and implement them in many aspects of everyday life [1]. Such devices have begun to overwhelmingly interact with personal information domain. Moreover an increasing amount of such devices is equipped with built-in cameras and has introduced relevant changes in the acquisition, storage and automatic understanding of images and videos. Therefore, while the novel availability of images and videos has encouraged personal creativity, it has also raised a certain amount of privacy concerns, mainly regarding the unaware or unwilling contexts subject, that could get caught on such multimedia contents. Moreover relevant legal implication should be taken into account [2], especially regarding the large amount of data continuously produced by the so called life-logging devices [3,4,5,6]. Nowadays, these devices allow their users to continuously record and share online many different kinds of data, as videos, audio, pictures, personal data, as well as collective information or individual activities. On the other hand, while traditional devices as cameras or audio recording devices were only used sporadically and deliberately, modern life-logging devices can record and share their data continuously, therefore tampering with bystanders’ expectations about privacy and discretion [7]. For these reasons privacy and discretion aspects have gained great importance; as a matter of facts the typical user of such life-logging devices may prefer to enforce privacy through location based control of image collection, in order to avoid later burdensome review of all collected media. Finally, automatic face recognition software performances are now almost as good as human abilities [8], therefore on one hand they offer a useful service, on the other hand they can put at even greater risk personal privacy e.g. taking into account the treat represented by malwares which could seize private multimedial contents surreptitiously [9, 10].

Fig. 1.
figure 1

Contexts recorded into video recorded at the Department of Mathematics and Computer Science of Catania’s University Campus.

2 Proposed System

In this work, we present an overall architecture for context related privacy preservation. The system has been designed to work in places affected by an high level of similarity among different contexts. Therefore the presented approach enforces privacy constraints by applying computer vision methods as well as low energy bluetooth technology for context recognition.

We tested the proposed method in nontrivial use cases, therefore, we decided to use a low-end commercial wearable to record portions of our University campus facilities with an high degree of similarity (e.g. offices). Specifically the raw video data were collected wearing the Recon Jet ™ smart glasses and recording while walking trough several rooms, lounges and hallways (see Fig. 1).

2.1 Scenario and Communication Protocol

In order to grant users’ privacy and enforce all the required security measures, the proposed system has been provided with an ad hoc communication protocol (Fig. 2). The protocol is enforced with the following steps:

  1. 1.

    Environment identification;

  2. 2.

    Generation of session encryption key for row file transmission;

  3. 3.

    Cloud Service for Policies handling;

  4. 4.

    Handled file retrieving.

In our protocol we assume the presence of trustworthy users and untampered device. This restriction is based on the fact that no-one can prevent the recording of image or sound by uncooperative or nasty user, with hidden camera. In these cases, defining privacy policy or restriction is totally useless.

Instead we want to focus our attention on a scenario where a user with wearable device wants to respect rules relative to the environment where it is, obtaining from the environment itself the privacy policy defined by others. In this sense, all the encryption operations are finalized to prevent any image acquisition by unauthorized user before privacy rules application.

Finally, we assume that the “owner” (or at least the bystanders) of a specific location, have uploaded to the Cloud System a set of preferences or rules in order to determine whether or not enforce any privacy-related restriction when the context of interest is recognized.

In the following formalism we will define three agents: a generic wearable device \(W^n\) that grabs the environment images, a generic beacon \(B^m\) that identifies the particular portion of the environment, and a Cloud Service C that handles the recorded images.

The first phase (environment identification) involves both the wearable device and the nearest beacon. The beacon broadcasts continuously its identity, providing its \({\texttt {ID}}^m_{{\texttt {b}}}\) and a cloud-related public key \({\texttt {K}}^C_{{\texttt {Pub}}}\), so that

$$\begin{aligned} \qquad B^m \rightarrow W^n : {\texttt {ID}}^m_{{\texttt {b}}}, {\texttt {K}}^C_{{\texttt {Pub}}}&\end{aligned}$$

The Cloud Service couple of public and private keys \({\texttt {K}}^C_{{\texttt {Pub}}} {\texttt {K}}^C_{{\texttt {Priv}}}\) is generated by an independent Certification Authority; this step provides the properties of authentication and confidentiality for the Cloud Service. The public key can be obviously retrieved also in other ways.

After the detection of the beacon’s presence, the wearable device generates a session key \({\texttt {K}}^{n,i}_{{\texttt {S}}}\) which is used to encrypt the video recorded, in the follow called \(V^{n,i}\), and a timestamp \(T^{n,i}\), which is used to identify univocally the video.

In the follow, the encrypted information is represented with the common bracket formalism \(\{\cdot \}_k\), where k is the encryption key.

Fig. 2.
figure 2

Network protocol

The encrypted video is stored locally until network connection availability or a user interaction. When the connection is available or the device owner decides, the stored cyphered data is upload to the Cloud Service:

$$\begin{aligned} \qquad W^n \rightarrow C : {\texttt {ID}}^m_{{\texttt {b}}}, \left\{ V^{n,i}\right\} _{{\texttt {K}}^{n,i}_{{\texttt {S}}}}, \left\{ T^{n,i}, {\texttt {K}}^{n,i}_{{\texttt {S}}}, {\texttt {K}}^{n,i}_{{\texttt {R}}}, [{\texttt {ID}}^l_{{\texttt {b}}}]_{l\ne m} \right\} _{{\texttt {K}}^C_{{\texttt {Pub}}}}&\end{aligned}$$

In this transmission the wearable device sends:

  • the \({\texttt {ID}}^m_{{\texttt {b}}}\) in clear text;

  • the recorded video \(\{\tilde{V}^{n,i}\}\) encrypted with the Cloud public key;

  • a tuple with a transmission timestamp \(T^{n,i}\), the session encryption key, a response encryption key \({\texttt {K}}^{n,i}_{{\texttt {R}}}\), the list of beacon listened by the device, except \({\texttt {ID}}^m_{{\texttt {b}}}\).

The Cloud Service, that owns its \({\texttt {K}}^C_{{\texttt {Priv}}}\) key, is the only one able to decrypt the last part of the received communication. It retrieves the session key \({\texttt {K}}^{n,i}_{{\texttt {S}}}\), so it cat decode the video. By means of the list of beacon ID, it can retrieve the privacy policies previously defined, applying them to the video.

After this described communication, and the related message decoding, the Cloud resident application is able to recognize the context of each image through computer vision algorithms, and apply the required privacy enforcement rules. Only after this process, and the blurring of privacy concerned images, the resulting video can be transmitted back to the wearable device’s owner.

Before its transmission from the cloud service to the users client, the processed video is re-encrypted using the response key \({\texttt {K}}^{n,i}_{{\texttt {R}}}\), to avoid unauthorized accesses.

The wearable device finally requests to the Cloud Service repository the transmission of the handled video:

$$\begin{aligned} \qquad&W^n \rightarrow C: \left\{ T^{n,i} \right\} _{{\texttt {K}}^{n,i}_{{\texttt {R}}}}&\\ \qquad&C \rightarrow W^n : \left\{ T^{n,i}, \tilde{V}^{n,i}\right\} _{{\texttt {K}}^{n,i}_{{\texttt {R}}}}&\end{aligned}$$

Note that the response key \({\texttt {K}}^{n,i}_{{\texttt {R}}}\) and the timestamp \(T^{n,i}\) can be provided by the wearable device to every authorized user, to realized independently steps 3 and 4. For this reason \({\texttt {K}}^{n,i}_{{\texttt {R}}}\) must be different from \({\texttt {K}}^{n,i}_{{\texttt {S}}}\).

3 Visual Context Recognition

In Sect. 2.1 we described the designed communication protocol between the wearable device and a cloud-resident application. In this section we will describe how the cloud application proceeds with the required context recognition. This latter goal is achieved by using dedicated computer vision algorithms as well as machine learning solutions. Once the context is identified, then, the cloud is responsible for applying the required privacy preserving policies. Such policies will be applied by blurring the images regarding contexts for which the users have required a privacy enforcement rule. In the following we will compare two different implementations for the proposed approach. The first uses Bag of Words for feature extraction and k-Nearest Neighbors algorithm for context recognition (see Sect. 3.1). The second approach uses AlexNet for feature extraction and Support Vector Machine for context recognition (see Sect. 3.2). Finally these two implementation are compared on the base of their results and performances (see Sect. 4).

3.1 Bag-of-Words and k-Nearest Neighbors Algorithm

The Bag-of-Words (BoW) [11] method was born for information retrieval in text document analysis. For image processing purposes it is possible to apply the same model by creating a vocabulary of visual words constructed as a catalog of visual features. BoW model relies a distance based features clustering. The features are extracted from local regions after keypoint detection. It is possible to apply the BoW model for image classification by the following steps:

  1. 1.

    extract local regions from Points of Interests;

  2. 2.

    compute and extract local descriptors on these local regions;

  3. 3.

    compute a visual vocabulary through the clustering of the local descriptor;

  4. 4.

    represent an image as distribution of its visual word with respect to the computed visual vocabulary.

In this work, the BoW model has been used with Dense-SURF as features. The algorithm has been instructed to use an 8 by 8 pixel grid. The visual vocabulary obtained with k-means clustering is constituted by 1024 visual words.

In our solution we created different classifiers, one for each beacon, in order to assist context recognition. We used the following set-ups. We split the dataset in three parts and we used one or two parts for training and only one for testing.

The k-Nearest Neighbors algorithm [12] (k-NN) is the algorithm we used for classification when BoW is employed as representation. This algorithm is based on the prediction of the class of an image, considering k training data neighbors. In our study we used 1-NN algorithm implementation.

3.2 AlexNet and SVM Algorithm

AlexNet [13] is a convolution neural network (CNN) for objects recognition. AlexNet is composed by 650000 neurones triggered by 60 millions input parameters. The AlexNet model has been trained on a subset of ImageNet dataset composed by 1.2 million images of 1000 categories. We used AlexNet as alternative of BoW for image representation purpose.

We have coupled AlexNet representation with an SVM [12] classifier. We used SVM in multiclass procedure. This algorithm is based on the construction of detach clusters. In this study we have six detach cluster one for each class.

Fig. 3.
figure 3

Example of images into dataset

4 Experimental Settings and Results

The experiments were designed to test the proposed system employing beacon by comparing it’s efficacy while used to improve two well known classification methods based on Bag-of-Words [11] and AlexNet [13]. The classic classification methods, therefore, will be taken as reference baseline for the result presented in the next sections. The Bag-of-Words model has been used jointly with the k-Nearest Neighbors [12] algorithm for classification. Similarly, AlexNet has been used and then feed a SVM [12] algorithm for classification purpose. Our dataset is composed by six classes each of them related to a different context (see Fig. 1).

Table 1. Correlation between beacon, images and classes. \(N_{B}\) is the number of images for the classifier training step in the based method and our solution. \(N_{C}\) is the number of images per class. The parameter K is 1 for single part of the dataset (e.g. T1) and 2 for combined parts (e.g. T12).
Table 2. Performance of the different proposed setups based on BoW/kNN and AlexNet/SVM representation: \(\alpha ^{{\tiny BoW}}\) is the baseline accuracy of the standard BoW model, \(\alpha ^{{\tiny AN}}\) is the baseline accuracy of the standard AN model, \(\alpha ^{{\tiny BoW}}_{*}\) is the accuracy of our improved BoW-based model with beacon driven context classification, \(\alpha ^{{\tiny AN}}_{*}\) is the accuracy of our improved AN-based model.

4.1 Dataset

The dataset is composed of video frames (Fig. 3). The frames have been collected from a set of recorded videos. Such videos have been captured with a ReconJet by an operator walking through one wing of the building (Fig. 1). We performed many simulations regarding each one of the presented methods. In order to collect sufficient data and minimize statistical interferences we used several configurations for both the training and the testing set. The dataset has been split in three equally sized partitions (T1, T2, and T3), moreover three mixed partitions have been created: T12 combining T1 and T2, T13 combining T1 and T3, and T23 combining T2 and T3. Each of these partitions have been used independently for training in each simulation while paired with a complementary test set partition (see Table 1).

4.2 BoW Model

In order to compare our beacon based solution with the foremost standards, initially we used a BoW model and k-NN algorithm to obtain a reference baseline. Table 2 shows the results of such a model for each possible combination of training and testing sets (see Table 2). In this phase the classifier has been trained for context recognition among all possible classes with no restriction. Therefore the BoW/k-NN model has been applied to the context of each frame with respect to six different contexts (see Fig. 1). While each test has produced consistent results, on the other hand, it should be noticed that, when we used T3 as test set, the accuracy of this classifier produces less accurate results. Similarly, also when T3 is used for training, the resulting classifier obtains a very low accuracy. Effectively the T3 dataset is affected by a relevant noise (e.g. blurred or overexposed frames, too dark or too bright scenes, etc.). On the other hand we also noticed that using T3 combined with another set among T1 or T2 for training highly increases the classification capabilities of the classifier. We suspect that T3 contributes to train the classifier for context recognition even with noisy data.

4.3 Beacon-Enhanced BoW Model

Table 2 also shows the performances obtained by an improved version of the BoW model which makes use of beacon-driven context recognition (see Sect. 2). In this phase we used two classifiers, one for each beacon involved in our experiments. The first classifier has been used to detect the classes associated with context related to the first beacon: Hall1, WC, Room1 and Corridor. The second classifier has been used to recognize the remaining classes related to the second beacon: Corridor, Room2 and Hall2. The data provided to the classifiers where similar to the data used for the standard BoW modes (Sect. 4.2). Moreover, for this second experiment, the device also stored a tag for each frame with a ID list of the beacons in range at recording time. This setup permitted us to obtain an higher system’s accuracy (compare columns 3 and 4 of Table 2 w.r.t. the columns 6 and 7). Finally, as in the previous experiment, also in this second scenario the T3 showed the same noise-related issues.

Table 3. Improvements with respect to BoW model: \(\alpha ^{{\tiny BoW}}\) is the baseline accuracy of the standard BoW model, \(\alpha ^{{\tiny BoW}}_{1}\) and \(\alpha ^{{\tiny BoW}}_{2}\) are the accuracies related respectively to the classes belonging to the first or second beacon in our improved BoW model, \(\alpha ^{{\tiny BoW}}_{1,2}\) is the average accuracy of our improved BoW model
Table 4. Improvements with respect to the AlexNet (AN) model: \(\alpha ^{{\tiny AN}}\) is the baseline accuracy of the standard AN model, \(\alpha ^{{\tiny AN}}_{1}\) and \(\alpha ^{{\tiny AN}}_{2}\) are the accuracies related respectively to the classes belonging to the first or second beacon in our AN-based model, \(\alpha ^{{\tiny AN}}_{1,2}\) is the average accuracy of our AN-based model

4.4 AlexNet Model

In order to prove the efficacy of the proposed beacon-driven context recognition with respect to the standard image recognition based models, we tested and compared a hybrid approach. In this setup we preprocessed the video frames by using AlexNet [13] obtaining for each frame a feature vector. Then we used such a feature vector as input for an SVM classification algorithm. As done previously (see Sects. 4.2 and 4.3), also for this hybrid method we compare the results of an unconstrained test, that we used as comparison baseline, with our improved beacon-driven approach. Similarly to the previous experiments, also this time the noisy dataset T3 affected the classification accuracy of our implemented models. Moreover, despite AlexNet architecture should be robust with respect to such kind of noise, in our experiment we noticed that a strongly noise video recording could tamper it. On the other hand, if T3 is used in conjunction with a low-noise dataset, it seems to improve the accuracy of the classifier (see Sect. 4.5).

4.5 Discussion

The results of the experiments are reported in Tables 2, 3, and 4. In Table 2 we report the performance of the standard Bag-of-Words (\(\alpha ^{{\tiny BoW}}\)) and AlexNet (\(\alpha ^{{\tiny AN}}\)) approaches, as well as the performances of our improved models (\(\alpha ^{{\tiny BoW}}_{*}\) and \(\alpha ^{{\tiny AN}}_{*}\)). These two latter also make use of beacon driven context classification to improve their accuracy. The same results are reported in columns third and fourth of Table 2. In Table 3 the performance of the implemented BoW-based models are analyzed with respect to the different set of classes (whether if related to beacon 1 or beacon 2): \(\alpha ^{{\tiny BoW}}\) is the baseline accuracy of the standard BoW model, \(\alpha ^{{\tiny BoW}}_{1}\) and \(\alpha ^{{\tiny BoW}}_{2}\) are the accuracies related respectively to the classes belonging to the first or second beacon in our improved BoW model, \(\alpha ^{{\tiny BoW}}_{1,2}\) is the average accuracy of our improved BoW-based model. Table 4 shows the improvement introduced by our modifications to the AlexNet model: \(\alpha ^{{\tiny AN}}\) is the baseline accuracy of the standard AlexNet model, \(\alpha ^{{\tiny AN}}_{1}\) and \(\alpha ^{{\tiny AN}}_{2}\) are the accuracies related respectively to the classes belonging to the first or second beacon in our improved AN model, \(\alpha ^{{\tiny AN}}_{1,2}\) is the average accuracy of our improved AN model (see Tables 3 and 4). Finally, Fig. 4 shows an overview of the implemented methods and the related improvements introduced with the proposed beacon-driven recognition techniques.

Fig. 4.
figure 4

Comparison of Bag-of-Words and AlexNet representation with and without the exploitation of beacon

5 Conclusions

In this work, we presented a hybrid approach to help life-logging wearable devices enforcing restrictions for context-related users’ privacy preservation. The introduction of bluetooth beacon technology have been proven useful to improve the context recognition accuracy of some known image classification solutions based on Bag-of-Words and AlexNet representation. The results showed that the proposed solution is both robust to noise affected datasets as well as efficient for environments that presents an high degree of similarity between different contexts. Moreover, the developed system is highly customizable to enforce the privacy choices of the context owners or bystanders. Finally, the cloud oriented support make it suitable for a wide range of different devices and applications.