1 Introduction

The idea of digitally recording our everyday lives for our own personal and private use is not a new phenomenon. The writing diary has been used to record the experiences of everyday happenings and has been handed down from generation to generation. With the near-pervasive application of computing technology, the form we can use to record our daily experiences is changing. The earliest motivation for automatic generation of personal digital archives can be traced back to 1945 when Vannevar Bush expressed his vision [7] that our lives could be recorded with the help of technology and that access could be made easier to these ‘digital memories’ via the of what became known as hypertext/hypermedia. Automatically generating autobiographies has become more realistic recently with advances in lightweight computing devices and sensors though it is recognized that being able to digitally record everything about our lives does not then imply we have perfect recall of everything from our past or that that is even a good idea [36]. Mobile devices have low cost and small and lightweight embedded sensors including cameras, GPS, Bluetooth, accelerometers and gyroscopes, etc., which make computing devices portable thus enabling life recording to be done unobtrusively. Lifelogging is the term describing the notion of digitally recording aspects of our daily lives, where the recorded content in multiple media is a reflection of activities which we subsequently use to obtain an insight into our daily lives by browsing, searching, or querying. In some cases the motivation for this is work-related, sometimes it is to record special events, and sometimes it is to record for purposes as yet unknown.

Conventional content-based methodologies for retrieval of images or video try to map low-level features to high-level semantics without bridging the semantic gap. This approach has limitations because of the lack of coincidence between low-level features and query semantics. This makes concept-based high-level semantic reasoning an attractive option, where concepts are first detected via a mapping from low-level features using generic methods from training data, then fused together to reason or to infer a final set of concepts which may be used as a representation for whatever the application. Progress in the development of semantic concept detection for video can be seen in the annual TRECVid benchmark [37]. As reported in [22], automatically detected concepts in the TV news broadcasting domain can already be scaled to 1,000+, for which 101 concepts are defined in [41] and 834 in [27]; 491 concepts are detected in [39], 374 in [9] and 311 in [22].

However, the large effort in building concept detectors in the TV news broadcast domain cannot be applied directly to the domain of retrieval from everyday lifelog activities. Among the above-mentioned semantic concept sets, the Large-Scale Concept Ontology for Multimedia (LSCOM) is the most comprehensive taxonomy developed for standardizing multimedia semantics in the broadcast TV news domain [27]. As a framework, the LSCOM effort also produced a set of use cases and queries along with a large annotated data set of broadcast news video. However, in the lifelogging domain, many of the LSCOM concepts, for example, weapon, government leader, etc., are never normally encountered. In this paper, we investigate the definition of everyday concepts and their automatic detection from visual lifelogs in order to satisfy the requirements for indexing everyday multimedia lifelogs. We also investigate comprehensive ontological similarities for the lifelogging domain.

The rest of the paper is organized as follows: in Sect. 2, related work on lifelogging research and automatic concept selection in multimedia retrieval is discussed. In Sect. 3, we select everyday activities in order to construct a semantic concept space for a passive-capture visual lifelogging domain. In Sect. 4, various semantic similarity measures are investigated on two mainstream semantic networks—WordNet and ConceptNet. Our algorithm for semantic density-based concept selection and ranking is discussed in Sect. 5, followed by an evaluation in Sect. 7 in a user experiment. Finally, we close the paper with conclusions.

2 Related work

Many digital devices and sensors are now lightweight and computationally efficient, and can be used to capture the heterogenous contexts of wearers as part of automatic lifelogging. In wearable lifelogging, we can capture visual data using head-mounted cameras [17] or cameras mounted in front of the chest [35]. Using visual information we can infer contextual information like ‘who’, ‘what’, ‘where’ and ‘when’, and using digital cameras or camera-enabled mobile devices is a very attractive form of lifelogging. Visual lifelogging is the term used to describe both image-based and video-based lifelogging. Within the lifelogging community, cameras are often used as wearable devices to record still images [35] or videos [17, 25]. Example visual lifelogging projects are Steve Mann’s WearCam [24], the DietSense project at UCLA [31], the WayMarkr project at New York University [6], the inSense system at MIT [5] and the SenseCam [16] developed at Microsoft Research, Cambridge. Though these projects use various mobile devices for digital logging, they have the common feature of using cameras to capture still images or videos, taken from the first person view of the wearers. Camera-embedded mobile phones are employed in both the DietSense and WayMarkr projects for diet monitoring and experience recall, whereas the SenseCam is a sensor-augmented wearable camera designed to capture a digital record of the wearer’s day by recording a series of images and other sensor data. Viewing SenseCam images has been shown recently to be effective in supporting recall of memory from the past for memory-impaired individuals [35]. Due to its advantages of multimodal context-awareness, lightweight and unobtrusive logging with long battery life, we employ SenseCam shown in Fig. 1, as the visual recording device in our work.

Fig. 1
figure 1

The Microsoft SenseCam (right as worn by a user)

The management of visual lifelogging data such as the SenseCam image streams, should involve semantic indexing and retrieval for which much preliminary work has already been done in other domains such as TV news broadcasting. Concept-based information retrieval has received much interest in the community due to its potential in filling the semantic gap and its semantic reasoning capabilities. In concept-based video retrieval, for example, there are known methods to expand query terms into a range of concepts and user judgments and feedback can be used to reveal correlations between concepts. Subjects can be asked to choose the concepts they think are appropriate for their queries. This kind of approach to searching, however, may be difficult for a user when the semantic “space” of concepts becomes large. Previous work has only tested how users use concepts in retrieval based on a small number of concepts and queries as in work by Christel et al. [11] for which two collections included 23 queries and 10 concepts together with 24 queries and 17 topics. We also tend to get low inter-annotator agreement, as described by [11]. Previous work on using semantic concepts explicitly from lifelog data has used the distribution of the occurrence of these semantic concepts to profile an individual’s activities and behavour from a broad, generic viewpoint [12] and this has used a static and pre-defined set of concepts on which to base this.

Automatic approaches to selecting appropriate concepts for semantic querying fall into two categories: lexical and statistical [28]. Lexical approaches leverage the linguistic relationships among semantic concepts in deciding the most related and most useful concepts, whereas statistical approaches apply occurrence patterns from a corpus to reveal concept correlations. Statistical approaches also make use of collection-specific associations driven by the corpus set while lexical approaches depend on global linguistic knowledge.

Semantic similarity can be used as a measure to rank the relevance of concepts to a given query text and WordNet [26] is a popular source of the lexical knowledge needed for this. One approach involves selecting concepts based on minimizing the semantic distance between concepts and query terms. WordNet-based semantic similarity between query terms and concepts can be calculated as the weight of concepts using semantic similarity scores and some of the work in the area goes back many years, e.g., [32, 34]. In more recent work, the Lesk-based similarity measure [4] is demonstrated as one of the best measures for lexical relatedness and is employed in [13] for lexical query expansion. WordNet-based concept extraction is also investigated in [14] to evaluate the effectiveness of high-level concepts used in video retrieval where it achieved results comparable to user-selected query concepts. The issue with concept selection when using a lexicon ontology such as WordNet is that the local similarities across branches are not uniform which could lead to incomparable similarity values obtained from local ontology branches, as argued in [42]. In work by Snoek [40], information content is used to calculate similarity in order to deal with the problem of similarity inconsistency caused by non-uniform distances within the WordNet hierarchy.

A large manual annotation effort in the TRECVid benchmarking activity for video retrieval [29], and in the LSCOM concept ontology for broadcast TV news [1], has enabled the analysis of static patterns for video retrieval. The ground truth of hundreds of individual concepts and dozens of query annotations is used in comparing retrieval systems as well as selecting and analyzing the relevant concepts associated with particular queries. More recent work by Wei and Ngo [42] proposed an ontology-enriched semantic space model to cope with concept selection in a linear space. The ontological space is constructed with a minimal set of concepts and plays the role of a computable platform to define the necessary concept sets used in video search. This linear space guarantees the uniform and consistent comparison of concept scores for query-to-concept mapping [42].

3 Constructing an event semantic space (EES)

One of the limitations in building automatic classifiers or concept detectors for images or videos is for them to reveal higher level semantics when they detect multiple concepts with high correlations. Since the concepts involved in lifelogging cover aspects of our daily lives and thus the range of concepts is very broad, interpreting lifelogging events thus demands a strategy which helps to select the most appropriate combination of concepts for event representation rather than just all possible concepts. We now elaborate the construction of a semantic space reflecting everyday event semantics.

3.1 Everyday activities: exploring and selecting

Patterns of everyday activities are investigated in areas such as occupational therapy and diet monitoring to improve physical and mental health by understanding how we use our time in various activities. Several investigations and surveys have shown that most of our time is spent on activities such as sleeping and resting (34%), domestic activities (13%), TV/radio/music/computers (11%), eating and drinking (9%), which collectively count for nearly 70% of the time in a typical day.

In [19], the most frequently occurring everyday activities are explored to rate their level of enjoyment when people experience these activities. 16 activities are investigated and ordered decreasingly by their enjoyment rating. The impact of everyday activities on our feelings of enjoyment also affect our health, which makes these activities important in an analysis of well-being and lifelogging. Similar patterns of activity are also shown in [2, 3, 10] with sleep being the most dominant activity followed by housework, watching TV, employment/study, etc., [2, 3] also show that the distribution of activities varies with age group. However, some activities achieve high agreement in participation among all subjects investigated for activities such as sleeping, eating and drinking, personal care, travel, etc.

In constructing an event semantic space for modelling activities, we select our target activities from those with the following criteria:

  • Time dominance A small number of activities occupy a large amount of our time and analysis of these activities could maximize the analysis of the relationship between time spent and our well-being.

  • Generality Even though the time spent on activities varies from age group to age group, there are some activities that are engaged in by all age groups. The selection of activities with high group agreement will increase the generality of our activity analysis.

  • High frequency It is important to select the activities which have enough sample data so that we can build automatic classifiers.

Table 1 Target activities for our lifelogging work

With these criteria in mind, we selected the activities listed in Table 1 as targets:

3.2 Topic-related concepts

How to decide the set of concepts related to the event topics above is the focus of our work. In state-of-the-art everyday concept detection and validation [8], concepts are suggested by several SenseCam users after they have gone through and studied several days of their own lifelogged events. Then, being more familiar with their own lifestyles through reviewing their own lifelogs, concepts are discussed and filtered with the added criterion that the concept can be detected with satisfactory accuracy. During this procedure, concepts are not selected in a way that considers the related event topics so concepts may be selected that might not be helpful in interpreting specific event semantics. In addition, some concepts which might be helpful in recognizing and interpreting a specific event type may be ignored in the selection procedure. This limits the performance of event detection and semantic interpretation especially when particular concepts relevant to the event are missed. Given the fact that concept detection is always noisy, the situation is compounded when a non-relevant concept is selected to be used in a query, which will reduce the performance by incurring high noise in the query step.

To find a set of candidate concepts related to each of the activities described in Sect. 3.1, we carried out user experiments on concept selection where candidate concepts related to each of the activities above were pooled based on user investigation. Although individuals may have different contexts and personal characteristics, there is a common understanding of concepts that is already socially agreed and allows people to communicate about these according to [20] and [18]. This makes it reliable for users to choose suitable concepts relevant to activities. User experiments were carried out to discover candidate concepts which potentially have high correlation with activity semantics and details of the experimental methodology will be described in Sect. 7.1.

The user experiments gave us a set of candidate concepts with regard to the activities we explored in Sect. 3.1. These concepts were used to construct an event-based semantic space for every activity and the concept space was expanded by each concept as one dimension. Events are represented by groups of images which have their own concepts. We propose a novel semantic density-based concept selection algorithm to find the most useful concepts in the following sections because we believe that existing algorithms are not a good match for the particular problems of detecting the most appropriate semantic concepts for lifelog events.

4 Investigating ESS concept relationships

An ontology is used to represent the concepts and concept relations within a domain. Usually ontologies are considered as graphs, where nodes represent concepts and edges represent relations between concepts and in this way the ontology structure captures the semantics for the domain. Ontology-based similarity or relatedness measures can exploit the ontology structure or additional information to quantify the likeness or correlation between two concepts.

4.1 Lexical similarity based on taxonomy

Concepts are clustered according to their distribution in the semantic space. With a lack of features or coordinates in this semantic space, concepts can only be clustered in terms of their ontology relationships between each other. As a popular English lexical ontology, WordNet [26] is widely used as a semantic knowledge base. Synsets are the basic elements in WordNet representing the senses of words. The current version (3.0) of WordNet contains 155,327 words grouped into 117,597 synsets. The is-a relationship is modeled as hypernymy in WordNet where one concept is more general than another. Hyponymy represents the characteristic that one concept is more specific than another. The meronym or holonymy connection is the semantics representing a part-of relationship. This comprehensive coverage and explicit representation of concept relationships make WordNet useful in analyzing the concept relationships within the semantic spaces in our work.

Semantic similarity has been explored in previous research to define a matrix for concept relationship analysis. Rada [30] was first to develop the basis for edge-based measures for concept similarity by defining the distance in a semantic network as the length of the shortest path between the two concept nodes. Richardson and Smeaton [34] further refined the similarity measures. The Hirst and St-Onge [15] similarity measure, takes path direction into account and the idea is that the concepts are semantically close if their WordNet synsets are connected by a short path which does not change direction too often. Another similarity definition is proposed in [43] by Wu and Palmer for verb similarity calculation since most work is built upon noun concepts, and applied in machine translation. This was extended by Leacock and Chodorow [21] also as a path-based similarity measure which determines similarity with regard to the maximum depth of the taxonomy.

Semantic similarity based on information content is also an important component of lexical relationship analysis. This approach relies on the hypothesis that the more information two concepts share, the more similarity they have. The informativeness of a concept is quantified by the notion of its Information Content (IC), which is calculated based on the occurrence probability of concepts in a given corpus. IC is obtained by negative likelihood of encountering a concept in a given corpus [32]. The basic intuition of using negative likelihood assumes that the more likely a concept appears in a corpus the less information it conveys.

Based on the IC formula, a concept will contain less information if the probability of its occurrence in a corpus is high. The advantage of using information content is that once given a properly constructed corpus, information content can be adapted in different domains because information content is included in a statistical way according to occurrences of the concept, its sub-concepts and subsumers.

In [33], Resnik applied information content to calculating semantic similarity using Most Specific Common Abstract [\(msca(c_{1},c_{2})\)] as the amount of information that concepts \(c_1\) and \(c_2\) have in common. In this approach, only the is-a relationship is used because only the information of the subsuming concept of the two concepts being compared, is used. In [38], this similarity measurement is also employed by Quigley and Smeaton to compute word–word similarity in image caption retrieval.

4.2 Contextual ontological similarity and relatedness

WordNet is a small ontology of primarily taxonomic semantic relations. ConceptNet extended WordNet to include a richer set of relations appropriate to concept-level nodes [23]. In the version of ConceptNet we use, the relational ontology consists of 20 relation types falling into categories like K-lines, Things, Agents, Event, Spatial, Causal, Functional and Affective.

In ConceptNet, all concepts are linked with the above-mentioned relations which can reflect the correlations between concepts. We apply a link-based relatedness measure to maximize the concept relations in measuring concept correlation. This differs from WordNet which uses mainly taxonomic relationships, while ConceptNet employs more context relationships. While WordNet similarities only consider subsumption relations to assess how two objects are alike lexically, relatedness takes into account a broader range of relations which can be measured using ConceptNet.

The relations between concepts reflect the semantic correlation between two concepts. We assume that semantic relations are transitive so the more related two concepts are, the shorter the paths they will have. The relatedness between two concepts varies inversely with the length of the shortest path between the two concepts. Conceptual relatedness is a monotonically decreasing function of path distance. Our approach takes into account the length of paths between two concepts. In ConceptNet, because the edges between concepts are directional, we combine the length of the path between concept \(c_{1}\) and \(c_{2}\) as well as path between \(c_{2}\) and \(c_{1}\). The similarity between two concepts is defined as:

$$\begin{aligned} S_{CN}(c_{1},c_{2})=\text{ max}(\text{ AS}(c_{1},c_{2}),\text{ AS}(c_{2},c_{1})) \end{aligned}$$
(1)

where \(\text{ AS}(c_{1},c_{2})\) represents the activation score of \(c_{2}\) starting from \(c_{1}\), and vice versa. Activation score is computed by spreading activation in ConceptNet to find the most similar concepts with regard to a starting concept. The starting concept is initialized with activation score 1.0 and then the nodes connected with the starting concept with one link path, two link paths, etc., are activated. The activation score of connected node \(b\) with original node \(a\) is defined as:

$$\begin{aligned} \text{ AS}(a,b)=\sum _{c\in \mathrm{ Neighbor}(b)}{\text{ AS}(a,c)\times d \times w(c,b)} \end{aligned}$$
(2)

where \(d\) is a distance discount (\(d<1\)) to give the concepts far from the original concept a lower weight and \(w(c,b)\) is the relation weight of the link from \(c\) to \(b\). In this paper, we apply the same relation weight for activation scores. For any given concept \(b\), the activation score related to \(a\) is the sum of scores of all nodes connected to it.

5 Concept selection based on semantic density

Our measure of semantic density relies directly on the semantic distance between concepts. If the distance measured between concepts is small, then the concepts have high density. The semantic distance is used as a measure by which concepts are clustered to represent event semantics. In our semantic topic-related concept selection, we deal with the research question by means of identifying the similarity between concepts as a linguistic problem. Processing consists of text pre-processing which consists of tokenization, POS tagging and stopword removal, and this is followed by word similarity and phrase similarity calculations which we now describe.

5.1 Conjunctive concept similarity

In traditional text-based retrieval, a document or query is represented as a vector of term weights which are used in similarity comparison. The term “vector” can be regarded as a new and distinct compound concept and the concept reflected by a document is best described by ANDing the concepts represented by its index terms [30], which facilitates documents being treated as conjunctive concepts.

When concepts have several disjunctive meanings in WordNet synsets, we apply ‘disjunctive minimum’ [30] to obtain the similarity between the two concepts. That is, when a concept has alternative synsets because it is polysemous, we calculate the minimum conceptual distance between the synsets and the other concept as the final distance between the two concepts. Assume that we have two concepts \(c_{1}\) and \(c_{2}\) and \(c_{1}\) has three disjunctive synsets \({syn_{1}, syn_{2}, syn_{3}}\). In terms of ‘disjunctive minimum’, the conceptual distance between \(c_{1}\) and \(c_{2}\) will be given by:

$$\begin{aligned} d(c_{1},c_{2})=\text{ min} \left[ d(syn_{1},c_{2}),d(syn_{2},c_{2}),d(syn_{3},c_{2}) \right] \end{aligned}$$
(3)

In calculating conjunctive concept similarity, we take into account all elementary concepts in the conjunctive concept. We regard the comparison of the similarity of two conjunctive concepts as finding the best assignment for a bipartite graph. On both sides of the bipartite graph, the nodes represent elementary concepts. As with solving the best matching problem, we apply the Hungarian algorithm to decide the maximum similarity matching between the two conjunctive concepts. An alternative to the computationally expensive Hungarian algorithm is to perform conjunctive concept similarity, defined as [38]:

$$\begin{aligned} \text{ sim}(c_{1},c_{2})= \frac{1}{M\cdot N}\sum _{i=1}^{M}\sum _{j=1}^{N}\text{ sim}(e_{i},f_{j}) \end{aligned}$$
(4)

where \(c_{1}\) and \(c_{2}\) are the compound concepts being compared and \(e_{i}\) and \(f_{j}\) are elementary concepts for \(c_{1}\) and \(c_{2}\), respectively. In this formula, the sum of pairwise elementary concept similarities is normalized by the product of the length of conjunctive concepts to reduce the bias of the number of elementary concepts [30]. Some other approaches to conjunctive concept similarity calculation can also be found in [38].

5.2 Density-based concept selection

In the concept set, each concept represents a semantic entity in the semantic space and the pairwise relationship between 2 concepts can be determined by their semantic similarity, represented as an \(n \times n\) symmetric matrix, \(M\). The most similar concept group can represent a subspace in the semantic space within which the concepts have high co-occurrence correlations.

Principle Component Analysis (PCA) is a useful tool in pattern recognition in high-dimensional spaces to reduce the number of dimensions without losing much of the information represented by the data. Although PCA can ensure the orthogonality of the bases, the representation of original data in terms of feature vectors is difficult to interpret and embed with semantics, a point made by Wei and Ngo in [42]. However, subsets of concepts which are clusters in semantic space represent specific domain semantics which should be as disjoint as possible to be selected as the bases in semantic space. Therefore, the number of clusters, that is also the number of bases selected by clustering, should be consistent with the number of feature vectors selected by PCA.

We apply PCA to help find the most appropriate number of clusters in density-based concept selection. The total number of clusters is decided by considering the inconsistency coefficient and PCA. In hierarchical clustering, the inconsistency coefficient was used to decide the appropriate number of clusters in the dendrogram. The inconsistency coefficient is defined to compare the height of a link in a cluster hierarchy with the average height of links below it and can be used to identify groups of concepts which are densely packed in certain areas of the cluster dendrogram.

To demonstrate how our approach works, we take ConceptNet contextual similarity as an example, as described in Sect. 4.2. Figures 2 and 3 are both demonstrated using the typical concept set (85 concepts) which we investigate in Sect. 7.3. In Fig. 2 (in green), the number of clusters formed when inconsistent values are less than a specified inconsistency coefficient is shown. According to PCA, the cumulative energy content for the top \(k\) Eigenvectors is shown in Fig. 2 (in blue). As described above, the number of orthogonal vectors represent disjoint semantics in the semantic space and we strive to group as many similar concepts as possible. The trade-off between PCA inconsistency coefficient is used to find a proper number of clusters for agglomerative algorithm. As shown in Fig. 2, the intersection of PCA (blue) and inconsistency coefficient (green) curves is selected to decide the number of clusters. The number of clusters at the trade-off point can still keep the cumulative energy higher than 90% while the inconsistent coefficient is at a relatively low level.

Fig. 2
figure 2

Number of clusters

Fig. 3
figure 3

Relationship dendrogram for concepts

A dendrogram generated by hierarchical clustering is illustrated in Fig. 3. In the dendrogram, semantically related concepts are linked together within a cluster. For example, ‘food’, ‘table’, ‘people’, ‘drink’ and ‘plate’ are grouped together, from which we see that these are more related to the activity ‘eating’ (shown as a dashed circle in Fig. 3). More examples in Fig. 3, such as ‘milk’, ‘water’, ‘cup’ are clustered for ‘drinking’ while ‘sky’, ‘path’, ‘tree’, ‘road sign’ and ‘road’ are clustered for ‘walking’. The semantic clustering facilitates the selection of topic-related concepts. As a concept, a given topic is also an instance which can be clustered in the concept space. Therefore, the concepts within the same cluster of a given topic are the concept candidates.

6 Leveraging similarity for concept ranking

In the previous section, we described a method for selecting candidate concepts in similarity matching based on clustering concepts in a semantic space. Although the selected concepts have high correlation with the given activity topic, there may still be other concepts missing which might be related to the topic. This is because the clustering algorithm only considers the local distance in the semantic space. Since the selected concepts have high semantic correlation with the given topic, they can be used as seeds in finding other related concepts. To leverage concept similarity in a global view, we employed the random walk model.

6.1 Concept similarity model

Random walk is a widely used algorithm which uses links in a network to calculate global importance scores for objects which are connected in the network. It allows us to compute the probability of a random walker being located in each vertex performed as a discrete Markov chain characterized by a transition probability matrix. We model concept similarity as a graph \(G=(C,E)\), where \(V\) is the concept set and \(E\) is a set of edges that link concepts. Each edge is assigned a given similarity value describing the probability that a random walker jumps among the concepts. As shown in Fig. 4, concept sets and given topics can both be viewed as vertices in the graph, connected by similarity links. In last section, the concepts very similar to the given topic are selected as candidates, shown as the shaded concepts in Fig. 4. However, the concepts which are similar to candidate concepts but have no direct similarity link with the given topic, are ignored. The random walk model ranks the concepts with candidate concepts as seeds from a global similarity view.

Fig. 4
figure 4

Concept similarity link

6.2 Similarity rank

Here, we consider the process as a Markov chain where the states are concepts and transitions are similarity links between them. A random walker will start with a prior probability and surf on the graph, following similarity links. The similarity random walk is based on mutual reinforcement of concepts, that is, the score for a concept relative to a given topic influences, and is influenced by, the score of other concepts. We formulate the calculation of the score for \(c_{i}\) as:

$$\begin{aligned} x(c_{i})=\sum _{j=1}^{n} \text{ Sim}_{ij}\; x(c_{j}) \end{aligned}$$
(5)

where \(\text{ Sim}_{ij}\) is a normalized similarity value between \(c_{i}\) and \(c_{j}\). Following the PageRank algorithm, we update the score of concepts by:

$$\begin{aligned} \left( \begin{array}{c} x_1\\ \vdots \\ x_n \end{array} \right)= \alpha \left( \begin{array}{ccc} \text{ Sim}_{11}&\ldots&\text{ Sim}_{1n}\\ \vdots&\ddots&\vdots \\ \text{ Sim}_{n1}&\ldots&\text{ Sim}_{nn} \end{array} \right) \left( \begin{array}{c} x_1\\ \vdots \\ x_n \end{array} \right)+(1-\alpha ) \left( \begin{array}{c} d_1\\ \vdots \\ d_n \end{array} \right) \nonumber \\ \end{aligned}$$
(6)

where \((d_1 \cdots d_n)^\mathrm{ T}\) is a prior score vector, and \(\alpha \) is a decay factor. The equation can be formalized in a compact matrix form as:

$$\begin{aligned} \mathbf{ x} =\alpha \mathbf{ T} \mathbf{ x} +(1-\alpha )\mathbf{ d} \end{aligned}$$
(7)

In this formula, \(\mathbf{ x} \) stands for the score vector and T is the similarity matrix with the sum of each column normalized to 1. For each concept \(c_{i}\), there is \(x_{i}=\sum\nolimits _{j=1}^{n} \alpha \cdot \text{ Sim}_{ij}x_{j}+(1-\alpha ) \cdot d_{i}\) for the score. To solve (7), we convert it to:

$$\begin{aligned} \mathbf{ x} =\alpha (\mathbf{ T} +(1-\alpha )/\alpha \cdot \mathbf{ d} \cdot \mathbf{ 1} )\mathbf{ x} \end{aligned}$$
(8)

If we assume \(\mathbf{ A} =\alpha (\mathbf{ T} +(1-\alpha )/\alpha \cdot \mathbf{ d} \cdot \mathbf{ 1} )\), then \(\mathbf{ x} \) will be the Eigenvector of \(\mathbf{ A} \). Although this leads to a direct solution for the formula, the iterative calculation converges fast enough and is usually employed. In our work, the iteration starts with \(\mathbf{ x} \) initialized as \(\mathbf{ 0} \).

7 Experiments and evaluation

7.1 User experiment

Our experiments start with a user investigation to determine a set of possible concepts for interpreting lifelog events. Respondents are chosen from among the researchers in our own lab, most of whom work in computer science and some also log their everyday lives with the SenseCam so the group are sympathetic to, and familiar with, the idea of indexing visual content by semantic concepts. In total, 13 of our respondents then took part in our initial user experiment, 9 male and 4 female, whose ages are all in the range 20–40 years. About half of the participants (7 in 13) are in the age group of 26–30 while 3 are in 21–25 and another 3 are over 30. There are 8 participants who are familiar with SenseCam and have worn it for various periods. Five participants carry out research using SenseCam and are engaged in different tasks like visualization, concept detection, medical therapy, etc. The demographic information for our participants is shown in Table 2.

Table 2 Demographic information of participants

Participants were shown SenseCam images for samples of activities and were then surveyed on their understanding of images of SenseCam activity as well as the concepts occurring in those SenseCam images. The experiment was organized into three phases, namely, study, phrase and rating. In the study phase, target activities were first described to respondents to make them familiar with the activity concepts. Exemplar image streams for each activity were shown to the group and they were then asked to inspect the SenseCam images. In the pooling stage, participants were asked to go through images collected individually to list the possible concepts they thought might be helpful in order to retrieve the activities. The aim of the second phase is to determine a large concept set that might be helpful in analyzing SenseCam images in order to detect activities. In the final rating phase, the number of subjects who thought a concept was relevant to the given activity is calculated, for all target activities. The higher the number of “votes” each concept gets and the greatest agreement among all subjects, the more importance we give to the concept for that activity.

In the pooling stage, subjects were asked to list as many concepts associated with each event topic while inspecting SenseCam images through controlled browsing. Note that the pooled event topics are all from the everyday activities we investigated in Sect. 3.1, as shown in Table 1. To provide cues for participants to find appropriate concepts, SenseCam images depicting different activities were shown. In our later experiment on evaluating concept selection in Sect. 7.4, we use the concept set obtained from this user experiment. The concepts investigated include 171 concepts in total.

7.2 Experimental evaluation: methodology

To evaluate our concept selection algorithm, the user experiment acts as the “oracle” result. In the user experiment, the ranked concepts are analyzed to determine the set of agreed ones, decided unanimously for the evaluation. Benchmarks are introduced to evaluate the algorithms from different performance points of views. These viewpoints are group consistency, set agreement and rank correlation [18].

In order to assess the generated clustering, we define group consistency to measure the degree of semantically related concepts to be clustered. When two related concepts are grouped in the same cluster by our algorithm, this should give a positive contribution to the overall consistency value, otherwise, a negative contribution should be given to overall consistency. To determine whether two concepts should be grouped together is a subjective decision hence the results of human experiments are used as an oracle evaluation. We formalize the notion of human judgement on concept group consistency as a binary function \(O\):

$$\begin{aligned} O(c_{i},c_{j}) = \left\{ \begin{array}{l@{\quad }l} 1&\text{ if}\; c_{i} \text{ and}\; c_{j} \text{ are} \text{ under} \text{ the} \text{ same} \text{ topic}\\ 0&\text{ if}\; c_{i} \text{ and}\; c_{j} \text{ not} \text{ under} \text{ the} \text{ same} \text{ topic}\\ \end{array} \right. \end{aligned}$$
(9)

Similarly, we define another binary function \(G\) to reflect the grouping result of two concepts by clustering as:

$$\begin{aligned} G(c_{i},c_{j}) = \left\{ \begin{array}{l@{\quad }l} 1&\text{ if}\; c_{i} \text{ and}\; c_{j} \text{ are} \text{ in} \text{ the} \text{ same} \text{ cluster}\\ 0&\text{ if}\; c_{i} \text{ and}\; c_{j} \text{ not} \text{ in} \text{ the} \text{ same} \text{ cluster}\\ \end{array} \right. \end{aligned}$$
(10)

Note that these two binary functions are both symmetric which means \(O(c_{i},c_{j})=O(c_{j},c_{i})\) and \(G(c_{i},c_{j})= G(c_{j},c_{i})\). Generating a set \(\mathcal{ C} \) of ordered pairs \(\mathcal{ C} =\{(c_{i},c_{j}), 1 \le i,j\le |C| , i\ne j \}\) from concept set \(C\), the overall group consistency for \(C\) is defined based on these two functions and is formalized as:

$$\begin{aligned} \text{ GC}=\frac{|\mathcal{ C} |-\sum _{(c_{i},c_{j})\in \mathcal{ C} }{\text{ IC}(O,G,c_{i},c_{j})}}{|\mathcal{ C} |} \end{aligned}$$
(11)

where

$$\begin{aligned} \text{ IC}(O,G,c_{i},c_{j})=\left\{ \begin{array}{l@{\quad }l} 1&\text{ if}\; O(c_{i},c_{j})\ne G(c_{i},c_{j})\\ 0&\text{ if}\; O(c_{i},c_{j})=G(c_{i},c_{j})\\ \end{array} \right. \end{aligned}$$
(12)

Group consistency reflects the performance of similarity-based clustering in the form of a pairwise grouping result. The ratio is computed as the fraction of the pairs for which the semantic clustering algorithm gives the same output as the user experiment. If there are no cases in which semantic clustering mis-groups a concept pair, \(GC\) is equal to 1. Conversely, \(GC\) is equal to 0 when no concept pairs are correctly grouped.

Set agreement is used to compare two concept sets without considering the ranking measurement. It defines the positive proportion of specific agreement between two sets [18]. The score of set agreement is equal to 1 when the two sets\(C_{1}=C_{2}\), and 0 when\(C_{1} \bigcap C_{2}=\phi \).

Rank correlation is used to study the relationships between different rankings on the same concept set. We employ Spearman’s ranking correlation coefficient to measure the final output. According to the definition, the score is equal to 1 when agreement between the two rankings are the same, and -1 when one ranking is the reverse of the other.

7.3 Evaluation setup

We recruited 13 persons in our user experiment for concept recommendation. Diverse concepts were suggested by our subjects as shown in Fig. 5 showing that the number of concepts increases significantly when less agreement is achieved, from 13 votes to 2 votes. We ignore concepts with only 1 vote because one subject’s suggestion means very little in terms of a common understanding of concept selection.

Fig. 5
figure 5

Concept number versus agreement

Fig. 6
figure 6

Distribution of concepts

We initially concentrate on a smaller concept set in which concepts are selected with \(agreeement\ge 50\%\). When too few concepts are selected for a topic, more concepts with a smaller agreement will also selected in order to make each topic have at least 5 concepts. In this concept set, there are a total of 85 concepts.

To test the robustness of different similarity measures used in our density-based concept selection, we also carried out experiments on a larger concept set with less agreement among users (\(vote\ge 2\)), forming a broader set of 171 concepts. The distribution of all 171 concepts across activities is depicted in Fig. 6 with most activities having between 10 and 20 concepts, the overall average being 15. Among all activities, ‘Cooking’ has more relevant concepts selected as more visual concepts are involved and are helpful to identify the activity, such as various kitchen items and food which are very specific. Activities like ‘Using phone’, ‘Reading’, ‘Pet care’ and ‘Going to cinema’ tend to have relatively similar images within one single event sample, therefore have less concepts recommended.

To measure semantic similarity, we employed both taxonomic similarity and contextual similarity as discussed in Sect. 4 using the ontologies of WordNet and ConceptNet, respectively. For taxonomic similarity, we also compared five mainstream similarity measures, those of \(Wu\) and \(Palmer\), \(Leacock\) and \(Chodorow\), \(Resnik\), \(Jiang\) and \(Conrath\), \(Lin\) all of which were introduced and described earlier. Contextual similarity is obtained by spreading activation through ConceptNet links. After normalizing by textual processing, the word–word semantic similarity is first calculated and then combined to get phrase-level similarity for conjunctive concepts composed of multi-words.

Concept–concept similarity and topic–concept similarity are both used in our density-based concept selection algorithm to cluster the most similar concepts in the same clusters with corresponding event topics. The output concepts from hierarchical clustering are first analyzed to show the diversity of result concepts by different semantic similarity measures. The average number of concepts selected per event topic is depicted in Fig. 7. Though there is not much difference in the average number of concepts per topic, \(Lin\) selected more concepts (5.0) compared to \(Jiang\) and \(Conrath\) and \(ConceptNet\) which both select 2.6 and 2.5 concepts per topic, respectively.

7.4 Result of evaluation

Our results are assessed to compare the performance of applications of the two ontologies, WordNet and ConceptNet, for semantic density-based concept selection in the lifelogging domain. Our density-based concept selection and re-ranking algorithm involves several steps including similarity calculation, agglomerative clustering, similarity ranking and so on, therefore we evaluate our results in manifolded ways.

7.4.1 Evaluating the clustering algorithm

We first apply clustering to group semantically related concepts based on similarity measurement. Group consistency is calculated for each ontology to assess the clustering performance of our agglomerative algorithm in capturing the semantic relationships in everyday life events. The comparison of all above referred ontological measures are shown in Fig. 8.

Fig. 7
figure 7

Concept number per topic

Fig. 8
figure 8

Group consistency comparison

The assessment is first carried out on a small concept set (85 concepts) as shown by blue bars in Fig. 8 and as we can see, ConceptNet-based similarity shows more consistency compared to the other similarity measures. Using the same concept set and agglomerative clustering algorithm, this indicates that the similarity values returned by our spreading activation from ConceptNet do reflect the semantics of everyday activities. We increased the testing concept set by using the larger concept set (171 concepts) shown by red bars and found that \(ConceptNet\) still outperforms the other similarity sources.

In the larger concept set, the semantic similarity calculation is also performed first for these 171 concepts and topics followed by an hierarchical clustering algorithm. Output concepts are compared on a topic basis against the ground truth from the user experiment. Comparison is done on Set Agreement and Rank Correlation to evaluate the performance of different similarity measures. Because topics are not uniform in assessing performance, we do not average results over all topics. The results of our density-based everyday concept selection are presented in Figs. 9 and 10. The performance of similarity measures is compared in Fig. 9 using Set Agreement where we find ConceptNet-based concept selection has the highest median value and better quartile scores than WordNet-based measures. Among WordNet-based similarities, \(Leacock\) performs best on Set Agreement but does not show advantages on Rank Correlation as shown in Fig. 10.

Fig. 9
figure 9

Comparison based on set agreement

ConceptNet-based concept selection results in the highest median and quartile scores on Rank Correlation while \(Jiang\) has the best performance among WordNet-based similarities, but is outperformed by \(ConceptNet\). We conclude that ConceptNet-based similarity performs best not only on the concepts selected (as implied by Set Agreement), but also on the ranking of these concepts (as implied by Rank Correlation). The contextual ontology is thus more suitable in everyday concept selection for the lifelogging domain.

Fig. 10
figure 10

Comparison of rank correlation

Fig. 11
figure 11

Comparison of pairwise orderedness (small set)

7.4.2 Similarity ranking assessment

Similar to group consistency, we define pairwise orderedness to evaluate ranking performance of our algorithm, as the following formula:

$$\begin{aligned} \text{ PO}=\frac{|\mathcal{ C} |-\sum _{(c_{i},c_{j})\in \mathcal{ C} }{\text{ IC}(O,R,c_{i},c_{j})}}{|\mathcal{ C} |} \end{aligned}$$
(13)

where

$$\begin{aligned}&\text{ IC}(O,R,c_{i},c_{j})\nonumber \\&\quad =\left\{ \begin{array}{l@{\quad }l} 1&\text{ if}\; R(c_{i})\ge R(c_{j})\;\&\;O(c_{i})<O(c_{j})\\ 1&\text{ if}\; R(c_{i})\le R(c_{j})\;\&\;O(c_{i})>O(c_{j})\\ 0&\text{ otherwise}\\ \end{array} \right. \end{aligned}$$
(14)

\(O(c) = 1\) if concept \(c\) is selected as a ground truth concept in the user experiment, otherwise, \(O(c) = 0\). \(R(c)\) is the final score for concept \(c\) returned by the similarity ranking.

A comparison of ontology similarities using pairwise orderedness is shown in Fig. 11 on the small concept set (85 concepts). \(ConceptNet\) similarity outperforms the other measures in most cases for which the curve of \(ConceptNet\) (CN) is above all the other curves (activities before ‘cook’). There are only four cases in which \(ConceptNet\) shows worse performance than WordNet-based similarity measures, namely, ‘cook’, ‘listen to presentation’, ‘general shopping’ and ‘presentation’. We also analyzed the poor performance of \(ConceptNet\) on these activity types. For ‘listen to presentation’ and ‘presentation’, \(ConceptNet\) did not perform well due to the lack of context information for the concept ‘presentation’. By looking up the ontology structure of ConceptNet, we find only two concepts that are contextually connected to ‘presentation’ with high correlation, ‘fail to get information across’ and ‘at conference’ and connected with ‘presentation’ by relationships ‘CapableOf’ and ‘LocationOf’, respectively. Thus, it is hard to quantify related concepts in our concept set with a high similarity weight. In our experiment, ‘general shopping’ is introduced as a very general concept for which even humans find hard to decide the most related concepts.

An evaluation on pairwise orderedness is also carried out on the larger 171 concept. The comparison shows a similar result to Fig. 11. ConceptNet-based semantic similarity still performs better than other similarity measures in most cases. In only three cases, \(ConceptNet\) does not perform as well as WordNet-based similarities, those three cases being ‘cook’, ‘presentation’ and ‘general shopping’. The reason for poor performance can be explained in the same way as when we were using the small concept set. Note that in the ‘cook’ topic, more procedures such as ‘washing’, ‘peeling potatoes’, ‘stir frying’, to name a few, are involved. The contextual diversity also makes it difficult for \(ConceptNet\) to return the contextual similarity correctly.

Ranked concepts based on semantic similarity are also compared using metrics of Set Agreement and Rank Correlation. To simplify the comparison, we perform an evaluation on the smaller concept set with the selection on the top-5 and top-10 concepts returned by the similarity rank algorithm. The performance of different semantic similarity measurements are shown, respectively, in Figs. 12 and 13.

Fig. 12
figure 12

Comparison for top-5 ranked concepts (smaller concept set)

As we can see from Fig. 12, the advantage of using \(ConceptNet\) is more obvious as we select more concepts after similarity rank compared to very few concept seeds selected by clustering. In Fig. 12, the ConceptNet-based algorithm outperforms the others not only in Set Agreement but also in Rank Correlation. The advantages of \(ConceptNet\) when top-10 concepts are selected as depicted in Fig. 13 show the robustness of our similarity rank algorithm which propagates the similarity network and gives higher weights to more relevant concepts based on the seeds selected by the clustering algorithm.

Fig. 13
figure 13

Comparison for top-10 ranked concepts (smaller concept set)

8 Conclusion

This paper investigates digital recording of everyday activities, known as visual lifelogging, and elaborates the of target activities for semantic analysis. A density-based approach to selecting semantic concepts is introduced to exploit and leverage concept similarity as reasoned from underlying ontologies. Suggested concepts are then re-ranked with candidate concepts selected by agglomerative clustering, used as seeds. In this paper, semantic reasoning on prevalent lexical and contextual ontologies are also discussed.

The efficacy of our concept selection algorithm is shown in the way we select and rank concepts used to represent everyday lifelogging activities, from a global view. We conclude that candidate concepts selected by clustering depend on grouping consistency. Usually, the similarity measures which correctly reflect semantic relationships between concepts can obtain better group consistency, as demonstrated by the good performance of \(ConceptNet\) similarity in lifelogging. The best performance of contextual similarity obtained from spreading activation on ConceptNet shows that contextual similarity is more suitable in reflecting semantics of everyday concepts in the lifelogging domain. \(ConceptNet\) similarity better reflects the relationship of everyday activities and concepts because they are more contextually relevant in the lifelogging domain.