1 Introduction

Data has a central role in everyday life activities, and they represent one of the most valuable items in modern societies. Many available data produced every day in different formats by humans, and automatic agents need efficient and effective techniques to represent and manage them. Other dimensions have arisen from this plethora of data, and many efforts have been devoted to define and implement novel approaches to share and represent data using common formats [1].

In this context, ontologies and the related semantic web technologies represent a crucial aspect of allowing the transformation from human-readable to machine-readable data. They enable real advanced information processing and management techniques. The different definition has been proposed for years about the term ontology. In our vision, one of the most complete is presented in [2] where an ontology is a conceptual model of a specified reality; it has explicit definitions of its components; it is formally defined, and therefore it is “machine-readable”; it is one most accepted by the scientific community. Ontology Construction, enrichment, and adaptation tasks are a wide process called Ontology Learning [3]. In particular, ontology enrichment is the task of increasing an existing ontology with concepts and relations. On the other hand, the ontology population task can add new instances of concepts to the ontology. In this paper, we focus our attention on the Ontology Population task. The use of an ontology has several advantages in different fields [4], and it can be used to reduces the semantic gap [5, 6]. In particular, if we consider multimedia data, an ontology-based model can be designed to relate low-level features to high-level concepts [7]. Due to the high number of possible entities and properties involved in representing a knowledge domain, the ontology building process is very expensive and time-consuming. On the other hand, the updating task of an ontology instance requires frequent modifications. Therefore, one goal is to automate this process.

In our approach, the ontology evolution consists of extracting informative contents from different data sources and populating as a priori defined ontology schema. In our context, we consider a multimedia ontology schema and use low-level extraction techniques and semantic concept analysis to implement the ontology population process. Following the Ontology Learning, Layer Cake [8], the ontology learning process involves different tasks. In particular, the tasks concerning the ontology population process are related to object identification, through which an extracted object from a corpus can be associated with a single concept in the domain of interest. If we analyse textual data, this task consists in the recognition of terms or groups of words. On the other hand, if we analyse visual data, it consists of identifying particular areas of considered images that represent objects for the ontology population.

The object identification task is divided into the following sub-tasks: (i) Object Recognition used to find recognizable entities in a corpus or visual data; (ii) Object Classification related to the identification of specified categories (i.e. ontology concepts) assigned to the identified objects; (iii) Object Mapping to compute similarity between contents in different data sources; (iv) Synonyms Identification that refers in recognition of different representations of the same concepts and vice versa; (v) Concept Identification and Data Association which assigns a set of instances to a concept in the ontology (i.e. the ontology population step); (vi) Entity Disambiguation given by the process of identifying instances that refer to the different concepts. We explicit point out that we extend the model described so far considering multimodal representations (i.e. visual information). The Ontology Population deals with adding new instances of concepts to ontology without changing its structure. Therefore, after the population process, neither the hierarchy of concepts nor the non-taxonomic relations change. The population process requires an existing ontology and an extraction engine that processes data, identifying objects associated with concepts. Different surveys have been proposed in the context of ontology learning and population and, due to its essential applications in various fields, it is a very prolific research field [9].

This paper proposes a novel approach for the ontology population using different visual descriptors and semantic analysis. The process is wholly automatized, and it has been implemented using hybrid big data technologies.

The paper is organised as follows: in Sect. 2, we introduce and discuss the existing literature; Sect. 3 is devoted to the presentation of the proposed approach and its implementation together with the description of the used multimedia ontology model and the used data sources; the experimental strategy and the obtained results are showed in Sect. 4 and, eventually, conclusions and future works are in Sect. 5.

2 Related works

This section presents and discusses some of the main works in the literature related to our approach from an application point of view (i.e. ontology population) and a methodological one (i.e. data sources and analysis techniques).

One of the main projects in our field of interest is BOEMIE [10]. BOEMIE adopts a synergistic approach that combines multimedia extraction and ontology evolution in a bootstrapping process. Its aim is the improvement of the acquisition process of multimedia content using ontologies. This process is used to enrich ontologies to drive a multimedia information extraction task. The system uses a semi-automatic approach to control the annotation task. An approach to populate an image ontology based on textual representation is presented in [11]. It combines a segmentation process and a ranking function considering a colour distribution. A web search engine fetches the images considered for the population process. A waterfall segmentation algorithm is applied to object recognition. The textual information is associated with a single region detection. The obtained regions are clustered through an unsupervised algorithm using texture and colour features. The concept identification is performed by assigning a score to each cluster. An approach for ontology population using linked-open data is proposed in [12]. It is based on domain-specific ontology and the DBpedia ontology structure. The system is based on three steps related to image fetching for the ontology population retrieved from the Web and textual information. Each image is segmented, and the textual information is associated with each region. This operation is done through the Labelme annotation tool [13]. It associates concept to image using colour and textural information. Moreover, concepts description are fetched from DBpedia, and low-level features are extracted from images. Each image and its candidate concept is shown to a group of users. Each user gives a score to each image. The concept with the highest score is assigned to the considered image. PROPheT [14] is a software tool for the population and enrichment of a local ontology model with different instances fetched from several Linked Data sources served by SPARQL endpoints as DBPedia. In the realm of big data, the Bishop project [15] presents methods for ontology population-based on big data-driven and large-scale self-learning support. A conceptual framework is recognised from the user requirements to capture the integration schema for self-learning methods in the proposed approach. Different big data frameworks are taken into account to measure the performance in the data analysis step. The results comparisons are set using test scenarios.

A system to extract information from heterogeneous sources is presented in [16]. It populates an ontological knowledge base using a rule-based information extraction approach to recognise named entities. They are added into semantic structures using declarative mapping rules. OntoPRiMa [17] is a system for semi-automatic ontology population based on text analysis combining NLP techniques, semantic approaches and web technologies. It uses an automatic weighting schema to evaluate and support the decision process in the ontology population. Specific knowledge domains are also investigated for the ontology population. In [18] a method for ontology population in the domain of cultural heritage is proposed. It is based on an automatic approach to extract instances from semi-structured corpora with the support of extraction patterns manually defined. In [19] a tool for extracting semantic information from CAD drawings and populate an ontology is presented. The drawing primitives are considered to perform pattern matching, and classification algorithms extract the semantic information. The resulting information is mapped to the corresponding ontology classes, and related individuals are created to populate the ontology. The authors in [20] present a system able to extract web document contents to populate an ontology related to the tourism domain. Their methodology is based on a two-stage approach where the system obtains a set of semantic annotations considered as possible candidates of ontology instances. In the second step, semantic ambiguities are detected and solved to relate the annotations to the right concept in the ontology.

Our proposed approach differs from the ones discussed so far in terms of automation degree, and we combine different data sources using both textual and visual information. The aim of the implemented system is the ontology population using multiple informative sources analysed through semantic and deep learning techniques. Different techniques have been proposed in the literature [21,22,23,24,25,26] and, over years, the use of deep convolutional neural network as feature extractor exceeds the state of the art [27,28,29]. The deeper levels of a CNN underline lower-level features while the last level points out higher-level concepts [30]. This suggests that lower-level features are used for fine-grained tasks while higher-level features are used for semantic similarity. The lower-levels features suffer from two fundamental problems: semantic ambiguity [31] and background-clutter [32]. In literature, some approaches suggest combining the features extracted from multiple levels. In this way, the strength of complementarity is exploited and, semantic and structural features are combined. However, it has been shown that direct concatenation between the features coming from several layers not permit positive results, not only because of high dimensionality but also because the strong influence of high-level features on low-level features weakens the effectiveness of the latter. To address these issues, in [33] is proposed an approach for the semantic category identification through descriptors from the last convolutional layer and, after this step, it makes images ranking through features extraction from lower and intermediate levels. In [34], instead, the concept of hypercolumn is explored. It considers the output, at a given location, of all the units that are above a given level. The reason is that all pieces of information are distributed on all network layers. The combination of multiple levels, as happens in the approaches just presented, leads to an excessively large size descriptor. In this case, it would be necessary to apply an embedding technique and to avoid overfitting [35]. An important aspect is the choice of aggregation methods for locals descriptors. In the literature, several aggregation methodologies have been presented. One of the most used is based on Bag Visual Words [36]. The idea is to quantise local invariant descriptors into a set of visual words. The frequency vector of the visual words represents the image, and an inverted index file is used for efficient comparison of such Bag of Visual Words. Another methodology as the Fisher Vector [37] transforms an incoming variable-size set of independent samples into a fixed-size vector representation, assuming that the samples follow a parametric generative model estimated on a training set. This description vector is the gradient of the sample likelihood concerning the parameters of this distribution, scaled by the inverse square root of the Fisher information matrix. The VLAD representation [38] can be seen as a simplification of the Fisher kernel. VLAD aggregates descriptors based on a locality criterion in the feature space. Another aggregation method is Triangular embedding, and it is based on an anchor graph and realises triangulation operation [39]. Another novelty of our approach is the use of more fine-grained tasks in the ontology population process because the ontology is built hierarchically, starting from high semantic level concepts to specialisations. Our goal is to identify, for each input image, the related semantic category for concept validation and its subcategory for concept identification.

3 The proposed approach

In this section, we describe all aspects of our approach for ontology population. The proposed methodology is based on Natural Languages Processing (NLP) and deep learning techniques. The section describes the system architecture highlighting the main task of each module. This is summarized in Fig. 1.

Fig. 1
figure 1

System architecture

The ontology population starts with an existing ontology. The ontology that we want to populate is ImageNet [40] and it represents our knowledge base. It organizes into different classes of images in a densely populated semantic hierarchy. ImageNet follows the same structure as WordNet. Each node represents a synset. Synsets are a group of synonymous words that express the same concept. Representative images are associated with each synset. In our approach, Imagenet as been represented as a multimedia ontology represented by OWL following an ontology-based model [41]. ImageNet has been imported into two different NoSQL Databases following a methodology and model proposed in [7, 42,43,44]. We choose Neo4j graph DB to store the ontology structure and Mongo document-oriented DB to store images as feature vectors. The main problem of ImageNet is that not all synsets are populated. For this reason, ImageNet can be seen as a thinly populated multimedia ontology, and we can use it as a real example to realize our population framework. We perform an analysis of ImageNet to find not populated synsets. For each root node of each ImageNet sub-tree, we use ImageNet API to fetch the number of images for each synset. In this way, we can know which nodes are empty and which semi-populated. The first step is data sources identification. We retrieve data from two different types of data sources, and we use both an image search engine (i.e. Google image) and an images dataset (i.e. COCO data set [45]). Our goal is to integrate data sources with different representations [46, 47] to accomplish a multi-modal image ontology population system using both visual and textual analysis combining different techniques.

After the data source identification, we have to validate the detected concept and associate images to concepts of our ontology. The concept validation and association is obtained through feature extraction module. Then, a thesaurus is used to support concept identification. So, we combine textual and visual information by performing semantic analysis. In the case of images retrieved by Google, the concept to be validated is the one used in the query formulation. Instead, in the case of COCO images, we realize an ontology alignment operation to associate COCO categories to ontology concepts. The association of visual object representations with concepts are computed with a features extraction process. After the concept association, we proceed with the ontology population. The feature extraction module is used to perform the concept association and validation. The features extraction engine is based on a hierarchical approach because our aim is to distinguish between a high-level category (i.e. root node) and a low level one represented by a child node for each sub-tree to populate. This approach is implemented in two steps. In the first step, the validation is performed by hyperonym identification through a global descriptor. The second one is used to associate concepts through hyponym identification. The association of low-level features to high-level concepts occurs through a semantic interpretation module based on reasoning techniques. At both levels, the extracted descriptors are aggregated in a global and compact feature vector representation. Features matching operation is performed through an image similarity measure. After the concept validation task, we finally carried out the Ontology population step. The following subsections will be explained in detail our framework.

Fig. 2
figure 2

Feature extraction engine

3.1 Hierarchical deep feature extraction module

The proposed features extraction engine works hierarchically. The process is shown in Fig. 2. We consider hyperonym at the higher level and all hyponyms at the same hierarchical level of the considered synset. Therefore, we have a two-concepts levels hierarchy using the ImageNet structure. With the first level of the hierarchy, hypernym identification takes place. We perform a concept validation verifying that the input image concept belongs to the correct semantic category. An example of hypernym identification is in Fig. 3, where the label node in the figure is composed of a lemma, part of speech, and a number that is a rank value obtained by the concept sorting by association frequency of it to lemma. The colour of the label is different by hyponym level. Instead, the size is by edge degree.

Fig. 3
figure 3

An example of dog synset hyponyms extraction

Both local and global descriptors are extracted using a convolutional neural networks due to the performance of these architectures [30, 48]. A global descriptor is extracted from the last convolutional layer of the DNN. Instead, a local descriptor is represented by features maps values concatenation on the same spatial position. The fine-grained similarity is obtained by applying a feature selection mechanism that allows an automatic object recognition. The selection of a representative subset of local convolutional features can remove a large number of redundant features improving the object recognition process. We use the Selective Convolutional Descriptor Aggregation for Fine-Grained Image Retrieval (SCDA) method [49] to extract the local feature.The method is based on a pre-trained VGG16 model to perform a fine-grained image retrieval task. VGG is a CNN (Convolution Neural Network) architecture proposed in [50], the most common used is the VGG16 that consists in 16 convolution layers with 3 \(\times \) 3 kernels. SCDA approach is summarized in Fig. 4 [49]. Following this approach we change the Mask Map using more accurate techniques as detailed in the following. For the sake of clarity we briefly describe the entry process. The SCDA process starts with the extraction of a \(h \times w \times d\) 3-D tensor of an image from pool-5 and it is transformed in a \(h \times w\) 2-D tensor, called aggregation map (A). The aggregation map is computed as \(A = \sum _{n=1}^{d} S_n\), where \(S_n\) is a feature map in pool-5. Then, the mask map M is obtained as:

$$\begin{aligned} M_{i,j} = \bigg \{ \begin{array}{cc} 1 &{}\quad \hbox {if}\, A_{i,j}> {\overline{a}} \\ 0 &{}\quad \hbox {iotherwise} \end{array} \end{aligned}$$

where A and M has same size, (ij) is a particular position in these \(h \times w\) positions and \({\overline{a}}\) is the threshold computed as mean value of all positions in A. In order to improve the main object selection, an algorithm the collect the largest connected component of M is applied. Therefore, the selected feature F is:

$$\begin{aligned} F = \bigg \{ x_{(i,j)} \mid {\tilde{M}}_{(i,j)} = 1 \bigg \} \end{aligned}$$
(1)

where \({\tilde{M}}\) is the selected valid and meaningful deep convolutional descriptor. Then, it applies on F an algorithm to reduce further the noise in the way to isolate only the main object. Finally, it applies the dimension reduction method to obtain a 1-D feature, in particular, the authors choose average pooling, max pooling, Vector of Locally Aggregated Descriptors (VLAD) or Fisher Vector.

Fig. 4
figure 4

Selective convolutional descriptors aggregation proposed by [49]

We modified the mask step using one of the following masks:

  • SIFT-mask specifically, let

    $$\begin{aligned} S = {(x^i,y^i)_{i=1}^n} \end{aligned}$$

    the SIFT feature locations extracted from an image with the size of \(W_I x H_I\), each location on the spatial grid WxH is the location of a local deep convolutional feature. Based on the property that convolutional layers preserve the spatial information of the input image [49], we select a subset of locations on the spatial grid which correspond to locations of SIFT key-points. In this way, we discard features from background objects, and we consider the foreground ones.

    $$\begin{aligned} M_{\mathrm{SIFT}} = \bigg \{ \left( x_{\mathrm{SIFT}}^{(i)}, y_{\mathrm{SIFT}}^{(i)} \right) \bigg \} \end{aligned}$$

    where

    $$\begin{aligned} x_{\mathrm{SIFT}}^{(i)} = \mathrm{round}\bigg (\frac{x^{i}W}{W_{I}} \bigg ) \quad and \quad y_{\mathrm{SIFT}}^{(i)} = \mathrm{round}\bigg (\frac{y^{i}H}{H_{I}} \bigg ) \end{aligned}$$
  • Max-mask we define it as:

    $$\begin{aligned} M_{\mathrm{MAX}}= & {} \bigg \{ \bigg ( x^{(k)}_{\mathrm{MAX}}, y^{(k)}_{\mathrm{MAX}} \bigg ) \bigg \} \quad k=1,..,K\\ \bigg (x^{(k)}_{\mathrm{MAX}}, y^{(k)}_{\mathrm{MAX}} \bigg )= & {} \hbox {arg\,max}_{(x,y)} F_{(x,y)}^{k} \end{aligned}$$

    We select a subset of local convolutional features which contain high activations values for all visual contents. The goal is to select the local features that capture the most prominent object structures in the input images. Specifically, we assess each feature map and select the location corresponding to the max activation value on that feature map.

  • SUM-mask based on the idea that a local convolutional feature is more informative if it has high values in more feature maps. Therefore, the sum of values of this kind of local feature is higher. In other words, if many channels are activated in the same image region, there is a high probability that an object of interest is in that region. We define SUM-Mask as follows:

    $$\begin{aligned} M_{\mathrm{SUM}} = \left\{ \left( (x,y) \vert \sum _{(x,y)}^{M} F \ge \alpha \right) \right\} \end{aligned}$$
    (2)

    where

    $$\begin{aligned} \sum _{(x,y)}F = \sum _{K=1}^K F_{(x,y)}^{k}\end{aligned}$$
    $$\begin{aligned} \alpha = \hbox {median}\left( \sum {F}\right) \quad \hbox {or} \quad \alpha = \hbox {average}\left( \sum {F}\right) \end{aligned}$$

    In the evaluation section, we report results considering both values of \(\alpha \).

Similar to traditional local descriptors, an aggregation method is needed. In this paper, we use traditional aggregation methods for convolutional neural networks. Global pooling layers [51] can be used to reduce the dimensionality of the feature maps output. In our work, we have considered and evaluated Global max-pooling, Global average-pooling and Sum-pooling. After the use of one of the aggregation methods, we have a compact descriptor for each image. These descriptors will then be used with an image similarity measure. Image similarity computation is equivalent to solve a classification problem. We define M labels associated with images of our knowledge base. We measure the similarity using the cosine distance between vectors of descriptors representing images. The cosine similarity is defined as:

$$\begin{aligned} \cos \theta = \frac{A\cdot B}{\parallel A \parallel \cdot \parallel B \parallel } = \frac{\sum _{i=1}^{n}A_i \cdot B_i}{\sqrt{\sum _{i=1}^{n} A_{i}^2 \cdot B_{i}^2}} \end{aligned}$$

In our vision, the concept association process is composed of the concept validation and concept identification steps. We use an inter-class similarity for concept validation and an intra-class similarity for concept identification. So we use global descriptors from the last convolutional layer for hyperonym identification and high-level semantic category validation by local descriptors selected by the mask for hyponym identification.

3.2 Ontology alignment

We populate ImageNet Ontology with images retrieved from the Google Search Images engine and the COCO data set. We realize an ontology alignment with COCO finding correspondences of its category with ontological concepts. In other words, an ontology alignment operation needs a measure of similarity between data sources that we are considering. Therefore, the purpose of this step is to find semantic links between COCO images and ImageNet concepts. In the literature there are different techniques to perform alignments [52,53,54]. They are based on the use of weights or/and thresholds. Some of them use the external resource as a thesaurus. We combine terminological and structural techniques to reach our goal.

According to [55], terminological techniques have as main aim to discover a similarity between terms related to a concept. These methods are based on the comparison of terms (i.e. strings of text). They can be divided into two sub-categories: (i) Intrinsic techniques based on similarity between terms that have morphological and syntactical variations (e.g. Porter Stemming Algorithm); (ii) Extrinsic techniques where external linguistic resources, such as dictionaries and thesaurus, are used in order to find a similarity between lexical variations of the same term. External techniques consider that there is a relationship equivalence between synonyms and a subsuming relationship between hyponyms.

On the other hand, structural techniques consider the similarity between two entities by exploiting structural information. The focus is on semantic or syntactic links, forming a hierarchy or a graph of entities. It compares the internal structure of the entities themselves or the shared relationships. Structural techniques are based on: (i) Internal structure, which compares internal characteristics of the entities, such as cardinality, transitivity, attributes and relationships; (ii) External structure are based on the similarity between entities by considering their position in the respective ontologies. They are based on the assumption that if two entities are similar, they have a similarity with their adjacent entities. These techniques tend to treat ontologies as graphs in which each node is a concept in the ontology, and each edge is a relationship between concepts.

We are now in the position to describe how we achieve alignments through the techniques described above. We point out that the COCO dataset has more objects of different sizes in the same image belong to different high-level categories. To address this issues, we consider an object as relevant if its surface is larger than 5% of entire image [56]. We use some information in the COCO .json file to have a semantic characterization and a more efficient use of its content for our purposes. The used information are listed in Table 1.

Table 1 COCO features

We retrieve categories related to images and their captions. If there are not categories, we use Natural Language Processing techniques (NLP) to derive them from image captions starting with pre-processing operations as (i) Uppercase to Lowercase conversion; (ii) Numerical characters elimination; (iii) Punctuational deletion; (iv) Stop words removal. After these operations, we obtain a list of terms, and we proceed with a Part of Speech Tagging (POSt) task. It may be defined as the process of assigning one of the parts of speech to a given term. We take into account only nouns because they are the preferable form to represent concepts [57]. After this step, we calculate the frequency of occurrence of each word and consider the most frequent couple of words. The mapping is obtained with:

  • Intrinsic terminological techniques: for each word, we perform a steaming operation.

  • Extrinsic terminological techniques: we use WordNet as a thesaurus to keep track of lexical variations of the same term.

  • Structural techniques: we use hyponym relation.

Once the synsets associated with words have been obtained, we recognize the ’lowest common hypernym’. If it does not exist, we consider it the most frequent word to recognize the related synset. The recognized synset represents the image category. If an image has only one category, we apply terminological techniques, and we get the related synset through WordNet. In this case, we distinguish between two cases:

  1. 1.

    Images with one category but more instances of the same category: we crop the image exploiting the coordinates of the bounding boxes in the coco annotations file. In this way, we delete noisy background elements.

  2. 2.

    Images with one category and only one instance of the same category: the image is entirely considered.

If images have two categories and only one instance for each category, we make a crop of each object in the image. For images with more than two categories, our goal is to consider only foreground objects to recognize the relevant ones for the ImageNet population. We are interested in images or part of them that contains the entire object and in according to the previous assumption, and we define a threshold: \(\min \_\mathrm{area} = \frac{1}{4}\mathrm{TotalArea}\).

This threshold has been set by experiments and its value has a practical interpretation considering that an object is recognized as relevant if it fills a large part of the analyzed image compared to background and other objects in it.

We carry out the relative crop for objects that satisfy this property, and we associate a synset as previously explained. For images with objects that do not satisfy this property, we apply the above described NLP techniques but considering the categories associated with images. We measure the occurrence frequency of each category, and we consider the two most frequent words. If each category occurs only once, we compute semantic similarity between the most frequent couple of words using the measure presented in [58]. When the most semantic similar couple of words has been obtained, we get the lowest common hypernym to have a mapping between ImageNet synset and COCO categories, and we make a crop as in the previous cases. The whole ontology mapping process is summarized in Fig. 5.

Fig. 5
figure 5

Ontology mapping process

4 Evaluation strategy

Our work aims to the population of an existing ontology represented by a multimedia knowledge graph using NoSql technologies. Therefore, it is essential to test and validate the proposed approach and the implemented system. We present and discuss the strategy used to evaluate features extraction techniques and the accuracy of the ontology population task accuracy. The DNN used for the feature extraction is VGG16 [50] due to its performance.

We evaluated the system considering four distinct datasets: two with general informative contents and two fine-grain. We consider a general dataset composed of images belonging to different semantic categories, while with fine-grain, we refer to datasets having images belonging to specializations of the same semantic categories. Table 2 summarizes the datasets statistics used for top-level evaluation and, in Table 3 are shown the datasets statistics used for fine-grained evaluation. In detail, for general datasets, we reported the number of images, the number of images for each category and how we split the dataset for test and train steps. Instead, we report the number of categories, subcategories, images and used train/test images for the fine-grained dataset.

Table 2 General data sets statistics
Table 3 Fine-grained data sets statistics

For each dataset, the training set is the considered knowledge base, while the test set is the input query set to the system. Therefore, each test image is compared with each training image. For the input query, the predicted label is related to training images with the highest cosine similarity value.

In this paper, we choose as evaluation metrics the Mean Average Precision (mAP) and Accuracy due to its extensive use for our task of interest [59].

We consider as average precision (AP) is weighted sum of precision values: \( AP = \sum (R_n - R_{n-1}) P_n\). From this point of view, mean average precision (mAP) is the mean of AP. The accuracy is the fraction of correct predictions. We consider as correct prediction an image with the highest similarity value according to the test images: \(\mathrm{Accuracy} = \frac{\mathrm{Number\,of\,correct\,prediction}}{\mathrm{Total\,number\,of\,predictions}}\).

4.1 Feature extraction evaluation

We evaluate each level of the feature extraction module considering all aggregation methods, and for each aggregation method, we evaluate all masks types as described in Sect. 3.1. We evaluate the root node with fine-grained datasets because we want to underline how the feature selection mechanism is essential when working with fine-grained images. For the Root Node, we extract features from the last convolutional layer of VGG16. The layer that we consider is \(pool_5\). For each input image, we get a 3-d vector, \(7 \times 7 \times 512\). Each feature map has a \(7 \times 7\) size while there are 512 channels. We evaluate it for each aggregation method. The performance measures in terms of mAP and accuracy are, respectively, in Tables 4 and 5. The accuracy measure shows that the general dataset results are better than those on the fine-grained dataset. Furthermore, in the following, we showed as we had improved the performance in the worst case.

Table 4 Root node mean average precision (mAP)
Table 5 Root node accuracy

We show the accuracy mean values for each aggregation method. We consider general and fine-grained datasets separately. The results is shown in Fig. 6.

Fig. 6
figure 6

Mean accuracy by pooling strategy grouped by dataset type

The highest accuracy value is 0.927 with max-pooling. Therefore, the final configuration for root node is:

  • \( pool_5 \) layer features extraction;

  • max-pooling aggregation method.

The highest accuracy value is 0.717. Starting to form these results obtained with standard methodologies, in the following of this section, we will show the accuracy improvement using our proposed approach using a features selection mechanism through mask-scheme.

We apply a feature-selection mechanism to the last convolutional layer (i.e.\( pool_5 \)). Tables 6 and 7 show the results obtained in terms of mAP and accuracy for the child node using pooling methods and mask types.

The average accuracy value grouped by aggregation methods for each mask type is shown in Fig. 7.

Table 6 Child node mAP
Table 7 Child node accuracy
Fig. 7
figure 7

Average accuracy grouped by mask type and compared by pooling methods

The highest accuracy value is 0.897 with SUM-mask with median value and Avg and Max pooling as aggregation method.

Figure 8 shows the performances comparisons with and without the features selection mechanism.

Fig. 8
figure 8

Accuracy comparison between the using of last layer and using feature selection mechanism

We have an improvement of about 20% using features selection mechanism.

4.2 Ontology population evaluation

We evaluate the accuracy of our proposed ontology population strategy considering the accuracy of ontology mapping through COCO and the Google Image Fetcher by query posing. The results are in Tables 8 and 9, respectively.

Table 8 COCO accuracy
Table 9 Google image fetcher accuracy

In Table 10 are listen the number of new ImageNet populated node.

Table 10 Populated nodes

We explicit point out that we obtain images for 276 nodes where 43 nodes are entirely unpopulated.

5 Conclusion and future works

In this paper, we proposed a multi-modal approach to populate a multimedia ontology. We combine textual and visual content using different techniques. The proposed framework is fully automated, and it is based on semantic similarity techniques to implement an ontology mapping task. Compared with the approaches in the literature, no user intervention is required, and a feature selection mechanism is implemented to solve the issue related to descriptors dimensionality. Furthermore, we introduced a hierarchically approach to validate an object candidate to populate an ontology concept achieving a better precision in population task. Moreover, the exploitation of the semantic structure of ImageNet allows a performance improvement compared with traditional techniques with an 87% of accuracy. Different applications, from classification to retrieval tasks, could be improved by a more accurately population of ImageNet [60,61,62,63].

At the best of our knowledge, there are not systems that implement an full automatic ontology population approach in the literature. For this reason, a direct comparison is not possible. On the other hand, there is a lack of public results on standard datasets and available free software codes of other systems. Starting from these considerations and highlighting that the development and evaluation of semi-automated techniques is an expensive task, we could run the risk that our implementation of such systems could not be in line with the results of original version. From our side, we explicit point out that the use of a standard dataset to test the performance of the proposed techniques allow an effective exploitation of our results for a later comparison with novel similar automatic approaches.

Furthermore, the approach presents a high degree of modularity, and it can be extended considering other data sources. We are investigating different features works related to the use of more complex DNN architecture as Siamese Network or Auto-encoder. Moreover, for fine-grained recognition, it can be helpful to the implementation of a multiscale approach to take into account small images variations. In the future work, we will try to improve the local feature extraction using better methods and other CNNs architecture.