Object recognition is important in many applications such as robot vision, autonomous vehicles or helping the visually impaired. While there is a long history of optical object recognition only in the last few years we saw significant improvements with the evolution of neural networks. It is a natural assumption that multiple shots can decrease ambiguity and if those shots are from different directions, the amount of information gathered from the object also increases. In this later case, we soon arrive to relative pose estimation where the appearance of the 3D object corresponds to its relative pose. Due to natural ambiguities, such as noise, occlusion and geometrical distortions, this estimation can be ill posed. Nowadays, LSTM (long short-term memory) networks are popular techniques to include temporal domains in the deep neural network (DNN) frameworks. While DNNs can be very accurate in recognition, their training, computational, and memory requirements are high. Unfortunately, when considering lightweight solutions, which can be crucial of futures’ sensors technology, the temporal combination of multiple views is much less investigated. In our paper, we propose a lightweight variational approach, which can be combined with any single shot detection technique, including DNNs. The proposed model implements information fusion since besides 2D images it uses IMU (inertial measurement unit) sensors for the estimation of change in the relative orientation of the camera. The advantage of this fusion in recognition rates and in active vision is shown in our paper.

The barrier to use hidden Markov models (HMMs) in object recognition was always the real lack of ordered sequential information. Talking about sequential multiple observations the order of shots is determined by the actual behavior of the observer, resulting in a weak certainty in state transition probabilities. We resolve this issue with the utilization of IMUs, giving useful hints to estimate transitions on the fly. The advantages of our proposal can be summarized as:

  • the number of shots can vary from one to any number and can be changed dynamically (in our experiments, we test 2, 4, 6, and 8 consequential queries),

  • it can use any single shot recognition technique to evaluate the individual query images,

  • it can easily incorporate active vision by proposing the next view to decrease uncertainty of recognition,

  • it relies on low-cost IMU sensors available in many mobile and imaging devices.

Overview and contribution

The advantage of deep convolutional neural networks in object recognition is that in case of large number of training samples and proper training they can find a good combination of feature detectors and classifier sub-networks. Contrary, variational approaches can find good results (solutions with high probabilities) even if the conditions are far from optimal and these conditions could not be approximated during training. Our first attempt to use HMMs for multiview object recognition is in [6]. Now, we show our improved model and results, where we can plan the next viewing position to improve recognition rates. Also, we extended the evaluations and included test cases where the objects to be recognized are partly occluded. Our experiments, on these occluded data, with LSTMs and HMMs support the above theory regarding the vulnerability via improper training. In numerical evaluations, we use two standard datasets with the sum of 1100 object classes. Color and edge directivity descriptors (CEDDs) [2], as compact global feature descriptors, are calculated for individual shots.

In the next section, we overview related papers, and then in Sect. 3, we explain the proposed method. In Sect. 4, the used datasets are introduced, while the experiments and evaluations can be found in Sect. 5. Finally, we conclude our article in the last Section.

Related papers

In computer vision, the problem of recognition of 3D objects from different views can be approached in many ways; there are numerous topics where we can find related solutions. The keywords video-based recognition, 3D object recognition, multiple view recognition all can be considered, but also specific domains such as face detection [16] or human skeleton-based recognition [3] can apply similar techniques. Naturally, also special 3D sensors (such as Kinect or Lidars) can also be utilized to help the detection, segmentation, and recognition of 3D objects [4]. Due to the lack of space, we cannot discuss such approaches. In our overview, we concentrate only on a few papers which we considered the most relevant.

The modeling of objects from different views was proposed by early techniques; an often used one is the so-called aspect graphs (e.g., [5, 19]). For example, in [19], instead of recovering a full 3D geometry, parts are connected through their mutual homographic transformation. In this approach, a canonical view is a collection of canonical parts (colored patches of objects) that share the same view. We can interpret a canonical view as a subset of the overall linkage structure of the object category, and the linkage structure is to connect each pair of canonical parts. The resulting model is a compact summarization of both the appearance and geometry information of the object class. In [20], the method is similar, but probabilistic models are generated to capture the relative position of parts within each of the discretized viewpoints. While our purpose is similar, in our case the linkage is based on the fusion of visual and orientation information utilizing global descriptors and a modified HMM framework. Naturally, local descriptors are more efficient in generic object classifications, but in case of specific objects we found global descriptors sufficient and also very cost effective considering computational time and memory requirements [7].

Liu et al. [13] attack the problem of generating large amount of training data and the common problem of hand-crafted features on various texture-less and surface-smooth objects. A hand-crafted 3D feature descriptor with center offset and pose annotations, called view-specific local projection statistics (VSLPSs), is proposed supported by a voting strategy to transform the feature-point matching problem into the problem of voting an optimal model-view in the 6-DOF space. Various experiments on three public and their own dataset demonstrate its good performance.

The application of HMMs for object recognition is restricted to such cases where statistically ordered sequences of features can be constructed. It is natural that in some activities the order of motion patterns determines the class of activity (see, for example, [1]), but in case of static objects the sequence of observed features is not easy to generate. An early example for static objects is in [9] where the contours of objects were used with some invariant features. Unfortunately, in the experiments only four objects were investigated. In [8], authors presented an approach for face recognition using singular value decomposition (SVD) to extract relevant face features and HMMs as classifiers. In order to create the HMM model, the two-dimensional face images had to be transformed into one-dimensional observation sequences. A face image was divided into seven distinct horizontal regions: hair, forehead, eyebrows, eyes, nose, mouth, and chin forming seven hidden states in the Markov model. While the algorithm was tested on two standard databases, the advantage of the HMM model over other approaches was not discussed and the proposed spatial order modeling, to represent structural relations, seems to be unnatural. Considering lightweight approaches and HMMs, we should mention [10], where first the low-dimensional spatial domain feature (SDF) descriptors were clustered and then used as state representation of objects views in HMMs. Unfortunately, no information is given how transition probabilities were estimated.

A recent CNN-based framework can be found in [12], where six-degree-of-freedom (6-DoF) pose for a large number of object classes was determined from single or multiple views. To learn discriminating pose features, three new capabilities are integrated into a deep CNN: an inference scheme that combines both classification and pose regression based on a uniform tessellation of the special Euclidean group in three dimensions (SE(3)), the fusion of class priors into the training process via a tiled class map, and an additional regularization using deep supervision with an object mask. While it is shown that the proposed framework improves the performance of the single-view network, the incremental-temporal use of the network is not discussed.

LSTMs are efficient techniques for the sequential linkage of observation data. In computer vision, they are mostly utilized for the processing of dynamically changing data such as motion behavior [21] and tracking of objects [11]. Not only temporal data can be processed by LSTMs: In [22], apple diseases and pests are detected. Here, the purpose of LSTMs was to combine the features of three deep models, namely AlexNet, GoogleNet, and DenseNet201. Yuan et al. [24] apply a much more interesting approach to address action-driven weakly supervised object detection. The proposed temporal dynamic graph LSTM architecture recurrently propagates the temporal context on a constructed dynamic graph structure for each frame. That is, temporal action information pattern can help the recognition of visual objects. Similarly to our approach [15] combines the output of independent detection but not with HMMs but with LSTMs called Association LSTM.

The proposed method

The overview of our algorithm is shown in Figs. 1 and 2. An object is being captured by several shots from different directions. These different views can be also considered as different relative poses. As the camera moves, we get the change of relative pose (\(\varDelta \alpha _i\)) by the IMU sensors. Each captured image is independently evaluated, and the probability of all possible objects is estimated. Analyzing the retrieval list, we can compare the most relevant candidates and can determine the next move of the camera to minimize the ambiguity. To combine the retrieval lists, we start from the assumption that we see the same object continuously. That is, we have to determine the most probable state sequence for each object with the Viterbi algorithm. The object having the highest similarity (when evaluating its highest probability state sequence) is retrieved as the recognized object.

Fig. 1
figure 1

Object is recognized continuously in a sequence of queries. New queries (\(Q_2\) and \(Q_3\)) are planned by the analysis of previous shot(s) (\(Q_1\))

Fig. 2
figure 2

Overview of the proposed multiview method

HMM object models

An HMM is defined by:

  • the set of N possible hidden states \(S = \{S_1,...,S_N\}\),

  • transition probabilities between states \(S_i\) and \(S_j\) (see Eq. 2),

  • emission probabilities based on observations (P(o), see Eq. 7),

  • initial state probabilities (\(\pi _i\)).

The observation of objects with multiple views is a process where in each tth step this model is in one \(q_t\in S\) state, where \(t = 1,..., T\). To achieve object retrieval will need to build HMM models for all elements of the set of objects (M) where the states correspond to different poses. Then, based on the sequence of observations, we find the most probable state sequence for all object models.

Object views as states in a Markov model

In our approach, the states can be considered as the 2D views (poses) of a given object model. Observations of these (hidden states) can be easily imagined as the camera is targeting toward an object from a given elevation and azimuth. In our experiments, we use static subdivision of the circle of 360\(^\circ \) into 8 uniform sectors 45\(^\circ \) each at the same elevation. We define the initial state probabilities \({\varvec{\pi }}=\{\pi _i\}_{1\le i\le N}\) based on the opening angle of the views:

$$\begin{aligned} \pi _i=P(q_1=S_i)=\frac{\alpha (S_i)}{360} \end{aligned}$$

where \(\alpha (S_i)\) is the angle interval (given in degree) of the aperture of state \(S_i\).

State transitions

Between two steps, the model can undergo a change of states according to a set of transition probabilities associated with each state pair. In general, the transition probabilities are:

$$\begin{aligned} a_{ij}=P(q_t=S_j|q_{t-1}=S_i) \end{aligned}$$

where i and j indexes refer to hidden states of the HMM. The transition probability matrix is denoted by \(\mathbf{A }=\{a_{ij}\}_{1\le i,j\le N}\), where \(a_{ij} \ge 0\), and for a given state \(\sum _{j=1}^{N} a_{ij} = 1\) holds.

Building a hidden Markov model means the definition of hidden states and learning its parameters (\({\varvec{\pi }}\), \(\mathbf{A }\), and emission probabilities introduced later) by examining typical examples. However, our problem does not allow such a training process: The probability of going from one state to another severely depends on the user’s behavior. Contrary, we can directly compute transition probabilities based on geometric probability as follows.

First define \(\varDelta _{t-1,t}\) as the orientation difference between two successive observations (\(o_t\) and \(o_{t-1}\)):

$$\begin{aligned} \varDelta _{t-1,t}= \alpha (o_t)-\alpha (o_{t-1}). \end{aligned}$$

Now define \(R_i\) as the aperture interval angle belonging to state \(S_i\) by borderlines:

$$\begin{aligned} R_i = [ S_i^\mathrm{min}, S_i^\mathrm{max}[. \end{aligned}$$

where \(S_i^\mathrm{min}\) and \(S_i^\mathrm{max}\) denote the two (left and right) terminal positions of state \(S_i\). The back-projected aperture interval angle is the range of orientation from where the previous observation could originate:

$$\begin{aligned} L_j = [ S_j^\mathrm{min}-\varDelta _{t-1,t}, S_j^\mathrm{max}-\varDelta _{t-1,t}). \end{aligned}$$

Now, to define the transition probability of coming from state \(S_i\), we compute the ratio of opening angles of the intersection \(L_j\) and \(R_j\) and of the opening of \(L_j\):

$$\begin{aligned} a_{ij}=P(q_t=S_j|q_{t-1}=S_i)=\frac{\alpha (L_j\cap R_i)}{\alpha (L_j)}. \end{aligned}$$

Recognition of objects from multiple views

The emission probability of a particular observation \({o}_t\) for state \(S_i\) is defined as:

$$\begin{aligned} b_i(o_t)=P(o_t|q_t=S_i). \end{aligned}$$

In [7], we have shown that the area-based CEDD [2] is a robust low-dimensional descriptor for object recognition. CEDD classifies pixels into one of six texture classes (horizontal, vertical, \(45^{\circ }\) and \(135^{\circ }\) diagonal, non-edge, and non-directional edges) using the MPEG-7 Edge Histogram Descriptor. For each texture class, a normalized 24-bin color histogram is generated, where each bin represents colors obtained by the division of the HSV color space. CEDD’s advantage is that it uses only a short vector (length of 144) as a global descriptor, but naturally it can be less robust under various circumstances. More sophisticated (but also computationally expensive) single shot recognition techniques can also be used within our framework such as SSD [14] or Yolo [17]. In our experiments, we use the combination of CEDD and Tanimoto coefficient to approximate the emission probabilities of states. Emission probability of Eq. 7 can be given as:

$$\begin{aligned} b_i(o_t)=\frac{{\mathcal {T}}({\mathcal {C}}(S_i),{\mathcal {C}}(o_t))}{\sum _{j=1}^{N} {\mathcal {T}}({\mathcal {C}}(S_j),{\mathcal {C}}(o_t))} \end{aligned}$$

where \({\mathcal {C}}\) stands for the CEDD descriptor generating function and \({\mathcal {T}}\)stands for the Tanimoto coefficient. Since each state of the object models can cover a large directional range, we will use the average CEDD vector, of available model samples within, to represent the whole state with a single descriptor. The sequence of retrieval lists, generated by independent queries, is evaluated by the Viterbi algorithm to combine the values of Eqs. 1, 6, and 8 to get the most probable state sequences. To achieve object retrieval, we have to find the most probable state sequence \({\hat{S}}_k\) with the above steps for all possible candidate objects. To select the winner object \({\hat{k}}\), we have to compare the observations with the most probable state sequence:

$$\begin{aligned} {\hat{k}} = \arg \max _{\forall k\in M}\left( \frac{\sum _{i=1}^{T} {\mathcal {T}}({\mathcal {C}}(o_i),{\mathcal {C}}({\hat{S}}_{k,i})}{T}\right) \end{aligned}$$

where k denotes object k in M.

Active recognition

Active recognition is a relatively old idea in pattern recognition, and it is typical to extend non-active methods. Without discussing such techniques, we refer the reader to the survey in [18]. Active vision systems can be classified, according to their next view planning strategy, into two groups:

  1. 1.

    Systems that take the next view to minimize an ambiguity function;

  2. 2.

    Systems incorporating explicit path planning algorithms.

We have chosen the first strategy, and here we discuss a method that is very close to human’s behavior to move around an object to become acquainted with its appearance from different directions. Based on a rapid evaluation of the first observations, we hypothesize which objects have high probability and we plan the following movements to find those views that can reduce ambiguity. Now, based on the preliminary models, each object k will be represented with \(N_k\) descriptors computed as the average of descriptors within a given viewing range:

$$\begin{aligned} {\tilde{c}}_{k,i} = 1 / N_k^i\sum _{l=1}^{N_k^i}c_{k,l} \end{aligned}$$

where \(c_{k,l}\) stands for the descriptors of object k within interval i. The similarity between these average views can be computed with the Tanimoto coefficient and can be stored in matrix S of size \(N N_{k} \times NN_{k}\). After making the very first observations, we are to evaluate the retrieval list(s) \({\mathcal {L}}\), and as \(\alpha ({\tilde{c}}_{k,i})\) provides the estimate of orientation for the most probable object k in state i, we can also compute the similarity of object views to the left (and to the right accordingly):

$$\begin{aligned} {S}_\mathrm{left} =\sum _{{\tilde{c}}_{j},{\tilde{c}}_{l} \in {\mathcal {L}}, j \ne l} {\mathcal {T}}({\tilde{c}}_{j, \mathrm{left}},{\tilde{c}}_{l, \mathrm{left}}) \end{aligned}$$

where \({\tilde{c}}_{j,\mathrm{left}}\) is the closest \({\tilde{c}}_{j}\) view left to \(\alpha ({\tilde{c}}_{j,k})\) being in the already existing retrieval list \({\mathcal {L}}\). Finally, we should move the camera either to the left or to the right depending on the similarity of views of the possible candidates:

$$\begin{aligned} \mathrm{Decision} = {\left\{ \begin{array}{ll} \text {Move to left} &{}\text {if } {S}_\mathrm{left} \le {S}_\mathrm{right}\\ \text {Move to right} &{}\text {if } {S}_\mathrm{left} > {S}_\mathrm{right} \end{array}\right. } \end{aligned}$$

resulting in the more discriminating direction. The performance of this active approach will be compared to the non-active recognition in Sect. 5.


The following datasets were used to generate the HMM models and to run the different tests. All of our models consisted of 8 states.

Coil-100 dataset

The COIL-100 dataset includes 100 different objects; 72 images of each object were taken at pose intervals of \(5^\circ \). We evaluated retrieval with clear and heavily distorted queries using Gaussian noise and motion blur. The imnoise function of Matlab, with standard deviation \(\mathrm{sd}=0.012\), was used to generate additive Gaussian noise (GN), while motion blur (MB) was made by fspecial with parameters \(len=15\), and angle \(\theta =20^\circ \). Some examples of the queries are shown in Fig. 3. To simulate real-life scenarios, we created the Occluded COIL-100 dataset containing the same 100 objects, but with partial occlusion over the object areas (for illustration see Fig. 3).

Fig. 3
figure 3

First two lines: clear samples from COIL-100. Third line: example queries loaded with Gaussian additive noise. Fourth line: example queries loaded with motion blur. Fifth and sixth lines: occluded examples

ALOI-1000 dataset

The ALOI-1000 dataset includes 1000 different small objects recorded against a black background. Each object was recorded by rotation in the plane at 5\(^\circ \) steps. For evaluation under different conditions, we used the same distortion settings, including occlusions, as described for the COIL-100 dataset. Please note that while the resolution of images in COIL is \(128\times 128\), it is \(384\times 288\) for ALOI. (This explains the less visible Gaussian noise and motion blur in Fig. 4.)

Fig. 4
figure 4

First two lines: clear samples from ALOI-1000. Third line: example queries loaded with Gaussian additive noise. Fourth line: example queries loaded with motion blur

Experiments and evaluations

All the above-introduced variations in the datasets were generated to show how our temporal methods can improve the performance of the weak classifiers under different circumstances. Since CEDD mainly relies on edge-like features, strong additive noise or (motion) blur can influence results. Charts are generated by taking the average of 10 experiments with randomly generated queries with all 100 and 1000 objects of COIL and ALOI datasets. (That is, the total number of queries was the multiple of 11,000.)

In all measurements, we see the advantage of using multiple queries: all curves monotonically increase as the number of queries increases. The first two charts of Fig. 5 help the understanding of our idea for active vision on some test data where all queries were occluded. The continuous curves show whether the ground truth (GT) objects are within the top 10 candidates of the retrieval lists (\({\mathcal {L}}\)). Since our next view planning makes its decision based on the 10 most probable candidates of the first two retrieval lists, we expect to get results below this curve but above the non-active approach. We could measure performance gain over non-active recognition between 6.2 and 13.8% in these experiments.

Fig. 5
figure 5

Comparison of non-active and active recognition when all queries are occluded. Top graph: COIL-100, bottom graph: ALOI-1000 datasets. Continuous curves show the GT being in the top 10 items of the retrieval list

Figures 6 and 7 show other experimental results regarding the COIL and ALOI datasets, respectively. In these tests, either all queries were loaded with Gaussian noise (GN) or motion blur (MB), or the first two queries were partially occluded, while the remaining ones were loaded with MB and GN (these are denoted with \(2\hbox {O}\_{\mathrm{MB}}\) and \(2\hbox {O}\_{\mathrm{GN}}\)). In all cases, the increase of the number of queries resulted in higher hit-rate and active vision outperformed the non-active.

For an alternative presentation of some parts of the above data, we included a table (Table 1) of results for the 8 queries cases. It is clear to see that while in case of the smaller dataset (COIL-100 with 100 object classes) there is a significant advantage of the active method, contrary, in case of large number of object classes, this evaporates to around 1% in general. While this effect is natural, it is less significant in case of good quality images as we can read from Fig. 5 where only occlusion happened but no other type of noise.

Fig. 6
figure 6

Comparison of active and non-active recognition on the distorted COIL dataset. First graph: motion blur, second graph: Gaussian noise

Fig. 7
figure 7

Comparison of active and non-active recognition on the distorted ALOI dataset. First graph: motion blur, second graph: Gaussian noise

Table 1 Average hit-rates (%) in case of different query distortions applying 8 sequential queries

About space and time complexity

While any single shot feature extraction technique can be applied in the proposed framework, in our article we used the very compact CEDD descriptor. It occupies 144 bytes per image, while the orientation information requires not more than 4 Bytes. Running on plain CPUs (Intel Core i7), the memory and running time requirements are given in Table 2.

Table 2 Memory and running time requirements of the HMM and LSTM models

Comparison to LSTM

Since DNNs are known for high performance in object recognition, we implemented a so-called ConvLSTM network accepting several query frames based on the technique given in [23]. The overview of the framework, after our modification, is illustrated in Fig. 8. It can process query frames in a directional sequential order (either left or right), the 10 convolution kernels have size 3 by 3. It is known that DNNs are sensible for the training: High number of sample images are required under similar viewing conditions to those at inference. To accomplish this, either sophisticated augmentation techniques are required or large synthetic datasets are used relying on the CAD model of the objects. In our experiments, we trained the LSTM on the whole COIL dataset and tested its recognition performance on the partially occluded version. We already showed the hit-rates of our proposed framework in Fig. 5. For comparisons with LSTM, Fig. 9 shows the mAP (mean average precision) values. It can be interpreted that the HMM can handle much better the untrained occluded queries. The running time and memory usage of the LSTM model are given in Table 2. Please also consider that the training took about half an hour for 100 epochs with an NVIDIA Quadro P6000 GPU with 24 GB RAM.

Fig. 8
figure 8

Overview of the tested ConvLSTM framework in case of four sequential queries

Fig. 9
figure 9

Comparison of active HMM and ConvLSTM on occluded queries


The main contribution of our paper is to show how active perception and information fusion can help the recognition of 3D objects in a HMM framework if only weak classifiers are applied. The possible application of such approaches can be important in embedded systems or if sensors with limited resources are to be used, for example, in future’s autonomous or IoT devices. The proposed HMM technique is computationally lightweight, requires limited memory, and can incorporate other classifiers, not only the presented CEDD. The effectiveness of the method was tested with large number of experiments in various conditions and with comparisons with an LSTM implementation.