Fast 2D/3D object representation with growing neural gas

Angelopoulou, Anastassia; Garcia Rodriguez, Jose; Orts-Escolano, Sergio; Gupta, Gaurav; Psarrou, Alexandra

doi:10.1007/s00521-016-2579-y

Fast 2D/3D object representation with growing neural gas

Original Article
Open access
Published: 22 September 2016

Volume 29, pages 903–919, (2018)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Fast 2D/3D object representation with growing neural gas

Download PDF

Anastassia Angelopoulou¹,
Jose Garcia Rodriguez²,
Sergio Orts-Escolano²,
Gaurav Gupta³ &
…
Alexandra Psarrou¹

3874 Accesses
13 Citations
1 Altmetric
Explore all metrics

Abstract

This work presents the design of a real-time system to model visual objects with the use of self-organising networks. The architecture of the system addresses multiple computer vision tasks such as image segmentation, optimal parameter estimation and object representation. We first develop a framework for building non-rigid shapes using the growth mechanism of the self-organising maps, and then we define an optimal number of nodes without overfitting or underfitting the network based on the knowledge obtained from information-theoretic considerations. We present experimental results for hands and faces, and we quantitatively evaluate the matching capabilities of the proposed method with the topographic product. The proposed method is easily extensible to 3D objects, as it offers similar features for efficient mesh reconstruction.

Model Probability in Self-organising Maps

Growing Surface Structures: A Topology Focused Learning Scheme

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The images captured of hand gestures, which are effectively a 2D projection of a 3D object, can become very complex for any recognition system. Systems that follow a model-based method [1, 32] require an accurate 3D model that captures efficiently the hand’s high Degrees of Freedom (DOF) articulation and elasticity. The main drawback of this method is that it requires massive calculations which makes it unrealistic for real-time implementation. Since this method is too complicated to implement, the most widespread alternative is the feature-based method [16] where features such as the geometric properties of the hand can be analysed using either Neural Networks (NNs) [34, 36] or stochastic models such as Hidden Markov Models (HMMs) [6, 35].

However, for the accurate analysis of the hand’s properties, a suitable segmentation that separates the object of interest from the background is needed. Segmentation is a pre-processing step in many computer vision applications. These applications include visual surveillance [5, 10, 18, 20], and object tracking [15, 17, 26]. While a lot of research has been focused on efficient detectors and classifiers, little attention has been paid to efficiently labelling and acquiring suitable training data. Existing approaches to minimise the labelling effort [19, 21, 24, 30] use a classifier which is trained in a small number of examples. Then the classifier is applied on a training sequence, and the detected patches are added to the previous set of examples. Levin et al. [21] start with a small set of hand labelled data and generate additional labelled examples by applying co-training of two classifiers. Nair and Clark [24] use motion detection to obtain the initial training set. Lee et al. [21] use a variant of eigentracking to obtain the training sequence for face recognition and tracking. Sivic et al. [30] use boosting orientation-based features to obtain training samples for their face detector. A disadvantage of these approaches is that either a manual initialization [19] or a pre-trained classifier is needed to initialise the learning process. Having a sequence of images, this can be avoided by using an incremental model.

We decided to use NNs to represent the geometric properties of objects, and more specifically the self-organising maps (SOMs), due to their incremental nature. One of these SOM-based methods is the growing cell structures (GCS) algorithm [8], which is a model formed incrementally. However, it constrains the connections between the nodes, so any model produced during the training stage is always topologically equivalent to the initial topology. The Topology Representing Networks (TRN) approach, proposed by Martinez and Schulten [22], does not have a fixed structure and also does not impose any constraint on the connection between the nodes. In contrast, this network has a pre-established number of nodes and, therefore, it is not able to generate models with different resolutions. The algorithm was also coined with the term Neural Gas (NG) due to the dynamics of the feature vectors during the adaptation process, which distribute themselves like a gas within the data space. However, as the NG has a fixed number of nodes, it is necessary to have some a priori information about the input space to pre-establish the size of the network. This model was extended by Fritzke [9] proposing the Growing Neural Gas (GNG) network, which combined the flexible structure of the NG with a growing strategy. Moreover, the learning adaptation step was slightly modified. This extension enabled the neural network to use the already detected topological information while training in order to conform to the geometry. This approach has the capability to add neurons while preserving the topology of the input space.

Although the use of the SOM-based techniques of NG, GCS or GNG for various data inputs has already been studied and successful results have been reported [4, 13, 14, 27, 31, 32], there are some limitations that still persist. Most of these works assumed noise-free environments and low complexity distributions. Therefore, applying these methods on challenging real world data obtained using noisy 2D^{Footnote 1} and 3D^{Footnote 2} sensors is our main study. These particular non-invasive sensors have been used in the associated experiments and are typical, contemporary technology.

In this work, we extend the method presented in [2] for object representation using the GNG algorithm. This work extends the already proposed method by considering elimination of noisy connections during the learning process and by applying it to 3D datasets. The method is used for the representation of two-dimensional outline of hands and ventricles, which is extended to 3D. Furthermore, we are interested in the minimisation of the user intervention in the learning process; thus, we utilise an automatic criterion for maximum node growth based on topological parameters. We achieve that by taking into consideration that human skin has a relatively unique colour and the complexity or simplicity of the proposed model is decided by information-theoretic measures.

The remainder of the paper is organised as follows. Section 2 introduces the framework for object modelling using topological relations. Section 3 proposes an approach to minimise the user intervention in the termination of the network using knowledge obtained from information-theoretic considerations. In Sect. 4 a set of experimental results is presented that includes 2D and 3D representations before conclusions are drawn in Sect. 5.

2 Characterising 2D objects with modified GNG

GNG [9] is an unsupervised incremental self-organising network independent of the topology of the input distribution or space. It uses a growth mechanism inherited from the Growth Cell Structure [8] together with the Competitive Hebbian Learning (CHL) rule [22] to construct a network of the input date set. In the GNG algorithm [9], the growing process starts with two nodes, and new nodes are incrementally inserted until a predefined conditioned is satisfied, such as the maximum number of nodes or available time. During the learning process, local error measures are gathered to determine where to insert new nodes. New nodes are inserted near the node with the highest accumulated error and new connections between the winner node and its topological neighbours are created.

Identifying the points of the image that belong to objects allows the GNG network to obtain an induced Delaunay triangulation of the objects. In other words, to obtain an approximation of the geometric appearance of the object. Let an object $\mathbf {\textit{O}} = [\mathbf {\textit{O}}_{G}, \mathbf {\textit{O}}_{A}]$ be defined by its geometry and its appearance. The geometry provides a mathematical description of the object’s shape, size, and parameters such as translation, rotation, and scale. The appearance defines a set of the object’s characteristics such as colour, texture, and other attributes.

Given a domain $\mathbf {S}\subseteq \mathbb {R}^2$, an image intensity function $\mathbf {I}(x,y)\in \mathbb {R}$ such that $\mathbf {I} : \mathbf {S} \rightarrow [0, \mathbf {I}_{\max }]$, and an object $\mathbf {\textit{O}}$, its standard potential field $\varPsi _{T} (x,y) = f_{T}(I(x,y))$ is the transformation $\varPsi _{T}: \mathbf {S} \rightarrow [0, 1]$ which associates with each point $(x,y)\in \mathbf {S}$ the degree of compliance with the visual property T of the object $\mathbf {\textit{O}}$ by its associated intensity $\mathbf {I}(x,y)$.

Considering:

The input distribution as the set of points in the image:
$${\mathbf {A}}= {\mathbf {S}}$$
(1)

$$\xi _{w}= (x,y)\in {\mathbf {S}}$$
(2)
The probability density function according to the standard potential field obtained for each point of the image:
$$p(\xi _{w}) = p(x,y) = \varPsi _{T} (x,y)$$
(3)

Learning takes place with our modified GNG algorithm where wrong edges in the network are eliminated and the final graph is normalised. Algorithms 1 and 2 describe our extended GNG. During this process, the neural network is obtained which preserves the topology of the object ${\mathbf {\textit{O}}}$ from a certain feature T. Therefore, from the visual appearance ${\textit{O}}_{A}$ of the object is obtained an approximation to its geometric appearance $\mathbf {\textit{O}}_{G}$. Henceforth, the Topology Preserving Graph $TPG = \langle A,C\rangle$ is defined with a set of vertices (nodes) A and a set of connections (edges) C. To speed up the learning, we used the faster Manhattan distance [23] compared to the Euclidean distance in the original algorithm [9].

Figure 1 compares the original GNG algorithm with the modified GNG in 5 simple shapes with curvatures and corners.

We test the performance of the modified GNG by quantitative measures as shown in Table 1. The two measures of topological correctness that we used are the mean Quantisation Error (qe) and the Topology Preservation Error (te) [33], shown in Eqs. 4 and 5 respectively. There are N pixels, or reference vectors $\overrightarrow{x_{c}}$, representing the input space in the GNG network. Each node $c\in N$ has its associated reference vector $\{x_{c}\}_{c=1}^{N}\in \mathbb {R}^{q}$. The reference vectors indicate the nodes’ position or receptive field centre in the input distribution. We first analyse the quantisation error for each node with the Euclidean distance to its Best Matching Unit (BMU)$_{m_{\overrightarrow{x_{c}}}}$. The Best Matching Unit (BMU)$_{m_{\overrightarrow{x_{c}}}}$ is the node whose reference vector is closest to the input signal $(\xi _{w})$. The mean quantization error (qe) is the average distance between each reference vector and its BMU. For the calculation of topographic error, there is a function $u(\overrightarrow{x_{c}})$ that is 1 if $\overrightarrow{x_{c}}$ data vectors first and second BMUs are adjacent and 0 otherwise. The modified version of GNG produces a significant speed increase, with better connections in corners and angles and better topology preservation (less error).

$$\begin{aligned} \hbox{qe} & = \frac{1}{N} \sum _{c=1}^{N} \parallel \overrightarrow{x_{c}} - m_{\overrightarrow{x_{c}}}\parallel \end{aligned}$$

(4)

$$\begin{aligned} \hbox{te} & = \frac{1}{N} \sum _{c=1}^{N} u(\overrightarrow{x_{c}}) \end{aligned}$$

(5)

Table 1 Topology Preservation measures of the original vs. modified GNG with respect to frames per second (fps)

Full size table

As reflected in Table 1, GNG modified version provided lower quantization and topology preservation errors due to the deletion of wrong edges for most cases. However, in a few cases, wrong edges provide a shorter distance between input space and the Delaunay triangulation obtained (see Star-6 TE).

Figure 2 shows another example of the modified GNG applied to shapes extracted from the Columbia Object Image Library (coil-100) dataset. The 100 object coil-100 dataset consists of colour images of 72 different poses for each object. The poses correspond to $5^{\circ }$ rotation intervals. Figure 3 shows the modified GNG from our own dataset of hands and shapes. Any wrong connections to corners have been accurate eliminated.

To normalise the graph that represents the contour we must define a starting point, for example the node on the left-bottom corner. Taking that node as the first we must follow the neighbours until all the nodes have been added to the new list. If necessary we must apply a scale and a rotation to the list with respect to the centre of gravity of the list of nodes. We achieved the required alignment by applying a transformation T composed by a translation $(t_x,t_y)$, rotation $\theta$, and a scaling s. The normalisation is given by Algorithm 2.

$$\begin{aligned} T \left[ \begin{array}{r} x_{i}\\ y_{i} \end{array}\right] = \left[ \begin{array}{rr} s(\cos \vartheta )x_{i} & -s(\sin \theta )y_{i}\\ s(\sin \vartheta )x_{i} & s(\cos \theta )y_{i} \end{array}\right] + \left[ \begin{array}{r} t_{x}\\ t_{y} \end{array}\right] \end{aligned}$$

(6)

3 Adaptive learning

The determination of accurate topology preservation requires the determination of best similarity threshold and best network map without overfitting. Let $\Omega (x)$ denote the set of pixels in the objects of interest based on the configuration of x (e.g. colour, texture, etc.) and $\Upsilon$ the set of all image pixels. The likelihood of the required number of nodes to describe the topology of an image y is:

$$\begin{aligned} p(y|x) & = \left\{ \prod _{u\in \Omega (x)}p_{skin}(u) \prod _{v\in \Upsilon \backslash \Omega (x)}p_{bkgd}(v) \right. \nonumber \\&\quad \left. \propto \prod _{u\in \Omega (x)}\frac{p_{skin}(u)}{p_{bkgd}(u) + p_{skin}(u)} \right\} * e_{T} \end{aligned}$$

(7)

and $e_{T} \le {\prod _{u\in \Omega (x)}p_{skin}(u) + \prod _{v\in \Upsilon \backslash \Omega (x)}p_{bkgd}(v)}$.

Figure 4 shows the network map for images with different skin to background ratio.

$e_{T}$ is a similarity threshold and defines the accuracy of the map. If $e_{T}$ is low, the topology preservation is lost and more nodes need to be added. On the contrary, if $e_{T}$ is too big, then nodes have to be removed so that Voronoï cells become wider. For example, let us consider an extreme case where the total size of the image is $I = 100$ pixels and only one pixel represents the object of interest. Let us suppose that we use $e_{T} = 100$ then the object can be represented by one node. In the case where $e_{T} \ge I$ then overfit occurs since twice as many nodes are provided.

In our experiments, the numerical value of $e_{T}$ ranges from $100\le e_{T} \le 900$ and the accuracy depends on the size of the objects’ distribution. The difference between choosing manually the maximum number of nodes and selecting $e_{T}$ as the similarity threshold, is the preservation of the object independently of scaling operations. Algorithm 3 shows the steps of the automatic criterion added to the modified GNG algorithm to minimise user intervention in the learning process.

We can describe the optimum number of similarity thresholds, required for the accuracy of the map for different objects, as the unknown clusters K, and the network parameters as the mixture coefficients $W_{K}$, with d-dimensional means and covariances $\varTheta _{K}$. To do that, we use a heuristic criterion from statistics known as the Minimum Description Length (MDL) [28], which does not require an estimation of the probability p(Y) as is the case for the conditional entropy heuristic criterion [3]. The MDL criterion takes the general form of a prediction error, which consists of the difference between two terms:

$$\begin{aligned} E = model\_likelihood - complexity\_term \end{aligned}$$

(8)

a likelihood term that measures the model fit and increases with the number of clusters, and a complexity term, used as a penalty, that grows with the number of free parameters in the model. Thus, if the number of cluster is small, we get a low value for the criterion because the model fit is low, while if the number of cluster is large, we get a low value because the complexity term is large.

The information-criterion MDL of Rissanen [28], is defined as:

$$\begin{aligned} \hbox {MDL}(K) = -\ln [L(X|W_{K},\varTheta _{K})] + \frac{1}{2}M\ln (N) \end{aligned}$$

(9)

where

$$\begin{aligned} L(X|W_{K},\varTheta _{K}) = \max \prod _{i=1}^{N}p(x_{i}|W_{K},\varTheta _{K}) \end{aligned}$$

(10)

The first term $-\ln [L(X|W_{K},\varTheta _{K})]$ measures the model probability with respect to the model parameter $W_{K},\varTheta _{K}$ defined for a Gaussian mixture by the mixture coefficients $W_{K}$ and d-dimensional means and covariances $\varTheta _{K}$. The second term $\frac{1}{2}M\ln (N)$ measures the number of free parameters needed to encode the model and serves as a penalty for models that are too complex. M describes the number of free parameters and is given for a Gaussian mixture by $M = 2dK + (K - 1)$ for $(K-1)$ adjustable mixture weights and 2D parameters for d-dimensional means and diagonal covariance matrices.

The optimal number of similarity thresholds can be determined by applying the following iterative procedure:

For all K, $(K_{\min}< K <K_{\max })$

(a) Maximize the likelihood $L(X|W_{K},\varTheta _{K})$ using the EM algorithm to cluster the nodes based on the similarity thresholds applied to the dataset.

(b) Calculate the value of MDL(K) according to Eqs. 9 and 10
Select the model parameters $(W_{K},\varTheta _{K})$ that correspond to minimisation of the MDL(K) value.

Figure 5 shows the value of MDL(K) for clusters within the range of $(1< K < 18)$. We have doubled the range in the MDL(K) minimum and maximum values so we can represent the extreme cases of 1 cluster which represents the whole dataset, and 18 clusters which over classify the distribution and corresponds to the overfitting of the network with similarity threshold $e_{T} = 900$. A global minimum and therefore optimal number of clusters can be determined for $K = 9$ which indicates that the best similarity threshold that defines the accuracy of the map without overfitting or underfitting the dataset is $e_{T} = 500$. To account for susceptibility for the EM cluster centres as part of the MDL(K) initialisation of the mixture coefficients, the measure is averaged over 10 runs and the minimal value for each configuration is selected. Algorithm 4 summarises the steps.

We can now use this optimal network to track objects locally wherever common regions are found. To do that, shape information and colour information from the 1st and any subsequent frames are added to the TPG map and can be used for the learning in a sequence of k frames. The segmented frame and the stored shape and colour information in each node is given by:

$$\begin{aligned} S(x;P(g(x,y);t) = p(k|x) \propto P(g(x,y),t-1), TPG_{t-1} \end{aligned}$$

(11)

Figure 6 shows the convergence of the network with shape and posterior probability per node.

4 Experiments

In this section, different experiments are shown validating the capabilities of our extended GNG method to represent 2D and 3D hand models. The proposed method by considering elimination of noisy connections during the learning process is able to define an optimal number of nodes using the MDL criterion. The method has also been used in 3D datasets. First, a quantitative study is performed adding different levels of noise to the ground truth model (datasets). Using the ground truth models and the generated ones adding noise, we are able to measure the error produced by our method. In addition, our method is compared against the state-of-the-art algorithms Active Shape Models and Poisson surface reconstruction.

All methods have been developed and tested on a desktop machine of 2.26 GHz Pentium IV processor. These methods have been implemented in MATLAB and C++. The Poisson surface reconstruction method has been implemented using the PCL library^{Footnote 3} [29].

4.1 Benchmark data

We tested our modified GNG network on a dataset of hand images recorded from 5 participants each performing different gestures (Fig. 7) that frequently appear in sign language. To create this dataset, we have recorded images over several days and a simple webcam was used with image resolution $800 \times 600$. In total, we have recorded over 12000 frames, and for computational efficiency, we have resized the images from each set to $300 \times 225$, $200 \times 160$, $198 \times 234$, and $124 \times 123$ pixels. We obtained the dataset from the University of Alicante, Spain and the University of Westminster, UK. Also, we tested our method with 49 images from Mikkel B. Stegmann^{Footnote 4} online dataset. In total we have run the experiments on a dataset of 174 images. Since the background is unambiguous, the network adapts without occlusion reasoning. For our experiments, only complete gesture sequences are included. There are no gestures with partial or complete occluded regions, which means that we do not model multiple objects that interact with the background.

Furthermore, we have performed the experiments having in mind specific applications, thus limiting its applicability. The quality and stability of the results at close range makes it worthwhile for webcam or green screen sign language applications which share a close range viewing distance and a relatively uncluttered background.

We have also tested the system in a more generic background where shadows, changes in lighting and extremely cluttered backgrounds are common. Figure 8 shows that when colour information is incorporated into the network, the system is able to represent the gesture and only a few nodes adjust to nearby similar pixels. Gesture representation is possible as long as no homogeneity is applied around the gesture.

To classify a region as a hand or face, we take into account domain knowledge information that always respects some proportions found in hands and human faces [11]. To do that we find the centroid, height and width of the connected nodes in the networks as well as the percentage of skin in the rectangular area (Fig. 9). Since the height to width ratio for hands and human faces fall into a small range, we are able to reject or accept if the topology of a network does or does not represent a hand. Studies [7, 11] have shown that the height to width ratio of human face and hands fall within a range based on the well known Golden Ratio (Eq. 12). Thus, we consider a network as a hand or not if the height to width ratio of the region falls within a range of the Golden Ratio ± Tolerance. In the case where the hand is in a folded posture the rule still applies but with different percentage for the Tolerance. The values for the Tolerance were found by experimentation, and range from 0.5 to 0.7 based on the hand posture.

$$\begin{aligned} \varphi \equiv \frac{\hbox {Height}}{\hbox {Width}} \equiv \frac{(1 + \sqrt{5})}{2} \end{aligned}$$

(12)

Table 2 shows topology preservation, execution time, and number of nodes when different variants in the $\lambda$ and the K are applied in the gesture (d) from Fig. 7 as the input space. Faster variants get worse topology preservation but the network converges quickly. However, the representation is sufficient and can be used in situations where minimum time is required like online learning for detecting obstacles in robotics where you can obtain a rough representation of the object of interest in a given time and with minimum quality.

Table 2 Topology preservation and processing time using the quantisation error and the topology preservation error for different variants

Full size table

Figure 10 shows the distribution of two different hand shapes and the plots of the MDL(K) cluster centres within the range of $(1< K < 18)$. The optimum cluster is achieved at $K = 9$ (circled point).

Table 3 shows the topology preservation error for a number of nodes. We can see that the insertion of more nodes makes no difference to the object’s topology. Based on the maximum size of the network, an optimum result is achieved when at least half of the network is developed. Table 3 shows that for the different type of gestures, this optimum number is in the range >90 and <130. Furthermore, the more nodes added during the learning process, the more time it takes for the network to grow (Fig. 11).

Table 3 The topology preservation error for gestures (a–d)

Full size table

Finally, we added different levels of Gaussian noise to three different gestures to test the validity of the modified GNG in comparison with Kohonen map and the growing cell structures (GCS). The results of applying different levels of noise to the gestures are shown in Fig. 12, and error measurements for all methods are calculated in Table 4.

Table 4 Error measurements for modified GNG, Kohonen and GCS

Full size table

4.2 Variability and comparison with the snake model

Our modified GNG network has been compared to the methodology of the active snake model. The snake converges when all the forces achieve an equilibrium state. The drawbacks with this method are that the snake has no a priori knowledge of the domain, which means it can deform to match any contour; this attribute is not desirable if we want to keep the specificity of the model or preserve the physical attributes such as geometry, topological relations, etc., and that the active step is performed globally even if parts of the snake have already converged. Figure 13 shows the tracking of a hand gesture using the modified GNG in the outline of the hand.

Figure 14 shows the fitting results of a snake applied to the same gesture. Figure 14a is the original state of the snake after manually locating an area around the hand. The closer we allocate landmark points around the hand the faster the convergence of the snake. The snake after a number of iterations converges to the palm of the hand but fails to converge around the thumb.

The parameters for the snake are summarised in Table 5. The execution time for modified GNG is approximately 4 times less compared to the snake. The computational and convergence results are summarised in Table 6.

Table 5 Parameters and performance for snake

Full size table

Table 6 Convergence and execution time results of modified GNG and snake

Full size table

4.3 3D reconstruction

This section shows the result of applying an existing approach proposed by Orts-Escolano et al. [25] for performing 3D surface reconstruction using the GNG algorithm. In this work, we focused on the application of the above-mentioned method for performing reconstruction of human hands and faces that were acquired using the Kinect sensor. Moreover, some experiments were performed using synthetic data.

In [25], the original GNG algorithm is extended to perform 3D surface reconstruction. Furthermore, it considers surface normal information during the learning process. It modifies original Competitive Hebbian Learning process, which only considered the creation of edges between neurons, producing wire-frame 3D representations. Therefore, it is necessary to modify the learning process in order to create triangular faces during network adaptation.

The edge creation, the neurons insertion and the neuron removal stages were extended considering the creation of triangular faces during this process. Algorithm 5 describes the extended CHL to produce triangular faces during the adaption process.

Figure 15 shows the model created by applying the original GNG algorithm using as an input data a point cloud obtained using the Kinect sensor. It can be appreciated how the GNG produces a wire-frame representation of the input data, but no information about 3D surfaces is provided.

Figure 16 shows the 3D mesh created using the method mentioned above. It can be seen how this extended algorithm is able to create a coloured 3D mesh, surface information, that represents the input data. Since point clouds obtained using the Kinect are partial 3D views, the mesh obtained is not complete and therefore the model generated by the GNG is an open coloured mesh.

Moreover, it can also be appreciated that the generated representation is accurate and implicitly it performs some typical computer vision preprocessing steps such as filtering, downsampling and 3D reconstruction.

Figure 17 shows the result of applying the GNG-based method for surface reconstruction applied to complete hand 3D models. These models were synthetically generated using 3D CAD software.

Figure 18 shows the mean square error of different representations of the hand obtained with different numbers of neurons. In addition, the graph shows that with approximately 180 neurons, the adaption error obtained is satisfactory and provides an adequate representation of the input data. We chose the minimum number of neurons with an acceptable quality as it allows real-time processing.

Finally, we performed some experiments using 3D human faces instead of hands to demonstrate that the method can also deal with different shapes. Figure 19 shows the 3D reconstruction of a human face acquired using the Kinect sensor (top) and the 3D reconstruction of a synthetically generated human face (bottom). Both faces were reconstructed using the GNG for 3D surface reconstruction. Synthetic data were generated using the Blensor software [12], for simulating a virtual Kinect sensor (noise-free).

In all our experiments, the parameters of the network are as follows: $\lambda = 100$ to 1000, $\epsilon _x = 0.1$, $\epsilon _n = 0.005$, $\Delta x_{s_{1}} = 0.5$, $\Delta x_{i} = 0.0005$, $\alpha _{\max } = 125$.

While 3D downsampling and reconstruction methods like Poisson or Voxelgrid are not able to deal with noisy data, GNG method is able to avoid outliers and obtain an accurate representation in presence of noise. This ability is due to the Hebbian learning rule used and its random nature that update vertex location based on the average influence of a large number of input patterns.

5 Conclusions and future work

Based on the capabilities of GNG to readjust to new input patterns without restarting the learning process, we developed an approach to minimise the user intervention by utilising an automatic criterion for maximum node growth. This automatic criterion for GNG is based on the object’s distribution and the similarity threshold ($e_{T}$) which determines the preservation of the topology. The model is then used for the representation of motion in image sequences by initialising a suitable segmentation. During testing we found that for different shapes there exists an optimum number that maximises topology learning versus adaptation time and MSE. This optimal number uses knowledge obtained from information-theoretic considerations. Furthermore, we have shown that the low dimensional incremental neural model (GNG) adapts successfully to the high dimensional manifold of the hand by generating 3D models from raw data received from the Kinect. Future work will aim at improving system performance at all stages to achieve a natural user interface that allows us to interact with any object manipulation system. Likewise, the acceleration of the whole system should be completed on GPUs.

Notes

Webcam with image resolution $800\times 600$.
Kinect for XBox 360: http://www.xbox.com/kinectMicrosoft.
The Point Cloud Library (or PCL) is a large scale, open project for 2D/3D image and point cloud processing.
http://www2.imm.dtu.dk/~aam/.

References

Albrecht I, Haber J, Seidel H (2003) Construction and animation of anatomically based human hand models. In: Proceedings of the 2003 ACM SIGGRAPH/eurographics symposium on computer animation, pp 98–109
Angelopoulou A, García J, Psarrou A, Gupta G, Mentzelopoulos M (2013) Adaptive learning in motion analysis with self-organising maps. In: The 2013 international joint conference on neural networks (IJCNN), pp 1–7
Brown G, Pocock A, Zhao M-J, Luján M (2012) Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:26–66
MathSciNet MATH Google Scholar
Cretu A, Petriu E, Payeur P (2008) Evaluation of growing neural gas networks for selective 3D scanning. In: Proceedings of IEEE international workshop on robotics and sensors, environments, pp 108–113
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol I, pp 886–893
Eddy S (1996) Hidden Markov models. Curr Opin Struct Biol 6(3):361–365
Article MathSciNet Google Scholar
Fernandez A, Ortega M, Cancela B, Penedo MG (2012) Contextual and skin color region information for face and arms location. In: Proceedings of the 13th international conference on computer aided systems theory—EUROCAST 2011, vol 6927/2012, pp 616–623
Fritzke B (1994) Growing cell structures—a self-organising network for unsupervised and supervised learning. J Neural Netw 7(9):1441–1460
Article Google Scholar
Fritzke B (1995) A growing neural gas network learns topologies. In: Advances in neural information processing systems 7 (NIPS’94), pp 625–632
García-Rodríguez J, Angelopoulou A, García-Chamizo JM, Psarrou A, Escolano SO, Giménez VM (2012) Autonomous growing neural gas for applications with time constraint: optimal parameter estimation. Neural Netw 32:196–208
Article Google Scholar
Govindaraju V (1996) Locating human faces in photographs. Int J Comput Vision 19(2):129–146
Article Google Scholar
Gschwandtner M, Kwitt R, Uhl A, Pree W (2011) BlenSor: blender sensor simulation toolbox advances in visual computing, volume 6939 of lecture notes in computer science, chap 20. Springer, Berlin
Google Scholar
Gupta G, Psarrou A, Angelopoulou A, García J (2012) Region analysis through close contour transformation using growing neural gas. In: Proceedings of the international joint conference on neural networks, IJCNN2012, pp 1–8
Holdstein Y, Fischer A (2008) Three-dimensional surface reconstruction using meshing growing neural gas (MGNG). Vis Comput Int J Comput Graph 24(4):295–302
Google Scholar
Kakumanu P, Makrogiannis S, Bourbakis N (2007) A survey of skin-color modeling and detection methods. Pattern Recogn 40(3):1106–1122
Article MATH Google Scholar
Koike H, Sato Y, Kobayashi Y (2001) Integrating paper and digital information on enhanced desk: a method for real time finger tracking on an augmented desk system. ACM Trans Comput Hum Interact 8(4):307–322
Article Google Scholar
Kruppa H (2004) Object detection using scale-specific boosted parts and a Bayesian combiner. PhD Thesis, ETH Zrich
Kruppa H, Santana C, Sciele B (2003) Fast and robust face finding via local context. In: Proceedings of the IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 157–164
Lee M, Nevatia R (2005) Integrating component cues for human pose estimation. In: Proceedings of the IEEE international workshop on visual surveillance and performance evaluation of tracking and surveillance, pp 41–48
Leibe B, Seemann E, Sciele B (2005) Pedestrian detection in crowded scenes. In: Proceedings of the IEEE conference on computer vision and pattern recognition, vol I, pp 878–885
Levin A, Viola P, Freund Y (2003) Unsupervised improvement of visual detectors using co-training. In: Proceedings of the IEEE international conference on computer vision, vol I, pp 626–633
Martinez T, Schulten K (1994) Topology representing networks. J Neural Netw 7(3):507–522
Article Google Scholar
Mignotte M (2008) Segmentation by fusion of histogram-based k-means clusters in different color spaces. IEEE Trans Image Process 5(17):780–787
Article MathSciNet Google Scholar
Nair V, Clark J (2004) An unsupervised, online learning framework for moving object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, vol II, pp 317–324
Orts-Escolano S, Garcia-Rodriguez J, Morell V, Cazorla M, Perez J, Garcia-Garcia A (2015) 3D surface reconstruction of noisy point clouds using growing neural gas: 3D object/scene reconstruction. Neural Process Lett 43:1–23
Google Scholar
Papageorgiou C, Oren M, Poggio T (1998) A general framework for object detection. In: Proceedings of the IEEE international conference on computer vision, pp 555–562
Rêgo R, Araújo A, de Lima Neto F (2007) Growing self-organizing maps for surface reconstruction from unstructured point clouds. In: Proceedings of the international joint conference on artificial neural networks, IJCNN’07, pp 1900–1905
Rissanen J (1978) Modelling by shortest data description. Automatica 14:465–471
Article MATH Google Scholar
Rusu R, Cousins S (2011) 3D is here: point cloud library (PCL). In: Proceedings of the IEEE international conference on robotics and automation, ICRA, Shanghai, China, May 9–13, 2011
Sivic J, Everingham M, Zisserman A (2005) Person spotting: video shot retrieval for face sets. In: International conference on image and video retrieval, pp 226–236
Stergiopoulou E, Papamarkos N (2009) Hand gesture recognition using a neural network shape fitting technique. Eng Appl Artif Intell 22(8):1141–1158
Article Google Scholar
Sui C (2011) Appearance-based hand gesture identification. Master of Engineering, University of New South Wales, Sydney
Google Scholar
Uriarte EA, Martn FD (2005) Topology preservation in SOM. Int J Appl Math Comput Sci 1(1):19
Google Scholar
Vamplew P, Adams A (1998) Recognition of sign language gestures using neural networks. Aust J Intell Inf Process Syst 5(2):94–102
Google Scholar
Wong S, Ranganath S (2005) Automatic sign language analysis: a survey and the future beyond lexical meaning. In: IEEE transactions on pattern analysis and machine intelligence, pp 873–8912005
Yang J, Bang W, Choi E, Cho S, Oh J, Cho J, Kim S, Ki E, Kim D (2009) A 3D hand-drawn gesture input device using fuzzy ARTMAP-based recognizer. J Syst Cybern Inform 4(3):1–7
Google Scholar

Download references

Acknowledgments

This work was partially funded by the Spanish Government DPI2013-40534-R Grant.

Author information

Authors and Affiliations

Faculty of Science and Technology, University of Westminster, 115 New Cavendish Street, Middlesex, W1W 6UW, UK
Anastassia Angelopoulou & Alexandra Psarrou
Department of Computing Technology, University of Alicante, PO Box 99, 03080, Alicante, Spain
Jose Garcia Rodriguez & Sergio Orts-Escolano
Faculty of Media, Arts and Design, University of Westminster, Northwick Park, Middlesex, HA1 3TP, UK
Gaurav Gupta

Authors

Anastassia Angelopoulou
View author publications
You can also search for this author in PubMed Google Scholar
Jose Garcia Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar
Sergio Orts-Escolano
View author publications
You can also search for this author in PubMed Google Scholar
Gaurav Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Alexandra Psarrou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anastassia Angelopoulou.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Angelopoulou, A., Garcia Rodriguez, J., Orts-Escolano, S. et al. Fast 2D/3D object representation with growing neural gas. Neural Comput & Applic 29, 903–919 (2018). https://doi.org/10.1007/s00521-016-2579-y

Download citation

Received: 23 June 2015
Accepted: 02 September 2016
Published: 22 September 2016
Issue Date: May 2018
DOI: https://doi.org/10.1007/s00521-016-2579-y

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Fast 2D/3D object representation with growing neural gas

Abstract

Similar content being viewed by others

Model Probability in Self-organising Maps

Growing Surface Structures: A Topology Focused Learning Scheme

Deep Local Shapes: Learning Local SDF Priors for Detailed 3D Reconstruction

1 Introduction

2 Characterising 2D objects with modified GNG

3 Adaptive learning