First, we trained CNNs on the hand posture dataset. In order to evaluate their generalization capability we additionally recorded images with the considered hand postures using a notebook camera. The camera that was utilized in image recording delivers images of size \(640 \times 480\), i.e. the images have the same size as images acquired by Kinect RGB camera. Two volunteers of different nationalities, who did not attend in the dataset recordings performed 400 gestures.
Afterwards, we evaluated the error of wrist localization using the skin-color based algorithm for hand extraction and neural network based algorithm for wrist localization. The error was determined on 727 images (person #3) from the dataset and 400 images recorded by the notebook camera. On such an image set we manually determined the wrist positions and then used them to calculate the errors. Table 1 depicts the errors for each class together with the averages both for the Kinect and notebook camera (Dif. cam.). As we can notice, the average error for the different camera is slightly larger.
In the next stage, for each person, we divided the images into training and testing parts. About 15% of images from each class were selected for the tests, whereas the remaining images were used in the learning of CNNs. The images with gestures performed by person #3 were not included in the training data and they were used only in person-independent tests. This way we selected 6000 gray images of size \(38 \times 38\) for training the CNNs. All CNNs were trained using Caffe  with the following parameters: batch size = 64, momentum = 0.9, base learning rate = 0.001, gamma = 0.1, step size = 1000, max. iteration = 5000.
The first convolutional neural network was trained on raw gray images. Table 2 depicts the recognition performance with respect to both automatically and manually determined wrist position on test images acquired by the notebook camera and Kinect camera, i.e. test part of the hand posture dataset. As we can notice, on images acquired by Kinect the CNN achieves good classification performance. In the discussed performer-independent test the results are slightly worse for the wrist positions determined automatically. As we can observe, despite that the classification was done on images from a different camera, the CNN quite well recognizes the hand postures expressed by a different performer.
Table 3 depicts classification results that were obtained on images preprocessed by the Gabor filter. As we can observe, the results are far better in comparison to results presented in Table 2. The classification results achieved on images filtered by the Gabor filter are better for both the Kinect and notebook camera. In the person-independent test the CNNs achieves the classification accuracy equal to 97% if the images are taken by the same camera as in the training. If the classification is done on images acquired by different camera the classification accuracy is equal to 87%. For the discussed case the improvement with respect to classification on raw images is equal to 8.5%. The discussed results were achieved using the following parameters of the Gabor filter: \(\lambda =2\), \(\theta =[0\,\pi /4 \,\pi /2 \,3/4\pi ]\), \(\phi =[0 \,\pi /2]\), \(b=1.8\), \(\sigma =0.75 \), \(\gamma =0.5\).
Table 4 depicts classification accuracies that were achieved with respect to the considered hand postures. As we can notice, the CNNs achieves significantly worse results for class 6 when it operates on images taken by a different camera.
Our system has been designed to operate on humanoid Robot Nao as well as ARM processor-based mobile devices. Having on regard that they are not equipped with GPUs, which can considerably reduce the processing time of CNNs, we extract low-level features using techniques from biologically inspired computer vision, which are then further processed hierarchically, and finally recognized by moderate-size CNN. The recognition of hand posture on a single sub-image with the extracted hand and the estimated wrist position is about 5 ms. The presented software has been developed in C++, Python and Matlab.