Machine Vision and Applications

, Volume 24, Issue 6, pp 1213–1227 | Cite as

Learning class-specific dictionaries for digit recognition from spherical surface of a 3D ball

Original Paper
  • 265 Downloads

Abstract

In the literature, very few researches have addressed the problem of recognizing the digits placed on spherical surfaces, even though digit recognition has already attracted extensive attentions and been attacked from various directions. As a particular example of recognizing this kind of digits, in this paper, we introduce a digit ball detection and recognition system to recognize the digit appearing on a 3D ball. The so-called digit ball is the ball carrying Arabic number on its spherical surface. Our system works under weakly controlled environment to detect and recognize the digit balls for practical application, which requires the system to keep on working without recognition errors in a real-time manner. Two main challenges confront our system, one is how to accurately detect the balls and the other is how to deal with the arbitrary rotation of the balls. For the first one, we develop a novel method to detect the balls appearing in a single image and demonstrate its effectiveness even when the balls are densely placed. To circumvent the other challenge, we use spin image and polar image for the representation of the balls to achieve rotation-invariance advantage. Finally, we adopt a dictionary learning-based method for the recognition task. To evaluate our system, a series of experiments are performed on real-world digit ball images, and the results validate the effectiveness of our system, which achieves 100 % accuracy in the experiments.

Keywords

Digit ball recognition Dictionary learning Sparse coding Circle detection Rotation invariance 

1 Introduction

Digit recognition has attracted extensive attentions and has been attacked from various directions in the computer vision community, for example, hand-written digit recognition and vehicle license plate detection and recognition. In the literature, almost all the digit recognition researches hold a common assumption that the digits to be recognized are placed on a plane surface. However, many cases in the real world are that the digits usually appear on non-plane surfaces, such as spherical surface of a ball. As a particular example of detecting and recognizing the digits on non-plane surfaces, in this paper, we introduce a digit ball detection and recognition system to detect and recognize 3D digit balls. Here, the so-called digit ball is a 3D ball which carries Arabic number on its spherical surface, as illustrated in Fig. 1. The ultimate goal of our system is to recognize the digits appearing on the balls.
Fig. 1

The task of recognizing the digit appearing on each 3D ball. From this illustrative figure, we can see the difficulties of correctly recognizing these digits caused by several aspects. For example, how to accurately segment each digit ball is one problem; the arbitrary rotation of the 3D ball poses the digit in a deformed manner, and the effect of various illumination reflection also makes the recognition task more difficult

Our system works for a practical application purpose, concisely, it is used as an interface of a man-machine interactive game and the recognition results can be immediately observed by human. Therefore, the particular application requires the system to keep on working without any recognition errors in a real-time manner.1 As well, along with the specific application, the work environment can be weakly controlled, i.e. using the diffuse reflection dark plane as the background, as illustrated in Fig. 1. Even though, at first sight, detecting and recognizing these balls is a simple task, there are several intrinsic problems and challenges in the real-world application. In our system, only one industrial digital camera2 is used, and the illumination reflection changes are allowed in the real-world applications. More importantly, the particular application requires the real-time performance of the system to achieve nearly 100 % accuracy in recognizing these balls, which makes the digit ball recognition an extremely challenging task. Besides its realization of this goal, our system has other practical significance. For example, it can be used for lotto&lottery monitoring and used as an automatic assistance (tracking and commentator) in Snooker broadcasting.

As demonstrated above, despite its real-world significance, our system is confronted with several intrinsic problems to recognize the digit on the ball, such as the arbitrary rotations of the 3D balls and diverse viewpoints of the digit, which can be easily seen from Figs. 1 and 2. Obviously, the arbitrary rotation of the ball not only deforms the digit shape that is projected onto 2D image, but also poses us the difficulty to recognize the digit from various viewpoints. However, the rotation issue is just one intrinsic challenge for our system to recognize the digit on the ball, and the diverse viewpoints of the digit region will make it much more difficult. Therefore, for better recognition performance, how to determine a desirable viewpoint of the digit is another problem to attack. Moreover, as only one digital camera is used in our work, the global information of the digit balls cannot be captured and we can only obtain the biased digit information on the spherical surface. For this reason, most of digit recognition approaches will not work on this situation, even if they are well adapted to print or hand-written digit recognition, such as OCR [20], geometric moments [27], contour profile [11], mathematical morphology [6, 21, 22, 23, 28] and so on [12, 17]. As we cannot directly run recognition process on the arbitrarily-rotated digit balls, we ought to seek for other solutions. One possible way to tackle this problem is to use 3D reconstruction technique to map the 2D projection of the ball into closely hemispherical surface. Then, we can adopt the spherical correlation to implement pose estimation and the sequential recognition process. However, the spherical correlation usually needs to capture the global information in advance and the related algorithm is very time-consuming [24, 25]. This explains that our aim is not to solve the general 3D object recognition problem, but to recognize the shape-deformed digit on the spherical surface. The last but not the least, other inevitable factors also have negative impact on the recognition performance, such as various illumination reflection changes, motion of the ball and so on.
Fig. 2

Several digit ball images corresponding to the digit “2”. From this figure, we can see that the intrinsic arbitrary rotation of the ball makes the recognition task much more difficult. Rotation issue not only produces arbitrarily rotated digit, but also brings about deformed digit due to various viewpoints. Therefore, the rotation issue in our work confronts us with a specific challenge

Focusing on the above issues, instead of adopting the 3D reconstruction technique, we turn to extracting some informative representations of the balls. Concretely, we first extract the digit region of the detected ball according to the geometrical property, then regularize the digit region, and finally transform the regularized digit region into some efficient and effective representations. In this paper, we study two rotation-invariant feature descriptors, i.e. spin image [9, 13] and rotation-invariant feature transformation [13]. These descriptors are generated from the regularized digit region of the ball, and are invariant to arbitrary rotations. However, as a conclusion of our experiments on the descriptors, the recognition accuracies on them are not satisfactory, even though the descriptors enable us to avoid any calibration process of the digit region to save computation time.

In addition, we introduce another effective representation of the arbitrarily rotated digit balls—the polar image. The polar image not only provides us with the robustness to illumination reflection, as we can see from experiments, but also help us to alleviate the arbitrary rotation problem by merely dealing with the single-direction translation problem of the polar image. In our system, the polar image of each detected digit ball is calibrated according to some criterion. This calibration process needs additional computation time, but is very fast and brings out very decent recognition rate. Therefore, in our system, we use a coarse-to-fine recognition strategy for the final recognition: (1) using spin image to coarsely select several candidate digit labels, and (2) using the calibrated polar image for the final determination of the label. As for recognition, we exploit a dictionary learning-based method to learn a class-specific dictionary for each kind of the digit ball, and use the reconstruction error [26] to recognize the novel balls. Beyond this method, we adopt a Bayesian-based multiple sampling strategy to boost the recognition performance. The experiments demonstrate that, with the help of this strategy, our system indeed achieves 100 % recognition accuracy for all the detected balls under the experiment environment. Actually, in the real-world application, there is no report on any misclassification of the balls yet, and any running errors involve no issues related to our system (the detection and recognition process).

Besides, another critical step is digit ball detection, whose quality will directly affect the recognition performance. We consider the problem of detecting several 3D balls in one image (dubbed multi-ball detection), and develop a novel method to tackle this problem. As well, from experiments, we can see the effectiveness of our detection method. Even when the balls are densely placed in the image, our method can still detect and segment the balls out of the image accurately. To the best of our knowledge, it is the first work to address such kind of multi-ball detection and recognition task in the literature.

To summarize, the contributions of our work include the following:
  • We develop a novel method for multi-ball detection task, and demonstrate its effectiveness through experiments even when the balls are very densely placed in the image.

  • We propose to use the intensity-domain spin image [13] and the polar image to represent the digit ball, which can eliminate the difficulties caused by arbitrary rotation.

  • We adopt a coarse-to-fine recognition strategy, which is a two-stage process: using spin image for coarse prediction and using the polar image for the final determination of the label.

  • We advocate learning a class-specific sub-dictionary for each digit ball (its polar image), and using them with the Bayesian-based multiple sampling strategy for the final classification. Experimental results demonstrate our system can indeed achieve 100 % recognition accuracy of all the detected digit balls.

1.1 The outline of the system

As our system works for particular application under the weakly controlled environment, the prior knowledge of the size of the ball is given, i.e. its spherical radius \(\bar{R}\), and the distance between the camera and the diffuse reflection dark plane is fixed. In our system, the industrial camera is calibrated based on Zhang’s method [30]. The system consists of two inseparable parts, one is the training part, which aims to learn the sophisticated class-specific dictionaries for the sequential recognition stage, i.e. the second part. Both of the two parts mainly share three crucial steps: (1) digit ball detection; (2) digit region detection, regularization and calibration; and (3) feature extraction and representation. Apart from the three common steps, the training part also has the most important step, i.e. dictionary learning. The overall flowchart of our system is plotted in Fig. 3. In the rest of the paper, we will elaborate these four important steps and validate our system through experiments. Finally, we conclude our paper with remarks.
Fig. 3

The overall flow diagram of our digit ball detection and recognition system. The system consists of two stages: the training stage aims to learn a reliable dictionary for the sequential recognition stage. Within the flowchart, the four steps will be elaborated in this paper

2 Digit ball detection

Actually, there are two different purposes related to digit ball detection in our system, one is used for learning class-specific dictionaries in the training part, and the other for recognition, i.e. the multi-ball detection. Each of the training images only contains one digit ball, while the images in the recognition usually have several balls to detect, and this multi-ball detection task is much more difficult. We elaborate the two detection approaches in this section.

2.1 Digit ball detection for training

As illustrated in Sect. 1, our system works in a weakly controlled environment, hence the carefully chosen diffuse reflection dark plane makes it easy to detect the ball on each training image. Moreover, as a result of that the distance of the camera and the plane is fixed without any shifting and the spherical radius \(\bar{R}\) of the balls is given, we can simply use an enclosing circle fitting method to detect the ball. In detail, we use Canny algorithm to detect the edges of the image with binarization process, refine the binary image by omitting the isolated point set, draw a minimum circle to enclose the existing points, and finally estimate the center of the circle (the detected ball). Empirically, we also try Hough transform-based method to detect the ball, however, we find the fitting enclosing circle method is more efficient and more accurate.

2.2 Multi-ball detection for recognition

A testing image usually contains several digit balls for detection and recognition, therefore, the detection method used in the training part is no longer applicable. To address this problem, we develop a novel method that is adapted to multi-ball detection.

Given a novel gray-scale image \(\mathbf{X}\) with size \(H\times W\) such as the illustrative one in recognition part of Fig. 3, similar to the enclosing circle fitting method in the training part, we first run some preprocessing steps on the image. Concretely, we use Canny operator to detect the edges, perform binarization on the derived image, and refine the result by omitting the isolated points. The derived result is stored in \(\mathbf{Y}\) with the same size of \(\mathbf{X}\), as demonstrated by Fig. 4a. Then, we create a blank image with all zero-value pixels, or more concisely, a matrix \(\mathbf{Z}\in \mathbb{R }^{H\times W}\) for the sequential intermediary storage.
Fig. 4

The intermediate results of the proposed Algorithm 1: multi-ball detection algorithm. a Refined binary image \(\mathbf{Y}\); b normalized 2D histogram \(\mathbf{Z}\); c projected 2D histogram onto \(xy\)-plane; d final output results by Algorithm 1

After these preprocessing steps, by treating each white pixel of \(\mathbf{Y}\) as a center, we draw an \(R\)-radius solid round and add it to \(\mathbf{Z}\) with the appropriate position defined by the center (the same position in \(\mathbf{Y}\)). Here, \(R\) is the radius of the projected ball on the image, and is easy to derive in terms of the spherical radius \(\bar{R}\) and the distance between the diffuse reflection dark plane and the camera. Note any filled value of the solid round will produce the same result, owing to the sequential normalization process.

Finally, we normalize the values of \(\mathbf{Z}\) to form a 2D histogram image, proportionally limiting all the values in a specific range, say 0–255, as demonstrated in Fig. 4b. This normalization process is necessary, because we need a threshold \(\epsilon \) to locate the detected centers of the ball. Specially, we can detect the local peak values in \(\mathbf{Z}\) that are greater than \(\epsilon \), and classify them into the centers of the detected balls. It is worth noting that the detected peak-value elements can cluster together, in other words, we will not necessarily obtain the unique center of one ball, but several pseudo ones around the real center. For better understanding of this, we project the 2D histogram onto \(xy\)-plane (the Cartesian Coordinate System), as demonstrated in Fig. 4c. From this figure, we can see there are always a cluster peak values around the real center of each ball, thus it is not advisable to treat any one as the detected center. But fortunately, it is not a serious situation, and we can merely use the average location of these aggregated elements as the center of each detected ball. Empirically, we find this simple trick is very effective and efficient.

The overall procedure of our proposed multi-ball detection method is summarized in Algorithm 1. Running the proposed approach, we obtain the final detected balls demonstrated in Fig. 4d. Concretely, running step 1–4, the algorithm generates the refined binary image \(\mathbf{Y}\) as illuminated by Fig. 4a and a

blank image \(\mathbf{Z}\), both of \(\mathbf{Y}\) and \(\mathbf{Z}\) are the same in size with the original image \(\mathbf{X}\); running step 5–14, we can derive the 2D histogram image \(\mathbf{Z}\), as demonstrated by Fig. 4 b and c; after step 15, the algorithm determines the centers of the detected balls, and outputs them as shown in Fig. 4d).

 

3 Digit region detection and regularization

As demonstrated previously, the digit projected on the image is badly deformed in shape, because it is captured from various viewpoints with arbitrary rotations of the 3D ball, as illustrated in Fig. 2. Thus, we cannot directly implement recognition process on these detected balls. For this reason, we choose to regularize the digit to a standard position first, i.e. transforming the digit to a frontal pose to “look at the camera”. Before that we have to detect the informative region where the digit appears, i.e. the digit region. This step is of crucial significance to the ultimate target, and can directly determine the recognition performance. In this section, we present the digit region detection and regularization process.

3.1 Digit region detection

A convenience brought by the real-world digit ball is that there is a circumcircle that encapsulates the digit on the spherical surface, and always at least one circular region is fully visible from top. Such a circumcircle is projected into an ellipse on the image. Therefore, we can exploit this kind of valuable information by detecting the small ellipse first. Our system adopts Hough transform to detect ellipse-like regions, and then judge whether the detected ellipses are acceptable according to their estimated size.

In detail, given the detected ellipse-like region with its center, we can well estimate its size (the length of semi-major axis and semi-minor axis) and eccentricity in terms of some geometrical property of ellipse. Then, we compare these estimated results with the real value of these attributes, which are the prior knowledge, and finally determine whether the detected ellipse-like region is the desirable one. As illustrated by Fig. 5 for better understanding this process, we can easily see that, no matter where the small circle is located on the sphere, the radius of the small circle on the sphere is just equal to the semi-major axis of the projected ellipse. Through some simple derivations, we have the following relation:
$$\begin{aligned} b = R\cos \left(\arccos \frac{d}{R}-\arcsin \frac{r}{R}\right) - d, \end{aligned}$$
(1)
where \(r\) and \(R\) are the radius of the small circle on the sphere and the spherical radius, respectively, both of which are pre-given. \(d\) is the distance of the center of the sphere and the center of the ellipse, and \(b\) is the length of the semi-minor axis of the ellipse. Equation 1 provides us with the metric to judge whether the detected region is a desirable elliptic region.
Fig. 5

The projected result of the semidiameter of the small circle on the sphere is the short semiminor axis of the ellipse

Figure 6a displays some ellipse detection results. From this figure, we can see there will be several reasonable ellipses and probably all of them are qualified for the sequential application. But one detected ellipse is enough, and we ought to select the “best” one. Concretely, we simply select the one with the smallest eccentricity for the least deformation of the digit. To summarize, the overall procedure of digit region detection is listed as below:
  1. 1.

    use Canny operator to detect the edges followed by an adaptive binarization process;

     
  2. 2.

    detect all the contours and refine the result by discarding or merging the noisy contours;

     
  3. 3.

    fit any contours to an ellipse and derive its location and size;

     
  4. 4.

    assess the detected ellipses to judge whether it is a useful digit region;

     
  5. 5.

    select the one with smallest eccentricity as the final digit region to represent the detected ball.

     
Fig. 6

The flow of digit region detection and regularization, and feature extracting. a Detected digit regions (the ellipse-like region); b regularization of the selected digit region from a; c relocalization of the digit region from b; d binary image derived from c; e the polar image derived from d

3.2 Digit region regularization

To obtain decent recognition performance, we should consider how to transform the elliptical digit region into a standard scale. That is to say, when we obtain the best ellipse region, the next step is to regularize this region via unwarping the ellipse to a circle. Therefore, the digit will be posed on a desired position or on the obverse side. In this way, the digit can be reconstructed to a better position with the least shape distortion. Figure 6b illustrates such results of the detected elliptical digit regions depicted in Fig. 6a. Furthermore, we relocate the digit region and use Canny edge detector with binarization process to derive a more accurate digit region, as demonstrated in Fig. 6c, d. Note that even though the distinct illumination reflections are anticipated to have a negative impact on the recognition performance as displayed by Fig. 6c, these illumination reflections disappear in the binary image as illustrated in Fig. 6d. This demonstrates the robustness of our system to illumination reflection changes.

4 Digit ball representation

When digit region are correctly detected, we obtain a circular region that circumscribes the digit. However, directly using the circular region will prevent good recognition performances, because the relocated region still has the arbitrary rotation problem. For this reason, we ought to seek some efficient and effective solutions to solve this problem.

The most intuitive way to handle the rotation problem is to calibrate this region by rotating it to a standard direction. Thus, when a novel digit ball is detected for recognition, we can simply compare its calibrated digit region with that of every training datum. Unfortunately, this 2D-rotation-like calibration is not only very time-consuming for such a large amount of digit regions, but also infeasible due to the difficulty to decide which direction or how many angles to rotate. For example, probably 100 or 200 attempts on the rotation angle would be required for a testing digit ball. On the other hand, the rotation-invariant feature descriptor is a good alternative. There have already existed a multitude of rotation-invariant feature descriptors, such as Hu-moment [7], scale-invariant feature transformation (SIFT) [15], intensity-domain spin image [9, 13] and rotation-invariant feature transformation (RIFT) [13]. However, in our empirical observation, Hu-moment and SIFT cannot bring out satisfactory recognition performance, but lead to very inferior performances when meeting the digits with large rotation. On the contrary, intensity-domain spin images and RIFT achieve good performances, and will be compared in the experiments.

In addition, to tackle the arbitrary rotation issue, we propose to transform the digit region into polar image for the digit ball representation. The polar image enables us to circumvent the 2D rotation problem by dealing with a much easier shifting problem, and the polar image can be cast as a resampling process. To generate the polar image, two parameters should be given, \(\theta \) and \(\psi \), which represent the rotation angle step and the radius, respectively. With a fixing start direction, we can resample the radius via an interpolation process at every angle step, and stack the radius as a column vector to a matrix with appropriate size. Figure 7 illustrates the process of generating polar image for better understanding.
Fig. 7

Transforming the arbitrarily rotated digit region into polar image. This process will reduce the arbitrary rotation problem to one-direction shifting problem. Two parameters determine the size of the polar image: \(\theta \) for the rotation angle step and \(\psi \) for the radius

With the generated polar image, we can reduce the arbitrary rotation problem to a much easier problem which only contains shifting variance along one direction, as illustrated by Fig. 6e. In particular, for example, if we set the rotation step as 4\(^\circ \), then we obtain the polar image consisting of 90 columns. Compared with the intuitive 2D-rotation calibration, which often requires 100 or 200 attempts on the rotation angle, calibrating the polar image is a much easier task, and merely need at most 90 attempts in this situation to determine how many columns to shift. Actually, our system calibrate the polar image indeed in this way via a simple correlation method. All the training balls (their polar image) corresponding to specific digit category are calibrated to the same position. Therefore, the polar image with this kind of calibration is very efficient. Figure 8 illustrates the effectiveness by displaying several examples of digit “46”. In this figure, the sub-image (b) is the derived polar image from the digit region (a), and (c) is the calibrated polar image which is used as the final representation of the digit ball. If each trial in calibration is done with 3-pixel step, then we only need no more than 30 attempts.
Fig. 8

Several training examples from the digit “46”. Row a depicts the binary image of the regularized digit region and row b shows the corresponding polar image. It is easy to see the polar images only variate along 1D (the horizontal direction). Row c is the calibrated polar images of digit “46”, which are extraordinarily similar to each other

Note that intensity-domain spin images and RIFT descriptors are rotation-invariant feature descriptors, which enables us to avoid the calibration process. But we argue that even though the calibration process is discarded and the time consumption is accordingly saved, the recognition accuracy is compromised to some extent, as verified through experiments in Sect. 6.

5 Recognition based on dictionary learning

Running the processes described above, we obtain the polar image for each detected digit ball as the final representation. Now we can perform recognition procedure on these polar images. Our system advocates learning a class-specific dictionary for each type of the digit ball (the corresponding polar image) for recognition.

Dictionary learning (DL), as a particular sparse signal model, has risen to prominence in recent years. DL aims to learn a (overcomplete) dictionary in which only a few atoms can be linearly combined to well approximate a given signal. DL-based methods have achieved state-of-the-art performances in many application fields, such as image denoising [5] and image classification [2, 10, 16]. In this paper, we present two kinds of dictionaries learned from the polar images of the training set. One is learned directly from the uncalibrated polar images as shown in Fig. 8b, while the other is learned from the calibrated ones as demonstrated in Fig. 8c. The two types of dictionaries are both evaluated through experiments in Sect. 6, and the superiority of the second one is highlighted. Besides the dictionary learning process, we also present a more efficient way, rather than sparse coding, to encode the novel signal (the new-coming polar image for recognition) over the dictionary for recognition. In this section, we elaborate the two dictionary learning schemes and the recognition process at length.

5.1 Dictionary learning scenarios

Previously, Huang et al. [8] use 1D correlation of the vectorized polar image for the final classification and achieves good result. In Cheng et al. [4], improve the recognition performance by treating all the training data (actually the corresponding polar images) as the dictionary. Concretely, they encode the polar image of a testing digit ball over the dictionary in a sparse manner and use the reconstruction error for recognition. Their method works in a manner similar to sparse reconstruction-based classification (SRC) technique [26]. In SRC, a vectorized testing polar image3\(\mathbf{x}\in \mathbb{R }^{d}\) is first encoded collaboratively over the dictionary, which consists of all \(N\) training samples \(\mathbf{D}=[\mathbf{x}_1,\ldots ,\mathbf{x}_n,\ldots ,\mathbf{x}_N]\) under \(\ell _1\)-norm sparsity constraint, where \(\mathbf{x}_n \in \mathbb{R }^{d}\) is the normalized polar image of the \(n\)th training datum. Then, SRC classifies the testing image \(\mathbf{x}\) individually to determine which class it should belong to. Mathematically, the sparse code \(\mathbf{a} \in \mathbb{R }^{N}\) is calculated as:
$$\begin{aligned} \mathbf{a} = {\mathop {\text{ argmin}}_\mathbf{a}} \Vert \mathbf{x}- \mathbf{D}\mathbf{a}\Vert _{2}^{2} + \lambda {\Phi }(\mathbf{a}), \end{aligned}$$
(2)
where \({\Phi }(\mathbf{a})=\Vert \mathbf{a} \Vert _1\) in SRC. Then, it calculates the reconstruction error \(r_c=\Vert \mathbf{x}- \mathbf{X}_c\mathbf{a}_c\Vert _2^2\) for all the \(C\) classes, where \(\mathbf{X}_c\) is formed by all the samples from the \(c\)th class and \(\mathbf{a}_c\) is the corresponding elements of \(\mathbf a\). Finally, it selects \({\hat{c}} = {\mathop {\text{ argmin}}_{c}} r_{c}\) as the identified digit.

Cheng et al. suggest three different sparse constraints for sparse coding of Eq. 2: LASSO [19], elastic net [31] and nonnegative garrote [3]. Even though their method achieves promising performance, their recognition mechanism incorporates all the training samples to reconstruct the testing image, which induces redundant and noisy information in \(\mathbf{D}\), and is very time-consuming in the sparse coding process when the training samples are huge. For this reason, in our work, we propose to learn \(C\) class-specific dictionaries, each one for each digit category, and concatenate them into the overall dictionary. Thanks to the well-established dictionary, any polar images generated from the \(c\)th digit ball can be well represented by the bases of the \(c\)th class-specific dictionary. This learned dictionary reduces the heavy burden of sparse coding on all the training data, and brings out very decent outcome as demonstrated from the experiments in Sect. 6.

To learn the \(c\)th sub-dictionary \(\mathbf{D}_c = [\mathbf{d}^{c}_1, \ldots , \mathbf{d}^{c}_K] \in \mathbb{R }^{d\times K}\), given the \(c\)th training set (the polar images) \(\mathbf{X}_c \in \mathbb{R }^{d \times N_c}\), we can simply optimize the following objective function over \(\mathbf{D}_c\) and the corresponding coefficient matrix \(\mathbf{A}_c = [\mathbf{a}^{c}_1, \ldots , \mathbf{a}^{c}_{N_c}] \in \mathbb{R }^{K \times N_c}\):
$$\begin{aligned} \{ \mathbf{D}_{c}, \mathbf{A}_c\}&= {\mathop {\text{ argmin}}_{\mathbf{D}_c, \mathbf{A}_c}} \Vert \mathbf{X}_c - \mathbf{D}_{c}\mathbf{A}_{c} \Vert _{F}^{2} + \lambda \Phi (\mathbf{A}_c)\nonumber \\&\text{ s.t.} \Vert \mathbf{d}^{c}_{k}\Vert _2 = 1\, \quad \mathrm{for}\,\, \forall k = 1, \ldots , K, \end{aligned}$$
(3)
where \(\Phi (\mathbf{A}_c) = \sum _i^{N_c}\Vert \mathbf{a}_i^c\Vert _1\). There are various algorithms to solve Eq. 3, such as K-SVD [1] and feature-sign search [14]. In our paper, we adopt K-SVD algorithm owing to its efficiency and effectiveness.
Actually, in our paper, we study two scenarios for training the dictionary, one is learning the dictionary directly on the raw polar images, as illustrated in Fig. 8b, and the other on the calibrated ones, as demonstrated by Fig. 8c. Some examples of the two types of dictionaries are presented in Fig. 9, in which sub-figure (a) shows the uncalibrated dictionary, and sub-figure (b) displays the calibrated one. It is intuitive to anticipate that any novel polar images can be well reconstructed by the uncalibrated dictionary. And obviously, the calibrated sub-dictionary can well approximate any calibrated testing polar images of this digit class. It is worth noting that, in a calibrated dictionary, the color variations of the edges among different atoms are just the reflection of the intra-class difference of the training samples, for example the slight differences of the printed digits. This phenomenon exactly manifests the robustness of the dictionary to the intra-class variations.
Fig. 9

Some atoms of the dictionary learned by two different scenarios: a is generated from the raw polar images, and b from the calibrated polar images. a The uncalibrated dictionary; b the calibrated dictionary

5.2 Recognition scheme

Given all the \(C\) learned sub-dictionaries, we can perform recognition procedure for the new testing digit balls (their polar images). Different from SRC which incorporates a sparse coding process, we choose anther scheme for recognition. For a new polar image \(\mathbf{x}\), we calculate the square error corresponding to the \(c\)th sub-dictionary as below:
$$\begin{aligned} e_c = \min _\mathbf{a_c} \Vert \mathbf{x}- \mathbf{D}_c\mathbf{a_c} \Vert _2^{2}. \end{aligned}$$
(4)
Note that there is no sparse constraint on the coefficient in Eq. 4, thus we do not need to solve the costly sparse coding process and this facilitate the real-time performance. Recently, in-depth researches such as [29] and [18] have already demonstrated the efficiency and effectiveness of this classical coding over sparse coding. Actually, a large number of atoms in a sub-dictionary are always not necessary for recognition, or the overcomplete dictionary is not a required stuff. In other words, the learned sub-dictionary has full column rank, indicating its atoms span the subspace of the corresponding digit.
Furthermore, from Eq. 4, we can easily derive the optimal coefficient \(\mathbf{a_c}=(\mathbf{D}_c^T\mathbf{D}_c)^{-1}\mathbf{D}_c^T\mathbf{x}\). By substituting \(\mathbf{a_c}\) with \((\mathbf{D}_c^T\mathbf{D}_c)^{-1}\mathbf{D}_c^T\mathbf{x}\) back in Eq. 4, we have:
$$\begin{aligned} \Vert \mathbf{x}- \mathbf{D}_c\mathbf{a_c} \Vert _2^{2}&= \Vert \mathbf{x}- \mathbf{D}_c\left(\mathbf{D}_c^T\mathbf{D}_c\right)^{-1}\mathbf{D}_c^T\mathbf{x}\Vert _2^{2}\nonumber \\&= \Vert (\mathbf{I}_d - \mathbf{D}_c\left(\mathbf{D}_c^T\mathbf{D}_c)^{-1}\mathbf{D}_c^T\right)\mathbf{x}\Vert _2^{2}, \end{aligned}$$
(5)
where \(\mathbf{I}_d \in \mathbb{R }^{d\times d}\) is the identity matrix. From Eq. 5, it is easy to see that, if we store the matrix \( \mathbf{W}_c = \mathbf{I}_d - \mathbf{D}_c(\mathbf{D}_c^T\mathbf{D}_c)^{-1}\mathbf{D}_c^T\) throughout the recognition procedure, the computational cost will be further reduced and real-time recognition performance will be facilitated.

In this way, we can calculate the reconstruction error in terms to \(\mathbf{W}_1, \dots , \mathbf{W}_C\), and finally identify the testing polar image as \({\hat{c}} = {\mathop {\text{ argmin}}_{c}} \Vert \mathbf{W}_c\mathbf{x}\Vert _2^2\).

6 Experimental validation

With the digit ball detection and recognition procedures detailed in the previous sections, we now evaluate the performance of our system through a series of experiments. All the experiments, as well as the practical application, are performed on an ordinary computer with the following configuration: Intel (R) Corel (TM) 2 Duo CPU E4500, 2.21 GHz, 4GB of RAM. We begin this section by first introducing the data set and the evaluation criteria used in this paper.

6.1 Data set and criteria

In our work, there are 75 categories of digit balls, numbering 1–75. Each digit ball only carries one type of digit on its spherical surface. Figure 10 depicts the samples for the 75 digit balls. From these images, we can anticipate some possible affects on detection and recognition performances caused by the various illumination reflection.
Fig. 10

The 75 digit balls used in our work

As for the training set, we capture 20 images for each digit ball under fluorescent lighting environment, with various poses. These images are called single-ball image in this paper, because every image has only one digit ball. Therefore we have 1,500 images in total (\(75\times 20 = 1{,}500\)). Three single-ball images are illustrated in the training part of the flowchart in Fig. 3. For real-world digit ball recognition application, 67 images are captured for recognition under the same environment, and each one contains 6–30 balls. This kind of image is called multi-ball image in this paper. Figure 1 displays one such multi-ball image. All the images, including training set and testing set, are of \(778\times 1{,}032\)-pixel resolution (see column (a) of Fig. 12), and the grayscale images are directly used for training and testing without any conversion and preprocessing. Both of the single-ball image set and the multi-ball image set can be downloaded online.4

To evaluate the recognition performance, we choose recognition accuracy in this paper. The recognition accuracy is defined as the following:
$$\begin{aligned} \text{ ACCURACY} = 100~\% \times \frac{t}{N}, \end{aligned}$$
(6)
where \(t\) is the number of sample cases correctly recognized, and \(N\) is the total number of samples.

6.2 Validation of detection

Each training image only contains one digit ball, and we simply use an enclosing circle fitting method to detect the ball, as described in Sect. 2.1. Despite the simplicity of this method, it indeed always correctly detects and segments the ball out of the image, and Fig. 11 depicts few such results.
Fig. 11

Some results brought out by the fitting enclosing circle method for detecting the digit ball from single-ball image. As we can see from the figure, all the balls are accurately detected (best seen in color and by zooming in)

As introduced in Sect. 2.2, we use the proposed method to detect all the balls appearing in one image, i.e. attacking the multi-ball detection problem. Despite Fig. 4 which is the illustrative figure and also experimental result, we showcase some results on more challenging multi-ball images, i.e. dozens of balls are densely placed on the image. Figure 12 depicts three such results, including the 2D histograms and the corresponding \(xy\)-plane projection of the 2D histogram for better understanding. From this figure, the effectiveness of our multi-ball detection method is manifested clearly: even though the balls are densely placed in the image, they can still be accurately detected. Note that the balls with some parts outside the image cannot be detected, e.g. the balls near the edges of the image shown by the second and third image. But this is consistent with what we expect. As we know, the center of the detected circles are estimated according to the white pixels of the binary image, in other words, the center is the averaged position of a dense cluster of points (white pixels) or the position of the local value peak, as demonstrated in Fig. 4a. Thus, when one part of the ball is out side the image, the estimated center will be too biased and our method will omit them in terms of the pre-set threshold.
Fig. 12

Three multi-ball detection results brought out by our proposed approach. The sub-figures of column a are the original images; b are the derived 2D histogram; c are the projected 2D histogram on XY-plane; and d are the detection results. We can see almost all the “intact” balls are accurately detected, while the balls with some unseen/occluded parts outside the image (pointed out by the arrow) are failed to be detected. It is reasonable to expect this phenomenon, due to the mechanism of our approach

Actually, in the industrial environment, all the balls are conveyed in a row, and they can be controlled to appear in the right place to prevent the occlusion problem.

6.3 Single-ball recognition

In this subsection, we perform single ball recognition to evaluate our method. The 1,500 single-ball images (75 digit balls and 20 images for each) are used for the experiments. In detail, 12 images per digit ball are randomly selected for training (the dictionary) and the rest 8 images for testing. We compare the results of our method with two rotation-invariant feature descriptors, i.e. intensity-domain spin images [13] and RIFT [13].
  • Intensity-domain spin image (ID-spin image) [13] is a variant of spin image [9]. Different from the spin image, which is a data level shape descriptor used to match surfaces for 3D shape-based object recognition, intensity-domain spin image is a 2D histogram encoding the distribution of image brightness values in the neighborhood of a particular reference (center) point. The 2D of the histogram are \(d\), the distance from the center point, and \(i\), the intensity value.

  • RIFT descriptor [13] is proposed to capture a representation of the local appearance of patches. It is constructed as follows: The circular patch image is divided into concentric rings of equal width and a gradient orientation histogram is computed within each ring. To maintain rotation invariance, this orientation is measured at each point relative to the direction pointing outward from the center.

 

We can directly apply the two descriptors to the regularized images as demonstrated in Fig. 6c, d. But please note that Fig. 6c displays the grayscale images, while Fig. 6d are binary images, both of which are of \(81\times 81\)-pixel resolution. When transforming the grayscale images to ID-spin images, we set the number of bins for intensity value to be 20 and that for distance to be 40 (with 1-pixel step), represented by an 800D vector. When transforming the binary images to ID-spin images, we only have two values, 0 for the dark color and 255 for the white color, therefore, the transformed spin image is of size \(2\times 41\), represented by an 82D vector. For RIFT descriptor, we use five rings and eight histogram orientations, yielding 40D descriptors. The nearest neighbor classifier is used for classification of the two rotation-invariant feature descriptors.

For fair comparison, we adopt both of the two dictionary scenarios as demonstrated in Fig. 9 on the raw polar images (dubbed RawDic) and the calibrated ones (dubbed CaliDic), respectively. Here, two parameters are set: \(\theta =4\) and \(\psi =30\). Note the dictionary size is different between the two scenarios. Concisely, in RawDic scenario, each sub-dictionary has eight atoms, while the class-specific dictionary has six atoms in CaliDic. The atoms of both the scenarios are learned from the training set, 12 training samples of each digit ball for its corresponding class-specific sub-dictionary. Besides, in the CaliDic scenario, each sub-dictionary is learned from the 12 calibrated polar images of the corresponding digit. For each digit, we ought to give a standard pose beforehand for calibration of the balls belonging to this specific digit. In our paper, such standard pose is determined by a randomly selected polar image of the specific digit, which is called standard image in this paper. Then, all the other training images of this digit are calibrated to the standard pose via the correlation matching method. Thus, apart from the learned sub-dictionaries, we have 75 standard images in total for recognition. In detail, for the new testing polar image, we run calibration process 75 times, each time for each one digit, and calculate the reconstruction error by Eq. 5 for 75 times. The recognition method is just the one introduced in Sect. 5.2. The averaged recognition accuracy with the standard derivation for each method/descriptor is reported after running 20 times.

Direct comparisons is listed in Table 1. From this table, we can see the dictionary-based classification scheme on the polar images achieves better recognition results than the other two rotation-invariant descriptors. Particularly, with calibrated polar images, our method performs the best on recognition accuracy. For the two rotation-invariant descriptors, RIFT performs better than intensity-domain spin image. We conjecture the reason is that the dimensionality of spin image (82D vector) is higher than that of RIFT (40D vector), meaning more discriminative information is reserved for classification. Additionally, the results of two descriptors on the grayscale images, as presented in Fig. 6c, are worse than that on the binary images, as shown in Fig. 6d. The reason is that, in the grayscale images, large illumination reflection changes appear in the surface of the ball, thus severely influencing the intensity value histogram in spin image and the gradient orientation histogram in RIFT, and further compromising the recognition performance. That is also the reason why our method performs on the polar image with binary values, which brings the resilience to the illumination reflection changes and reserves more discriminative information.
Table 1

Comparison on recognition accuracy (\(\%\))

Method/descriptor

Accuracy

Grayscale image

Binary image

ID-spin image

\(91.43 \pm 0.98 \)

\(94.63 \pm 0.83\)

RIFT

\(84.27 \pm 0.89 \)

\(88.03 \pm 0.92\)

Ours RawDic (8)

\(98.59 \pm 0.64\)

Ours CaliDic (6)

\(99.32 \pm 0.57\)

Intensity-domain spin image (ID-spin image) and RIFT descriptor are directly applied to the grayscale and binary images, as illustrated in Fig. 6c, d, respectively. Our method learns class-specific sub-dictionaries for all the class from the polar images (binary value)—the raw polar images for RawDic and the calibrated ones for CaliDic. The number in bracket denotes the sub-dictionary size for the two scenarios

As we notice previously, the two rotation-invariant descriptors enable us to avoid the calibration process, while their recognition accuracies is worse than that of our method. Our method on the calibrated polar images achieves the best accuracy but with more computation time in the calibration process. This can be seen as a balance between running time and recognition accuracy. Fortunately, in the practical application, the calibration process is very fast for an ordinary computer, as the number of classes is small: only 75 digit balls in total and at most 30 trials of calibration with three-pixel shift, as explained in Sect. 4. But note that, even though our method achieves real-time performance of decent recognition accuracy, in practical applications, our system adopts a coarse-to-fine strategy to expedite the recognition process. In detail, the coarse recognition stage uses the intensity-domain spin images for recognition,5 selecting several candidate classes, say seven classes. Then, in the second stage, the polar images of the candidates are used for the final recognition with the learned class-specific dictionaries. This strategy indeed speeds up our system further. Should there be more types of digit balls, this advantages could be more obvious.

 

6.3.1 Tuning parameters for our method

There are two important parameters in our system that can have significant effect on the recognition performances. They are the rotation step \(\theta \) and the radius \(\psi \) used in determining the size of the polar images, as described by Fig. 7 in Sect. 4. In this subsection, we study how these two parameters affect the recognition performance in this subsection through experiments.

To evaluate the effect of the size of polar image, we compare the recognition results achieved with various choices of the radius \(\psi \) and the rotation step \(\theta \). Table 2 reports the averaged recognition accuracy of each setting. From this table, we can see that, for both of the two scenarios, when either of the radius or the rotation step becomes too large or too small, the recognition performance will degrade accordingly. Although the smaller rotation step means more elaborated resampling results, it induces the overfitting problem. While, as the larger rotation step always fails to capture the meaningful patterns of the polar images, the recognition performance will be deteriorated undoubtedly. Therefore, in terms of the results in Table 2, we set the rotation step as 4\(^\circ \) (i.e. 90 columns are generated) and the radius length as \(30\) pixel in our system, as this setting constantly brings out decent recognition accuracies. Thus, we have a \(30 \times 90\)-size polar image for each ball, represented by a 2,700D vector.
Table 2

Comparison on recognition accuracy under different trial choices of radius length \(\psi \) and rotation step \( \theta \)

\(\theta \)

\(\psi \)

Accuracy

RawDic (8) (%)

CaliDic (6) (%)

2

30

\(96.42\)

\(97.33\)

2

40

\(97.61\)

\(98.13\)

2

50

\(97.87\)

\(98.59\)

4

30

\(98.67\)

\(99.72\)

4

40

\(98.33\)

\(99.08\)

4

50

\(98.65\)

\(99.03\)

6

30

\(98.51\)

\(98.10\)

6

40

\(98.13\)

\(98.67\)

6

50

\(97.89\)

\(98.40\)

8

30

\(97.26\)

\(97.32\)

8

40

\(96.94\)

\(97.89\)

8

50

\(96.88\)

\(97.84\)

The number in bracket denotes the sub-dictionary size for the two scenarios

Additionally, we note the CaliDic scenario consistently outperforms the RawDic scenario, even though its dictionary size is one time smaller than that of RawDic. Actually, this can be seen as a trade-off between the runtime and recognition accuracy, because CaliDic involves the calibration process, indicating more computational cost is required. Fortunately, the calibration is not that time-consuming but is very fast. Concretely, our system can finish 13–15 frames/pictures with single ball detection and recognition per second. As this processing speed satisfies the real-world application, we adopt the CaliDic scenario in our system for the recognition (spin image descriptor for the coarse recognition).

6.4 Validation of recognition: multi-ball recognition

In this subsection, we run the overall procedure of our system for the real-world application, i.e. multi-ball detection and recognition. The polar images are generated with the size of \(30\times 90\), as described in the previous subsection, and all the 1,500 single-ball images are used for training the 75 class-specific sub-dictionaries in the CaliDic scenario. The recognition scheme described in Sect. 5.2 is adopted.

Through 67 captured multi-ball images, our system achieves \(92.12~\%\) recognition accuracy, including the balls failed to be detected due to partial occlusion. As the running time, our system can process 5–8 frames or pictures per second, each frame/picture contains 3–6 balls. When we omit the undetected balls, the recognition accuracy is \(99.14~\%\). Actually, in the real-world applications, the locations of the digit balls in the image can be controlled to fully appear on the image, thus the accuracy \(99.14~\%\) is desirable. But we can still enhance the recognition performance to satisfy the real-world applications via another strategy, which is the content of the next section.

7 Experiment revisit with Bayesian-based multi-sampling strategy

Even though the recognition performance in Sect. 6.4 is very promising, it still cannot satisfy the real-world application. Concentrating on refining the system, in this section, we introduce a Bayesian multi-sampling strategy to improve the recognition performance of our system.

In the real-world applications, the balls for recognition are captured by a single camera, but we can adopt a multi-sample strategy to boost recognition performance, that is to capture several images under different environmental settings, such as the various viewpoints and different light sensitivities of the camera. By independently running recognition process on these images, we can analyze the results and vote for the most credible digit label. This process can be simply cast as a Bayesian-based multiple sampling strategy.

Mathematically, suppose there are \(C\) digit labels (\(C=75\) in this paper) and each digit label is with a prior probability \(p\). Denote the prior probability of the \(i\)th digit ball by \(p(C_i)\), then \(p(y|C_i)\) means the probability of the sample \(y\) appearing as an observation of the \(i\)th digit, and the posterior probability \(p(C_i|y)\) indicates the probability of the \(i\)th digit category given the observation \(y\). According to the Bayes’ theorem, we have:
$$\begin{aligned} p(C_i|y) = \frac{p(y | C_i) p(C_i) }{ \sum _{i=1}^{C} p(y|C_i) p(C_i) }. \end{aligned}$$
(7)
Intuitively, we assume the prior probability \(p(C_i)\) follows uniform distribution, i.e. \(p(C_i) = 1/C\). Moreover, we set \(p(y|C_i)=1-\delta \) if \(i = {\mathop {\text{ argmin}}_{c}} \Vert \mathbf{x}- \mathbf{D}_{c}\mathbf{a_c} \Vert _2^{2}\) (see Eq. 4), and \(\frac{\delta }{C-1}\) otherwise. Here, \(\delta \) is a small real number used for smoothing, say \(10^{-4}\). Then, we can rewrite the posterior probability Eq. 7 as below:
$$\begin{aligned} p(C_i|y ) = \left\{ \begin{array}{cl} 1-\delta ,&\quad \text{ if} i = {\mathop {\text{ argmin}}_{c}} \Vert \mathbf{x}- \mathbf{D}_{c}\mathbf{a_c} \Vert _2^{2},\\ \frac{\delta }{C-1},&\quad \text{ otherwise}. \end{array} \right. \end{aligned}$$
(8)
By adopting Bayesian multi-sampling strategy, we first capture \(T\) images for a digit ball, i.e. resampling \(T\) times, denoted by \(y_1,\dots , y_t, \dots ,y_T\). Then, we have \(p(C_i|Y)\) standing for the probability of the whole samples belonging to the \(i\)th category:
$$\begin{aligned} p(C_i|Y) = \prod _{t=1}^{T}p(C_i|y_t). \end{aligned}$$
(9)
Therefore, the final identified digit label is:
$$\begin{aligned} {\hat{c}} = {\mathop {\text{ argmax}}_i} p(C_i|Y). \end{aligned}$$
(10)
Figure 13 illustrates this Bayesian-based multiple sampling strategy. With different choices of shooting, e.g. two viewpoints (I and II) and three light sensitivities (a, b and c), we capture six images for the balls. From this figure, we can see there are two incorrectly recognized digit balls appearing in I-a and I-b image (one for each image), and all the other detected balls (including other images) are correctly identified. Therefore, if we adopt Bayesian-based multiple sampling strategy, the right recognition results will be derived. Note that, as our system works under a fluorescent lamp with alternating current, any two consecutively captured frames can be different even if the balls stay still. Therefore, in practical applications, our system merely captures several pictures for the balls staying still, and no other camera settings are made.
Fig. 13

An illustrative example to show the Bayesian-based multiple sampling strategy. Six images are captured by the camera with various choices of conditions—two viewpoints (I and II) and three light sensitivities (a, b and c). The predicted digit labels are printed on the digit balls, and incorrectly recognized ones are pointed out by an arrow. Both image I-a and I-b have one misclassified digit ball, while in other images, all the detected balls are correctly recognized. The Bayesian-based multiple sampling strategy helps vote for the correct digit label, thus the final results give the right digit labels of these balls. In this way, our system will accurately recognize all the detected balls

We use the 67 multi-ball images to evaluate the ability of our system in multi-ball recognition. Our system performs 10 times on each of the image with \(\{3,5,7\}\) times of Bayesian sampling under the fluorescent light. As a conclusion, all the detected balls are correctly recognized in the experiments with the help of Bayesian multiple sampling strategy, achieving \(100~\%\) recognition accuracy in the experiments. In addition, three times of Bayesian sampling is sufficient enough for \(100~\%\) accuracy. Therefore, we set the time of Bayesian multiple sampling to be three for each picture/frame in practice. It is worth noting that, in real-world applications, there is still no reports on the recognition errors yet, and the running errors involve no issues related to our system.

8 Conclusion with remarks

8.1 Remarks

In this paper, we introduce a digit ball detection and recognition system which addresses the problem of detecting and recognizing the digits printed on the spherical surface of the ball. Our system consists of several important steps: (1) digit ball detection, (2) digit region regularization, (3) feature extraction and representation, and (4) dictionary learning for recognition.

We use a simple enclosing circle fitting method to detect the single ball in the image for training, and propose a novel method to detect multiple balls in one image for testing. As verified through experiments, the proposed multi-ball detection method can indeed successfully detect and locate the positions of the balls in one image. As for recognition, our system finds the digit region and regularize it by unwarping the elliptical region to a circular one. Therefore, no large shape deformation of the digit is induced due to various viewpoints. We use both the intensity-domain spin image and the calibrated polar image for the representation of the digit ball. Correspondingly, we adopt a coarse-to-fine recognition strategy: the spin image is used to select several candidate digit labels with a higher speed, and the calibrated polar image is used for the final prediction of the digit label. Finally, to ensure the recognition accuracy, we introduce a Bayesian multiple sampling strategy to capture the balls several times, and use the Bayesian rule for the final decision of the predicted digit label.

In practice, there is still no reports on the recognition errors yet, and the running errors involve no issues related to our system. Our system can handle 15–20 frames/pictures per second for single ball detection and recognition. While for multi-ball detection and recognition, our system can process 5–8 frames/pictures per second, and each picture contains 4–6 balls. This speed has already satisfied the real-world application on the real-time requirement. As a future research to meet more demanding cases, we plan to adopt parallel computation and FPGA + DSP to expedite our system.

8.2 Conclusion

Even if our system works for specific applications under weakly controllable environment, it sheds light upon some interesting areas, such as multi-ball detection, recognizing the digit appearing on non-plane surfaces, and dealing with the arbitrary rotation problem by spin-image and polar image.

To the best of our knowledge, the multi-ball detection problem has not been addressed in the literature. Our circle/round detection method has potential in detecting circle/round in less controlled environments. This is worth exploring in the future work. To deal with the arbitrary rotation problem, we use the spin image and the polar image to represent the digit ball. The spin image is a rotation-invariant feature descriptor. Even if it avoids any calibration process, the recognition accuracy on it is relatively lower than that of the polar image. While the polar image simplifies the 2D rotation problem to a 1D shifting issue and still needs a calibration step, the recognition accuracy on it is very promising as demonstrated in the experiments. The combination of the two descriptors brings out very decent recognition rate. To make our system satisfy the practical applications, we introduce a Bayesian multiple sampling strategy to recognize the digit balls. This strategy indeed improves the performance in the experiments, achieving \(100.00~\%\) recognition accuracy.

Besides the merit of our digit ball detection and recognition system, it has other potential applications, for example, assisting tracking and the automatic commentator in Snooker live telecast, lotto&lottery monitoring, and product inspecting on an assembly line. Moreover, this inspiring work provides some tentative explorations on character recognition which is printed on non-plane and non-rigid surfaces.

Footnotes

  1. 1.

    In our system, the meaning of “real-time” is that no less than 5 frames/pictures per second can be processed with 3–6 balls appearing on each frame/picture. In this way, the system cannot be influenced by the processing speed.

  2. 2.

    Basler industrial camera, camera model: scA1000-30gc.

  3. 3.

    Here the polar image is reshaped to a column vector. In the rest of the paper, we use the polar image to denote its column vector abusively.

  4. 4.
  5. 5.

    The nearest centroid classifier is used here.

Notes

Acknowledgments

The authors are grateful to the anonymous reviewers for their excellent reviews and constructive comments that helped to improve the manuscript and our system. This work is supported by 973 Program (No.2010CB327904) and Natural Science Foundations (No.61071218) of China.

References

  1. 1.
    Aharon, M., Elad, M., Bruckstein, A.: The k-svd: An algorithm for designing of overcomplete dictionaries for sparse representation. IEEE Trans. Signal Process. 54(11), 4311–4322 (2006)CrossRefGoogle Scholar
  2. 2.
    Bradley, D.M., Bagnell, J.A.: Differentiable sparse coding. Adv. Neural Inform. Process. Syst. (NIPS) (2008)Google Scholar
  3. 3.
    Breiman, L.: Better subset regression using the nonnegative garrote. Technometrics 37(4), 373–384 (1995)MathSciNetMATHCrossRefGoogle Scholar
  4. 4.
    Cheng, L., Wang, D., Deng, X., Kong, S.: Sparse representation for three-dimensional number ball recognition. In: WRI Global Congress on Intelligent Systems (GCIS) (2010)Google Scholar
  5. 5.
    Elad, M., Aharon, M.: Image denoising via learned dictionaries and sparse representation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2006)Google Scholar
  6. 6.
    Gader, P.D., Khabou, M.A.: Automatic feature generation for handwritten digit recognition. IEEE Trans. Pattern Anal. Mach. Intell. 18(12), 1256–1261 (1996)CrossRefGoogle Scholar
  7. 7.
    Hu, M.K.: Visual pattern recognition by moment invariants. IRE Trans. Inform. Theory 8(2), 179–187 (1962)MATHCrossRefGoogle Scholar
  8. 8.
    Huang, T., Wang, D., Cheng, L., Deng, X.: Number ball recognition at arbitrary pose using multiple view instances. In: IEEE Youth Conference on Information, Computing and Telecommunication, pp. 510–513 (2009)Google Scholar
  9. 9.
    Johnson, A., Hebert, M.: Using spin images for efficient object recognition in cluttered 3d scenes. IEEE Trans. Pattern Anal. Mach. Intell. 21(5), 433–449 (1999)CrossRefGoogle Scholar
  10. 10.
    Kong, S., Wang, D.: A dictionary learning approach for classification: separating the particularity and the commonality. In: European Conference on Computer Vision (ECCV) (2012)Google Scholar
  11. 11.
    Kupeev, K.Y., Wolfson, H.J.: A new method of estimating shape similarity. Pattern Recognit. Lett. 17(8), 873–887 (1996)CrossRefGoogle Scholar
  12. 12.
    Kurita, T., Hotta, K., Mishima, T.: Scale and rotation invariant recognition method using higher-order local autocorrelation features of log-polar image. In: Asian Conference on Computer Vision (ACCV), pp 89–96 (1998)Google Scholar
  13. 13.
    Lazebnik, S., Schmid, C., Ponce, J.: A sparse texture representation using local affine regions. IEEE Trans. Pattern Anal. Mach. Intell. 27(8), 1265–1278 (2005)CrossRefGoogle Scholar
  14. 14.
    Lee, H., Battle, A., Raina, R., Ng, A.Y.: Efficient sparse coding algorithms. In: Advanced in Neural Information Processing systems (NIPS) (2007)Google Scholar
  15. 15.
    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar
  16. 16.
    Mairal, J., Bach, F., Ponce, J., Sapiro, G., Zisserman, A.: Supervised dictionary learning. In: Advanced in Neural Information Processing systems (NIPS) (2008)Google Scholar
  17. 17.
    Ojala, T., Pietikainen, M., Maenpa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans. Pattern Anal. Mach. Intell. 24(7), 971–987 (2002)CrossRefGoogle Scholar
  18. 18.
    Rigamonti, R., Brown, M.A., Lepetit, V.: Are sparse representations really relevant for image classification? In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2011)Google Scholar
  19. 19.
    Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B 58(1), 267–288 (1996)Google Scholar
  20. 20.
    Trier, O.D., Jain, A.K., Taxt, T.: Feature extraction methods for character recognition: a survey. Pattern Recognit. 29(4), 641–662 (1996)CrossRefGoogle Scholar
  21. 21.
    Vizireanu, D.N.: Generalizations of binary morphological shape decomposition. J. Electron. Imag. 16(1), 1–6 (2007)CrossRefGoogle Scholar
  22. 22.
    Vizireanu, D.N.: Morphological shape decomposition interframe interpolation method. J. Electron. Imag. 17(1), 1–5 (2008)CrossRefGoogle Scholar
  23. 23.
    Vizireanu, D.N., Halunga, S., Marghescu, G.: Morphological skeleton decomposition interframe interpolation method. J. Electron. Imag. 19(2), 1–3 (2010)CrossRefGoogle Scholar
  24. 24.
    Wang, D., Cui, C., Wu, Z.: Matching 3d models with global geometric feature map. In: International Conference on Multi-media Modelling (MMM) (2006)Google Scholar
  25. 25.
    Wang, D., Qian, H.: 3d object recognition by fast spherical correlation between combined view egis and pft. In: IAPR International Conference on Pattern Recognition (ICPR), pp. 1–4 (2008)Google Scholar
  26. 26.
    Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust face recognition via sparse representation. IEEE Trans. Pattern Anal. Mach. Intell. 31(2), 210–227 (2009)Google Scholar
  27. 27.
    Yang, L., Albregtsen, F.: Fast computation of invariant geometric moments: A new method giving correct results. In: IAPR International Conference on Pattern Recognition (ICPR), pp. 201–204 (1994) Google Scholar
  28. 28.
    Yu, D., Yan, H.: Reconstruction of broken handwritten digits based on structural morphological features. Pattern Recognit. 34(2), 235–254 (1999)MathSciNetCrossRefGoogle Scholar
  29. 29.
    Zhang, L., Yang, M., Feng, X.: Sparse representation or collaborative representation: Which helps face recognition? In: IEEE International Conference on Computer Vision (ICCV) (2011)Google Scholar
  30. 30.
    Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22(11), 1330–1334 (2000)CrossRefGoogle Scholar
  31. 31.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67(2), 301–320 (2005)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  1. 1.Department of Computer Science and TechnologyZhejiang UniversityHangzhouChina

Personalised recommendations