Grammar-based techniques for creating ground-truthed sketch corpora

  • Scott MacLeanEmail author
  • George Labahn
  • Edward Lank
  • Mirette Marzouk
  • David Tausky
Original Paper


Although publicly available, ground-truthed corpora have proven useful for training, evaluating, and comparing recognition systems in many domains, the availability of such corpora for sketch recognizers, and math recognizers in particular, is currently quite poor. This paper presents a general approach to creating large, ground-truthed corpora for structured sketch domains such as mathematics. In the approach, random sketch templates are generated automatically using a grammar model of the sketch domain. These templates are transcribed manually, then automatically annotated with ground-truth. The annotation procedure uses the generated sketch templates to find a matching between transcribed and generated symbols. A large, ground-truthed corpus of handwritten mathematical expressions presented in the paper illustrates the utility of the approach.


Expression Tree Label Algorithm Terminal Symbol Symbol Recognition Nonterminal Symbol 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blackwell, F.W., Anderson, R.H.: An on-line symbolic mathematics system using hand-printed two-dimensional notation. In: Proceedings of the 1969 24th National Conference, pp. 551–557. ACM, New York (1969)Google Scholar
  2. 2.
    Bunke, H.: Recognition of cursive roman handwriting–past, present and future. In: ICDAR ’03: Proceedings of the Seventh International Conference on Document Analysis and Recognition, p. 448. IEEE Computer Society, Washington (2003)Google Scholar
  3. 3.
    Chan, K.-F., Yeung, D.-Y.: Error detection, error correction and performance evaluation in on-line mathematical expression recognition. In: On-Line Mathematical Expression Recognition, Pattern Recognition (1999)Google Scholar
  4. 4.
    Costagliola, G., Tomita, M., Chang, S.-K.: A generalized parser for 2-d languages. In: Proceedings of the 1991 IEEE Workshop on Visual Languages, pp. 98–104 (1991)Google Scholar
  5. 5.
    Garain U., Chaudhuri B.: A corpus for ocr research on mathematical expressions. Int. J. Doc. Anal. Recognit. 7(4), 241–259 (2005)CrossRefGoogle Scholar
  6. 6.
    Heroux, P., Barbu, E., Adam, S., Trupin, E.: Automatic ground-truth generation for document image analysis and understanding, document analysis and recognition, 2007. In: ICDAR 2007. Ninth International Conference on, vol. 1, Sept 2007, pp. 476–480 (2007)Google Scholar
  7. 7.
    Kumar, A., Balasubramanian A., Namboodiri, A., Jawahar, C.V.: Model-based annotation of online handwritten datasets, In: Lorette, G. (ed.) Tenth International Workshop on Frontiers in Handwriting Recognition. Université de Rennes 1, Suvisoft, Oct 2006Google Scholar
  8. 8.
    Labahn, G., Lank, E., MacLean, S., Marzouk, M., Tausky, D.: Mathbrush: a system for doing math on pen-based devices. In: The Eighth IAPR Workshop on Document Analysis Systems (DAS), pp. 599–606 (2008)Google Scholar
  9. 9.
    Laviola, J.J. Jr.: Mathematical sketching: a new approach to creating and exploring dynamic illustrations, Ph.D. thesis, Brown University, Providence, RI, USA, Adviser-Dam, Andries Van (2005)Google Scholar
  10. 10.
    Martin,W.A.: Computer input/output of mathematical expressions. In: SYMSAC ’71: Proceedings of the second ACM symposium on Symbolic and algebraic manipulation, pp. 78–89. ACM, New York (1971)Google Scholar
  11. 11.
    Marzinkewitsch, R.: Operating computer algebra systems by handprinted input. In: ISSAC ’91: Proceedings of the 1991 international symposium on Symbolic and algebraic computation, pp. 411–413. ACM, New York (1991)Google Scholar
  12. 12.
    Mas, J., Jorge, J.A., Sánchez, G., Lladós, J.: Representing and parsing sketched symbols using adjacency grammars and a grid-directed parser, pp. 169–180. GREC (2007)Google Scholar
  13. 13.
    Okun, O., Pietikainen, M.: Automatic ground-truth generation for skew-tolerance evaluation of document layout analysis methods, Pattern Recognition, 2000. In: Proceedings 15th International Conference on, vol. 4, pp. 376–379 (2000)Google Scholar
  14. 14.
    Prusa, D., Hlavac, V.: Mathematical formulae recognition using 2d grammars, Document Analysis and Recognition, 2007. In: ICDAR 2007. Ninth International Conference on, vol. 2, Sept 2007 pp. 849–853 (2007)Google Scholar
  15. 15.
    Van Beusekom, J., Shafait, F., Breuel, T.M.: Automated ocr ground truth generation, document analysis systems, 2008. In: DAS ’08. The Eighth IAPR International Workshop on, Sept 2008, pp. 111–117 (2008)Google Scholar
  16. 16.
    Wittenburg K., Weitzman L., Talley J.: Unification-based grammars and tabular parsing for graphical languages. J. Vis. Lang. Comput. 2, 347–370 (1991)CrossRefGoogle Scholar
  17. 17.
    Zanibbi R., Blostein D., Cordy J.R.: Recognizing mathematical expressions using tree transformation. Pattern Anal. Mach. Intell. IEEE Trans. 24(11), 1455–1467 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  • Scott MacLean
    • 1
    Email author
  • George Labahn
    • 1
  • Edward Lank
    • 1
  • Mirette Marzouk
    • 1
  • David Tausky
    • 1
  1. 1.David R. Cheriton School of Computer ScienceUniversity of WaterlooWaterlooCanada

Personalised recommendations