Skip to main content

Page Segmentation of Structured Documents Using 2D Stochastic Context-Free Grammars

  • Conference paper

Part of the Lecture Notes in Computer Science book series (LNIP,volume 7887)

Abstract

In this paper we define a bidimensional extension of Stochastic Context-Free Grammars for page segmentation of structured documents. Two sets of text classification features are used to perform an initial classification of each zone of the page. Then, the page segmentation is obtained as the most likely hypothesis according to a grammar. This approach is compared to Conditional Random Fields and results show significant improvements in several cases. Furthermore, grammars provide a detailed segmentation that allowed a semantic evaluation which also validates this model.

Keywords

  • document segmentation
  • stochastic context-free grammars
  • text classification features

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • DOI: 10.1007/978-3-642-38628-2_15
  • Chapter length: 8 pages
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
eBook
USD   119.00
Price excludes VAT (USA)
  • ISBN: 978-3-642-38628-2
  • Instant PDF download
  • Readable on all devices
  • Own it forever
  • Exclusive offer for individuals only
  • Tax calculation will be finalised during checkout
Softcover Book
USD   155.00
Price excludes VAT (USA)

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Álvaro, F., Sánchez, J.A., Benedí, J.M.: Recognition of on-line handwritten mathematical expressions using 2d stochastic context-free grammars and hidden markov models. Pattern Recognition Letters (2012)

    Google Scholar 

  2. An, C., Bird, H.S., Xiu, P.: Iterated document content classification. In: Proc. of ICDAR, Brazil, vol. 1, pp. 252–256 (2007)

    Google Scholar 

  3. Antonacopoulos, A., Clausner, C., Papadopoulos, C., Pletschacher, S.: Historical document layout analysis competition. In: Proc. of ICDAR, pp. 1516–1520 (2011)

    Google Scholar 

  4. Bulacu, M., Koert, R., Schomaker, L., Zant, T.: Layout analysis of handwritten historical documents for searching the archive of the cabinet of the dutch queen. In: Proc. of ICDAR, Brazil, vol. 1, pp. 23–26 (2007)

    Google Scholar 

  5. Crespi Reghizzi, S., Pradella, M.: A CKY parser for picture grammars. Information Processing Letters 105(6), 213–217 (2008)

    MathSciNet  MATH  CrossRef  Google Scholar 

  6. Cruz, F., Ramos Terrades, O.: Document segmentation using relative location features. In: Proc. of ICPR, Japan, pp. 1562–1565 (2012)

    Google Scholar 

  7. Esteve, A., Cortina, C., Cabré, A.: Long term trends in marital age homogamy patterns: Spain, 1992-2006. Population 64(1), 173–202 (2009)

    CrossRef  Google Scholar 

  8. Gould, S., Rodgers, J., Cohen, D., Elidan, G., Koller, D.: Multi-class segmentation with relative location prior. Int. Journal of Computer Vision 80(3), 300–316 (2008)

    CrossRef  Google Scholar 

  9. Handley, J.C., Namboodiri, A.M., Zanibbi, R.: Document understanding system using stochastic context-free grammars. In: Proc. of ICDAR, vol. 1, pp. 511–515 (2005)

    Google Scholar 

  10. Jain, A.K., Namboodiri, A.M., Subrahmonia, J.: Structure in online documents. In: Proc. of ICDAR, vol. 1, pp. 844–848 (2001)

    Google Scholar 

  11. Lafferty, J., McCallum, A., Pereira, F.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. of ICML, USA, pp. 282–289 (2001)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Álvaro, F., Cruz, F., Sánchez, JA., Terrades, O.R., Benedí, JM. (2013). Page Segmentation of Structured Documents Using 2D Stochastic Context-Free Grammars. In: Sanches, J.M., Micó, L., Cardoso, J.S. (eds) Pattern Recognition and Image Analysis. IbPRIA 2013. Lecture Notes in Computer Science, vol 7887. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38628-2_15

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-38628-2_15

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-38627-5

  • Online ISBN: 978-3-642-38628-2

  • eBook Packages: Computer ScienceComputer Science (R0)