A Bayesian Network Approach to Semantic Labelling of Text Formatting in XML Corpora of Documents

  • Florendia Fourli-Kartsouni
  • Kostas Slavakis
  • Georgios Kouroupetroglou
  • Sergios Theodoridis
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4556)

Abstract

The wide-spread applications of document digitization have lead to the use of structured digital representation methods such as the XML language. Extraction methodologies for the formatting metadata can be used on such structured documents for enhancing their accessibility, including augmented audio representation of documents. To the best of our knowledge, an effort has yet to be made to produce an automatic extraction system of semantic information of the document formatting, solely from document layout, without the use of natural language processing. In this study a corpus of XML representations of several issues of a Greek newspaper is used in order to create and evaluate a semantic classifier of text formatting, based on Bayesian Networks.

Keywords

document accessibility document analysis semantic labeling 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Conway, A.: Page grammars and page parsing: a syntactic approach to document layout recognition. In: Proc. Int. Conf. on Document Analysis and Recognition, pp. 761–764 (1993)Google Scholar
  2. 2.
    Yamashita, A., Amano, T., Takahashi, I., Toyokawa, K.: A model based layout understanding method for the document recognition system. In: Proc. Int. Conf. on Document Analysis and Recognition, Saint Malo, France, pp. 130–138 (September 1991)Google Scholar
  3. 3.
    Chicago Manual of Style. 15th edn. University of Chicago Press, Chicago (2003) http://www.chicagomanualofstyle.org/
  4. 4.
    Derrien-Peden, D.: Frame-based system for macro-typographical structure analysis in scientific papers. In: Proc. Int. Conf. on Document Analysis and Recognition, Saint-Malo, France, pp. 311–319 (1991)Google Scholar
  5. 5.
    Rish, I., Hellerstein, J., Jayram, T.: An analysis of data characteristics that affect naive Bayes performance. Technical Report RC21993, IBM Watson Research Center (2001)Google Scholar
  6. 6.
    Langley, P., Iba, W., Thompson, K.: An analysis of Bayesian classifiers. In: Proc. 10th Nat. Conf. Artificial Intelligence, pp. 399–406. AAAI Press and MIT Press (1992)Google Scholar
  7. 7.
    Langley, P., Sage, S.: Induction of selective Bayesian classifiers. In: Proc. 10th Conf. Uncertainty in Artificial Intelligence, pp. 223–228. Morgan Kaufmann, San Francisco (1994)Google Scholar
  8. 8.
    Krishnamoorthy, M., Nagy, G., Seth, S., Viswanathan, M.: Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Transactions on Pattern Analysis and Machine Intelligence 15, 737–747 (1993)CrossRefGoogle Scholar
  9. 9.
    Maragoudakis, M., Kermanidis, K., Fakotakis, N., Kokkinakis, G.: Combining bayesian and support vector machines learning to automatically complete syntactical information for HPSG-like formalisms. In: Proceedings of International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp. 93–100 (2002)Google Scholar
  10. 10.
    Maragoudakis, M., Ganchev, T., Fakotakis, N.: Bayesian reinforcement for a probabilistic neural net part-of-speech tagger. In: Proc. Int. Conf. on Text Speech and Dialogue, Brno, Chech Republic, pp. 137–145 (2004)Google Scholar
  11. 11.
    Bringhurst, R.: The Elements of Typographic Style, 2nd edn., pp. 93–119. Hartley & Marks Publishers, Vancouver Canada (2002)Google Scholar
  12. 12.
    Mao, S., Rosenfeld, A., Kanungo, T.: Document structure analysis algorithms: a literature survey. In: Proceedings of SPIE 5010, pp. 197–207 (2003)Google Scholar
  13. 13.
    Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., Emptoz, H.: Bayesian networks classifiers applied to documents. In: Proc. IEEE ICPR, vol. 1, pp. 483–486 (2002)Google Scholar
  14. 14.
    Souafi-Bensafi, S., Parizeau, M., Lebourgeois, F., Emptoz, H.: Logical labeling using bayesian networks. In: Proceedings of IEEE ICDAR, pp. 832–836 (2001)Google Scholar
  15. 15.
    Tsujimoto, S., Asada, H.: Understanding multi-articled document. In: Proc. Int. Conf. on Pattern Recognition, Atlantic City, NJ, pp. 551–556 (1990)Google Scholar
  16. 16.
    Theodoridis, S., Koutroumbas, K.: Pattern Recognition, 3rd edn. pp. 13–26. Academic Press, San Diego (2006)MATHGoogle Scholar
  17. 17.
    The American Psychological Association. Publication Manual, Washington DC, pp. 94–103, 111–130 (2001)Google Scholar
  18. 18.
  19. 19.
    Tateisi, Y., Itoh, N.: Using stochastic syntactic analysis for extracting a logical structure from a document image. In: Proc. Int. Conf. on Pattern Recognition, Israel, pp. 391–394 (1994)Google Scholar
  20. 20.
    Xydas, G., Argyropoulos, V., Karakosta, T., Kouroupetroglou, G.: An Experimental Approach in Recognizing Synthesized Auditory Components in a Non-Visual Interaction with Documents. In: Proc. 11th Int. Conf. Human-Computer Interaction, Las Vegas, pp. 411–420 (2005)Google Scholar
  21. 21.
    Xydas, G., Kouroupetroglou, G.: Augmented Auditory Representation of e-Texts for Text-to-Speech Systems. In: Matoušek, V., Mautner, P., Mouček, R., Tauser, K. (eds.) TSD 2001. LNCS (LNAI), vol. 2166, pp. 134–141. Springer, Heidelberg (2001)Google Scholar
  22. 22.
    Web Accessibility Initiative (WAI) http://www.w3.org/WAI/

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Florendia Fourli-Kartsouni
    • 1
  • Kostas Slavakis
    • 1
  • Georgios Kouroupetroglou
    • 1
  • Sergios Theodoridis
    • 1
  1. 1.Department of Informatics and Telecommunications, University of Athens, GR-15784, AthensGreece

Personalised recommendations