Skip to main content

Identifying Logical Structure and Content Structure in Loosely-Structured Documents

  • Chapter
  • First Online:
Linguistic Modeling of Information and Markup Languages

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 41))

Abstract

Text documents are structured on (at least) two separate levels: The “logical” structure is largely reflected in the layout (headlines, paragraphs, etc.), and the “content” structure specifies the functional zones that serve a part of the text’s overall communicative purpose. The latter is clearly genre-specific, whereas the former is independent of the particular text genre. In this chapter, we describe an approach to identifying both structural levels automatically. For content structure, we focus on the genre “film review”: Based on a corpus study, we propose an inventory of zone labels, and describe our method for identifying these zones, using a hybrid approach that makes use of both symbolic rules and statistical (bag-of-words) classification.

The work reported in this chapter originated when one of the authors (A. Suriyawongkul) was a researcher at Potsdam University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • S. Banerjee, T. Pedersen. “The Design, Implementation, and Use of the Ngram Statistics Package.” In: Proc. of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2003.

    Google Scholar 

  • U. Hahn, U. Reimer. “Topic Essentials.” In: Proc. of COLING 1986, S. 497–503, 1986.

    Google Scholar 

  • T. Joachims. “Making large-scale SVM learning practical.” In: B. Sch\(\ddot n\)olkopf, C. Burges, A. Smola (eds.): Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, 1999.

    Google Scholar 

  • H. Langer, H. Lüngen, P. Bayerl. “Text Type Structure and Logical Document Structure.” In: Proceedings of the ACL 2004 Workshop on Discourse Annotation. pp. 49–56. Barcelona, 2004.

    Google Scholar 

  • R. Miller, B. Myers. “Lightweight Structured Text Processing.” In: Proc. of the USENIX Annual Technical Conference, Monterey, CA, June 1999, pp. 131–144.

    Google Scholar 

  • B. Pang, L. Lee, S. Vaithyanathan. “Thumbs up? Sentiment classification using machine learning techniques.” In. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.

    Google Scholar 

  • B. Pang, L. Lee: “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.” In: Proceedings of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), Barcelona, 2004.

    Google Scholar 

  • Y. Qu, J. Shanahan, J. Wiebe (eds.): “Exploring Attitude and Affect in Text: Theories and Applications.” (Papers from the 2004 AAAI Spring Symposium) Technical report SS-04-07. AAAI Press, Menlo Park, CA, 2004.

    Google Scholar 

  • G. Stegert. Filme Rezensieren in Presse, Radio und Fernsehen. TR-Verlagsunion, München, 1993.

    Google Scholar 

  • O. Stock, R. Falcone, P. Insinnamo: “Island parsing and bidirectional charts.” In: Proc. of the Int’l Conference on Computational Linguistics (Coling), 1988.

    Google Scholar 

  • K.M. Summers. “Automatic Discovery of Logical Document Structure.” Ph.D. Thesis, Cornell University, 1998.

    Google Scholar 

  • S. Teufel, M. Moens. “Summarizing scientific articles: Experiments with relevance and rhetorical status.” Computational Linguistics 28(4):409–445, 2002.

    Article  Google Scholar 

  • C.G. Wolf, S.R. Alpert, J.G. Vergo, L. Kozakov, Y. Doganata. “Summarizing technical support documents for search: Expert and user studies.” IBM Systems Journal 43(3):564–586, 2004.

    Article  Google Scholar 

Download references

Acknowledgments

The work reported in this chapter was funded by the German Federal Ministry of Education and Research, grant 03WKH22. We thank Heike Bieler and Stefanie Dipper for their contributions to the SUMMaR project, and Annika Neumann and Andreas Peldszus for their help with defining tag sets and performing annotations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Manfred Stede .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer Science+Business Media B.V.

About this chapter

Cite this chapter

Stede, M., Suriyawongkul, A. (2010). Identifying Logical Structure and Content Structure in Loosely-Structured Documents. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_5

Download citation

  • DOI: https://doi.org/10.1007/978-90-481-3331-4_5

  • Published:

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-90-481-3330-7

  • Online ISBN: 978-90-481-3331-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics