Identifying Logical Structure and Content Structure in Loosely-Structured Documents

Stede, Manfred; Suriyawongkul, Arthit

doi:10.1007/978-90-481-3331-4_5

Manfred Stede³ &
Arthit Suriyawongkul⁴

Part of the book series: Text, Speech and Language Technology ((TLTB,volume 41))

761 Accesses
1 Citations

Abstract

Text documents are structured on (at least) two separate levels: The “logical” structure is largely reflected in the layout (headlines, paragraphs, etc.), and the “content” structure specifies the functional zones that serve a part of the text’s overall communicative purpose. The latter is clearly genre-specific, whereas the former is independent of the particular text genre. In this chapter, we describe an approach to identifying both structural levels automatically. For content structure, we focus on the genre “film review”: Based on a corpus study, we propose an inventory of zone labels, and describe our method for identifying these zones, using a hybrid approach that makes use of both symbolic rules and statistical (bag-of-words) classification.

The work reported in this chapter originated when one of the authors (A. Suriyawongkul) was a researcher at Potsdam University.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

S. Banerjee, T. Pedersen. “The Design, Implementation, and Use of the Ngram Statistics Package.” In: Proc. of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2003.
Google Scholar
U. Hahn, U. Reimer. “Topic Essentials.” In: Proc. of COLING 1986, S. 497–503, 1986.
Google Scholar
T. Joachims. “Making large-scale SVM learning practical.” In: B. Sch\(\ddot n\)olkopf, C. Burges, A. Smola (eds.): Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, 1999.
Google Scholar
H. Langer, H. Lüngen, P. Bayerl. “Text Type Structure and Logical Document Structure.” In: Proceedings of the ACL 2004 Workshop on Discourse Annotation. pp. 49–56. Barcelona, 2004.
Google Scholar
R. Miller, B. Myers. “Lightweight Structured Text Processing.” In: Proc. of the USENIX Annual Technical Conference, Monterey, CA, June 1999, pp. 131–144.
Google Scholar
B. Pang, L. Lee, S. Vaithyanathan. “Thumbs up? Sentiment classification using machine learning techniques.” In. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.
Google Scholar
B. Pang, L. Lee: “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.” In: Proceedings of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), Barcelona, 2004.
Google Scholar
Y. Qu, J. Shanahan, J. Wiebe (eds.): “Exploring Attitude and Affect in Text: Theories and Applications.” (Papers from the 2004 AAAI Spring Symposium) Technical report SS-04-07. AAAI Press, Menlo Park, CA, 2004.
Google Scholar
G. Stegert. Filme Rezensieren in Presse, Radio und Fernsehen. TR-Verlagsunion, München, 1993.
Google Scholar
O. Stock, R. Falcone, P. Insinnamo: “Island parsing and bidirectional charts.” In: Proc. of the Int’l Conference on Computational Linguistics (Coling), 1988.
Google Scholar
K.M. Summers. “Automatic Discovery of Logical Document Structure.” Ph.D. Thesis, Cornell University, 1998.
Google Scholar
S. Teufel, M. Moens. “Summarizing scientific articles: Experiments with relevance and rhetorical status.” Computational Linguistics 28(4):409–445, 2002.
Article Google Scholar
C.G. Wolf, S.R. Alpert, J.G. Vergo, L. Kozakov, Y. Doganata. “Summarizing technical support documents for search: Expert and user studies.” IBM Systems Journal 43(3):564–586, 2004.
Article Google Scholar

Download references

Acknowledgments

The work reported in this chapter was funded by the German Federal Ministry of Education and Research, grant 03WKH22. We thank Heike Bieler and Stefanie Dipper for their contributions to the SUMMaR project, and Annika Neumann and Andreas Peldszus for their help with defining tag sets and performing annotations.

Author information

Authors and Affiliations

Universität Potsdam, Potsdam, Germany
Manfred Stede
Faculty of Sociology and Anthropology, Thammasat University and Opendream Labs, Bangkok, Thailand
Arthit Suriyawongkul

Authors

Manfred Stede
View author publications
You can also search for this author in PubMed Google Scholar
Arthit Suriyawongkul
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Manfred Stede .

Editor information

Editors and Affiliations

Institut für Deutsche Sprache (IDS), Mannheim, 68161, Germany
Andreas Witt
Fak. Linguistik und, Universität Bielefeld, Universitätsstraße, Bielefeld, 33615, Germany
Dieter Metzing

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Stede, M., Suriyawongkul, A. (2010). Identifying Logical Structure and Content Structure in Loosely-Structured Documents. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_5

Download citation

DOI: https://doi.org/10.1007/978-90-481-3331-4_5
Published: 09 November 2009
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3330-7
Online ISBN: 978-90-481-3331-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics