Abstract
Text documents are structured on (at least) two separate levels: The “logical” structure is largely reflected in the layout (headlines, paragraphs, etc.), and the “content” structure specifies the functional zones that serve a part of the text’s overall communicative purpose. The latter is clearly genre-specific, whereas the former is independent of the particular text genre. In this chapter, we describe an approach to identifying both structural levels automatically. For content structure, we focus on the genre “film review”: Based on a corpus study, we propose an inventory of zone labels, and describe our method for identifying these zones, using a hybrid approach that makes use of both symbolic rules and statistical (bag-of-words) classification.
Keywords
The work reported in this chapter originated when one of the authors (A. Suriyawongkul) was a researcher at Potsdam University.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
S. Banerjee, T. Pedersen. “The Design, Implementation, and Use of the Ngram Statistics Package.” In: Proc. of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, Mexico City, 2003.
U. Hahn, U. Reimer. “Topic Essentials.” In: Proc. of COLING 1986, S. 497–503, 1986.
T. Joachims. “Making large-scale SVM learning practical.” In: B. Sch\(\ddot n\)olkopf, C. Burges, A. Smola (eds.): Advances in Kernel Methods – Support Vector Learning. MIT Press, Cambridge, 1999.
H. Langer, H. Lüngen, P. Bayerl. “Text Type Structure and Logical Document Structure.” In: Proceedings of the ACL 2004 Workshop on Discourse Annotation. pp. 49–56. Barcelona, 2004.
R. Miller, B. Myers. “Lightweight Structured Text Processing.” In: Proc. of the USENIX Annual Technical Conference, Monterey, CA, June 1999, pp. 131–144.
B. Pang, L. Lee, S. Vaithyanathan. “Thumbs up? Sentiment classification using machine learning techniques.” In. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2002.
B. Pang, L. Lee: “A Sentimental Education: Sentiment Analysis Using Subjectivity Summarization Based on Minimum Cuts.” In: Proceedings of the Annual Meeting of the Assoc. for Computational Linguistics (ACL), Barcelona, 2004.
Y. Qu, J. Shanahan, J. Wiebe (eds.): “Exploring Attitude and Affect in Text: Theories and Applications.” (Papers from the 2004 AAAI Spring Symposium) Technical report SS-04-07. AAAI Press, Menlo Park, CA, 2004.
G. Stegert. Filme Rezensieren in Presse, Radio und Fernsehen. TR-Verlagsunion, München, 1993.
O. Stock, R. Falcone, P. Insinnamo: “Island parsing and bidirectional charts.” In: Proc. of the Int’l Conference on Computational Linguistics (Coling), 1988.
K.M. Summers. “Automatic Discovery of Logical Document Structure.” Ph.D. Thesis, Cornell University, 1998.
S. Teufel, M. Moens. “Summarizing scientific articles: Experiments with relevance and rhetorical status.” Computational Linguistics 28(4):409–445, 2002.
C.G. Wolf, S.R. Alpert, J.G. Vergo, L. Kozakov, Y. Doganata. “Summarizing technical support documents for search: Expert and user studies.” IBM Systems Journal 43(3):564–586, 2004.
Acknowledgments
The work reported in this chapter was funded by the German Federal Ministry of Education and Research, grant 03WKH22. We thank Heike Bieler and Stefanie Dipper for their contributions to the SUMMaR project, and Annika Neumann and Andreas Peldszus for their help with defining tag sets and performing annotations.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer Science+Business Media B.V.
About this chapter
Cite this chapter
Stede, M., Suriyawongkul, A. (2010). Identifying Logical Structure and Content Structure in Loosely-Structured Documents. In: Witt, A., Metzing, D. (eds) Linguistic Modeling of Information and Markup Languages. Text, Speech and Language Technology, vol 41. Springer, Dordrecht. https://doi.org/10.1007/978-90-481-3331-4_5
Download citation
DOI: https://doi.org/10.1007/978-90-481-3331-4_5
Published:
Publisher Name: Springer, Dordrecht
Print ISBN: 978-90-481-3330-7
Online ISBN: 978-90-481-3331-4
eBook Packages: Computer ScienceComputer Science (R0)