Abstract
Through the use of formal document structures, for example paragraphs and tables, steps are shown on how to use these to extract information in the course of the automatic recognition of the contents of OpenOffice text documents and HTML documents as part of a document management project. It is possible to create formal graphs that structure the document-related information space based on a given information model by using a natural language processing chain and a wrapping procedure. A combined text and layout analysis is carried out with open source components that aims at representing information as a semantic network in a formal and visualizable manner. Scalable ways of retrieving information and processing knowledge are produced by uniting document-related information spaces to form thematic domains.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
DAY, D. et al. (1997): Mixed-Initiative Development of Language Processing Systems. In: Fifth Conference on Applied Natural Language Processing. Association for Computational Linguistics, Washington D.C. URL: http://www.mitre.org/tech/alembic-workbench/ANLP97-bigger.html.
FREY, M. (2002): The Role of Data Representation in Sentence Boundary Disambiguation with Neural Networks. FKIE-Bericht Nr. 46, Forschungsgesellschaft für Angewandte Naturwissenschaften e. V. (FGAN), Wachtberg.
KISS, T. and STRUNK, J. (2003): Viewing sentence boundary detection as collocation identification. In: S. Busemann, S. (Ed.): Konvens 2002 Tagungsband. DFKI, Saarbrücken, 75–82. URL: http://www.linguistics.ruhr-unibochum.de/~kiss/publications/07v-kiss.pdf.
LOPER, E. and BIRD, S. (2002): NLTK: The Natural Language Toolkit. In: Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Association for Computational Linguistics, Philadelphia. URL: http://arxiv.org/PS_cache/cs/pdf/0205/0205028.pdf.
MANNING, C. D. and SCHÜTZE, H. (2000): Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, Massachusetts and London, England.
MÜLLER, F. H. and ULE, T. (2001): Satzklammer annotieren und Tags korrigieren — Ein mehrstufiges “Top-Down-Bottom-Up”-System zur flachen, robusten Annotierung von Sätzen im Deutschen. In: H. Lobin (Ed.): Proceedings der GLDV-Frühjahrstagung 2001. Universität Gießen, 235–244. URL: http://www.uni-giessen.de/germanistik/asd/gldv2001/proceedings/ pdf/GLDV2001-mueller.pdf.
MILLER, R. C. (2002): Lightweight Structure in Text. PhD thesis, Carnegie Mellon University, Pittsburgh, PA. URL: http://www-2.cs.cmu.edu/~rcm/papers/thesis/thesis.pdf.
SCHILLER, A. et al. (1995): Guidelines für das Tagging deutscher Textcorpora mit STTS. Universität Stuttgart and Universität Tübingen. URL: http://www.sfs.nphil.uni-tuebingen.de/Elwis/stts/stts-guide.ps.gz.
ZERNIK, U. (Ed.) (1991): Lexical Acquisition: Exploiting On-Line Resources to Build a Lexicon. Erlbaum, Hillsdale, New Jersey.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin · Heidelberg
About this paper
Cite this paper
Rist, U. (2005). Document Management and the Development of Information Spaces. In: Weihs, C., Gaul, W. (eds) Classification — the Ubiquitous Challenge. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-28084-7_62
Download citation
DOI: https://doi.org/10.1007/3-540-28084-7_62
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25677-9
Online ISBN: 978-3-540-28084-2
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)