Abstract
Correct understanding of document structures is vital to enhancing various practices and applications, such as authoring and information retrieval. In this paper, with a focus on procedural documents of automobile repair manuals, we report a formalisation of document structure and an experiment to automatically classify sentences according to document structure. To formalise document structure, we first investigated what types of content are included in the target documents. Through manual annotation, we identified 26 indicative content categories (content elements). We then employed an existing standard for technical documents—the Darwin Information Typing Architecture—and formalised a detailed document structure for automobile repair tasks that specifies the arrangement of the content elements. We also examined the feasibility of automatic content element recognition in given documents by implementing sentence classifiers using multiple machine learning methods. The evaluation results revealed that support vector machine-based classifiers generally demonstrated high performance for in-domain data, while convolutional neural network-based classifiers are advantageous for out-of-domain data. We also conducted in-depth analyses to identify classification difficulties.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
We used the PRIUS Repair Manual (November 2017 version) provided by Toyota Motor Corporation.
- 2.
The feasibility of this annotation scheme will be validated by measuring inter-rater agreement in future work.
- 3.
For Japanese word segmentation, MeCab ver.0.996 (http://taku910.github.io/mecab) and mecab-ipa-NEologd (https://github.com/neologd/mecab-ipadic-neologd) were used. Although we tried three settings of features (uni-grams, bi-grams and uni-grams+bi-grams), we report only the uni-grams+bi-grams with the highest score in this paper.
- 4.
We used publicly available pre-trained vectors. http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/.
- 5.
The F-score is the harmonic mean value of the precision and recall.
References
Bellamy, L., Carey, M., Schlotfeldt, J.: DITA Best Practices: A Roadmap for Writing, Editing, and Architecting in DITA. IBM Press, Upper Saddle River (2012)
Bhatia, V.K.: Worlds of Written Discourse: A Genre-Based View. Continuum International, London (2004)
Biber, D., Conrad, S.: Register, Genre, and Style. Cambridge University Press, New York (2009)
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Brett, P.: A genre analysis of the results section of sociology articles. Engl. Specif. Purp. 13(1), 47–59 (1994)
Bunton, D.: The structure of PhD conclusion chapters. J. Engl. Acad. Purp. 4(3), 207–224 (2005)
Carey, M., Lanyi, M.M., Longo, D., Radzinski, E., Rouiller, S., Wilde, E.: Developing Quality Technical Information: A Handbook for Writers and Editors. IBM Press, Upper Saddle River (2014)
Corbin, J., Strauss, A.: Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 4th edn. Sage Publications, Los Angeles (2014)
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
Cross, C., Oppenheim, C.: A genre analysis of scientific abstracts. J. Documentation 62(4), 428–446 (2006)
Day, D., Priestley, M., Schell, D.: Introduction to the Darwin Information Typing Architecture: Toward portable technical information (2005). http://www.ibm.com/developerworks/xml/library/x-dita1/x-dita1-pdf.pdf
Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 308–313 (2017)
Hayes, P.J., Andersen, P.M., Nlrenburg, I.B., Schmandt, L.M.: TCS: a shell for content-based text categorization. In: Proceedings of the 6th Conference on Artificial Intelligence for Applications (CAIA), Santa Barbara, California, USA, pp. 320–326 (1990)
Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1), 185–234 (1989)
Horn, R.E.: Mapping Hypertext: The Analysis, Organization, and Display of Knowledge for the Next Generation of On-Line Text and Graphics. Lexington Institute, Arlington (1989)
Horn, R.E.: Structured writing as a paradigm. In: Romiszowski, A., Dills, C. (eds.) Instructional Development: State of the Art. Educational Technology Publications, Englewood Cliffs (1998)
Jin, D., Szolovits, P.: Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 3100–3109 (2018)
Kando, N.: Text-level structure of research articles and its implication for text-based information processing systems. In: Proceedings of the 19th British Computer Society Annual Colloquium on Information Retrieval Research (BCS-IRSG), Aberdeen, UK, pp. 68–81 (1997)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751 (2014)
Maswana, S., Kanamaru, T., Tajino, A.: Move analysis of research articles across five engineering fields: what they share and what they do not. Ampersand 2, 1–11 (2015)
OASIS: Darwin Information Typing Architecture (DITA) Version 1.3. http://docs.oasis-open.org/dita/dita/v1.3/dita-v1.3-part3-all-inclusive.html
Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cambridge (2000)
Rubens, P. (ed.): Science and Technical Writing: A Manual of Style. Routledge, New York (2001)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Song, X., Petrak, J., Roberts, A.: A deep neural network sentence level classification method with context information. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 900–904 (2018)
Swales, J.M.: Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge (1990)
Swales, J.M.: Research Genres: Explorations and Applications. Cambridge University Press, Cambridge (2004)
Swales, J.M., Freak, C.B.: Academic Writing for Graduate Students: Essential Tasks and Skills, 3rd edn. University of Michigan Press, Ann Arbor (2012)
Tessuto, G.: Generic structure and rhetorical moves in English-language empirical law research articles: sites of interdisciplinary and interdiscursive cross-over. Engl. Specif. Purp. 37, 13–26 (2015)
Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)
Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 253–263 (2017)
Acknowledgement
This work was partly supported by JSPS KAKENHI Grant Numbers 17H06733 and 19H05660. The automobile manuals used in this study were provided by Toyota Motor Corporation.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Sugino, H., Miyata, R., Sato, S. (2019). Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_23
Download citation
DOI: https://doi.org/10.1007/978-3-030-34058-2_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34057-5
Online ISBN: 978-3-030-34058-2
eBook Packages: Computer ScienceComputer Science (R0)