Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals

  • Hodai SuginoEmail author
  • Rei MiyataEmail author
  • Satoshi Sato
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11853)


Correct understanding of document structures is vital to enhancing various practices and applications, such as authoring and information retrieval. In this paper, with a focus on procedural documents of automobile repair manuals, we report a formalisation of document structure and an experiment to automatically classify sentences according to document structure. To formalise document structure, we first investigated what types of content are included in the target documents. Through manual annotation, we identified 26 indicative content categories (content elements). We then employed an existing standard for technical documents—the Darwin Information Typing Architecture—and formalised a detailed document structure for automobile repair tasks that specifies the arrangement of the content elements. We also examined the feasibility of automatic content element recognition in given documents by implementing sentence classifiers using multiple machine learning methods. The evaluation results revealed that support vector machine-based classifiers generally demonstrated high performance for in-domain data, while convolutional neural network-based classifiers are advantageous for out-of-domain data. We also conducted in-depth analyses to identify classification difficulties.


Document structure Genre analysis Manual annotation Automatic text classification 



This work was partly supported by JSPS KAKENHI Grant Numbers 17H06733 and 19H05660. The automobile manuals used in this study were provided by Toyota Motor Corporation.


  1. 1.
    Bellamy, L., Carey, M., Schlotfeldt, J.: DITA Best Practices: A Roadmap for Writing, Editing, and Architecting in DITA. IBM Press, Upper Saddle River (2012)Google Scholar
  2. 2.
    Bhatia, V.K.: Worlds of Written Discourse: A Genre-Based View. Continuum International, London (2004)Google Scholar
  3. 3.
    Biber, D., Conrad, S.: Register, Genre, and Style. Cambridge University Press, New York (2009)CrossRefGoogle Scholar
  4. 4.
    Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
  5. 5.
    Brett, P.: A genre analysis of the results section of sociology articles. Engl. Specif. Purp. 13(1), 47–59 (1994)CrossRefGoogle Scholar
  6. 6.
    Bunton, D.: The structure of PhD conclusion chapters. J. Engl. Acad. Purp. 4(3), 207–224 (2005)CrossRefGoogle Scholar
  7. 7.
    Carey, M., Lanyi, M.M., Longo, D., Radzinski, E., Rouiller, S., Wilde, E.: Developing Quality Technical Information: A Handbook for Writers and Editors. IBM Press, Upper Saddle River (2014)Google Scholar
  8. 8.
    Corbin, J., Strauss, A.: Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 4th edn. Sage Publications, Los Angeles (2014)Google Scholar
  9. 9.
    Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
  10. 10.
    Cross, C., Oppenheim, C.: A genre analysis of scientific abstracts. J. Documentation 62(4), 428–446 (2006)CrossRefGoogle Scholar
  11. 11.
    Day, D., Priestley, M., Schell, D.: Introduction to the Darwin Information Typing Architecture: Toward portable technical information (2005).
  12. 12.
    Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 308–313 (2017)Google Scholar
  13. 13.
    Hayes, P.J., Andersen, P.M., Nlrenburg, I.B., Schmandt, L.M.: TCS: a shell for content-based text categorization. In: Proceedings of the 6th Conference on Artificial Intelligence for Applications (CAIA), Santa Barbara, California, USA, pp. 320–326 (1990)Google Scholar
  14. 14.
    Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1), 185–234 (1989)CrossRefGoogle Scholar
  15. 15.
    Horn, R.E.: Mapping Hypertext: The Analysis, Organization, and Display of Knowledge for the Next Generation of On-Line Text and Graphics. Lexington Institute, Arlington (1989)Google Scholar
  16. 16.
    Horn, R.E.: Structured writing as a paradigm. In: Romiszowski, A., Dills, C. (eds.) Instructional Development: State of the Art. Educational Technology Publications, Englewood Cliffs (1998)Google Scholar
  17. 17.
    Jin, D., Szolovits, P.: Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 3100–3109 (2018)Google Scholar
  18. 18.
    Kando, N.: Text-level structure of research articles and its implication for text-based information processing systems. In: Proceedings of the 19th British Computer Society Annual Colloquium on Information Retrieval Research (BCS-IRSG), Aberdeen, UK, pp. 68–81 (1997)Google Scholar
  19. 19.
    Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751 (2014)Google Scholar
  20. 20.
    Maswana, S., Kanamaru, T., Tajino, A.: Move analysis of research articles across five engineering fields: what they share and what they do not. Ampersand 2, 1–11 (2015)CrossRefGoogle Scholar
  21. 21.
    OASIS: Darwin Information Typing Architecture (DITA) Version 1.3.
  22. 22.
    Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cambridge (2000)CrossRefGoogle Scholar
  23. 23.
    Rubens, P. (ed.): Science and Technical Writing: A Manual of Style. Routledge, New York (2001)Google Scholar
  24. 24.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  25. 25.
    Song, X., Petrak, J., Roberts, A.: A deep neural network sentence level classification method with context information. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 900–904 (2018)Google Scholar
  26. 26.
    Swales, J.M.: Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge (1990)Google Scholar
  27. 27.
    Swales, J.M.: Research Genres: Explorations and Applications. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
  28. 28.
    Swales, J.M., Freak, C.B.: Academic Writing for Graduate Students: Essential Tasks and Skills, 3rd edn. University of Michigan Press, Ann Arbor (2012)CrossRefGoogle Scholar
  29. 29.
    Tessuto, G.: Generic structure and rhetorical moves in English-language empirical law research articles: sites of interdisciplinary and interdiscursive cross-over. Engl. Specif. Purp. 37, 13–26 (2015)CrossRefGoogle Scholar
  30. 30.
    Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)CrossRefGoogle Scholar
  31. 31.
    Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 253–263 (2017)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Nagoya UniversityNagoyaJapan

Personalised recommendations