Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals
- 344 Downloads
Abstract
Correct understanding of document structures is vital to enhancing various practices and applications, such as authoring and information retrieval. In this paper, with a focus on procedural documents of automobile repair manuals, we report a formalisation of document structure and an experiment to automatically classify sentences according to document structure. To formalise document structure, we first investigated what types of content are included in the target documents. Through manual annotation, we identified 26 indicative content categories (content elements). We then employed an existing standard for technical documents—the Darwin Information Typing Architecture—and formalised a detailed document structure for automobile repair tasks that specifies the arrangement of the content elements. We also examined the feasibility of automatic content element recognition in given documents by implementing sentence classifiers using multiple machine learning methods. The evaluation results revealed that support vector machine-based classifiers generally demonstrated high performance for in-domain data, while convolutional neural network-based classifiers are advantageous for out-of-domain data. We also conducted in-depth analyses to identify classification difficulties.
Keywords
Document structure Genre analysis Manual annotation Automatic text classificationNotes
Acknowledgement
This work was partly supported by JSPS KAKENHI Grant Numbers 17H06733 and 19H05660. The automobile manuals used in this study were provided by Toyota Motor Corporation.
References
- 1.Bellamy, L., Carey, M., Schlotfeldt, J.: DITA Best Practices: A Roadmap for Writing, Editing, and Architecting in DITA. IBM Press, Upper Saddle River (2012)Google Scholar
- 2.Bhatia, V.K.: Worlds of Written Discourse: A Genre-Based View. Continuum International, London (2004)Google Scholar
- 3.Biber, D., Conrad, S.: Register, Genre, and Style. Cambridge University Press, New York (2009)CrossRefGoogle Scholar
- 4.Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)CrossRefGoogle Scholar
- 5.Brett, P.: A genre analysis of the results section of sociology articles. Engl. Specif. Purp. 13(1), 47–59 (1994)CrossRefGoogle Scholar
- 6.Bunton, D.: The structure of PhD conclusion chapters. J. Engl. Acad. Purp. 4(3), 207–224 (2005)CrossRefGoogle Scholar
- 7.Carey, M., Lanyi, M.M., Longo, D., Radzinski, E., Rouiller, S., Wilde, E.: Developing Quality Technical Information: A Handbook for Writers and Editors. IBM Press, Upper Saddle River (2014)Google Scholar
- 8.Corbin, J., Strauss, A.: Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 4th edn. Sage Publications, Los Angeles (2014)Google Scholar
- 9.Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)zbMATHGoogle Scholar
- 10.Cross, C., Oppenheim, C.: A genre analysis of scientific abstracts. J. Documentation 62(4), 428–446 (2006)CrossRefGoogle Scholar
- 11.Day, D., Priestley, M., Schell, D.: Introduction to the Darwin Information Typing Architecture: Toward portable technical information (2005). http://www.ibm.com/developerworks/xml/library/x-dita1/x-dita1-pdf.pdf
- 12.Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 308–313 (2017)Google Scholar
- 13.Hayes, P.J., Andersen, P.M., Nlrenburg, I.B., Schmandt, L.M.: TCS: a shell for content-based text categorization. In: Proceedings of the 6th Conference on Artificial Intelligence for Applications (CAIA), Santa Barbara, California, USA, pp. 320–326 (1990)Google Scholar
- 14.Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1), 185–234 (1989)CrossRefGoogle Scholar
- 15.Horn, R.E.: Mapping Hypertext: The Analysis, Organization, and Display of Knowledge for the Next Generation of On-Line Text and Graphics. Lexington Institute, Arlington (1989)Google Scholar
- 16.Horn, R.E.: Structured writing as a paradigm. In: Romiszowski, A., Dills, C. (eds.) Instructional Development: State of the Art. Educational Technology Publications, Englewood Cliffs (1998)Google Scholar
- 17.Jin, D., Szolovits, P.: Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 3100–3109 (2018)Google Scholar
- 18.Kando, N.: Text-level structure of research articles and its implication for text-based information processing systems. In: Proceedings of the 19th British Computer Society Annual Colloquium on Information Retrieval Research (BCS-IRSG), Aberdeen, UK, pp. 68–81 (1997)Google Scholar
- 19.Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751 (2014)Google Scholar
- 20.Maswana, S., Kanamaru, T., Tajino, A.: Move analysis of research articles across five engineering fields: what they share and what they do not. Ampersand 2, 1–11 (2015)CrossRefGoogle Scholar
- 21.OASIS: Darwin Information Typing Architecture (DITA) Version 1.3. http://docs.oasis-open.org/dita/dita/v1.3/dita-v1.3-part3-all-inclusive.html
- 22.Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cambridge (2000)CrossRefGoogle Scholar
- 23.Rubens, P. (ed.): Science and Technical Writing: A Manual of Style. Routledge, New York (2001)Google Scholar
- 24.Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)MathSciNetCrossRefGoogle Scholar
- 25.Song, X., Petrak, J., Roberts, A.: A deep neural network sentence level classification method with context information. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 900–904 (2018)Google Scholar
- 26.Swales, J.M.: Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge (1990)Google Scholar
- 27.Swales, J.M.: Research Genres: Explorations and Applications. Cambridge University Press, Cambridge (2004)CrossRefGoogle Scholar
- 28.Swales, J.M., Freak, C.B.: Academic Writing for Graduate Students: Essential Tasks and Skills, 3rd edn. University of Michigan Press, Ann Arbor (2012)CrossRefGoogle Scholar
- 29.Tessuto, G.: Generic structure and rhetorical moves in English-language empirical law research articles: sites of interdisciplinary and interdiscursive cross-over. Engl. Specif. Purp. 37, 13–26 (2015)CrossRefGoogle Scholar
- 30.Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)CrossRefGoogle Scholar
- 31.Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 253–263 (2017)Google Scholar