Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals

Sugino, Hodai; Miyata, Rei; Sato, Satoshi

doi:10.1007/978-3-030-34058-2_23

Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals

Hodai Sugino¹¹,
Rei Miyata¹¹ &
Satoshi Sato¹¹

Conference paper
First Online: 29 October 2019

724 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11853))

Abstract

Correct understanding of document structures is vital to enhancing various practices and applications, such as authoring and information retrieval. In this paper, with a focus on procedural documents of automobile repair manuals, we report a formalisation of document structure and an experiment to automatically classify sentences according to document structure. To formalise document structure, we first investigated what types of content are included in the target documents. Through manual annotation, we identified 26 indicative content categories (content elements). We then employed an existing standard for technical documents—the Darwin Information Typing Architecture—and formalised a detailed document structure for automobile repair tasks that specifies the arrangement of the content elements. We also examined the feasibility of automatic content element recognition in given documents by implementing sentence classifiers using multiple machine learning methods. The evaluation results revealed that support vector machine-based classifiers generally demonstrated high performance for in-domain data, while convolutional neural network-based classifiers are advantageous for out-of-domain data. We also conducted in-depth analyses to identify classification difficulties.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
We used the PRIUS Repair Manual (November 2017 version) provided by Toyota Motor Corporation.
2.
The feasibility of this annotation scheme will be validated by measuring inter-rater agreement in future work.
3.
For Japanese word segmentation, MeCab ver.0.996 (http://taku910.github.io/mecab) and mecab-ipa-NEologd (https://github.com/neologd/mecab-ipadic-neologd) were used. Although we tried three settings of features (uni-grams, bi-grams and uni-grams+bi-grams), we report only the uni-grams+bi-grams with the highest score in this paper.
4.
We used publicly available pre-trained vectors. http://www.cl.ecei.tohoku.ac.jp/~m-suzuki/jawiki_vector/.
5.
The F-score is the harmonic mean value of the precision and recall.

References

Bellamy, L., Carey, M., Schlotfeldt, J.: DITA Best Practices: A Roadmap for Writing, Editing, and Architecting in DITA. IBM Press, Upper Saddle River (2012)
Google Scholar
Bhatia, V.K.: Worlds of Written Discourse: A Genre-Based View. Continuum International, London (2004)
Google Scholar
Biber, D., Conrad, S.: Register, Genre, and Style. Cambridge University Press, New York (2009)
Book Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Brett, P.: A genre analysis of the results section of sociology articles. Engl. Specif. Purp. 13(1), 47–59 (1994)
Article Google Scholar
Bunton, D.: The structure of PhD conclusion chapters. J. Engl. Acad. Purp. 4(3), 207–224 (2005)
Article Google Scholar
Carey, M., Lanyi, M.M., Longo, D., Radzinski, E., Rouiller, S., Wilde, E.: Developing Quality Technical Information: A Handbook for Writers and Editors. IBM Press, Upper Saddle River (2014)
Google Scholar
Corbin, J., Strauss, A.: Basics of Qualitative Research: Techniques and Procedures for Developing Grounded Theory, 4th edn. Sage Publications, Los Angeles (2014)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Cross, C., Oppenheim, C.: A genre analysis of scientific abstracts. J. Documentation 62(4), 428–446 (2006)
Article Google Scholar
Day, D., Priestley, M., Schell, D.: Introduction to the Darwin Information Typing Architecture: Toward portable technical information (2005). http://www.ibm.com/developerworks/xml/library/x-dita1/x-dita1-pdf.pdf
Dernoncourt, F., Lee, J.Y.: PubMed 200k RCT: a dataset for sequential sentence classification in medical abstracts. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 308–313 (2017)
Google Scholar
Hayes, P.J., Andersen, P.M., Nlrenburg, I.B., Schmandt, L.M.: TCS: a shell for content-based text categorization. In: Proceedings of the 6th Conference on Artificial Intelligence for Applications (CAIA), Santa Barbara, California, USA, pp. 320–326 (1990)
Google Scholar
Hinton, G.E.: Connectionist learning procedures. Artif. Intell. 40(1), 185–234 (1989)
Article Google Scholar
Horn, R.E.: Mapping Hypertext: The Analysis, Organization, and Display of Knowledge for the Next Generation of On-Line Text and Graphics. Lexington Institute, Arlington (1989)
Google Scholar
Horn, R.E.: Structured writing as a paradigm. In: Romiszowski, A., Dills, C. (eds.) Instructional Development: State of the Art. Educational Technology Publications, Englewood Cliffs (1998)
Google Scholar
Jin, D., Szolovits, P.: Hierarchical neural networks for sequential sentence classification in medical scientific abstracts. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 3100–3109 (2018)
Google Scholar
Kando, N.: Text-level structure of research articles and its implication for text-based information processing systems. In: Proceedings of the 19th British Computer Society Annual Colloquium on Information Retrieval Research (BCS-IRSG), Aberdeen, UK, pp. 68–81 (1997)
Google Scholar
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1746–1751 (2014)
Google Scholar
Maswana, S., Kanamaru, T., Tajino, A.: Move analysis of research articles across five engineering fields: what they share and what they do not. Ampersand 2, 1–11 (2015)
Article Google Scholar
OASIS: Darwin Information Typing Architecture (DITA) Version 1.3. http://docs.oasis-open.org/dita/dita/v1.3/dita-v1.3-part3-all-inclusive.html
Reiter, E., Dale, R.: Building Natural Language Generation Systems. Cambridge University Press, Cambridge (2000)
Book Google Scholar
Rubens, P. (ed.): Science and Technical Writing: A Manual of Style. Routledge, New York (2001)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Song, X., Petrak, J., Roberts, A.: A deep neural network sentence level classification method with context information. In: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, pp. 900–904 (2018)
Google Scholar
Swales, J.M.: Genre Analysis: English in Academic and Research Settings. Cambridge University Press, Cambridge (1990)
Google Scholar
Swales, J.M.: Research Genres: Explorations and Applications. Cambridge University Press, Cambridge (2004)
Book Google Scholar
Swales, J.M., Freak, C.B.: Academic Writing for Graduate Students: Essential Tasks and Skills, 3rd edn. University of Michigan Press, Ann Arbor (2012)
Book Google Scholar
Tessuto, G.: Generic structure and rhetorical moves in English-language empirical law research articles: sites of interdisciplinary and interdiscursive cross-over. Engl. Specif. Purp. 37, 13–26 (2015)
Article Google Scholar
Teufel, S., Moens, M.: Summarizing scientific articles: experiments with relevance and rhetorical status. Comput. Linguist. 28(4), 409–445 (2002)
Article Google Scholar
Zhang, Y., Wallace, B.C.: A sensitivity analysis of (and practitioners’ guide to) convolutional neural networks for sentence classification. In: Proceedings of the 8th International Joint Conference on Natural Language Processing (IJCNLP), Taipei, Taiwan, pp. 253–263 (2017)
Google Scholar

Download references

Acknowledgement

This work was partly supported by JSPS KAKENHI Grant Numbers 17H06733 and 19H05660. The automobile manuals used in this study were provided by Toyota Motor Corporation.

Author information

Authors and Affiliations

Nagoya University, Nagoya, Japan
Hodai Sugino, Rei Miyata & Satoshi Sato

Authors

Hodai Sugino
View author publications
You can also search for this author in PubMed Google Scholar
Rei Miyata
View author publications
You can also search for this author in PubMed Google Scholar
Satoshi Sato
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Hodai Sugino or Rei Miyata .

Editor information

Editors and Affiliations

Kyoto University, Kyoto, Japan
Adam Jatowt
Ritsumeikan University, Kusatsu, Japan
Akira Maeda
The Catholic University of America, Washington, DC, USA
Sue Yeon Syn

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sugino, H., Miyata, R., Sato, S. (2019). Formalising Document Structure and Automatically Recognising Document Elements: A Case Study on Automobile Repair Manuals. In: Jatowt, A., Maeda, A., Syn, S. (eds) Digital Libraries at the Crossroads of Digital Information for the Future. ICADL 2019. Lecture Notes in Computer Science(), vol 11853. Springer, Cham. https://doi.org/10.1007/978-3-030-34058-2_23

Download citation

DOI: https://doi.org/10.1007/978-3-030-34058-2_23
Published: 29 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34057-5
Online ISBN: 978-3-030-34058-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics