Semantic Text Segment Classification of Structured Technical Content

Höllig, Julian; Dufter, Philipp; Geierhos, Michaela; Ziegler, Wolfgang; Schütze, Hinrich

doi:10.1007/978-3-030-80599-9_15

Julian Höllig¹²,
Philipp Dufter¹³,
Michaela Geierhos¹²,
Wolfgang Ziegler¹⁴ &
…
Hinrich Schütze¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12801))

Included in the following conference series:

International Conference on Applications of Natural Language to Information Systems

1623 Accesses
1 Citations

Abstract

Semantic tagging in technical documentation is an important but error-prone process, with the objective to produce highly structured content for automated processing and standardized information delivery. Benefits thereof are consistent and didactically optimized documents, supported by professional and automatic styling for multiple target media. Using machine learning to automate the validation of the tagging process is a novel approach, for which a new, high-quality dataset is provided in ready-to-use training, validation and test sets. In a series of experiments, we classified ten different semantic text segment types using both traditional and deep learning models. The experiments show partial success, with a high accuracy but relatively low macro-average performance. This can be attributed to a mix of a strong class imbalance, and high semantic and linguistic similarity among certain text types. By creating a set of context features, the model performances increased significantly. Although the data was collected to serve a specific use case, further valuable research can be performed in the areas of document engineering, class imbalance reduction, and semantic text classification.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. ACL, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423
Dhiman, A., Toshniwal, D.: An enhanced text classification to explore health based indian government policy tweets. CoRR abs/2007.06511 (2020)
Google Scholar
Di Iorio, A., Peroni, S., Poggi, F., Vitali, F.: A first approach to the automatic recognition of structural patterns in XML documents. In: Concolato, C., Schmitz, P. (eds.) ACM Symposium on Document Engineering, DocEng 2012, Paris, France, 4–7 September 2012, pp. 85–94. ACM (2012). https://doi.org/10.1145/2361354.2361374
Drewer, P., Ziegler, W.: Technische Dokumentation: Übersetzungsgerechte Texterstellung und Content-Management, pp. 25–27. Vogel Business Media (2011)
Google Scholar
Fei, G., Liu, B.: Social media text classification under negative covariate shift. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September 2015, pp. 2347–2356. ACL (2015). https://doi.org/10.18653/v1/d15-1282
González-Carvajal, S., Garrido-Merchán, E.C.: Comparing BERT against traditional machine learning text classification. CoRR abs/2005.13012 (2020)
Google Scholar
Gräbner, D., Zanker, M., Fliedl, G., Fuchs, M.: Classification of Customer Reviews based on Sentiment Analysis. In: Fuchs, M., Ricci, F., Cantoni, L. (eds.) ENTER 2012, pp. 460–470. Springer, Vienna (2012). https://doi.org/10.1007/978-3-7091-1142-0_40
Chapter Google Scholar
Lee, J.S., Hsiang, J.: Patent classification by fine-tuning BERT language model. World Patent Inf. 61, 101965 (2020). https://doi.org/10.1016/j.wpi.2020.101965
Article Google Scholar
Lund, M.: Duplicate detection and text classification on simplified technical English. Dissertation, Linköping University (2019). http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158714
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Book Google Scholar
Nicholls, C., Song, F.: Improving sentiment analysis with part-of-speech weighting. In: 2009 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1592–1597 (2009). https://doi.org/10.1109/ICMLC.2009.5212278
Oevermann, J.: Reconstructing semantic structures in technical documentation with vector space classification. In: Martin, M., Cuquet, M., Folmer, E. (eds.) SEMANTiCS 2016, Leipzig, Germany, 12–15 September 2016. CEUR Workshop Proceedings, vol. 1695. CEUR-WS.org (2016)
Google Scholar
Oevermann, J., Ziegler, W.: Automated classification of content components in technical communication. Comput. Intell. 34(1), 30–48 (2018)
Article MathSciNet Google Scholar
Prakash, A.: Fine-tuning BERT model using PyTorch, December 2019. https://medium.com/@prakashakshay90/f34148d58a37
Pratama, B.Y., Sarno, R.: Personality classification based on Twitter text using naive bayes, KNN and SVM. In: 2015 International Conference on Data and Software Engineering (ICoDSE), pp. 170–174 (2015). https://doi.org/10.1109/ICODSE.2015.7436992
Raj, B.S.: Understanding BERT: is it a game changer in NLP? (2019). https://towardsdatascience.com/7cca943cf3ad
Stewart, S., Burns, D. (eds.): W3C Recommendation, chap. WebDriver. W3C, August 2020. https://www.w3.org/TR/webdriver/
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, p. 306. Addison-Wesley Longman Publishing Co., Inc., USA (2005)
Google Scholar
Vig, J.: Deconstructing BERT, Part 2: visualizing the inner workings of attention, January 2019. https://towardsdatascience.com/60a16d86b5c1
Wang, W., Liu, M., Zhang, Y., Xiang, J., Mao, R.: Financial numeral classification model based on BERT. In: Kato, M.P., Liu, Y., Kando, N., Clarke, C.L.A. (eds.) NTCIR 2019. LNCS, vol. 11966, pp. 193–204. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-36805-0_15
Chapter Google Scholar

Download references

Acknowledgments

This work was supported by the Bavarian Research Institute for Digital Transformation and the European Research Council (#740516).

Author information

Authors and Affiliations

Research Institute CODE, Bundeswehr University Munich, Neubiberg, Germany
Julian Höllig & Michaela Geierhos
Center for Language and Information Processing, LMU Munich, Munich, Germany
Philipp Dufter & Hinrich Schütze
Information Management and Media, Karlsruhe University of Applied Sciences, Karlsruhe, Germany
Wolfgang Ziegler

Authors

Julian Höllig
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Dufter
View author publications
You can also search for this author in PubMed Google Scholar
Michaela Geierhos
View author publications
You can also search for this author in PubMed Google Scholar
Wolfgang Ziegler
View author publications
You can also search for this author in PubMed Google Scholar
Hinrich Schütze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Julian Höllig .

Editor information

Editors and Affiliations

Conservatoire National des Arts et Métiers, Paris, France
Elisabeth Métais
University of Derby, Derby, UK
Farid Meziane
German Research Center for Artificial Intelligence, Saarbrücken, Germany
Helmut Horacek
University of Hertfordshire, Hatfield, UK
Epaminondas Kapetanios

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Höllig, J., Dufter, P., Geierhos, M., Ziegler, W., Schütze, H. (2021). Semantic Text Segment Classification of Structured Technical Content. In: Métais, E., Meziane, F., Horacek, H., Kapetanios, E. (eds) Natural Language Processing and Information Systems. NLDB 2021. Lecture Notes in Computer Science(), vol 12801. Springer, Cham. https://doi.org/10.1007/978-3-030-80599-9_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-80599-9_15
Published: 20 June 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80598-2
Online ISBN: 978-3-030-80599-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Semantic Text Segment Classification of Structured Technical Content