Abstract
Semantic tagging in technical documentation is an important but error-prone process, with the objective to produce highly structured content for automated processing and standardized information delivery. Benefits thereof are consistent and didactically optimized documents, supported by professional and automatic styling for multiple target media. Using machine learning to automate the validation of the tagging process is a novel approach, for which a new, high-quality dataset is provided in ready-to-use training, validation and test sets. In a series of experiments, we classified ten different semantic text segment types using both traditional and deep learning models. The experiments show partial success, with a high accuracy but relatively low macro-average performance. This can be attributed to a mix of a strong class imbalance, and high semantic and linguistic similarity among certain text types. By creating a set of context features, the model performances increased significantly. Although the data was collected to serve a specific use case, further valuable research can be performed in the areas of document engineering, class imbalance reduction, and semantic text classification.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. ACL, Minneapolis, June 2019. https://doi.org/10.18653/v1/N19-1423
Dhiman, A., Toshniwal, D.: An enhanced text classification to explore health based indian government policy tweets. CoRR abs/2007.06511 (2020)
Di Iorio, A., Peroni, S., Poggi, F., Vitali, F.: A first approach to the automatic recognition of structural patterns in XML documents. In: Concolato, C., Schmitz, P. (eds.) ACM Symposium on Document Engineering, DocEng 2012, Paris, France, 4–7 September 2012, pp. 85–94. ACM (2012). https://doi.org/10.1145/2361354.2361374
Drewer, P., Ziegler, W.: Technische Dokumentation: Übersetzungsgerechte Texterstellung und Content-Management, pp. 25–27. Vogel Business Media (2011)
Fei, G., Liu, B.: Social media text classification under negative covariate shift. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September 2015, pp. 2347–2356. ACL (2015). https://doi.org/10.18653/v1/d15-1282
González-Carvajal, S., Garrido-Merchán, E.C.: Comparing BERT against traditional machine learning text classification. CoRR abs/2005.13012 (2020)
Gräbner, D., Zanker, M., Fliedl, G., Fuchs, M.: Classification of Customer Reviews based on Sentiment Analysis. In: Fuchs, M., Ricci, F., Cantoni, L. (eds.) ENTER 2012, pp. 460–470. Springer, Vienna (2012). https://doi.org/10.1007/978-3-7091-1142-0_40
Lee, J.S., Hsiang, J.: Patent classification by fine-tuning BERT language model. World Patent Inf. 61, 101965 (2020). https://doi.org/10.1016/j.wpi.2020.101965
Lund, M.: Duplicate detection and text classification on simplified technical English. Dissertation, Linköping University (2019). http://urn.kb.se/resolve?urn=urn:nbn:se:liu:diva-158714
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Nicholls, C., Song, F.: Improving sentiment analysis with part-of-speech weighting. In: 2009 International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1592–1597 (2009). https://doi.org/10.1109/ICMLC.2009.5212278
Oevermann, J.: Reconstructing semantic structures in technical documentation with vector space classification. In: Martin, M., Cuquet, M., Folmer, E. (eds.) SEMANTiCS 2016, Leipzig, Germany, 12–15 September 2016. CEUR Workshop Proceedings, vol. 1695. CEUR-WS.org (2016)
Oevermann, J., Ziegler, W.: Automated classification of content components in technical communication. Comput. Intell. 34(1), 30–48 (2018)
Prakash, A.: Fine-tuning BERT model using PyTorch, December 2019. https://medium.com/@prakashakshay90/f34148d58a37
Pratama, B.Y., Sarno, R.: Personality classification based on Twitter text using naive bayes, KNN and SVM. In: 2015 International Conference on Data and Software Engineering (ICoDSE), pp. 170–174 (2015). https://doi.org/10.1109/ICODSE.2015.7436992
Raj, B.S.: Understanding BERT: is it a game changer in NLP? (2019). https://towardsdatascience.com/7cca943cf3ad
Stewart, S., Burns, D. (eds.): W3C Recommendation, chap. WebDriver. W3C, August 2020. https://www.w3.org/TR/webdriver/
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, p. 306. Addison-Wesley Longman Publishing Co., Inc., USA (2005)
Vig, J.: Deconstructing BERT, Part 2: visualizing the inner workings of attention, January 2019. https://towardsdatascience.com/60a16d86b5c1
Wang, W., Liu, M., Zhang, Y., Xiang, J., Mao, R.: Financial numeral classification model based on BERT. In: Kato, M.P., Liu, Y., Kando, N., Clarke, C.L.A. (eds.) NTCIR 2019. LNCS, vol. 11966, pp. 193–204. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-36805-0_15
Acknowledgments
This work was supported by the Bavarian Research Institute for Digital Transformation and the European Research Council (#740516).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Höllig, J., Dufter, P., Geierhos, M., Ziegler, W., Schütze, H. (2021). Semantic Text Segment Classification of Structured Technical Content. In: Métais, E., Meziane, F., Horacek, H., Kapetanios, E. (eds) Natural Language Processing and Information Systems. NLDB 2021. Lecture Notes in Computer Science(), vol 12801. Springer, Cham. https://doi.org/10.1007/978-3-030-80599-9_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-80599-9_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-80598-2
Online ISBN: 978-3-030-80599-9
eBook Packages: Computer ScienceComputer Science (R0)