Abstract
This paper deals with the task of developing a text corpus for the automatic generation of a historical shipbuilding domain ontology. Standard methods of analysis produce unsatisfactory results due to the limited nomenclature of available texts and lexical evolution of language. In this work, a parser developed by authors is used for lemmatization and word-sense disambiguation. The parser is based on an external classifier and provides the unambiguous relationship between each lexeme and class. The documents are represented as vectors in the topic space. The experiments show that the proposed method of categorization produces results very close to the expert opinion and at the same time is sufficiently resistant to the historical dynamics of the vocabulary.
Keywords
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
The CIDOC conceptual reference model (CRM), www.cidoc-crm.org/
Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 3, 1183–1208 (2003)
Blei, D., Lafferty, J.: Topic models. Text Mining: Classification, Clustering, and Applications, 71–94 (2009)
Boyarsky, K.K., Kanevsky, E.A.: Rules language for creation of a syntactic tree. In: Internet and Modern Society: XIV All-Russian Joint Conference, pp. 233–237. Multi Project System Service Publishing, Sankt-Petersburg (2011)
Curti, O.: Modelli Navali. Encyclopedia del Modellismo Navale. Sudostrojenie Publishing (1977)
Gavrilova, T.A., Horoshevsky, V.F.: Knowledge bases of intellectual systems. Piter Publishing, Sankt-Petersburg (2000)
Isa, D., Kallimani, V.P., Lee, L.H.: Using the self organizing map for clustering of text documents. Expert Systems with Applications 36, 9584–9591 (2009)
Kanevsky, E.A., Boyarsky, K.K.: Semantic-syntactical analyzer semsin. In: International Conference on Computational Linguistics Dialog 2012, Bekasovo, May 30-June 3 (2012), http://www.dialog-21.ru/digest/2012/?type=doc
Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proc. 15th Int. Conf. on Computational Linguistics (COLING), Kyoto, vol. 2, pp. 1071–1075 (1994)
de Knijff, J., Frasincar, F., Hogenboom, F.: Domain taxonomy learning from text: The subsumption method versus hierarchical clustering. Data & Knowledge Engineering 83, 54–69 (2013)
Korshunov, A., Gomzin, A.: Topic modeling in natural language texts. In: Works of Institute of System Design of the Russian Academy of Sciences (2012)
Lee, C.S., Kao, Y.F., Kuo, Y.H., Wang, M.H.: Automated ontology construction for unstructured text documents. Data & Knowledge Engineering 60, 547–566 (2007)
Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data & Knowledge Engineering 68, 1271–1288 (2009)
Mashechkin, I.V., Petrovsky, M.I., Tsarov, D.: Methods of calculation of relevance of text fragments using topic models in a problem of automatic annotation. Computing Methods and Programming 14, 91–102 (2013)
Mozzherina, E.: Approach to improving the classification of the new york times annotated corpus. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 83–91. Springer, Heidelberg (2013)
Nasir, J.A., Varlamis, I., Karim, A., Tsatsaronis, G.: Semantic smoothing for text clustering. Knowledge-Based Systems 54, 216–229 (2013)
Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, pp. 100–108 (June 2010)
Nouman, A., JingTao, Y.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications 39, 4760–4768 (2012)
Pinheiro, R., Cavalcanti, G., Correa, R., Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Systems with Applications 39, 12851–12857 (2012)
Romme, C.: L’Art de la marine, ou principes et prceptes gnraux de l’art de construire et d’armer les vaisseaux. Sea military school Publishing (1793, 1795)
Rubashkin, V.S.: Ontologic semantics. Knowledge. Ontologies. Ontologically focused methods of the information analysis of the text. Fizmatlit Publishing (2013)
Rykov, V.V.: Text corpus as realization of an object-oriented paradigm. In: Workshop Dialog 2002. Nauka Publishing (2002)
Song, W., Li, C.H., Park, S.C.: Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications 36, 9095–9104 (2009)
Tuzov, V.A.: Computer semantics of Russian. Sankt-Petersburg State University (2004)
Varfolomeyev, A., Ivanovs, A.: Representation of historical sources on the semantic web by means of attempto controlled english. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 177–190. Springer, Heidelberg (2013)
Vorontsov, K.B.: Probabilistic topic models of text documents collections, http://www.machinelearning.ru/wiki/images/7/7e/Voron-ML-TopicModels-slides.pdf
de Vries, G., Malaisé, V., van Someren, M., Adriaans, P., Chreiber, G.: Semi-automatic ontology extension in the maritime domain. In: Proceedings of the Twentieth Belgian-Dutch Conference on Artificial Intelligence, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science, pp. 265–272 (2008), http://dare.uva.nl/en/record/315959
Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management 48, 741–754 (2012)
Zagidulin, I.: Methods and means of an automatic text categorization (2008), http://www.cv.imm.uran.ru/uploads/f1/s/0/299/basic/7/858/Metodyi_i_sredstva_TK.pdf
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Artemova, G. et al. (2014). Text Categorization for Generation of a Historical Shipbuilding Ontology. In: Klinov, P., Mouromtsev, D. (eds) Knowledge Engineering and the Semantic Web. KESW 2014. Communications in Computer and Information Science, vol 468. Springer, Cham. https://doi.org/10.1007/978-3-319-11716-4_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-11716-4_1
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11715-7
Online ISBN: 978-3-319-11716-4
eBook Packages: Computer ScienceComputer Science (R0)