Text Categorization for Generation of a Historical Shipbuilding Ontology

  • Galina Artemova
  • Kirill Boyarsky
  • Dmitri Gouzévitch
  • Natalia Gusarova
  • Natalia Dobrenko
  • Eugeny Kanevsky
  • Daria Petrova
Part of the Communications in Computer and Information Science book series (CCIS, volume 468)

Abstract

This paper deals with the task of developing a text corpus for the automatic generation of a historical shipbuilding domain ontology. Standard methods of analysis produce unsatisfactory results due to the limited nomenclature of available texts and lexical evolution of language. In this work, a parser developed by authors is used for lemmatization and word-sense disambiguation. The parser is based on an external classifier and provides the unambiguous relationship between each lexeme and class. The documents are represented as vectors in the topic space. The experiments show that the proposed method of categorization produces results very close to the expert opinion and at the same time is sufficiently resistant to the historical dynamics of the vocabulary.

Keywords

Text categorization historical shipbuilding domain ontology parsing space of topics 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    The CIDOC conceptual reference model (CRM), www.cidoc-crm.org/
  2. 2.
    Bekkerman, R., El-Yaniv, R., Tishby, N., Winter, Y.: Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research 3, 1183–1208 (2003)MATHGoogle Scholar
  3. 3.
    Blei, D., Lafferty, J.: Topic models. Text Mining: Classification, Clustering, and Applications, 71–94 (2009)Google Scholar
  4. 4.
    Boyarsky, K.K., Kanevsky, E.A.: Rules language for creation of a syntactic tree. In: Internet and Modern Society: XIV All-Russian Joint Conference, pp. 233–237. Multi Project System Service Publishing, Sankt-Petersburg (2011)Google Scholar
  5. 5.
    Curti, O.: Modelli Navali. Encyclopedia del Modellismo Navale. Sudostrojenie Publishing (1977)Google Scholar
  6. 6.
    Gavrilova, T.A., Horoshevsky, V.F.: Knowledge bases of intellectual systems. Piter Publishing, Sankt-Petersburg (2000)Google Scholar
  7. 7.
    Isa, D., Kallimani, V.P., Lee, L.H.: Using the self organizing map for clustering of text documents. Expert Systems with Applications 36, 9584–9591 (2009)CrossRefGoogle Scholar
  8. 8.
    Kanevsky, E.A., Boyarsky, K.K.: Semantic-syntactical analyzer semsin. In: International Conference on Computational Linguistics Dialog 2012, Bekasovo, May 30-June 3 (2012), http://www.dialog-21.ru/digest/2012/?type=doc
  9. 9.
    Karlgren, J., Cutting, D.: Recognizing text genres with simple metrics using discriminant analysis. In: Proc. 15th Int. Conf. on Computational Linguistics (COLING), Kyoto, vol. 2, pp. 1071–1075 (1994)Google Scholar
  10. 10.
    de Knijff, J., Frasincar, F., Hogenboom, F.: Domain taxonomy learning from text: The subsumption method versus hierarchical clustering. Data & Knowledge Engineering 83, 54–69 (2013)CrossRefGoogle Scholar
  11. 11.
    Korshunov, A., Gomzin, A.: Topic modeling in natural language texts. In: Works of Institute of System Design of the Russian Academy of Sciences (2012)Google Scholar
  12. 12.
    Lee, C.S., Kao, Y.F., Kuo, Y.H., Wang, M.H.: Automated ontology construction for unstructured text documents. Data & Knowledge Engineering 60, 547–566 (2007)CrossRefGoogle Scholar
  13. 13.
    Luo, C., Li, Y., Chung, S.M.: Text document clustering based on neighbors. Data & Knowledge Engineering 68, 1271–1288 (2009)CrossRefGoogle Scholar
  14. 14.
    Mashechkin, I.V., Petrovsky, M.I., Tsarov, D.: Methods of calculation of relevance of text fragments using topic models in a problem of automatic annotation. Computing Methods and Programming 14, 91–102 (2013)Google Scholar
  15. 15.
    Mozzherina, E.: Approach to improving the classification of the new york times annotated corpus. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 83–91. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  16. 16.
    Nasir, J.A., Varlamis, I., Karim, A., Tsatsaronis, G.: Semantic smoothing for text clustering. Knowledge-Based Systems 54, 216–229 (2013)CrossRefGoogle Scholar
  17. 17.
    Newman, D., Lau, J.H., Grieser, K., Baldwin, T.: Automatic evaluation of topic coherence. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the ACL, Los Angeles, California, pp. 100–108 (June 2010)Google Scholar
  18. 18.
    Nouman, A., JingTao, Y.: Comparison of term frequency and document frequency based feature selection metrics in text categorization. Expert Systems with Applications 39, 4760–4768 (2012)CrossRefGoogle Scholar
  19. 19.
    Pinheiro, R., Cavalcanti, G., Correa, R., Ren, T.I.: A global-ranking local feature selection method for text categorization. Expert Systems with Applications 39, 12851–12857 (2012)CrossRefGoogle Scholar
  20. 20.
    Romme, C.: L’Art de la marine, ou principes et prceptes gnraux de l’art de construire et d’armer les vaisseaux. Sea military school Publishing (1793, 1795)Google Scholar
  21. 21.
    Rubashkin, V.S.: Ontologic semantics. Knowledge. Ontologies. Ontologically focused methods of the information analysis of the text. Fizmatlit Publishing (2013)Google Scholar
  22. 22.
    Rykov, V.V.: Text corpus as realization of an object-oriented paradigm. In: Workshop Dialog 2002. Nauka Publishing (2002)Google Scholar
  23. 23.
    Song, W., Li, C.H., Park, S.C.: Genetic algorithm for text clustering using ontology and evaluating the validity of various semantic similarity measures. Expert Systems with Applications 36, 9095–9104 (2009)CrossRefGoogle Scholar
  24. 24.
    Tuzov, V.A.: Computer semantics of Russian. Sankt-Petersburg State University (2004)Google Scholar
  25. 25.
    Varfolomeyev, A., Ivanovs, A.: Representation of historical sources on the semantic web by means of attempto controlled english. In: Klinov, P., Mouromtsev, D. (eds.) KESW 2013. CCIS, vol. 394, pp. 177–190. Springer, Heidelberg (2013)CrossRefGoogle Scholar
  26. 26.
    Vorontsov, K.B.: Probabilistic topic models of text documents collections, http://www.machinelearning.ru/wiki/images/7/7e/Voron-ML-TopicModels-slides.pdf
  27. 27.
    de Vries, G., Malaisé, V., van Someren, M., Adriaans, P., Chreiber, G.: Semi-automatic ontology extension in the maritime domain. In: Proceedings of the Twentieth Belgian-Dutch Conference on Artificial Intelligence, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science, pp. 265–272 (2008), http://dare.uva.nl/en/record/315959
  28. 28.
    Yang, J., Liu, Y., Zhu, X., Liu, Z., Zhang, X.: A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Information Processing and Management 48, 741–754 (2012)CrossRefGoogle Scholar
  29. 29.
    Zagidulin, I.: Methods and means of an automatic text categorization (2008), http://www.cv.imm.uran.ru/uploads/f1/s/0/299/basic/7/858/Metodyi_i_sredstva_TK.pdf

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Galina Artemova
    • 1
  • Kirill Boyarsky
    • 1
  • Dmitri Gouzévitch
    • 2
  • Natalia Gusarova
    • 1
  • Natalia Dobrenko
    • 1
  • Eugeny Kanevsky
    • 3
  • Daria Petrova
    • 1
  1. 1.Saint Petersburg National Research University of Information Technologies, Mechanics and Optics (ITMO University)Saint PetersburgRussia
  2. 2.Centre d’Etudes du Monde russe, caucasienne centre-européen, École des hautes études en sciences socialesParisFrance
  3. 3.Saint Petersburg Institute for Economics and MathematicsRussian Academy of SciencesSaint PetersburgRussia

Personalised recommendations