Advertisement

Text Categorization Improvement via User Interaction

  • Jakub Atroszko
  • Julian SzymańskiEmail author
  • David Gil
  • Higinio Mora
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10842)

Abstract

In this paper, we propose an approach to improvement of text categorization using interaction with the user. The quality of categorization has been defined in terms of a distribution of objects related to the classes and projected on the self-organizing maps. For the experiments, we use the articles and categories from the subset of Simple Wikipedia. We test three different approaches for text representation. As a baseline we use Bag-of-Words with weighting based on Term Frequency-Inverse Document Frequency that has been used for evaluation of neural representations of words and documents: Word2Vec and Paragraph Vector. In the representation, we identify subsets of features that are the most useful for differentiating classes. They have been presented to the user, and his or her selection allow increase the coherence of the articles that belong to the same category and thus are close on the SOM.

Keywords

Text representation Document categorization Wikipedia Word2Vec Paragraph vector Self-organizing maps 

References

  1. 1.
    Tayal, S., Goel, S.K., Sharma, K.: A comparative study of various text mining techniques. In: 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), pp. 1637–1642 (2015)Google Scholar
  2. 2.
    Schütze, H., Manning, C.D., Raghavan, P.: Introduction to Information Retrieval, pp. 117–119. Cambridge University Press, New York (2008)Google Scholar
  3. 3.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013)Google Scholar
  4. 4.
    Le, Q.V., Mikolov, T.: Distributed representations of sentences and documents. CoRR abs/1405.4053 (2014)Google Scholar
  5. 5.
    Mujtaba, G., Shuib, L., Raj, R.G., Rajandram, R., Shaikh, K.: Automatic text classification of ICD-10 related CoD from complex and free text forensic autopsy reports. In: 2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1055–1058 (2016)Google Scholar
  6. 6.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, pp. 3111–3119 (2013)Google Scholar
  7. 7.
    Bengio, Y., Courville, A., Vincent, P.: Representation learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1799 (2013)CrossRefGoogle Scholar
  8. 8.
    Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. CoRR abs/1105.5444, p. 95 (2011)Google Scholar
  9. 9.
    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: the concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, pp. 406–414. ACM (2001)Google Scholar
  10. 10.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. In: ICML, vol. 97, pp. 412–420 (1997)Google Scholar
  11. 11.
    Godbole, S., Harpale, A., Sarawagi, S., Chakrabarti, S.: Document classification through interactive supervision of document and term labels. In: Boulicaut, J.-F., Esposito, F., Giannotti, F., Pedreschi, D. (eds.) PKDD 2004. LNCS (LNAI), vol. 3202, pp. 185–196. Springer, Heidelberg (2004).  https://doi.org/10.1007/978-3-540-30116-5_19CrossRefGoogle Scholar
  12. 12.
    Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E.D., Gutierrez, J.B., Kochut, K.: A brief survey of text mining: classification, clustering and extraction techniques. arXiv preprint arXiv:1707.02919 (2017)
  13. 13.
    Stanković, R., Krstev, C., Obradović, I., Kitanović, O.: Improving document retrieval in large domain specific textual databases using lexical resources. In: Nguyen, N.T., Kowalczyk, R., Pinto, A.M., Cardoso, J. (eds.) TCCI XXVI. LNCS, vol. 10190, pp. 162–185. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-59268-8_8CrossRefGoogle Scholar
  14. 14.
    Hu, Y., Milios, E.E., Blustein, J.: Interactive feature selection for document clustering. In: Proceedings of the 2011 ACM Symposium on Applied Computing, SAC 2011, pp. 1143–1150. ACM, New York (2011)Google Scholar
  15. 15.
    Raghavan, H., Madani, O., Jones, R.: Interactive feature selection. In: IJCAI, vol. 5, pp. 841–846 (2005)Google Scholar
  16. 16.
    Dzemyda, G., Kurasova, O., Žilinskas, J.: Multidimensional Data Visualization. SOIA, vol. 75. Springer, New York (2012).  https://doi.org/10.1007/978-1-4419-0236-8CrossRefzbMATHGoogle Scholar
  17. 17.
    Borg, I., Groenen, P.J.F.: Modern Multidimensional Scaling: Theory and Applications. SSS. Springer, New York (2005).  https://doi.org/10.1007/0-387-28981-XCrossRefzbMATHGoogle Scholar
  18. 18.
    van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)zbMATHGoogle Scholar
  19. 19.
    Kohonen, T.: The self-organizing map. Proc. IEEE 78, 1464–1465, 1474 (1990)CrossRefGoogle Scholar
  20. 20.
    Ultsch, A.: Emergence in self-organizing feature maps. University Library of Bielefeld (2007)Google Scholar
  21. 21.
    Szymański, J.: Self-organizing map representation for clustering Wikipedia search results. In: Nguyen, N.T., Kim, C.-G., Janiak, A. (eds.) ACIIDS 2011. LNCS (LNAI), vol. 6592, pp. 140–149. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-20042-7_15CrossRefGoogle Scholar
  22. 22.
    Szymański, J., Duch, W.: Self organizing maps for visualization of categories. In: Huang, T., Zeng, Z., Li, C., Leung, C.S. (eds.) ICONIP 2012. LNCS, vol. 7663, pp. 160–167. Springer, Heidelberg (2012).  https://doi.org/10.1007/978-3-642-34475-6_20CrossRefGoogle Scholar
  23. 23.
    Zhao, Z., Morstatter, F., Sharma, S., Alelyani, S., Anand, A., Liu, H.: Advancing feature selection research. ASU feature selection repository, pp. 1–28 (2010)Google Scholar
  24. 24.
    Tang, J., Alelyani, S., Liu, H.: Feature selection for classification: a review. In: Data Classification: Algorithms and Applications, p. 37 (2014)Google Scholar
  25. 25.
    Vergara, J.R., Estévez, P.A.: A review of feature selection methods based on mutual information. Neural Comput. Appl. 24, 175–186 (2014)CrossRefGoogle Scholar
  26. 26.
    Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55, 78–87 (2012)CrossRefGoogle Scholar
  27. 27.
    Kotsiantis, S.B., Zaharakis, I.D., Pintelas, P.E.: Machine learning: a review of classification and combining techniques. Artif. Intell. Rev. 26, 159–190 (2006)CrossRefGoogle Scholar
  28. 28.
    Cha, S.H.: Comprehensive survey on distance/similarity measures between probability density functions. Int. J. Math. Models Methods Appl. Sci. 1, 300–302, 306 (2007)Google Scholar
  29. 29.
    Ultsch, A., Mörchen, F.: ESOM-maps: tools for clustering, visualization, and classification with emergent SOM. Technical report, Department of Mathematics and Computer Science, University of Marburg, Germany (2005)Google Scholar
  30. 30.
    Draszawka, K., Szymański, J.: External validation measures for nested clustering of text documents. In: Ryżko, D., Rybiński, H., Gawrysiak, P., Kryszkiewicz, M. (eds.) Emerging Intelligent Technologies in Industry. SCI, vol. 369, pp. 207–225. Springer, Heidelberg (2011).  https://doi.org/10.1007/978-3-642-22732-5_18CrossRefGoogle Scholar
  31. 31.
    Szymański, J., Duch, W.: Semantic memory knowledge acquisition through active dialogues. In: 2007 International Joint Conference on Neural Networks, IJCNN 2007, pp. 536–541. IEEE (2007)Google Scholar
  32. 32.
    Czarnul, P., Rościszewski, P., Matuszek, M., Szymański, J.: Simulation of parallel similarity measure computations for large data sets. In: 2015 IEEE 2nd International Conference on Cybernetics (CYBCONF), pp. 472–477. IEEE (2015)Google Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Jakub Atroszko
    • 1
  • Julian Szymański
    • 1
    Email author
  • David Gil
    • 2
  • Higinio Mora
    • 2
  1. 1.Department of Computer Systems ArchitectureGdansk University of TechnologyGdańskPoland
  2. 2.Department of Computer Science Technology and ComputationUniversity of AlicanteAlicanteSpain

Personalised recommendations