Advertisement

A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines

  • Sławomir ZadrożnyEmail author
  • Janusz Kacprzyk
  • Marek Gajewski
Chapter
Part of the Studies in Computational Intelligence book series (SCI, volume 634)

Abstract

In our earlier work we introduced the concept of the multiaspect text categorization (MTC) task which has its roots in relevant practical problems of managing collections of documents at many, if not all, commercial companies and, above all, public institutions. Specifically, it is a well defined general problem which boils down to the classification of textual documents at two levels: first, to a general category, and—second—to a specific sequence of documents within such a category. While the former task may be dealt with the use of some standard text categorization techniques, the latter one is more challenging due to, first of all, a limited number of training documents. On the other hand, it is assumed that there is some natural logic, for instance, resulting from rules and regulations, behind the succession of documents within the sequences which can be exploited to make a decision as to the assignment of a new document to a proper sequence. We have studied the MCT problem in a number of papers and proposed some solutions to it. Here we propose a new solution which is based on the use of the support vector machines (SVMs) which are known as a very effective technique to solve various classification tasks. We consider the application of SVMs in a specific context, determined by the characteristics of the MTC problem, and by a specific data set used for the experimentation. The use of the SVMs has implied a new, more sophisticated representation of the documents and their sequences which has made it possible to obtain promising results in computational experiments. Moreover, the proposed approach is flexible and may be considerably modified and extended to cover many possible problem versions.

Keywords

Support Vector Machine Text Categorization Jaccard Index Training Vector Novelty Detection 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgments

This work is supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

References

  1. 1.
    Allan, J. (ed.): Topic Detection and Tracking: Event-Based Information. Kluwer Academic Publishers, Boston (2002)Google Scholar
  2. 2.
    Beygelzimer, A., Kakadet, S., Langford, J., Arya, S, Mount, D., Li, S.: FNN: fast nearest neighbor search algorithms and applications (2013). http://CRAN.R-project.org/package=FNN. R package version 1.1
  3. 3.
    Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, MoroccoGoogle Scholar
  4. 4.
    Bu, F., Li, H., Zhu, X.: String re-writing kernel. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 8–14 July 2012, Jeju Island, Korea - Volume 1: Long Papers, pp. 449–458. The Association for Computer Linguistics (2012)Google Scholar
  5. 5.
    Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst. 28(1), 37–78 (2007)CrossRefGoogle Scholar
  6. 6.
    Dubois, D., Prade, H.: Weighted minimum and maximum operations in fuzzy set theory. Inf. Sci. 39, 205–210 (1986)MathSciNetCrossRefzbMATHGoogle Scholar
  7. 7.
    Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure. R. J. Stat. Softw. 25(5), 1–54 (2008)Google Scholar
  8. 8.
    Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Series D: System Theory, Knowledge Engineering and Problem Solving. Kluwer Academic Publishers, Boston (1994)CrossRefzbMATHGoogle Scholar
  9. 9.
    Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana (to appear)Google Scholar
  10. 10.
    Grabisch, M.: Fuzzy integral as a flexible and interpretable tool of aggregation. In: Bouchon-Meunier, B. (ed.) Aggregation and Fusion of Imperfect Information. Studies in Fuzziness and Soft Computing, pp. 51–72. Physica-Verlag, Heidelberg (1998)CrossRefGoogle Scholar
  11. 11.
    Kacprzyk, J., Zadrożny, S.: Power of linguistic data summaries and their protoforms. In: Kahraman, C. (ed.) Computational Intelligence Systems in Industrial Engineering. Atlantis Computational Intelligence Systems, vol. 6, pp. 71–90. Atlantis Press, Amsterdam (2012)CrossRefGoogle Scholar
  12. 12.
    Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: Kernlab - an S4 package for kernel methods. R. J. Stat. Softw. 11(9), 1–20 (2004). http://www.jstatsoft.org/v11/i09/
  13. 13.
    Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, 28 June–1 July 2001, pp. 282–289. Morgan Kaufmann (2001)Google Scholar
  14. 14.
    Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)Google Scholar
  15. 15.
    R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria (2014). http://www.R-project.org
  16. 16.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag. 24, 513–523 (1988)CrossRefGoogle Scholar
  17. 17.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 147 (2002)CrossRefGoogle Scholar
  18. 18.
    Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognit. Lett. 20(11–13), 1191–1199 (1999)CrossRefGoogle Scholar
  19. 19.
    Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
  20. 20.
    Zadeh, L.: A computational approach to fuzzy quantifiers in natural languages. Comput. Math. Appl. 9, 149–184 (1983)MathSciNetCrossRefzbMATHGoogle Scholar
  21. 21.
    Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing (ICCIP’13). Cracow University of Technology (2013)Google Scholar
  22. 22.
    Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control 4–AC, 7–16 (2013)Google Scholar
  23. 23.
    Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-Like Intelligence, CIHLI 2015, Cape Town, South Africa, 8–10 December 2015. IEEE (to appear)Google Scholar
  24. 24.
    Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, 17–19 August 2015Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Sławomir Zadrożny
    • 1
    Email author
  • Janusz Kacprzyk
    • 1
  • Marek Gajewski
    • 1
  1. 1.Systems Research Institute, Polish Academy of SciencesWarszawaPoland

Personalised recommendations