# A New Approach to the Multiaspect Text Categorization by Using the Support Vector Machines

## Abstract

In our earlier work we introduced the concept of the *multiaspect text categorization* (MTC) task which has its roots in relevant practical problems of managing collections of documents at many, if not all, commercial companies and, above all, public institutions. Specifically, it is a well defined general problem which boils down to the classification of textual documents at two levels: first, to a general category, and—second—to a specific sequence of documents within such a category. While the former task may be dealt with the use of some standard text categorization techniques, the latter one is more challenging due to, first of all, a limited number of training documents. On the other hand, it is assumed that there is some natural logic, for instance, resulting from rules and regulations, behind the succession of documents within the sequences which can be exploited to make a decision as to the assignment of a new document to a proper sequence. We have studied the MCT problem in a number of papers and proposed some solutions to it. Here we propose a new solution which is based on the use of the support vector machines (SVMs) which are known as a very effective technique to solve various classification tasks. We consider the application of SVMs in a specific context, determined by the characteristics of the MTC problem, and by a specific data set used for the experimentation. The use of the SVMs has implied a new, more sophisticated representation of the documents and their sequences which has made it possible to obtain promising results in computational experiments. Moreover, the proposed approach is flexible and may be considerably modified and extended to cover many possible problem versions.

## Keywords

Support Vector Machine Text Categorization Jaccard Index Training Vector Novelty Detection## Notes

### Acknowledgments

This work is supported by the National Science Centre (contract no. UMO-2011/01/B/ST6/06908).

## References

- 1.Allan, J. (ed.): Topic Detection and Tracking: Event-Based Information. Kluwer Academic Publishers, Boston (2002)Google Scholar
- 2.Beygelzimer, A., Kakadet, S., Langford, J., Arya, S, Mount, D., Li, S.: FNN: fast nearest neighbor search algorithms and applications (2013). http://CRAN.R-project.org/package=FNN. R package version 1.1
- 3.Bird, S., Dale, R., Dorr, B., Gibson, B., Joseph, M., Kan, M.Y., Lee, D., Powley, B., Radev, D., Tan, Y.: The ACL anthology reference corpus: a reference dataset for bibliographic research in computational linguistics. In: Proceedings of Language Resources and Evaluation Conference (LREC 08), pp. 1755–1759. Marrakesh, MoroccoGoogle Scholar
- 4.Bu, F., Li, H., Zhu, X.: String re-writing kernel. In: The 50th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, 8–14 July 2012, Jeju Island, Korea - Volume 1: Long Papers, pp. 449–458. The Association for Computer Linguistics (2012)Google Scholar
- 5.Ceci, M., Malerba, D.: Classifying web documents in a hierarchy of categories: a comprehensive study. J. Intell. Inf. Syst.
**28**(1), 37–78 (2007)CrossRefGoogle Scholar - 6.Dubois, D., Prade, H.: Weighted minimum and maximum operations in fuzzy set theory. Inf. Sci.
**39**, 205–210 (1986)MathSciNetCrossRefzbMATHGoogle Scholar - 7.Feinerer, I., Hornik, K., Meyer, D.: Text mining infrastructure. R. J. Stat. Softw.
**25**(5), 1–54 (2008)Google Scholar - 8.Fodor, J., Roubens, M.: Fuzzy Preference Modelling and Multicriteria Decision Support. Series D: System Theory, Knowledge Engineering and Problem Solving. Kluwer Academic Publishers, Boston (1994)CrossRefzbMATHGoogle Scholar
- 9.Gajewski, M., Kacprzyk, J., Zadrożny, S.: Topic detection and tracking: a focused survey and a new variant. Informatyka Stosowana (to appear)Google Scholar
- 10.Grabisch, M.: Fuzzy integral as a flexible and interpretable tool of aggregation. In: Bouchon-Meunier, B. (ed.) Aggregation and Fusion of Imperfect Information. Studies in Fuzziness and Soft Computing, pp. 51–72. Physica-Verlag, Heidelberg (1998)CrossRefGoogle Scholar
- 11.Kacprzyk, J., Zadrożny, S.: Power of linguistic data summaries and their protoforms. In: Kahraman, C. (ed.) Computational Intelligence Systems in Industrial Engineering. Atlantis Computational Intelligence Systems, vol. 6, pp. 71–90. Atlantis Press, Amsterdam (2012)CrossRefGoogle Scholar
- 12.Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: Kernlab - an S4 package for kernel methods. R. J. Stat. Softw.
**11**(9), 1–20 (2004). http://www.jstatsoft.org/v11/i09/ - 13.Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Brodley, C.E., Danyluk, A.P. (eds.) Proceedings of the Eighteenth International Conference on Machine Learning (ICML 2001), Williams College, Williamstown, 28 June–1 July 2001, pp. 282–289. Morgan Kaufmann (2001)Google Scholar
- 14.Platt, J.C.: Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, pp. 61–74. MIT Press, Cambridge (1999)Google Scholar
- 15.R Core Team: R: A language and environment for statistical computing. R Foundation for Statistical Computing. Vienna, Austria (2014). http://www.R-project.org
- 16.Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Process. Manag.
**24**, 513–523 (1988)CrossRefGoogle Scholar - 17.Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv.
**34**(1), 147 (2002)CrossRefGoogle Scholar - 18.Tax, D.M.J., Duin, R.P.W.: Support vector domain description. Pattern Recognit. Lett.
**20**(11–13), 1191–1199 (1999)CrossRefGoogle Scholar - 19.Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)zbMATHGoogle Scholar
- 20.Zadeh, L.: A computational approach to fuzzy quantifiers in natural languages. Comput. Math. Appl.
**9**, 149–184 (1983)MathSciNetCrossRefzbMATHGoogle Scholar - 21.Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and two approaches to its solution. In: Proceedings of the International Congress on Control and Information Processing (ICCIP’13). Cracow University of Technology (2013)Google Scholar
- 22.Zadrożny, S., Kacprzyk, J., Gajewski, M., Wysocki, M.: A novel text classification problem and its solution. Tech. Trans. Autom. Control
**4–AC**, 7–16 (2013)Google Scholar - 23.Zadrożny, S., Kacprzyk, J., Gajewski, M.: A new two-stage approach to the multiaspect text categorization. In: IEEE Symposium on Computational Intelligence for Human-Like Intelligence, CIHLI 2015, Cape Town, South Africa, 8–10 December 2015. IEEE (to appear)Google Scholar
- 24.Zadrożny, S., Kacprzyk, J., Gajewski, M.: A novel approach to sequence-of-documents focused text categorization using the concept of a degree of fuzzy set subsethood. In: Proceedings of the Annual Conference of the North American Fuzzy Information processing Society NAFIPS’2015 and 5th World Conference on Soft Computing 2015, Redmond, 17–19 August 2015Google Scholar