A Comparative Study of Text Preprocessing Techniques for Natural Language Call Routing

  • Roman Sergienko
  • Muhammad Shan
  • Alexander Schmitt
Chapter
Part of the Lecture Notes in Electrical Engineering book series (LNEE, volume 427)

Abstract

The article describes a comparative study of text preprocessing techniques for natural language call routing. Seven different unsupervised and supervised term weighting methods were considered. Four different dimensionality reduction methods were applied: stop-words filtering with stemming, feature selection based on term weights, feature transformation based on term clustering, and a novel feature transformation method based on terms belonging to classes. As classification algorithms we used k-NN and the SVM-based algorithm Fast Large Margin. The numerical experiments showed that the most effective term weighting method is Term Relevance Ratio (TRR). Feature transformation based on term clustering is able to significantly decrease dimensionality without significantly changing the classification effectiveness, unlike other dimensionality reduction methods. The novel feature transformation method reduces the dimensionality radically: number of features is equal to number of classes.

Keywords

Call routing Text classification Term weighting Dimensionality reduction 

References

  1. 1.
    Suhm, B., Bers, J., McCarthy, D., Freeman, B., Getty, D., Godfrey, K., Peterson, P.: A comparative study of speech in the call center: natural language call routing vs. touch-tone menus. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 283–290. ACM (2002)Google Scholar
  2. 2.
    Lee, C., Jung, S., Kim, S., Lee, G.G.: Example-based dialog modeling for practical multi-domain dialog system. Speech Commun. 51(5), 466–484 (2009)CrossRefGoogle Scholar
  3. 3.
    Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. (CSUR) 34(1), 1–47 (2002)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Fox, C.: A stop list for general text. In: ACM SIGIR Forum, vol. 24, pp. 19–21. ACM (1989)Google Scholar
  5. 5.
    Porter, M.F.: Snowball: a language for stemming algorithms (2001)Google Scholar
  6. 6.
    Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Inf. Proc. Manag. 24(5), 513–523 (1988)CrossRefGoogle Scholar
  7. 7.
    Debole, F., Sebastiani, F.: Supervised term weighting for automated text categorization. In: Text Mining and its Applications, pp. 81–97. Springer (2004)Google Scholar
  8. 8.
    Soucy, P., Mineau, G.W.: Beyond TFIDF weighting for text categorization in the vector space model. IJCAI 5, 1130–1135 (2005)Google Scholar
  9. 9.
    Xu, H., Li, C.: A novel term weighting scheme for automated text categorization. In: Seventh International Conference on Intelligent Systems Design and Applications, ISDA 2007, pp. 759–764. IEEE (2007)Google Scholar
  10. 10.
    Lan, M., Tan, C.L., Su, J., Lu, Y.: Supervised and traditional term weighting methods for automatic text categorization. IEEE Trans. Pattern Anal. Mach. Intell. 31(4), 721–735 (2009)CrossRefGoogle Scholar
  11. 11.
    Ko, Y.: A study of term weighting schemes using class information for text classification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1029–1030. ACM (2012)Google Scholar
  12. 12.
    Gasanova, T., Sergienko, R., Akhmedova, S., Semenkin, E., Minker, W.: Opinion mining and topic categorization with novel term weighting. In: Proceedings of the 5th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, pp. 84–89. ACL (2014)Google Scholar
  13. 13.
    Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. 9, 1871–1874 (2008)MATHGoogle Scholar
  14. 14.
    Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text categorization. ICML 97, 412–420 (1997)Google Scholar
  15. 15.
    Sergienko, R., Gasanova, T., Semenkin, E., Minker, W.: Text categorization methods application for natural language call routing. In: 11th International Conference on Informatics in Control, Automation and Robotics (ICINCO), vol. 2, pp. 827–831. IEEE (2014)Google Scholar
  16. 16.
    Momtazi, S., Klakow, D.: A word clustering approach for language model-based sentence retrieval in question answering systems. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1911–1914. ACM (2009)Google Scholar
  17. 17.
    Ward Jr., J.H.: Hierarchical grouping to optimize an objective function. J. Am. Stat. Assoc. 58(301), 236–244 (1963)MathSciNetCrossRefGoogle Scholar
  18. 18.
    Han, E.H.S., Karypis, G., Kumar, V.: Text Categorization Using Weight Adjusted k-Nearest Neighbor Classification. Springer (2001)Google Scholar
  19. 19.
    Baharudin, B., Lee, L.H., Khan, K.: A review of machine learning algorithms for text-documents classification. J. Adv. Inf. Tech. 1(1), 4–20 (2010)Google Scholar
  20. 20.
    Joachims, T.: Learning to Classify Text Using Support Vector Machines: Methods. Kluwer Academic Publishers, Theory and Algorithms (2002)CrossRefGoogle Scholar
  21. 21.
    Morariu, D.I., Vintan, L.N., Tresp, V.: Meta-classification using SVM classifiers for text documents. Int. J. Appl. Math. Comput. Sci. 1(1) (2005)Google Scholar
  22. 22.
    Shafait, F., Reif, M., Kofler, C., Breuel, T.M.: Pattern recognition engineering. In: RapidMiner Community Meeting and Conference, vol. 9. Citeseer (2010)Google Scholar
  23. 23.
    Goutte, C., Gaussier, E.: A probabilistic interpretation of precision, recall and f-score, with implication for evaluation. In: Advances in Information Retrieval, pp. 345–359. Springer (2005)Google Scholar
  24. 24.
    Gabrilovich, E., Markovitch, S.: Text categorization with many redundant features: using aggressive feature selection to make svms competitive with c4. 5. In: Proceedings of the Twenty-First International Conference on Machine learning, p. 41. ACM (2004)Google Scholar

Copyright information

© Springer Science+Business Media Singapore 2017

Authors and Affiliations

  • Roman Sergienko
    • 1
  • Muhammad Shan
    • 1
  • Alexander Schmitt
    • 1
  1. 1.Institute of Communications Engineering, Ulm UniversityUlmGermany

Personalised recommendations