Advertisement

The VLDB Journal

, Volume 12, Issue 2, pp 170–185 | Cite as

Fast and accurate text classification via multiple linear discriminant projections

  • Soumen ChakrabartiEmail author
  • Shourya Roy
  • Mahesh V. Soundalgekar
Original Paper

Abstract.

Support vector machines (SVMs) have shown superb performance for text classification tasks. They are accurate, robust, and quick to apply to test instances. Their only potential drawback is their training time and memory requirement. For n training instances held in memory, the best-known SVM implementations take time proportional to na, where a is typically between 1.8 and 2.1. SVMs have been trained on data sets with several thousand instances, but Web directories today contain millions of instances that are valuable for mapping billions of Web pages into Yahoo!-like directories. We present SIMPL, a nearly linear-time classification algorithm that mimics the strengths of SVMs while avoiding the training bottleneck. It uses Fisher's linear discriminant, a classical tool from statistical pattern recognition, to project training instances to a carefully selected low-dimensional subspace before inducing a decision tree on the projected instances. SIMPL uses efficient sequential scans and sorts and is comparable in speed and memory scalability to widely used naive Bayes (NB) classifiers, but it beats NB accuracy decisively. It not only approaches and sometimes exceeds SVM accuracy, but also beats the running time of a popular SVM implementation by orders of magnitude. While describing SIMPL, we make a detailed experimental comparison of SVM-generated discriminants with Fisher's discriminants, and we also report on an analysis of the cache performance of a popular SVM implementation. Our analysis shows that SIMPL has the potential to be the method of choice for practitioners who want the accuracy of SVMs and the simplicity and speed of naive Bayes classifiers.

Keywords:

Text classification Discriminative learning Linear discriminants 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1. Agrawal R, Bayardo RJ, Srikant R (2000) Athena: mining-based interactive management of text databases. In: Proceedings of the 7th international conference on extending database technology (EDBT), Konstanz, Germany, March 2000. http://www.almaden.ibm.com/cs/people/ragrawal/papers/athena.psGoogle Scholar
  2. 2. Basu C, Hirsh H, Cohen WW (1998) Recommendation as classification: using social and content-based information in recommendation. In: Proceedings of the 15th national conference on artificial intelligence, Madison, WI, July 1998, pp 714--720Google Scholar
  3. 3. Chakrabarti S, Dom B, Agrawal R, Raghavan P (1998) Scalable feature selection, classification and signature generation for organizing large text databases into hierarchical topic taxonomies. VLDB J http://www.cs.berkeley.edu/~soumen/\VLDB54_3.pdfGoogle Scholar
  4. 4. Cooke T (2002) Two variations on Fisher's linear discriminant for pattern recognition. IEEE Trans Patt Analysis Machine Intell (PAMI) 24(2):268--273 http://www.computer.org/\tpami\tp2002/i0268abs.htmGoogle Scholar
  5. 5. Dasgupta S (1999) Learning mixtures of Gaussians. In: FOCS, pp 634--644 http://charlotte.ucsd.edu/users/dasgupta/papers/\focs2.psGoogle Scholar
  6. 6. Dasgupta S (2000) Experiments with random projection. UAI 16:143--151 http://charlotte.ucsd.edu/users/dasgupta/papers/\random.psGoogle Scholar
  7. 7. Duda R, Hart P (1973) Pattern classification and scene analysis. Wiley, New YorkGoogle Scholar
  8. 8. Dumais S, Platt J, Heckerman D, Sahami M (1998) Inductive learning algorithms and representations for text categorization. In: Proceedings of the 7th conference on information and knowledge management, 1998. http://www.research.microsoft.com/~jplatt/cikm98.pdfGoogle Scholar
  9. 9. Frankl P, Maehara H (1988) The Johnson-Lindenstrauss lemma and the sphericity of some graphs. J Combin Theory B 44:355--362Google Scholar
  10. 10. Friedman JH (1987) Exploratory projection pursuit. J Am Stat Assoc 82:249--266Google Scholar
  11. 11. Fung G, Mangasarian OL (2001) Proximal support vector classifiers. In: Provost F, Srikant R (eds) Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining, San Francisco, August 2001, pp 77--86 University of Wisconsin Data Mining Institute Technical Report 01-02, http://www.cs.wisc.edu/~gfung/Google Scholar
  12. 12. Fung G, Mangasarian OL (2002) Incremental support vector machine classification. In: Proceedings of the 2nd SIAM international conference on data mining, Arlington, VA, April 2002, pp 247--260 University of Wisconsin Data Mining Institute Technical Report 01-08, ftp://ftp.cs.wisc.edu/pub/dmi/\tech-reports/01-08.psGoogle Scholar
  13. 13. Graefe G, Fayyad UM, Chaudhuri S (1998) On the efficient gathering of sufficient statistics for classification from large SQL databases. In: Knowledge discovery and data mining, vol 4. AAAI Press, New York, pp 204--208Google Scholar
  14. 14. Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: Nédellec C, Rouveirol C (eds) Proceedings of ECML-98, 10th European conference on machine learning, Lecture notes in computer science, vol 1398. Springer, Berlin Heidelberg New York, pp 137--142Google Scholar
  15. 15. Joachims T (1999) Making large-scale SVM learning practical. In: Schölkopf B, Burges C, Smola A (eds) Advances in kernel methods: support vector learning. MIT Press, Cambridge, MA http://www-ai.cs.uni-dortmund.de/DOKUMENTE/joachims_99a.pdfGoogle Scholar
  16. 16. Joachims T (2001) A statistical learning model of text classification for support vector machines. In: Croft WB, Harper DJ, Kraft DH, Zobel J (eds) Proceedings of the international conference on research and development in information retrieval, vol 24, New Orleans, September 2001, ACM Press, New York, pp 128--136Google Scholar
  17. 17. Johnson RA, Wichern DW (2001) Applied multivariate statistical analysis, 3rd edn. Prentice-Hall, New DelhiGoogle Scholar
  18. 18. Kleinberg JM (1997) Two algorithms for nearest-neighbor search in high dimensions. In: Proceedings of the ACM symposium on theory of computing, pp 599--608Google Scholar
  19. 19. LeCun Y, Simard PY, Pearlmetter B (1993) Automatic learning rate maximization by on-line estimation of the Hessian's eigenvectors. In: Hanson SJ, Cowan JD, Lee-Giles C (eds) Advances in neural information processing systems, vol 5. Morgan Kaufmann, San Mateo, CA, pp 156--163Google Scholar
  20. 20. Lee YJ, Mangasarian OL (2001) RSVM: reduced support vector machines. In: Proceedings of the 1st SIAM international conference on data mining, Chicago, April 2001. http://www.siam.org/meetings/sdm01/pdf/sdm01_13.pdfGoogle Scholar
  21. 21. Lewis DD (1997) The reuters-21578 text categorization test collection, 1997. http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.htmlGoogle Scholar
  22. 22. Lewis DD, Schapire RE, Callan JP, Papka R (1996) Training algorithms for linear text classifiers. In: Frei HP, Harman D, Schäuble P, Wilkinson R (eds) Proceedings of SIGIR-96, 19th ACM international conference on research and development in information retrieval, ACM Press, New York, pp 298--306Google Scholar
  23. 23. Mangasarian OL, Musicant DR (1999) Successive over-relaxation for support vector machines. In: IEEE Trans Neural Netw 10:1032--1037 ftp://ftp.cs.wisc.edu/math-prog/tech-reports/98-18.psGoogle Scholar
  24. 24. Mangasarian OL, Musicant DR (2000) Lagrangian support vector machines. Technical Report 00-06, Data Mining Institute, University of Wisconsin, Madison, June 2000. http://www.cs.wisc.edu/~musicant/Google Scholar
  25. 25. McCallum A (1998) Bow: a toolkit for statistical language modeling, text retrieval, classification and clustering. Software available from http://www.cs.cmu.edu/~mccallum/bow/Google Scholar
  26. 26. McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: AAAI/\-ICML-98 workshop on learning for text categorization, AAAI Press, pp 41--48 Also technical report WS-98-05, CMU, http://www.cs.cmu.edu/~knigam/ papers/multinomial-aaaiws98.pdf.Google Scholar
  27. 27. Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Intell Res 2:1--32Google Scholar
  28. 28. Nigam K, Lafferty J, McCallum A (1999) Using maximum entropy for text classification. In: IJCAI-99 workshop on machine learning for information filtering, pp 61--67. http://www.cs.cmu.edu/~knigam/ and http://www.cs.cmu.edu/~mccallum/papers/maxent-ijcaiws99.ps.gzGoogle Scholar
  29. 29. Pavlov D, Mao J, Dom B (2000) Scaling-up support vector machines using boosting algorithm. In: Proceedings of the international conference on pattern recognition (ICPR), Barcelona, September 2000. http://www.cvc.uab.es/ICPR2000/Google Scholar
  30. 30. Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Technical Report MSR-TR-98-14, Microsoft Research. http://www.research.microsoft.com/users/jplatt/smoTR.pdfGoogle Scholar
  31. 31. Sahami M, Dumais S, Heckerman D, Horvitz E (1998) A Bayesian approach to filtering junk E-mail. In: Learning for text categorization: papers from the 1998 workshop, Madison, WI, AAAI Technical Report WS-98-05Google Scholar
  32. 32. Schapire RE (2001) The boosting approach to machine learning: an overview. In: Proceedings of the MSRI workshop on nonlinear estimation and classification, Berkeley, CA, March 2001. http://stat.bell-labs.com/who/cocteau/nec/ and http://www.research.att.com/~schapire/boost.htmlGoogle Scholar
  33. 33. Schutze H, Hull DA, Pederson JO (1995) A comparison of classifiers and document representations for the routing problem. In: SIGIR, pp 229--237. ftp://parcftp.xerox.com/pub/qca/SIGIR95.psGoogle Scholar
  34. 34. Shafer JC, Agrawal R, Mehta M (1996) SPRINT: A scalable parallel classifier for data mining. VLDB, pp 544--555Google Scholar
  35. 35. Shashua A (1999) On the equivalence between the support vector machine for classification and sparsified Fisher's linear discriminant. Neural Processing Lett 9(2):129--139 http://www.cs.huji.ac.il/~shashua/papers/fisher-NPL.pdfGoogle Scholar
  36. 36. Swayne DF, Cook D, Buja A (1998) XGobi: interactive dynamic data visualization in the x window system. J Computat Graph Stat 7(1) http://lib.stat.cmu.edu/general/XGobi/Google Scholar
  37. 37. Vapnik V, Golowich S, Smola AJ (1996) Support vector method for function approximation, regression estimation, and signal processing. In: Advances in neural information processing systems. MIT Press, Cambridge, MAGoogle Scholar
  38. 38. Witten IH, Frank E (1999) Data mining: practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San FranciscoGoogle Scholar

Copyright information

© Springer-Verlag Berlin/Heidelberg 2003

Authors and Affiliations

  • Soumen Chakrabarti
    • 1
    Email author
  • Shourya Roy
    • 1
  • Mahesh V. Soundalgekar
    • 1
  1. 1.IIT Bombay

Personalised recommendations