Skip to main content

Unsupervised Feature Selection for Spherical Data Modeling: Application to Image-Based Spam Filtering

  • Conference paper
Multimedia Communications, Services and Security (MCSS 2012)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 287))

Abstract

Understanding the relevance of extracted features in domain-specific sense is a matter at the heart of image classification. In this paper, we propose a feature selection framework that allows more compactness of the statistical model while holding good generalization to unseen data. Both feature selection and clustering are based on well-established statistical models that provide natural choice when the data to model are spherical. Moreover, we develop a probabilistic kernel based on Fisher score and mixture of von Mises model (moVM) to feed Support Vector Machines (SVM). The selection process evaluates the relevance of features through a principled feature saliency approach. The unsupervised learning is approached using Expectation Maximization (EM) for parameter estimation along with Minimum Message Length (MML) to determine the optimal number of mixture components. We argue that the proposed framework is well-justified and can be adjusted to different problems. Experimental results involving the challenging problem of image-based spam filtering show the merits of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Akaike, H.: A new look at the statistical model identification. IEEE Transaction on Automatic Control 19(6), 716–723 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  2. Baxter, R., Oliver, J.: Finding Overlapping Components with MML. Statistics and Computing 10(1), 5–16 (2000)

    Article  Google Scholar 

  3. Ben-Bassat, M.: Use of Distance Measures, Information Measures and Error Bounds in Feature Evaluation, ch. 35, pp. 773–791. Elsevier Science Pub Co.

    Google Scholar 

  4. Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using map and stochastic complexity. IEEE Transactions on Knowledge and Data Engineering 21(12), 1649–1664 (2009)

    Article  Google Scholar 

  5. Boutemedjet, S., Bouguila, N., Ziou, D.: A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8), 1429–1443 (2009)

    Article  Google Scholar 

  6. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)

    Article  Google Scholar 

  7. Dowe, D.L., Oliver, J.J., Baxter, R.A., Wallace, C.S.: Bayesian estimation of the von mises concentration parameter. In: Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods, pp. 51–59. Kluwer Academic (1995)

    Google Scholar 

  8. Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for image spam. In: CEAS (2007)

    Google Scholar 

  9. Erdogmus, D.: Information theoretic learning: Renyi’s entropy and its applications to adaptive. Ph.D. thesis, University of Florida (2002)

    Google Scholar 

  10. Fisher, N.I.: Statistical analysis of circular data, 1st edn. Cambridge University Press, Cambridge (1993)

    Book  MATH  Google Scholar 

  11. Hsia, J.H., Chen, M.S.: Language-model-based detection cascade for efficient classification of image-based spam e-mail. In: Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 1182–1185. IEEE Press, Piscataway (2009)

    Chapter  Google Scholar 

  12. Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42(1-2), 143–175 (2001)

    Article  MATH  Google Scholar 

  13. Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of Advances in Neural Information Systems (NIPS), pp. 487–493. MIT Press (1998)

    Google Scholar 

  14. Jain, A., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997)

    Article  Google Scholar 

  15. Koenderink, J.J.: The structure of images. Biological Cybernetics 50(5), 363–370 (1984), http://dx.doi.org/10.1007/BF00336961

    Article  MathSciNet  MATH  Google Scholar 

  16. Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell 26, 1154–1166 (2004)

    Article  Google Scholar 

  17. Lemos, A., Caminhas, W., Gomide, F.: Evolving fuzzy linear regression trees with feature selection. In: 2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS), pp. 31–38 (April 2011)

    Google Scholar 

  18. Lindeberg, T.: Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics, 224–270 (1994)

    Google Scholar 

  19. Liu, Q., Qin, Z., Cheng, H., Wan, M.: Efficient modeling of spam images. In: Proceedings of the Third International Symposium on Intelligent Information Technology and Security Informatics. IEEE Computer Society, Washington, DC (2010)

    Google Scholar 

  20. Mardia, K.V.: Statistics of directional data. Academic Press (1972)

    Google Scholar 

  21. Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, pp. 497–506 (2008)

    Google Scholar 

  22. Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 301–312 (2002)

    Article  Google Scholar 

  23. Perronnin, F., Dance, C.R.: Fisher kernels on visual vocabularies for image categorization. In: CVPR. IEEE Computer Society (2007)

    Google Scholar 

  24. Herbrich, R., Graepel, T.: A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs Work. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 224–230 (2000)

    Google Scholar 

  25. Rissanen, J.: Modeling by shortest data discription. Automatica 14, 465–471 (1987)

    Article  Google Scholar 

  26. Saeys, Y., Inza, I.N., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)

    Article  Google Scholar 

  27. Schwarz, G.: Estimating dimension of a model. Annals of Statistics 6, 461–464 (1978)

    Article  MathSciNet  MATH  Google Scholar 

  28. Titterington, D., Smith, A., Makov, U.: Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, Chichester (1985)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Amayri, O., Bouguila, N. (2012). Unsupervised Feature Selection for Spherical Data Modeling: Application to Image-Based Spam Filtering. In: Dziech, A., Czyżewski, A. (eds) Multimedia Communications, Services and Security. MCSS 2012. Communications in Computer and Information Science, vol 287. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30721-8_2

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30721-8_2

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30720-1

  • Online ISBN: 978-3-642-30721-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics