Unsupervised Feature Selection for Spherical Data Modeling: Application to Image-Based Spam Filtering

Amayri, Ola; Bouguila, Nizar

doi:10.1007/978-3-642-30721-8_2

Ola Amayri³ &
Nizar Bouguila³

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 287))

Included in the following conference series:

International Conference on Multimedia Communications, Services and Security

1264 Accesses
2 Citations

Abstract

Understanding the relevance of extracted features in domain-specific sense is a matter at the heart of image classification. In this paper, we propose a feature selection framework that allows more compactness of the statistical model while holding good generalization to unseen data. Both feature selection and clustering are based on well-established statistical models that provide natural choice when the data to model are spherical. Moreover, we develop a probabilistic kernel based on Fisher score and mixture of von Mises model (moVM) to feed Support Vector Machines (SVM). The selection process evaluates the relevance of features through a principled feature saliency approach. The unsupervised learning is approached using Expectation Maximization (EM) for parameter estimation along with Minimum Message Length (MML) to determine the optimal number of mixture components. We argue that the proposed framework is well-justified and can be adjusted to different problems. Experimental results involving the challenging problem of image-based spam filtering show the merits of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Akaike, H.: A new look at the statistical model identification. IEEE Transaction on Automatic Control 19(6), 716–723 (1974)
Article MathSciNet MATH Google Scholar
Baxter, R., Oliver, J.: Finding Overlapping Components with MML. Statistics and Computing 10(1), 5–16 (2000)
Article Google Scholar
Ben-Bassat, M.: Use of Distance Measures, Information Measures and Error Bounds in Feature Evaluation, ch. 35, pp. 773–791. Elsevier Science Pub Co.
Google Scholar
Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using map and stochastic complexity. IEEE Transactions on Knowledge and Data Engineering 21(12), 1649–1664 (2009)
Article Google Scholar
Boutemedjet, S., Bouguila, N., Ziou, D.: A hybrid feature extraction selection approach for high-dimensional non-gaussian data clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 31(8), 1429–1443 (2009)
Article Google Scholar
Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004)
Article Google Scholar
Dowe, D.L., Oliver, J.J., Baxter, R.A., Wallace, C.S.: Bayesian estimation of the von mises concentration parameter. In: Proceedings of the Fifteenth International Workshop on Maximum Entropy and Bayesian Methods, pp. 51–59. Kluwer Academic (1995)
Google Scholar
Dredze, M., Gevaryahu, R., Elias-Bachrach, A.: Learning fast classifiers for image spam. In: CEAS (2007)
Google Scholar
Erdogmus, D.: Information theoretic learning: Renyi’s entropy and its applications to adaptive. Ph.D. thesis, University of Florida (2002)
Google Scholar
Fisher, N.I.: Statistical analysis of circular data, 1st edn. Cambridge University Press, Cambridge (1993)
Book MATH Google Scholar
Hsia, J.H., Chen, M.S.: Language-model-based detection cascade for efficient classification of image-based spam e-mail. In: Proceedings of the IEEE International Conference on Multimedia and Expo, ICME 2009, pp. 1182–1185. IEEE Press, Piscataway (2009)
Chapter Google Scholar
Dhillon, I.S., Modha, D.S.: Concept Decompositions for Large Sparse Text Data Using Clustering. Machine Learning 42(1-2), 143–175 (2001)
Article MATH Google Scholar
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proceedings of Advances in Neural Information Systems (NIPS), pp. 487–493. MIT Press (1998)
Google Scholar
Jain, A., Zongker, D.: Feature selection: evaluation, application, and small sample performance. IEEE Transactions on Pattern Analysis and Machine Intelligence 19(2), 153–158 (1997)
Article Google Scholar
Koenderink, J.J.: The structure of images. Biological Cybernetics 50(5), 363–370 (1984), http://dx.doi.org/10.1007/BF00336961
Article MathSciNet MATH Google Scholar
Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell 26, 1154–1166 (2004)
Article Google Scholar
Lemos, A., Caminhas, W., Gomide, F.: Evolving fuzzy linear regression trees with feature selection. In: 2011 IEEE Workshop on Evolving and Adaptive Intelligent Systems (EAIS), pp. 31–38 (April 2011)
Google Scholar
Lindeberg, T.: Scale-space theory: A basic tool for analysing structures at different scales. Journal of Applied Statistics, 224–270 (1994)
Google Scholar
Liu, Q., Qin, Z., Cheng, H., Wan, M.: Efficient modeling of spam images. In: Proceedings of the Third International Symposium on Intelligent Information Technology and Security Informatics. IEEE Computer Society, Washington, DC (2010)
Google Scholar
Mardia, K.V.: Statistics of directional data. Academic Press (1972)
Google Scholar
Mehta, B., Nangia, S., Gupta, M., Nejdl, W.: Detecting image spam using visual features and near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, WWW 2008, pp. 497–506 (2008)
Google Scholar
Mitra, P., Murthy, C., Pal, S.: Unsupervised feature selection using feature similarity. IEEE Transactions on Pattern Analysis and Machine Intelligence 24(3), 301–312 (2002)
Article Google Scholar
Perronnin, F., Dance, C.R.: Fisher kernels on visual vocabularies for image categorization. In: CVPR. IEEE Computer Society (2007)
Google Scholar
Herbrich, R., Graepel, T.: A PAC-Bayesian Margin Bound for Linear Classifiers: Why SVMs Work. In: Proceedings of Advances in Neural Information Processing Systems (NIPS), pp. 224–230 (2000)
Google Scholar
Rissanen, J.: Modeling by shortest data discription. Automatica 14, 465–471 (1987)
Article Google Scholar
Saeys, Y., Inza, I.N., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), 2507–2517 (2007)
Article Google Scholar
Schwarz, G.: Estimating dimension of a model. Annals of Statistics 6, 461–464 (1978)
Article MathSciNet MATH Google Scholar
Titterington, D., Smith, A., Makov, U.: Statistical Analysis of Finite Mixture Distributions. John Wiley & Sons, Chichester (1985)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering and Computer Science, Concordia University, Montreal, QC, Canada, H3G 2W1
Ola Amayri & Nizar Bouguila

Authors

Ola Amayri
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Telecommunications, AGH University of Science and Technology, Krakow, Poland
Andrzej Dziech
Multimedia Systems Department, Gdansk University of Technology, Narutowicza 11/22, 80-233, Gdansk, Poland
Andrzej Czyżewski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Amayri, O., Bouguila, N. (2012). Unsupervised Feature Selection for Spherical Data Modeling: Application to Image-Based Spam Filtering. In: Dziech, A., Czyżewski, A. (eds) Multimedia Communications, Services and Security. MCSS 2012. Communications in Computer and Information Science, vol 287. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30721-8_2

Download citation

DOI: https://doi.org/10.1007/978-3-642-30721-8_2
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30720-1
Online ISBN: 978-3-642-30721-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics