Abstract
The main objective of this chapter is to explain the two important dimensionality reduction techniques, feature hashing and principal component analysis, that can support scaling-up machine learning. The standard and flagged feature hashing approaches are explained in detail. The feature hashing approach suffers from the hash collision problem, and this problem is reported and discussed in detail in this chapter, too. Two collision controllers, feature binning and feature mitigation, are also proposed in this chapter to address this problem. The principal component analysis uses the concepts of eigenvalues and eigenvectors, and these terminologies are explained in detail with examples. The principal component analysis is also explained using a simple two-dimensional example, and several coding examples are also presented.
Keywords
- Principal Component Analysis
- Dimensionality Reduction
- Hash Function
- Feature Mitigation
- Stochastic Gradient Descent
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
B. Dalessandro. “Bring the noise: Embracing randomness is the key to scaling up machine learning algorithms.” Big Data vol. 1, no. 2, pp. 110–112, 2013.
L. Bottou. “Large-scale machine learning with stochastic gradient descent.” in Proceedings of COMPSTAT’2010. Physica-Verlag HD, pp. 177–186, 2010.
J. Han and C. Moraga. “The influence of the sigmoid function parameters on the speed of backpropagation learning.” In From Natural to Artificial Neural Computation, pages 195–201. Springer, 1995.
K. Weinberger, A. Dasgupta, J. Langford, A. Smola, and J. Attenberg. “Feature hashing for large scale multitask learning.” In Proceedings of the 26th Annual International Conference on Machine Learning, pp. 1113–1120. ACM, 2009.
K. Shvachko, H. Kuang, S. Radia, and R. Chansler. “The hadoop distributed file system.” In Proceedings of the IEEE Symposium on Mass Storage Systems and Technologies, pp. 1–10, 2010.
J. Dean, and S. Ghemawat. “MapReduce: a flexible data processing tool.” Communications of the ACM, vol. 53, no. 1, pp. 72–77, 2010.
B. Li, X. Chen, M.J. Li, J.Z. Huang, and S. Feng. “Scalable random forests for massive data,” P.N. Tan et al. (Eds): PAKDD 2012, Part I, LNAI 7301, pp. 135–146, 2012.
L. Rokach, and O. Maimon. “Top-down induction of decision trees classifiers-a survey.” Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 35, no. 4, pp. 476–487, 2005.
L. Breiman, “Random forests.” Machine learning 45, pp. 5–32, 2001.
L. Bottou, and O. Bousquet. “The tradeoffs of large scale learning.” In Proceedings of NIPS, vol 4., p. 8, 2007.
P. Domingos, and G. Hulten. “A general method for scaling up machine learning algorithms and its application to clustering.” In ICML, pp. 106–113. 2001.
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. “Hash kernels.” In Proceedings of the International Conference on Artificial Intelligence and Statistics, pp. 496–503. 2009.
Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, and V. Vishwanathan. “Hash kernels for structured data.” The Journal of Machine Learning Research 10, pp. 2615–2637, 2009.
B. Bai, J. Weston, D. Grangier, R. Collobert, O. Chapelle, and K. Weinberger, “Supervised semantic indexing.” In Proceedings of the 18th ACM conference on Information and knowledge management, pp. 187–196, 2009.
C. Caragea, A. Silvescu, and P. Mitra. “Combining hashing and abstraction in sparse high dimensional feature spaces.” AAAI, p. 7, 2012.
http://www.math.northwestern.edu/~mlerma/papers/princcomp2d.pdf, Last accessed: May 14, 2015.
Author information
Authors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer Science+Business Media New York
About this chapter
Cite this chapter
Suthaharan, S. (2016). Dimensionality Reduction. In: Machine Learning Models and Algorithms for Big Data Classification. Integrated Series in Information Systems, vol 36. Springer, Boston, MA. https://doi.org/10.1007/978-1-4899-7641-3_14
Download citation
DOI: https://doi.org/10.1007/978-1-4899-7641-3_14
Publisher Name: Springer, Boston, MA
Print ISBN: 978-1-4899-7640-6
Online ISBN: 978-1-4899-7641-3
eBook Packages: Business and ManagementBusiness and Management (R0)