Machine Learning

, Volume 81, Issue 2, pp 149–178 | Cite as

Learning to classify with missing and corrupted features

  • Ofer Dekel
  • Ohad Shamir
  • Lin Xiao


A common assumption in supervised machine learning is that the training examples provided to the learning algorithm are statistically identical to the instances encountered later on, during the classification phase. This assumption is unrealistic in many real-world situations where machine learning techniques are used. We focus on the case where features of a binary classification problem, which were available during the training phase, are either deleted or become corrupted during the classification phase. We prepare for the worst by assuming that the subset of deleted and corrupted features is controlled by an adversary, and may vary from instance to instance. We design and analyze two novel learning algorithms that anticipate the actions of the adversary and account for them when training a classifier. Our first technique formulates the learning problem as a linear program. We discuss how the particular structure of this program can be exploited for computational efficiency and we prove statistical bounds on the risk of the resulting classifier. Our second technique addresses the robust learning problem by combining a modified version of the Perceptron algorithm with an online-to-batch conversion technique, and also comes with statistical generalization guarantees. We demonstrate the effectiveness of our approach with a set of experiments.


Adversarial environment Binary classification Deleted features 


  1. Asuncion, A., & Newman, D. J. (2007). UCI machine learning repository. Google Scholar
  2. Bennett, K. P. (1999). Combining support vector and mathematical programming methods for classification. In Advances in kernel methods: support vector learning (pp. 307–326). Cambridge: MIT Press. Google Scholar
  3. Boyd, S., & Vandenberghe, L. (2004). Convex optimization. Cambridge: Cambridge University Press. zbMATHGoogle Scholar
  4. Carr, R. D., & Lancia, G. (2000). Compact vs. exponential-size LP relaxations (Technical Report SAND2000-2170). SANDIA Report, September 2000. Google Scholar
  5. Cesa-Bianchi, N., Conconi, A., & Gentile, C. (2004). On the generalization ability of on-line learning algorithms. IEEE Transactions on Information Theory, 50(9), 2050–2057. CrossRefMathSciNetGoogle Scholar
  6. Dalvi, N., Domingos, P., Mausam, Sanghai, S., & Verma, D. (2004). Adversarial classification. In Proceedings of the 10th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 99–108). New York: ACM. Google Scholar
  7. Dekel, O., Shamir, O. (2008). Learning to classify with missing and corrupted features. In Proceedings of the twenty-fifth international conference on machine learning. Google Scholar
  8. Dietterich, T. G., & Bakiri, G. (1995). Solving multiclass learning problems via error-correcting output codes. Journal of Artificial Intelligence Research, 2, 263–286. zbMATHGoogle Scholar
  9. Gamble, E. S., Macskassy, S. A., & Minton, S. (2007). Classification with pedigree and its applicability to record linkage. In Workshop on text-mining & link-analysis. Google Scholar
  10. Globerson, A., & Roweis, S. (2006). Nightmare at test time: robust learning by feature deletion. In Proceedings of the 23rd international conference on machine learning (pp. 353–360). Google Scholar
  11. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Berlin: Springer. zbMATHGoogle Scholar
  12. Joachims, T. (1998). Making large-scale support vector machine learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods—support vector learning. Cambridge: MIT Press. Google Scholar
  13. LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324. CrossRefGoogle Scholar
  14. Littlestone, N. (1991). Redundant noisy attributes, attribute errors, and linear-threshold learning using winnow. In Proceedings of the fourth annual workshop on computational learning theory (pp. 147–156). Google Scholar
  15. Lowd, D., & Meek, C. (2005). Good word attacks on statistical spam filters. In Proceedings of the second conference on email and anti-spam (CEAS). Google Scholar
  16. McAllester, D. A. (2003). Simplified PAC-Bayesian margin bounds. In Proceedings of the sixteenth annual conference on computational learning theory (pp. 203–215). Google Scholar
  17. Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and organization in the brain. Psychological Review, 65, 386–407. CrossRefMathSciNetGoogle Scholar
  18. Teo, C.-H., Globerson, A., Roweis, S., & Smola, A. J. (2008). Convex learning with invariances. In Advances in neural information processing systems 21. Google Scholar
  19. Trefethen, L. N., & Bau, D. (1997). Numerical linear algebra. SIAM: Philadelphia. zbMATHGoogle Scholar
  20. Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley. zbMATHGoogle Scholar
  21. Wittel, G., & Wu, S. (2004). On attacking statistical spam filters. In Proceedings of the first conference on email and anti-spam (CEAS). Google Scholar
  22. Wright, S. J. (1997). Primal-dual interior-point methods. SIAM: Philadelphia. zbMATHGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2009

Authors and Affiliations

  1. 1.Microsoft ResearchRedmondUSA
  2. 2.The Hebrew UniversityJerusalemIsrael

Personalised recommendations