Machine Learning

, Volume 99, Issue 1, pp 47–73 | Cite as

Optimizing regression models for data streams with missing values



Automated data acquisition systems, such as wireless sensor networks, surveillance systems, or any system that records data in operating logs, are becoming increasingly common, and provide opportunities for making decision on data in real or nearly real time. In these systems, data is generated continuously resulting in a stream of data, and predictive models need to be built and updated online with the incoming data. In addition, the predictive models need to be able to output predictions continuously, and without delays. Automated data acquisition systems are prone to occasional failures. As a result, missing values may often occur. Nevertheless, predictions need to be made continuously. Hence, predictive models need to have mechanisms for dealing with missing data in such a way that the loss in accuracy due to occasionally missing values would be minimal. In this paper, we theoretically analyze effects of missing values to the accuracy of linear predictive models. We derive the optimal least squares solution that minimizes the expected mean squared error given an expected rate of missing values. Based on this theoretically optimal solution we propose a recursive algorithm for producing and updating linear regression online, without accessing historical data. Our experimental evaluation on eight benchmark datasets and a case study in environmental monitoring with streaming data validate the theoretical results and confirm the effectiveness of the proposed strategy.


Data streams Missing data Linear models Online regression  Regularized recursive regression 



We would like to thank the INFER project for sharing the Chemi dataset. This work has been supported by Academy of Finland Grant 118653 (ALGODAN).


  1. Alippi, C., Boracchi, G., & Roveri, M. (2012). On-line reconstruction of missing data in sensor/actuator networks by exploiting temporal and spatial redundancy. In Proceedings of the 2012 International Joint Conference on Neural Networks, IJCNN (pp. 1–8).Google Scholar
  2. Allison, P. (2001). Missing data. Thousand Oaks: Sage.Google Scholar
  3. Brobst, S. (2010). Sensor data is data analytics’ future goldmine.
  4. Ciampi, A., Appice, A., Guccione, P., & Malerba, D. (2012). Integrating trend clusters for spatio-temporal interpolation of missing sensor data. In Proceedings of the 11th International Conference on Web and Wireless Geographical Information Systems, W2GIS (pp. 203–220).Google Scholar
  5. Frank, P. (1990). Fault diagnosis in dynamic systems using analytical and knowledge-based redundancy: A survey and some new results. Automatica, 26(3), 459–474.CrossRefMATHGoogle Scholar
  6. Gama, J., & Gaber, M. (Eds.). (2007). Learning from data streams: Processing techniques in sensor networks. Heidelberg: Springer.Google Scholar
  7. Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4) (in press).Google Scholar
  8. Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural networks and the bias/variance dilemma. Neural Computation, 4(1), 1–58.CrossRefGoogle Scholar
  9. Golub, G., & Van Loan, C. (1996). Matrix computations. Baltimore: Johns Hopkins University Press.Google Scholar
  10. Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning: Data mining, inference, and prediction. New York: Springer.CrossRefGoogle Scholar
  11. Hoerl, A., & Kennard, R. (1970). Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42(1), 55–67.MathSciNetCrossRefGoogle Scholar
  12. Huber, P. (1981). Robust statistics. New York: Wiley.CrossRefMATHGoogle Scholar
  13. Huber, P. J. (1973). Robust regression: Asymptotics, conjectures and monte carlo. Annals of Statistics, 1(5), 799–991.MathSciNetCrossRefMATHGoogle Scholar
  14. Jordan, M. (1998). Notes on recursive least squares. University of California, Berkeley.
  15. Junninen, H., Lauri, A., Keronen, P., Aalto, P., Hiltunen, V., Hari, P., et al. (2009). Smart-SMEAR: On-line data exploration and visualization tool for smear stations. Boreal Environment Research, 14, 447–457.Google Scholar
  16. Little, R. (1992). Regression with missing X’s: A review. Journal of the American Statistical Association, 87(420), 1227–1237.Google Scholar
  17. Little, R., & Rubin, D. (2002). Statistical analysis with missing data (2nd ed.). New York: Wiley.MATHGoogle Scholar
  18. Pan, S., & Yang, Q. (2010). A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10), 1345–1359.CrossRefGoogle Scholar
  19. Patton, R. (1997). Fault-tolerant control: The 1997 situation. In Proceedings of the 3rd IFAC Symposium on Fault Detection, Supervision and Safety of Technical Processes (pp. 1033–1055).Google Scholar
  20. Polikar, R., DePasquale, J., Syed Mohammed, H., Brown, G., & Kuncheva, L. I. (2010). A random subspace approach for the missing feature problem. Pattern Recognition, 43(11), 3817–3832.CrossRefMATHGoogle Scholar
  21. Qin, S. J. (1998). Recursive PLS algorithms for adaptive data modeling. Computers & Chemical Engineering, 22(4/5), 503–514.CrossRefGoogle Scholar
  22. Scharf, L. L. (1990). Statistical signal processing—Detection, estimation and time series analysis. New York: Addison-Wesley.Google Scholar
  23. Zhang, Y., & Jiang, J. (2008). Bibliographical review on reconfigurable fault-tolerant control systems. Annual Reviews in Control, 32(2), 229–252.CrossRefGoogle Scholar
  24. Zliobaite, I., & Hollmen, J. (2013). Fault tolerant regression for sensor data. In Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, ECMLPKDD (pp. 449–464).Google Scholar

Copyright information

© The Author(s) 2014

Authors and Affiliations

  1. 1.Department of Information and Computer ScienceAalto UniversityEspooFinland
  2. 2.Helsinki Institute for Information Technology (HIIT)HelsinkiFinland

Personalised recommendations