Optimizing regression models for data streams with missing values
- 558 Downloads
Automated data acquisition systems, such as wireless sensor networks, surveillance systems, or any system that records data in operating logs, are becoming increasingly common, and provide opportunities for making decision on data in real or nearly real time. In these systems, data is generated continuously resulting in a stream of data, and predictive models need to be built and updated online with the incoming data. In addition, the predictive models need to be able to output predictions continuously, and without delays. Automated data acquisition systems are prone to occasional failures. As a result, missing values may often occur. Nevertheless, predictions need to be made continuously. Hence, predictive models need to have mechanisms for dealing with missing data in such a way that the loss in accuracy due to occasionally missing values would be minimal. In this paper, we theoretically analyze effects of missing values to the accuracy of linear predictive models. We derive the optimal least squares solution that minimizes the expected mean squared error given an expected rate of missing values. Based on this theoretically optimal solution we propose a recursive algorithm for producing and updating linear regression online, without accessing historical data. Our experimental evaluation on eight benchmark datasets and a case study in environmental monitoring with streaming data validate the theoretical results and confirm the effectiveness of the proposed strategy.
KeywordsData streams Missing data Linear models Online regression Regularized recursive regression
We would like to thank the INFER project for sharing the Chemi dataset. This work has been supported by Academy of Finland Grant 118653 (ALGODAN).
- Alippi, C., Boracchi, G., & Roveri, M. (2012). On-line reconstruction of missing data in sensor/actuator networks by exploiting temporal and spatial redundancy. In Proceedings of the 2012 International Joint Conference on Neural Networks, IJCNN (pp. 1–8).Google Scholar
- Allison, P. (2001). Missing data. Thousand Oaks: Sage.Google Scholar
- Brobst, S. (2010). Sensor data is data analytics’ future goldmine. www.zdnet.com.
- Ciampi, A., Appice, A., Guccione, P., & Malerba, D. (2012). Integrating trend clusters for spatio-temporal interpolation of missing sensor data. In Proceedings of the 11th International Conference on Web and Wireless Geographical Information Systems, W2GIS (pp. 203–220).Google Scholar
- Gama, J., & Gaber, M. (Eds.). (2007). Learning from data streams: Processing techniques in sensor networks. Heidelberg: Springer.Google Scholar
- Gama, J., Zliobaite, I., Bifet, A., Pechenizkiy, M., & Bouchachia, A. (2014). A survey on concept drift adaptation. ACM Computing Surveys, 46(4) (in press).Google Scholar
- Golub, G., & Van Loan, C. (1996). Matrix computations. Baltimore: Johns Hopkins University Press.Google Scholar
- Jordan, M. (1998). Notes on recursive least squares. University of California, Berkeley. www.cs.berkeley.edu/~jordan/courses/294-fall98/readings/rls.ps.
- Junninen, H., Lauri, A., Keronen, P., Aalto, P., Hiltunen, V., Hari, P., et al. (2009). Smart-SMEAR: On-line data exploration and visualization tool for smear stations. Boreal Environment Research, 14, 447–457.Google Scholar
- Little, R. (1992). Regression with missing X’s: A review. Journal of the American Statistical Association, 87(420), 1227–1237.Google Scholar
- Patton, R. (1997). Fault-tolerant control: The 1997 situation. In Proceedings of the 3rd IFAC Symposium on Fault Detection, Supervision and Safety of Technical Processes (pp. 1033–1055).Google Scholar
- Scharf, L. L. (1990). Statistical signal processing—Detection, estimation and time series analysis. New York: Addison-Wesley.Google Scholar
- Zliobaite, I., & Hollmen, J. (2013). Fault tolerant regression for sensor data. In Proceedings of European Conference on Machine Learning and Knowledge Discovery in Databases, ECMLPKDD (pp. 449–464).Google Scholar