A Data Cleaning Service on Massive Spatio-Temporal Data in Highway Domain

  • Yanqing XiaEmail author
  • Xuefei Wang
  • Weilong Ding
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11434)


With the development of highway toll system and sensor network, massive highway toll data has been accumulated nowadays. The imperfection of raw data, such as incomplete, repetitive and abnormal data, seriously affects the efficiency of data mining modeling. Traditional cleaning methods on massive spatio-temporal data are inefficient, because the business rules are difficult to depict in various domains. On the highway toll data of Henan Province, we propose a data cleaning service through business rules. This service can efficiently clean the raw toll data with spatio-temporal attributes, including the data calibration of erroneous data and invalid data, the repair of erroneous data, and the filtering of duplicate data. Implemented through Hadoop MapReduce on toll data in highway domain, our service shows its efficiency, accuracy and scalability in extensive experiments.


Data cleaning Spatio-temporal data Highway Hadoop Business rules 


  1. 1.
    Anagnostopoulos, I., Zeadally, S., Exposito, E.: Handling big data: research challenges and future directions. J. Supercomput. 72(4), 1494–1516 (2016)CrossRefGoogle Scholar
  2. 2.
    Zhong, M., Lingras, P., Sharma, S.: Estimation of missing traffic counts using factor, genetic, neural, and regression techniques. Transp. Res. Part C Emerg. Technol. 12(2), 139–166 (2004)CrossRefGoogle Scholar
  3. 3.
    Lee, W.H., Tseng, S.S., Shieh, J.L., et al.: Discovering traffic bottlenecks in an urban network by spatiotemporal data mining on location-based services. IEEE Trans. Intell. Transp. Syst. 12(4), 1047–1056 (2011)CrossRefGoogle Scholar
  4. 4.
    Carey, M.J., Jacobs, S., Tsotras, V.J.: Breaking BAD: a data serving vision for big active data. In: Proceedings of the 10th ACM International Conference on Distributed and Event-Based Systems, pp. 181–186. ACM, Irvine (2016)Google Scholar
  5. 5.
    Ganti, V., Sarma, A.D.: Data cleaning: a practical perspective. Morgan & Claypool Publishers, Williston (2013)Google Scholar
  6. 6.
    Tang, N.: Big data cleaning. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds.) APWeb 2014. LNCS, vol. 8709, pp. 13–24. Springer, Cham (2014). Scholar
  7. 7.
    Fan, W., Geerts, F., Tang, N., Yu, W.: Inferring data currency and consistency for conflict resolution. In: Proceedings of the 29th International Conference on Data Engineering (ICDE 2013), pp. 470–481. IEEE (2013)Google Scholar
  8. 8.
    Wang, J., Tang, N.: Towards dependable data repairing with fixing rules. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, pp. 457–468. ACM, Snowbird (2014)Google Scholar
  9. 9.
    Sun, D., Zhang, G., Zheng, W., Li, K.: Key Technologies for Big Data Stream Computing. Big Data: Algorithms, Analytics, and Applications. CRC Press, Taylor & Francis Group, USA (2014)Google Scholar
  10. 10.
    Beskales, G., Ilyas, I.F., Golab, L.: Sampling the repairs of functional dependency violations under hard constraints. Proc. VLDB Endow. 3, 197–207 (2010)CrossRefGoogle Scholar
  11. 11.
    Chu, X., Ilyas, I.F., Papotti, P.: Holistic data cleaning: putting violations into context. In: Proceedings of 2013 IEEE 29th International Conference on Data Engineering (ICDE 2013), pp. 458–469 (2013)Google Scholar
  12. 12.
    Fan, W., Li, J., Ma, S., Tang, N., Yu, W.: Towards certain fixes with editing rules and master data. VLDB J. 21, 213–238 (2012)CrossRefGoogle Scholar
  13. 13.
    Du, Y., Shen, D., Nie, T., Kou, Y., Yu, G.: Determining repairing sequence of inconsistencies in content-related data. In: Bouguettaya, A., et al. (eds.) WISE 2017. LNCS, vol. 10569, pp. 524–539. Springer, Cham (2017). Scholar
  14. 14.
    Wang, J., Tang, N.: Dependable data repairing with fixing rules. Data Inf. Qual. 8(3–4), 1–34 (2017)Google Scholar
  15. 15.
    Vincenzo, G., Magnus, A., Marina, P.: eChIDNA: continuous data validation in advanced metering infrastructures. IEEE PES Innovative Smart Grid Technologies, EuropeGoogle Scholar
  16. 16.
    Ding, W., Cao, Y.: A data cleaning method on massive spatio-temporal data. In: Wang, G., Han, Y., Martínez Pérez, G. (eds.) APSCC 2016. LNCS, vol. 10065, pp. 173–182. Springer, Cham (2016). Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Data Engineering InstituteNorth China University of TechnologyBeijingChina
  2. 2.Beijing Key Laboratory on Integration and Analysis of Large-Scale Stream DataBeijingChina

Personalised recommendations