Abstract
The basic sorted-neighborhood method (SNM) is a classic algorithm to detect approximately duplicate records in data cleaning, but the drawback is that the size of sliding window is hard to select and the attribute matching is too frequent so the detection efficiency is unfavorable. An optimized algorithm is proposed based on SNM By setting the size and speed of the sliding window variable to avoid missing record comparisons and reduce unnecessary ones, also it uses cosine similarity algorithm in attribute matching to improve precision of detection, and the Top-k effective weight filtering algorithm is proposed to reduce the number of attribute matching and improve the detection efficiency. The experiment results show that the improved algorithm is better than SNM in recall rate, precision rate and execution time efficiency.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hylton, J.A.: Identifying and merging related bibliographic records. M S dissertation, MIT Laboratory for Computer Science Technical Report, MIT, p. 678 (1996)
Li, J.: Improvement on the algorithm of data cleaning based on SNM. Comput. Appl. Softw. (2008)
He, L., Zhang, Z., Tan, Y., Liao, M.: An efficient data cleaning algorithm based on attributes selection. In: 2011 6th International Conference on Computer Sciences and Convergence Information Technology (ICCIT), pp. 375–379. IEEE (2011)
Madnick, S.E., Wang, R.Y., Lee, Y.W., Zhu, H.W.: Overview and framework for data and information quality research. ACM J. Data Inf. Qual. 1, 2 (2009)
Omar, B., Hector, G., David, M., Jennifer, W., Steven, E., Su, Q.: Swoosh: a generic approach to entity resolution. VLDB J. 18, 255–276 (2009)
Sotomayor, B.: The globus toolkit 3 programmer’s tutorial, 2004, pp. 81–88, Zugriffsdatum (2005). http://gdp.globus.org/gt3-tutorial/multiplehtml/
Monge, A., Elkan, C.: The field matching problem: algorithms and applications. In: Proceedings of the 2nd International Conference of Knowledge Discovery and Data Mining (1996)
Krishnamoorthy, R., Kumar, S.S., Neelagund, B.: A new approach for data cleaning process. In: Recent Advances and Innovations in Engineering (ICRAIE), pp. 1–5. IEEE (2014)
Chen, H.Q., Ku, W.S., Wang, H.X., Sun, M.T.: Leveraging spatio-temporal redundancy for RFID data cleansing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 51–62 (2010)
Zhang, F., Xue, H.F., Xu, D.S., Zhang, Y.H., You, F.: Big data cleaning algorithms in cloud computing. Int. J. Interact. Mobile Technol. (2013)
Arora, R., Pahwa, P., Bansal, S.: Alliance rules for data warehouse cleansing. In: International Conference on Signal Processing Systems, pp. 743–747. IEEE (2009)
Ali, K., Warraich, M.: A framework to implement data cleaning in enterprise data warehouse for robust data quality. In: 2010 International Conference on Information and Emerging Technologies (ICIET), pp. 1–6. IEEE (2010). 978-1-4244-8003-6/10
Li, J., Zheng, N.: An improved algorithm based on SNM data cleaning algorithm. Comput. Appl. Softw. 25(2), 245–247 (2008). doi:10.3969/j.issn.1000-386X.2008.02.089
Luo, Q., Wang, X.F.: Analysis of data cleaning technology in data warehouse. Comput. Program. Skills Maintenance 2 (2015)
Dai, J.W., Wu, Z.L., Zhu, M.D.: Data Engineering Theory and Technology, pp. 148–155. National Defense Industry Press, Beijing (2010)
Zhang, J.Z., Fang, Z., Xiong, Y.J.: Data cleaning algorithm optimization based on SNM. J. Central South Univ. (Nat. Sci. Ed.) 41(6), 2240–2245 (2010)
Wang, L., Xu, L.D., Bi, Z.M., Xu, Y.C.: Data cleaning for RFID and WSN integration. Ind. Inf. IEEE Trans. 10(1), 408–418 (2014)
Tong, Y.X., Cao, C.C., Zhang, C.J., Li, Y.T., Lei, C.: CrowdCleaner: data cleaning for multi-version data on the web via crowd sourcing. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 182–1185. IEEE Computer Society (2014)
Volkovs, M., Fei, C., Szlichta, J., Miller, R.J.: Continuous data cleaning. In: 2014 IEEE 30th International Conference on Data Engineering (ICDE), pp. 244–255. IEEE (2014)
Dallachiesa, M., Ebaid, A., Eldawy, A., Elmagarmid, A., Ilyas, I.F., Quzzani, M., Tang, N.: NADEEF: a commodity data cleaning system. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM (2013)
Ebaid, A., Elmagarmid, A., Ilyas, I.F., Quzzani, M., Yin, S., Tang, N.: NADEEF: a generalized data cleaning system. Proc. VLDB Endowment 6, 1218–1221 (2013)
Broeck, J.V.D., Fadnes, L.T.: Data cleaning. Epidemiol. Principles Pract. Guidel. 66 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Li, M., Xie, Q., Ding, Q. (2015). An Improved Data Cleaning Algorithm Based on SNM. In: Huang, Z., Sun, X., Luo, J., Wang, J. (eds) Cloud Computing and Security. ICCCS 2015. Lecture Notes in Computer Science(), vol 9483. Springer, Cham. https://doi.org/10.1007/978-3-319-27051-7_22
Download citation
DOI: https://doi.org/10.1007/978-3-319-27051-7_22
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-27050-0
Online ISBN: 978-3-319-27051-7
eBook Packages: Computer ScienceComputer Science (R0)