Abstract
With the development of the Internet and cloud computing, there is the need of databases that will be able to store and process big data, and Not only SQL ’NoSQL’ databases are becoming increasingly used in the big data domains and have some interesting strengths such as scalability and flexibility. This paper explains the growing interest of implementing NoSQL in Data Warehouses. In addition, this paper investigates the use of data cleaning (the process of detecting and correcting or removing inaccurate records from a database) in NoSQL databases. More precisely, we are interested in adapting the data deduplication algorithms in two NoSQL models: document-oriented and column-oriented. Finally, a comparison between the implemented algorithms and the results of our simulations.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Boufarès, F., Salem, A.B.: Heterogeneous data-integration and data quality: Overview of conflicts. In: Proceedings of the International Conference on Sciences of Electronic, Technologies of Information and Telecommunications, (SETIT 2011), Sousse, Tunisie, 26–29 October 2011
Boufarès, F., BenSalem, A., Correia, S.: Un algorithme de déduplication pour les Bases et Entrepôts de Données. In: Actes du XXXème Congrès INFormatique des ORganisations et Systèmes d’Information et de Décision, (INFORSID 2012), Montpellier, France, pp. 497–504, 29–31 Mai 2012
Kulkarni, P.S., Bakal, J.W.: Hybrid approaches for data cleaning in data warehouse. Int. J. Comput. Appl. 88(18), 8887 (2014)
Ma, K., Yang, B.: Parallel NoSQL entity resolution approach with MapReduce. In: International Conference on Intelligent Networking and Collaborative System, pp. 384–389 (2015)
Kenig, B., Gal, A.: MFIBlocks An effective blocking algorithm for entity resolution. Inf. Syst. 38(6), 908–926 (2013)
Rahm, E., Do, H.H.: Data cleaning: problems and current approaches. In: IEEE Technology. Bulletin on Data Engineering (2000)
Lu, G., Jin, Y., Du, D.H.: Frequency based chunking algorithm for data deduplication. In: 18th Annual Meeting of the IEEE International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS 2010), Miami, Florida, August 2010
Kopcke, H., Rahm, E.: Frameworks for entity matching: a comparison. Data Knowl. Eng. 69(2), 197–210 (2010)
Benjalloun, O., Garcia Molina, H., Menestria, D., Su, Q., Whang, S.E., Widom, J.: Swoosh: a generic approach to entity resolution. Int. J. Very Large Data Bases (VLDB 09) 18(1), 255–276 (2009)
Peng, T.: A framework for data cleaning in data warehouses. In: ICEIS, vol. 1, pp. 473–478 (2008)
Chevalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, R.: Implementing multidimensional data warehouses into NoSQL. In: ICEIS, vol. 1, pp. 172–183 (2015)
Chevalier, M., El Malki, M., Kopliku, A., Teste, O., Tournier, R.: Benchmark for OLAP on NoSQL technologies comparing NoSQL multidimensional data warehousing solutions. In: RCIS, pp. 480–485 (2015)
Dehdouh, K., Bentayeb, F., Boussaid, O., Kabachi, N.: Using the column oriented NoSQL model for implementing big data warehouses. In: Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA), PDPTA (2015)
Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery, and Data Mining, Washington DC, USA, pp. 39–48 (2003)
Fourcet, A.: NonSQL Une nouvelle approche du stockage et de la manipulation des donnél’es, Livre Blanc Smile (2015)
Hernndez, M., Stolfo, S.: The merge/purge problem for large databases. ACM SIGMOD Rec. 24(2), 127–138 (1995)
Mohan, C.: History repeats itself: sensible and NonsenSQL aspects of the NoSQL hoopla. In: EDBT/ICDT 2013 Joint Conference, Genoa-Italy, 18–22 March 2013. ISBN:878-1-4503-1597-5
Roe, C.: ACID vs. BASE: The shifting pH of database transaction processing (2012). http://www.dataversity.net/acidvs-base-the-shifting-ph-of-databasetransaction-processing/
Sahiet, D., Asanka, P.D.: ETL framework design for NOSQL databases in dataware housing. IJICAR 3(11) (2015)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Alami, L., Hafidi, I., Metrane, A. (2018). Entity Resolution in NoSQL Data Warehouse. In: Noreddine, G., Kacprzyk, J. (eds) International Conference on Information Technology and Communication Systems. ITCS 2017. Advances in Intelligent Systems and Computing, vol 640. Springer, Cham. https://doi.org/10.1007/978-3-319-64719-7_5
Download citation
DOI: https://doi.org/10.1007/978-3-319-64719-7_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64718-0
Online ISBN: 978-3-319-64719-7
eBook Packages: EngineeringEngineering (R0)