\(\partial u\partial u\) Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
Near duplicate detection algorithms have been proposed and implemented in order to detect and eliminate duplicate entries from massive datasets. Due to the differences in data representation (such as measurement units) across different data sources, potential duplicates may not be textually identical, even though they refer to the same real-world entity. As data warehouses typically contain data coming from several heterogeneous data sources, detecting near duplicates in a data warehouse requires a considerable memory and processing power.
Traditionally, near duplicate detection algorithms are sequential and operate on a single computer. While parallel and distributed frameworks have recently been exploited in scaling the existing algorithms to operate over larger datasets, they are often focused on distributing a few chosen algorithms using frameworks such as MapReduce. A common distribution strategy and framework to parallelize the execution of the existing similarity join algorithms is still lacking.
In-Memory Data Grids (IMDG) offer a distributed storage and execution, giving the illusion of a single large computer over multiple computing nodes in a cluster. This paper presents the research, design, and implementation of \(\partial u\partial u\), a distributed near duplicate detection framework, with preliminary evaluations measuring its performance and achieved speed up. \(\partial u\partial u\) leverages the distributed shared memory and execution model provided by IMDG to execute existing near duplicate detection algorithms in a parallel and multi-tenanted environment. As a unified near duplicate detection framework for big data, \(\partial u\partial u\) efficiently distributes the algorithms over utility computers in research labs and private clouds and grids.
KeywordsNear Duplicate Detection (NDD) In-Memory Data Grid (IMDG) MapReduce
Unable to display preview. Download preview PDF.
- 1.Rahm, E., Do, H.H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23(4), 3–13 (2000)Google Scholar
- 2.Oliveira, P., Rodrigues, F., Henriques, P., Galhardas, H.: A taxonomy of data quality problems. In: 2nd Int. Workshop on Data and Information Quality, pp. 219–233 (2005)Google Scholar
- 3.Barateiro, J., Galhardas, H.: A survey of data quality tools. Datenbank-Spektrum 14(15–21), 48 (2005)Google Scholar
- 4.Vernica, R., Carey, M.J., Li, C.: Efficient parallel set-similarity joins using mapreduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 495–506. ACM (2010)Google Scholar
- 5.Di Sanzo, P., Rughetti, D., Ciciani, B., Quaglia, F.: Auto-tuning of cloud-based in-memory transactional data grids via machine learning. In: 2012 Second Symposium on Network Cloud Computing and Applications (NCCA), pp. 9–16. IEEE (2012)Google Scholar
- 6.Johns, M.: Getting Started with Hazelcast. Packt Publishing Ltd. (2013)Google Scholar
- 7.Marchioni, F.: Infinispan data grid platform. Packt Publishing Ltd. (2012)Google Scholar
- 8.Samovsky, M., Kacur, T.: Cloud-based classification of text documents using the gridgain platform. In: 2012 7th IEEE International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 241–245. IEEE (2012)Google Scholar
- 9.Seovic, A., Falco, M., Peralta, P.: Oracle Coherence 3.5. Packt Publishing Ltd. (2010)Google Scholar
- 10.Arora, P., Khandelwal, D., Marshall, J., Usha, A., Sadtler, C., et al.: Scalable, Integrated Solutions for Elastic Caching Using IBM WebSphere eXtreme Scale. IBM Redbooks (2011)Google Scholar
- 11.Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.: Declarative data cleaning: Language, model, and algorithms (2001)Google Scholar
- 12.Zhang, D.Q., Chang, S.F.: Detecting image near-duplicate by stochastic attributed relational graph matching with learning. In: Proceedings of the 12th Annual ACM International Conference on Multimedia, pp. 877–884. ACM (2004)Google Scholar
- 13.Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39–48. ACM (2003)Google Scholar
- 15.Hernández, M.A., Stolfo, S.J.: The merge/purge problem for large databases. In: ACM SIGMOD Record, vol. 24, pp. 127–138. ACM (1995)Google Scholar
- 16.Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: Proceedings of the 17th International Conference on World Wide Web, pp. 131–140. ACM (2008)Google Scholar
- 18.Wang, C., Wang, J., Lin, X., Wang, W., Wang, H., Li, H., Tian, W., Xu, J., Li, R.: Mapdupreducer: detecting near duplicates over massive datasets. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, pp. 1119–1122. ACM (2010)Google Scholar
- 20.Lwenstein, B.: Benchmarking of Middleware Systems: Evaluating and Comparing the Performance and Scalability of XVSM (MozartSpaces), JavaSpaces (GigaSpaces XAP) and J2EE (JBoss AS). VDM Verlag (2010)Google Scholar
- 21.Ferrante, M.: Java frameworks for high-level distributed scientific programming (2010)Google Scholar
- 22.El-Refaey, M., Rimal, B.P.: Grid, soa and cloud computing: On-demand computing models. Computational and Data Grids: Principles, Applications, and Design, 45 (2012)Google Scholar
- 23.Mohanty, S., Jagadeesh, M., Srivatsa, H.: Extracting value from big data: in-memory solutions, real time analytics, and recommendation systems. In: Big Data Imperatives, pp. 221–250. Springer (2013)Google Scholar
- 24.Kathiravelu, P., Veiga, L.: An adaptive distributed simulator for cloud and mapreduce algorithms and architectures. In: 2014 IEEE/ACM 7th International Conference on Utility and Cloud Computing (UCC), pp. 79–88. IEEE (2014)Google Scholar
- 25.Sarnovsky, M., Ulbrik, Z.: Cloud-based clustering of text documents using the ghsom algorithm on the gridgain platform. In: 2013 IEEE 8th International Symposium on Applied Computational Intelligence and Informatics (SACI), pp. 309–313. IEEE (2013)Google Scholar