Abstract
Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. To resolve problems of globalization, random-write and duration in Hadoop, a data indexing approach on Hadoop using the Java Persistence API (JPA) is elaborated in the implementation of a KD-tree algorithm on Hadoop. An improved intersection algorithm for distributed data indexing on Hadoop is proposed, it performs O(M+logN), and is suitable for occasions of multiple intersections. We compare the data indexing algorithm on open dataset and synthetic dataset in a modest cloud environment. The results show the algorithms are feasible in large-scale data mining.
Keywords
An Erratum for this chapter can be found at http://dx.doi.org/10.1007/978-3-642-16327-2_42
Download to read the full chapter text
Chapter PDF
References
Bohm, C., et al.: Multidimensional index structures in relational databases. In: Mohania, M., Tjoa, A.M. (eds.) DaWaK 1999. LNCS, vol. 1676, Springer, Heidelberg (1999)
Dean, J., Ghemawat, S., Usenix: MapReduce: Simplified data processing on large clusters. In: 6th Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA,
McCreadie, R.M.C., Macdonald, C., Ounis, I.: On Single-Pass Indexing with MapReduce. In: Sanderson, M., et al. (eds.) Proceedings 32nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 742–743. Assoc. Computing Machinery, New York (2009)
Lammel, R.: Google’s MapReduce programming model - Revisited. Science of Computer Programming 70(1), 1–30 (2008)
Moretti, C., et al.: Scaling Up Classifiers to Cloud Computers. In: IEEE International Conference on Data Mining, Pisa, Italy (2008), http://icdm08.isti.cnr.it/Paper-Submissions/32/accepted-papers
Gillick, D., Faria, A., DeNero, J.: MapReduce: Distributed Computing for Machine Learning (2006), http://www.icsi.berkeley.edu/~arlo/publications/gillick_cs262a_proj.pdf
Han, J., Kamber, M.: Data Mining: Concepts and Techniques. In: Jim Gray, M.R. (ed.) The Morgan Kaufmann Series in Data Management Systems, 2nd edn., Morgan Kaufmann, San Francisco (2006)
Berchtold, S., Keim, D.A., Kriegel, H.P.: The X-tree: An Index Structure for High-Dimensional Data. Readings in Multimedia Computing and Networking (2001)
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Communications of the ACM 18(9), 509–517 (1975)
Arya, S., et al.: Approximate Nearest Neighbor Queries in Fixed Dimensions. In: 4th Annual ACM-SIAM Symp. on Discrete Algorithms, SIAM, Austin (1993)
Yang, L., Shi, Z.: An Efficient Data Mining Framework on Hadoop using Java Persistence API. In: The 10th IEEE International Conference on Computer and Information Technology (CIT-2010), Bradford, UK (2010)
Biswas, R., Ort, E.: Java Persistence API - A Simpler Programming Model for Entity Persistence (2009), http://java.sun.com/developer/technicalArticles/J2EE/jpa/
Hinz, S., et al.: MySQL Cluster (2009), http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-overview.html
Bentley, J.L.: K-d trees for semidynamic point sets. In: Proceedings of the Sixth Annual Symposium on Computational Geometry. ACM, New York (1990)
Siemens Medical Solutions, USA, kddcup data (2008), http://www.kddcup2008.com/KDDsite/Data.htm
Lam, M.S., Rothberg, E.E., Wolf, M.E.: The Cache Performance and Optimizations of Blocked Algorithms. In: 4th International Conf. on Architectural Support for Programming Languages and Operating Systems. Assoc. Computing Machinery, Santa Clara (1991)
Przybylski, S.A.: Cache and Memory Hierarchy Design: A Performance Directed Approach. Morgan Kaufmann, San Francisco (1990)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 IFIP International Federation for Information Processing
About this paper
Cite this paper
Lai, Y., ZhongZhi, S. (2010). An Efficient Data Indexing Approach on Hadoop Using Java Persistence API. In: Shi, Z., Vadera, S., Aamodt, A., Leake, D. (eds) Intelligent Information Processing V. IIP 2010. IFIP Advances in Information and Communication Technology, vol 340. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16327-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-642-16327-2_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16326-5
Online ISBN: 978-3-642-16327-2
eBook Packages: Computer ScienceComputer Science (R0)