Discovery of Unique Column Combinations with Hadoop

Han, Shupeng; Cai, Xiangrui; Wang, Chao; Zhang, Haiwei; Wen, Yanlong

doi:10.1007/978-3-319-11116-2_49

Discovery of Unique Column Combinations with Hadoop

Shupeng Han¹⁹,
Xiangrui Cai¹⁹,
Chao Wang¹⁹,
Haiwei Zhang¹⁹ &
…
Yanlong Wen¹⁹

Conference paper

3253 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8709))

Abstract

A unique column combination is one important kind of structural information in relations. From a data management perspective, discovering unique column combinations is a crucial step in understanding and utilizing the data. It will benefit data modeling, data integration, anomaly detection, query optimization and indexing. Nevertheless, discovering all unique column combinations is a NP-hard problem. Therefore, efficiency is a tremendous challenge.

In this paper, we propose MRUCC, which is an efficient algorithm to discover unique column combinations in large-scale data sets on Hadoop. Existing algorithms mainly focus on datasets of normal size, which cannot be adapted to large data sets. In contrast, we discover unique column combinations in parallel and implement MRUCC on Hadoop. Furthermore, we use column-based and row-based pruning to improve efficiency. Finally, we compare MRUCC with state-of-the-art approaches using both real and synthetic data sets. The experiment shows that MRUCC has a better performance.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gunopulos, D., Khardon, R., Mannila, H., Saluja, S., Toivonen, H., Sharma, R.S.: Discovering all most specific sentences. ACM Trans. Database Syst. 28(2), 140–174 (2003)
Article Google Scholar
Brown, P., Haas, P.J., Myllymaki, J., Pirahesh, H., Reinwald, B., Sismanis, Y.: Toward automated large-scale information integration and discovery. In: Härder, T., Lehner, W., et al. (eds.) Data Management in a Connected World. LNCS, vol. 3551, pp. 161–180. Springer, Heidelberg (2005)
Chapter Google Scholar
Bell, S., Brockhausen, P.: Discovery of constraints and data dependencies in databases. In: Lavrač, N., Wrobel, S. (eds.) ECML 1995. LNCS, vol. 912, pp. 267–270. Springer, Heidelberg (1995)
Google Scholar
Kivinen, J., Mannila, H.: Approximate dependency inference from relations. Theoret. Comput. Sci. 149, 129–149 (1995)
Article MATH MathSciNet Google Scholar
Petit, J.-M., Toumani, F., Boulicaut, J.-F., Kouloumdjian, J.: Towards the reverse engineering of renormalized relational databases. In: Proc. ICDE, pp. 218–227 (1996)
Google Scholar
Sismanis, Y., et al.: GORDIAN: efficient and scalable discovery of composite keys. In: Proceedings of the 32nd International Conference on Very Large Data Bases. VLDB Endowment (2006)
Google Scholar
Abedjan, Z., Naumann, F.: Advancing the discovery of unique column combinations. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. ACM (2011)
Google Scholar
Adelfio, M.D., Samet, H.: Schema extraction for tabular data on the web. Proceedings of the VLDB Endowment 6(6), 421–432 (2013)
Article Google Scholar
Janga, P., Davis, K.C.: Schema extraction and integration of heterogeneous XML document collections. In: Cuzzocrea, A., Maabout, S. (eds.) MEDI 2013. LNCS, vol. 8216, pp. 176–187. Springer, Heidelberg (2013)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

College of Computer and Control Engineering, Nankai University, 94 Weijin Road, Tianjin, P.R. China, 300071
Shupeng Han, Xiangrui Cai, Chao Wang, Haiwei Zhang & Yanlong Wen

Authors

Shupeng Han
View author publications
You can also search for this author in PubMed Google Scholar
Xiangrui Cai
View author publications
You can also search for this author in PubMed Google Scholar
Chao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Haiwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanlong Wen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Beijing Institute of Spacecraft System Engineering, Beijing, China
Lei Chen
School of Computer Science, National University of Defense Technology, 410073, Changsha, Hunan, China
Yan Jia
RMIT University, Melbourne, Australia
Timos Sellis
School of Computer Science and Technology, Soochow University, 215006, Suzhou, China
Guanfeng Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Han, S., Cai, X., Wang, C., Zhang, H., Wen, Y. (2014). Discovery of Unique Column Combinations with Hadoop. In: Chen, L., Jia, Y., Sellis, T., Liu, G. (eds) Web Technologies and Applications. APWeb 2014. Lecture Notes in Computer Science, vol 8709. Springer, Cham. https://doi.org/10.1007/978-3-319-11116-2_49

Download citation

DOI: https://doi.org/10.1007/978-3-319-11116-2_49
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11115-5
Online ISBN: 978-3-319-11116-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics