Abstract
This chapter presents the main background knowledge relevant to the book. Sections 2.1 and 2.2 describe the areas of processing complex data and knowledge discovery in traditional databases. The task of clustering complex data is discussed in Sect. 2.3, while the task of labeling such kind of data is described in Sect. 2.4. Section 2.5 introduces the MapReduce framework, a promising tool for large scale data analysis, which has been proven to offer one valuable support to the execution of data mining algorithms in a parallel processing environment. Section 2.6 concludes the chapter.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
www.google.com
References
Achtert, E., Böhm, C., Kriegel, H.P., Kröger, P., Zimek, A.: Robust, complete, and efficient correlation clustering. SDM, USA, In (2007)
Achtert, E., Böhm, C., David, J., Kröger, P., Zimek, A.: Global correlation clustering based on the hough transform. Stat. Anal. Data Min. 1, 111–127 (2008). doi:10.1002/sam.v1:3
Aggarwal, C., Yu, P.: Redefining clustering for high-dimensional applications. IEEE TKDE 14(2), 210–225 (2002). doi:10.1109/69.991713
Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. SIGMOD Rec. 29(2), 70–81 (2000). doi:10.1145/335191.335383
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD Rec. 27(2), 94–105 (1998). doi:10.1145/276305.276314
Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. SIGMOD Rec. 28(2), 61–72 (1999). doi:10.1145/304181.304188
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data. Data Min. Knowl. Discov. 11(1), 5–33 (2005). doi:10.1007/s10618-005-1396-1
Al-Razgan, M., Domeniconi, C.: Weighted clustering ensembles. In: Ghosh, J., Lambert, D., Skillicorn, D.B., Srivastava, J. (eds.) SDM. SIAM (2006).
Ando, S., Iba, H.: Classification of gene expression profile using combinatory method of evolutionary computation and machine learning. Genet. Program Evolvable Mach. 5, 145–156 (2004). doi:10.1023/B:GENP.0000023685.83861.69
Beyer, K.S., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: ICDT, pp. 217–235. UK (1999).
Blicher, A.P.: Edge detection and geometric methods in computer vision (differential topology, perception, artificial intelligence, low-level). Ph.D. thesis, University of California, Berkeley (1984). AAI8512758
Bohm, C., Kailing, K., Kriegel, H.P., Kroger, P.: Density connected clustering with local subspace preferences. In: ICDM ’04: Proceedings of the 4th IEEE International Conference on Data Mining, pp. 27–34. IEEE Computer Society, Washington, DC, USA (2004).
Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J.: Classification and Regression Trees. Wadsworth, Belmont (1984)
Chan, T.F., Shen, J.: Image processing and analysis-variational, PDE, wavelet, and stochastic methods. SIAM (2005).
Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T., Fikes, A., Gruber, R.E.: Bigtable: a distributed storage system for structured data. In: USENIX’06. Berkeley, CA, USA (2006).
Cheng, C.H., Fu, A.W., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: KDD, pp. 84–93. NY, USA (1999). doi:http://doi.acm.org/10.1145/312129.312199
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: The, VLDB Journal, pp. 426–435 (1997).
Cordeiro, R.L.F., Traina Jr., C., Traina, A.J.M., López, J., Kang, U., Faloutsos, C.: Clustering very large multi-dimensional datasets with mapreduce. In: Apté, C., Ghosh, J., Smyth, P. (eds.) KDD, pp. 690–698. ACM (2011).
Dash, M., Liu, H., Yao, J.: Dimensionality reduction for unsupervised data. In: Proceedings of the 9th IEEE International Conference on Tools with, Artificial Intelligence (ICTAI’97), pp. 532–539 (1997).
Daugman, J.G.: Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. J. Opt. Soc. Am. A 2, 1160–1169 (1985). doi:10.1364/JOSAA.2.001160
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. OSDI (2004)
Domeniconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Berry, M.W., Dayal, U., Kamath, C., Skillicorn, D.B. (eds.) SDM (2004)
Domeniconi, C., Gunopulos, D., Ma, S., Yan, B., Al-Razgan, M., Papadopoulos, D.: Locally adaptive metrics for clustering high dimensional data. Data Min. Knowl. Discov. 14(1), 63–97 (2007). doi:10.1007/s10618-006-0060-8
Duda, R., Hart, P., Stork, D.: Pattern Classification, 2nd edn. Wiley, New York (2001)
Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. Wiley-Interscience, New York (2000)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD, pp. 226–231 (1996).
Fayyad, U.: A data miner’s story-getting to know the grand challenges. In: Invited Innovation Talk, KDD (2007). Slide 61. Available at: http://videolectures.net/kdd07_fayyad_dms/
Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P.: From data mining to knowledge discovery: an overview. In: Advances in Knowledge Discovery and Data Mining, pp. 1–34 (1996).
Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. Roy. Stat. Soc. B 66(4), 815–849 (2004). doi:ideas.repec.org/a/bla/jorssb/v66y2004i4p815-849.html
Hadoop information. http://hadoop.apache.org/
Han, J., Kamber, M.: Data Mining: Concepts and Techniques, 2nd edn. Morgan Kaufmann, San Francisco (2006)
Haralick, R.M., Shanmugam, K., Dinstein, I.: Textural features for image classification. Syst. Man Cybern. IEEE Trans. 3(6), 610–621 (1973). doi:10.1109/TSMC.1973.4309314
Huang, J., Kumar, S., Mitra, M., Zhu, W.J., Zabih, R.: Image indexing using color correlograms. In: Proceedings of 1997 IEEE Computer Society Conference on Computer Vision and, Pattern Recognition, pp. 762–768 (1997). doi:10.1109/CVPR.1997.609412
Kailing, K., Kriegel, H.: Kroger. P, Density-connected subspace clustering for highdimensional data (2004).
Kang, U., Tsourakakis, C., Faloutsos, C.: Pegasus: a peta-scale graph mining system-implementation and observations. ICDM (2009).
Kang, U., Tsourakakis, C., Appel, A.P., Faloutsos, C., Leskovec., J.: Radius plots for mining tera-byte scale graphs: algorithms, patterns, and observations. SDM (2010).
Korn, F., Pagel, B.U., Faloutsos, C.: On the ‘dimensionality curse’ and the ‘self-similarity blessing. IEEE Trans. Knowl. Data Eng. (TKDE) 13(1), 96–111 (2001). doi:10.1109/69.908983
Kriegel, H.P., Kröger, P., Renz, M., Wurst, S.: A generic framework for efficient subspace clustering of high-dimensional data. In: ICDM, pp. 250–257. Washington, USA (2005). doi:http://dx.doi.org/10.1109/ICDM.2005.5
Kriegel, H.P., Kröger, P., Zimek, A.: Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM TKDD 3(1), 1–58 (2009). doi:10.1145/1497577.1497578
Lämmel, R.: Google’s mapreduce programming model-revisited. Sci. Comput. Program. 70, 1–30 (2008)
Lazebnik, S., Raginsky, M.: An empirical bayes approach to contextual region classification. In: CVPR, pp. 2380–2387. IEEE (2009).
Lloyd, S.: Least squares quantization in pcm. Inf. Theory IEEE Trans. 28(2), 129–137 (1982). doi:10.1109/TIT.1982.1056489
Long, F., Zhang, H., Feng, D.D.: Fundamentals of content-based image retrieval. In: Multimedia Information Retrieval and Management. Springer (2002).
MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Cam, L.M.L., Neyman, J. (eds.) Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 1, pp. 281–297. University of California Press (1967).
Mehrotra, S., Rui, Y., Chakrabarti, K., Ortega, M., Huang, T.S.: Multimedia analysis and retrieval system. In: Proceedings of 3rd International Workshop on Multimedia. Information Systems, pp. 25–27 (1997).
Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: KDD, pp. 533–541 (2008).
Moise, G., Sander, J., Ester, M.: P3C: a robust projected clustering algorithm. In: ICDM, pp. 414–425. IEEE Computer Society (2006).
Moise, G., Sander, J., Ester, M.: Robust projected clustering. Knowl. Inf. Syst 14(3), 273–298 (2008). doi:10.1007/s10115-007-0090-6
Moise, G., Zimek, A., Kröger, P., Kriegel, H.P., Sander, J.: Subspace and projected clustering: experimental evaluation and analysis. Knowl. Inf. Syst. 21(3), 299–326 (2009)
Mount, D.M., Arya, S.: Ann: a library for approximate nearest neighbor searching. http://www.cs.umd.edu/mount/ANN/
Ng, E.K.K., Fu, A.W.: Efficient algorithm for projected clustering. In: ICDE ’02: Proceedings of the 18th International Conference on Data Engineering, p. 273. IEEE Computer Society, Washington, DC, USA (2002).
Ng, E.K.K., chee Fu, A.W., Wong, R.C.W.: Projective clustering by histograms. TKDE 17(3), 369–383 (2005). doi:10.1109/TKDE.2005.47
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD ’08, pp. 1099–1110 (2008).
Pan, J.Y., Yang, H.J., Faloutsos, C., Duygulu, P.: Gcap: graph-based automatic image captioning. In: CVPRW ’04: Proceedings of the 2004 Conference on Computer Vision and Pattern Recognition, Workshop (CVPRW’04) vol. 9, p. 146 (2004).
Papadimitriou, S., Sun, J.: Disco: distributed co-clustering with map-reduce. ICDM (2008)
Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. SIGKDD Explor. Newsl 6(1), 90–105 (2004). doi:10.1145/1007730.1007731
Pass, G., Zabih, R., Miller, J.: Comparing images using color coherence vectors. In: ACM Multimedia, pp. 65–73 (1996).
Pentland, A., Picard, R.W., Sclaroff, S.: Photobook: tools for content-based manipulation of image databases. In: Storage and Retrieval for Image and Video Databases (SPIE), pp. 34–47 (1994).
Procopiuc, C.M., Jones, M., Agarwal, P.K., Murali, T.M.: A monte carlo algorithm for fast projective clustering. In: SIGMOD, pp. 418–427. USA (2002). doi:http://doi.acm.org/10.1145/564691.564739
Rangayyan, R.M.: Biomedical Image Analysis. CRC Press, Boca Raton (2005)
Rezende, S.O.: Sistemas Inteligentes: Fundamentos e Aplicações. Ed , Manole Ltda (2002). (in Portuguese)
Shotton, J., Winn, J.M., Rother, C., Criminisi, A.: TextonBoost: joint appearance, shape and context modeling for multi-class object recognition and segmentation. In: Leonardis, A., Bischof, H., Pinz A. (eds.) ECCV (1), Lecture Notes in Computer Science, vol. 3951, pp. 1–15. Springer (2006).
Sonka, M., Hlavac, V., Boyle, R.: Image Processing: Analysis and Machine Vision, 2nd edn. Brooks/Cole Pub Co, Pacific Grove (1998)
Sousa, E.P.M.: Identificação de correlações usando a teoria dos fractais. Ph.D. Dissertation, Computer Science Department–ICMC, University of São Paulo-USP, São Carlos, Brazil (2006). (in Portuguese).
Sousa, E.P.: Caetano Traina, J., Traina, A.J., Wu, L., Faloutsos, C.: A fast and effective method to find correlations among attributes in databases. Data Min. Knowl. Discov. 14(3), 367–407 (2007). doi:10.1007/s10618-006-0056-4
Stehling, R.O., Nascimento, M.A., Falcão, A.X.: Cell histograms versus color histograms for image representation and retrieval. Knowl. Inf. Syst. 5, 315–336 (2003). doi:10.1007/s10115-003-0084-y. http://portal.acm.org/citation.cfm?id=959128.959131
Steinhaus, H.: Sur la division des corp materiels en parties. Bull. Acad. Polon. Sci. 1, 801–804 (1956). (in French).
Tong, H., Faloutsos, C., Pan, J.Y.: Random walk with restart: fast solutions and applications. Knowl. Inf. Syst. 14, 327–346 (2008). doi:10.1007/s10115-007-0094-2. http://portal.acm.org/citation.cfm?id=1357641.1357646
Torralba, A.B., Fergus, R., Freeman, W.T.: 80 million tiny images: a large data set for non-parametric object and scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. 30(11), 1958–1970 (2008)
Traina, A.J.M., Traina, C., Bueno, J.M., Chino, F.J.T., Azevedo-Marques, P.: Efficient content-based image retrieval through metric histograms. World Wide Web 6, 157–185 (2003). doi:10.1023/A:1023670521530
Traina Jr, C., Traina, A.J.M., Seeger, B., Faloutsos, C.: Slim-trees: high performance metric trees minimizing overlap between nodes. In: Zaniolo, C., Lockemann, P.C., Scholl, M.H., Grust, T. (eds.) International Conference on Extending Database Technology (EDBT). Lecture Notes in Computer Science, vol. 1777, pp. 51–65. Springer, Konstanz, Germany (2000).
Traina Jr., C., Traina, A.J.M., Santos Filho, R.F., Faloutsos, C.: How to improve the pruning ability of dynamic metric access methods. In: International Conference on Information and Knowledge Management (CIKM), pp. 219–226. ACM Press, McLean, VA, USA (2002)
Tung, A.K.H., Xu, X., Ooi, B.C.: Curler: finding and visualizing nonlinear correlation clusters. In: SIGMOD, pp. 467–478 (2005). doi:http://doi.acm.org/10.1145/1066157.1066211
Vieira, M.R., Traina Jr, C., Traina, A.J.M., Chino, F.J.T.: Dbm-tree: a dynamic metric access method sensitive to local density data. In: Lifschitz, S. (ed.) Brazilian Symposium on Databases (SBBD), vol. 1, pp. 33–47. SBC, Brasìlia, DF (2004)
Wang, W., Yang, J., Muntz, R.: Sting: a statistical information grid approach to spatial data mining. In: VLDB, pp. 186–195 (1997).
Wiki: http://wiki.apache.org/hadoop/hbase. Hadoop’s Bigtable-like structure
Woo, K.G., Lee, J.H., Kim, M.H., Lee, Y.J.: Findit: a fast and intelligent subspace clustering algorithm using dimension voting. Inf. Softw. Technol. 46(4), 255–271 (2004)
Yip, K.Y., Ng, M.K.: Harp: a practical projected clustering algorithm. IEEE Trans. on Knowl. Data Eng. 16(11), 1387–1397 (2004). doi:http://dx.doi.org/10.1109/TKDE.2004.74. Member-David W. Cheung
Yip, K.Y., Cheung, D.W., Ng, M.K.: On discovery of extremely low-dimensional clusters using semi-supervised projected clustering. In: ICDE, pp. 329–340. Washington, USA (2005). doi:http://dx.doi.org/10.1109/ICDE.2005.96
Zhang, B., Hsu, M., Dayal, U.: K-harmonic means-a spatial clustering algorithm with boosting. In: Roddick, J.F., Hornsby, K. (eds.) TSDM. Lecture Notes in Computer Science, vol. 2007, pp. 31–45. Springer (2000).
Zhang, H.: The optimality of naive Bayes. In: V. Barr, Z. Markov (eds.) FLAIRS Conference. AAAI Press (2004). http://www.cs.unb.ca/profs/hzhang/publications/FLAIRS04ZhangH.pdf
Zhou, C., Xiao, W., Tirpak, T.M., Nelson, P.C.: Evolving accurate and compact classification rules with gene expression programming. IEEE Trans. Evol. Comput. 7(6), 519–531 (2003)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2013 The Author(s)
About this chapter
Cite this chapter
Cordeiro, R. ., Faloutsos, C., Traina Júnior, C. (2013). Related Work and Concepts. In: Data Mining in Large Sets of Complex Data. SpringerBriefs in Computer Science. Springer, London. https://doi.org/10.1007/978-1-4471-4890-6_2
Download citation
DOI: https://doi.org/10.1007/978-1-4471-4890-6_2
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-4889-0
Online ISBN: 978-1-4471-4890-6
eBook Packages: Computer ScienceComputer Science (R0)