Data Mining Process

Vazirgiannis, Michalis; Halkidi, Maria; Gunopulos, Dimitrios

doi:10.1007/978-1-4471-0031-7_2

Michalis Vazirgiannis PhD³,
Maria Halkidi MSc³ &
Dimitrios Gunopulos PhD⁴

Part of the book series: Advanced Information and Knowledge Processing ((AI&KP))

195 Accesses
1 Citations

Abstract

The knowledge discovery from large data repositories has been accepted as a key research issue in the field of databases, machine learning, and statistics, as well as an important opportunity for innovation in business. Various applications, such as data warehousing and on-line services via the Internet, invoke different data mining techniques in order to achieve better understanding of customers’ behavior and thus to improve the quality of provided services achieving their business advantage.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 16.99; Price excludes VAT (USA)

Hardcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agrawal, R., Faloutsos, C, and Swami, A. “Efficient Similarity Search in Sequence Databases”, in Proceedings of the 4 ^th FODO Conference, 1993.
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., and Raghavan, P. “Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications”, in Proceedings of the ACM SIGMOD Conference on Management of Data, 1998.
Google Scholar
Agrawal R., Mannila, H., Srikant, R., Toivonen, H., and Verkamo, A.I. “Fast Discovery of Association Rules”, in Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic Smuth and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996.
Google Scholar
Aggarwal C.C., Procopiuc, C, Wolf, J.L., Yu, P.S., and Park, J.S. “Fast Algorithms for Projected Clustering”, in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1999.
Google Scholar
Agrawal R & Srikant R. “Fast Algorithms for Mining Association Rules”, in Proceedings of the 20 ^th Very Large Data Bases Conference, Santiago de Chile, Chile, 1994.
Google Scholar
R. Agrawal, R. Srikant. “Mining Sequential Patterns”, in Proceedings of the Fifth International Conference on Extending Database Technology (EDBT), Avignon, France, March 1996.
Google Scholar
C. Aggarwal and P. S. Yu, “Finding generalized projected clusters in high dimensional spaces”, in Proceedings of the ACM SIGMOD International Conference on Management of Data, 2000.
Google Scholar
R.J. Bayardo. “Efficiently Mining Long Patterns from Databases”, in Proceedings of ACM SIGMOD International Conference on Management of Data, 1998.
Google Scholar
D. Bemdt and J. Clifford. “Using Dynamic Time Warping to Find Patterns in Time Series.” in Proceedings of the KDD Workshop, 1996.
Google Scholar
D. Budrick, M. Calimlim, and J. Gehrke. “Mafia: a maximal frequent itemset algorithm for transactional databases”, in International Conference on Data Engineering, 2001.
Google Scholar
B. Bollobas, G. Das, D. Gunopulos, H. Mannila. “Time-series Similarity Problems and Well Separated Geometric Sets”, in Nordic Journal of Computing, V. 4, 2001.
Google Scholar
Bezdeck J.C, Ehrlich R., Full W., “FCM: Fuzzy C-Means Algorithm”, Computers and Geoscience, 1984.
Google Scholar
D. Barbara, C. Faloutsos, J. Hellerstein, Y. loannidis, H.V. Jagadish, T. Johnson, R. Ng, V. Poosala, K. Ross, and K.V. Sevcik. The New Jersey Data Reduction Report, Data Engineering Bulletin, September, 1996.
Google Scholar
L. Breiman, J. Friedman, R. Olshen, C. Stone. Classification and Regression Trees. Wadsworth, 1984.
Google Scholar
P. Bradley, U. Fayyad, and C. Reina. “Scaling EM (Expectation-Maximization) clustering to large databases”, Microsoft Research Report, MSR-TR-98-35, August, 1998.
Google Scholar
Michael J. A. Berry, Gordon Linoff. Data Mining Techniques For marketing, Sales and Customer Support, John Willey & Sons, Inc, 1996.
Google Scholar
E. Bingham, H. Mannila. “Random projection in dimensionality reduction: applications to image and text data”, in Proceedings ACM SIGKDD, 2001.
Google Scholar
A. Borodin, R. Ostrovsky, and Y. Rabani. “Subquadratic Approximation Algorithms for Clustering Problems in High Dimensional Spaces”, in Proceedings of STOC, pp. 435–444, 1999.
Google Scholar
S. Chiu. “Extracting Fuzzy Rules from Data for Function Approximation and Pattern Classification”. Fuzzy Information Engineering — A Guided Tour of Applications, (eds.: D. Dubois, H. Prade, R Yager), 1997.
Google Scholar
Cover, T., and Hart, P. “Nearest Neighbor Pattern Classification”, in IEEE Transactions on Information Theory, pp. 21–27, 1967.
Google Scholar
Ming-Syan Chen, Jiawei Han, Philip S. Yu. “Data Mining: An Overview from a Database Perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. 8, No. 6, December, 1996.
Google Scholar
E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Pazzani. “Locally adaptive dimensionality reduction for indexing large time series databases”, in Proceedings of ACM SIGMOD Conference on Management of Data, 2001.
Google Scholar
P. Cheeseman, J. Stutz. “Bayesian Classification (AutoClass): Theory and Results”. Advances in Knowledge Discovery and Data Mining (eds: U. Fayyad, et al.), AAAI Press, 1996.
Google Scholar
[DH73]Duda, R.O., and Hart, P.E. Pattern Classification and Scene Analysis. John Wiley and Sons, 1973.
Google Scholar
Domeniconi, C, Peng, J., and Gunopulos, D. “An Adaptive Metric Machine for Pattern Classification”, in Advances in Neural Information Processing Systems, 2000.
Google Scholar
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Xiaowei Xu. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, in Proceedings of 2 ^nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226–231, 1996.
Google Scholar
Martin Ester, Hans-Peter Kriegel, Jorg Sander, Michael Wimmer, Xiaowei Xu. “Incremental Clustering for Mining in a Data Warehousing Environment”, in Proceedings of 24 ^nd VLDB Conference, New York, USA, 1998.
Google Scholar
Faloutsos, C. Searching Multimedia Databases by Content, Kluwer Academic, 1996.
Google Scholar
Faloutsos, C, Lin, K.-I. “Fastmap: A fast algorithm for indexing, data-mining, and visualization of traditional and multimedia data sets”, in Proceedings of the ACM SIGMOD Conference on Management of Data, 1995.
Google Scholar
Usama M. Fayyad, Gregory Piatesky-Shapiro, Padhraic Smuth and Ramasamy Uthurusamy. Advances in Knowledge Discovery and Data Mining, AAAI Press, 1996.
Google Scholar
Friedman, J. “Flexible Metric Nearest Neighbor Classification” Technical Report, Department of Statistics, Stanford University, 1994.
Google Scholar
Fukunaga, K. Introduction to Statistical Pattern Recognition, Academic Press, 1990.
Google Scholar
D. Gunopulos, R. Khardon, H. Mannila, and H. Toivonen. “Data mining, hypergraph transversals, and machine learning”, in Proceedings of the ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems, 1997.
Google Scholar
S. Guha, N. Mishra, R. Motwani, L. O’Callaghan. “Clustering Data Streams”, in IEEE Foundations of Computer Science, 2000.
Google Scholar
Glymour C, Madigan D., Pregibon D, Smyth P, “Statistical Inference and Data Mining”, in Communications of ACM, V39 (11), 1996, pp. 35–42.
Article Google Scholar
J. Gehrke, R. Ramakrishnan, and V. Ganti. Rainforest. “A framework for fast decision tree construction of large data sets”. Journal of Data Mining and Knowledge Discovery, V.4, No. 2/3, pp. 127–162, 2000.
Article Google Scholar
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “CURE: An Efficient Clustering Algorithm for Large Databases”, in Proceedings of the ACM SIGMOD Conference, 1998.
Google Scholar
Sudipto Guha, Rajeev Rastogi, Kyueseok Shim. “ROCK: A Robust Clustering Algorithm for Categorical Attributes”, in Proceedings of the IEEE Conference on Data Engineering, 1999.
Google Scholar
X. Ge and P. Smyth. “Deformable Markov model templates for time-series pattern matching”, in Proceedings of ACM SIGKDD, 2000.
Google Scholar
M. Gupta, and T. Yamakawa, (eds) “Fuzzy Logic and Knowledge Based Systems”, Decision and Control (North Holland), 1988.
Google Scholar
Alexander Hinneburg, Daniel Keim. “An Efficient Approach to Clustering in Large Multimedia Databases with Noise”, in Proceedings of KDD Conference, 1998.
Google Scholar
Jiawei Han, Micheline Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers, 2001.
Google Scholar
Hand, D., Mannila, H., and Smyth, P. Principles of Data Mining. The MIT Press, 2001.
Google Scholar
T. Horiuchi. “Decision Rule for Pattern Classification by Integrating Interval Feature Values”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 20, No. 4, April 1998, pp. 440–448.
Article MathSciNet Google Scholar
J. Han, J. Pei, and Y. Yin. “Mining frequent patterns without candidate generation”, in Proceedings of ACM SIGMOD International Conference on Management of Data, 2000.
Google Scholar
Hastie, T., and Tibshirani, R. “Discriminant Adaptive Nearest Neighbor Classification”, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 18, No. 6, pp. 607–615, 1996.
Article Google Scholar
Zhexue Huang. “A Fast Clustering Algorithm to Cluster very Large Categorical Data sets in Data Mining”, DMKD, 1997.
Google Scholar
P. Indyk and R. Motwani. “Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality”, in Proceedings of STOC, 1998.
Google Scholar
P. Indyk. “A sublinear-time approximation scheme for clustering in metric spaces”, in Proceedings of the 40 ^th Symposium on Foundations of Computer Science, 1999.
Google Scholar
Cezary Z. Janikow, “Fuzzy Decision Trees: Issues and Methods”, IEEE Transactions on Systems, Man, and Cybernetics, Vol. 28, Issuel,pp. 1–14, 1998.
Google Scholar
Jain, A.K., and Dubes, R.C. Algorithms for Clustering Data, Prentice Hall, 1988.
Google Scholar
A.K Jain, M.N. Murty, P.J. Flyn. “Data Clustering: A Review”, ACM Computing Surveys, Vol. 31, No. 3, September 1999.
Google Scholar
T. Joachims. “Text Categorization with Support Vector Machines”, in Proceedings of European Conference on Machine Learning, 1998.
Google Scholar
E. Keogh. Exact Indexing of Dynamic Time Warping. Proc. of Very Large Data Bases Conf. (VLDB) 2002.
Google Scholar
Keogh, E., Chu, S., Hart, D., Pazzani, M. “An Online Algorithm for Segmenting Time Series.” in Proceedings of IEEE International Conference on Data Mining, pp. 289–296, 2001.
Google Scholar
Keogh, E., Chakrabarti, K., Pazzani, M., Mehrotra, S. “Dimensionality reduction for fast similarity search in large time series databases”. Journal of Knowledge and Information Systems, pp 263–286, 2000.
Google Scholar
G. Karypis, Eui-Hong Han, V. Kumar. “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”, IEEE Computer, Vol. 32, No. 8, 68–75, 1999.
Article Google Scholar
E. Keogh and M. Pazzani. “Scaling up Dynamic Time Warping for Datamining Applications”, in Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining, Boston, MA, 2000.
Google Scholar
Kauffman, L., and Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, 1990.
Google Scholar
J. B. Kruskal, and D. Sankoff, Editors. Time Wraps, String Edits, and Macromolecules. The Theory and Practice of Sequence Comparison. Addison-Wesley, 1983.
Google Scholar
T. Kahveci and A. K. Singh. “Variable length queries for time series data”, in proceedings of IEEE Inernational Conference on Data Engineering, 2001.
Google Scholar
Kruskal, J., and Wish, M. Multidimensional Scaling. Quantitative Applications in the Social Sciences, SAGE Publications, 1978.
Google Scholar
H.V. Jagadish, Alberto O. Mendelzon, and Tova Milo. “Similarity-based queries”, in proceedings of the 14th ACM PODS, pages 36–45, May 1995.
Google Scholar
M. Melta, R. Agrawal, J. Rissanen. “SLIQ: A fast scalable classifier for data mining”, in Proceedings of EDBT’ 96, Avigon France, March, 1996.
Google Scholar
MacQueen, J.B “Some Methods for Classification and Analysis of Multivariate Observations”, in Proceedings of 5th Berkley Symposium on Mathematical Statistics and Probability, Volume I: Statistics, pp. 281–297, 1967.
MathSciNet Google Scholar
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
Google Scholar
H. Mannila, H. Toivonen, A. I. Verkamo: Discovery of frequent episodes in event sequences. Report C-1997-15, University of Helsinki, Department of Computer Science, February 1997.
Google Scholar
Raymond Ng, Jiawei Han. “Efficient and Effective Clustering Methods for Spatial Data Mining”, in Proceedings of the 2 ^th VLDB Conference, Santiago, Chile, 1994.
Google Scholar
A. Nanopoulos, Y. Theodoridis, Y. Manolopoulos. “C²P: Clustering based on Closest Pairs”, in Proceeding of the VLDB Conference, Roma, Italy, 2001.
Google Scholar
C. M. Procopiuc, M. Jones, P. K. Agarwal, and T. M. Murali. “A monte carlo algorithm for fast projective clustering”, in Proceedings of the ACM SIGMOD Conference on Management of Data, 2002.
Google Scholar
J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. “PrefixSpan: Mining Sequential Patterns Efficiently by PrefixProjected Pattern Growth”, In Proceedings of International Conference of Data Engineering (ICDE’ 01), 2001.
Google Scholar
S. Pemg, H. Wang, S. Zhang, and D.S. Parker. “Landmarks: A New Model for Similarity-based Pattern Matching in Time Series Databases”, in Proceedings of IEEE International Conference of Data Engineering, 2000.
Google Scholar
J.R Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, 1993.
Google Scholar
Ramze Rezaee, B.P.F. Lelieveldt, J.H.C Reiber. “A new cluster validity index for the fuzzy c-mean”. Pattern Recognition Letters, 19, pp. 237–246, 1998.
Article MATH Google Scholar
D. Rafiei, A. Mendelzon. “Querying Time Series Data Based on Similarity”, in IEEE Transactions on Knowledge and Data Engineering, V. 12, No.5, pp. 675–683, 2000.
Article Google Scholar
R. Rastori, K. Shim. “PUBLIC: A Decision Tree Classifier that Integrates Building and Pruning”, in Proceedings of the 24 ^th VLDB Conference, New York, USA, 1998.
Google Scholar
Roweis, S., and Saul, L. “Nonlinear dimensionality reduction by locally linear embeddings”. Science, V.290, No. 5500, pp. 2323–2326, 2000.
Article Google Scholar
R. Srikant, R. Agrawal. “Mining Generalized Association Rules”, in Proceedings of the 21 ^st VLDB Conference, 1995.
Google Scholar
Shafer J., Agrawal R., Mehta M.. “SPRINT: A scalable parallel classifier for data mining”, in Proceedings of the VLDB Conference, Bombay, India, September 1996.
Google Scholar
C. Sheikholeslami, S. Chatterjee, A. Zhang. “WaveCluster: A-MultiResolution Clustering Approach for Very Large Spatial Database”, in Proceedings of 24 ^th VLDB Conference, Nerw York, USA, 1998.
Google Scholar
S. Theodoridis, K. Koutroubas. Pattern recognition. Academic Press, 1999.
Google Scholar
J. B. Tenenbaum, V. de Silva, J. C. Langford. “A global geometric framework for nonlinear dimensionality reduction”. Science, V. 290, No. 5500, pp. 2319–2323, 2000.
Article Google Scholar
V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, New York, 1995.
MATH Google Scholar
V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.
Google Scholar
M. Vlachos, C. Domeniconi, D. Gunopulos, G. KoUios, and N. Koudas. “Non-Linear Dimensionality Reduction Techniques for Classification and Visualization”, in Proceedings of ACM SIGKDD Conference, 2002.
Google Scholar
J.S. Vitter, M. Wang, and B. R. Iyer. “Data Cube Approximation and Histograms via Wavelets”, in proceedings of the 1998 ACM CIKM International Conference on Knowledge Management.
Google Scholar
Weiss, S.M., and Kulikowski, C. Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning and Expert Systems. Morgan Kauffman, 1991.
Google Scholar
Wei Wang, Jiorg Yang and Richard Muntz. “STING: A statistical information grid approach to spatial data mining”, in proceedings of 23 ^rd VLDB Conference, 1997.
Google Scholar
B.-K. Yi and C. Faloutsos. “Fast Time Sequence Indexing for Arbitrary Lp Norms”, in proceedings of Very Large Data Bases Conference (VLDB), 2000.
Google Scholar
B.-K. Yi, H. V. Jagadish, and C. Faloutsos. “Efficient Retrieval of Similar Time Sequences under Time Warping”, in proceedings of International Conference of Data Enfineering, pp. 201–208, 1998.
Google Scholar
M. Zaki. “Efficient Enumeration of Frequent Sequences”, Machine Learning Journal, 2001.
Google Scholar
Tian Zhang, Raghu Ramakrishnman, Miron Linvy. “BIRCH: An Efficient Method for Very Large Databases”, ACM SIGMOD, Montreal, Canada, 1996.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Informatics, Athens University of Economics and Business, Greece
Michalis Vazirgiannis PhD & Maria Halkidi MSc
Department of Computer Science and Engineering, University of California, Riverside, USA
Dimitrios Gunopulos PhD

Authors

Michalis Vazirgiannis PhD
View author publications
You can also search for this author in PubMed Google Scholar
Maria Halkidi MSc
View author publications
You can also search for this author in PubMed Google Scholar
Dimitrios Gunopulos PhD
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Vazirgiannis, M., Halkidi, M., Gunopulos, D. (2003). Data Mining Process. In: Uncertainty Handling and Quality Assessment in Data Mining. Advanced Information and Knowledge Processing. Springer, London. https://doi.org/10.1007/978-1-4471-0031-7_2

Download citation

DOI: https://doi.org/10.1007/978-1-4471-0031-7_2
Publisher Name: Springer, London
Print ISBN: 978-1-4471-1119-1
Online ISBN: 978-1-4471-0031-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics