Data Farming: Concepts and Methods
A typical data mining project uses data collected for various purposes, ranging from routinely gathered data, to process improvement projects, and to data required for archival purposes. In some cases, the set of considered features might be large (a wide data set) and sufficient for extraction of knowledge. In other cases the data set might be narrow and insufficient to extract meaningful knowledge or the data may not even exist.
Mining wide data sets has received attention in the literature, and many models and algorithms for feature selection have been developed for wide data sets.
Determining features for which data should be collected in the absence of an existing data set or when a data set is partially available has not been sufficiently addressed in the literature. Yet, this issue is of paramount importance as the interest in data mining is growing. The methods and process for the definition of the most appropriate features for data collection, data transformation, data quality assessment, and data analysis are referred to as data farming. This chapter outlines the elements of a data fanning discipline.
Key WordsData Farming Data Mining Feature Definition Feature Functions New Features
Unable to display preview. Download preview PDF.
- Barry, MJ.A. and G. Linoff (1997), Data Mining Techniques: For Marketing, Sales, and Customer Support, John Wiley, New York.Google Scholar
- Carlett, J. (1991), Megainduction: Machine Learning on Very Large Databases, Ph.D. Thesis, Department of Computer Science, University of Sydney, Australia.Google Scholar
- Caroll, J.M. and J. Olson (1987), Mental Models in Human-Computer Interaction: Research Issues About the User of Software Knows, National Academy Press, Washington, DC.Google Scholar
- Cios, K., W. Pedrycz, and R. Swiniarski (1998), Data Mining: Methods for Knowledge Discovery, Kluwer, Boston, MA.Google Scholar
- Dugherty, D., R. Kohavi, and M. Sahami (1995), Supervised and unsupervised discretization of continuous features, Proceedings of the 12 th International Machine Learning Conference, pp. 194–202.Google Scholar
- Duda, R.O. and P.E. Hart (1973), Pattern Recognition and Scene Analysis, John Wiley, New York.Google Scholar
- Fayyad, U.M. and K.B. Irani (1993), Multi-interval discretization of continuously-valued attributes for classification learning, Proceedings of the 13 th International Joint Conference on Artificial Intelligence, pp. 1022–1027.Google Scholar
- Fukunaga, K. (1990), Introduction to Statistical Pattern Analysis, Academic Press, San Diego, CA.Google Scholar
- Han, J. and M. Kamber (2001), Data Mining: Concepts and Techniques, Morgan Kaufmann, San Diego, CA.Google Scholar
- John, G., R. Kohavi, and K. Pfleger (1994), Irrelevant features and the subset selection problem, Proceedings of the II th International Conference on Machine Learning, ICLM’94, Morgan Kaufmann, San Diego, CA, pp. 121–127.Google Scholar
- Kruchten, P. (2000), The Rational Unified Process: An Introduction, Addison-Wesley, New York, 2000.Google Scholar
- Kusiak, A. (1999), Engineering Design: Products, Processes, and Systems, Academic Press, San Diego, CA.Google Scholar
- LINDO (2003), http://www.lindo.com (Accessed June 5, 2003).Google Scholar
- Pawlak, Z. (1991), Rough Sets: Theoretical Aspects of Reasoning About Data, Kluwer, Boston, MA.Google Scholar
- Preparata, F.P. and Shamos, M.I. (1985), Pattern Recognition and Scene Analysis, Springer-Verlag, New York.Google Scholar
- Quinlan, J.R. (1986), Induction of decision trees, Machine Learning, Vol. 1, No 1, pp. 81–106.Google Scholar
- Ragel, A. and B. Cremilleux (1998), Treatment of missing values for association rules, Proceedings of the Second Pacific Asia Conference, PAKDD’ 98, Melbourne, Australia.Google Scholar
- Slowinski, R. (1993), Rough set learning of preferential attitude in multi-criteria decision making, in Komorowski, J. and Ras, Z. (Eds), Methodologies for Intelligent Systems, Springer-Verlag, Berlin, Germany, pp. 642–651.Google Scholar
- Venables, W.N. and B.D. Ripley (1998), Modern Statistics with S-PLUS, Springer-Verlag, New York.Google Scholar
- Wickens, G., S.E. Gordon, and Y. Liu (1998), An Introduction to Human Factors Engineering, Harper Collins, New York.Google Scholar
- Wilson, S.W. (1995), Classifier fitness based on accuracy, Evolutionary Computation, Vol. 3, No. 2, pp. 149–175.Google Scholar