Skip to main content

Definition of the Subject

By definition, to discover is to see, get knowledge of, learn of, find or find out; gain sight or knowledge ofsomething previously unseen or unknown [18], therefore a discovery system can be defined asa system that supports the process of finding new knowledge. Results of a simple query for discoverysystem on the World Wide Web returns different types of discovery systems: from knowledge discovery systems in databases, internet‐basedknowledge discovery, service discovery systems and resource discovery systems to more specific, like for example drug discovery systems [10], gene discovery systems [43], discovery system forpersonality profiling [48], and developmental discovery systems [17] among others. As illustrated variety of discovery systems can be found in many different research areas, but wewill focus on knowledge discovery and knowledge discovery systems from the computer science perspective. Inconsistent definitions of terms knowledgediscovery...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 3,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 549.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

Accuracy (rate):

Used for evaluating quality of induced model.

Average class accuracy:

One of the simplest metrics for estimating the quality of a model. Classification accuracy is calculated for each class of the target variable and then the average of all accuracies per class is calculated.

Aggregation:

Process of combining two or more objects into single one. Typical statistical aggregation functions for quantitative attributes are sum and average.

Attribute:

A property or characteristic of data object, which may vary in time and also from object to object. Attributes have usually assigned values or symbols for the purpose of analysis. Other frequently used names for an attribute are variable and feature.

Binarization:

Transformation of continuous or discrete attributes into binary attributes. Binary attributes have only two possible values.

Classification:

Classification of data objects is a process of assigning classes or class labels to data objects. It is a type of predictive modeling and it is used for predicting discrete target variable.

Classification accuracy:

See description under Accuracy (rate).

Classifier:

A model based on data used for classification.

Confusion matrix:

A matrix of results from testing model versus predicted class values. It is very useful visual tool for understanding results of testing a classification model.

Data cleaning:

Step of KDDM usually involving detection and correction of data quality problems, removal of noise, defining outliers, and dealing with missing values.

Data mining:

A technology combining traditional data analysis methods and sophisticated algorithms for automatically processing large volumes of data and finding and extracting novel, useful and usually hidden patterns.

Data object:

A record of attribute values about an object or person. Other common names for data object are data record, case, point, sample, observation or entity.

Data preprocessing/preparation:

A phase of KDDM related to the preparation and transformation of data for data mining. It comprises of several techniques for selecting relevant data and attributes and creating or changing the attributes.

Data set:

A collection of similar or related data objects. Data objects are usually collected for a particular study.

Dimensionality reduction:

Reduction in the number of attributes. It is used for eliminating irrelevant features and noise by creating new attributes as a combination of the old attributes. Feature subset selection or feature selection is other type of dimensionality reduction, where dimensionality is reduced by selecting and using only a subset of old attributes.

Discretization:

A transformation of continuous numerical attributes into categorical or discrete attributes.

Ensemble:

Also known as committee or multiple classifier system is a group of classifiers. Ensemble approaches exploit the classification abilities of multiple classifiers. The integration of classifiers usually enhances the performance of final classification.

Feature:

A feature (variable) is a synonym for attribute. It is frequently used in data‐mining domain. See Attribute.

Feature extraction:

A process of creating new features from the original raw data. It is highly domain‐specific.

Knowledge discovery and data mining (KDDM):

The “umbrella” term for the overall process of knowledge discovery.

Knowledge discovery (KD):

Nontrivial process of mapping low-level data into other more meaningful forms that are easier to understand, like patterns, rules, summaries or even graphs.

Noise:

Result of erroneous measurements. It can involve distortion of values or addition of unauthentic data objects. Unlike outlier, noise is not legitimate data.

Outlier:

An anomalous object or atypical value of an attribute. Outliers can be legitimate data objects or values. Detecting outliers is especially important in fraud detection or network intrusion detection.

Pattern:

In KDDM defined as a high-level description of a subset of data and can be in many forms, e. g. statistical or predictive models of data, relationships among parts of data sets, association rules, clusters, graphs, summaries, or classification rules, tree structures, linear equations etc.

Precision:

Fraction of positive samples correctly classified as positive among all samples classified as positive.

Recall:

See Sensitivity.

Regression:

Regression is a type of predictive modeling used for predicting continuous target variable.

Sampling:

A process of selecting a subset of data or sample, for the data analysis. Basic sampling techniques are simple random sampling with or without replacement, stratified sampling and adaptive or progressive sampling.

Sensitivity (recall):

Proportion of samples correctly classified as positive (true positives) of all positive samples tested. If sensitivity is 1, all positive samples have been identified as positive.

Specificity:

Proportion of samples classified as negative of all negative samples tested.

Bibliography

Primary Literature

  1. Anand S, Buchner A (1998) Decision support using data mining. Financial Time Management, London

    Google Scholar 

  2. Baeck T (1996) Evolutionary algorithms in theory and practice. Oxford University Press, New York

    MATH  Google Scholar 

  3. Barley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159

    Google Scholar 

  4. Becerra‐Fernandez I, Gonzalez A, Sabherwal R (2004) Knowledge management: Challenges, solutions, and technologies. Prentice Hall, Upper Saddle River

    Google Scholar 

  5. Beck JR, Shultz E (1986) The use of relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 110:13–20

    Google Scholar 

  6. Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140

    MathSciNet  MATH  Google Scholar 

  7. Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group, Belmont

    MATH  Google Scholar 

  8. Boz O (2000) Converting a trained neural network to a decision tree dectext – decision tree extractor. Ph D thesis, Computer Science and Engineering, Lehigh University. http://citeseer.ist.psu.edu/boz00converting.html. Accessed 12 Nov 2007

  9. Cabena P, Hadjinian P, Stadler R, Verhees J, Zanasi A (1998) Discovering data mining: From concepts to implementation. Prentice Hall, Upper Saddle River

    Google Scholar 

  10. Caspase Drug Discovery Systems. drug discovery system. http://www.biomol.com/Online_Catalog/Online_Catalog/Products/36/?categoryId=420. Accessed 6 Nov 2007

  11. Cios K, Teresinska A, Konieczna S, Potocka J, Sharma S (2000) Diagnosing myocardial perfusion from PECT bull's‐eye maps – a knowledge discovery approach. IEEE Eng Med Biol Mag, Special Issue Med Data Mining Knowl Discov 19(4):17–25

    Google Scholar 

  12. Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining. A knowledge discovery approach. Springer, New York

    MATH  Google Scholar 

  13. Dalgaard P (2002) Introductory statistics with R. Springer, New York

    MATH  Google Scholar 

  14. Davenport TH, Prusak L (1997) Information ecology: Mastering the information and knowledge environment. Oxford University Press, New York

    Google Scholar 

  15. Dennis JE Jr, Schnabel RB (1989) A view of unconstrained optimization. In: Nemhauser GL, Runnooy Kan AHG, Todd MJ (eds) Handbook in operations research and management science, vol 1 Optimization. Elsevier, Amsterdam

    Google Scholar 

  16. Demsar J, Zupan B (2004) Orange: From experimental machine learning to interactive data mining. White Paper. Faculty of Computer and Information Science, University of Ljubljana. http://www.ailab.si/orange

  17. Developmental Discovery System (TM). Developmental discovery system. http://www.gotofocus.com/. Accessed 6 Nov 2007

  18. Dictionary.com Unabridged (v 1.1). discover. http://dictionary.reference.com/browse/discover. Accessed 5 Nov 2007

  19. Dietterich TG (2000) Ensemble methods in machine learning. In: First International Workshop on Multiple Classifier Systems. Lecture Notes in Computer Science. Springer, New York, pp 1–15

    Google Scholar 

  20. Dixon J (2005) Pentaho Open Source Business Intelligence Platform Technical White Paper. Pentaho Corporation, Orlando. http://sourceforge.net/project/showfiles.php?group_id=140317

  21. Fayyad U, Piatetsky‐Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases (a survey). AI Mag 17(3):37–54

    Google Scholar 

  22. Fayyad U, Piatesky‐Shapiro G, Smyth P, Uthurusamy R (eds) (1996) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park

    Google Scholar 

  23. Frawley W, Piatesky‐Shapiro G, Matheus C (1991) Knowledge discovery in databases: An overview. In: Piatesky‐Shapiro G, Frowley W (eds) Knowledge Discovery in Databases. AAAI/MIT Press, pp 1–27, Menlo Park

    Google Scholar 

  24. Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings Thirteenth International Conference on Machine Learning. Morgan Kaufman, San Francisco, pp 148–156

    Google Scholar 

  25. Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison, Reading

    MATH  Google Scholar 

  26. Hand D, Mannila H, Smyth P (eds) (2001) Principles of data mining. MIT Press, Cambridge

    Google Scholar 

  27. Holland JH (1975) Adaptation in natural and artificial systems. MIT Press, Cambridge

    Google Scholar 

  28. Iglesias CJ (1996) The role of hybrid systems in intelligent data management: The case of fuzzy/neural hybrids. Control Eng Pract 4(6):839–845

    MathSciNet  Google Scholar 

  29. Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29:119–127

    ADS  Google Scholar 

  30. Kurgan L, Musilek P (2006) A survey of Knowledge Discovery and Data Mining process models. Knowl Eng Rev 21(1):1–24

    Google Scholar 

  31. Loh W, Shih Y (1997) Split selection methods for classification trees. Stat Sinica 7:815–840

    MathSciNet  MATH  Google Scholar 

  32. Mannila H (2000) Theoretical frameworks of data mining. SIGKDD Explor 1:30–32

    Google Scholar 

  33. Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: Rapid Prototyping for Complex Data Mining Tasks. In: Proc of the 12th ACMSIGKDD. International Conference on Knowledge Discovery and Data Mining, Philadelphia, pp 1–6

    Google Scholar 

  34. Pechenizkiy M, Tsymbal A, Puuronen S (2005) Meta‐knowledge management in multistrategy process‐oriented knowledge discovery systems. Technical Report, Dublin, Trinity College Dublin, Department of Computer Science, TCD-CS-2005–30, p 12

    Google Scholar 

  35. Piatetsky‐Shapiro G (1991) Knowledge discovery in real databases: A report on the IJCAI-89 Workshop. AI Mag 11(5):68–70

    Google Scholar 

  36. Piatetsky‐Shapiro G (1999) The data mining industry coming to age. IEEE Intel Syst 14(6):32–33

    Google Scholar 

  37. Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing classifiers. In: Proceedings of the Fifteenth International Conference on Machine Learning, (ICML-98), San Francisco

    Google Scholar 

  38. Quinlan JR (1986) Induction of decision trees. In: Machine Learning, vol 1. Kluwer, Hingham

    Google Scholar 

  39. Quinlan R (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  40. Rakotomalala R (2005) TANAGRA: Un logiciel gratuit pour l'enseignement et la recherche. In: Proc of the 5th Journees d'Extraction et Gestion des Connaissances 2:697–702

    Google Scholar 

  41. Reeves CR (ed) (1993) Modern heuristic techniques for combinatorial problems. Wiley, New York

    MATH  Google Scholar 

  42. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back‐propagating errors. Nature 323:533–536

    ADS  Google Scholar 

  43. Sano M, Katoa Y, Taira K (2005) Functional gene‐discovery systems based on libraries of hammerhead and hairpin ribozymes and short hairpin RNAs. Mol Biosyst 1:27–35

    Google Scholar 

  44. Shearer C (2000) The CRISP-DM model: the new blueprint for data mining. J Data Wareh l5(4):13–19

    Google Scholar 

  45. Smyth P, Goodman RM (1991) Rule induction using information theory. In: Piatetsky‐Schapiro G, Frawley WJ (eds) Knowledge Discovery in Databases. AAAI Press, pp 159–176, Menlo Park

    Google Scholar 

  46. Snedecor GW, Cochran WG (1989) Statistical methods, 8th edn. Iowa State University Press, Ames

    MATH  Google Scholar 

  47. Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison, Boston

    Google Scholar 

  48. The Discovery System. discovery system for personality profiling. http://www.insights.com/core/English/TheDiscoverySystem/default.shtm. Accessed 6 Nov 2007

  49. Towsey M, Alpsan D, Sztriha L (1995) Training a neural network with conjugate gradient methods. IEEE Proc Neural Netw 1:373–378

    Google Scholar 

  50. Weiss GM, Provost F (2001) The effect of class distribution on classifier learning. Technical Report ML-TR 43, Department of Computer Science, Rutgers University

    Google Scholar 

  51. Werbos PJ (1994) The roots of backpropagation. Wiley, New York

    Google Scholar 

  52. Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco

    Google Scholar 

  53. Wolpert D, Macready W (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82

    Google Scholar 

Books and Reviews

  1. Berthold M, Hand DJ (2003) Intelligent data analysis: An introduction, 2nd edn. Springer, New York

    Google Scholar 

  2. Lin TY, Ohsuga S, Liau CJ, Hu X, Tsumoto S (eds) (2005) Foundations of data mining and knowledge discovery. Studies in Computational Intelligence, vol 6. Springer, New York

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag

About this entry

Cite this entry

Povalej, P., Verlic, M., Stiglic, G. (2009). Discovery Systems. In: Meyers, R. (eds) Encyclopedia of Complexity and Systems Science. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30440-3_125

Download citation

Publish with us

Policies and ethics