Discovery Systems

Povalej, Petra; Verlic, Mateja; Stiglic, Gregor

doi:10.1007/978-0-387-30440-3_125

Petra Povalej²,
Mateja Verlic² &
Gregor Stiglic²

249 Accesses

Definition of the Subject

By definition, to discover is to see, get knowledge of, learn of, find or find out; gain sight or knowledge ofsomething previously unseen or unknown [18], therefore a discovery system can be defined asa system that supports the process of finding new knowledge. Results of a simple query for discoverysystem on the World Wide Web returns different types of discovery systems: from knowledge discovery systems in databases, internet‐basedknowledge discovery, service discovery systems and resource discovery systems to more specific, like for example drug discovery systems [10], gene discovery systems [43], discovery system forpersonality profiling [48], and developmental discovery systems [17] among others. As illustrated variety of discovery systems can be found in many different research areas, but wewill focus on knowledge discovery and knowledge discovery systems from the computer science perspective. Inconsistent definitions of terms knowledgediscovery...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 3,499.99; Price excludes VAT (USA)

Hardcover Book: USD 549.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Abbreviations

Accuracy (rate):: Used for evaluating quality of induced model.
Average class accuracy:: One of the simplest metrics for estimating the quality of a model. Classification accuracy is calculated for each class of the target variable and then the average of all accuracies per class is calculated.
Aggregation:: Process of combining two or more objects into single one. Typical statistical aggregation functions for quantitative attributes are sum and average.
Attribute:: A property or characteristic of data object, which may vary in time and also from object to object. Attributes have usually assigned values or symbols for the purpose of analysis. Other frequently used names for an attribute are variable and feature.
Binarization:: Transformation of continuous or discrete attributes into binary attributes. Binary attributes have only two possible values.
Classification:: Classification of data objects is a process of assigning classes or class labels to data objects. It is a type of predictive modeling and it is used for predicting discrete target variable.
Classification accuracy:: See description under Accuracy (rate).
Classifier:: A model based on data used for classification.
Confusion matrix:: A matrix of results from testing model versus predicted class values. It is very useful visual tool for understanding results of testing a classification model.
Data cleaning:: Step of KDDM usually involving detection and correction of data quality problems, removal of noise, defining outliers, and dealing with missing values.
Data mining:: A technology combining traditional data analysis methods and sophisticated algorithms for automatically processing large volumes of data and finding and extracting novel, useful and usually hidden patterns.
Data object:: A record of attribute values about an object or person. Other common names for data object are data record, case, point, sample, observation or entity.
Data preprocessing/preparation:: A phase of KDDM related to the preparation and transformation of data for data mining. It comprises of several techniques for selecting relevant data and attributes and creating or changing the attributes.
Data set:: A collection of similar or related data objects. Data objects are usually collected for a particular study.
Dimensionality reduction:: Reduction in the number of attributes. It is used for eliminating irrelevant features and noise by creating new attributes as a combination of the old attributes. Feature subset selection or feature selection is other type of dimensionality reduction, where dimensionality is reduced by selecting and using only a subset of old attributes.
Discretization:: A transformation of continuous numerical attributes into categorical or discrete attributes.
Ensemble:: Also known as committee or multiple classifier system is a group of classifiers. Ensemble approaches exploit the classification abilities of multiple classifiers. The integration of classifiers usually enhances the performance of final classification.
Feature:: A feature (variable) is a synonym for attribute. It is frequently used in data‐mining domain. See Attribute.
Feature extraction:: A process of creating new features from the original raw data. It is highly domain‐specific.
Knowledge discovery and data mining (KDDM):: The “umbrella” term for the overall process of knowledge discovery.
Knowledge discovery (KD):: Nontrivial process of mapping low-level data into other more meaningful forms that are easier to understand, like patterns, rules, summaries or even graphs.
Noise:: Result of erroneous measurements. It can involve distortion of values or addition of unauthentic data objects. Unlike outlier, noise is not legitimate data.
Outlier:: An anomalous object or atypical value of an attribute. Outliers can be legitimate data objects or values. Detecting outliers is especially important in fraud detection or network intrusion detection.
Pattern:: In KDDM defined as a high-level description of a subset of data and can be in many forms, e. g. statistical or predictive models of data, relationships among parts of data sets, association rules, clusters, graphs, summaries, or classification rules, tree structures, linear equations etc.
Precision:: Fraction of positive samples correctly classified as positive among all samples classified as positive.
Recall:: See Sensitivity.
Regression:: Regression is a type of predictive modeling used for predicting continuous target variable.
Sampling:: A process of selecting a subset of data or sample, for the data analysis. Basic sampling techniques are simple random sampling with or without replacement, stratified sampling and adaptive or progressive sampling.
Sensitivity (recall):: Proportion of samples correctly classified as positive (true positives) of all positive samples tested. If sensitivity is 1, all positive samples have been identified as positive.
Specificity:: Proportion of samples classified as negative of all negative samples tested.

Bibliography

Primary Literature

Anand S, Buchner A (1998) Decision support using data mining. Financial Time Management, London
Google Scholar
Baeck T (1996) Evolutionary algorithms in theory and practice. Oxford University Press, New York
MATH Google Scholar
Barley AP (1997) The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recognit 30(7):1145–1159
Google Scholar
Becerra‐Fernandez I, Gonzalez A, Sabherwal R (2004) Knowledge management: Challenges, solutions, and technologies. Prentice Hall, Upper Saddle River
Google Scholar
Beck JR, Shultz E (1986) The use of relative operating characteristic (ROC) curves in test performance evaluation. Arch Pathol Lab Med 110:13–20
Google Scholar
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
MathSciNet MATH Google Scholar
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth International Group, Belmont
MATH Google Scholar
Boz O (2000) Converting a trained neural network to a decision tree dectext – decision tree extractor. Ph D thesis, Computer Science and Engineering, Lehigh University. http://citeseer.ist.psu.edu/boz00converting.html. Accessed 12 Nov 2007
Cabena P, Hadjinian P, Stadler R, Verhees J, Zanasi A (1998) Discovering data mining: From concepts to implementation. Prentice Hall, Upper Saddle River
Google Scholar
Caspase Drug Discovery Systems. drug discovery system. http://www.biomol.com/Online_Catalog/Online_Catalog/Products/36/?categoryId=420. Accessed 6 Nov 2007
Cios K, Teresinska A, Konieczna S, Potocka J, Sharma S (2000) Diagnosing myocardial perfusion from PECT bull's‐eye maps – a knowledge discovery approach. IEEE Eng Med Biol Mag, Special Issue Med Data Mining Knowl Discov 19(4):17–25
Google Scholar
Cios KJ, Pedrycz W, Swiniarski RW, Kurgan LA (2007) Data mining. A knowledge discovery approach. Springer, New York
MATH Google Scholar
Dalgaard P (2002) Introductory statistics with R. Springer, New York
MATH Google Scholar
Davenport TH, Prusak L (1997) Information ecology: Mastering the information and knowledge environment. Oxford University Press, New York
Google Scholar
Dennis JE Jr, Schnabel RB (1989) A view of unconstrained optimization. In: Nemhauser GL, Runnooy Kan AHG, Todd MJ (eds) Handbook in operations research and management science, vol 1 Optimization. Elsevier, Amsterdam
Google Scholar
Demsar J, Zupan B (2004) Orange: From experimental machine learning to interactive data mining. White Paper. Faculty of Computer and Information Science, University of Ljubljana. http://www.ailab.si/orange
Developmental Discovery System (TM). Developmental discovery system. http://www.gotofocus.com/. Accessed 6 Nov 2007
Dictionary.com Unabridged (v 1.1). discover. http://dictionary.reference.com/browse/discover. Accessed 5 Nov 2007
Dietterich TG (2000) Ensemble methods in machine learning. In: First International Workshop on Multiple Classifier Systems. Lecture Notes in Computer Science. Springer, New York, pp 1–15
Google Scholar
Dixon J (2005) Pentaho Open Source Business Intelligence Platform Technical White Paper. Pentaho Corporation, Orlando. http://sourceforge.net/project/showfiles.php?group_id=140317
Fayyad U, Piatetsky‐Shapiro G, Smyth P (1996) From data mining to knowledge discovery in databases (a survey). AI Mag 17(3):37–54
Google Scholar
Fayyad U, Piatesky‐Shapiro G, Smyth P, Uthurusamy R (eds) (1996) Advances in knowledge discovery and data mining. AAAI Press, Menlo Park
Google Scholar
Frawley W, Piatesky‐Shapiro G, Matheus C (1991) Knowledge discovery in databases: An overview. In: Piatesky‐Shapiro G, Frowley W (eds) Knowledge Discovery in Databases. AAAI/MIT Press, pp 1–27, Menlo Park
Google Scholar
Freund Y, Schapire RE (1996) Experiments with a new boosting algorithm. In: Proceedings Thirteenth International Conference on Machine Learning. Morgan Kaufman, San Francisco, pp 148–156
Google Scholar
Goldberg DE (1989) Genetic algorithms in search, optimization, and machine learning. Addison, Reading
MATH Google Scholar
Hand D, Mannila H, Smyth P (eds) (2001) Principles of data mining. MIT Press, Cambridge
Google Scholar
Holland JH (1975) Adaptation in natural and artificial systems. MIT Press, Cambridge
Google Scholar
Iglesias CJ (1996) The role of hybrid systems in intelligent data management: The case of fuzzy/neural hybrids. Control Eng Pract 4(6):839–845
MathSciNet Google Scholar
Kass GV (1980) An exploratory technique for investigating large quantities of categorical data. Appl Stat 29:119–127
ADS Google Scholar
Kurgan L, Musilek P (2006) A survey of Knowledge Discovery and Data Mining process models. Knowl Eng Rev 21(1):1–24
Google Scholar
Loh W, Shih Y (1997) Split selection methods for classification trees. Stat Sinica 7:815–840
MathSciNet MATH Google Scholar
Mannila H (2000) Theoretical frameworks of data mining. SIGKDD Explor 1:30–32
Google Scholar
Mierswa I, Wurst M, Klinkenberg R, Scholz M, Euler T (2006) YALE: Rapid Prototyping for Complex Data Mining Tasks. In: Proc of the 12th ACMSIGKDD. International Conference on Knowledge Discovery and Data Mining, Philadelphia, pp 1–6
Google Scholar
Pechenizkiy M, Tsymbal A, Puuronen S (2005) Meta‐knowledge management in multistrategy process‐oriented knowledge discovery systems. Technical Report, Dublin, Trinity College Dublin, Department of Computer Science, TCD-CS-2005–30, p 12
Google Scholar
Piatetsky‐Shapiro G (1991) Knowledge discovery in real databases: A report on the IJCAI-89 Workshop. AI Mag 11(5):68–70
Google Scholar
Piatetsky‐Shapiro G (1999) The data mining industry coming to age. IEEE Intel Syst 14(6):32–33
Google Scholar
Provost F, Fawcett T, Kohavi R (1998) The case against accuracy estimation for comparing classifiers. In: Proceedings of the Fifteenth International Conference on Machine Learning, (ICML-98), San Francisco
Google Scholar
Quinlan JR (1986) Induction of decision trees. In: Machine Learning, vol 1. Kluwer, Hingham
Google Scholar
Quinlan R (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Francisco
Google Scholar
Rakotomalala R (2005) TANAGRA: Un logiciel gratuit pour l'enseignement et la recherche. In: Proc of the 5th Journees d'Extraction et Gestion des Connaissances 2:697–702
Google Scholar
Reeves CR (ed) (1993) Modern heuristic techniques for combinatorial problems. Wiley, New York
MATH Google Scholar
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back‐propagating errors. Nature 323:533–536
ADS Google Scholar
Sano M, Katoa Y, Taira K (2005) Functional gene‐discovery systems based on libraries of hammerhead and hairpin ribozymes and short hairpin RNAs. Mol Biosyst 1:27–35
Google Scholar
Shearer C (2000) The CRISP-DM model: the new blueprint for data mining. J Data Wareh l5(4):13–19
Google Scholar
Smyth P, Goodman RM (1991) Rule induction using information theory. In: Piatetsky‐Schapiro G, Frawley WJ (eds) Knowledge Discovery in Databases. AAAI Press, pp 159–176, Menlo Park
Google Scholar
Snedecor GW, Cochran WG (1989) Statistical methods, 8th edn. Iowa State University Press, Ames
MATH Google Scholar
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison, Boston
Google Scholar
The Discovery System. discovery system for personality profiling. http://www.insights.com/core/English/TheDiscoverySystem/default.shtm. Accessed 6 Nov 2007
Towsey M, Alpsan D, Sztriha L (1995) Training a neural network with conjugate gradient methods. IEEE Proc Neural Netw 1:373–378
Google Scholar
Weiss GM, Provost F (2001) The effect of class distribution on classifier learning. Technical Report ML-TR 43, Department of Computer Science, Rutgers University
Google Scholar
Werbos PJ (1994) The roots of backpropagation. Wiley, New York
Google Scholar
Witten IH, Frank E (2005) Data mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco
Google Scholar
Wolpert D, Macready W (1997) No free lunch theorems for optimization. IEEE Trans Evol Comput 1(1):67–82
Google Scholar

Books and Reviews

Berthold M, Hand DJ (2003) Intelligent data analysis: An introduction, 2nd edn. Springer, New York
Google Scholar
Lin TY, Ohsuga S, Liau CJ, Hu X, Tsumoto S (eds) (2005) Foundations of data mining and knowledge discovery. Studies in Computational Intelligence, vol 6. Springer, New York
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Electrical Engineering and Computer Science, University of Maribor, Maribor, Slovenia
Petra Povalej, Mateja Verlic & Gregor Stiglic

Authors

Petra Povalej
View author publications
You can also search for this author in PubMed Google Scholar
Mateja Verlic
View author publications
You can also search for this author in PubMed Google Scholar
Gregor Stiglic
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

RAMTECH LIMITED, 122 Escalle Lane, Larkspur, CA, 94939, USA
Robert A. Meyers Ph. D. (Editor-in-Chief) (Editor-in-Chief)

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Povalej, P., Verlic, M., Stiglic, G. (2009). Discovery Systems. In: Meyers, R. (eds) Encyclopedia of Complexity and Systems Science. Springer, New York, NY. https://doi.org/10.1007/978-0-387-30440-3_125

Download citation

DOI: https://doi.org/10.1007/978-0-387-30440-3_125
Publisher Name: Springer, New York, NY
Print ISBN: 978-0-387-75888-6
Online ISBN: 978-0-387-30440-3
eBook Packages: Physics and AstronomyReference Module Physical and Materials ScienceReference Module Chemistry, Materials and Physics

Publish with us

Policies and ethics