The KDD pipeline describes the complete process of knowledge discovery in databases (KDD), i.e. the process of deriving useful, valid and non-trivial patterns from a large amount of data. The pipeline consists of five consecutive steps:
The selection step identifies the goal of the current application and selects a data set that is likely to contain relevant patterns.
The preprocessing step increases the quality of the data set by supplementing missing attributes, removing duplicate instances and resolving data inconsistencies.
The transformation step deletes correlated and irrelevant attributes and derives new more meaningful attributes from the current data description.
This step selects a data mining algorithm with respect to the goal which was identified in the selection step and derives patterns or learns functions that are valid for the current data set.
Evaluation and Interpretation
In the last step, the found patterns are checked with respect to their validity. Furthermore, the user examines the usefulness of the found knowledge for the given application.
The KDD pipeline was originally introduced as KDD process [1, 2, 3]. It describes a unifying framework containing all necessary and optional steps when deriving patterns using data mining algorithms. An important aspect of this view is that data mining is just one step in the complete KDD process which emphasizes the importance of a meaningful and consistent data representation. The KDD pipeline is considered to be an interactive and adjustable framework rather than a strict work flow. The necessity for this flexibility arises from the large variety of methods and parameter selections that can be applied in each step. In the majority of cases, it is necessary to adjust parameters or even exchange the applied method in one of the steps if the final patterns do not display satisfactory quality in the evaluation step. Furthermore, the borders of each step cannot be outlined in a strict manner because the quality of results strongly depends on a well selected combination of the methods applied in each step. Additionally, there exist methods fulfilling the tasks of two consecutive steps, e.g., transformation and data mining.
- 1.Brachman R, Anand T. The process of knowledge discovery in databases: a human centered approach. Proceedings of 10th National Conference on AI; 1996. p. 37–8.Google Scholar
- 2.Fayyad U, Piatetsky-Shapiro G, Smyth P. From data mining to knowledge discovery in databases. Proceedings of 10th National Conference on AI; 1996. p. 1–30.Google Scholar
- 3.Fayyad U, Piatetsky-Shapiro G, Smyth P. Knowledge discovery and data mining: towards a unifying framework. Proceedings of 2nd Internatinal Conference on Knowledge Discovery and Data Mining; 1996. p. 82–8.Google Scholar