Summary
This paper is an example of data mining in action. The database we are mining contains 1085 profiles of individuals who have downloaded the statistical software XploRe. Each profile contains the responses to an online questionnaire comprised of questions about such things as an individuals’ computing preferences (operating system, favorite statistical software) or professional affiliation. After formatting and cleaning the raw data using MS Excel, we use IBM’s Intelligent Miner to perform a cluster analysis of the download profiles. We try to identify a small number of “types” of users by employing a clustering algorithm based on the New Condorcet Criterion, which is particularly well-suited for our all-categorical data. We identify three clusters in the mining run to which we refer as Academia, Unix/Linux users and Researchers, respectively. Based on the characteristics of the cluster members, we briefly outline how the results of the data analysis may be used for targeted marketing of XploRe.
This is a preview of subscription content, access via your institution.




Notes
1To make it easy to relate the questions to the variables used below, we already indicate the variable names in bold Typewriter font at this point whenever possible.
2See, for instance Gordon (1999) for a comprehensive treatment of cluster ananlysis. Alternative clustering methods in a data mining context are CLARANS (Ng & Han (1994)), DBSCAN (Ester, Kriegel, Sander, & Xu (1996)) BIRCH (Zhang, Ramakrishnan, & Livny (1996)) and CURE (Guha, Rastogi, & Shim (1998).)
3The remaining variables in the data set proved to be less useful in the clustering algorithm, essentially because they either have too few (Server, Mailing List) or too many possible values (Statistical Software, Country)).
4In fact, Intelligent Miner provides a standardized χ2 statistic for each variable in each cluster. This statistic, reported in Table 2, indicates how much the intracluster distribution differs from the distribution of the variable in the entire sample. The closer χ2 is to 1 (and the farther apart it is from 0) the more differs the intracluster distribution of the variable from its distribution at large. See Grabmeier & Rudolph (1998) for details.
5Indeed, in an earlier analysis with less recent profiles, the world wide web was less important in clusters I and III. The increased importance is probably due to both the general increase in internet usage, as well as the enhanced internet representation of XploRe.
References
Chen, M. S., Han, J., & Yu, P. S. (1996). Data Mining: an Overview from a Database Perspective, IEEE Trans. on Knowledge and Data Engineering, 8:866–883.
Ester, M, Kriegel, H., Sander, J., & Xu, X. (1996). A Density Based Algorith for Discovering Clusters in large Spatial Databases with Noise, Proc. of Int’l Conf. on Knowledge Discovery and Data Mining, Portland, Oregon.
Gordon, A. D. (1999). Classification, Chapman and Hall, 2nd ed., London.
Grabmeier, J. & Rudolph, A. (1998). Techniques of Cluster Algorithms in Data Mining, Technical Report IBM, http://www.ibm.com/software/data/iminer/fordata/clusttechn.pdf.
Guha. S, Rastogi. R, & Shim. K (1998). CURE: An efficient clustering algorithm for large databases, Proc. of ACM SIGMOD Int’l Conf. on Management of Data, New York, pp. 73–84.
Ha, S. H. & Park, S. C. (1998). Application of data mining tools to hotel data mart on the Intranet for database marketing, Expert System with Application, 15:1–31.
Härdle, W., Klinke, S., & Müller, M. (1999). XploRe Learning Guide, Springer Verlag, Heidelberg.
Michaud, P. (1987). Condorcet — A man of the Avant-garde, Applied Stochastic Models and Data Analysis, 3:173–189.
Michaud, P. (1997). Clustering Techniques, Future Generation Computer Systems, 13:135–147.
Ng, R.T, & Han, J. (1994). Efficient and Effective Clustering Methods for Spatial Data Mining, Proc. of the 20th Int’l Conf. on Very large databases, Santiago, Chile, pp.144–155.
Zhang, T., Ramakrishnan, R., & Livny, M. (1996). BIRCH: An Efficient Data Clustering Method for Very Large Databases, Proc. of the 1996 ACM SIGMOD Int’l Conf. on Management of Data, Montreal, Canada, pp. 103–114.
Author information
Authors and Affiliations
Additional information
An extended version of this paper is available at http://sfb.wiwi.hu-berlin.de/. Financial support from Deutscher Akademischer Austauschdienst and Deutsche Forschungs-gemeinschaft (SFB 373, “Qualifikation und Simulation Ökonomischer Prozesse”,) is gratefully acknowledged. We are very grateful for the helpful comments of two anonymous referees which led to improvements in the paper. All remaining errors are our own.
Rights and permissions
About this article
Cite this article
Sofyan, H., Werwatz, A. Analyzing XploRe Download Profiles with Intelligent Miner. Computational Statistics 16, 465–479 (2001). https://doi.org/10.1007/s001800100079
Published:
Issue Date:
DOI: https://doi.org/10.1007/s001800100079
Keywords
- Data Mining
- Cluster Analysis