Abstract
We propose a new class of distance measures (metrics) designed for multisets, both of which are a recurrent theme in many data mining applications. One particular instance of this class originated from the necessity for a clustering of criminal behaviours.
These distance measures are parameterized by a function f which, given a few simple restrictions, will always produce a valid metric. This flexibility allows these measures to be tailored for many domain-specific applications.
In this paper, the metrics are applied in bio-informatics (genomics), criminal behaviour clustering and text mining. The metric we propose also is a generalization of some known measures, e.g., the Jaccard distance and the Canberra distance. We discuss several options, and compare the behaviour of different instances.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Bray J.R., Curtis J.T., An ordination of the upland forest communities of southern Wisconsin, Ecol. Monogr. 27 (1957): 325–349.
Bruin, J.S. de, Cocx, T.K., Kosters, W.A., Laros, J.F.J., Kok, J.N., Data mining approaches to criminal career analysis, Sixth IEEE International Conference on Data Mining (ICDM 2006), Proceedings pp. 171-177.
Hoogeboom, H.J., Kosters, W.A., Laros, J.F.J., Selection of DNA markers, to appear in IEEE Transactions on Systems, Man, and Cybernetics Part C, 2007.
Jaccard, P., Lois de distribution florale dans la zone alpine, Bull. Soc. Vaud. Sci. Nat. 38 (1902): 69–130.
Kosters, W.A.,Wezel, M.C. van, Competitive neural networks for customer choice models, in E-Commerce and Intelligent Methods, volume 105 of Studies in Fuzziness and Soft Computing, Physica-Verlag, Springer, 2002, pp. 41–60.
Mahalanobis, P. C., On the generalised distance in statistics, Proceedings of the National Institute of Science of India 12 (1936): 49–55.
Tan, P.-N., Steinbach, M., Kumar, V., Introduction to data mining, Addison- Wesley, 2005.
UCSC Genome Bioinformatics, http://genome.ucsc.edu/.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag London Limited
About this paper
Cite this paper
Kosters, W.A., Laros, J.F.J. (2008). Metrics for Mining Multisets. In: Bramer, M., Coenen, F., Petridis, M. (eds) Research and Development in Intelligent Systems XXIV. SGAI 2007. Springer, London. https://doi.org/10.1007/978-1-84800-094-0_22
Download citation
DOI: https://doi.org/10.1007/978-1-84800-094-0_22
Publisher Name: Springer, London
Print ISBN: 978-1-84800-093-3
Online ISBN: 978-1-84800-094-0
eBook Packages: Computer ScienceComputer Science (R0)