Abstract
The amount of stored information in modern database applications increased tremendously in recent years. Besides their sheer amount, the stored data objects are also more and more complex. Therefore, classification of these complex objects is an important data mining task that yields several new challenges. In many applications, the data objects provide multiple representations. E.g. proteins can be described by text, amino acid sequences or 3D structures. Additionally, many real-world applications need to distinguish thousands of classes. Last but not least, many complex objects are not directly expressible by feature vectors. To cope with all these requirements, we introduce a novel approach to classification of multi-represented objects that is capable to distinguish large numbers of classes. Our method is based on k nearest neighbor classification and employs density-based clustering as a new approach to reduce the training instances for instance-based classification. To predict the most likely class, our classifier employs a new method to use several object representations for making accurate class predictions. The introduced method is evaluated by classifying proteins according to the classes of Gene Ontology, one of the most established class systems for biomolecules that comprises several thousand classes.
Keywords
- Multi-represented objects
- classification
- instance based learning
- k nearest neighbor classifier
Supported by the German Ministery for Education, Science, Research and Technology (BMBF) under grant no. 031U212 within the BFAM (Bioinformatics for the Functional Analysis of Mammalian Genomes) project which is part of the German Genome Analysis Network (NGFN).
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Consortium, T.G.O.: Gene Ontology: Tool for the Unification of Biology. Nature Genetics 25, 25–29 (2000)
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Transactions on information Theory IT-13, 21–27 (1967)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: Proc. of the 23rd Int. Conf. on Very Large Data Bases, pp. 426–435. Morgan Kaufmann, San Francisco (1997)
Berchtold, S., Böhm, C., Jagadish, H., Kriegel, H.P., Sander, J.: Independent Quantization: An Index Compression Technique for High-Dimensional Spaces. In: Int. Conf. on Data Engineering, ICDE 2000 (2000)
Brighton, H., Mellish, C.: On the consistency of information filters for lazy learning algorithms. In: Żytkow, J.M., Rauch, J. (eds.) PKDD 1999. LNCS (LNAI), vol. 1704, pp. 283–288. Springer, Heidelberg (1999)
Gates, G.: The reduced nearest neighbour rule. IEEE Transactions on Information Theory 18, 431–433 (1972)
Ritter, G., Woodruff, H., Lowry, S.R., Isenhour, T.: An algorithm for the selective nearest neighbor decision rule. IEEE Transactions on Information Theory 21, 665–669 (1975)
Wilson, H., Martinez, T.: Instance pruning techniques. In: Proc. 14th Int. Conf. on Machine Learning, pp. 403–411. Morgan Kaufmann Publishers, San Francisco (1997)
Aha, D.: Tolerating noisy, irrelevant and novel attributes in in instance-based learning algorithms. Int. Jurnal of Man-Machine Studies 36, 267–287 (1992)
Wilson, H., Martinez, T.: Machine Learning, 38-3. Reduction Techniques for Instance-Based Learning Algorithms. Kluwer Academic Publishers, Boston (2000)
Brighton, H., Mellish, C.: Data Mining and Knowledge Discavery. Advances in Instance Selection for Instance-Based Learning Algorithms, vol. 6. Kluwer Academic Publishers, Dordrecht (2002)
Sander, J., Ester, M., Kriegel, H.P., Xu, X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications. In: Data Mining and Knowledge Discovery, pp. 169–194. Kluwer Academic Publishers, Dordrecht (1998)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proc. KDD 1996, Portland, OR, pp. 291–316. AAAI Press, Menlo Park (1996)
Kittler, J., Hatef, M., Duin, R., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 226–239 (1998)
Duin, R.: The Combining Classifier: To Train Or Not To Train? In: Proc. 16th Int. Conf. on Pattern Recognition, Quebec City, Canada, pp. 765–770 (2002)
Kuncheva, L., Bezdek, J., Duin, R.: Decision Templates for Multiple Classifier Fusion: an Experimental Comparison. Pattern Recognition 34, 299–314 (2001)
Kriegel, H.P., Kröger, P., Pryakhin, A., Schubert, M.: Using support vector machines for classifying large sets of multi-represented objects. In: Proc. SIAM Int. Conf. on Data Mining, Lake Buena Vista, Florida, USA, pp. 102–114 (2004)
Boeckmann, B., Bairoch, A., Apweiler, R., Blatter, M.C., Estreicher, A., Gasteiger, E., Martin, M., Michoud, K., O’Donovan, C., Phan, I., Pilbout, S., Schneider, M.: The SWISS-PROT Protein Knowledgebase and its Supplement TrEMBL in 2003. Nucleic Acid Research 31, 365–370 (2003)
Deshpande, M., Karypis, G.: Evaluation of Techniques for Classifying Biological Sequences. In: Chen, M.-S., Yu, P.S., Liu, B. (eds.) PAKDD 2002. LNCS (LNAI), vol. 2336, pp. 417–431. Springer, Heidelberg (2002)
Salton, G.: Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer. Addison-Wesley, Reading (1989)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Kriegel, HP., Pryakhin, A., Schubert, M. (2005). Multi-represented kNN-Classification for Large Class Sets. In: Zhou, L., Ooi, B.C., Meng, X. (eds) Database Systems for Advanced Applications. DASFAA 2005. Lecture Notes in Computer Science, vol 3453. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11408079_45
Download citation
DOI: https://doi.org/10.1007/11408079_45
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25334-1
Online ISBN: 978-3-540-32005-0
eBook Packages: Computer ScienceComputer Science (R0)
