Abstract
Analyzing complex scientific data, e.g., graphs and images, often requires comparison of features: regions on graphs, visual aspects of images and related metadata, some features being relatively more important. The notion of similarity for comparison is typically distance between data objects which could be expressed as distance between features. We refer to distance based on each feature as a component. Weights of components representing relative importance of features could be learned using distance function learning algorithms. However, it is seldom known which components optimize learning, given criteria such as accuracy, efficiency and simplicity. This is the problem we address. We propose and theoretically compare four component selection approaches: Maximal Path Traversal, Minimal Path Traversal, Maximal Path Traversal with Pruning and Minimal Path Traversal with Pruning. Experimental evaluation is conducted using real data from Materials Science, Nanotechnology and Bioinformatics. A trademarked software tool is developed as a highlight of this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Aksoy, S., Haralick, R.: Probabilistic versus Geometric Similarity Measures for Image Retrieval. IEEE CVPR 2, 357–362 (2000)
Bishop, C.: Neural Networks for Pattern Recognition. Oxford University Press, England (1996)
Bilenko, M., Mooney, R.: Adaptive Duplicate Detection using Learnable String Similarity Measures. In: KDD, pp. 39–48 (August 2003)
Chen, L., Ng, R.: On the Marriage of Lp-Norm and Edit Distance. In: VLDB, pp. 792–803 (August 2004)
Das, G., Gunopulos, D., Mannila, H.: Finding Similar Time Series. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 88–100. Springer, Heidelberg (1997)
Dougherty, S., Liang, L., Pins, G.: Precision Nanostructure Fabrication for the Investigation of Cell Substrate Interactions, Technical Report, Worcester Polytechnic Institute, Worcester, MA (June 2006)
Friedberg, R.: A Learning Machine: Part I. IBM Journal 2, 2–13 (1958)
Faloutsos, C., Lin, K.: FastMap: A Fast Algorithm for Indexing, Data Mining and Visualization of Traditional and Multimedia Datasets. SIGMOD Record 24(2), 163–174 (1995)
Hinneburg, A., Aggarwal, C., Keim, D.: What is the Nearest Neighbor in High Dimensional Spaces. In: Komorowski, J., Żytkow, J.M. (eds.) PKDD 1997. LNCS, vol. 1263, pp. 506–515. Springer, Heidelberg (1997)
Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann, California (2001)
Ishikawa, Y., Subramanya, R., Faloutsos, C.: MindReader: Querying Databases through Multiple Examples. In: VLDB, pp. 218–227 (August 1998)
Keim, D., Bustos, B.: Similarity Search in Multimedia Databases. In: ICDE, pp. 873–874 (March 2004)
Mitchell, T.: Machine Learning. WCB McGraw Hill, USA (1997)
Polikar, R.: Ensemble Based Systems in Decision Making. IEEE Circuits and Systems 6(3), 21–45 (2006)
Rui, Y., Huang, T., Mehrotra, S.: Relevance Feedback Techniques in Interactive Content Based Image Retrieval. In: SPIE, pp. 25–36 (January 1998)
Sisson, R., Maniruzzaman, M., Ma, S.: Quenching: Understanding, Controlling and Optimizing the Process, CHTE Seminar (October 2002)
Sheybani, E., Varde, A.: Issues in Bioinformatics Image Processing, Technical Report, Virginia State University, Petersburg, VA (October 2006)
Traina, A., Traina, C., Papadimitriou, S., Faloutsos, C.: TriPlots: Scalable Tools for Multidimensional Data Mining. In: KDD, pp. 184–193 (August 2001)
Varde, A., Rundensteiner, E., Javidi, G., Sheybani, E., Liang, J.: Learning the Relative Importance of Features in Image Data. In: ICDE’s DBRank (April 2007)
Varde, A., Rundensteiner, E., Ruiz, C., Maniruzzaman, M., Sisson, R.: Learning Semantics-Preserving Distance Metrics for Clustering Graphical Data. In: KDD’s MDM, pp. 107–112 (August 2005)
Varde, A., Rundensteiner, E., Sisson, R.: AutoDomainMine: A Graphical Data Mining System for Process Optimization. In: SIGMOD, pp. 1103–1105 (June 2007)
Witten, I., Frank, E.: Data Mining: Practical Machine Learning Algorithms with Java Implementations. Morgan Kaufmann Publishers, San Francisco (2000)
Wang, J., Wiederhold, G., Firschein, O., Wei, S.: Content-Based Image Indexing and Searching Using Daubechies Wavelets. International Journal of Digital Libraries 1, 311–328 (1997)
Xing, E., Ng, A., Jordan, M., Russell, S.: Distance Metric Learning with Application to Clustering with Side Information, NIPS, pp. 503–512 (December 2003)
Zhou, Z., Wu, J., Tang, W.: Ensembling Neural Networks: Many Could Be Better Than All. Artificial Intelligence 137(1), 239–263 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Varde, A. et al. (2008). Component Selection to Optimize Distance Function Learning in Complex Scientific Data Sets. In: Bhowmick, S.S., Küng, J., Wagner, R. (eds) Database and Expert Systems Applications. DEXA 2008. Lecture Notes in Computer Science, vol 5181. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85654-2_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-85654-2_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85653-5
Online ISBN: 978-3-540-85654-2
eBook Packages: Computer ScienceComputer Science (R0)