Abstract
Development of high throughput data acquisition technologies, together with advances in computing, and communications have resulted in an explosive growth in the number, size, and diversity of potentially useful information sources. This has resulted in unprecedented opportunities in data-driven knowledge acquisition and decision- making in a number of emerging increasingly data-rich application domains such as bioinformatics, environmental informatics, enterprise informatics, and social informatics (among others). However, the massive size, semantic heterogeneity, autonomy, and distributed nature of the data repositories present significant hurdles in acquiring useful knowledge from the available data. This paper introduces some of the algorithmic and statistical problems that arise in such a setting, describes algorithms for learning classifiers from distributed data that offer rigorous performance guarantees (relative to their centralized or batch counterparts). It also describes how this approach can be extended to work with autonomous, and hence, inevitably semantically heterogeneous data sources, by making explicit, the ontologies (attributes and relationships between attributes) associated with the data sources and reconciling the semantic differences among the data sources from a user’s point of view. This allows user or context-dependent exploration of semantically heterogeneous data sources. The resulting algorithms have been implemented in INDUS – an open source software package for collaborative discovery from autonomous, semantically heterogeneous, distributed data sources.
Chapter PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Mitchell, T.: Machine Learning. McGraw Hill, New York (1997)
Duda, R., Hart, E., Stork, D.: Pattern Recognition. Wiley, Chichester (2000)
Thrun, S., Faloutsos, C., Mitchell, M., Wasserman, L.: Automated learning and discovery: State-of-the-art and research topics in a rapidly growing field. AI Magazine (1999)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, Heidelberg (2001)
Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford University Press, New York (1995)
Baldi, P., Frasconi, P., Smyth, P.: Modeling the Internet and the Web - Probabilistic Methods and Algorithms. Wiley, New York (2003)
Baldi, P., Brunak, S.: Bioinformatics - A Machine Learning Approach. MIT Press, Cambridge (2003)
Sowa, J.: Knowledge Representation: Logical, Philosophical, and Computational Foundations. PWS Publishing Co., New York (1999)
Ashburner, M., Ball, C., Blake, J., Botstein, D., Butler, H., Cherry, J., Davis, A., Dolinski, K., Dwight, S., Eppig, J., Harris, M., Hill, D., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J., Richardson, J., Ringwald, M., Rubin, G., Sherlock, G.: Gene ontology: tool for unification of biology. Nature Genetics 25, 25–29 (2000)
Reinoso-Castillo, J., Silvescu, A., Caragea, D., Pathak, J., Honavar, V.: Information extraction and integration from heterogeneous, distributed, autonomous information sources: a federated, query-centric approach. In: IEEE International Conference on Information Integration and Reuse, Las Vegas, Nevada (2003)
Caragea, D., Pathak, J., Honavar, V.: Learning classifiers from semantically heterogeneous data. In: Proceedings of the International Conference on Ontologies, Databases, and Applications of Semantics for Large Scale Information Systems (2004)
Dzeroski, S., Lavrac, N. (eds.): Relational Data Mining. Springer, Heidelberg (2001)
Getoor, L., Friedman, N., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Dzeroski, S., Lavrac, N.(eds.) Relational Data Mining. Springer, Heidelberg (2001)
Friedman, N., Getoor, L., Koller, D., Pfeffer, A.: Learning probabilistic relational models. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence, Orlando, FL, pp. 1300–1309. Morgan Kaufmann Publishers Inc., San Francisco (1999)
Atramentov, A., Leiva, H., Honavar, V.: Learning decision trees from multi-relational data. In: In Horváth, T., Yamamoto, A. (eds.) ILP 2003. LNCS (LNAI), vol. 2835, pp. 38–56. Springer, Heidelberg (2003)
Neville, J., Jensen, D., Gallagher, B.: Simple estimators for relational bayesian classifiers. In: ICDM 2003 (2003)
Casella, G., Berger, R.: Statistical Inference. Duxbury Press, Belmont (2001)
Davidson, A.: Statistical Models. Cambridge University Press, London (2003)
Kearns, M.: Efficient noise-tolerant learning from statistical queries. Journal of the ACM 45, 983–1006 (1998)
Caragea, D., Silvescu, A., Honavar, V.: A framework for learning from distributed data using sufficient statistics and its application to learning decision trees. International Journal of Hybrid Intelligent Systems 1 (2004)
Caragea, D., Silvescu, A., Honavar, V.: Decision tree induction from distributed heterogeneous autonomous data sources. In: Proceedings of the International Conference on Intelligent Systems Design and Applications, Tulsa, Oklahoma (2003)
Caragea, D., Silvescu, A., Honavar, V.: Agents that learn from distributed dynamic data sources. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS (LNAI), vol. 1810, pp. 53–61. Springer, Heidelberg (2000)
Caragea, C., Caragea, D., Honavar, V.: Learning support vector machine classifiers from distributed data. extended abstract. In: Proceedings of the 22nd National Conference on Artificial Intelligence, AAAI 2005 (2005)
Caragea, D.: Learning classifiers from Distributed, Semantically Heterogeneous, Autonomous Data Sources. Ph.d. thesis, Department of Computer Science. Iowa State University, Ames, Iowa, USA (2004)
Quinlan, R.: Induction of decision trees. Machine Learning 1, 81–106 (1986)
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and regression trees, Wadsworth, Monterey, CA (1984)
Graefe, G., Fayyad, U., Chaudhuri, S.: On the efficient gathering of sufficient statistics for classification from large sql databases. In: Proceedings of the Fourth International Conference on KDD, pp. 204–208. AAAI Press, Menlo Park (1998)
Moore, A.W., Lee, M.S.: Cached sufficient statistics for efficient machine learning with large datasets. Journal of Artificial Intelligence Research 8, 67–91 (1998)
Wang, X., Schroeder, D., Dobbs, D., Honavar, V.: Data-driven discovery of rules for protein function classification based on sequence motifs: Rules discovered for peptidase families based on meme motifs outperform those based on prosite patterns and profiles. In: Proceedings of the Conference on Computational Biology and Genome Informatics (2002)
Andorf, C., Silvescu, A., Dobbs, D., Honavar, V.: Learning classifiers for assigning protein sequences to gene ontology functional families. In: Fifth International Conference on Knowledge Based Computer Systems (KBCS 2004), India (2004)
Cortes, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–297 (1995)
Cristianini, N., Shawe-Taylor, J.: An Introduction to Support Vector Machines. Cambridge University Press, Cambridge (2000)
Bradley, P.S., Mangasarian, O.L.: Massive data discrimination via linear support vector machines. Optimization Methods and Software 13(1), 1–10 (2000)
Srivastava, A., Han, E., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery 3, 237–261 (1999)
Grossman, L., Gou, Y.: Parallel methods for scaling data mining algorithms to large data sets. In: Zytkow, J. (ed.) Handbook on Data Mining and Knowledge Discovery. Oxford University Press, Oxford (2001)
Provost, F.J., Kolluri, V.: A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery 3, 131–169 (1999)
Park, B., Kargupta, H.: Constructing simpler decision trees from ensemble models using Fourier analysis. In: Proceedings of the 7th Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD 2002), Madison, WI, ACM SIGMOD, pp. 18–23 (2002)
Domingos, P.: Knowledge acquisition from examples via multiple models. In: Proceedings of the Fourteenth International Conference on Machine Learning, Nashville, TN, pp. 98–106. Morgan Kaufmann, San Francisco (1997)
Prodromidis, A., Chan, P., Stolfo, S.: Meta-learning in distributed data mining systems: issues and approaches. In: Kargupta, H., Chan, P. (eds.) Advances of Distributed Data Mining. AAAI Press, Menlo Park (2000)
Bhatnagar, R., Srinivasan, S.: Pattern discovery in distributed databases. In: Proceedings of the Fourteenth AAAI Conference, Providence, pp. 503–508. AAAI Press/The MIT Press (1997)
Kargupta, H., Park, B., Hershberger, D., Johnson, E.: Collective data mining: A new perspective toward distributed data mining. In: Kargupta, H., Chan, P. (eds.) Advances in Distributed and Parallel Knowledge Discovery. MIT Press, Cambridge (1999)
Mansour, J.: Learning boolean functions via the fourier transform. In: Theoretical Advances in Neural Computation and Learning. Kluwer, Dordrecht (1994)
Levy, A.: Logic-based techniques in data integration. In: Logic-based artificial intelligence, pp. 575–595. Kluwer Academic Publishers, Dordrecht (2000)
Caragea, D., Silvescu, A., Pathak, J., Bao, J., Andorf, C., Dobbs, D., Honavar, V.: Information integration and knowledge acquisition from semantically heterogeneous biological data sources. In: Ludäscher, B., Raschid, L. (eds.) DILS 2005. LNCS (LNBI), vol. 3615, pp. 175–190. Springer, Heidelberg (2005)
Bonatti, P., Deng, Y., Subrahmanian, V.: An ontology-extended relational algebra. In: Proceedings of the IEEE Conference on Information Integration and Reuse, pp. 192–199. IEEE Press, Los Alamitos (2003)
Bouquet, P., Giunchiglia, F., van Harmelen, F., Serafini, L., Stuckenschmidt, H.: C-OWL: Contextualizing ontologies. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 164–179. Springer, Heidelberg (2003)
Bao, J., Honavar, V.: Collaborative ontology building with wiki@nt - a multi-agent based ontology building environment. In: Proceedings of the Third International Workshop on Evaluation of Ontology based Tools, at the Third International Semantic Web Conference ISWC, Hiroshima, Japan (2004)
Bao, J., Honavar, V.: An efficient algorithm for reasoning about subsumption and equivalence relationships to support collaborative editing of ontologies and inter-ontology mappings. under review (2005)
Hull, R.: Managing semantic heterogeneity in databases: A theoretical perspective. In: PODS, Tucson, Arizona, pp. 51–61 (1997)
Davidson, S., Crabtree, J., Brunk, B., Schug, J., Tannen, V., Overton, G., Stoeckert, C.: K2/Kleisli and GUS: experiments in integrated access to genomic data sources. IBM Journal 40 (2001)
Eckman, B.: A practitioner’s guide to data management and data integration in bioinformatics. Bioinformatics, 3–74 (2003)
Sheth, A., Larson, J.: Federated databases: architectures and issues. ACM Computing Surveys 22, 183–236 (1990)
Barsalou, T., Gangopadhyay, D.: M(dm): An open framework for interoperation of multimodel multidatabase systems. IEEE Data Engineering (1992)
Bright, M., Hurson, A., Pakzad, S.: A taxonomy and current issues in multibatabase systems. Computer Journal 25, 5–60 (1992)
Wiederhold, G., Genesereth, M.: The conceptual basis for mediation services. IEEE Expert 12, 38–47 (1997)
Garcia-Molina, H., Papakonstantinou, Y., Quass, D., Rajaraman, A., Sagiv, Y., Ullman, J., Vassalos, V., Widom, J.: The TSIMMIS approach to mediation: data models and languages. Journal of Intelligent Information Systems, Special Issue on Next Generation Information Technologies and Systems 8 (1997)
Chang, C.K., Garcia-Molina, H.: Mind your vocabulary: query mapping across heterogeneous information sources. In: ACM SIGMOD International Conference On Management of Data, Philadelphia, PA (1999)
Arens, Y., Chin, C., Hsu, C., Knoblock, C.: Retrieving and integrating data from multiple information sources. International Journal on Intelligent and Cooperative Information Systems 2, 127–158 (1993)
Knoblock, C., Minton, S., Ambite, J., Ashish, N., Muslea, I., Philpot, A., Tejada, S.: The ariadne approach to Web-based information integration. International Journal of Cooperative Information Systems 10, 145–169 (2001)
Lu, J., Moerkotte, G., Schue, J., Subrahmanian, V.: Efficient maintenance of materialized mediated views. In: Proceedings of 1995 ACM SIGMOD Conference on Management of Data, San Jose, CA (1995)
Levy, A.: The information manifold approach to data integration. IEEE Intelligent Systems 13 (1998)
Draper, D., Halevy, A.Y., Weld, D.S.: The nimble XML data integration system. In: ICDE, pp. 155–160 (2001)
Etzold, T., Harris, H., Beulah, S.: SRS: An integration platform for databanks and analysis tools in bioinformatics. Bioinformatics Managing Scientific Data, 35–74 (2003)
Haas, L., Schwarz, P., Kodali, P., Kotlar, E., Rice, J., Swope, W.: DiscoveryLink: a system for integrated access to life sciences data sources. IBM System Journal 40 (2001)
Stevens, R., Goble, C., Paton, N., Becchofer, S., Ng, G., Baker, P., Bass, A.: Complex query formulation over diverse sources in tambis. Bioinformatics, 189–220 (2003)
Chen, J., Chung, S., Wong, L.: The Kleisli query system as a backbone for bioinformatics data integration and analisis. Bioinformatics, 147–188 (2003)
Tannen, V., Davidson, S., Harker, S.: The information integration in K2. Bioinformatics, 225–248 (2003)
Tomasic, A., Rashid, L., Valduriez, P.: Scaling heterogeneous databases and design of DISCO. IEEE Transactions on Knowledge and Data Engineering 10, 808–823 (1998)
Haas, L., Kossmann, D., Wimmers, E., Yan, J.: Optimizing queries across diverse sources. In: Proceedings of the 23rd VLDB Conference, Athens, Greece, pp. 267–285 (1997)
Rodriguez-Martinez, M., Roussopoulos, R.: MOCHA: a self-extensible database middleware system for distributed data sources. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data, Dallas, TX, pp. 213–224 (2000)
Lambrecht, E., Kambhampati, S., Gnanaprakasam, S.: Optimizing recursive information-gathering plans. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1204–1211. AAAI Press, Menlo Park (1999)
Maluf, D., Wiederhold, G.: Abstraction of representation in interoperation. In: Sommer, G. (ed.) AFPAC 1997. LNCS(LNAI), vol. 1315. Springer, Heidelberg (1997)
Zhang, J., Honavar, V.: Learning decision tree classifiers from attribute-value taxonomies and partially specified data. In: Fawcett, T., Mishra, N. (eds.) Proceedings of the International Conference on Machine Learning, Washington, DC, pp. 880–887 (2003)
Zhang, J., Honavar, V.: Learning concise and accurate naive bayes classifiers from attribute value taxonomies and data. In: Proceedings of the Fourth ICMD (2004)
Haussler, D.: Quantifying inductive bias: AI learning algorithms and Valiant’s learning framework. Artificial Intelligence 36, 177–221 (1988)
Friedman, N., Geiger, D., Goldszmidt, M.: Bayesian network classifiers. Machine Learning 29 (1997)
Caragea, D., Zhang, J., Pathak, J., Honavar, V.: Learning classifiers from distributed, ontology-extended data sources. under review (2005)
Walker, A.: On retrieval from a small version of a large database. In: VLDB Conference 1989 (1989)
DeMichiel, L.: Resolving database incompatibility: An approach to performing relational operations over mismatched domains. IEEE Trans. Knowl. Data Eng. 1 (1989)
Chen, A., Tseng, F.: Evaluating aggregate operations over imprecise data. IEEE Trans. On Knowledge and Data Engineering 8 (1996)
McClean, S., Scotney, B., Shapcott, M.: Aggregation of imprecise and uncertain information in databases. IEEE Transactions on Knowledge and Data Engineering 6 (2001)
Bergadano, F., Giordana, A.: Guiding induction with domain theories. In: Machine Learning An Artificial Intelligence Approach, vol. 3. Morgan Kaufmann (1990)
Pazzani, M., Kibler, D.: The role of prior knowledge in inductive learning. Machine Learning 9 (1992)
Towell, G., Shavlik, J.: Knowledge-based artificial neural networks. Artificial Intelligence 70 (1994)
Aronis, J., Kolluri, V., Provost, F., Buchanan, B.: The WoRLD: knowledge discovery from multiple distributed databases. Technical Report ISL-96-6, Intelligent Systems Laboratory, Department of Computer Science, University of Pittsburgh, Pittsburgh, PA (1996)
Aronis, J., Provost, F.: Increasing the efficiency of inductive learning with breadth-first marker propagation. In: Proceedings of the Third International Conference on Knowledge Discovery and Data Mining (1997)
Nunez, M.: The use of background knowledge in decision tree induction. Machine Learning 6 (1991)
Almuallim, H., Akiba, Y., Kaneda, S.: On handling tree-structured attributes. In: Proceedings of the Twelfth International Conference on Machine Learning (1995)
Dhar, V., Tuzhilin, A.: Abstract-driven pattern discovery in databases. IEEE Transactions on Knowledge and Data Engineering 5 (1993)
Han, J., Fu, Y.: Exploration of the power of attribute-oriented induction in data mining. In: Fayyad, U.M. et al. (ed.) Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press (1996)
Hendler, J., Stoffel, K., Taylor, M.: Advances in high performance knowledge representation (1996)
Taylor, M., Stoffel, K., Hendler, J.: Ontology-based induction of high level classification rules. In: SIGMOD Data Mining and Knowledge Discovery workshop proceedings, Tuscon, Arizona (1997)
Pazzani, M., Mani, S., Shankle, W.: Beyond concise and colorful: Learning intelligible rules. In: Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, Newport Beach, CA (1997)
Pazzani, M., Mani, M., Shankle, W.: Comprehensible knowledge discovery in databases. In: Proceedings of the the Cognitive Science Conference (1997)
desJardins, M., Getoor, L., Koller, D.: Using feature hierarchies in bayesian network learning. In: Choueiry, B.Y., Walsh, T. (eds.) SARA 2000. LNCS (LNAI), vol. 1864, pp. 260–270. Springer, Heidelberg (2000)
Rubin, D.: Multiple imputations in sample surveys: A phenomenological bayesian approach to nonresponse. In: Proceedings of the American Statistical Association, Section on Survey Research Methods, pp. 29–34 (1978)
Rubin, D.: Multiple imputation for nonresponse in surveys. John Wiley and Sons, Chichester (1987)
Rubin, D.: Multiple imputation after 18+ years. Journal of the American Statistical Association 91 (1996)
Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J., Kolehmainen, M.: Methods for imputation of missing values in air quality data sets. Atmospheric Environment 38 (2004)
Longford, N.: Missing data and small area estimation in the uk labour force survey. Journal of the Royal Statistical Society Series A-Statistics in Society 167 (2004)
Raghunathan, T.: What do we do with missing data? some options for analysis of incomplete data. Annual Review of Public Health 25 (2004)
Little, R., Rubin, D.: Statistical analysis with missing data, 2nd edn. John Wiley and Sons, Chichester (2002)
Madow, W., Olkin, I., Rubin, D.B.: Incomplete data in sample surveys. Theory and bibliographies, vol. 2. Academic Press, London (1983)
Madow, W., Nisselson, J., Olkin, I.: Incomplete data in sample surveys. Report and case studies, vol. 1. Academic Press, New York, London (1983)
Yan, C., Dobbs, D., Honavar, V.: A two-stage classifier for identification of protein-protein interface residues. Bioinformatics 20, i371–i378 (2004)
Yan, C., Honavar, V., Dobbs, D.: Identifying protein-protein interaction sites from surface residues - a support vector machine approach. Neural Computing Applications 13, 123–129 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Caragea, D., Zhang, J., Bao, J., Pathak, J., Honavar, V. (2005). Algorithms and Software for Collaborative Discovery from Autonomous, Semantically Heterogeneous, Distributed Information Sources. In: Jain, S., Simon, H.U., Tomita, E. (eds) Algorithmic Learning Theory. ALT 2005. Lecture Notes in Computer Science(), vol 3734. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11564089_5
Download citation
DOI: https://doi.org/10.1007/11564089_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-29242-5
Online ISBN: 978-3-540-31696-1
eBook Packages: Computer ScienceComputer Science (R0)