Virtual Observatories, Data Mining, and Astroinformatics

  • Kirk Borne


The historical, current, and future trends in knowledge discovery from data in astronomy are presented here. The story begins with a brief history of data gathering and data organization. A description of the development of new information science technologies for astronomical discovery is then presented. Among these are e-Science and the virtual observatory, with its data discovery, access, display, and integration protocols; astroinformatics and data mining for exploratory data analysis, information extraction, and knowledge discovery from distributed data collections; new sky surveys’ databases, including rich multivariate observational parameter sets for large numbers of objects; and the emerging discipline of data-oriented astronomical research, called astroinformatics. Astroinformatics is described as the fourth paradigm of astronomical research, following the three traditional research methodologies: observation, theory, and computation/modeling. Astroinformatics research areas include machine learning, data mining, visualization, statistics, semantic science, and scientific data management. Each of these areas is now an active research discipline, with significant science-enabling applications in astronomy. Research challenges and sample research scenarios are presented in these areas, in addition to sample algorithms for data-oriented research. These information science technologies enable scientific knowledge discovery from the increasingly large and complex data collections in astronomy. The education and training of the modern astronomy student must consequently include skill development in these areas, whose practitioners have traditionally been limited to applied mathematicians, computer scientists, and statisticians. Modern astronomical researchers must cross these traditional discipline boundaries, thereby borrowing the best of breed methodologies from multiple disciplines. In the era of large sky surveys and numerous large telescopes, the potential for astronomical discovery is equally large, and so the data-oriented research methods, algorithms, and techniques that are presented here will enable the greatest discovery potential from the ever-growing data and information resources in astronomy.


Astroinformatics bayesian classification classification clustering data management data mining data preparation data profiling data science data transformation databases decision tree distance metrics e-Science exploratory data analysis fourth paradigm informatics K-means machine learning neural network outlier detection semantic science semisupervised learning similarity metrics sky surveys supervised learning survey science unsupervised learning virtual observatory visualization VOEvent 


List of Abbreviations


2-Micron All-Sky Survey


Anglo-Australian Observatory


Astronomical Data Archives Center (Japan)


Astronomical Data Analysis Software and Systems


Astronomical Data Center


Astrophysical Journal Supplement


Artificial neural network


Bonner Durchmusterung


Canadian Astronomy Data Center


Center de Donnees astronomique de Strasbourg (France)


General Catalog of Variable Stars


Distributed data mining


Distributed mining of data


Department of Energy


Digital Sky Survey


Exploratory data analysis


Henry Draper


High Energy Astrophysics Science Archive Research Center


Infrared Processing and Analysis Center


Infrared Science Archive


International Virtual Observatory Alliance


Knowledge Discovery in Databases


K-nearest neighbors


Leicester Database and Archive Service (UK)


Large Synoptic Survey Telescope


Multimission Archive at Space Telescope


Mining of distributed data


Machine learning


National Aeronautics and Space Administration


NASA/IPAC Extragalactic Database


New General Catalog


National Science Foundation


National Virtual Observatory


Panoramic Survey Telescope and Rapid Response System




Project Data Management Plan


Principal investigator


Right ascension and declination


Resource Description Framework


Smithsonian Astrophysical Observatory


Set of Identifications, Measurements, and Bibliography for Astronomical Data


Sloan Digital Sky Survey


Support vector machine




Virtual Astronomy Observatory


Two-Micron Sky Survey


Virtual observatory


World Wide Web


eXtensible Markup Language



This research has been supported in part by NASA AISR grant number NNX07AV70G. The author thanks numerous colleagues for their significant and invaluable contributions to the ideas expressed in this chapter: Jogesh Babu, Douglas Burke, Andrew Connolly, Timothy Eastman, Eric Feigelson, Matthew Graham, Alexander Gray, Norman Gray, Suzanne Jacoby, Thomas Loredo, Ashish Mahabal, Robert Mann, Bruce McCollum, Misha Pesenson, M. Jordan Raddick, Keivan Stassun, Alex Szalay, Tony Tyson, and John Wallin. The author is grateful to Dr. Hillol Kargupta and his research associates for many years of productive collaborations in the field of distributed data mining in virtual observatories.


  1. Abell, G. O. 1958, ApJS, 3, 211ADSCrossRefGoogle Scholar
  2. Ball, N. M., & Brunner, R. J. 2010, Data mining and machine learning in astronomy. Int. J. Mod. Phys. D, 19(7), 1049Google Scholar
  3. Ball, N. M., & McConnell, S. 2011, IVOA KDD-IG: A User Guide for Data Mining in Astronomy, downloaded from
  4. Ball, N. M., et al. 2006, ApJ, 650, 497ADSCrossRefGoogle Scholar
  5. Bayes, Rev. T. 1763, An essay toward solving a problem in the Doctrine of chances. Philos. Trans. R. Soc. Lond., 53, 370Google Scholar
  6. Bazell, D., & Peng, Y. 1998, ApJS, 116, 47ADSCrossRefGoogle Scholar
  7. Becciani, U., et al. 2010, Publ. ASP, 122, 119Google Scholar
  8. Becker, A. C. 2008, AN, 329, 280ADSGoogle Scholar
  9. Becla, J., et al. 2006, Designing a multi-petabyte database for LSST, in Observatory Operations: Strategies, Processes, and Systems, Proc. SPIE, Vol. 6270, ed. D. R. Silva, & R. E. Doxsey. doi:10.1117/12.671721Google Scholar
  10. Bell, G., Gray, J., & Szalay, A. 2006, Petascale computational systems. IEEE Comput, 39(1), 110CrossRefGoogle Scholar
  11. Bennett, A. S. 1962, Mem. R. Astron. Soc., 68, 163ADSGoogle Scholar
  12. Bhaduri, K., Das, K., Liu, K., Kargupta, H., & Ryan, J. 2008 (Release 1.8), downloaded from
  13. Bhaduri, K., et al. 2011, J. Stat. Anal. Data Min., 4(3), 336Google Scholar
  14. Bloom, J. S., Butler, N. R., & Perley, D. A. 2007, Gamma-ray bursts, classified physically, in AIP Conf. Proc., Vol. 1000 (Melville, NY: American Institute of Physics), Gamma-Ray Bursts, 11Google Scholar
  15. Bloom, J. S., et al. 2008, Towards a real-time transient classification engine. Astron. Nach., 329, 284ADSCrossRefGoogle Scholar
  16. Boch, T., Fernique, P., & Bonnarel, F. 2008, Astronomical Data Analysis Software and Systems (ADASS) XVI, ASP Conf. Ser. 394 (Chicago: Astronomical Society of the Pacific), 217Google Scholar
  17. Borne, K. 2001a, Science user scenarios for a VO design reference mission: science requirements for data mining, in Virtual Observatories of the Future, ASP Conf. Ser. 225 (Chicago: Astronomical Society of the Pacific), 333Google Scholar
  18. Borne, K. 2001b, Data mining in astronomical databases, in Mining the Sky (Berlin/Heidelberg: Springer-Verlag), 671Google Scholar
  19. Borne, K. 2003, SPIE Data Mining and Knowledge Discovery, Vol. 5098 (Bellingham: SPIE), 211Google Scholar
  20. Borne, K. D. 2007, Astroinformatics: the new eSc- ience paradigm for astronomy research and education. Microsoft eScience Workshop at RENCI, downloaded from
  21. Borne, K. 2008a, A machine learning classification broker for the LSST transient database. Astron. Nach., 329, 255ADSCrossRefGoogle Scholar
  22. Borne, K. 2008b, Data science challenges from distributed petascale astronomical sky surveys, in DOE Workshop on Mathematical Analysis of Petascale Data, downloaded from
  23. Borne, K. 2009a, Scientific data mining in astronomy, in Next Generation Data Mining (Chapman and Hall/Boca Raton: CRC), 91Google Scholar
  24. Borne, K. 2009b, The VO and Large Surveys: What More Do We Need? downloaded from$\sim$george/AIworkshop/Borne.pdf
  25. Borne, K. 2009c, The Zooniverse: Advancing Science through User-Guided Learning in Massive Data Streams, downloaded from
  26. Borne, K. 2010, Astroinformatics: data-oriented astronomy research and education. Earth Sci. Inform., 3, 5CrossRefGoogle Scholar
  27. Borne, K., & Vedachalam, A. 2012, Surprise detection in multivariate astronomical data, in Statistical Challenges in Modern Astronomy V, ed. E. D. Feigelson, & G. J. Babu (New York: Springer), 275–290Google Scholar
  28. Borne, K., Becla, J., Davidson, I., Szalay, A., & Tyson, J. A. 2008, The LSST data mining research agenda, in AIP Conference Proceedings for Classification and Discovery in Large Astronomical Surveys, Vol. 1082 (Melville, NY: American Institute of Physics), 347Google Scholar
  29. Borne, K., et al. 2009, Astroinformatics: a 21st century approach to astronomy, in ASTRO2010 Decadal Survey in Astronomy and Astrophysics position paper, arXiv:0909.3892v1Google Scholar
  30. Breiman, L. 2001, Mach. Learn., 45(1), 5Google Scholar
  31. Brunner, R., Djorgovski, S. G., Prince, T. A., & Szalay, A. S. 2002, Massive datasets in astronomy, in The Handbook of Massive Data Sets, ed. J. Abello, P. M. Pardalos, & M. Resende (Norwell: Kluwer), 931–979Google Scholar
  32. Budavari, T., et al. 2009, ApJ, 694, 1281ADSCrossRefGoogle Scholar
  33. Carliles, S., et al. 2010, ApJ, 712, 511ADSCrossRefGoogle Scholar
  34. Codd, E. F. 1970, Commun. ACM, 13(6), 377zbMATHCrossRefGoogle Scholar
  35. Das, K., et al. 2009, in SIAM Conference on Data Mining SDM09, 247–258, downloaded from
  36. Debosscher, J., et al. 2007, Automated supervised classification of variable stars. I. Methodology. A&A, 475, 1159CrossRefGoogle Scholar
  37. Drake, A. J., et al. 2009, ApJ, 696, 870ADSCrossRefGoogle Scholar
  38. Djorgovski, S. G., et al. 2008, AN, 329, 263ADSGoogle Scholar
  39. Djorgovski, S. G., & Davis, M. 1987, Fundamental properties of elliptical galaxies. ApJ, 313, 59ADSCrossRefGoogle Scholar
  40. Djorgovski, S. G., et al. 2001, Exploration of parameter spaces in a virtual observatory, in Mining the Sky, Proc. SPIE, Vol. 4477, ed. J.-L. Starck, & F. Murtagh (Bellingham: SPIE), 43Google Scholar
  41. DOE-1, 2007, Visualization and Knowledge Discovery: Report from the DOE/ASCR Workshop on Visual Analysis and Data Exploration at Extreme Scale, downloaded from
  42. DOE-2, 2008, Mathematics for Analysis of Petascale Data Workshop Report, downloaded from
  43. DOE-3, 2008, Applied Mathematics at the U.S. Department of Energy: Past, Present and a View to the Future,\_May\_08.pdf
  44. Dolensky, M. 2004, Applicability of emerging resource discovery standards to the VO, in Toward an International Virtual Observatory, ed. P. J. Quinn, & K. M. Gorski (Berlin: Springer), 265Google Scholar
  45. Dressler, A., et al. 1987, Spectroscopy and photometry of elliptical galaxies. I – a new distance estimator. ApJ, 313, 42Google Scholar
  46. Dutta, H., et al. 2007, in SIAM Conference Data Mining SDM07, 473–476, downloaded from
  47. Dutta, H., et al. 2009, in IEEE International Conference on Data Mining, Workshops, 495–500, downloaded from
  48. Eastman, T., Borne, K., Green, J, Grayzeck, E., McGuire, R., Sawyer, D. 2005, eScience and archiving for space science. Data Sci. J., 4, 67–76CrossRefGoogle Scholar
  49. Euchner, F., et al. 2004, Astronomical Data Analysis Software and Systems (ADASS) XIII, ASP Conf. Ser. 314 (Chicago: Astronomical Society of the Pacific), 578Google Scholar
  50. Fellhauer, M., & Heggie, D. 2005, A&A, 435, 875ADSzbMATHCrossRefGoogle Scholar
  51. Fabbiano, G., et al. 2010, Recommendations of the VAO Science Council, arXiv:1006.2168v1,
  52. Fortson, L., et al. 2011, Galaxy zoo: morphological classification and citizen science, in Advances in Machine Learning and Data Mining for Astronomy, ed. M. J. Way, J. D. Scargle, K. M. Ali, & A. N. Srivastava (Chapman and Hall/Boca Raton: CRC)Google Scholar
  53. Gardner, J. P., Connolly, A., & McBride, C. 2007, Astronomical Data Analysis Software and Systems (ADASS) XVI, ASP Conf. Ser. 376 (Chicago: Astronomical Society of the Pacific), 69Google Scholar
  54. Giannella, C., Dutta, H., Borne, K., Wolff, R., & Kargupta, H. 2006, in SIAM Conference on Data Mining SDM06, Workshop on Scientific Data Mining, downloaded from{\%}20Datasets/
  55. Graham, M. J. 2009, Astronomical Data Analysis Software and Systems (ADASS) XVIII, ASP Conf. Ser. 411, 165Google Scholar
  56. Graham, M. J., et al. 2005, Astronomical Data Analysis Software and Systems (ADASS) XIV, ASP Conf. Ser. 347 (Chicago: Astronomical Society of the Pacific), 394Google Scholar
  57. Graham, M. J., Fitzpatrick, M. J., & McGlynn, T. A. (eds) 2008, The National Virtual Observatory: Tools and Techniques for Astronomical Research, ASP Conf. Ser. 382 (Chicago: Astronomical Society of the Pacific)Google Scholar
  58. Graham, M. J. 2010, Hot-Wiring the Transient Universe, 119, available from
  59. Gray, J. 2003, Online Science, downloaded from
  60. Gray, J., & Szalay, A. 2004, Where the Rubber Meets the Sky: Bridging the Gap Between Databases and Science, Microsoft Technical Report MSR-TR-2004–110, IEEE Data Engineering Bulletin, 27(4), 3–11Google Scholar
  61. Greene, G., et al. 2008, The National Virtual Observatory: Tools and Techniques for Astronomical Research, ASP Conf. Ser. 382 (Chicago: Astronomical Society of the Pacific), 111Google Scholar
  62. Grosbl, P., et al. 2005, Astronomical Data Analysis Software and Systems (ADASS) XIV, ASP Conf. Ser. 347, 124Google Scholar
  63. Harberts, R., et al. 2003, Intelligent Archive Visionary Use Case: Virtual Observatories, downloaded from
  64. Hendler, J. 2003, Science, 299(5606), 520Google Scholar
  65. Hey, T., & Trefethen, A. 2002, Future Gen. Comput. Syst., 18, 1017zbMATHCrossRefGoogle Scholar
  66. Hey, T., Tansley, S., & Tolle, K. (eds) 2009, The Fourth Paradigm: Data-Intensive Scientific Discovery, downloaded from
  67. Hojnacki, S. M., et al. 2007, ApJ, 659, 585ADSCrossRefGoogle Scholar
  68. Ivezic, Z., et al. 2008, Parameterization and classification of 20 Billion LSST objects: lessons from SDSS, in Classification and Discovery in Large Astronomical Surveys, AIP Conf. Proc., Vol. 1082 (Melville, NY: American Institute of Physics), 359Google Scholar
  69. Kegelmeyer, P., et al 2008, Mathematics for Analysis of Petascale Data: Report on a Department of Energy Workshop, downloaded from
  70. Liu, C., et al. 2006, Advanced Software and Control for Astronomy, Proc. SPIE, Vol. 6274 (Bellingham: SPIE), 627415Google Scholar
  71. LSST Science Collaborations and the LSST Project 2009, LSST Science Book, Version 2.0, arXiv:0912.0201,
  72. Lynds, B. T. 1962, ApJS, 7, 1ADSCrossRefGoogle Scholar
  73. Mahootian, F., & Eastman, T. 2009, World Futures, 65, 61Google Scholar
  74. Mahule, T., et al. 2010, in NASA Conference on Intelligent Data Understanding, downloaded from,pp.243-257
  75. McGlynn, T. 2008, in The National Virtual Observatory: Tools and Techniques for Astronomical Research, ASP Conf. Ser. 382 (Chicago: Astronomical Society of the Pacific), 51Google Scholar
  76. Missaoui, R., et al. 2005, Similarity measures for efficient content-based image retrieval. IEEE Proc. Vision Image Signal Process., 152(6), 875Google Scholar
  77. Mould, J. 2004, LSST Followup, downloaded from
  78. Murthy, S. K., Kasif, S., & Salzberg, S. 1994, J Artif. Intell. Res., 2, 1zbMATHGoogle Scholar
  79. Nisbet, R., Elder, J., IV, & Miner, G. 2009, Handbook of Statistical Analysis and Data Mining Applications (Amsterdam/Boston: Academic)zbMATHGoogle Scholar
  80. Ochsenbein, F., Bauer, P., & Marcout, J. 2000, The VizieR database of astronomical catalogues. A&ASS, 143, 23ADSCrossRefGoogle Scholar
  81. Oreiro, R., et al. 2011, A&A, 530, A2ADSCrossRefGoogle Scholar
  82. Pimblett, K. A. 2011, MNRAS, 411, 2637ADSCrossRefGoogle Scholar
  83. Plante, R., et al. 2004, Astronomical Data Analysis Software and Systems (ADASS) XIII, ASP Conf. Ser. 314 (Chicago: Astronomical Society of the Pacific), 585Google Scholar
  84. Plante, R., et al. 2010, Building Archives in the Virtual Observatory Era in Software and Cyberinfrastructure for Astronomy, Proc. SPIE, Vol. 7740 (Bellingham: SPIE), 77400KGoogle Scholar
  85. Quinlan, J. R. 1996, Bagging, boosting, and c4.5, in the Proceedings of the 13th National Conference on Artificial Intelligence, AAAI Press (Portland, OR: Association for the Advancement of Artificial Intelligence), 725Google Scholar
  86. Ramapriyan, H. K., et al. 2002. Conceptual Study of Intelligent Archives of the Future, downloaded from
  87. Raskin, R., G. & Pan, M. J. 2005, Knowledge representation in the semantic web for earth and environmental terminology (SWEET). Comput. Geosci., 31(9), 1119Google Scholar
  88. Rebbapragada, U., et al. 2009, Finding anomalous periodic time series: an application to catalogs of periodic variable stars. Mach. Learn., 74(3), 281CrossRefGoogle Scholar
  89. Rossi, G., & Sheth, R. K. 2008, MNRAS, 387, 735ADSCrossRefGoogle Scholar
  90. Rotem, D., & Shoshani, A. 2009, Scientific Data Management: Challenges, Technology, and Deployment (Chapman and Hall/Boca Raton: CRC)Google Scholar
  91. Sarro, L., et al. 2009, Automated supervised classification of variable stars. II. Application to the OGLE database. A&A, 494, 739Google Scholar
  92. Sebok, W. 1979, AJ, 84, 1526ADSCrossRefGoogle Scholar
  93. Schaaf, A. 2007, Web Information Systems Engineering, WISE 2007 Workshop, Lecture Notes in Computer Science, Vol. 4832 (Heidelberg: Springer), 52Google Scholar
  94. Shabalin, A. A., Weigman, V. J., Perou, C. M., & Nobel, A. B. 2009, Finding large average submatrices in high dimensional data. Ann. Appl. Stat., 3(3), 985MathSciNetzbMATHCrossRefGoogle Scholar
  95. Sharpless, S. 1959, ApJS, 4, 257ADSCrossRefGoogle Scholar
  96. Springel, V., et al. 2005, Simulations of the formation, evolution and clustering of galaxies and quasars. Nature, 435, 629ADSCrossRefGoogle Scholar
  97. Strauss, M. 2004, Towards a Design Reference Mission for the LSST, downloaded from
  98. Szalay, A. 2008, Preserving digital data for the future of eScience. Science News (from the August 30, 2008 issue)Google Scholar
  99. Szalay, A., Gray, J., & vandenBerg, J. 2002, Petabyte scale data mining: dream or reality? in Astronomy Telescopes and Instruments, Proc. SPIE, Vol. 4836 (Bellingham: SPIE), 333Google Scholar
  100. Tan, P.-N., Steinbach, M., & Kumar, V. 2006, Introduction to Data Mining (Boston: Addison Wesley)Google Scholar
  101. Taylor, M., et al. 2010, IVOA Recommendation: Simple Application Messaging Protocol Version 1.2, downloaded from
  102. Trimble, V., & Ceja, J. A. 2010, Astron. Nach., 331, 338ADSCrossRefGoogle Scholar
  103. Tyson, J. A. 2004, The Large Synoptic Survey Telescope: Science & Design, downloaded from
  104. Tyson, J. A., and LSST collaboration 2008, LSST Petascale Data R&D Challenges, downloaded from
  105. von Ahn, L. 2007, Human computation, in The proceedings of the 4th International Conference on Knowledge Capture. doi:10.1145/1298406.1298408Google Scholar
  106. Wadadekar, Y. 2005, Publ. ASP, 117, 79ADSGoogle Scholar
  107. Wang, D., Zhang, Y., & Zhao, Y. 2010, in Software and Cyberinfrastructure for Astronomy, Proc. SPIE, Vol. 7740 (Bellingham: SPIE), 701937.1Google Scholar
  108. White, R. L. 2008, Astronomical applications of oblique decision trees, in AIP Conference Proceedings for Classification and Discovery in Large Astronomical Surveys, Vol. 1082 (Melville, NY: American Institute of Physics), 37Google Scholar
  109. White, R. L. et al. 2009, The High Impact of Astronomical Data Archives, ASTRO2010 Decadal Survey in Astronomy and Astrophysics position paper, downloaded from
  110. Witten, I. H., Frank, E., & Hall, M. A. 2011, Data Mining: Practical Machine Learning Tools and Techniques (3rd ed.; Amsterdam/Boston: Morgan Kaufmann)Google Scholar
  111. Williams, R., 2008, Astronomical Data Analysis Software and Systems (ADASS) XVI, ASP Conf. Ser. 394 (Chicago: Astronomical Society of the Pacific), 173Google Scholar
  112. Williams, R., & Seaman, R. 2008, in The National Virtual Observatory: Tools and Techniques for Astronomical Research, ASP Conf. Ser. 382 (Chicago: Astronomical Society of the Pacific), 425Google Scholar
  113. Williams, R., Bunn, S., & Seaman, R. 2010, Hot-Wiring the Transient Universe, available from
  114. Wolf, C., et al. 2004, A&A, 421, 913ADSCrossRefGoogle Scholar
  115. Wu, X., & Kumar, V. 2009, The Top Ten Algorithms in Data Mining (Chapman and Hall/Boca Raton: CRC)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2013

Authors and Affiliations

  • Kirk Borne
    • 1
  1. 1.George Mason UniversityFairfaxUSA

Personalised recommendations