Earth Science Informatics

, Volume 3, Issue 1–2, pp 5–17 | Cite as

Astroinformatics: data-oriented astronomy research and education

  • Kirk D. BorneEmail author
Research Article


The growth of data volumes in science is reaching epidemic proportions. Consequently, the status of data-oriented science as a research methodology needs to be elevated to that of the more established scientific approaches of experimentation, theoretical modeling, and simulation. Data-oriented scientific discovery is sometimes referred to as the new science of X-Informatics, where X refers to any science (e.g., Bio-, Geo-, Astro-) and informatics refers to the discipline of organizing, describing, accessing, integrating, mining, and analyzing diverse data resources for scientific discovery. Many scientific disciplines are developing formal sub-disciplines that are information-rich and data-based, to such an extent that these are now stand-alone research and academic programs recognized on their own merits. These disciplines include bioinformatics and geoinformatics, and will soon include astroinformatics. We introduce Astroinformatics, the new data-oriented approach to 21st century astronomy research and education. In astronomy, petascale sky surveys will soon challenge our traditional research approaches and will radically transform how we train the next generation of astronomers, whose experiences with data are now increasingly more virtual (through online databases) than physical (through trips to mountaintop observatories). We describe Astroinformatics as a rigorous approach to these challenges. We also describe initiatives in science education (not only in astronomy) through which students are trained to access large distributed data repositories, to conduct meaningful scientific inquiries into the data, to mine and analyze the data, and to make data-driven scientific discoveries. These are essential skills for all 21st century scientists, particularly in astronomy as major new multi-wavelength sky surveys (that produce petascale databases and image archives) and grand-scale simulations (that generate enormous outputs for model universes, such as the Millennium Simulation) become core research components for a significant fraction of astronomical researchers.


Data mining Informatics Data integration Semantic metadata Knowledge discovery Science education 



We thank the National Science Foundation (NSF) for partial support of this work by the Division of Undergraduate Education (DUE) Course and Curriculum, and Laboratory Improvement (CCLI) program, through award #0737091. The author thanks numerous colleagues for their significant and invaluable contributions to the ideas expressed in this paper: Jogesh Babu, Douglas Burke, Andrew Connolly, Timothy Eastman, Eric Feigelson, Matthew Graham, Alexander Gray, Norman Gray, Suzanne Jacoby, Thomas Loredo, Ashish Mahabal, Robert Mann, Bruce McCollum, Misha Pesenson, M. Jordan Raddick, Alex Szalay, Tony Tyson, and John Wallin. Finally, the author wishes to express deep gratitude and appreciation to Keivan Stassun for his thorough and thoughtful review of an earlier version of this paper, and for his numerous helpful comments and suggestions, which considerably improved the final product.


  1. Agresti W (2003) Discovery Informatics. CACM 46:25Google Scholar
  2. Atkins D et al (2003) Revolutionizing Science and Engineering through Cyberinfrastructure. Downloaded from
  3. Baker DN (2008) Informatics and the electronic geophysical year. EOS 89:485CrossRefGoogle Scholar
  4. Ball NM, Brunner RJ (2009) Data mining and machine learning in astronomy. arXiv:0906.2173v1Google Scholar
  5. Becla J, Hanushevsky A, Nikolaev S, Abdulla G, Szalay A, Nieto-Santisteban M, Thakar A, Gray J (2006) Designing a multi-petabyte database for LSST. arXiv:cs/0604112v1Google Scholar
  6. Bell G, Gray J, Szalay A (2007) Petascale computational systems. arXiv:cs/0701165v1Google Scholar
  7. Bloom J, Starr DL, Butler NR, Nugent P, Rischard M, Eads D, Poznanksi D (2008) Towards a real-time transient classification engine. Astron Nachr 329:284CrossRefGoogle Scholar
  8. Borne K (2001a) Science user scenarios for a VO design reference mission: science requirements for data mining, in virtual observatories of the future, p 333Google Scholar
  9. Borne K (2001b) Data mining in astronomical databases, in mining the sky, p 671Google Scholar
  10. Borne KD (2006) Data-driven discovery through e-science technologies. 2nd IEEE Conference on Space Mission Challenges for Information TechnologyGoogle Scholar
  11. Borne KD (2007) Astroinformatics: the new escience paradigm for astronomy research and education. Microsoft eScience Workshop at RENCIGoogle Scholar
  12. Borne K (2008a) A machine learning classification broker for the LSST transient database. Astron Nachr 329:255CrossRefGoogle Scholar
  13. Borne K (2008b) Data science challenges from distributed petascale astronomical sky surveys, in the DOE Workshop on Mathematical Analysis of Petascale Data, downloaded from
  14. Borne K (2009a) Scientific data mining in astronomy. In: Next generation data mining. Chapman & Hall, pp 91–114Google Scholar
  15. Borne K (2009b) Astroinformatics: a 21st century approach to astronomy. arXiv:0909.3892v1Google Scholar
  16. Borne K (2009c) The VO and large surveys: what more do we need? Downloaded from
  17. Borne K (2009d) The zooniverse: advancing science through user-guided learning in massive data streams. Downloaded from
  18. Borne K, Eastman T (2006) A paradigm for space science informatics. AGU, IN51A-05Google Scholar
  19. Borne K, Jacoby S, Carney K, Connolly A, Eastman T, Raddick MJ, Tyson JA, Wallin J (2009a) The revolution in astronomy education: data science for the masses. Downloaded from arXiv:0909.3895v1Google Scholar
  20. Borne K, Wallin J, Weigel R (2009b) The new computational and data sciences undergraduate program at George Mason University, ICCS 2009, Part II, LNCS 5545, 74Google Scholar
  21. Brunner R, Djorgovski SG, Prince TA, Szalay AS (2001) Massive datasets in astronomy. Downloaded from arXiv:astro-ph/0106481v1Google Scholar
  22. Butler D (2007) Agencies join forces to share data. Nature 446:354CrossRefGoogle Scholar
  23. Cleveland W (2007) Data science: an action plan. Int Stat Rev 69:21CrossRefGoogle Scholar
  24. Djorgovski SG, Mahabal A, Brunner R, Williams R, Granat R, Curkendall D, Jacob J, Stolorz P (2001) Exploration of parameter spaces in a virtual observatory. arXiv:astro-ph/0108346v1Google Scholar
  25. Dolensky M (2004) Applicability of emerging resource discovery standards to the VO. In: Toward an international virtual observatory. Berlin, Springer, p 265Google Scholar
  26. Dunham M (2002) Data mining introductory and advanced topics. Prentice-HallGoogle Scholar
  27. Eastman T, Borne K, Green J, Grayzeck E, McGuire R, Sawyer D (2005) eScience and archiving for space science. Data Sci J 4:67–76CrossRefGoogle Scholar
  28. Graham M, Fitzpatrick M, McGlynn T (2007) The National Virtual Observatory: tools and techniques for astronomical research. ASP Conference Series, Vol. 382Google Scholar
  29. Gray J (2003) Online Science. Downloaded from
  30. Gray J, Szalay A (2004) Where the rubber meets the sky: bridging the gap between databases and science. Microsoft technical report MSR-TR-2004-110Google Scholar
  31. Gray J, Szalay A, Thakar A, Kunszt P, Stoughton C, Slutz D, vandenBerg J (2002) Data Mining in the SDSS SkyServer Database, arXiv:cs/0202014v1Google Scholar
  32. Gray J, Liu D, Nieto-Santisteban M, Szalay A, Dewitt D, Beger G (2005) Scientific data management in the coming decade, arXiv:cs/0502008v1Google Scholar
  33. Hey J, Trefethen A (2002) The UK e-Science core programme and the grid. Future Gener Comput Syst 18:1017–1031CrossRefGoogle Scholar
  34. Hey T, Tansley S, Tolle K (eds) (2009) The fourth paradigm: data-intensive scientific discovery. Downloaded from
  35. Iwata S (2008) Scientific “Agenda” of data science. Data Sci J 7:54CrossRefGoogle Scholar
  36. Kegelmeyer P, Calderbank R, Critchlow T, Jameson L, Kamath C, Meza J, Samatova N, Wilson A (2008) Mathematics for Analysis of Petascale Data: Report on a DOE Workshop. Downloaded from
  37. Mahootian F, Eastman T (2009) Complementary frameworks of scientific inquiry. World Futures 65:61CrossRefGoogle Scholar
  38. Millar AH (2004) Location, location, location: surveying the intracellular real estate through proteomics in plants. Funct Plant Biol 31(6):563CrossRefGoogle Scholar
  39. National Academies of Science (NAS 1997) Bits of Power: Issues in Global Access to Scientific Data, downloaded from
  40. NSF (National Science Foundation) report (2003) Knowledge lost in information: research directions for digital libraries, downloaded from
  41. NSF/JISC Repositories Workshop (2007) Downloaded from
  42. NSTC Interagency Working Group on Digital Data (2009) Harnessing the power of digital data for science and society, downloaded from
  43. Rutherford FJ, Ahlgren A (1991) Science for all Americans, Chapter 12, downloaded from
  44. Schwartz MS, Sadler PM, Sonnert G, Tai RH (2008) Depth versus breadth: how content coverage in high school science courses relates to later success. Sci Educ. doi: 10.1002/sce.20328 Google Scholar
  45. Seni G, Elder J (2010) Ensemble methods in data mining: improving accuracy through combining predictions. Morgan & Claypool PublishersGoogle Scholar
  46. Smith F (2006) Data science as an academic discipline. Data Sci J 5:163CrossRefGoogle Scholar
  47. Springel V et al (2005) Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435:629CrossRefGoogle Scholar
  48. Strauss M (2004) Towards a design reference mission for the LSST. Downloaded from
  49. Szalay A (2008) Preserving digital data for the future of eScience. Science News, August 30, 2008Google Scholar
  50. Szalay AS, Gray J, vandenBerg J (2002) Petabyte scale data mining: dream or reality? Downloaded from arXiv:cs/0208013v1Google Scholar
  51. Tyson JA (2004) The large synoptic survey telescope: science & design, downloaded from
  52. Tyson JA, Pike R, Stein M, Szalay A, The LSST collaboration (2002) LSST Data Challenges. Downloaded from
  53. Witten I, Frank E (2005) Data mining: practical machine learning tools and techniques. Morgan Kaufmann, San FranciscoGoogle Scholar
  54. Yager RE (1982) What research says to the science teacher, Volume 4, p 117Google Scholar

Copyright information

© Springer-Verlag 2010

Authors and Affiliations

  1. 1.Department of Computational and Data SciencesGeorge Mason UniversityFairfaxUSA

Personalised recommendations