Surprise Detection in Multivariate Astronomical Data

Borne, Kirk D.; Vedachalam, Arun

doi:10.1007/978-1-4614-3520-4_26

Kirk D. Borne³ &
Arun Vedachalam³

Part of the book series: Lecture Notes in Statistics ((LNSP,volume 902))

1539 Accesses
4 Citations
8 Altmetric

Abstract

Astronomers systematically study the sky with large sky surveys. A common feature of modern sky surveys is that they produce hundreds of terabytes (TB) up to 100 (or more) petabytes (PB) both in the image data archive and in the object catalogs. For example, the LSST will produce a 20–40 PB catalog database. Large sky surveys have enormous potential to enable countless astronomical discoveries. Such discoveries will span the full spectrum of statistics: from rare one-in-a-billion (or one-in-a-trillion) object types, to complete statistical and astrophysical specifications of many classes of objects (based upon millions of instances of each class). The growth in data volumes requires more effective knowledge discovery and extraction algorithms. Among these are algorithms for outlier (novelty/surprise/anomaly) detection. Outlier detection algorithms enable scientists to discover the most “interesting” scientific knowledge hidden within large and high-dimensional datasets: the “unknown unknowns”. Effective outlier detection is essential for rapid discovery of potentially interesting and/or hazardous events. Emerging unexpected conditions in hardware, software, or network resources need to be detected, characterized, and analyzed as soon as possible for obvious system health and safety reasons, just as emerging behaviors and variations in scientific targets should be similarly detected and characterized promptly in order to enable rapid decision support in response to such events. We have developed a new algorithm for outlier detection (KNN-DD: K-Nearest Neighbor Data Distributions). We have derived results from preliminary experiments in terms of the algorithm’s precision and recall for known outliers, and in terms of its ability to distinguish between characteristically different data distributions among different classes of objects.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

References

M. J. Bayarri and J. O. Berger. Measures of Surprise in Bayesian Analysis. Downloaded from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.21.6365, 1997.
M. J. Bayarri and J. O. Berger. Quantifying Surprise in the Data and Model Verification. Downloaded from http://citeseer.ist.psu.edu/old/401333.html, 1998.
B. Berriman, D. Kirkpatrick, R. Hanisch, A. Szalay, and R. Williams. Discovery of Brown Dwarfs with Virtual Observatories. IAU Joint Discussion 8: Large Telescopes and Virtual Observatory: Visions for the Future. http://adsabs.harvard.edu/abs/2003IAUJD...8E..60B
K. Borne. Scientific Data Mining in Astronomy. Next Generation Data Mining. CRC Press: Taylor & Francis, Boca Raton, FL, pp. 91–114, 2009.
Google Scholar
K. Borne. Effective Outlier Detection using K-Nearest Neighbor Data Distributions: Unsupervised Exploratory Mining of Non-Stationarity in Data Streams. Submitted to the Machine Learning Journal, March 2010.
Google Scholar
M. Breunig, H.-P. Kriegel, R. Ng, and S. Sander. LOF: Identifying Density-Based Local Outliers. ACM SIGMOD Record, vol. 29, pp. 93–104, 2000.
Article Google Scholar
K. Das, K. Bhaduri, S. Arora, W. Griffin, K. Borne, C. Giannella, and H. Kargupta. Scalable Distributed Change Detection from Astronomy Data Streams using Eigen-Monitoring Algorithms. 2009 SIAM International Conference on Data Mining (SDM09), 2009.
Google Scholar
D. L. Davies and D. W. Bouldin. A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1(2): 224–227, 1979.
Article Google Scholar
M. Debruyne. An Outlier Map for Support Vector Machine Classification. Annals of Applied Statistics, 3(4): 1566–1580, 2009.
Article MathSciNet MATH Google Scholar
P. Dhaliwal, M. Bhatia, and P. Bansal, P. A Cluster-based Approach for Outlier Detection in Dynamic Data Streams (KORM: K-median OutlieR Miner). Journal of Computing, vol. 2, pp. 74–80, 2010.
Google Scholar
S. G. Djorgovski and M. Davis. Fundamental Properties of Elliptical Galaxies. Astrophysical Journal, vol. 313, pp. 59–68, 1987.
Article Google Scholar
A. Dressler, D. Lynden-Bell, D. Burstein, R. L. Davies, S. M. Faber, R. Terlevich, and G. Wegner. Spectroscopy and Photometry of Elliptical Galaxies. I - A New Distance Estimator. Astrophysical Journal, vol. 313, pp. 42–58, 1987.
Google Scholar
H. Dutta. Empowering Scientific Discovery by Distributed Data Mining on the Grid Infrastructure. Ph.D. dissertation, UMBC, 2007.
Google Scholar
H. Dutta, C. Giannella, K. Borne, and H. Kargupta. Distributed Top-K Outlier Detection from Astronomy Catalogs using the DEMAC System. 2007 SIAM International Conference on Data Mining, 2007.
Google Scholar
H. Dutta, C. Giannella, K. Borne, H. Kargupta, and R. Wolff. Distributed Data Mining for Astronomy Catalogs. IEEE Transactions in Knowledge and Data Engineering, 2009.
Google Scholar
P. Filzmoser, R. Maronna, and M. Werner. Outlier Identification in High Dimensions. Computational Statistics and Data Analysis, 52, pp. 1694–1711, 2008.
Article MathSciNet MATH Google Scholar
A. Freitas On Objective Measures of Rule Surprisingness. LNCC, 1510, pp. 1–9, 1998.
Google Scholar
V. Hautamaki, I. Karkkainen, and P. Franti. Outlier Detection Using k-Nearest Neighbour Graph. Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04), 2004.
Google Scholar
C. R. Johnson, M. Glatter, W. Kendall, J. Huang, and F. Hoffman. Querying for Feature Extraction and Visualization in Climate Modeling. ICCS 2009, Part II, LNCS 5545, pp. 416–425, 2009.
Google Scholar
G. I. G. Jozsa, M. A. Garrett, T. A. Oosterloo, H. Rampadarath, Z. Paragi, H. van Arkel, C. Lintott, W. C.Keel, K. Schawinski, and E. Edmondson. Revealing Hanny’s Voorwerp: Radio Observations of IC 2497. Astronomy and Astrophysics, vol. 500, pp. L33–L36, 2009.
Article Google Scholar
C. J. Lintott, et al. Galaxy Zoo: Morphologies Derived from Visual Inspection of Galaxies from the Sloan Digital Sky Survey. Monthly Notices of the Royal Astronomical Society, vol. 389, pp. 1179–1189, 2008.
Article Google Scholar
R. A. Maronna and V. J. Yohai. The Behavior of the Stahel-Donoho Robust Multivariate Estimator. Journal of the American Statistical Association, vol. 90, pp. 330–341, 1995.
Article MathSciNet MATH Google Scholar
D. Pena and F. J. Prieto. Multivariate Outlier Detection and Robust Covariance Matrix Estimation. Technometrics, vol. 43, pp. 286–301, 2001.
Article MathSciNet Google Scholar
D. Pokrajac, A. Lazarevic, and L. Latecki, L. Incremental Local Outlier Detection for Data Streams. IEEE Symposium on Computational Intelligence and Data Mining (CIDM), 2007.
Google Scholar
G. T. Richards, et al. Eight-Dimensional Mid-Infrared/Optical Bayesian Quasar Selection. Astronomical Journal, vol. 137, pp. 3884–3899, 2009.
Article Google Scholar
V. Saltenis. Outlier Detection Based on the Distribution of Distances between Data Points. Informatica, 15(3): 399–410, 2004.
MathSciNet Google Scholar
R.-D. Scholz, M. J. McCaughrean, N. Lodieu, and B. Kuhlbrodt. Epsilon Indi B: A New Benchmark T Dwarf. Astronomy and Astrophysics, vol. 398, pp. L29–L33, 2003.
Article Google Scholar
A. A. Shabalin, V. J. Weigman, C. M. Perou, and A. B. Nobel. Finding Large Average Submatrices in High Dimensional Data. Annals of Applied Statistics, 3(3): 985–1012, 2009.
Article MathSciNet MATH Google Scholar
S. S. Shapiro and M. B. Wilk. An analysis of variance test for normality (complete samples). Biometrika, vol. 52, pp. 591–611, 1965.
MathSciNet MATH Google Scholar
P. Smyth and R. M. Goodman. Rule Induction Using Information Theory. Knowledge Discovery in Databases, pp 159–176, AAAI/MIT Press, 1991.
Google Scholar
S. Srinoy and W. Kurutach. Anomaly Detection Model Based on Bio-Inspired Algorithm and Independent Component Analysis. TENCON 2006, IEEE Region 10 Conference proceedings, pp. 1–4, 2006.
Google Scholar
Weaver’s Surprise Index. Encyclopedia of Statistical Sciences (Wiley), vol. 9, pp. 104–109, 1988.
Google Scholar
N. Zakamska, et al. Candidate Type II Quasars from the Sloan Digital Sky Survey. I. Selection and Optical Properties. Astronomical Journal, vol. 126, pp. 2125–2143, 2003.
Google Scholar

Download references

Acknowledgements

This research is supported in part by NASA AISR grant number NNX07AV70G and in part by NASA through the American Astronomical Society’s Small Research Grant Program.

Author information

Authors and Affiliations

Astronomy, & Computational Science, School of Physics, George Mason University, 4400 University Drive, Fairfax, VA, 22030, USA
Kirk D. Borne & Arun Vedachalam

Authors

Kirk D. Borne
View author publications
You can also search for this author in PubMed Google Scholar
Arun Vedachalam
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kirk D. Borne .

Editor information

Editors and Affiliations

, Dept. Astronomy & Astrophysics, Pennsylvania State University, 518 Davey Lab, University Park, 16802, Pennsylvania, USA
Eric D. Feigelson
, Department of Statistics, Pennsylvania State University, 417C Thomas Building, University Park, 16802, Pennsylvania, USA
G. Jogesh Babu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Borne, K.D., Vedachalam, A. (2012). Surprise Detection in Multivariate Astronomical Data. In: Feigelson, E., Babu, G. (eds) Statistical Challenges in Modern Astronomy V. Lecture Notes in Statistics(), vol 902. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-3520-4_26

Download citation

DOI: https://doi.org/10.1007/978-1-4614-3520-4_26
Published: 21 May 2012
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-3519-8
Online ISBN: 978-1-4614-3520-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics