Machine Learning

, Volume 74, Issue 3, pp 281–313 | Cite as

Finding anomalous periodic time series

An application to catalogs of periodic variable stars
  • Umaa Rebbapragada
  • Pavlos Protopapas
  • Carla E. Brodley
  • Charles Alcock
Article

Abstract

Catalogs of periodic variable stars contain large numbers of periodic light-curves (photometric time series data from the astrophysics domain). Separating anomalous objects from well-known classes is an important step towards the discovery of new classes of astronomical objects. Most anomaly detection methods for time series data assume either a single continuous time series or a set of time series whose periods are aligned. Light-curve data precludes the use of these methods as the periods of any given pair of light-curves may be out of sync. One may use an existing anomaly detection method if, prior to similarity calculation, one performs the costly act of aligning two light-curves, an operation that scales poorly to massive data sets. This paper presents PCAD, an unsupervised anomaly detection method for large sets of unsynchronized periodic time-series data, that outputs a ranked list of both global and local anomalies. It calculates its anomaly score for each light-curve in relation to a set of centroids produced by a modified k-means clustering algorithm. Our method is able to scale to large data sets through the use of sampling. We validate our method on both light-curve data and other time series data sets. We demonstrate its effectiveness at finding known anomalies, and discuss the effect of sample size and number of centroids on our results. We compare our method to naive solutions and existing time series anomaly detection methods for unphased data, and show that PCAD’s reported anomalies are comparable to or better than all other methods. Finally, astrophysicists on our team have verified that PCAD finds true anomalies that might be indicative of novel astrophysical phenomena.

Keywords

Anomaly detection Time series data 

References

  1. Aggarwal, C. C., & Yu, P. S. (2001). Outlier detection for high dimensional data. In Proceedings of the ACM SIGMOD international conference on data management (pp. 37–46). Google Scholar
  2. Angiulli, F., & Pizzuti, C. (2002) Fast outlier detection in high dimensional spaces. In PKDD’02: Proceedings of the 6th European conference on principles of data mining and knowledge discovery (pp. 15–26). Google Scholar
  3. Bar-Joseph, Z., Gerber, G., Gifford, D. K., Jaakkola, T., & Simon, I. (2002). A new approach to analyzing gene expression time series data. In RECOMB (pp. 39–48). Google Scholar
  4. Barnett, V., & Lewis, T. (1994). Outliers in statistical data. New York: Wiley. MATHGoogle Scholar
  5. Bay, S. D., & Schwabacher, M. (2003). Mining distance-based outliers in near linear time with randomization and a simple pruning rule. In Proceedings of the ninth international conference on knowledge discovery and data mining (pp. 29–38). Google Scholar
  6. Bottou, L., & Bengio, Y. (1995). Convergence properties of the k-means algorithms. In Advances in neural information processing systems (pp. 585–592). Google Scholar
  7. Breunig, M. M., Kriegel, H., Ng, R. T., & Sander, J. (2000). LOF: Identifying density-based local outliers. In Proceedings of the ACM SIGMOD international conference on management of data (pp. 93–104). Google Scholar
  8. Chan, P. K., & Mahoney, M. (2005). Modeling multiple time series for anomaly detection. In IEEE international conference on data mining (pp. 90–97). Google Scholar
  9. Chudova, D., Gaffney, S., Mjolsness, E., & Smyth, P. (2003). Translation-invariant mixture models for curve clustering. In Proceedings of the 9th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 79–88). Google Scholar
  10. Dasgupta, D., & Forrest, S. (1996). Novelty detection in time series data using ideas from immunology. In Proceedings of the international conference on intelligent systems (pp. 82–87). Google Scholar
  11. Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, 39, 1–38. MATHMathSciNetGoogle Scholar
  12. Gaffney, S., & Smyth, P. (2004). Joint probabilistic curve clustering and alignment. In Advances in neural information processing systems (Vol. 17, pp. 473–480). Cambridge: MIT Press. Google Scholar
  13. Hawkins, D. (1980). Identification of outliers. London: Chapman and Hall. MATHGoogle Scholar
  14. Hewish, A., Bell, J., Pilkington, P., & Scott, R. (1968). Observations of a rapidly pulsating radio source. Nature, 217, 709–710. CrossRefGoogle Scholar
  15. Jagadish, H. V., Koudas, N., & Muthukrishnan, S. (1999). Mining deviants in a time series database. In Proceedings of the 25th international conference on very large data bases (pp. 102–113). Google Scholar
  16. Jin, W., Tung, A. K. H., & Han, J. (2001). Mining top-n local outliers in large databases. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 293–298). Google Scholar
  17. Keogh, E., & Folias, T. (2002). The UCR time series data mining archive. http://www.cs.ucr.edu/~eamonn/TSDMA/index.html.
  18. Keogh, E., Lonardi, S., & Chiu, B. Y. (2002). Finding surprising patterns in a time series database in linear time and space. In Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 550–556). Google Scholar
  19. Keogh, E., Lin, J., & Fu, A. (2005). HOT SAX: Efficiently finding the most unusual time series subsequence. In Proceedings of the fifth IEEE international conference on data mining (pp. 226–233). Google Scholar
  20. Klebesadel, R. W., Strong, I. B., & Olson, R. A. (1973). Observations of gamma-ray bursts of cosmic origin. Astrophysical Journal Letters, 182, L85+. CrossRefGoogle Scholar
  21. Knorr, E. M., & Ng, R. T. (1998). Algorithms for mining distance-based outliers. In Proceedings of the 24th international conference on very large databases (VLDB) (pp. 392–403). Google Scholar
  22. Kollios, G., Gunopulos, D., Koudas, N., & Berchtold, S. (2003). Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Transactions on Knowledge and Data Engineering, 15(5), 1170–1187. CrossRefGoogle Scholar
  23. Lazarevic, A., & Kumar, V. (2005). Feature Bagging for Outlier Detection. In Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 157–166). Google Scholar
  24. Lin, J., Keogh, E., Lonardi, S., & Chiu, B. (2003). A symbolic representation of time series, with implications for streaming algorithms. In Proceedings of the 8th ACM SIGMOD workshop on research issues in data mining and knowledge discovery. Google Scholar
  25. Listgarten, J., Neal, R. M., Roweis, S. T., Puckrin, R., & Cutler, S. (2006). Bayesian detection of infrequent differences in sets of time series with shared structure. In Advances in neural information processing systems 19. Google Scholar
  26. Ma, J., & Perkins, S. (2003). Online novelty detection on temporal sequences. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining (pp. 613–618). Google Scholar
  27. Mahoney, M., & Chan, P. K. (2005). Trajectory boundary modeling of time series for anomaly detection. In Computer Science Dept. Technical Report CS-2005-08. Google Scholar
  28. Mallat, S. (1998). A wavelet tour of signal processing. San Diego: Academic Press. MATHGoogle Scholar
  29. Pelleg, D., & Moore, A. (2000). X-means: Extending K-means with efficient estimation of the number of clusters. In Proceedings of the 17th international conference on machine learning (pp. 727–734). Google Scholar
  30. Petit, M. (1987). Variable stars. New York: Wiley. Google Scholar
  31. Pollacco, D. L., & Bell, S. A. (1993). New light on UU Sagittae. Monthly Notices of the Royal Astronomical Society, 262, 377–391. Google Scholar
  32. Protopapas, P., Giammarco, J. M., Faccioli, L., Struble, M. F., Dave, R., & Alcock, C. (2006). Finding outlier light-curves in catalogs of periodic variable stars. Monthly Notices of the Royal Astronomical Society, 369, 677–696. CrossRefGoogle Scholar
  33. Ramaswamy, S., Rastogi, R., & Shim, K. (2000). Efficient algorithms for mining outliers from large data sets. In SIGMOD’00: Proceedings of the 2000 ACM SIGMOD international conference on management of data (pp. 427–438). Google Scholar
  34. Ren, D., Wang, B., & Perrizo, W. (2004). RDF: A density-based outlier detection method using vertical data representation. In Proceedings of the fourth IEEE international conference on data mining (pp. 503–506). Google Scholar
  35. Richter, G., Wenzel, W., & Hoffmeister, C. (1985). Variable stars. Berlin: Springer. Google Scholar
  36. Salvador, S., Chan, P., & Brodie, J. (2004). Learning states and rules for time series anomaly detection. In Proceedings of the seventeenth international Florida artificial intelligence research society conference. Google Scholar
  37. Samus’, N. N., Goranskii, V. P., Durlevich, O. V., Zharova, A. V., Kazarovets, E. V., Kireeva, N. N., Pastukhova, E. N., Williams, D. B., & Hazen, M. L. (2003). An electronic version of the second volume of the general catalogue of variable stars with improved coordinates. Astronomy Letters, 29(7), 468–479. CrossRefGoogle Scholar
  38. Schmidt, M. (1963). 3c 273: A star-like object with large red-shift. Nature, 197, 1040. CrossRefGoogle Scholar
  39. Shahabi, C., Tian, X., & Zhao, W. (2000). TSA-tree: A wavelet-based approach to improve the efficiency of multi-level surprise and trend queries on time-series data. In Statistical and scientific database management (pp. 55–68). Google Scholar
  40. Sterken, C., & Jaschek, C. (1996). Light curves of variable stars: a pictorial atlas. Cambridge: Cambridge University Press. Google Scholar
  41. Udalski, A., Szymanski, M., Kubiak, M., Pietrzynski, G., Wozniak, P., & Zebrun, Z. (1997). Optical gravitational lensing experiment. photometry of the macho-smc-1 microlensing candidate. Acta Astronomica, 47(431). Google Scholar
  42. Wei, L., Kumar, N., Lolla, V., Keogh, E., Lonardi, S., & Ratanamahatana, C. (2005). Assumption-free anomaly detection in time series. In SSDBM’2005: Proceedings of the 17th international conference on scientific and statistical database management (pp. 237–240). Google Scholar
  43. Wei, L., Keogh, E., & Xi, X. (2006). SAXually explicit images: Finding unusual shapes. In Proceedings of the sixth IEEE international conference on data mining (pp. 711–720). Google Scholar
  44. Wu, M., & Jermaine, C. (2006). Outlier detection by sampling with accuracy guarantees. In Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining (pp. 767–772). Google Scholar
  45. Yang, J., Wang, W., & Yu, P. S. (2001). Infominer: Mining surprising periodic patterns. In Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining (pp. 395–400). Google Scholar
  46. Yang, J., Wang, W., & Yu, P. S. (2004). Mining surprising periodic patterns. Data Mining and Knowledge Discovery, 9, 189–216. CrossRefMathSciNetGoogle Scholar
  47. Yu, D., Sheikholeslami, G., & Zhang, A. (2004). FindOut: Finding outliers in very large datasets. Knowledge and Information Systems, 4(4), 387–412. CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2008

Authors and Affiliations

  • Umaa Rebbapragada
    • 1
  • Pavlos Protopapas
    • 2
    • 3
  • Carla E. Brodley
    • 1
  • Charles Alcock
    • 2
  1. 1.Department of Computer ScienceTufts UniversityMedfordUSA
  2. 2.Harvard-Smithsonian Center for AstrophysicsCambridgeUSA
  3. 3.Initiative in Innovative ComputingHarvard UniversityCambridgeUSA

Personalised recommendations