Distributed Unsupervised Clustering for Outlier Analysis in the Biggest Milky Way Survey: ESA Gaia Mission

  • Daniel Garabato
  • Carlos Dafonte
  • Marco A. Álvarez
  • Minia Manteiga
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10586)


The Gaia mission (ESA) is collecting huge amounts of information about the objects that populate our Galaxy and beyond. Such data must be processed and analyzed before being released, and this work is carried out by the Data Processing and Analysis Consortium (DPAC) through several work packages. One of these packages is Outlier Analysis, devoted to the study, by means of unsupervised clustering, of all the objects that cannot be fitted into any of the existent models. An algorithm based on optimized Self-Organized Maps (SOM) is proposed and implemented for taking advantage of distributed computing platforms, such as the MapReduce paradigm for Apache Hadoop and Apache Spark. Finally, the processing times of the sequential implementation of the algorithm is compared to the Hadoop and Spark based ones.


Computational Astrophysics Fast Self-Organized Maps Parallel computing Map-reduce Apache Hadoop Apache Spark Remote sensing 



This work was supported by the Spanish FEDER through Grants ESP2016-80079-C2-2-R, and ESP2014-55996-C2-2-R.


  1. 1.
    Álvarez, M.A., Dafonte, C., Garabato, D., Manteiga, M.: Analysis and knowledge discovery by means of Self-Organizing Maps for Gaia data releases. In: Hirose, A., Ozawa, S., Doya, K., Ikeda, K., Lee, M., Liu, D. (eds.) ICONIP 2016. LNCS, vol. 9950, pp. 137–144. Springer, Cham (2016). doi: 10.1007/978-3-319-46681-1_17 CrossRefGoogle Scholar
  2. 2.
    Bailer-Jones, C.A.L., et al.: The Gaia astrophysical parameters inference system (Apsis). Pre-launch description. Astron. Astrophys. 559, A74 (2013)CrossRefGoogle Scholar
  3. 3.
    Brunet, P., Montmorry, A., Frezouls, B.: Big data challenges, an insight into the GAIA Hadoop solution. In: SpaceOps Conferences, AIAA, June 2012Google Scholar
  4. 4.
    Cardelli, J.A., Clayton, G.C., Mathis, J.S.: The relationship between infrared, optical, and ultraviolet extinction. Astrophys. J. 345, 245–256 (1989)CrossRefGoogle Scholar
  5. 5.
    del Coso, C., Fustes, D., Dafonte, C., Nóvoa, F.J., Rodríguez-Pedreira, J.M., Arcay, B.: Mixing numerical and categorical data in a Self-Organizing Map by means of frequency neurons. Appl. Soft Comput. 36, 246–254 (2015)CrossRefGoogle Scholar
  6. 6.
    de Bruijne, J.H.J.: Science performance of Gaia, ESA’s space-astrometry mission. Astrophys. Space Sci. 341, 31–41 (2012)CrossRefGoogle Scholar
  7. 7.
    Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugenics 7(7), 179–188 (1936)CrossRefGoogle Scholar
  8. 8.
    Fustes, D., Manteiga, M., Dafonte, C., Arcay, B., Ulla, A., Smith, K., Borrachero, R., Sordo, R.: An approach to the analysis of SDSS spectroscopic outliers based on Self-Organizing Maps: designing the outlier analysis software package for the next Gaia survey. Astron. Astrophys. 559, A7 (2013)CrossRefGoogle Scholar
  9. 9.
    Fustes, D., Dafonte, C., Arcay, B., Manteiga, M., Smith, K., Vallenari, A., Luri, X.: SOM ensemble for unsupervised outlier analysis. Application to outlier identification in the Gaia astronomical survey. ESWA 40(5), 1530–1541 (2013)Google Scholar
  10. 10.
    Collaboration, G., Brown, A.G.A., Vallenari, A., Prusti, T., de Bruijne, J.H.J., Mignard, F., Drimmel, R., Babusiaux, C., Bailer-Jones, C.A.L., Bastian, U., et al.: Gaia data release 1. Summary of the astrometric, photometric, and survey properties. Astron. Astrophys. 595, A2 (2016)CrossRefGoogle Scholar
  11. 11.
    Collaboration, G., Prusti, T., de Bruijne, J.H.J., Brown, A.G.A., Vallenari, A., Babusiaux, C., Bailer-Jones, C.A.L., Bastian, U., Biermann, M., Evans, D.W., et al.: The Gaia mission. Astron. Astrophys. 595, A1 (2016)CrossRefGoogle Scholar
  12. 12.
    Garabato, D., Dafonte, C., Manteiga, M., Fustes, D., Álvarez, M.A., Varela, B.A.: A distributed learning algorithm for Self-Organizing Maps intended for outlier analysis in the GAIA - ESA mission. In: IFSA-EUSFLAT (2015)Google Scholar
  13. 13.
    Isasi, Y., Figueras, F., Luri, X., Robin, A.C.: GUMS & GOG: simulating the universe for Gaia. Astrophys. Space Sci. Proc. 14, 415 (2010)CrossRefGoogle Scholar
  14. 14.
    Jolliffe, I.: Principal Component Analysis. Springer, New York (2002)zbMATHGoogle Scholar
  15. 15.
    Karau, H., Konwinski, A., Wendell, P., Zaharia, M.: Learning Spark: Lightning-Fast Big Data Analytics, 1st edn. O’Reilly Media Inc., Sebastopol (2015)Google Scholar
  16. 16.
    Kohonen, T.: Self-organized formation of topologically correct feature maps. Biol. Cybern. 43(1), 59–69 (1982)MathSciNetCrossRefzbMATHGoogle Scholar
  17. 17.
    Manteiga, M., Carricajo, I., Rodríguez, A., Dafonte, C., Arcay, B.: Starmind: a fuzzy logic knowledge-based system for the automated classification of stars in the MK system. Astron. J. 137(2), 3245–3253 (2009)CrossRefGoogle Scholar
  18. 18.
    Naim, A., Ratnatunga, K.U., Griffiths, R.E.: Galaxy morphology without classification: Self-Organizing Maps. ArXiv Astrophysics e-prints, April 1997Google Scholar
  19. 19.
    Ordóñez, D., Dafonte, C., Arcay, B., Manteiga, M.: HSC: a multi-resolution clustering strategy in Self-Organizing Maps applied to astronomical observations. Appl. Soft Comput. J. 12(1), 204–215 (2012)CrossRefGoogle Scholar
  20. 20.
    Ordóñez-Blanco, D., Arcay, B., Dafonte, C., Manteiga, M., Ulla, A.: Object classification and outliers analysis in the forthcoming Gaia mission. Lect. Notes Essays Astrophys. 4, 97–102 (2010)Google Scholar
  21. 21.
    Sanders, J., Kandrot, E.: CUDA by Example: An Introduction to General-Purpose GPU Programming, 1st edn. Addison-Wesley Professional, Reading (2010)Google Scholar
  22. 22.
    Smith, K.W.: The discrete source classifier in Gaia-Apsis, p. 239 (2012)Google Scholar
  23. 23.
    Torra, J., Gaia Group: Gaia: the challenge begins. In: Highlights of Spanish Astrophysics VII, pp. 82–94, May 2013Google Scholar
  24. 24.
    Tsalmantza, P., et al.: A semi-empirical library of galaxy spectra for Gaia classification based on SDSS data and PÉGASE models. Astron. Astrophys. 537, A42 (2012)CrossRefGoogle Scholar
  25. 25.
    Wenger, M., et al.: The SIMBAD astronomical database: the CDS reference database for astronomical objects. Astron. Astrophys., Suppl. Ser. 143(1), 9–22 (2000)CrossRefGoogle Scholar
  26. 26.
    White, T.: Hadoop: The Definitive Guide. O’Reilly Media Inc., Sebastopol (2015)Google Scholar
  27. 27.
    Xu, R., Wunsch, D.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Daniel Garabato
    • 1
  • Carlos Dafonte
    • 1
  • Marco A. Álvarez
    • 1
  • Minia Manteiga
    • 2
  1. 1.Departamentos de Tecnologìas de la Informaciòn y las ComunicacionesUniversidade da Coruña (UDC)A CoruñaSpain
  2. 2.Departamentos de Ciencias de la Navegación y de la TierraUniversidade da Coruña (UDC)A CoruñaSpain

Personalised recommendations