, Volume 12, Issue 3, pp 173–181 | Cite as

Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS

  • Harry Enke
  • Adrian Partl
  • Alexander Reinefeld
  • Florian Schintke


With recent observational instruments and survey campaigns in astrophysics, efficient analysis of big structured data becomes more and more relevant. While providing good query expressiveness and data analysis capabilities through SQL, off-the-shelf RDBMS are yet not well prepared to handle high volume scientific data distributed across several nodes, neither for fast data ingest nor for fast spatial queries. Our SQL query parser and job manager performs query reformulation to spread queries to data nodes, gathering outputs on a head node and providing them again to the shards for subsequent processing steps. We combine this data analysis architecture with the cloud data storage component XtreemFS for automatic data replication to improve the availability and access latency. With our solution we perform rich structured data analysis expressed using SQL on large amounts of structured astrophysical data distributed across numerous storage nodes in parallel. The cloud storage virtualization with XtreemFS provides elasticity and reproducibility of scientific analysis tasks through its snapshot capability.


File System Cloud Storage Head Node Dark Matter Halo Spatial Query 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



Part of the work on the RDBMS is funded by “Virtuelles Datenzentum (VDZ)”, BMBF grant 05A09BAB. We thank K. Riebe and J. Klar (AIP) for critical discussions and suggestions.

The XtreemFS development was partly funded by the EU projects XtreemOS (2006–2010) and Contrail (2010–2013) and by the German BMBF projects MoSGrid (2009–2012) and VDZ (2010–2012). We thank the XtreemFS team for the design and implementation of XtreemFS which, with its unique feature set, became a perfect tool for research on distributed data management.


  1. 1.
    Begeman K et al. (2011) LOFAR information system. Future generation computer systems, vol 27. Elsevier, Amsterdam, pp 319–328 Google Scholar
  2. 2.
    A brief introduction to FITS.
  3. 3.
    Enke H, Wambsganss JK (2012) Astronomie und Astrophysik. In: Langzeitarchivierung von Forschungsdaten – eine Bestandsaufnahme. Verlag Werner Hülsbusch, Göttingen Google Scholar
  4. 4.
    Guidelines for participation, IVOA note 2010 July 7. Chaps. 1 and 3.
  5. 5.
    Hupfeld F, Cortes T, Kolbeck B, Stender J, Focht E, Hess M, Malo J, Martí J, Cesario E (2008) The XtreemFS architecture—a case for object-based file systems in grids. Concurr Comput 20:2049–2060 CrossRefGoogle Scholar
  6. 6.
    IEEE Std 1003.1-2008. POSIX.1-2008, The Open Group Base Specifications Issue 7. The Open Group, 2008 Google Scholar
  7. 7.
    IVOA astronomical data query language.
  8. 8.
    Kolbeck B, Högqvist M, Stender J, Hupfeld F (2011) Flease—lease coordination without a lock server. In: 25th IEEE international symposium on parallel and distributed processing (IPDPS 2011), pp 978–988 Google Scholar
  9. 9.
    Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565 zbMATHCrossRefGoogle Scholar
  10. 10.
    Lamport L (1998) The part-time parliament. ACM Trans Comput Syst 16(2):133–169 CrossRefGoogle Scholar
  11. 11.
    Lemson G, Budavari T (2011) Implementing a general spatial indexing library for relational databases of large numerical simulations. In: 23rd international conference on scientific and statistical database management. Springer, Berlin Google Scholar
  12. 12.
    Lemson G, Virgo Consortium (2006) Halo and galaxy formation histories from the millennium simulation: public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. arXiv:astro-ph/0608019
  13. 13.
    Mattern F (1988) Virtual time and global states of distributed systems. In: Cosnard M (ed) Proc workshop on parallel and distributed algorithms. Elsevier, Amsterdam, pp 215–226 Google Scholar
  14. 14.
    O’Mullane W (2011) Blue skies and clouds, archives of the future. GAIA-TN-PL-ESAC-WOM-057-0 Google Scholar
  15. 15.
    O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Inform 33(4):351–385 CrossRefGoogle Scholar
  16. 16.
    Prisco RD, Lampson BW, Lynch NA (2000) Revisiting the PAXOS algorithm. Theor Comput Sci 243(1–2):35–91 zbMATHCrossRefGoogle Scholar
  17. 17.
    Riebe K et al (2011) The MultiDark database: release of the Bolshoi and MultiDark cosmological simulations. arXiv:1109.0003v2
  18. 18.
    Stender J, Berlin M, Reinefeld A (2012, to appear) XtreemFS—a file system for the cloud. In: Kyriazis D, Voulodimos A, Gogouvitis S, Varvarigou T (eds) Data intensive storage services for cloud environments. IGI Global Press Google Scholar
  19. 19.
    Stender J, Kolbeck B, Högqvist M, Hupfeld F (2010) BabuDB: fast and efficient file system metadata storage. In: 2010 international workshop on storage network architecture and parallel I/Os (SNAPI ’10), Washington, DC, USA. IEEE Comput Soc, Los Alamitos, pp 51–58 CrossRefGoogle Scholar
  20. 20.
    Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings of the second international conference on information and knowledge management (CIKM ’93), pp 490–499 CrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Harry Enke
    • 1
  • Adrian Partl
    • 1
  • Alexander Reinefeld
    • 2
  • Florian Schintke
    • 2
  1. 1.Astrophysical Institute in Potsdam (AIP)BerlinGermany
  2. 2.Zuse Institute Berlin (ZIB)BerlinGermany

Personalised recommendations