Skip to main content
Log in

Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS

  • Schwerpunktbeitrag
  • Published:
Datenbank-Spektrum Aims and scope Submit manuscript

Abstract

With recent observational instruments and survey campaigns in astrophysics, efficient analysis of big structured data becomes more and more relevant. While providing good query expressiveness and data analysis capabilities through SQL, off-the-shelf RDBMS are yet not well prepared to handle high volume scientific data distributed across several nodes, neither for fast data ingest nor for fast spatial queries. Our SQL query parser and job manager performs query reformulation to spread queries to data nodes, gathering outputs on a head node and providing them again to the shards for subsequent processing steps. We combine this data analysis architecture with the cloud data storage component XtreemFS for automatic data replication to improve the availability and access latency. With our solution we perform rich structured data analysis expressed using SQL on large amounts of structured astrophysical data distributed across numerous storage nodes in parallel. The cloud storage virtualization with XtreemFS provides elasticity and reproducibility of scientific analysis tasks through its snapshot capability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. This process is also called ‘pipelining’ or data reduction.

  2. A common file format in astronomy, the Flexible Image Transfer System (FITS) was created in 1979 and is approved as a standard by the IAU [2].

  3. http://www.sdss.org.

  4. http://cds.u-strasbg.fr/.

  5. http://gavo.mpa-garching.mpg.de/Millennium/.

  6. http://www.multidark.org.

  7. A Millennium like simulation requires for the production of raw data (aka. snapshots) several Million CPU hours (see Table 2).

  8. http://spiderformysql.com/.

  9. http://www.xtreemfs.org.

  10. Note that ϵ is given by the system environment. Typically, ϵ is between 10 and 100 ms for WANs and approx. 1 ms in LANs.

References

  1. Begeman K et al. (2011) LOFAR information system. Future generation computer systems, vol 27. Elsevier, Amsterdam, pp 319–328

    Google Scholar 

  2. A brief introduction to FITS. http://fits.gsfc.nasa.gov/fits_overview.html

  3. Enke H, Wambsganss JK (2012) Astronomie und Astrophysik. In: Langzeitarchivierung von Forschungsdaten – eine Bestandsaufnahme. Verlag Werner Hülsbusch, Göttingen

    Google Scholar 

  4. Guidelines for participation, IVOA note 2010 July 7. Chaps. 1 and 3. http://www.ivoa.net/Documents/latest/IVOAParticipation.html

  5. Hupfeld F, Cortes T, Kolbeck B, Stender J, Focht E, Hess M, Malo J, Martí J, Cesario E (2008) The XtreemFS architecture—a case for object-based file systems in grids. Concurr Comput 20:2049–2060

    Article  Google Scholar 

  6. IEEE Std 1003.1-2008. POSIX.1-2008, The Open Group Base Specifications Issue 7. The Open Group, 2008

  7. IVOA astronomical data query language. http://www.ivoa.net/Documents/cover/ADQL-20081030.html

  8. Kolbeck B, Högqvist M, Stender J, Hupfeld F (2011) Flease—lease coordination without a lock server. In: 25th IEEE international symposium on parallel and distributed processing (IPDPS 2011), pp 978–988

    Google Scholar 

  9. Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565

    Article  MATH  Google Scholar 

  10. Lamport L (1998) The part-time parliament. ACM Trans Comput Syst 16(2):133–169

    Article  Google Scholar 

  11. Lemson G, Budavari T (2011) Implementing a general spatial indexing library for relational databases of large numerical simulations. In: 23rd international conference on scientific and statistical database management. Springer, Berlin

    Google Scholar 

  12. Lemson G, Virgo Consortium (2006) Halo and galaxy formation histories from the millennium simulation: public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. arXiv:astro-ph/0608019

  13. Mattern F (1988) Virtual time and global states of distributed systems. In: Cosnard M (ed) Proc workshop on parallel and distributed algorithms. Elsevier, Amsterdam, pp 215–226

    Google Scholar 

  14. O’Mullane W (2011) Blue skies and clouds, archives of the future. GAIA-TN-PL-ESAC-WOM-057-0

  15. O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Inform 33(4):351–385

    Article  Google Scholar 

  16. Prisco RD, Lampson BW, Lynch NA (2000) Revisiting the PAXOS algorithm. Theor Comput Sci 243(1–2):35–91

    Article  MATH  Google Scholar 

  17. Riebe K et al (2011) The MultiDark database: release of the Bolshoi and MultiDark cosmological simulations. arXiv:1109.0003v2

  18. Stender J, Berlin M, Reinefeld A (2012, to appear) XtreemFS—a file system for the cloud. In: Kyriazis D, Voulodimos A, Gogouvitis S, Varvarigou T (eds) Data intensive storage services for cloud environments. IGI Global Press

  19. Stender J, Kolbeck B, Högqvist M, Hupfeld F (2010) BabuDB: fast and efficient file system metadata storage. In: 2010 international workshop on storage network architecture and parallel I/Os (SNAPI ’10), Washington, DC, USA. IEEE Comput Soc, Los Alamitos, pp 51–58

    Chapter  Google Scholar 

  20. Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings of the second international conference on information and knowledge management (CIKM ’93), pp 490–499

    Chapter  Google Scholar 

Download references

Acknowledgements

Part of the work on the RDBMS is funded by “Virtuelles Datenzentum (VDZ)”, BMBF grant 05A09BAB. We thank K. Riebe and J. Klar (AIP) for critical discussions and suggestions.

The XtreemFS development was partly funded by the EU projects XtreemOS (2006–2010) and Contrail (2010–2013) and by the German BMBF projects MoSGrid (2009–2012) and VDZ (2010–2012). We thank the XtreemFS team for the design and implementation of XtreemFS which, with its unique feature set, became a perfect tool for research on distributed data management.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alexander Reinefeld.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Enke, H., Partl, A., Reinefeld, A. et al. Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS. Datenbank Spektrum 12, 173–181 (2012). https://doi.org/10.1007/s13222-012-0099-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13222-012-0099-1

Keywords

Navigation