Abstract
With recent observational instruments and survey campaigns in astrophysics, efficient analysis of big structured data becomes more and more relevant. While providing good query expressiveness and data analysis capabilities through SQL, off-the-shelf RDBMS are yet not well prepared to handle high volume scientific data distributed across several nodes, neither for fast data ingest nor for fast spatial queries. Our SQL query parser and job manager performs query reformulation to spread queries to data nodes, gathering outputs on a head node and providing them again to the shards for subsequent processing steps. We combine this data analysis architecture with the cloud data storage component XtreemFS for automatic data replication to improve the availability and access latency. With our solution we perform rich structured data analysis expressed using SQL on large amounts of structured astrophysical data distributed across numerous storage nodes in parallel. The cloud storage virtualization with XtreemFS provides elasticity and reproducibility of scientific analysis tasks through its snapshot capability.
Similar content being viewed by others
Notes
This process is also called ‘pipelining’ or data reduction.
A common file format in astronomy, the Flexible Image Transfer System (FITS) was created in 1979 and is approved as a standard by the IAU [2].
A Millennium like simulation requires for the production of raw data (aka. snapshots) several Million CPU hours (see Table 2).
Note that ϵ is given by the system environment. Typically, ϵ is between 10 and 100 ms for WANs and approx. 1 ms in LANs.
References
Begeman K et al. (2011) LOFAR information system. Future generation computer systems, vol 27. Elsevier, Amsterdam, pp 319–328
A brief introduction to FITS. http://fits.gsfc.nasa.gov/fits_overview.html
Enke H, Wambsganss JK (2012) Astronomie und Astrophysik. In: Langzeitarchivierung von Forschungsdaten – eine Bestandsaufnahme. Verlag Werner Hülsbusch, Göttingen
Guidelines for participation, IVOA note 2010 July 7. Chaps. 1 and 3. http://www.ivoa.net/Documents/latest/IVOAParticipation.html
Hupfeld F, Cortes T, Kolbeck B, Stender J, Focht E, Hess M, Malo J, Martí J, Cesario E (2008) The XtreemFS architecture—a case for object-based file systems in grids. Concurr Comput 20:2049–2060
IEEE Std 1003.1-2008. POSIX.1-2008, The Open Group Base Specifications Issue 7. The Open Group, 2008
IVOA astronomical data query language. http://www.ivoa.net/Documents/cover/ADQL-20081030.html
Kolbeck B, Högqvist M, Stender J, Hupfeld F (2011) Flease—lease coordination without a lock server. In: 25th IEEE international symposium on parallel and distributed processing (IPDPS 2011), pp 978–988
Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565
Lamport L (1998) The part-time parliament. ACM Trans Comput Syst 16(2):133–169
Lemson G, Budavari T (2011) Implementing a general spatial indexing library for relational databases of large numerical simulations. In: 23rd international conference on scientific and statistical database management. Springer, Berlin
Lemson G, Virgo Consortium (2006) Halo and galaxy formation histories from the millennium simulation: public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. arXiv:astro-ph/0608019
Mattern F (1988) Virtual time and global states of distributed systems. In: Cosnard M (ed) Proc workshop on parallel and distributed algorithms. Elsevier, Amsterdam, pp 215–226
O’Mullane W (2011) Blue skies and clouds, archives of the future. GAIA-TN-PL-ESAC-WOM-057-0
O’Neil P, Cheng E, Gawlick D, O’Neil E (1996) The log-structured merge-tree (LSM-tree). Acta Inform 33(4):351–385
Prisco RD, Lampson BW, Lynch NA (2000) Revisiting the PAXOS algorithm. Theor Comput Sci 243(1–2):35–91
Riebe K et al (2011) The MultiDark database: release of the Bolshoi and MultiDark cosmological simulations. arXiv:1109.0003v2
Stender J, Berlin M, Reinefeld A (2012, to appear) XtreemFS—a file system for the cloud. In: Kyriazis D, Voulodimos A, Gogouvitis S, Varvarigou T (eds) Data intensive storage services for cloud environments. IGI Global Press
Stender J, Kolbeck B, Högqvist M, Hupfeld F (2010) BabuDB: fast and efficient file system metadata storage. In: 2010 international workshop on storage network architecture and parallel I/Os (SNAPI ’10), Washington, DC, USA. IEEE Comput Soc, Los Alamitos, pp 51–58
Kamel I, Faloutsos C (1993) On packing R-trees. In: Proceedings of the second international conference on information and knowledge management (CIKM ’93), pp 490–499
Acknowledgements
Part of the work on the RDBMS is funded by “Virtuelles Datenzentum (VDZ)”, BMBF grant 05A09BAB. We thank K. Riebe and J. Klar (AIP) for critical discussions and suggestions.
The XtreemFS development was partly funded by the EU projects XtreemOS (2006–2010) and Contrail (2010–2013) and by the German BMBF projects MoSGrid (2009–2012) and VDZ (2010–2012). We thank the XtreemFS team for the design and implementation of XtreemFS which, with its unique feature set, became a perfect tool for research on distributed data management.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Enke, H., Partl, A., Reinefeld, A. et al. Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS. Datenbank Spektrum 12, 173–181 (2012). https://doi.org/10.1007/s13222-012-0099-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13222-012-0099-1