Handling Big Data in Astronomy and Astrophysics: Rich Structured Queries on Replicated Cloud Data with XtreemFS
- 438 Downloads
With recent observational instruments and survey campaigns in astrophysics, efficient analysis of big structured data becomes more and more relevant. While providing good query expressiveness and data analysis capabilities through SQL, off-the-shelf RDBMS are yet not well prepared to handle high volume scientific data distributed across several nodes, neither for fast data ingest nor for fast spatial queries. Our SQL query parser and job manager performs query reformulation to spread queries to data nodes, gathering outputs on a head node and providing them again to the shards for subsequent processing steps. We combine this data analysis architecture with the cloud data storage component XtreemFS for automatic data replication to improve the availability and access latency. With our solution we perform rich structured data analysis expressed using SQL on large amounts of structured astrophysical data distributed across numerous storage nodes in parallel. The cloud storage virtualization with XtreemFS provides elasticity and reproducibility of scientific analysis tasks through its snapshot capability.
KeywordsFile System Cloud Storage Head Node Dark Matter Halo Spatial Query
Part of the work on the RDBMS is funded by “Virtuelles Datenzentum (VDZ)”, BMBF grant 05A09BAB. We thank K. Riebe and J. Klar (AIP) for critical discussions and suggestions.
The XtreemFS development was partly funded by the EU projects XtreemOS (2006–2010) and Contrail (2010–2013) and by the German BMBF projects MoSGrid (2009–2012) and VDZ (2010–2012). We thank the XtreemFS team for the design and implementation of XtreemFS which, with its unique feature set, became a perfect tool for research on distributed data management.
- 1.Begeman K et al. (2011) LOFAR information system. Future generation computer systems, vol 27. Elsevier, Amsterdam, pp 319–328 Google Scholar
- 2.A brief introduction to FITS. http://fits.gsfc.nasa.gov/fits_overview.html
- 3.Enke H, Wambsganss JK (2012) Astronomie und Astrophysik. In: Langzeitarchivierung von Forschungsdaten – eine Bestandsaufnahme. Verlag Werner Hülsbusch, Göttingen Google Scholar
- 4.Guidelines for participation, IVOA note 2010 July 7. Chaps. 1 and 3. http://www.ivoa.net/Documents/latest/IVOAParticipation.html
- 6.IEEE Std 1003.1-2008. POSIX.1-2008, The Open Group Base Specifications Issue 7. The Open Group, 2008 Google Scholar
- 7.IVOA astronomical data query language. http://www.ivoa.net/Documents/cover/ADQL-20081030.html
- 8.Kolbeck B, Högqvist M, Stender J, Hupfeld F (2011) Flease—lease coordination without a lock server. In: 25th IEEE international symposium on parallel and distributed processing (IPDPS 2011), pp 978–988 Google Scholar
- 11.Lemson G, Budavari T (2011) Implementing a general spatial indexing library for relational databases of large numerical simulations. In: 23rd international conference on scientific and statistical database management. Springer, Berlin Google Scholar
- 12.Lemson G, Virgo Consortium (2006) Halo and galaxy formation histories from the millennium simulation: public release of a VO-oriented and SQL-queryable database for studying the evolution of galaxies in the LambdaCDM cosmogony. arXiv:astro-ph/0608019
- 13.Mattern F (1988) Virtual time and global states of distributed systems. In: Cosnard M (ed) Proc workshop on parallel and distributed algorithms. Elsevier, Amsterdam, pp 215–226 Google Scholar
- 14.O’Mullane W (2011) Blue skies and clouds, archives of the future. GAIA-TN-PL-ESAC-WOM-057-0 Google Scholar
- 17.Riebe K et al (2011) The MultiDark database: release of the Bolshoi and MultiDark cosmological simulations. arXiv:1109.0003v2
- 18.Stender J, Berlin M, Reinefeld A (2012, to appear) XtreemFS—a file system for the cloud. In: Kyriazis D, Voulodimos A, Gogouvitis S, Varvarigou T (eds) Data intensive storage services for cloud environments. IGI Global Press Google Scholar