Comment

Theoretical approaches to science aim to reduce the impact of serendipity on the advance of our knowledge. However, serendipity often guides the evolution of theoretical sciences. The Ascona B-DNA Consortium (ABC) is an excellent example of how new and unpredicted research objectives emerge when looking for very specific information. Thus, the first ABC round aimed to study the tetramer-dependent properties of B-DNA (Beveridge et al. 2004; Dixit et al. 2005), but unexpectedly the project helped to improve force-fields (FF; (Pérez et al. 2007)). The second round (Pasi et al. 2014) helped to describe bimodality in certain steps, but also resulted in the development of last-generation FFs (Ivani et al. 2016; Zgarbová et al. 2015). The third round aimed to reproduce DNA polymorphism (Dans et al. 2019), but as a side product, the analysis of data led to the description of the kinetics of DNA transitions and the development of a myriad of coarse-grained and mesoscopic models (Walther et al. 2020; López-Güell et al. 2023), which made it possible to move to the chromatin scale (Buitrago et al. 2021). We cannot predict what the impact of the ongoing HexABC project will be beyond characterizing the properties of the 2080 unique hexamers of DNA. The only clear statement that we can make is that stored ABC data will be crucial for it.

Theoretical science is moving from an algorithm-based to a data-driven paradigm. Artificial intelligence methods are anxiously waiting for high-quality data to derive predictive models (Barissi et al. 2022). In this new scenario, we should be careful in reporting MD data with FAIR (findable, accessible, interoperable, and reusable) standards, providing provenance of the trajectories obtained using community-accepted simulation standards and stored after passing severe quality controls (Hospital et al. 2020). We should be prepared to solve computational challenges beyond mere CPU usage and closer to those faced by data-intensive sciences. ABC is pioneering the field: the HexABC consortium has generated 380 validated trajectories covering the 2080 unique DNA hexamers obtained using community-accepted standards. Performing such simulations has been a major effort for the 13 groups involved, but the greatest challenge has been to move around 200 TB of data from production sites to the datacenters in Utah and Barcelona, checking the integrity of the trajectories and detecting potential artefacts in the simulations that require human inspection. Analyzing 500,000 files (200 TB) and storing all the information in a NoSQL database with remote programmatic access represent an effort comparable to that of obtaining the trajectories. However, the final result: a validated database of B-DNA simulations will represent the best legacy of the ABC consortium (Fig. 1).

Fig. 1
figure 1

The main flow of HexABC data production and storage