The Special Collection ”Biomedical Data Analyses Facilitated by Open Cheminformatics Workflows” (https://www.biomedcentral.com/collections/BDAOCW) aimed to collect and publish cheminformatics workflows for curation and analysis of diverse life science data sets. Especially at a time where in many areas of science reproducibility of results is significantly challenged [1, 2], it is important to encourage publication of workflows for data curation, including data extraction, integration, annotation, cleaning/filtering, standardization and analysis. However, this reproducibility ”crisis” is not only a challenge but can become an opportunity for change and better publication practise in the future [3].

For many scientific workflows, curation of data is essential and takes a significant amount of time. Many different scientific disciplines and data types depend on data standardization and preprocessing, which is nicely exemplified by the different areas covered in this special issue - from small molecules, to metabolomics, and drug-protein interactions. However, choices made during data curation can be quite subjective, i.e. containing user-defined cut-offs, and also depends on the problem at hand. Thus, published workflows shall enable comparability and reproducibility of results in line with FAIR (findable, accessible, interoperable, and reusable) principles both for data [4] as well as software [5]. Another advantage of using already existing workflows is avoiding mistakes and challenges others have already faced and overcome before.

Many scientific studies in the fields of cheminformatics and computational chemistry aim to extract and connect knowledge from (experimental) data. One fundamental assumption is the correctness of input data from experimental resources. However, systematic errors, i.e. translation between 1D, 2D, and 3D structure representations, as well as random errors, such as incorrect human input, occur ranging from on average two errors per (medicinal chemistry) publication to 0.1\(-\)3.4% for different databases [6,7,8]. In addition to errors in experimental resources, the correct representation and standardization of molecules, including their tautomers and protonation states, can be highly challenging and time-consuming. Molecules are often represented by the Simplified Molecular Input Line Entry System (SMILES) [9] or InChI [10, 11] representation. However, there is no universal standard for SMILES and using different programs will lead to different representations for the same molecule. The importance of (automated) chemical structure curation is demonstrated by the publication of structure standardization workflows by major bioactivity data resources like ChEMBL [12], PubChem [13], or canSAR [14].

In the field of machine learning (ML) and artificial intelligence (AI) publication of code is more commonly applied. Due to increasing amount of published methods in that area, more publications including guidelines on reproducibility but also on model comparison itself became available [15,16,17]. It could serve as an example for other areas of cheminformatics and computational chemistry for which open source publications and workflows are not yet commonly published with manuscripts.

Publication of data on the other hand is an even more challenging topic especially when considering proprietary data. Public bioacitivty data often have a lower number of negative or inactive data compared to proprietary data, thus displaying a higher ratio of actives to inactives than commonly seen in i.e. high-throughput screening (HTS) runs [18,19,20,21]. Thus, public and proprietary data sets complement each other in terms of chemical space coverage. Another advantage of proprietary data is the estimation of experimental uncertainty, since often a more homogeneous curation pipeline and assay setup is applied as well as multiple measurements for the same compounds are available. In order to satisfy the request for reproducibility and data sharing without violating intellectual property (IP) rights, the application of developed methods and workflows to public and private data in the same manner is a good solution. This process has been encouraged for research papers submitted to J. Cheminf. as demonstrated by these examples [22, 23].

The workflows submitted for this special issue include KNIME workflows [24], Galaxy [25] or Jupyter notebooks[26]. In addition to these workflow tools, platforms for publishing and sharing of code, such as GitHub [27] or GitLab are available and allow sharing with and enhancements by peers. Larger data that requires more storage space, such as model input data or machine learning models themselves, can be stored in open-repositories such as Zenodo [28]. Docker [29] became popular in order to avoid any issues with cross platform installation. With all these resources available, ideal conditions for improvements and requirements for reproducibility are at hand.

Ultimately, publication of workflows does not mean that these workflows cannot be changed anymore. These serve as a basis and starting point for further research and can also help during teaching, with already existing initiatives such as TeachOpenCADD [30, 31]. Thus, we strongly encourage open access publication of workflows in order to help driving research the best way possible.

In this special issue, diverse topics were covered from data analysis (nonadditivity analysis, thermal shift assay analysis, or MS/MS analysis for metabolomics), structural analysis (drug-protein interactions, fragment-based virtual screening) to machine learning (retraining of ML models, ML for off-target predictions, MMPA and QSAR). This nicely illustrates how important data workflows and analysis are across different scientfic fields.

As mentioned already, data availability is still a great challenge especially when it comes to high quality data. Many of todays’ influential researchers have grown up in a culture where data and knowledge sharing has not been appreciated yet, but was rather seen as potentially limiting their chances for securing one of the rare tenured academic positions. Herein, a reward system to encourage data sharing could be a first incentive. Additionally, data sharing initiatives, such as federated learning with the MELLODDY project [32], have been conducted to share proprietary data and enhance machine learning models across companies. In the future, it would be great to see more initiatives to share data cross company but also between academia and industry to advance method development.