The integration workflow we established starts when a new dataset is identified for inclusion and terminates with the output of RDF named graphs representing the (annotated) information that was extracted from the identified dataset (see Fig. 2). The graphs are then published as Linked Open Data resources. We note that we tackle explicitly the task of integrating datasets with different formats within a unique repository with a common data model. That is, we enforce (both manually and automatically) syntactic data quality, but we do not tackle the issue of fixing data quality issues in the content of the data we integrate. This is on purpose since our goal is to collect and store multiple datasets as they are. Inspecting and solving data quality issues is an orthogonal task that domain experts can carry out only when they have open access to different datasets to compare and cross-reference. This means that, without our open database, ensuring the quality of the data used in LCSA would not be easy (or not feasible at all). In Sect. 7, we provide an example of such a case.
Integration of Multiple Classifications. Different datasets might have distinct classifications for the same concept. To align those datasets, correspondence tables systematically encode the semantic correspondence between those concepts within the BONSAI classification. Correspondence tables, hence constitute a reference taxonomy being developed by BONSAI to keep track of conceptual linkage between various datasets. For example, the Exiobase dataset introduces 163 different instances of Activity Types, 200 Flow Objects, and 43 Locations. One of the instances is the Activity Type of cultivation of paddy rice. In this case, the new concept is added in the BONSAI classification (Fig. 2, top dashed arrow) recording that cultivation of paddy rice is an Activity Type in the BONSAI classification extracted from the Exiobase dataset.
Moreover, in Exiobase there is a special Flow Object labeled “Other emissions”. Within the BONSAI classification, this concept is also linked to a set of more specific emissions listed by the United States Environmental Protection Agency (US EPA). This correspondence is hence recorded via the partOf relation to make data within the two classifications interoperable. Establishing semantic equivalence requires some domain knowledge, hence correspondence tables are manually created. Create Correspondence Table is the first process in the data workflow (Fig. 2). Then we perform the process of Correspondence Mapping, which produces the new enhanced dataset containing the updated correspondence information (in Fig. 2 labeled Correspondence Mapped Dataset).
Intermediate Data Transformation. In the process of integrating new LCSA datasets, we faced the technical issue of many LCSA datasets being shared in various non-normative formats. As an example, the Exiobase dataset is shared as a set of spreadsheets, without an associated ontology. Similarly, YSTAFDB datasets are provided as plain CSV files. The data structure, even within the same file format (e.g., CSV files), might however also differ from dataset to dataset, due to lack of standardization between LCSA datasets . To allow automatic transformation and integration of new datasets by a common set of data converters, we defined a common intermediate CSV format. The Formalization Transformation activity represents the conversion of the specific data-formats to the common one (in Fig. 2 with output Formalized Dataset). The formalized datasets will contain a separate list of Flows, Flow Objects, Activity Types, and Locations. Finally, this formalization task could also be carried out by any data provider who wants to include their dataset in the BONSAI database.
RDF Data Extraction. The final step in the integration of a new dataset is the actual conversion of the formalized data into an RDF graph coherent with the BONSAI Ontology. Custom scripts are used in this process (called Data Extraction) to create named graphs from the formalized data. The result is one or more named graphs with instances of Flow Objects, Activity Types, Locations, and Flows (Named Graphs in Fig. 2). Our convention is to create a named graph for each class of instances. Thus, if a new dataset presents Locations, Flow Objects, Activities, and Activity Types we create four new named graphs, one for each of the four classes. Furthermore, this convention tries to avoid duplicating concepts by storing them only once in their dedicated named graph. Since the same information usually appears in several datasets, the other datasets, when integrated, will just reference the information in the predefined named graph avoiding redundancies. Finally, the newly generated graphs can be published via a SPARQL endpoint. Moreover, while the BONSAI classification is expanded since new named graphs are produced and integrated in the database, the intermediate resources (in the dashed ovals) can be discarded. Finally, since the conversion script is automatic (due to the formalization step), we can ensure its conformity to the proposed ontology and also identify missing information. In our future work, we aim to also adopt shape expressions  for syntactic validation of extracted information.
Integration of new Models. After a new dataset is integrated and published, the database is used as a source of information to compute new or updated IO models. Development of IO models from MR SUTs varies depending on the algorithm used for IO Modeling [7, 18]. Nonetheless, users of the BONSAI database can apply their own or predefined IO Modeling Algorithms to some or part of the data published in the database by querying only the required data. For instance, given that both Exiobase and YSTAFDB comply with the flow-activity model encoded in the ontology , data from both can be processed altogether or a user can select a portion of them for IO Modeling in a specific sector. This step is illustrated in Fig. 2 as the process IO Modeling using the named graphs in the database along with an IO Modeling Algorithm. The result of this process is a new named graph representing the Flows and the corresponding information in the IO model. This means that the database allows also the insertion of the IO models into the dataset (illustrated with a dashed line between the IO Models and the Named Graphs).
Metadata Annotation. For all systems that incorporate data from multiple diverse sources, keeping provenance information about individual pieces of data is crucial. For new datasets this corresponds to the information of their origin, especially the organization and the time at which they have been produced. For IO models this also includes the portion of the dataset used to compute them and the metadata about how they have been computed. Therefore, during the integration processes described above, the output datasets are also annotated with provenance information, as described in Sect. 5.
Handling Updates. The pipeline is rerun whenever a new dataset is integrated, or when a new version of an already integrated dataset is available. All steps of the pipeline must be rerun for the integration of new datasets, but changes to existing datasets often do not require the initial manual step of Create Correspondence Table, since the schema between versions of a dataset is rarely changed.