Keywords

1 Introduction

The wide use, availability, accessible costs, interoperability, and analytical exploitation of financial data are essential for the European data strategy. Graphs or linked data are crucial to innovation, competition, and prosperity and establish a strategic investment in technical processing and ecosystem enablers. Graphs are universal abstractions that capture, combine, model, analyze, and process knowledge about real and digital worlds into actionable insights through item representation and interconnectedness. In this context, graphs are extreme data enablers that require further technological innovations to meet the needs of the European data economy. A study by IBM [1] revealed that the world generates nearly 2.5 quintillion bytes of financial data daily, posing extreme business analytics challenges. Graph-based technologies help pursue the United Nations Sustainable Development Goals [2] by enabling better value chains, products, and services for green financial investments and deriving trustworthy insights for creating sustainable communities.

The improvement and optimization of green investments and trading face significant barriers. Historical securities’ data, particularly on environmental, social, and governance data (starting from the early 2010s, is insufficient for in-depth testing, derisking financial algorithms, and training AI models. Unfortunately, financial data is often difficult and expensive to access for training AI-driven financial algorithms. One historical record per security is typical to optimize a financial strategy, but this can lead to deficiencies in data training and losses during live trading.

The Graph-Massivizer project [1] aims, among others, to remove the limitations of financial market data (limited volume, reduced accessibility, price barriers) by enabling fast, semi-automated creation of realistic and affordable synthetic extreme financial datasets, unlimited in size and accessibility. The extreme synthetic data goes one order of magnitude beyond the current big financial data features, aiming for petabytes in volume and affordable prices. The project researches and develops a high-performance, scalable, gender-neutral, secure, and sustainable platform based on the massive knowledge graph (KG) representation of extreme financial data. It delivers the Graph-Massivizer toolkit of five open-source software tools and findable, accessible, interoperable, and reusable (FAIR) graph datasets that cover the sustainable lifecycle of processing extreme data as massive graphs, as shown in Fig. 1.

Fig. 1.
figure 1

Sustainable massive KG operation lifecycle [1].

This paper provides a comprehensive introductory overview of the green and sustainable finance pilot use case researched in the Graph-Massivizer project. Section 2 outlines the related work structured targeting functional gaps in historical financial data offerings as motivation for synthetic financial data alternative to actual data and concludes with a short list of existing relevant financial synthetic data companies highlighting market innovation potential. Section 3 presents the green and sustainable finance use case comprising its conceptual architecture, objectives, and scientific challenges. Section 4 outlines the conceptual architecture of the Graph-Massivizer platform and toolkit, followed by a summary of the planned financial use case integration in Sect. 5. Section 6 concludes the paper.

2 State-of-the-Art in Green and Sustainable Finance

Intensive testing, data accuracy, quality, and quantity are paramount to investment and trading. Strict adherence to statistical relevance is necessary, such as conducting back-tests on financial algorithms with a minimum of 10,000 out-of-sample data points. Testing becomes even more challenging when various machine learning (ML) models are employed or benchmarked against each other to enhance existing financial algorithms. While sourcing and preparing data has become more accessible for companies in recent years, numerous challenges persist. We identify several gaps that exist in commercial financial data.

2.1 Functional Gaps in Historical Financial Data

According to Appen’s 2022 report [4], 42% of technologists find data sourcing challenging, 34% find data preparation, and 38% find model testing and deployment. Furthermore, 51% find data accuracy critical for artificial intelligence (AI) use cases. However, 78% find that the training data accuracy varies widely between 1% and 80%. Training and testing AI models with low-quality data make the model predictions and results inaccurate and inapplicable to financial transactions.

Regarding data volumes, Capital Fund Management [5] used all historical data from 1800 for back-testing to achieve statistical significance. Time-series data for futures and equities extends back at least to the 1960s and 1970s and, where possible, as far back as 1800 (e.g., monthly data for many indices, commodities, bonds, and various interest rates). To show the scale of their current data acquisition, they have over 1,500 servers enabling daily collection and presentation of over three terabytes of information. However, historical financial data may be irrelevant to financial model testing for several reasons, as discussed in the following paragraphs.

Changes in the Business Environment.

The business environment can change rapidly, and old financial data may not reflect the current market conditions. Economic factors such as inflation, interest rates, and trade policies can significantly impact a company’s financial performance, and these factors can change over time.

Changes in the Company’s Operations.

Companies can change operations over time, affecting their financial performance. For example, a manufacturing company may shift to a service-oriented business model, impacting its financial statements.

Changes in Accounting Standards.

Accounting standards can change over time, making reporting, recording, and comparing from different periods difficult, as the accounting methods may differ.

Outdated Technology.

The methods used to collect and process financial data can become outdated. For example, financial data collected manually and recorded on paper may be less reliable than data collected electronically.

Data Quality.

Financial data can be subject to errors and inconsistencies, which become more prevalent over time as the data becomes outdated and challenging to use for accurate testing of financial models.

Statistical significance and long periods of back-tests are critical. However, going back in time, even to the 1970s and 1960s, the richness of the data decreased dramatically compared to our days. Furthermore, the market conditions back then were different, and algorithms designed for today’s markets do not necessarily fit the conditions from the 1960s and 1970s.

Data purchase and storage costs can be another critical barrier, except for big and rich financial institutions that can theoretically afford to purchase any volume if needed. As big financial data refers to terabytes of structured and unstructured data, it is costly for any financial player except the big institutions to source, store, prepare, and test models. Therefore, most companies use smaller, more affordable sets that are insufficient, incomplete, or biased, affecting the models’ results and performance.

Another critical issue is the lack of accuracy and audit of some real-world financial data, particularly the environmental, social, and governance (ESG) data, briefly detailed in Table 1. Unlike accounting data that undergo auditing to ensure its accuracy and integrity, there are no clear regulations for verifying ESG data accuracy [6]. Nevertheless, regulatory efforts are underway, as evident from the shift of ESG issues from being voluntary disclosure-oriented to becoming regulatory. This development has significant implications for organizations collecting, verifying, and utilizing ESG information. As per Thomson Reuters, while the current regulations are still incomplete, the direction is towards greater regulation of ESG issues (see Table 1).

Table 1. Environmental, social, and governance parameters.

One significant challenge associated with using limited or reused real datasets is the issue of financial algorithms’ overfitting. This phenomenon occurs when designing algorithms to fit the real datasets too closely, resulting in high performance on test data but poor performance on previously unseen data. Overfitting is a common problem in ML and data mining, mitigated by regularization, cross-validation, and ensemble learning techniques. Nevertheless, overfitting remains a critical issue that needs careful consideration when working with real datasets in algorithm development (Table 2).

Table 2. Environmental, social, and governance data challenges.

Although historical financial data represent quantified actual events, history never repeats itself. Even for recurring market booms and crashes [16], the underlying economic factors are distinct and unique each time, exposing the models and algorithms to new and unencountered financial situations. As a result, linking historical data with performance can be challenging. The quote commonly associated with mutual funds warning investors that past performance is not a reliable indicator of future success exemplifies this challenge.

2.2 Synthetic Financial Data

Given the limitations of historical data, synthetic data, which closely mimics real-world data, emerges as an alternative, complementary solution for testing financial models and algorithms. According to Johnathan Kinlay, head of quantitative trading at Systematic Strategies LLC [17], synthetic data addresses one primary concern about using real data series for modeling purposes. Concretely, models designed to fit the historical data produce test results that one is unlikely to replicate [18]. Such models are not robust to changes likely occurring in any dynamic statistical process and will perform poorly out of sample. Synthetic data will expose and stress-test financial models to new situations, thus validating or invalidating their assumptions and exposing their strengths and weaknesses.

Linden [19] raised many interesting points about the “usability and future of synthetic data, increasing the accuracy of ML models.” Real-world data is happenstance and does not contain all permutations of conditions or events possible in the real world. Synthetic data can help generate data at the edges or for unseen conditions.

Deployed correctly, data and analytics leaders can use synthetic data to create more efficient AI models, taking their organizations’ applications to the next level [20], according to a Gartner analyst. Gartner further estimates that by 2030, synthetic data will overshadow actual data in a wide range of AI models and will help organizations understand the technology’s potential [21]. Nevertheless, risks are also present when using synthetic data whose quality relies on the quality of the model that created it and the resulting dataset. Therefore, synthetic data requires additional verification steps, such as a comparison with human-annotated, real-world data, to ensure its validity. The widespread of synthetic data raises questions about the transparency and explicability of the techniques used for generating it.

According to Datanami [28], hedge funds and banks deploy KGs as a powerful option to meet growing functional data management challenges. Using KGs gets increased traction in the financial space.

2.3 Green and Sustainable Finance Market

Statice.ai [22] presents a comprehensive list of 56 synthetic data vendors and a thriving ecosystem covering many industries. However, few vendors target financial markets with dedicated KG applications. Table 3 summarizes several relevant synthetic data companies focused on finance and banking selected from an extensive directory [23]. We identified a large ecosystem of synthetic data producers and users but very few in the green financial market. The analysis shows a strong drive for synthetic data and some competition but enough space for innovative newcomers.

Table 3. Relevant financial synthetic data companies.

3 Graph Processing for Green Finance

Green finance targets financial products and services that direct investments into green-oriented enterprises. It aims for economic growth while reducing waste, pollution, and greenhouse gas emissions and improving overall efficiencies. Sustainable finance considers environmental, social, and governance factors for investment decisions for long-term sustainable economic activities.

Financial KG Use Cases.

Deloitte [29] lists several use cases that apply KG to the financial services sector, summarized in Table 4. Linked data standards such as the hypertext transfer protocol, uniform resource identifiers, and Resource Description Framework (RDF) [30] models represent data in a single interchangeable format that machines and humans understand. Additionally, multi-model graph databases support multiple data models against a single, integrated backend.

Table 4. Financial KG use case examples.

Financial KG.

Graph-Massivizer aims to remove the limitations of financial market data providers (limited volume, reduced accessibility, high costs) by enabling fast, semi-automated creation of realistic and affordable synthetic extreme financial data sets, unlimited in size and accessibility. The extreme financial datasets will enable improved ML-based green investment and trading simulations, free of critical biases such as prior knowledge, overfitting, and indirect contaminations due to present data scarcity. We plan to research a financial knowledge graph as a fundamental data structure representing a hybrid, graph-based financial metadata structure (time series, values, boolean, monetary, securities taxonomies, statistical factors, rules) that helps research improved financial algorithms operating in five high-level steps:

  • Historical financial data structure mapping into a financial KG;

  • Synthetic data generation by preserving the original historical statistical features;

  • Missing data interpolation using ML inference and reasoning methods;

  • Green financial investments and trading simulation;

  • Recommendation of the “greenest” investments and trading opportunities.

Scientific Challenges.

To define our goals and methodology, we first conducted a comprehensive analysis to identify the scientific challenges of using KGs for green and sustainable finance, as summarized in Table 5.

Table 5. Scientific challenges of KGs for green and sustainable finance.

Green Financial Data Multiverse.

Graph-Massivizer targets energy-efficient synthetic financial data generation (Fig. 2) in a 1–75 petabytes range, validated by standard green financial investment and trading algorithms. The developed technology promises a 90% energy consumption accountability for extreme data creation streamed to clients. Samples of it will be available as open data for internal testing. The availability of cheaper synthetic financial data for testing in extreme quantities allows more fintech companies, funds, and investors to test and derisk investment models.

Fig. 2.
figure 2

Green financial data multiverse in Graph-Massivizer [1].

Greener Financial Algorithms and Better Investments.

Peracton Ltd. Aims to use the financial multiverse for improved green AI-enhanced financial algorithms with reduced bias and risk and an increased investment return by a realistic 2%–4%. It further targets an increase in excess return (alpha) by 1% – 2% with a quick ratio higher than 1.5, reflecting healthy investments with lower risk and higher returns.

4 Graph-Massivizer Toolkit

The Graph-Massivizer toolkit architecture consists of five tools, depicted as a simplified C4 container diagram in Fig. 3 and published in [1]: Graph-Inceptor, Graph-Scrutinizer, Graph-Optimizer, Graph-Greenifier, and Graph-Choreographer.

Fig. 3.
figure 3

Graph-Massivizer conceptual architecture using a simplified C4-context diagram [1].

Operational graph layer generates, transforms, and manipulates extreme data through BGOs, which comprise graph creation, enrichment, query, and analytics.

Graph creation implemented by the Graph-Inceptor tool translates extreme data from various static and event streams or follows heuristics to generate synthetic data, persist it, or publish it within a graph structure.

Graph enrichment, graph query, and graph analytics are three operation types implemented by the Graph-Scrutinizer tool. They analyze and expand extreme datasets using probabilistic reasoning and ML algorithms for graph pattern discovery, low memory footprint graph generation, and low latency error-bounded query response. The output of this phase is a new graph, a query, or an enriched structured dataset.

Graph processing layer provides sustainable, energy-aware, serverless graph analytics on the heterogeneous HPC infrastructure.

Graph workload modeling and optimization, represented by the Graph-Optimizer tool, analyses and expresses the given graph processing workload into a workflow of BGOs. It further combines parametric BGO performance and energy models with hardware models to generate accurate performance and energy consumption predictions for the workload running on a given multi-node, heterogeneous infrastructure of CPUs, GPUs, and FPGAs. The predictions indicate the most promising combinations of BGO optimizations and infrastructure, representing a codesigned solution for the given workload while guaranteeing its performance and energy consumption bounds.

Sustainability analysis, implemented by the Graph-Greenifier tool, collects, studies, and archives performance and sustainability data from operational data centers and national energy suppliers on a large scale. It simulates multi-objective infrastructure sustainability profiles for operating graph analytics workloads, trading off performance and energy (e.g., consumption, CO2, methane, GHG emissions) metrics. Its purpose is to model the impact of specific graph analytics workloads on the environment for evidence-based decision-making.

Serverless BGO processing, implemented by the Graph-Choreographer tool, uses performance and sustainability models and data to deploy serverless graph analytics on the computing continuum. It relies on novel scheduling heuristics, infrastructure partitioning, and environment-aware processing for scalable orchestration of serverless graph analytics with accountable performance and energy tradeoffs.

Hardware infrastructure layer considered by Graph-Massivizer consists of geographically distributed data centers across the Cloud HPC, mid-range Fog, and low-end Edge computing continuum.

Table 6. Bias and ethical concerns in green and sustainable finance.

5 Green and Sustainable Finance in Graph-Massivizer

The container diagram in Fig. 4 depicts the Graph-Massivizer tool pipeline implementing the green and sustainable finance use case for fast and semi-automated creation of realistic and affordable synthetic financial datasets of extreme size. Peracton Ltd. Will test the produced extreme synthetic financial data on its platform to evaluate their quality and the impact on green financial algorithms training and testing results. The extreme data sets for training allow (by design) various data distributions and market scenarios, including extreme variations that never occurred in the past. They will expose financial algorithms and models to new conditions and tune them with appropriate stress tests.

Fig. 4.
figure 4

Green and sustainable finance architectural component diagram in Graph-Massivizer.

Data Preprocessing and KG Creation.

The financial data sources component provides historical data samples purchased by PER as input to the platform in XML and CSV formats. Then, the Graph-Inceptor tool extracts the historical financial data, maps it on a massive financial KG structure using its graph creation component, and stores it in the graph database for later use. The later use can also support auditing these processes.

Relationships and Pattern Detection.

The Graph-Scrutinizer tool runs probabilistic reasoning on the financial KG to identify financial data patterns and correlations that are hard to identify on the raw data. Then, the inter-company/product graph (ICG) creation component driven by a graph convolutional network engine records the identified patterns and correlations.

Synthetic Data Generation.

The missing data interpolation component handles incomplete data and gaps encountered in the historical data and generates synthetic data ranges to fill them. Then, the synthetic graph generation component uses the ICG to generate a synthetic financial KG as a template for generating further synthetic data;

Quality Rules Implementation.

The Graph-Optimizer tool executes BGOs and implements synthetic data quality rules with the generated synthetic financial KG to instantiate synthetic data further.

Synthetic Data Generation.

The financial storage (and the continuum infrastructure) host the generated synthetic data for further use in financial model simulations.

6 Conclusions

Graph-Massivizer allows European green financial investors to avail of a massive financial data synthetic multiverse and a proven competitive, sustainable advantage. In comparison, other forms of analysis rely on present assumptions about “what happened” or “what happens” correctly building and employing financial graphs to generate synthetic data can further reveal patterns suggesting what “might happen” with clear evidence for each connection or inference step. Graph processing facilitates problem-solving driven by metrics related to costs or inefficiencies. The large-scale financial graph analytics market still traverses a developing phase, hampered by the lack of technology research and use case adoption. Graph-Massivizer provides for Europe these missing links.

Finally, we constantly monitor and address addresses biases and ethical concerns arising from using various AI, IT technologies, and data, as summarized in Table 6.