Keywords

1 Introduction

The potential benefits are immense by drawing on large-scale online and commercial data to construct insights of social development, for example, trends in economic and business development, emerging patterns of people’s daily life choices, comparative technology advances of competing regions, population sentiment to social events and so on. These insights are valuable, sometimes critical, in scenarios like helping government agencies for more objective policy making, aiding decision-making of investors before pulling money into certain business in certain regions, even helping individuals who might just want to check cities and companies’ outlooks before settling among several job offers.

Recent years have seen many articles to investigate various aspects of social activities and developments based on data and models. Bonaventura et al. [23] construct a worldwide professional network of start-ups. The time-varying network connects start-ups which share one or more individuals who have played a professional role. Authors suggest such network has predictive power to assess potential of early stage companies.  [26] investigates foreign interference found on twitter, during the 2020 US presidential election. Natural language processing models are used to classify troll accounts, network flow statistics are leveraged to reveal super-connectors. Drawn on top of analysis results drawn from these models, this report is able to quantify prevalence of troll and super-connector accounts in various politics-inclined communities and these accounts’ influence among these communities. Jia et al. [24] devise a risk model of covid-19 based on aggregate population flow data, the model is to forecast the distribution of confirmed cases, identify high risk regions threatened by virus transmission, one such model is built and verified using major carrier data of mobile phone geolocations from individuals leaving or transiting through Wuhan between 1 January and 24 January 2020. Authors suggests the methodology can be used by policy-makers in any nations to build similar models for risk assessment.

To realize many of aforementioned applications, a large amount of data need to be acquired, stored and processed, a scalable and efficient big data processing platform is the key. In our company, we have built such a data platform. We argue that the data stack of our platform provides enough flexibility to incorporate a variety of modern data component implementations and products from different vendors and bring them together to enable data applications to solve our use cases. Mainly two categories of applications are enabled by the design of the data stack: analytics-oriented applications and real-time transactional applications (usually customer-facing). These two application categories suite different use cases when developing data applications for extracting insights of social development. We showcase two concrete applications: one is a notebook-like analytics tool for analysts to examine research publications of a country with the world’s biggest population. The other is a customer-facing search application one of whose function is to retrieve and summarize companies’ patent statistics in past 20 years of the same big country.

This paper’s main contributions are not on advancing techniques of individual data components, but more of a practical study on how to incorporate appropriate data techniques under a flexible stack framework we propose, to enable real-world data-oriented user cases with minimum time-to-market. We document technical trade-offs we made for choosing the right set of components and technologies, from many existing ones, we use these components to compose a cohesive platform that suits our use cases.

In the following of this paper, Sect. 2 presents the architecture of the big data stack, then dive into the technical reasoning to choose concrete techniques for several key components. Section 3 shows two example applications and explain how the big data stack enable swift development, followed by the conclusion in Sect. 4.

2 The Big Data Stack

Figure 1 depicts a high-level view of what are in the big data stack, the stack is composed of five key components. In the past decade, we have seen a blossom of technologies that could possibly be used to implement the components of proposed stack. Too many techniques sometimes bring no help, but on the contrary quite a lot challenges for a system architect, who need to carefully compare and make trade-offs between several technologies and eventually decide on the right one to have it incorporated into one single cohesive stack.

The applications that we want the techniques to enable are mainly two categories: analytics-oriented and real-time customer facing. To enable these two categories, we set out with a number of goals for choosing the techniques to implement the data stack. First major goal is  flexibility, we strive to the keep our options open to be able to switch to a different technique in the future in needed and avoid being locked into certain set of techniques.  Scalability and  agility are two goals for analytics-oriented applications.  Responsiveness is one goal for real-time applications, “real-time” means the processing time is within the order of sub-second.

Below we dive into technical reasoning in each component of our stack, about the choices of concrete techniques. Note that, the index numbers of the list items correspond to the labels of components in Fig. 1.

Fig. 1.
figure 1

The conceptual view of the big data stack of our data intelligence platform Each component in this stack figure has a corresponding text paragraphs of the same label for more detail.

  1. 1.

    Data Governance

    The social development data could come as structured, e.g., files with clearly defined schema, e.g., CSV, parquet [28] files; or semi-structured, like XML and JSON; or unstructured, e.g., pictures, audio files, videos. The existing and emerging storage technologies to choose include: structured-data-only traditional database, that aims to store key operational data only; data warehouse that are designed to stores all your data but mainly structured data, snowflake [17] and Oracle [10] are examples of such warehouse providers. Recent data lake technologies [25], that promise to be able to store huge amount of structured and unstructured data. Data lakehouse [21] is another recent data storage paradigm attempting to unify both data warehouse and data lake. We keep an open mind in choosing storage technologies since we believe at this time not a single existing technology mature enough to solve all the cases. When picking technologies for our stack, we decide on data lake storage techniques for raw data storage for analytics-oriented applications that meet the goal of  scalability. While for real-time applications, we integrate traditional relational databases for its optimized transaction handling for responsiveness.

  2. 2.

    Cloud Service

    Fifteen years after the launch of AWS, we now enjoy a competitive cloud service provider market. There are global leading providers like AWS [2] and Azure [8], as well as region challengers like AliCloud [1], OVHCloud [11], the cloud services offered by different providers are more or less overlapped and converged gradually. The choice of providers sometimes more rely on business factors, like the availability of that providers in the region of target markets. We build internal software tools to abstract away the native cloud services from our applications as much as possible, we invest on Kubernetes technologies [27] as the application runtime environment so that we keep the option open to later evolve the stack for hybrid or multi-clouds if needed. Using cloud service enables  scalability both in storage and computation.

  3. 3.

    Algorithmic Computation

    Distributed data computation engine that provides parallel-processing capabilities is key to analytics-oriented applications processing massive datasets. Spark [4] and Flink [3] are two leading techniques. Flink is from the beginning a streaming-oriented data processing engine while Spark is more popular engines for batch processing and is catching up in streaming. We choose Spark as the our stack’s compute engine, because we consider Spark is better positioned in the whole data processing ecosystem. Many technologies come with existing solutions to integrate with Spark, with that we could enjoy more flexibility on choosing other techniques and know they will integrate well with the compute engine. This computation component is related to, and interleaved with the data analytics component described below.

  4. 4.

    Data Analytics

    Many open source tools to choose from for data analysis, tools used in single machine include Pandas [12], Scipy, sklearn [15]. We prioritize to support tools in the stack that are able to run on multiple machines in order to harvest distributed computing power provided by the cloud, to support  agility for analytics-oriented applications. Spark is our chosen technique that provides the desirable distributed computation capability, additionally Spark provides APIs in SQL semantic that is familiar to many data-analysis specialists already.

    Tensorflow [18] and PyTorch [13] are two machine learning tools that we aim to integrate into our platform.

    The design principle in this data analytics component is not to lose the flexibility and being able to integrate more tools in the future if necessary. We try to best to avoid locking into a handful of tools pre-maturely. Tools that have low learning curves are preferred, because  agility is one main goal. We try to reduce as much as possible the unnecessary effort of an analyst to wrestle with unfamiliar tooling concepts or APIs.

  5. 5.

    Data Applications

    We leverage open-source frontend Jupyter [7] to build analytics-oriented applications. We also use data visualization tools directly from some cloud vendors, e.g., PowerBI from Azure. When choosing such a specific data visualization tool from one vendor, we usually examine whether it supports many data input/output techniques rather than only those from the same vendor. We decide on frontend frameworks such as Vue and ReactJS [14, 19], and backend frameworks such as NodeJS and Django [6, 9], to build customer-facing real-time applications. These techniques have matured, they have been integrated and tested in cloud environments for many years. In addition there are existing open source data connectors for the frameworks we choose, for connecting them to different data storage techniques so that we keep the  flexibility and not being locked into certain techniques. Another principle we have is to bias the choices on those that we could quickly prototype with, and then iterate on the prototype with fast turn-around time, this helps to achieve our  agility goal.

3 Two Example Applications Enabled by the Stack

In this section, we showcase two example applications built on top of our data stack.

One analytics-oriented application shown in Fig. 2a is to investigate academic paper publication trends in each major city of China, for assessing cities’ research activity levels. The research publication data we collected contains around 8 millions entries, organized as JSON files in \(\sim \)80 GB. We use one cluster that consists of 30 nodes, each node of which has 14GB memory and 8 CPU cores, for data processing. Spark, the compute engine running on this cluster, orchestrates distributed computation tasks of analysis code. We choose a browser-based Jupyter [7] notebook environment for analyst to program analysis code, analysis results returned by the compute engine are also shown on the same UI. The programing API is a combination of SQL and DataFrame [22], both are familiar to experienced analysts, in fact our analysts put these new tools in use in a matter of a few hours’ learning. Figure 2a shows the UI of this browser-based programming tool for analyst’s use, backed by a powerful distributed cluster underneath. After loading the data into the cluster memory within minutes, analyst could use family APIs to program and then execute analytics tasks on the cluster. One example task is to group the publications by individual cities, then sort the cities by the publication numbers this particular analysis task takes less than one minute on the whole 80 GB dataset. With swift turn-around time of many such analytics tasks, analysts feel enabled and motivated to explore more analysis questions and experiment more analysis approaches to solve same questions.

Another application depicted in Fig. 2b is a customer-facing information-search web application. This application provides a search function for companies with their patent statistics in different regions of China. We leverage a cluster of 60 nodes, each has 14GB memory and 4 CPU cores, for running routine batch jobs to calculate key metrics that power the search. One of most expensive task in these routine batch jobs is to calculate region-specific company patent metrics, which needs to perform an expensive  SQL JOIN of two large datasets: one is company registration data of past 20 years, consisting of \(\sim \)36MM entries the other dataset is patent data of past 20 years that includes \(\sim \)28MM patent entries. First a  SQL JOIN of these two large dataset and then a  SQL GROUPBY to group companies with patents by different regions. In total this task takes around 12 min by Spark engine on this cluster. The resulting metrics are then inserted into a PostgrepSQL relational database, which in turns powers the web search application. The search portal responses to users’ search with results in few seconds. Figure 2b shows on such search result page, a geographical map of all regions is on the left side, where each region is colored according to its magnitude of numbers of company that have patents, on the right side is the top 10 regions. We are able to build this data application, from ingesting raw data, to setting up batch jobs for analysis, then eventually having web search application powered by a relational database, in weeks. The cohesive data stack connects a number of data storage and compute technologies together, enabling this swift development.

Fig. 2.
figure 2

Two applications built on top of the big data stack

4 Conclusion

We present a design of big data stack that collectively function as data intelligence platform, for swiftly deriving social development insights from huge amount of data. We present the concrete techniques to implement this stack, as well as the underlying reasonings on why choosing them among many other choices. The two showcases exemplify two categories of applications this data stack enables: analytics-oriented applications and real-time applications.

We hope to spur discussions on related topics in the community that would also benefit future development of our stack. The better the stack, the better it serves the purpose of providing insights and intelligence to aid informed decision-making of the society.

For future developments, one direction we are looking at is data-mesh like architectural paradigm [5, 16], the purpose is to unlock access to a growing number of domain-specific datasets located within different organizations. Another direction is to ingest and process streaming data in near real-time. For example, extracting information real-time news feed. We consider this a great technical challenge to our data stack and we need to bring in new techniques carefully. Should it be implemented in our data stack, many interesting applications became feasible. We believe the impact, particularly to present decision-makers with near real-time insights from data, would be huge.