This special issue includes 9 papers selected after a thorough reviewing process from 46 submitted to the call. The special issue received significant attention from both the international research and industrial communities. A large number of co-authors (i.e., 170) submitted their research contributions to the special issue. They are geographical widely distributed and come from 21 different countries all over the world. Submitted papers have, on average, 3.5 authors each, and most of them present the outcome of international cooperation.
The characteristics of big data and new needs from the business to get insights from the data, induce new research and technological challenges. The papers presented in this special issue cover a wide range of hot topics related to:
efficient data models, data processing pipelines and architectures to integrate standard and big data sources (Jovanovic et al. 2020) as well as to improve resource utilization and aggregate performance in shared environments (Michiardi et al. 2020);
predictive analytics to forecast product demand in the fashion industry (Gardino et al. 2020) and techniques to deal with the lack of annotated data for sensor-based human activity recognition (Prabono et al. 2020);
text data processing to assess the performance of text storage systems through a generic benchmark (Truică et al. 2020) and innovative solutions to deal with specific use cases such as the legal domain (Bordino et al. 2020);
novel approaches for mining social media to support intelligent transportation systems (Vallejos et al. 2020) and digging deep the IoT scenario (Ustek-Spilda et al. 2020);
solutions to deal with privacy issues in distance learning systems (Preuveneers et al. 2020).
The wordcloud in Fig. 1, generated through a traditional text-mining pipeline by analyzing titles, abstracts, and keywords of all accepted papers, provides an overview of the special issue content. By showing the top-60 frequent words characterizing the accepted papers, Fig. 1 highlights the macro-topic addressed (e.g., data, processing, model, learning, system) as well as it offers a detailed characteristic of all data management and data analytics techniques (e.g., natural language, query, view management, machine learning) and the scenarios where they have been exploited (e.g., social network, text).
The content of the special issue is briefly introduced in the following subsections.
Efficient Data Models, Data Processing Pipelines and Architectures
One of the main challenges in managing big data is to build architectures for integrating big data sources as well as ingesting and transform big data into formats suitable for OLAP processing and machine learning algorithms. It is now widely agreed that the currently available architectures that allow to integrate big data, are mainly the data lake (Ceravolo et al. 2018; Nargesian et al. 2019) and polystore architectures (Alotaibi et al. 2020; Gadepally et al. 2016; Tan et al. 2017). Both of them apply physical and virtual integration. In both the architectures, metadata are crucial to build an interoperable data integration layer. Executing queries in data integration architectures is challenging for the following reasons. First, the information is constructed by crossing data from numerous sources, schemas, and formats. Second, a query in a global schema needs to be decomposed into queries on particular data sources. Third, such a query needs to be efficiently executed in a distributed and heterogeneous environments, whereby multiple users submit queries concurrently.
In this special issue, paper (Jovanovic et al. 2020) contributes an integration platform (called Quarry) for standard and big data sources. The platform supports a semi-automated construction of and end-to-end data processing pipeline. The platform adapts the concept of the mediated architecture where a global schema is mapped to local schemas by means of the Local-as-View (LAV) approach. Quarry heavily relies on metadata, which are organized as a hypergraph to facilitate searching and the integration of new metadata. Quarry has been evaluated in an international project for the World Health Organization.
To gain a more efficient resource utilization and better aggregate performance in shared environments, where queries are concurrently submitted by multiple users, Multi-Query Optimization (MQO) techniques are adopted in paper (Michiardi et al. 2020). The proposed system extends the SparkSQL Catalyst optimizer to provide a general approach to MQO for distributed computing frameworks that support a relational API. The system demonstrated significant improvements in terms of aggregate query execution times, while fulfilling the memory budget given to the MQO problem.
Predictive analytics refers to a variety of statistical and analytical techniques used to develop models that predict future events or behaviors (Tsai et al. 2015). Classification models are common predictive models that aim at categorizing information based on historical data. Instead, regression analysis is used for forecasting, time series modelling and finding the causal effect relationship between between a dependent (target) and independent variable (s) (predictor). Predictive analytics finds applications in a wider range of different domains. Although a variety of predictive algorithms and models are available in the literature, the application of these solutions in specific real domains always raises some challenging questions to be addressed related to different aspects, such as data distribution, data dimensionality, or availability of a sufficient amount of labeled data for training a model.
In this context, paper (Gardino et al. 2020) proposes a method for predicting product demand in the fashion industry. The proposed prediction method, called multi-VIew Bridge Estimation (VIBE), takes advantage of the existence of multiple views on items, i.e., sets of homogeneous features. Such views compensate the problem of missing data by means of learning the interactions (or dependencies) between common latent features. The authors show how their method is able to reconstruct the missing views for products whose sales are not known, with the goal to forecast the future sales based on the reconstructed views.
Classification models are also used for the realization of sensor-based human activity recognition (HAR) to make a prediction of well-defined human activities out of the sensor data. However, for HAR a sufficiently large amount of annotated data is needed to realize an accurate classification model. Since the performance of these models degrades when they face new samples coming from unseen distributions, several domain adaptation methods have been proposed to minimize the need for labeled data. To address this issue, paper (Prabono et al. 2020) proposes a novel two-phase autoencoder-based approach on domain adaptation for sensor-based HAR. The proposed approach learns a latent representation, which minimizes the discrepancy between domains by reducing statistical distance. The effectiveness of the proposed approach has been tested in cross-domain sensor-based HAR.
As reported in Grohe (2019) and Marr (2019), a volume of unstructured data increases much faster than the volume of structured data. Among them, text data constitute a substantial counterpart. For text analytics and Natural Language Processing (NLP), text data are processed by complex pipelines, which typically include noise removal, homogenizing lower-upper cases, resolving abbreviations, acronyms, and synonyms, stemming, lemmatization, stopword removal, and (optionally) tagging. However, several challenging research issues are still open in text mining. Although the adaptation of NLP technologies brings attractive benefits to processing and understanding documents on a large-scale, the specific setting of each domain often prevents the adaptation of general-purpose NLP solutions or the adaptation of domain-specific NLP pipelines. Moreover, to test the quality of texts produced by such pipelines, one needs a ground truth data, i.e., benchmarks. Another issue to be solved is how to efficiently store and access text data.
Paper (Bordino et al. 2020) focuses on designing a novel NLP-based solution for garnishment - a specific use case in the legal domain so far overlooked in the NLP literature. The proposed GarNLP framework can automatically analyze and annotate various kinds of documents exchanged in a garnishment process, to classify documents into a predefined taxonomy of categories and automatically extract from the text, information on the garnishment procedure. This work was motivated by a request of one of a big pan-European commercial bank.
Paper (Truică et al. 2020) proposes a generic benchmark, called TextBenDS, for assessing the performance of text storage systems. At the conceptual level, the benchmark models text data (documents) as a cube. This model is then transformed by the benchmark into three different logical models to be implemented in distributed Hive, Spark, and MongoDB. TextBenDS includes a number of aggregation queries of different complexities, selectivities, and selection patterns to compute term weighting schemes applied to extracting top-k keywords and documents. TextBenDS has been applied in practice to evaluate the performance of Hive, Spark, and MongoDB on a text corpus including 2500000 tweets.
Social Media Mining
Social media includes a set of tools, such as blogs, social networking sites, and forums, which enable communication and cooperation. They facilitate relationship forming between users of diverse backgrounds, resulting in a rich social structure and a valuable source of information to capture user perception and to support decision making processes in a variety of scenarios (Kapoor et al. 2018). However, social media and their data represent complex ecosystems characterized by challenging issues from the data science perspective. Information is exchanged using heterogeneous data types (i.e, pictures, videos, text), thus requiring proper data processing techniques to turn out the social data content into useful insights. Another remarkable aspect is the the risks associated with the use of social media, due to malicious behavior such as bots, sock puppets, creation and dissemination of fake news, Sybil attacks, and actors hiding behind multiple identities.
Social networks are analyzed as a valuable source of information to support business and management. One of the application is an intelligent transportation systems in large cities in paper (Vallejos et al. 2020). ML and NLP techniques are combined in a novel approach, named Manwë, for detecting, interpreting, geo-locating and disseminating traffic incidents written in natural language reported in social networks. Manwë is a complementary approach to other traffic-specific social networks such as Waze, acquired by Google in 2013.
Social media are exploited to dig deep the Internet of Things (IoT) scenario in paper (Ustek-Spilda et al. 2020). Nowadays sensing technologies allow collecting and transmitting data for algorithmic processing and ML in various domains. However, the potential consequences of pervasive connectivity along with concerns about security, privacy, and trust are abundant and are still widely debated. The paper proposes a method developed within the VIRT-EU project that integrates social media data and data from qualitative research and network analysis. The authors aim at: (1) analyzing the main actors in IoT-related discussions in Europe, and (2) identifying geographical clusters and specific topics that emerge within IoT, and the matters of concern discussed by the technology developers, designers and entrepreneurs of IoT on Twitter.
Privacy Issues in Distance Learning Solution
Distance learning solutions are becoming increasingly popular as they allow enriching traditional teaching and reaching students spread over vast geographical areas, but they also represent a valuable alternative when face-to-face teaching is not feasible (as for example due to critical events such as Codiv-19 pandemic). In such a systems, the adequate monitoring of student engagement requires a considerable amount of computational resources and network capacity to process audiovisual, interaction and physiological data of the audience in near-real time. In this context, critical privacy concerns arise since the continuous tracking and centralized analysis of sensitive personal data may invade the privacy of remote students.
To address the aforementioned concerns, paper (Preuveneers et al. 2020) proposes a solution based on well-known and proven enabling technologies for decentralized data analytics as well as best practices for preserving privacy. A multi-modal engagement model is proposed that runs in the cloud and at the edge to easily scale with a growing number of participants. Federated behavior data processing and the application of secure multi-party computation are exploited to allow privacy enhanced analysis of student engagement.