A Practical Solution to Synchronise Structured and Non-structured Repositories

Ferreira, Vanessa; Portela, Filipe; Santos, Manuel Filipe

doi:10.1007/978-3-030-72654-6_35

Vanessa Ferreira¹⁹,
Filipe Portela¹⁹ &
Manuel Filipe Santos¹⁹

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1368))

Included in the following conference series:

World Conference on Information Systems and Technologies

723 Accesses
2 Citations

Abstract

The volumes of data collected have been increasing in recent years as a result of the emergence of new data sources such as sensors, social networks, among others. The data collected has a varied format (semi-structured, unstructured and structured) and, due to the speed at which they are generated, organisations cannot derive value from their data, as the data in its original form has little or no value. As a result of this new reality, the need has arisen for organisations to be able to process the large volumes of data generated, regardless of the type of format (structured or unstructured) they have, and to analyse them to obtain crucial information for the business. Additionally, some organisations need to be able to access information in real-time for later take informed decisions in good time. Thus, the focus of this article is on developing a refresh mechanism capable of synchronising different types of data and thereby standardising a global mechanism. The time it takes for the mechanism to go through all the dimensions and make the changes to a specific dimension does not total one hour - which gives a time difference of approximately 96%. It can be seen that there is a rather significant improvement with the data refresh mechanism developed.

Download conference paper PDF

Change Discovery in Heterogeneous Data Sources of a Data Warehouse

Managing Temporal and Versioning Aspects of JSON-Based Big Data via the τJSchema Framework

Enabling Intelligent Systems, Applications, and Analytics for Smart Environments Using Real-time Linked Dataspaces

Keywords

1 Introduction

The evolution of the Internet of Things (IoT) has allowed the generation of massive volumes from devices such as sensors. It has led to a critical question “How can this information be used significantly? [1]. The processing of this vast volume of data has been a complicated matter, and the underlying reason for this stems from the type of format that the data has, more specifically the variety of data [2].

As a result, the speed at which data is collected has also created a critical need for a mechanism that can process large amounts of data at different rates and extremely volatile data sizes promptly [2]. Since concepts such as volume, variety and speed of data are being referred to, it is essential to highlight the concept of Big Data. Big Data can be defined as the volume of data available in different degrees of complexity, generated at different speeds and different degrees of ambiguity. It cannot be processed using traditional technologies, processing methods, algorithms, or any solutions available on the market [3].

As a result of the data being collected daily at high speeds, it becomes imperative for organisations to be able to analyse it in real-time to obtain crucial information for the business and consequently reaching better decisions in good time [4]. Assuming the existence of several technologies to deal in isolation with the huge volume of data, data processing and real-time analysis, it is necessary to have a mechanism that can deal with all these problems. The research question “How can the huge volumes of data generated be processed, regardless of the type of format they have?” was answered, through the developed refresh mechanism that allows the synchronisation of data, regardless of its source or typology. Furthermore, the loading of the dimensions into the MySQL database is carried out according to a type of Slowly Changing Dimension to be applied. This means that the actions that occur when loading the dimension into the database are different, depending on the type of SCD. An example of one such action is whether or not it saves the history of changes that occur in the data sources.

This document begins with the introduction, in which the objective and problem associated with this article is presented. Then the background section is presented, in which the area of Data Science and Pervasive Computing is addressed, and where the results of a comparative analysis of existing solutions are included. After this section, the proposed high-level architecture for the data refresh mechanism is displayed, and later all the explanation of the developed mechanism. Finally, the results of this entire process, as well as an indication of the future work are presented.

2 Background

According to Loukides [5], the future belongs to companies that know how to collect and use data successfully. With that in mind, one of the areas that have become very important in the area is Data Science. It is an area that is associated with the collection, preparation, analysis, visualisation, management and preservation of large amounts of information [6, 7]. According to Steele et al. [8], the origin of Data Science lies in the technological advances that generate almost inconceivable volumes of data, and the rate at which new data are being produced will certainly not decrease. With so much data to be generated, an increase in storage capacity becomes essential, which requires greater sophistication in the analysis and use of data, which is the basis of Data Science [9]. As a result of this increase in data, the need for a faster and more conscious decision-making process was developed. Based on this, it is essential to understand the concept of Pervasive Data Science; however, it is necessary first to understand the concept of Pervasive Computing (PC).

According to Kurkovsky [10], PC describes ubiquitous computing environments that provide access to information services anytime, anywhere, and at the same time, make the presence of the system invisible to the user. Pervasive Computing can be defined as the idea that “computing equipment will grow smaller and gain more power; this would allow small devices to be ubiquitously and invisibly embedded in the everyday human surroundings and therefore provide easy and omnipresent access to a computing environment” [11]. According to Reddy [12], Pervasive Computing can also be called Ubiquitous Computing, and deals with the idea of making “computing power” available anywhere, anytime and in a uniform way so that it can be exploited and meet society’s challenges. Pervasive Data Science (PDS) can be defined as an intersection of two areas of study already mentioned, namely Pervasive Computing and Data Science. PDS is characterised by focus on the collection, analysis and use of data, to achieve a ubiquitous computing vision [13]. All data is stored in databases, as mentioned above, however it is important to understand that this storage is supported by a multidimensional model. A multidimensional model has several techniques or principles to denormalize the structure of the database in order to create schemes that support the decision-making process. There are two types of tables in multidimensional models, namely fact and dimension tables [14]. A dimension table contains a single primary key and this key is incorporated as a foreign key in the fact tables, which are associated with that same dimension table. In turn, a fact table always contains foreign keys, coming from each of the dimension tables to which it is associated [15].

2.1 Modular Solution for Data Science

According to Fernandes, Portela, & Santos [16], due to the technological progress reflected in recent years and the actual implementation of the Internet of Things (IoT) concept, it is possible to observe a considerable amount of data that is collected every minute. It brings with it a problem of how so many data can be processed to obtain relevant and timely information. The work done by the authors Fernandes et al. [16] focused on a possible solution to this problem, in which it should be able to deal with all types of data (structured, semi-structured and unstructured data), retrieve information and display it in a comprehensive way to end-users. The solution presented by authors Fernandes et al. [16] covers the still existing lack of dealing with unstructured data, which is original from a NoSQL (Not Only SQL) document repository while performing timely analysis.

2.2 Semi-structured Data

With technological developments, large amounts of data are being stored from various sources at an enormous speed. The data from these sources is sometimes semi-structured, which has led to limitations in the relational databases used to manage structured data [17]. In this work, the authors, Lin, Jun, Hongyan, Zhongwei, & Zhanfang [17] tried to solve the problem, namely, that traditional technologies are not suitable for the processing of unstructured data. To this end, they developed a method for extracting rules from semi-structured data, intending to work on them to gain knowledge.

2.3 Real-Time Consultation

With the rapid pace of technological advances in smart device technologies, the Internet of Things (IoT) domain has received significant attention. These technological advances have brought the vision of ubiquitous computing closer [18]. Most applications use clouds and Big Data technologies to deal with scalability. However, the responsiveness of IoT applications is rather limited due to research-based queries, so a problem has arisen: How to process several continuous queries in real-time and how to provide answers to users on time? To solve the problem identified, the authors Gokalp et al. [18] proposed a distributed infrastructure, which is centrally managed and uses state-of-the-art Big Data technologies, for real-time data processing, storage and data sharing.

2.4 Analysis of Related Work

After analysing the work related to this article, it should be noted that there are many works in this area, but it was only found one paper that tries to solve the problem associated with the recovery of information using unstructured or semi-structured data, as well as work that tries to solve the problem of real-time data analysis when using data that does not have a structured format. The proposal for this article is for a mechanism that can deal with the variety of data and subsequently process it, regardless of its format (structured, semi-structured and unstructured). Of the works analysed, the solution that most closely resembles the objective of this article is the one referring to the authors Fernandes et al. [16]. It happens because the solution previously defined by the authors is based on the same challenges. The remaining works analysed to address the same problems presented in this article and point out solutions to the existing issues. However, they only solve problems in isolation and not as a whole. In this sense, the development of this refresh mechanism presents great potential in this area because it shows a solution capable of dealing with the variety and processing of data in real-time.

3 Architecture

The proposed high-level architecture for the data refresh mechanism is presented in this section to provide a complete explanation of how the different phases of the proposed mechanism add value to the final solution, in addition, it is important to understand the existing architecture for a better understanding of the mechanism developed.

In Fig. 1, the architecture and the implicit sequence through its different phases are represented where four layers can be identified: the database(s) that feed the system, the API that processes the data to retrieve information, the stage area and the Data Warehouse that store the data.

The layers represented in Fig. 1 are the following:

a)
Data Sources - This layer represents the databases and other sources that provide data for the entire system. These data sources can be SQL or NoSQL databases, JSON files, CSV files, among others. Consequently, these sources may provide structured, semi-structured or unstructured data sources. The data sources used consist of NoSQl databases, as can be seen in Fig. 1.
b)
RESTFul API - In this layer data processing takes place, where raw data stored in data sources is transformed into relevant information. It is at this stage that mechanisms are developed to deal with the variety of data to make it useful and feasible. In order to do this, first of all, data extraction must be carried out and, therefore, the data must be processed according to previously detected faults, at the same time as an extensive analysis of the data is carried out, adequately addressing each situation and taking into account the context of the data. After that, it is necessary to work on retrieving information from the data processed and to seek the best way to obtain this relevant information.
c)
Staging Area - This layer was created to solve the problems of speed and storage and it is at this stage that the multidimensional model is completed in the final stage of the API phase.
d)
Data Warehouse - This layer was created to solve the problems of speed and storage and it is at this stage that the multidimensional model is completed. The Data Warehouse stores the data that is loaded in the stage area.

It should be noted that both the Staging area and the Data Warehouse both consist of MySQL databases that store the data. Staging area stores the data in an intermediate phase, before the update, and data warehouse receives the updated data from the previous step, i.e., after all the data being loaded to the staging area.

4 Data Refreshing Mechanism

This section explains the need for the development of a mechanism to carry out the refresh of the Data Warehouse. Normally the data after the whole extraction and transformation process is loaded directly into the Data Warehouse; however, this process is not advantageous, because in case the tables that make up the Data Warehouse contain a very high amount of data, it makes the whole process of loading the data into the MySQL database, a very slow process and consequently with a high refresh time. Besides, there is a need to provide real-time answers which cannot be achieved with the mechanism developed in this way.

The Data Warehouse, as already mentioned, represents the layer responsible for filling in the multidimensional model. However, the multidimensional model has a characteristic of a DW that consists of an integrated, non-volatile and variable data set over time. The variation over time refers to changes in operating systems during a given period, this being the source of data in a DW extraction process. To address the fact that data can change over time, Ralph Kimball introduced the concept of Slowly Changing Dimensions (SCD) [19].

Table 1 provides a brief description of the four types of Slowly Changing Dimension (SCD) applied to the dimensions in developing the data refresh mechanism. The SCD type 0 occurs when data is stored in the dimension tables exactly as it was extracted the first time and is never modified over time, even if it changes in the source database. The type 1 SCD occurs when data from data sources changes and, in this case, only needs to be changed, replacing the old value, thus losing the historical data. The type 2 SCD allows the storage of unlimited changes, inserting new rows in the dimension table whenever changes occur. And finally, the SCD of type 3 adds a new column with the new value to the table, if there are any data changes.

Table 1. Types of Slowly Changing Dimensions applied to the mechanism.

Full size table

After understanding the different types of SCD, the first step to solve this problem was the creation of a staging area that corresponds to a MySQL database and contains the same tables as the Data Warehouse. At this moment, after the whole process of extraction and transformation of the data, it is loaded directly to the stage area and not straight to the Data Warehouse. The next step was the addition of a new column in all dimensions present in both the stage area and the Data Warehouse, called last_date and which aims to save the date of the last modification of all rows that constitute the different tables, in DateTime format (YYYY-MM-DD HH:MI:SS). However, considering the architecture that supports the refresh mechanism, it has been divided into two phases, the delta refresh and the Data Warehouse refresh. The delta refresh occurs in the direction of the data sources until they are loaded in the stage area. The refresh of the Data Warehouse occurs in the direction of the staging area until the DW.

4.1 Delta Refreshment (Data Sources → Staging Area)

The deltas refresh was developed to obtain, as the word itself indicates, the refresh of the internship area, which consists of an intermediate MySQL database. The data from the different data sources passes through the layer that is responsible for all data processing, where the raw data stored in the data sources are transformed into relevant information and then loaded into the staging area. Therefore, the direction of this refresh will be from the data from the different MongoDB data sources to the stage area. It should be noted that Mongodb is the database technology NoSQL chosen. The following list explains the steps required to obtain the refreshed data from the staging area:

1.
Verify the existence of new elements inserted in the data sources. For this purpose, at the RESTful API, the date of the last element inserted in the MongoDB data source obtained through its ObjectID is compared with the most recent date of its dimension in the stage area. If the date from the data source is more recent than the one in the MySQL database, the new lines are inserted in the respective dimension in the MySQL database;
2.
Creation of a field containing the date of update on some objects from the different MongoDB data sources, which by default is initially filled with the null value;
3.
Check for each dimension if there are objects with an update date in the MongoDB collections. If the answer is yes, it creates a dataframe with all the necessary fields, just for these lines;
4.
For dimensions that are loaded with a different type of Slowly Changing Dimension (SCD) than zero, a search for the natural key in the stage area has been carried out. In case the natural key from the MongoDB data source already exists in the intermediate database, a call is made to the SCD function to apply the necessary changes to the dimension. It returns the respective ID of the lines that have changed. On the contrary, in case the master key does not exist, the new line is added to the dimension present in the MySQL database and returns the ID of the new line inserted;
5.
For dimensions with SCD equal to zero or that are not loaded according to any type of SCD, validation is made if there is any line in the staging area with the same data set like the one from the data source. If the data set already exists in the MySQL database, it returns the ID; otherwise, a call is made to the loading function of the respective dimension in order to insert the new line, and in this case, it also returns the ID;
6.
The initial dataframe now contains the associated columns for the IDs of each dimension where data updates occurred. The columns for the calculated facts are then added to the dataframe and filled in;
7.
All dataframe lines are crossed and inserted in the fact table.

4.2 Refreshment of the Data Warehouse

After explaining the mechanism for refreshing the internship area, it becomes important to understand how the Data Warehouse is refreshed. The data loaded in the Data Warehouse is only updated when any of the dimensions of the stage area changes, thus maintaining the reliability and accessibility of the data in it. This verification is carried out through the attribute that has been created and mentioned above, called last_date and which stores the date of the last modification of all the rows that make up the different tables in both the intermediate database and the Data Warehouse.

This way, if the date of modification saved in the last_date attribute of some dimension present in the data warehouse is more recent than the date of modification of the respective dimension in the data warehouse, the respective dimension is loaded into the data warehouse with the data that is loaded in the data warehouse, thus ensuring that only the data of the dimensions that have changed is refreshed, reducing the time of refreshing them. This has made it possible to overcome some of the problems mentioned above, namely the fact that the data is always available and correctly loaded into the Data Warehouse, as well as the decrease in the refresh time of the data. Below is a representation of the pseudocode developed for the Data Warehouse refresh mechanism.

4.3 Discussion

The mechanism of data refresh by deltas is quite advantageous as far as the temporal aspect is concerned. It checks for the existence of new elements in the collections, as well as for the existence of alterations to objects that make up the elements in the collections. So, it only performs those that have undergone some kind of change. Although, if this verification is done individually for each dimension, it will be loaded the dimension without this being necessary. With the Data Warehouse’s data refresh mechanism it is possible to overcome some problems, such as the fact that the data is always available and correctly loaded into the Data Warehouse, as well as the decrease in refresh time. Still, the processing time requires the stability of several components for proper functioning.

5 Conclusion

The main contribution of this article is to be able to answer the main research question “How can the huge volumes of data generated be processed, regardless of the type of format they have and the speed at which they are received? In order to answer the research question, the main objective was defined as the elaboration of a refresh mechanism that would allow the data to be synchronised, regardless of its source or typology.

The mechanism developed can be applied to different types of data sources and covers the still existing lack of processing of unstructured data, in this particular case, NoSQL documents. This mechanism for refreshing data from the training area is quite advantageous as far as the time aspect is concerned, since the time it takes for the mechanism to go through all the dimensions and subsequently make the changes for a specific dimension is 176,2938 s. Although, if this individual verification is done, for only five of the eleven dimensions present in the MySQL database, it takes 48552.02875 s. In this case, if we do a conversion of the time, it takes to go through these five dimensions individually, a total of 13 h results. If we divide this total by the five dimensions, on average, for each of the dimensions, it takes approximately 2.6 h. The time it takes for the mechanism to go through all the dimensions and make the changes for a specific dimension does not total one hour - which gives a time difference of approximately 96%.

At an early stage, the time taken for the dimensions and fact tables to be loaded into the Data Warehouse was analysed as a degree of comparison in obtaining the final times after the mechanism had been developed. In the future new metrics will be included.

In the future, there are some improvements already considered, for example: create new perspectives of different types of dashboards and exploring augmented reality mechanisms for the representation of indicators.

References

Simmhan, Y., Perera, S.: Big data analytics platforms for real-time applications in IoT. In: Pyne, S., Rao, B.L.S.P., Rao, S.B. (eds.) Big Data Analytics, pp. 115–135. Springer, New Delhi (2016). http://link.springer.com/10.1007/978-81-322-3628-3_7. Obtido Janeiro 10 2020
Krishnan, K.: Data Warehousing in the Age of Big Data. Morgan Kaufmann is an imprint of Elsevier, Amsterdam (2013)
Google Scholar
Hurwitz, J., Nugent, A., Halper, F., Kaufman, M.: Big Data for Dummies. Wiley, Hoboken (2013)
Google Scholar
Zikopoulos, P., Eaton, C., Deroos, D., Deutsch, T., Lapis, G.: Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, p. 176. McGraw-Hill Osborne Media, New York (2011)
Google Scholar
Loukides, M.: What Is Data Science? O’Reilly Media, Inc., New York (2011)
Google Scholar
Stanton, J.M.: Introduction to Data Science, p. 197. Morgan Kaufmann, Cambridge (2013)
Google Scholar
Baier, D., Decker, R., Schmidt-Thieme, L.: Data Analysis and Decision Support. Springer, Heidelberg (2006)
Google Scholar
Steele, B., Chandler, J., Reddy, S.: Algorithms for Data Science. Springer, Cham (2016). http://link.springer.com/10.1007/978-3-319-45797-0. Obtido Novembro 12, 2019
Loukides, M.K.: ProQuest: What Is Data Science? O’Reilly Media, Sebastopol (2012). https://VH7QX3XE2P.search.serialssolutions.com/?V=1.0&L=VH7QX3XE2P&S=JCs&C=TC0001454060&T=marc&tab=BOOKS. Obtido Novembro 12, 2019
Kurkovsky, S.: Pervasive Computing: past, present and future. In: 2007 ITI 5th International Conference on Information and Communications Technology. Presented at 2007 ITI 5th International Conference on Information and Communications Technology, pp. 65–71. IEEE, Cairo (2007). https://ieeexplore.ieee.org/document/4475619/. Obtido Dezembro 26, 2019
Fernandes, G., Portela, F., Santos, M.F.: PWA and pervasive information system – a new era. In: Rocha, Á., Adeli, H., Reis, L.P., Costanzo, S., Orovic, I., Moreira, F. (eds.) Trends and Innovations in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol. 1161, pp. 334–343. Springer, Cham (2020). http://link.springer.com/10.1007/978-3-030-45697-9_33. Obtido Novembro 15, 2020
Reddy, Y.: Pervasive computing: implications, opportunities and challenges for the society. In: 2006 First International Symposium on Pervasive Computing and Applications. Presented at 2006 First International Symposium on Pervasive Computing and Applications, p. 5. IEEE, Urumqi (2006). https://ieeexplore.ieee.org/document/4079026/. Obtido Dezembro 26, 2019
Davies, N., Clinch, S.: Pervasive data science. IEEE Pervasive Comput. 16(3), 50–58 (2017)
Article Google Scholar
Humphries, M., Hawkins, M.W., Dy, M.C.: Data Warehousing: Architecture and Implementation. Harris Kern’s Enterprise Computing Institute. Prentice Hall PTR, Upper Saddle River (1999)
Google Scholar
Kimball, R., Ross, M.: The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, 3rd edn. Wiley, Hoboken (2013)
Google Scholar
Fernandes, G., Portela, F., Santos, M.F.: Towards the development of a Data Science Modular Solution, p. 8 (2019)
Google Scholar
Lin, Y., Jun, Z., Hongyan, M., Zhongwei, Z., Zhanfang, F.: A method of extracting the semi-structured data implication rules. Procedia Comput. Sci. 131, 706–716 (2018)
Article Google Scholar
Gokalp, M.O., Kocyigit, A., Eren, P.E.: A cloud-based architecture for distributed real-time processing of continuous queries. In: 2015 41st Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 459–462. IEEE, Madeira (2020). https://ieeexplore.ieee.org/document/7302489/. Obtido Janeiro 8, 2020
Loyola, R.C., Sepulveda, A.U., Hernandez, M.W.: Optimisation slowly changing dimensions of a data warehouse using object-relational. In: 2015 34th International Conference of the Chilean Computer Science Society (SCCC), pp. 1–6. IEEE, Santiago (2015). https://ieeexplore.ieee.org/document/7416593/. Obtido Abril 14, 2020

Download references

Acknowledgements

This work has been supported by FCT - Fundação para a Ciência e Tecnologia within the R&D Units Project Scope: UIDB/00319/2020. This project was also supported by IOTech - Innovation on Technology.

Author information

Authors and Affiliations

Algoritmi Research Centre, University of Minho, Guimarães, Braga, Portugal
Vanessa Ferreira, Filipe Portela & Manuel Filipe Santos

Authors

Vanessa Ferreira
View author publications
You can also search for this author in PubMed Google Scholar
Filipe Portela
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Filipe Santos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filipe Portela .

Editor information

Editors and Affiliations

ISEG, University of Lisbon, Lisbon, Portugal
Álvaro Rocha
College of Engineering, The Ohio State University, Columbus, OH, USA
Hojjat Adeli
Institute of Data Science and Digital Technologies, Vilnius University, Vilnius, Lithuania
Gintautas Dzemyda
DCT, Universidade Portucalense, Porto, Portugal
Fernando Moreira
Department of Information Sciences, University of Sheffield, Lisbon, Portugal
Ana Maria Ramalho Correia

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ferreira, V., Portela, F., Santos, M.F. (2021). A Practical Solution to Synchronise Structured and Non-structured Repositories. In: Rocha, Á., Adeli, H., Dzemyda, G., Moreira, F., Ramalho Correia, A.M. (eds) Trends and Applications in Information Systems and Technologies. WorldCIST 2021. Advances in Intelligent Systems and Computing, vol 1368. Springer, Cham. https://doi.org/10.1007/978-3-030-72654-6_35

Download citation

DOI: https://doi.org/10.1007/978-3-030-72654-6_35
Published: 29 March 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-72653-9
Online ISBN: 978-3-030-72654-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics