Mining Data Wrangling Workflows for Design Patterns Discovery and Specification

AlMasaud, Abdullah; Sampaio, Sandra; Sampaio, Pedro

doi:10.1007/s10796-023-10458-7

Mining Data Wrangling Workflows for Design Patterns Discovery and Specification

Open access
Published: 01 February 2024

(2024)
Cite this article

Download PDF

You have full access to this open access article

Information Systems Frontiers Aims and scope Submit manuscript

Mining Data Wrangling Workflows for Design Patterns Discovery and Specification

Download PDF

767 Accesses
Explore all metrics

Abstract

In this paper, we investigate Data Wrangling (DW) pipelines in the form of workflows devised by data analysts with varying levels of experience to find commonalities or patterns. We propose an approach for pattern discovery based on workflow mining techniques, addressing key challenges associated with finding patterns in data preparation solutions. The findings provide insights into the most commonly used DW operations, solution patterns, redundancies, and reuse opportunities in data preparation. The findings were used to create design pattern specifications curated into a catalog in the form of a DW Design Patterns Handbook. The evaluation of the proposed handbook is performed by surveying professionals with results confirming the usefulness of discovered patterns to the construction of DW solutions and assisting data analysts/scientists via the reuse of patterns and best practices in DW.

On Discovering Data Preparation Modules Using Examples

Pattern Views: Concept and Tooling for Interconnected Pattern Languages

Usability Mining

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Data Wrangling (DW) is considered a tedious and complex process, with the latest industry reports indicating that data scientists spend around 38% of their time in data preparation and cleansing activities (Anaconda, 2022). DW describes activities associated with transforming raw data into an asset ready for analysis. DW includes data profiling, matching, mapping, format transformation, and repair (Fernandes et al., 2023). The practice of DW has been made possible for professionals lacking in data science and engineering skills by the availability of self-service tools that offer flexibility for custom, ad-hoc and quick DW solutions, often through user-friendly Graphical User Interfaces (GUIs) (Hellerstein et al., 2018; Kandel et al., 2011). While DW tools share common DW functionality, there is a lack of standards in conceptualization, representation, and semantics associated with supported DW operations (Hameed & Naumann, 2020), placing a burden on users to discover suitable DW operations and understand how to use them in practice. The steep learning challenge often involves a time-consuming trial-and-error process to develop a DW solution and can be exacerbated by low levels of skill and experience. As a result, users often end up building custom data preparation solutions from scratch for similar problems, rather than reusing existing ones, due to the difficulties in finding and understanding existing solution designs. Recent literature on the automation of data wrangling provide clues to some of the evidence underpinning reuse difficulties by stating that “data wrangling remains hard to automate, because it often involves special cases that require human insight” (Petricek et al., 2022). Examples of human insights include identifying common patterns in solutions with similar functionality, understanding idiosyncratic issues with data and distinguishing outliers from noise. To tackle these issues in data preparation scripts, tailor made code will often be needed, therefore limiting reuse opportunities. Moreover, our investigation revealed that potentially reusable DW scripts available from popular repositories often suffer from redundancies and inefficiencies due to poor choices in operation selection. Finding ways to help users build good quality DW pipelines without burdening them is essential to adopting a systematic approach to data preparation/wrangling. This raises questions, such as:

1.
How to identify DW design patterns in open access repositories to enable the reuse of existing data preparation solutions?
2.
How to assess if effective solutions to common data preparation/DW problems can be developed through the use of DW design patterns?
3.
How to organize and present DW design patterns to facilitate their use in the data analytics lifecycle?

In search of answers to these questions, we conducted an investigation into existing DW workflow pipelines. This enabled us to identify how analysts develop DW solutions in practice. More specifically, we articulate the following contributions in this paper:

The discovery of patterns of DW activities/operations from Web-available repositories of DW pipelines.
A conceptualization of common DW operations included in DW design patterns and the creation of a DW Taxonomy, presented as a Dictionary of Operations and Patterns.
The specification of a DW Design Patterns Handbook and an evaluation of its usefulness.

The remainder of this paper is organized as follows: Section 2 presents a brief background and related work. Section 3 describes an exploration of DW in workflow environments and articulates the notion of DW constructs and patterns in data analysis workflows. The systematic pattern mining approach used to identify DW patterns is presented in Section 4 with its findings and discussion in Section 5. Section 6 presents samples of the DW design pattern specifications included in our DW Handbook and an evaluation of the usefulness of the Handbook. Section 7 summarizes our research contributions, limitations, and future work.

2 Background and Related Work

At a time when data is often analogized to gold, diamonds, and oil (Van der Aalst, 2014), DW is a critical process for determining the quality of the outcome of data analysis tasks (Muller et al., 2019). To facilitate DW, a variety of tools have been developed that vary in their capabilities and the level of technical skill required from users (Hameed & Naumann, 2020; Convertino & Echenique, 2017) (please see Appendix A for a comparison of DW tools). These tools range from programming languages, such as R and Python, which require a high level of technical expertise, to user-friendly visual tools that require less technical knowledge, a category that Talend Data preparation (Talend, 2023) and Trifacta/Alteryx Designer Cloud (Trifacta, 2023) fall into. Examples of popular tools that fall into the first category also include Spark-based DW solutions, such as Databricks (2023), which generally provide users with an environment for interactive development of DW recipes, often through the use of notebooks into which Python code can be written. However, these solutions are largely lacking in provision of guidance through the difficult process of DW which could, potentially, not only ease and expedite the DW process, but also inform users on how to build “good” DW recipes. Both categories of tools are, in fact, limited in support for mechanisms that facilitate the reuse of solution designs (e.g., design patterns), as our assessment of the tools has revealed, lacking in provision of high-level descriptions of the rationale behind design decisions, their association with data properties and the purposes of the target analytical task. This is a shortcoming that software engineers have worked hard to overcome through the use of design patterns found in software design, i.e., as a way of standardizing and passing down knowledge/experience of expert designers to non-experts, in the form of readily available design constructs for reuse (Gamma et al., 1994). While the factors motivating the creation of design patterns in SE are similar to those in DW, there is a dearth of research on identifying DW patterns. The semi-automation of DW via Machine Learning (ML) techniques is a topic of growing interest (Petricek et al., 2022; Jaimovitch-López et al., 2022), with works such as He et al. (2018) showing that suggestions of applicable operations can be made based on user-provided examples, and Sutton et al. (2018), which performs data transformations inspired by the UNIX Diff command. Although tool functionality for direct and fully automated reuse of existing solutions is not yet supported, semi-automated approaches applying ML to narrow DW tasks such as record matching have been successfully developed. Our work complements previous research by focusing on identifying and documenting design patterns in available DW pipelines, addressing the problem of finding and describing constructs that can make DW more systematic and structured, paving the way to collaboration, standardization, and more extensive automation. This will, ultimately, make DW a more disciplined and efficient process.

2.1 Workflow Mining and Pattern Discovery

Workflow mining focuses on business process mining using event logs for the purpose of process redesign and optimization (Van der Aalst et al., 2003; Hammori et al., 2006). There are limitations to workflow mining based on event logs to mine patterns in data pipelines. Previous works in identifying patterns in workflows used logical modeling of workflows into graphs to enable their mining. Graphs are one of the frequently used representations for big data processing (Darmont et al., 2022). In Tosta et al. (2015) a path approach to identifying patterns in scientific workflows is applied (Fig. 1) and Theodorou et al. (2017) uses Frequent Subgraph Mining (FSM) to identify frequent patterns of operations in ETL workflows. Whilst these approaches are applicable to DW workflows, Tosta et al. (2015)’s approach would lead to identifying a set of short frequent paths of operations, which may lead to misleading conclusions, e.g, tasks C, D, and E in Fig. 1 appear 4 times (once in each path) as opposed to once in the original representation. The approach of Theodorou et al. (2017) was successful in identifying patterns in ETL, however, the repository used was created from well-formed ETL standard workflows which have a limited range of data transformation operations when compared to DW. To be able to take advantage of previous work in pattern identification designed for ETL processes, similarities between DW and ETL were explored, i.e., subsets of activities, resulting in our adaptation of existing ETL-based approaches for our research purposes (described later in this paper). Nonetheless, the key distinctions between ETL and DW are outlined below, highlighting why existing ETL mining approaches cannot be directly applied to DW pipelines.

ETL’s main focus is on operations that perform data integration from multiple data sources into a unified schema, and data storage in a consolidated data repository (e.g., data warehouse) whilst DW involves tailored data transformations to prepare a dataset for an analytical task such as machine learning-based prediction.
ETL involves a smaller subset of operations/data transformations applied in a quasi-sequential structure whilst DW involves a wider range of operations with potentially iterative cycles and interactive tasks (e.g., visual data profiling)
ETL is typically performed using specialized software tools (e.g., Talend and Oracle Data Integrator) that automate the majority of the ETL process steps, whilst DW combines GUI-based tools (such as Trifacta) and programming languages such as Python and R to code user-defined functions.

These differences result in different patterns for solving data problems faced by analysts that are not traditionally common in ETL workflows, but the similarities between the two and their required outcomes are the reasons for our adaptation of ETL-based approaches for our research purposes.

Table 1 Summary of the components of the proposed design principle (DPR) schema by Gregor et al. (2020)

c = dps
*DWP* = (C, E), such that:
\( \forall \ e \in {\textbf {E}}: \ \exists \ (c_{1},\ c_{2}), \ c_{1} \in {\textbf {C}} \ \wedge \ c_{2} \in {\textbf {C}} \ \wedge \ (c_{1} \prec c_{2}) \)

activity: C\(\ \rightarrow \ \pmb {\mathbb {A}}\)
stage: \(\pmb {\mathbb {A}}\ \rightarrow \ \pmb {\mathbb {S}}\)

*DWP* = (\(\pmb {\mathbb {A}}\), E), such that:
\( \forall \ e \in {\textbf {E}}: \ \exists \ (ac_{1},\ ac_{2}), \ ac_{1} \in \pmb {\mathbb {A}} \ \wedge \ ac_{2} \in \pmb {\mathbb {A}} \ \wedge \ (ac_{1} \prec ac_{2}) \)

*DWP* = (\(\pmb {\mathbb {S}}\), E), such that:
\( \forall \ e \in {\textbf {E}}: \ \exists \ (s_{1},\ s_{2}), \ s_{1} \in \pmb {\mathbb {S}} \ \wedge \ s_{2} \in \pmb {\mathbb {S}} \ \wedge \ (s_{1} \prec s_{2}) \)

Mining Data Wrangling Workflows for Design Patterns Discovery and Specification

Abstract

Similar content being viewed by others

On Discovering Data Preparation Modules Using Examples

Pattern Views: Concept and Tooling for Interconnected Pattern Languages

Usability Mining

Explore related subjects

1 Introduction

2 Background and Related Work

2.1 Workflow Mining and Pattern Discovery

2.2 Design Patterns Specification

2.3 Data Wrangling in a Workflow Setting

2.4 Terminology Used in this Paper

3 Workflows, DW Constructs and Patterns

3.1 Selection of Repositories and DW Pattern Exploration

3.2 Mapping Workflows and Abstracting DW Activities

3.3 Conceptual DW Constructs and DW Patterns

4 Data Wrangling Pattern Discovery Approach

4.1 Creation of a Taxonomy of Data Wrangling Constructs

4.2 Pre-processing: Parsing of Repository Workflows into Graphs

4.3 Mining of Workflows and the Most-Commonly Traversed DW Paths

5 Mining Approach Findings and Discussion

5.1 Findings

5.2 Discussion

6 Data Wrangling Design Pattern Specification

6.1 Evaluation

6.1.1 Method and Data Collection

6.1.2 Survey Results and Analysis

7 Conclusions and Future Work

Availability of data and materials

Code Availibility

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Competing interests

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Appendix A

Appendix A

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation