1 Introduction

Data Wrangling (DW) is considered a tedious and complex process, with the latest industry reports indicating that data scientists spend around 38% of their time in data preparation and cleansing activities (Anaconda, 2022). DW describes activities associated with transforming raw data into an asset ready for analysis. DW includes data profiling, matching, mapping, format transformation, and repair (Fernandes et al., 2023). The practice of DW has been made possible for professionals lacking in data science and engineering skills by the availability of self-service tools that offer flexibility for custom, ad-hoc and quick DW solutions, often through user-friendly Graphical User Interfaces (GUIs) (Hellerstein et al., 2018; Kandel et al., 2011). While DW tools share common DW functionality, there is a lack of standards in conceptualization, representation, and semantics associated with supported DW operations (Hameed & Naumann, 2020), placing a burden on users to discover suitable DW operations and understand how to use them in practice. The steep learning challenge often involves a time-consuming trial-and-error process to develop a DW solution and can be exacerbated by low levels of skill and experience. As a result, users often end up building custom data preparation solutions from scratch for similar problems, rather than reusing existing ones, due to the difficulties in finding and understanding existing solution designs. Recent literature on the automation of data wrangling provide clues to some of the evidence underpinning reuse difficulties by stating that “data wrangling remains hard to automate, because it often involves special cases that require human insight” (Petricek et al., 2022). Examples of human insights include identifying common patterns in solutions with similar functionality, understanding idiosyncratic issues with data and distinguishing outliers from noise. To tackle these issues in data preparation scripts, tailor made code will often be needed, therefore limiting reuse opportunities. Moreover, our investigation revealed that potentially reusable DW scripts available from popular repositories often suffer from redundancies and inefficiencies due to poor choices in operation selection. Finding ways to help users build good quality DW pipelines without burdening them is essential to adopting a systematic approach to data preparation/wrangling. This raises questions, such as:

  1. 1.

    How to identify DW design patterns in open access repositories to enable the reuse of existing data preparation solutions?

  2. 2.

    How to assess if effective solutions to common data preparation/DW problems can be developed through the use of DW design patterns?

  3. 3.

    How to organize and present DW design patterns to facilitate their use in the data analytics lifecycle?

In search of answers to these questions, we conducted an investigation into existing DW workflow pipelines. This enabled us to identify how analysts develop DW solutions in practice. More specifically, we articulate the following contributions in this paper:

  • The discovery of patterns of DW activities/operations from Web-available repositories of DW pipelines.

  • A conceptualization of common DW operations included in DW design patterns and the creation of a DW Taxonomy, presented as a Dictionary of Operations and Patterns.

  • The specification of a DW Design Patterns Handbook and an evaluation of its usefulness.

The remainder of this paper is organized as follows: Section 2 presents a brief background and related work. Section 3 describes an exploration of DW in workflow environments and articulates the notion of DW constructs and patterns in data analysis workflows. The systematic pattern mining approach used to identify DW patterns is presented in Section 4 with its findings and discussion in Section 5. Section 6 presents samples of the DW design pattern specifications included in our DW Handbook and an evaluation of the usefulness of the Handbook. Section 7 summarizes our research contributions, limitations, and future work.

2 Background and Related Work

At a time when data is often analogized to gold, diamonds, and oil (Van der Aalst, 2014), DW is a critical process for determining the quality of the outcome of data analysis tasks (Muller et al., 2019). To facilitate DW, a variety of tools have been developed that vary in their capabilities and the level of technical skill required from users (Hameed & Naumann, 2020; Convertino & Echenique, 2017) (please see Appendix A for a comparison of DW tools). These tools range from programming languages, such as R and Python, which require a high level of technical expertise, to user-friendly visual tools that require less technical knowledge, a category that Talend Data preparation (Talend, 2023) and Trifacta/Alteryx Designer Cloud (Trifacta, 2023) fall into. Examples of popular tools that fall into the first category also include Spark-based DW solutions, such as Databricks (2023), which generally provide users with an environment for interactive development of DW recipes, often through the use of notebooks into which Python code can be written. However, these solutions are largely lacking in provision of guidance through the difficult process of DW which could, potentially, not only ease and expedite the DW process, but also inform users on how to build “good” DW recipes. Both categories of tools are, in fact, limited in support for mechanisms that facilitate the reuse of solution designs (e.g., design patterns), as our assessment of the tools has revealed, lacking in provision of high-level descriptions of the rationale behind design decisions, their association with data properties and the purposes of the target analytical task. This is a shortcoming that software engineers have worked hard to overcome through the use of design patterns found in software design, i.e., as a way of standardizing and passing down knowledge/experience of expert designers to non-experts, in the form of readily available design constructs for reuse (Gamma et al., 1994). While the factors motivating the creation of design patterns in SE are similar to those in DW, there is a dearth of research on identifying DW patterns. The semi-automation of DW via Machine Learning (ML) techniques is a topic of growing interest (Petricek et al., 2022; Jaimovitch-López et al., 2022), with works such as He et al. (2018) showing that suggestions of applicable operations can be made based on user-provided examples, and Sutton et al. (2018), which performs data transformations inspired by the UNIX Diff command. Although tool functionality for direct and fully automated reuse of existing solutions is not yet supported, semi-automated approaches applying ML to narrow DW tasks such as record matching have been successfully developed. Our work complements previous research by focusing on identifying and documenting design patterns in available DW pipelines, addressing the problem of finding and describing constructs that can make DW more systematic and structured, paving the way to collaboration, standardization, and more extensive automation. This will, ultimately, make DW a more disciplined and efficient process.

Fig. 1
figure 1

Interpretation of Tosta et al. (2015) approach for DW

2.1 Workflow Mining and Pattern Discovery

Workflow mining focuses on business process mining using event logs for the purpose of process redesign and optimization (Van der Aalst et al., 2003; Hammori et al., 2006). There are limitations to workflow mining based on event logs to mine patterns in data pipelines. Previous works in identifying patterns in workflows used logical modeling of workflows into graphs to enable their mining. Graphs are one of the frequently used representations for big data processing (Darmont et al., 2022). In Tosta et al. (2015) a path approach to identifying patterns in scientific workflows is applied (Fig. 1) and Theodorou et al. (2017) uses Frequent Subgraph Mining (FSM) to identify frequent patterns of operations in ETL workflows. Whilst these approaches are applicable to DW workflows, Tosta et al. (2015)’s approach would lead to identifying a set of short frequent paths of operations, which may lead to misleading conclusions, e.g, tasks C, D, and E in Fig. 1 appear 4 times (once in each path) as opposed to once in the original representation. The approach of Theodorou et al. (2017) was successful in identifying patterns in ETL, however, the repository used was created from well-formed ETL standard workflows which have a limited range of data transformation operations when compared to DW. To be able to take advantage of previous work in pattern identification designed for ETL processes, similarities between DW and ETL were explored, i.e., subsets of activities, resulting in our adaptation of existing ETL-based approaches for our research purposes (described later in this paper). Nonetheless, the key distinctions between ETL and DW are outlined below, highlighting why existing ETL mining approaches cannot be directly applied to DW pipelines.

  • ETL’s main focus is on operations that perform data integration from multiple data sources into a unified schema, and data storage in a consolidated data repository (e.g., data warehouse) whilst DW involves tailored data transformations to prepare a dataset for an analytical task such as machine learning-based prediction.

  • ETL involves a smaller subset of operations/data transformations applied in a quasi-sequential structure whilst DW involves a wider range of operations with potentially iterative cycles and interactive tasks (e.g., visual data profiling)

  • ETL is typically performed using specialized software tools (e.g., Talend and Oracle Data Integrator) that automate the majority of the ETL process steps, whilst DW combines GUI-based tools (such as Trifacta) and programming languages such as Python and R to code user-defined functions.

These differences result in different patterns for solving data problems faced by analysts that are not traditionally common in ETL workflows, but the similarities between the two and their required outcomes are the reasons for our adaptation of ETL-based approaches for our research purposes.

Table 1 Summary of the components of the proposed design principle (DPR) schema by Gregor et al. (2020)
Table 2 Summary of the framework of the proposed minimum reusability evaluation of design principles (DPRs) by Iivari et al. (2021)

2.2 Design Patterns Specification

Design Patterns (DPs) are a form of design knowledge, used to capture and communicate design experience and expertise in solving a problem in a certain context (Riehle & Züllighoven, 1996). Design patterns can be expressed in multiple ways (refer to Riehle and Züllighoven (1996) for details), each with a particular specification template. For patterns to be effective and helpful in the reuse of DW solutions, they need to be formulated, described, and communicated to users and developers in the domain of self-service data preparation. Design patterns can be abstracted in the form of more general design principles by providing general guidelines on how patterns can be applied in various scenarios. Design Principles (DPRs) support the capture and communication of design knowledge by abstracting from implementation details and providing generalized solution descriptions. This facilitates the development of a more comprehensive understanding of different designs (Chandra et al., 2015). The work by (Gregor et al., 2020) presents three main categories of DPRs, as follows: 1) DPRs about user activity, which state what a designed artifact should allow the user to do; 2) DPRs about an artifact, which state what features the designed artifact should have; and 3) DPRs about user activity and an artifact, which combine features from both 1) and 2), describing the characteristics that an artifact should have and what the user should be able to do with it.

In the research reported in our paper, we adopt the notion of a Design Principle as a specification template for the abstract/conceptual representation and communication of Design Patterns. We use the DPR schema/template proposed by Gregor et al. (2020) to organize our Design Pattern Handbook, with components such as DPR’s aim, implementer, user, context, mechanisms, enactors, and rationale, which are summarized in Table 1. We have adopted this DPR template, due to its generality and completeness, making it suitable for specifying DW Design Patterns.

Iivari et al. (2021) argue that one of the weaknesses of Design Science Research is the evaluation of DPRs. In their view, DPRs are often stated without proper evaluation, or with incomplete evaluation that only addresses certain criteria. Alternatively, they may be evaluated based on the IT product that results from their application. To address this issue, the authors propose a reusability evaluation for DPRs based on five criteria: accessibility, importance, novelty and insightfulness, actability and guidance, and effectiveness, summarized in Table 2. It is worth noting that, for each criterion, the authors suggest a set of questions, which we adapt for evaluating our proposed DW Design Patterns.

2.3 Data Wrangling in a Workflow Setting

Workflow management systems are usually adopted for their orchestration capabilities, and the visual representation available in some of these tools makes the workflow easier to inspect, follow, and, potentially, be reused. KNIME (Berthold et al., 2007; Knime, 2023a), in particular, covers a wide range of data science and DW functionality, and is aimed at multiple data analysis applications. KNIME provides a visual interface for designing and running workflows where operation nodes are made available to the user through a searchable drag-and-drop interface in a tree structure and provides documentation for each of the included nodes. Additionally, users can access a number of example workflows created for the purpose of demonstrating different use cases, which can be accessed through a searchable interface. Due to its ease of use and rich functionality, large repositories of KNIME workflows are available for a variety of domain applications. These workflows, generated by users with varying levels of experience, were used in this study. Our chosen approach for the study is not limited to KNIME. However, we opted for KNIME because it offers various built-in operations covering a wide range of DW functionality and several pre-built data transformation workflow pipelines that can be mined for pattern discovery.

2.4 Terminology Used in this Paper

In this research, several terms are used that require clarification to understand their contextual use, summarized in Table 3. Generally, a workflow “can be thought of as a set of instructions that describe how a process should be performed, including the order in which tasks are to be completed” (Coalition, 2023). In the context of KNIME, when we refer to a workflow, we mean a workflow design document. And, when the design document includes a single sequence of activities, we use the term pipeline to refer to it.

Table 3 A Glossary of terms used in this research
Table 4 Number of KNIME workflow files from each repository

A workflow activity is typically the smallest unit of work in a workflow. KNIME refers to this as a node, while other tools use the term operation for the same concept. In this paper, in addition to the terms DW activity and DW operation, we may also adopt the general term DW construct to refer to these units of work. Constructs represent DW functionality. A specific DW activity could be implemented in different tools using a single DW construct or multiple DW constructs.

3 Workflows, DW Constructs and Patterns

3.1 Selection of Repositories and DW Pattern Exploration

Despite the availability of numerous repositories of data manipulation workflows, the workflows found in De Roure et al. (2009); NodePit (2023); Knime (2023b), shown in Table 4, were narrowed down, due to their focus on DW. Following a thorough check, the 76 MyExperiment workflows (De Roure et al., 2009) were selected for an initial, mostly manual, exploration, due to their high levels of completeness and complexity. From the initial exploration, the following has been observed:

Table 5 Application of general flow control patterns in DW workflows
Table 6 DW workflow activities classification

a) The KNIME workflows use over 4,500 KNIME nodes, with each workflow comprising multiple DW activities, including data structuring, formatting, manipulation, and integration; b) There is a variety of input file types, raising the requirement for complex data transformations from non-tabular formats into tables, imposed by most of the functionality within KNIME and similar tools; c) There is a considerable number of community-created/custom nodes, accounting to 10% of the nodes. This is attributed to a number of reasons, including limitations in the available functionality for data formatting, handling of web service calls, data comparison for cleaning purposes and data enrichment through integration, etc., and difficulties understanding the tool’s functionality and the workflows composed of its built-in nodes. This is largely due to a lack of detailed documentation as well as clarity and uniqueness in function naming schemes; d) DW activities are interleaved with domain-specific and data analysis ones, and, at times, placed at the end of the pipeline, for instance, when non-tabular data is transformed into a table for output; e) The workflow control patterns defined by Van der Aalst et al. (2003) were present in all workflows. In Table 5, adaptations for the general definition of each of these patterns are provided, where DW processes are taken as the main context; f) A set of recurring DW activities and associated flow patterns were identified and extracted from the workflows, of which the Join of two data sets, the application of a Filter over records of a table and the use of a For-loop for allowing repetition of the application of a given operation are examples. Each such activity and flow pattern was named according to its role in the workflows, as suggested in the first column of Table 7, such that different types of activities emerged. A classification of these activity types is made, based on the interrelationship between activity input and output, taking the classification of ETL activities by Vassiliadis et al. (2009) as an example. This is described in Table 6; g) Redundancies and other forms of inefficiency are frequently found, as illustrated in the following examples:

  • Repeated use of nodes, such as ‘row filter’Footnote 1, where a single ‘rule-based row filter’ could be used, resulting in computational overhead due to multiple full data scans.

  • Employment of a ‘split row’ nodeFootnote 2 in a single-output pipeline, where a ’row filter’ node could be used.

  • Removal of unneeded attributes at the end of a pipeline that contains computationally expensive operations, e.g., data aggregation and integration. In this case, computational costs could be reduced with the early removal of these attributes. It is worth pointing out that computation cost reduction becomes crucial when financial losses are incurred from renting cloud resources for orchestrating a pipeline.

Table 7 Data wrangling workflow activities and associated flow patterns mapped into general workflow patterns

3.2 Mapping Workflows and Abstracting DW Activities

Based on the common workflow control patterns defined in Table 5, and using the DW workflow activities classification shown in Table 6, a mapping between the recurring DW activities and associated flow patterns, extracted from the selected workflows, and the common workflow control patterns is drawn, which is presented in Table 7. This mapping associates each activity and flow pattern with its main role in DW workflows, abstracted from any construct representation and implementation details. Note that the activities act as logical models, representative of a greater number of constructs that can be utilized in DW; for example, the Create Attribute activity can be performed using multiple constructs that perform this same function, such as merge attributes or split attribute, found in KNIME. It is also noteworthy the fact that the most prevalent workflow pattern in the context of DW appears to be the sequence pattern.

Fig. 2
figure 2

The conceptual model of a data wrangling pipeline

Fig. 3
figure 3

The mapping of DW activities into stages in a pipeline and the simplified version of the transformed pipeline

Fig. 4
figure 4

The extended conceptual model of a data wrangling pipeline, including how patterns relate to it

3.3 Conceptual DW Constructs and DW Patterns

Constructs that perform DW activities can be combined in various permutations within a data pipeline. These permutations accomplish specific functionalities that may or may not be achievable by other permutations, making them unique structures or patterns. Given the different structures that can be formed using DW constructs and their permutations, as well as the type and function of each construct, these structures are considered candidate DW patterns. However, to qualify as a DW pattern, a structure must occur frequently, consist solely of DW constructs, and have distinguishing characteristics that enable its classification.

A DW construct c is an atomic processing unit belonging to a class of activity ac, responsible for a single transformation over its input(s), with distinct processing semantics dps leading to the production of its output data set(s) and having a specifically defined branch structure that connects with preceding and succeeding nodes (i.e. its neighboring constructs). DW pipelines DWP that are part of a data processing workflow can be logically modeled as directed acyclic graphs (DAGs) consisting of nodes, representing DW constructs (C). The edges of the graph (E) represent the directed data or control flow among nodes \((c_{1} \prec c_{2})\). Formally:

c = dps

 

DWP = (C, E), such that:

 

\( \forall \ e \in {\textbf {E}}: \ \exists \ (c_{1},\ c_{2}), \ c_{1} \in {\textbf {C}} \ \wedge \ c_{2} \in {\textbf {C}} \ \wedge \ (c_{1} \prec c_{2}) \)

 

Based on the characteristics of each DW construct c, including the transformation it performs, its classification, and its semantics, it can be mapped to one DW activity ac from a predefined set \(\mathbb {A}\) through a surjective function activity. Moreover, every activity ac, performs a function that can be mapped to one DW stage s from the predefined set \(\mathbb {S}\) through the surjective function stage. Formally:

activity: C\(\ \rightarrow \ \pmb {\mathbb {A}}\)

 

stage: \(\pmb {\mathbb {A}}\ \rightarrow \ \pmb {\mathbb {S}}\)

 

As such, the DW pipeline can be defined as a set of connected DW activities and, consequently, can be mapped further to a connected set of stages in DW. Formally:

DWP = (\(\pmb {\mathbb {A}}\), E), such that:

 

\( \forall \ e \in {\textbf {E}}: \ \exists \ (ac_{1},\ ac_{2}), \ ac_{1} \in \pmb {\mathbb {A}} \ \wedge \ ac_{2} \in \pmb {\mathbb {A}} \ \wedge \ (ac_{1} \prec ac_{2}) \)

 

DWP = (\(\pmb {\mathbb {S}}\), E), such that:

 

\( \forall \ e \in {\textbf {E}}: \ \exists \ (s_{1},\ s_{2}), \ s_{1} \in \pmb {\mathbb {S}} \ \wedge \ s_{2} \in \pmb {\mathbb {S}} \ \wedge \ (s_{1} \prec s_{2}) \)

 

This abstract definition of a DW pipeline and its constructs, illustrated in Fig. 2, can be applied to the (Figs. 3, 4, 5 and 6) DW process in any type of tool used for its implementation, whether in workflow or other forms. This enables the mining of DW patterns from various types of pipelines and specifically from workflows that can easily map to the presented model.

Fig. 5
figure 5

KNIME Node Repository interface, which provides nodes (operations) to users, with a sample of categories

Fig. 6
figure 6

Sample of ‘row filter’ nodes (operations)

Therefore, a DW pattern WP would also be a DAG and each of its nodes PA represents a specific activity ac. It would also have specific characteristics, including its branch structure (i.e. the way it is connected to its neighbors). Formally:

pa = ac

 

WP = (PA, E), such that:

 

\( \forall \ e \in {\textbf {E}}: \ \exists \ (pa_{1},\ pa_{2}), \ pa_{1} \in {\textbf {PA}} \ \wedge \ pa_{2} \in {\textbf {PA}} \ \wedge \ (pa_{1} \prec pa_{2}) \)

 

Only coherent structures are considered in the analysis performed, and thus DW patterns are connected graphs (i.e. there is a path between any two nodes in the graph). In Fig. 4, the conceptual model of a DW pipeline is extended to include the DW pattern and how it relates to the pipeline. The mapping of constructs to labels, which represents the activity they perform as well as the stage they are part of, is further described in the context of the mined workflows in Section 4.1. An example of the mapping from constructs into activity labels is presented in Fig. 8, where two workflows, WF1 and WF2 are mapped into a graph of activity labels. An example of mapping of activity labels into stages is shown in Fig. 3. The model is also discussed later in Section 4. The next section presents our mining approach.

4 Data Wrangling Pattern Discovery Approach

After conducting a manual exploration and analysis of a large sample of the selected scientific data analysis workflows, as described in Section 3, we now present a systematic approach to identifying DW patterns in all of the selected workflows. This approach includes the presentation of a DW activity taxonomy aimed at standardizing DW operations across tools (Section 4.1). We also describe the workflow pre-processing stage that takes place prior to mining (Section 4.2), the mining process (Section 4.3), and the results (Section 5). Table 8 describes the main steps in our DW pattern discovery methodology, distinguishing between the steps that were performed manually and those that were executed automatically.

Table 8 Data wrangling pattern discovery approach
Table 9 Data wrangling stages description
Fig. 7
figure 7

Process flow for mapping KNIME nodes to activity labels

4.1 Creation of a Taxonomy of Data Wrangling Constructs

The lack of standards in implementing DW constructs makes the identification of patterns in DW pipelines hard. To overcome this difficulty, a process of conceptualization and unification of DW constructs was devised and is described as follows. A DW pipeline is composed of several stages, each encompassing various activities that can be executed using different constructs within tools, in the form of operations. To help conceptualize and unify these stages across data pipelines, a dictionary of typical stages and activities in a DW pipeline can be useful. Such a dictionary was collated from the literature, despite the lack of consensus on stage and activity names. Precise identification and concise descriptions of DW stages and activities were obtained from sources, such as Rattenbury et al. (2017), Hellerstein et al. (2018) and Foundation (2013), used in the development of the dictionary. Table 9 describes the main DW stages in the dictionary.

Fig. 8
figure 8

Two workflow snippets with unique signatures and their representation after applying the taxonomy

Table 10 Description of workflow manipulations, mapped to their purpose

To develop a intended taxonomy for unifying constructs in a DW pipeline, an initial set of conceptual activity labels was created. These labels were based on the data transformation operations presented by Raman and Hellerstein (2001) and the operations of Relational Algebra (RA). Any duplicates found between the two sources were eliminated to ensure the accuracy and consistency of the taxonomy. Further, for each label, a DW activity flag was created as presented in Raman and Hellerstein (2001); Convertino and Echenique (2017); Hellerstein et al. (2018); Foundation (2013); Rattenbury et al. (2017); Hameed and Naumann (2020). The factors summarized in Fig. 7, considering the workflow patterns, classes and the logical model presented in Section 3, were used to map each operation (i.e., workflow node) into a label or to create a new label for an activity, if needed, in the following order. It is worth noting that, the dictionary of stages helps map operations and activity labels into their most generalized form.

  1. 1.

    The (KNIME, in this case) node repository classification tree illustrated in Fig. 5.

  2. 2.

    The node’s functionality (e.g., Row Splitter is classified as a row filter operation according to its functionality).

  3. 3.

    The node’s description (e.g., Substructure Search is classified as a row filter operation since it implements domain-specific ‘row split’ functionality).

  4. 4.

    The node’s prevalent trait (e.g., Date &Time Difference calculates the difference between two dates in a row and creates a new attribute with the results; as such, it can be classed under the ’append attribute’ activity).

An example of the benefits of applying the taxonomy is illustrated in Fig. 8 which can also be extended to unify pipelines from multiple tools. In the example, the repository initially contains 95k KNIME nodes with 1.8k unique signatures, which got reduced to 385 signatures, by using the 90 activity labels in the taxonomy. The 90 created labels contain 37 DW activities spanning 60% of the node instances.

4.2 Pre-processing: Parsing of Repository Workflows into Graphs

The preparation of workflows for mining using the PAFI algorithm, developed by Kuramochi and Karypis (2004), is summarized in Table 10. It is worth pointing out that this parser was tailored to parse KNIME generated workflows, but it could be easily generalized to support the parsing of other workflow formats. Also, note that workflows are grouped based on some of their representational characteristics, i.e., (A) Node ID, (B) Node name, and (C) Workflow structure, at different steps of the preparation process. The process was designed to meet the workflow format constraints imposed by the PAFI algorithm, which, in some cases, required changes to be made to the original workflow representation, i.e., to map the workflows into Direct Acyclic Graphs (DAGs) (Berthold et al., 2007). Also, the application of the taxonomy (Section 4.1) and the disposal of workflows that did not contain any DW constructs/activities resulted in a reduction in the size of the set of workflows following the preparation process. Furthermore, a concurrent parsing process was performed to produce stage graphs using the stage dictionary described in Section 4.1. During this process, each node in every repository workflow was mapped to its stage representation. Duplicate stage occurrences (e.g., Fig. 3) and branching were eliminated to produce a linear graph, which was then used to identify frequent stage patterns. It is worth pointing out that, by eliminating duplicate stage occurrences and branching, abstraction from details, such as the number of activities associated with the same stage and the specific transformation performed by the user, is achieved. This abstraction is desirable, given that these details are not relevant to the identification of frequent stage patterns.

Table 11 Modifications to stage representations in workflows, mapped to their purpose and benefit

In this process, the manipulations performed were as follows:

  • Removal of scripting nodes represented in the activity for User-Defined Functions (UDFs), because mapping them to a stage required an individual analysis of each instance.

  • Elimination of duplicate stages after the mapping and maintenance of the first occurrence of each stage in the pipeline.

  • Discarding of workflows that did not contain at least two of the primary wrangling stages described in Section 4.1 (excluding the loading and publishing stages).

  • Mapping of all activities related to analysis or output generation under the publishing stages (because these are all considered to be the end of a DW process).

Table 11 maps the modifications made to the reasons for performing them and the benefit gained from doing so.

4.3 Mining of Workflows and the Most-Commonly Traversed DW Paths

Following workflow pre-processing, the generated graphs were divided into buckets based on the % of DW activities contained in them, e.g., \((\ge 80\%, \ge 60\%, \ge 40\%, \ge 20\% \text { and} >0\%)\). This process was done via multiple iterations performed with different supportFootnote 3 (sup) thresholds. Despite the aims of workflow mining, which include (i) discovering patterns of DW stages in data pipelines and (ii) identifying the frequency of DW activities within the pipelines, the results did not meet expectations. This is because the mining of workflows revealed that a significant fraction of the discovered patterns had low frequencies, despite their similarities. The only differences were a few extra analysis or domain-specific operations that appeared in some patterns, which we informally refer to as intruder operations, and slight variations in the order in which DW activities appeared.

These observations have brought about the idea of making changes to our original approach, which considered obvious patterns appearing in the workflows (i.e., with high frequencies), as suggested by Theodorou et al. (2017). The modified approach aims at identifying low-frequency patterns found in the mining results that could potentially represent higher-frequency ones if exceptions are dealt with differently. In essence, in the new approach, variations in operation ordering found in operation groupings that appear with a certain frequency are disregarded, allowing identification of paths of DW activities within these groupings that are frequently traversed, which we call the Most-commonly Traversed Paths (MTPs).

5 Mining Approach Findings and Discussion

5.1 Findings

The findings from the mining process (Section 4) are categorized based on their relevance to either: i) DW stage patterns, ii) frequent DW activity patterns, or iii) the MTP approach (as outlined in Section 4.3).

In the investigation of i), 31% (out of the 1,787 pipelines investigated) were omitted because they included less than two DW stages. The remaining pipelines resulted in multi-usage patterns from which the 7 most interesting ones are presented in Fig. 9. The presented patterns are divided into two groups based on their inputs: a) single-input patterns (e.g., Ps1), and b) multi-input patterns (e.g., Pm1). The most frequently occurring pattern, Pm3, was implemented in 32% of the workflow instances, followed by 16% for Ps2. The stage patterns are chosen by users based on factors, such as the number of inputs, the structure of input(s), and subsequent analysis requirements, as suggested in Table 12; e.g., if a data set requires handling null values through imputation or removal, as well as attribute-based operations, such as the creation of new attributes, the Ps1 pattern would be used. It is also observed from this investigation that the lack of guidance for operation placement was not merely a matter of interchanging adjacent activities, which would not have been apparent when generalized to the level of stages. Instead, it was a more critical issue that persisted even when generalized to this level. An example of the high operation ordering variation can be seen in Pm3, which has 5 core wrangling stages and appeared in 200 different order permutations.

Fig. 9
figure 9

Most interesting stage patterns. A parallelogram representation was used because the exploration (D) stage is not a transformation stage and is not always captured by the workflows. The double circle was used because the cleaning (C) stage is both optional and data-dependent

Table 12 Stage patterns and descriptions ordered by their occurrence percentage within the repository
Fig. 10
figure 10

Results of mining the graph databases with different sup values

Table 13 Breakdown of types of repetition in subgraphs

Relevant to ii), the highest sup value with mining results in the activity-level pattern-mining is 50%, obtained when the database of Source 1 (Table 4) was used, which is the source of the largest number of patterns in other sup values. The highest sup value producing patterns for all databases was 15%, as illustrated in Fig. 10. The option to retrieve maximal subgraphsFootnote 4 only was used when running the algorithm, but the inconsistent order of operations required further manual processing to remove duplicate patterns and those never independently appearing in workflows. The DW activities were present in 575 subgraphs of which 63% contained a repetition of activities (Table 13) and 50% of the remaining subgraphs consisted of 6 activities (not all DW activities). Although the produced subgraphs were unique to the graph mining algorithm, they were not unique in terms of functionality when analyzed visually.

Fig. 11
figure 11

The most interesting DW activity patterns

Table 14 List of patterns of DW activities, including their description and number of input/output data sets
Fig. 12
figure 12

Map of combined candidate activity patterns

We defined a DW pattern as a frequently occurring combination of DW activities with distinguishing behavior leading to its classification. The 15 most interesting DW activity patterns extracted are presented in Fig. 11 and described in Table 14. Other patterns in the results were either subsumed within the 15 patterns or have no clear distinguishing behavior that could lead to its definitive classification. Issues faced in the mining of patterns was mainly resulting from the varied order of activities as well as the “intruding” activities appearing in frequent combinations, which hinder the identification of a pattern or result in them, appearing with a low frequency.

From the results, there were interesting insights that can be confirmed with the appearance of highly repeated operations (Table 15) which, as well as producing irrelevant frequent subgraphs, indicated issues related to incorrect utilization of operations, e.g using an Split Rows node while only processing a single output branch which could be achieved with a simple Row Filter operation.

Regarding iii), combining subgraphs as paths to create DW traversed paths (Fig. 12), obtained by removing operation repetitions and disregarding operation orderings in frequent subgraphs, revealed significant DW patterns. Figure 13 shows an MTP departing from Row Filter, with insights not found in subgraphs. There are 5 nodes and three paths, A, B, and C, which appeared in 45, 44, and 27 subgraphs, respectively. Despite these numbers, these paths were missed or returned with lower frequency by the mining algorithm due to varying placement orders and operation repetitions. Had a consistent arrangement been followed, lower frequency patterns in Fig. 11 would have had higher sup values. For example, Path B represents “Join with Summary” and “Join Summaries” patterns. Path C represents a valid DW scenario but did not clearly appear in the mining results.

5.2 Discussion

From our findings, we are able to conclude that widely used tools for DW, such as KNIME, provide functionality based on collected statistics and user feedback but, to a large extent, fail to address the issue of facilitating reuse of available workflows. Reuse of workflows requires users to understand previously constructed workflows before being able to select the most suitable one(s) for their purposes, as well as modify or adapt it(them). Considerable effort is invested by tool users to develop UDFs from scratch, which are heavily used in workflows, despite the tools’ native, ready-to-use and, to a significant extent, similar functionality. Extreme cases, where tools are used merely for their UDF orchestration capability, have been observed.

Table 15 Most repeated operations in a single subgraph and the number of graphs they appear in

Despite the availability of user-friendly GUIs, attempting to bring the practice of DW within the capabilities of a wider end-user population, difficulties in the efficient and effective use of the tools are observed. This is particularly obvious from the poor design choices made by users when constructing their workflows, resulting in redundancies and other forms of inefficiencies. Hence, the lack of data preparation technical knowledge and experience cannot be fully compensated by the tools’ visual interfaces, given the demonstration of lack of knowledge of, not only the tools’ capabilities but also in the planning on how to apply them to solve a problem. More specifically, the interactive tool design tempts a user to only use an operation to fulfill the requirement of a subsequent operation, including but not limited to loading and wrangling additional dataset(s) for integration.

Fig. 13
figure 13

Most traversed paths (MTPs) in the map of paths departing from the row filter

The creation of the DW taxonomy, described in Section 4.1, has allowed us to overcome major challenges in the identification of patterns in workflows, particularly in regard to the wide variation in operation placement and the multitude of possible permutations of DW stages and activities. It is worth pointing out that, although the patterns presented in Fig. 9 first appeared with numerous operation permutations, the operation order presented in the figure is influenced by the most used permutation, the testing of the candidate permutations, and Relational Algebra heuristics (e.g., operations with lower cost and the highest impact on data size reduction are to be applied as early as possible and before more expensive operations, such as integration). When testing the various permutations, considerations over the commutative properties of operations, which determine whether they can be rearranged, were made.

The mined workflows were not necessarily created by experienced “wranglers” or ETL engineers. Rather, they represent a subset of real data pipelines containing DW stages in a workflow setting, in addition to other data analysis functionality. While the subset of DW operations used is not big, identical combinations of operations were not produced by the mining algorithm indicating the lack of rules of thumb in DW, also explaining the lack of significantly high-frequency patterns of operations similar to those found in the works of Theodorou et al. (2017). That study mined ETL workflows built by experts using the TPC-DI benchmark for Data Integration (Poess et al., 2014). We also concluded that the sources with the highest concentration of the discovered patterns (e.g., Source 1 in Table 4) were the ones with the higher number of workflows authored by the same groups of users preparing data in similar domains.

The above observations resemble the basis of the motivation for the work of the Gang of Four (GoF) in SE, when they created their patterns catalogue (Gamma et al., 1994); however, considerations over the different mindsets of SE and DW users need to be made. While the mined set of paths is able to solve a wide range of DW problems, including data filtering, integration, structural transformations and value conversions, a considerable part of it includes computationally expensive operations, as the example in Fig. 13 suggests. These could be better arranged, through the use of query optimization techniques, to produce semantically equivalent solutions that are more computationally efficient and more widely applicable. The application of SE Design Patterns (DPs) and IS Design Principles (DPRs) to basic DW activities and MTPs can pave the way towards the creation of standards and patterns in DW, composing a catalogue for supporting the design of DW solutions.

In summary, one of the main challenges in reducing the burden of DW on users is the lack of standardization in DW operations. Contrary to relational database management systems where SQL is both a de facto and de jure standard with its key operations such as SELECT and JOIN having the same syntax and semantics across different relational DBMS products, there are no universal standards for data preparation/wrangling operations across DW tools and libraries. This creates significant compatibility, consistency, reusability, and productivity challenges in relation to data preparation. The development of standards for data preparation/wrangling operations can increase reusability, facilitate the understanding, verifiability, and auditing of DW pipelines, allow the emergence of widely used DW design patterns and best practices, and increase the efficiency and productivity of the data wrangling process.

The lack of standards also creates barriers to apply optimization techniques from other data engineering domains and create optimized pipelines based on DPs. By standardizing DW operations and formulating DW design patterns methodically, it is possible to increase the reuse of pipeline strategies and transform DW into an engineering discipline. This would ultimately lead to more efficient and effective DW processes (Mall, 2018).

Table 16 Mapping of DW activity and stage design patterns to the Design Principles schema from Gregor et al. (2020)
Table 17 Selective attribute value design pattern specification

6 Data Wrangling Design Pattern Specification

In this section, we use the specification template/schema proposed by Gregor et al. (2020) with a slight modification for ease of use. We define the DW Design Patterns (DPs) in a form that is understandable by both the user and implementer of DW tools. In addition to the DP specification schema, we include a notion of cost-benefit in terms of computational cost, effect on the data set, and subsumption. Table 16 maps the items defined in the DW DPs to their most relevant section of the schema. All DW DPs are compiled and curated in the form of a digital handbook, accessible through the following link: (https://almasaud-mcr.github.io/).

The DW activity pattern specifications depicted in this section are derived from the patterns identified in Table 14, extended and presented as design pattern specifications to facilitate usability. Table 17 presents a sample DW activity pattern specification associated with selective processing on two parts of a single data set.

Table 18 PS1 - Enrich dataset DP specification
Table 19 Summary of the survey questions covering the different evaluation dimensions

The DW stage pattern specifications were derived from the extracted patterns shown in Fig. 9 in their computationally optimal arrangement, based on Data Engineering rules of thumb. However, in our representation of the DPs, we associate all possible order permutations in our illustrations. Table 18 presents the DP specification of PS1, the simplest of the stage patterns, using a single input to perform enrichment of a data set by creating new specialized attributes. Note that, as shown in Table 18, this pattern is represented as a single graph containing its various arrangements of stages, i.e., its multiple stage permutations were merged into a single graph, unlike its representation in the digital handbook, where only the most cost-efficient permutation is shown. The specified DP indicates which other stage patterns subsume the given pattern in their DP. For example, PM2 includes the same stages but for two data sets with an additional integration stage to combine them.

6.1 Evaluation

We evaluate our contribution to DW DPs by performing a user reusability evaluation survey, using the framework presented by Iivari et al. (2021). This framework has the advantage of employing a rich set of reusability facets, rather than focusing on an artefact. Thus, it is suitable for the purposes of this work.

6.1.1 Method and Data Collection

Data collection for DW Pattern Handbook evaluation was conducted from August to September 2022, using an expert purposive sampling strategy. This ensured that the selected group had the necessary experience and skill set to provide valuable feedback on the design patterns discovered. We contacted 31 potential respondents, all of whom were data scientists and/or data analysts, via email to participate in our study. The questionnaire used is shown in Table 19 and, please, refer to Appendix A for the PDF version of the Qualtrics software implementation of the questionnaire and the email used to contact respondents. Out of the 31 potential respondents, 26 responded positively, resulting in a response rate of 83.8%. The raw data collected was securely stored in the University of Manchester Qualtrics instance and analyzed using both Qualtrics reports and Microsoft Excel. There were no ethical concerns regarding data collection, as the survey did not gather any personally identifiable information and was within the respondents’ domain of expertise.

Table 20 Summary of the respondents’ roles and self-reported technical and wrangling ability (scale from 0 to 100)
Fig. 14
figure 14

Evaluation results of the DW activity patterns for the accessibility, importance and novelty factors, where (a) is the figure, and (b) presents the averages obtained in each answer

Ethical Approval Declarations

  1. 1.

    Approval: Internally, we followed the ethical compliance required by the University of Manchester https://www.manchester.ac.uk/research/environment/governance/ethics/approval/ The University of Manchester does not normally require formal ethical review for the research activities reported in this paper provided the following criteria are met:

    1. (a)

      The data is completely anonymous with no personal information being collected (apart from their name, their publicly available contact details and a record of consent);

    2. (b)

      The data is not considered to be sensitive or confidential in nature;

    3. (c)

      The issues being researched are not likely to upset or disturb participants;

    4. (d)

      Vulnerable or dependent groups are not included;

    5. (e)

      There is no risk of possible disclosures or reporting obligations.

    6. (f)

      The subject matter is limited to topics that are strictly within the professional competence of the participants.

  2. 2.

    Accordance: The methods were carried out in accordance with the relevant ethical guidelines and regulations (https://www.manchester.ac.uk/research/environment/governance/ethics/approval/)

  3. 3.

    Informed consent: Informed consent was provided from all participants prior to conducting the evaluation survey included in this research.

The survey consisted of questions covering the evaluation framework and a Likert scale of responses with values ranging from ‘strongly disagree’ to ‘strongly agree’. The following reusability factors were considered in the evaluation: (1) Accessibility, (2) Importance, (3) Novelty and Insightfulness, (4) Actability, (5) Guidance, (6) Effectiveness and how it stands (7) Compared to the Current Situation. The specific questions associated with each of the factors are described in Table 19, illustrating their meaning. Note that additional free-text questions appear on the survey based on the users’ responses to some questions, to assist in understanding the user’s response and aid in improving the handbook.

6.1.2 Survey Results and Analysis

The responses to the self-assessment questions regarding the users’ technical and wrangling abilities are summarized in Table 20. The results show a range of abilities, providing an adequate representation of stakeholders that are typically involved in the DW process. Next, we present the obtained results from each section of the survey.

The DW activity patterns are particularly useful for practitioners who want to directly apply the identified DW activity design patterns in their wrangling work. The evaluation survey results for factors (1), (2) and (3) are illustrated in Fig. 14 and of factors (4), (5), (6) and (7), in Table 21. Almost all respondents found the presented DPs to be clear, understandable, important, and capable of addressing immediate issues in building DW pipelines. Only 20% of respondents did not think that the patterns conveyed new ideas, which is understandable given the technical expertise of some respondents. 25 out of 26 respondents agreed that the DW DPs were useful in their own practice. Three users suggested that there may be missing parts in the patterns, but only two submitted their suggestions, as shown in Table 22. These suggestions provide an opportunity to improve the coverage of the patterns to specific domain use cases in the future.

According to the survey respondents, the DW activity patterns can be easily used in practice, provide sufficient design freedom, and are not restrictive when used in designing DW pipelines. While one respondent reported insufficient guidance for designing DW pipelines, over 65% believed that the provided guidance was sufficient. All respondents agreed that the DPs of the activity patterns could help non-experts build DW pipelines and assist in DW pipeline optimization. In the Compared with the Current Situation section, over half of the respondents provided positive responses to the questions.

Table 21 The evaluation results of the DW activity patterns for the factors: actability, guidance, effectiveness and Compared to the current situation
Table 22 User responses to “Can you inform us of what DW activity patterns you believe weren’t included and would make it more complete?”

The evaluation of the DW activity patterns yielded promising feedback on its potential to improve the work of data wranglers in building pipelines and reducing their burdens. While users with high technical and wrangling abilities did not see as much benefit from the patterns as other users, they still agreed that the Handbook would be helpful for non-experts in performing wrangling activities. Several suggestions were received for improving the content of the Handbook, including creating a learning environment for executing the patterns, including ETL patterns, and adding specific activities used in certain domains. Although implementing the Handbook in a technology-specific learning environment would restrict its use to a particular platform or tool, and the additional activities and patterns are not commonly found in DW pipelines that use tabular data, these suggestions are still valid and will be taken into consideration in future work.

7 Conclusions and Future Work

Data wrangling (DW) or data preparation is a key process enabling the creation of business analytics and machine learning-based predictive analytics models. The skillful application of data preparation methods has been shown to enhance prediction model accuracy in business applications (Coussement et al., 2017), serves as a mitigating lever in the reduction of machine learning bias (Vokinger et al., 2021), and, as a data quality enhancement factor which firms can use to improve the quality of their decisions (Ghasemaghaei & Calic, 2019). Despite its importance, DW is often performed as an ad hoc craft rather than a systematic engineering discipline. In this paper, we investigate how to make the DW process more systematic and reuse-driven by discovering, conceptualizing, and specifying DW design patterns. Our work applied data mining techniques to DW workflow repositories to identify reusable design patterns and organize the findings by compiling a Data Wrangling Design Patterns Handbook. The Handbook articulates guidance and principles for applying DW design patterns. Our findings are useful in the process of increasing reusability, optimizing, and systematizing the process of developing data preparation solutions. Our findings also highlight important theoretical questions about the distinctive nature of the DW process and how data analysts/data scientists can be better supported by novel DW tools and development methods.

Self-service DW tools use Graphical User Interfaces, menu-driven tools, and spreadsheets to simplify the programming of DW scripts with easy-to-use interfaces. However, despite alleviating some of the challenges of building a DW pipeline, they can make it harder for non-technical analysts to choose from the variety of applicable DW operations with different implementations, which increases the decision space of possible DW design solutions. Moreover, the lack of standards in the DW operations and their outcomes add overheads (and costs) to the data preparation efforts. We argue that research on developing widely used DW patterns and standardized DW operations building on foundations such as the formal DW operation constructs introduced in Raman and Hellerstein (2001), along with research on DW cost models, would enable more extensive reuse and optimizations in DW pipelines.

Whilst our findings contribute to data science and information systems body of knowledge, our work can be extended in several directions. For example, the DW patterns can be incorporated into existing tools in the form of decision guidance mechanisms (Morana et al., 2019) and automatic code generation methods (Budinsky et al., 1996) to increase reusability and automation, and empirical studies need to be carried out to quantify the productivity gains from applying the DW design patterns. Obtaining larger datasets of DW pipelines created by leading experts in data engineering and wrangling would have allowed the discovery of additional DW design patterns. Quantitative evidence of the reusability and cost savings potential of the proposed design patterns is also an important area for future research.