Can language models automate data wrangling?

The automation of data science and other data manipulation processes depend on the integration and formatting of ‘messy’ data. Data wrangling is an umbrella term for these tedious and time-consuming tasks. Tasks such as transforming dates, units or names expressed in different formats have been challenging for machine learning because (1) users expect to solve them with short cues or few examples, and (2) the problems depend heavily on domain knowledge. Interestingly, large language models today (1) can infer from very few examples or even a short clue in natural language, and (2) can integrate vast amounts of domain knowledge. It is then an important research question to analyse whether language models are a promising approach for data wrangling, especially as their capabilities continue growing. In this paper we apply different variants of the language model Generative Pre-trained Transformer (GPT) to five batteries covering a wide range of data wrangling problems. We compare the effect of prompts and few-shot regimes on their results and how they compare with specialised data wrangling systems and other tools. Our major finding is that they appear as a powerful tool for a wide range of data wrangling tasks. We provide some guidelines about how they can be integrated into data processing pipelines, provided the users can take advantage of their flexibility and the diversity of tasks to be addressed. However, reliability is still an important issue to overcome.


Introduction
Data wrangling refers to repetitive and time-consuming data preparation tasks, including transforming data presented in different formats into a standardised form for easy access, understanding and analysis. The (semi-)automation of these manual and non-systematic tasks can impact the costs of data preparation significantly. If language models (on their own or integrated within other systems) are able to solve a significant proportion of these problems in the next years, the transformative effect on society and the marketplace would be huge, given how widespread these formatting chores happen (from spreadsheet manipulation to data science projects) [12].
One key difficulty of some data wrangling problems such as standardising a field into a single format stems in the context of interaction [30]. For automation to be really useful, the tool should be able to infer the transformation pattern from very few examples, and complete the rest automatically. The second challenge for data wrangling, and especially for data transformation into a common format lies in the myriad of different transformations and formats we may find depending on the domain of the data. For instance, in a date field, the day can be the first, second or third number, and these numbers can be delimited by different symbols. An AI system based only on basic string transformations may never find the right solution given just one example without domain constraints or background knowledge, as the transformations needed for dates are very different from those used for addresses or emails.
There seems to be great potential in language models [2] for data wrangling precisely because they compress huge amounts of human knowledge about many different domains, and have recently shown reasonably good performance in contextualising this knowledge for few-shot inference [23,25,5,13]. It is then very important to determine whether language models could be used in the future for data wrangling tasks, and whether they get better as the number of parameters increase, a question subject to recent debate [1,29]. The applicability for language models for the automation of other parts of data science may also be affected by the progress in data wrangling, especially as we move towards more domain-dependent and more open-ended tasks, as shown in the quadrants of figure 1 in [9].
In this paper we test experimentally whether language models can be used to solve typical problems in data wrangling, using prompts that will have input-output examples and a single input ending the prompt, for which the language model will have to provide the output as a continuation of the prompt (e.g. Input: 'marshap@gmail.com' \nOutput: 'marshap'\n\nInput: 'alant@hotmail.com'\nOutput:). Concretely, we compare the inference power of GPT-3 with other specialised tools on a benchmark of simple data wrangling problems. To our knowledge, this is the first paper analysing the potential of language models for data wrangling systematically, looking for the influence of the size of the model, and the number of examples.

Related work
One of the challenges for the automation of data wrangling tasks is how the solutions can be built or selected from a vast space of transformations when only a few examples are provided by a user [4]. For this reason many data wrangling tasks are approached by combining the available information in the examples with some domain knowledge ("any information the learner has about the unknown transformation before seeing the examples" [27]), in an attempt to reduce the hypothesis space. Inductive Programming [15] has been a common paradigm to learn transformations from very few examples by incorporating prior knowledge about the domain in a declarative way. As this approach suffers from intractability when background knowledge becomes large, the use of ad-hoc domain-specific languages (DSLs) (see [8,32]) restrict the search space, and has led to the first commercial products such as Microsoft Excel with FlashFill [15]. Even with domain-specific languages, many constraints on the transformations are added to make things work, or very specific collections of built-in facilities or functions. For instance, Amazon SageMaker Data Wrangler * contains over 300 built-in data transformations or even tools like Trifacta Wrangler [20] allows the user to define her own transformations. Many systems combine some of these ideas or apply ad-hoc optimisations [16, 3,11,22,14,28,27]. On the other hand, in [7,6], general-purpose inductive programming systems can still be used by using different domain-specific background knowledge that are selected or ranked from contextual information or meta-features about the examples to be transformed.
Language models are conceptually simple systems: they estimate the probability p(y|x) of a given sequence of characters or tokens y following another sequence x, in the spirit of efficient coding [26]. Today, these models are usually based on large deep learning architectures such as transformers (attention-based architectures, [31]), but they still estimate this same probability. They are trained over massive natural language corpora and hence exploit the extrinsic patterns borrowed from humans. However, beyond making plausible continuations following the inputs (the so-called 'prompts'), or as part of this capability, recent systems such as BERT [10], GPT-2 [24], GPT-3 [5], and PanGu-α [34] can also be employed as 'few-shot learners', trying to exploit intrinsic patterns in the prompt. Few-shot inference happens when the models are able to extrapolate from previous examples in the 'prompt', without being retrained or fine-tuned. Extensive experimental research is showing remarkable extrapolations [18, 17, 33, 19] from small prompts. The state of the art of language models suggest they can be a promising tool for data wrangling precisely because they (1) capture a wide range of domain background knowledge, and contextualise it to the problem quite effectively, without the need of extra knowledge (e.g., we do not have to tell them that '23/12/2021' is a date), and (2) they not only infer from very few examples (e.g., pairs of date transformations "Input: 23/12/2021, Output: 12-23-2021"), but we can also add hints to the prompt to make few-shot learning more effective, or even zero-shot learning possible (e.g., "The conversion of 23/12/2021 into US format is:").

Methodology: goals and experimental design
Our experimental goals are: (1) determine to which extent a state-of-the-art language model can obtain good results on these data wrangling problems under the few-shot setting, (2) analyse the effect of the number of instances given in the few-shot setting, (3) explore the effect of the number of parameters of the language model to better understand the future potential, (4) study the variation of performance for different domains, and (5) compare the results with some other systems specifically designed for data wrangling.
For the experimental setting, we employ the Data Wrangling Dataset Repository † , a benchmark for data transformation problems. This repository includes many of the data wrangling tasks used in the literature (see, e.g., [11]) as well as new manually gathered tasks [7]. Overall, the repository contains 117 different tasks divided into 7 different domains (dates, emails, freetext, names, phones, times and units). For every task we find 6 examples composed by an input string and an output string. The output string corresponds to a corrected or modified version of the input string. In the appendix, we provide further details about the tasks in each domain in Table 2 and some illustrative examples in Table 3.
We use four versions of GPT-3 (a language model built and trained by OpenAI) of increasing capabilities: Ada, Babbage, Curie and DaVinci which line up closely with 350M, 1.3B, 6.7B, and 175B parameters, respectively ‡ . First, we analysed several possible prompts based on the recommendations stated in the OpenAI API § . As a result, the final prompt used in this work follows an input-output style, where the † http://dmip.webs.upv.es/datawrangling/ ‡ For the sake of replicability and reproducibility, all the code and results can be found in https://github.com/gonzalojaimovitch/lm-dw § https://openai.com/blog/openai-api/ string "Input:" is used to indicate the start of the input, and the string "Output:" is used to indicate the start of the output. The line break \n separates the input from the output of an example, as well as the examples in the prompt. The instance will have one (one-shot) or more (few-shot) input-output pairs, randomly selected (without considering the possible order sensitivity of GPT-3 [21]), of the same problem and domain, and one single input ending the prompt. The model language will have to provide the output by continuing the prompt.

Results and Discussion
The results obtained by the four different models (Ada, Babbage, Curie and DaVinci) in the four learning settings analysed (from 1-shot to 4-shot) are depicted in Figure  1 (complete details in Table 4, Figure 4, Table 5 and Table 6 in the supplementary material). In general, the results show that language models can be employed to learn simple transformations from few examples, and, as expected, the accuracy improves when we provide more instances. We also see that, in general, the most powerful engine is DaVinci. Nevertheless, the performance is not uniform across the analysed domains. We observe that the domain emails is the one where the GPT-3 models obtain the highest performance, whereas units is the domain with the lowest performance.
With the intention of getting more insight into how the models fail, we perform a fine-grain analysis of the 'units' domain. Table 1 includes examples of some of these tasks to better understand the differences in performance showed in Figure 2. The problems in tasks getUnits-i and getValue-i (see Table 2 for details) can be translated as "extracting a part of the string", a transformation that the GPT3 models can solve. Hence, we see that GPT-3 presents good results in domains where tasks can be solved by simple string transformations. However, getSystem-i and convert-i are much more complex tasks. Thus, getUnits-i requires the identification of the unit acronym (e.g., 'cl' for centilitres) and relating it with its unit system (e.g., volume), while convert-i needs to perform an arithmetic operation (e.g., a division), in addition to the identification of the conversion coefficient to the target unit (e.g., a coefficient of 1000 to convert milligrams into grams). Table 1. Examples of problems in the domain 'units'.
Finally, in order to compare the performance of GPT-3 with other data wrangling systems, we employ the subset of 26 problems for which there are results in the literature and a 1-shot setting, which is the same setting used by the other systems. We compare GPT-3 DaVinci and other data wrangling tools: FlashFill [15], TrifactaWrangler [22] and DBK [7]. The results (displayed in Figure 3 Table  4 and Figure 4 in the supplementary material.
language models are competitive with first-generation data wrangling tools such as FlashFill, and are getting closer in performance to more sophisticated tools such as DBK. Again, we see that the performance of the compared systems is related to the types involved in the target functions. The best results are obtained in domains where the problems are solved by simple string operations, while in other domains like units where some functions incorporate arithmetic the results are much worse. The exception is DBK that can induce the domain of the problem and then select proper base functions to address it.

Conclusions
Language models have recently disrupted artificial intelligence thanks to an unexpected abstraction capacity that has expanded their applicability to fields and problems not originally anticipated in their design. In this work, we have analysed different configurations and prompts, as well as the effect of the number of examples provided to see their performance for data wrangling problems. To our knowledge, this paper is the first one that explores the possibilities of language models for data wrangling problems. The results show the capacity of these systems to learn transformation functions from few examples. The performance of the studied language models is comparable   Figure 4 and Table 2, respectively, in the supplementary material.  [15], Trifacta Wrangler and DBK [7] for a 1-shot learning setting. Results of the compared systems are obtained from [7,6]. The tasks addressed are a subset of those in Table 4. Coloured, horizontal lines show the average results per system across domains.
to well-known systems specialised in data wrangling. These results open a promising research direction to explore the possible applications of language models as APIs and specialised tools for data wrangling. This is not limited to data wrangling, but could well be used for other tasks in data science, especially those that can be learnt from very few examples and require extensive domain knowledge.
As future work, we are interested in exploring other scenarios where, additionally to the examples, we also give some textual hints about the problem. This is recommended in the OpenAI API documentation. We also plan to explore alternative prompts that could increase the learning capacity of language models. Finally, we also have detected some problems related to the reliability of the learned hypotheses by the language models. For instance, consistent hypotheses learned with few examples are later ignored when we provide more examples to the models.  The value transformed to a different magnitude Get System The system represented by the magnitude Get Units The units of the system Get Value The numeric value without any magnitude