We emphasise a series of characteristics of this problem: (1) the combination of a few functional primitives can achieve very complex transformations, (2) the arity of the primitives is usually low (one in many cases), so the snippet becomes a pipeline, where the output of one primitive becomes the input of the next one, (3) we work from one example and (4) we want the shortest transformation with the given primitives, as built-in primitives usually lead to more efficient transformations. All these characteristics suggest that the problem can be addressed by exploring all combinations of primitive sequences, by using a strong simplicity bias (the number of primitives used). This strategy is common in other inductive programming scenarios (Mitchell et al. 1991; Katayama 2005; Menon et al. 2013; Mitchell et al. 2018) but it must always be coupled with some constraints (e.g., types, schemata, etc.) or strong heuristics. In our case, we will use the dimensions of the matrices as the main constraint for reducing the combinatorial explosion, as well as some priors about the frequency of each primitive and, optionally, some posteriors using text hints in order to guide a tree-based search where each combination of functions will be sorted and selected based on its assigned probability.
Dimensional constraints
We consider the background knowledge as a set of primitives G. This number of primitives |G| taken into account for the search is known as the breadth (b) of the problem, while the minimum number of such primitives that have to be combined in one solution is known as depth (d). Clearly, both depth and breadth highly influence the hardness of the problem, in a way that is usually exponential, \(O(b^d)\) (Ferri-Ramírez et al. (2001)) affecting the time and resources needed to find the right solution. This expression is exact if we consider unary primitives, so that solutions become matrix operation pipelines, i.e., a string of primitive \(c= g_1 g_2 \ldots g_d\).
The first optimisation to this search comes from the constraints about the dimensions of the primitives and the input/output matrices. For each matrix primitive g we take into account the dimension of the input and output at any point of the composition, and also some other constraints about minimum dimension (for instance, calculating correlations, function cor, requires at least two rows, i.e., \(m>1\)). More formally, for each primitive g we define a tuple \(\langle m_{min}, n_{min}, \tau \rangle\) where \(m_{min}\) and \(n_{min}\) are the minimum number of rows and columns (respectively) for the input (by default \(m_{min}=1\) and \(n_{min}=1\)), and \(\tau : \mathbb {N}^2 \rightarrow \mathbb {N}^2\) is a type function, which maps the dimension of the input matrix to the dimension of the output matrix. For instance, for \(g=\) colSums, \(m_{min}=1\) and \(n_{min}=1\) as you need at least one column and row for the primitive to work. \(\tau (m,n) = (1,n)\), because g takes a matrix of size \(m\times n\) and returns a matrix of size \(1 \times n\). Similarly, for \(g=\) cor, \(m_{min}=2\) and \(n_{min}=1\), as we need at least two rows to calculate a correlation. \(\tau (m,n) = (n,n)\), because g takes a matrix of size \(m \times n\) and returns a matrix of size \(n \times n\).
Probabilistic model
Now, during exploration we can consider that not all primitives, and consistent sequences of primitives, are equally likely. Given our inputs: the hint text T, an input matrix A and partially filled output matrix B, we estimate the probability of a sequence of primitives, as follows:
$$\begin{aligned} p(g_1 g_2 \ldots g_d|T,A,B) = \prod _{i=1}^{d} p(g_{i}|g_{i-1} g_{i-2} \ldots g_1 , T,A,B) \end{aligned}$$
(1)
The expression on the right can be used partially as we include candidates for primitives during the search procedure.
In order to estimate this probability we consider the a priori probability p(g) for each g, which we can derive from the frequency of use of the primitives in the library, as we will see in the following section. When T is available, we will use a frequency model that compares TF-IDF values (Salton and Buckley 1988) of the primitive in its R help documentation and the TF-IDF values in T. This model produces the conditional probability \(p_0(g|T)\) \(\forall g,T\). We combine these probabilities as follows:
$$\begin{aligned} p(g|T,A,B) = \gamma p(g) + (1-\gamma ) p_0(g|T) \end{aligned}$$
(2)
with \(\gamma \in [0,1]\). Clearly, if T is not available \(\gamma =1\). Basically, \(\gamma\) gauges how much relevance we give to the primitive prior (valid for all problems) over the relevance of the hint given by the user.
Finally, we have the intuition that the probability of a primitive may depend on the previous primitives. In this paper, we explore a very simple model for sequential dependencies, by limiting the effect to trigrams and exploring whether the same primitive is repeated in any of the three previous operations. We use a parameter \(\beta \in [0,1]\), where high \(\beta\) values imply that repetitions are more penalised. More formally,
$$\begin{aligned} p(g_i|g_{i-1} \ldots g_1,T,A,B) = \beta p(g|T,A,B) + (1-\beta ) p(g_i|g_{i-1}g_{i-2}g_{i-3},T,A,B) \end{aligned}$$
(3)
and the repetition part is simply:
$$\begin{aligned} p(g_i|g_{i-1}g_{i-2}g_{i-3},T,A,B)&= 0&\text{ if } \, g_i \in \{g_{i-1}, g_{i-2},g_{i-1}\} \\&= p(g|T,A,B)& \text{ otherwise } \end{aligned}$$
which means that if the primitive is repeated in the three previous operations, then the value is 0, becoming more relevant the lower \(\beta\) is. We will explore whether this repetition intuition has an important effect on the results.
Algorithm
With Eq. 1 using the expansion of Eq. 3, we can recalculate the probability after any primitive is introduced in a tree-based search. Note that every combination of primitives whose sizes do not match have probability 0 and are ruled out. However, for those that are valid, can we use extra heuristics to determine whether we are getting closer to the solution? One idea is to check whether we are approaching the final size of matrix. For instance, if the result has size (m, 1) and an operation takes the dimension to exactly that, it may be more promising than another that leads to a size \((2m,n^3)\) (which would require further operations to be reduced, at least in size).
In particular, each node in the tree where functions \(g_1g_2\ldots g_d\) have been introduced will be assigned with the following priorityFootnote 4:
$$\begin{aligned} p^*(g_1g_2 \ldots g_d) = (1+\alpha m)\prod _{i=1}^{d} p(g_{i}|g_{i-1} g_{i-2} \ldots g_1 , T,A,B) \end{aligned}$$
(4)
where \(m=1\) if the final dimension match the size of the output matrix B, i.e., \(\tau _d(\tau _{d-1}(\ldots \tau _1(m_{input},n_{input})\ldots ))\) \(=(m_{output}, n_{output})\), and \(m=0\) otherwise. For those ongoing transformations where the output size matches (even if the values are not yet equal) the priority will be higher than if the dimensions do not match. In other words, it is just an estimate of whether “we may already be there”. The parameter \(\alpha \in [0,1]\) simply gives weight to this. If \(\alpha =1\) then the priority of a situation with the final size is doubled over another situation where the final size does not match. For \(\alpha =0\) the priority is not affected by the final size matching or not.
Now we can use Eq. 4 in the tree-based search. The search algorithm works as follows (see Algorithm 1):
-
1.
The system can be configured to use a set of primitive functions (G), for each of them including the minimum values for the size of the input (\(m_{min}\),\(n_{min}\)) and the type function \(\tau\).
-
2.
For each particular problem to solve, we take the input matrix A and the partially filled matrix B. Optionally, we take a text hint T describing the problem to solve.
-
3.
Being \(d_{max}\) the maximum number of functions allowed in the solution, the procedure evaluates sequences of primitives \(g_1g_2 \ldots g_d\), with \(0 \le d \le d_{max}\) where each \(g_i \in G\). The parameter \(s_{max}\) determines the maximum number of solutions (when reached, the algorithm stops).
-
4.
We start with a set of candidate solutions \(C=G\).
-
5.
We extract \(c= g_1g_2 \ldots g_d \in C\) such that \(p^*(c)\) is highest. We use \(\tau\) on A and all primitives in c to see if the combination is feasible according to the dimension constraints and, in that case, we calculate the output size. If the dimension of any composition in c does not match, we delete the node from C. If the dimension of the output matches the dimension of B, we effectively execute the combination on A, i.e., c(A), and check whether the result covers S, as defined in the previous section. In the positive case, we add c as a solution, and we delete it from C. In any other case, if \(d < d_{max}\) we expand c into \(c \cdot g_{d+1}\) with all \(g_{d+1} \in G\). We calculate \(p^*\) for each of them and add them to C. We remove c from C.
-
6.
We repeat the procedure in 5 above until \(s_{max}\) is reached or C is exhausted.
As mentioned in the problem formulation we allow for some small precision error \(\epsilon\) between the cells in S (generated by \(\hat{f}\)) and the cells that are present in B (and are generated by \(\hat{f}\)).
Use of text hints
In some cases the user may provide a few words describing what she wants to do. This can be very helpful to give more relevance to those primitives that may be involved in the solution. For instance, if we consider a problem like “compute the correlation of a matrix”, the primitive cor will probably appear in the solution. In our model, this is what we denoted \(p_0(g|T)\). We now explain how we estimate this value.
First, we consider the set of primitives G and, for each of them, we download the text description from the corresponding R package help documentation. For instance help(“det”) gives the description for the function det as follows: “det calculates the determinant of a matrix. determinant is a generic function that returns separately the modulus of the determinant, optionally on the logarithm scale, and the sign of the determinant”. In the same way, the description of diag is: “Extract or replace the diagonal of a matrix, or construct a diagonal matrix”. Each of these help texts \(H_g\) is converted into an array by applying a bag-of-words transformation, after removing the useless words (included in a list of stop words) and performing a stemming conversion (reducing inflected words to their word stem).
Secondly, given a short description T of the task we want to solve, we also apply the bag-of-words transformation, remove the stop words and do stemming. Now we have the processed text chunks \(H_g\) for each \(g \in G\) and the processed text chunk T. We extract the vocabulary V from all these text chunks.
Thirdly, we apply the TF-IDF conversion (Salton and Buckley 1988) to all vectors \(H_g\) and T using the same vocabulary V. TF-IDF gives more relevance to more informative words. This leads to a word vector \(\mathbf {h}_g\) for each \(g \in G\) and a word vector \(\mathbf {t}\). As an example, in Fig. 2 we can see the frequent terms for these two functions, as represented by their TF-IDF values. For instance, for the function det, it is clear that when the word determinant (and its steammed form determin) appears in a text hint, the function must have a higher probability of being required for the solution compared with other functions, such as the diag function.
Finally, for each g we calculate the cosine similarity \(s(H_g,T)\) between \(\mathbf {h}_g\) and \(\mathbf {t}\). We normalise the |G| similarities to sum up to one as follows:
$$\begin{aligned} p_0(g|T) = \frac{s(H_g,T)}{\sum _{g \in G} s(H_g,T)} \end{aligned}$$
This estimate is used for Eq. 2.