Introduction

The estimation of voter shifts (stayers and switchers) between elections is an active area of research that has attracted the interest of many scholars for decades, both during the XX century (e.g., Vangrevelinghe 1961; Hawkes 1969; Irwin and Meeter 1969; McCarthy and Terence 1977; Upon 1978; Brown and Payne 1986; Tziafetas 1986; Thomsen 1987; Johnston and Pattie 19911993; Füle 1994) and the XXI century (e.g., Wellhofer 2001; Antweiler 2007; Park 2008; Andreadis and Chadjipadelis 2009; Forcina et al. 2012; Park et al. 2014; Russo 2014; Puig and Ginebra 20142015; Klima et al. 20162019; Plescia and De Sio 2018; Klein 2019; Abou-Chadi and Stoetzer 2020; Pavía and Aybar 2020; Romero et al. 2020; Thurner et al. 2020; Pavía and Romero 2022a2022c; Sandoval and Ojeda 2022; Thurner et al. 2022; Vizcaino and Pavía 2022).

The voter transitions among the \(J\) election options available in an election E1 held at time \(t\) and the \(K\) election options of another election E2, usually held at a posterior time \(t+1\), are typically summarised in a \(J\times K\) row-standardised proportion (probability) matrix \({\varvec{P}}=\left[{p}_{jk}\right]\); where \({p}_{jk}\) represents the proportions of electors in the entire electoral space who chose (are classified in) option \(k\) in E2 among those who chose (are classified in) option \(j\) in E1. This matrix is usually unknown. Therefore, given their relevance to multiple agents (among others, party teams, the media and political scientists), it is routinely approximated using models and/or from polls (e.g., Klima et al. 2019; Abou-Chadi and Stoetzer 2020; Thurner et al. 2020).

When surveys are used to approximate this matrix, it is not uncommon for the estimated matrix to be inconsistent and even incomplete (e.g., Park et al. 2014; Russo 2014; Abou-Chadi and Stoetzer 2020). Inconsistency means that discrepancies emerge between actual results recorded in E2 and outcomes attained after applying the estimated probabilities to the results registered in E1. This is evidenced by examining the differences between estimated and real percentages. Incompleteness is a consequence of the unavailability of estimates for some proportions, either as a consequence of small sample sizes or because they are impossible to derive even from surveys. When this happens, analysts are usually interested in correcting these flaws, achieving a new estimated matrix: consistent and complete.

Two routes have been mainly followed to solve this problem: adjusting initial transfer probabilities using the Iterative Proportional Fitting (IPF) algorithm (e.g., Park 2008; Pavía and Aybar 2020; Thurner et al. 2020) or combining aggregate and a priori (survey) estimates within a statistical hierarchical Bayesian model (Greiner and Quinn 2010; Klima et al. 2019; Thurner et al. 2022). The first approach is quite simple, but it can only solve the inconsistency problem and is not free of weaknesses. The main limitation of IPF is its incapacity to move initial zero estimates, an issue that sometimes leads the algorithm to never converge (Thurner et al. 2020). In contrast, the second approach can solve both problems (inconsistency and incompleteness), but it is significantly more complex and data-demanding. Furthermore, when the data are available, Bayesian skill-trained analysts are still needed to properly tune the models.

This paper develops, within a linear programming framework, a family of models to reach in a fairly simple way consistency and completeness. All that is required is a (maybe incomplete) transfer vote matrix of initial estimates and two vectors of (row and column) constraints; these vectors usually correspond to the actual results recorded in E1 and E2 elections. The new models, which solve the incompleteness problem and are as simple to use as IPF, cover all the scenarios concerning the available a priori information (in terms of a priori proportions and their confidence) that can be reasonably considered. Interested analysts can easily use these new models with the help of the function lp_apriori of the R-package lphom (Pavía and Romero 2022b).

The data

Without loss of generality, we assume that the aggregated electoral outcomes of a partition into \(I\) territorial units of the electoral space are known and that \(J\) and \(K\) are the total number of possible options/outcomes/situations in E1 and E2, respectively. In both cases, we consider abstention as a possible voting option as well as other situations, such as new entries or exits in the census. Although, in general, having the aggregate data of the entire electoral space (i.e., dealing with the case \(I=1\)) is enough to solve the problem (see the examples in the supplementary material), having electoral outcomes of more units can be useful in the process of approximating (net) figures for entries and exits, when these are not available.

To solve the problem, we need to handle two sources of data. Aggregate recorded/estimated figures in each unit in both elections, \({\varvec{X}}=[{x}_{ij}]\) and \({\varvec{Y}}=[{y}_{ik}]\), and a priori information, \({{\varvec{P}}}^{0}=\left[{p}_{jk}^{0}\right]\).

On the one hand, for each of the \(i = 1,\dots , I\) territorial units we need the (available) recorded/estimated statistics of voters/electors \({x}_{ij}\) (for \(j = 1,\dots ,{ J}_{x}\), with \({J}_{x}\le J\)) corresponding to the election options in E1 and, equally, we also need similar (available) figures \({y}_{ik}\) (for \(k = 1,\dots ,{K}_{y}; {K}_{y}\le K\)) corresponding to E2. Note \({J}_{x}\le J\) and/or \({K}_{y}\le K\) because, sometimes, we will not have data on some collectives, such as census ins and outs.

On the other hand, we need a matrix \({{\varvec{P}}}^{0}=\left[{p}_{jk}^{0}\right]\) of order \({J}_{0}\times {K}_{0}\) (where \({J}_{0}\le J\) and \({K}_{0}\le K\)) of a priori transfers between elections 1 and 2. This matrix can come from a poll or from another source. In this matrix, \({p}_{jk}^{0}\) transfers are usually expressed in the form of row-standardised proportions, although a cross-classified matrix of counts, probably derived from a poll, is also allowed in our specifications. This matrix can contain missing values.

The basic unknowns are the quantities \({p}_{jk}\). The goal is to derive a consistent and complete row-standardised matrix of proportions, \(\widehat{{\varvec{P}}}\), as close as possible to \({{\varvec{P}}}^{0}\). To do this, we consider eight different scenarios (some of them with several variants) relevant to the information available in \({\varvec{X}}\) and \({\varvec{Y}}\). To the five scenarios defined in Romero et al. (2020)—simultaneous, raw, regular, full and gold—we add three additional scenarios: ordinary, enriched and semifull.

As in Romero et al. (2020), we consider that entries and exits of the census can each have two different sources. On the one hand, entries in each territorial unit \(i\) (\({E}_{i}\)) are the sum of two groups: young electors newly entitled to vote (\({N}_{i}\)) and new residents (immigrants, \({I}_{i}\)) that have the right to vote. On the other hand, exits (\({X}_{i})\) are also made up of two groups: voters registered in E1 who have died before E2 (\({D}_{i}\)) and people who have emigrated during the inter-election period (\({M}_{i}\)). The scenarios (and their variants) basically differ regarding the information available for these groups. In raw, regular, ordinary and enriched scenarios, the row-aggregations of \({\varvec{X}}\) and \({\varvec{Y}}\) will typically differ and data about net entries and/or net exits should be derived from the available information to guarantee congruence.

In general, denoting the total censuses corresponding to unit \(i\) in both elections by \({C1}_{i}\) and \({C2}_{i}\), the following accounting equalities can be expressed:

$${C1}_{i}+{E}_{i}-{X}_{i}={C2}_{i} \iff {C1}_{i}+\left({N}_{i}+{I}_{i}\right)-{(D}_{i}+{M}_{i})={C2}_{i}$$

Hence, given that \({C1}_{i}\) and \({C2}_{i}\) are known, depending on the level of information available regarding the remaining components, we can estimate: \({b}_{i}={E}_{i}-{X}_{i}\), whose absolute value corresponds to either total net entries, if \({b}_{i}>0\), or to total net exits, if \({b}_{i}<0\); \({c}_{i}={N}_{i}-{D}_{i}\), whose modulus corresponds to either net entries different from immigrants, if \({c}_{i}>0\), or net exits different from emigrants, if \({c}_{i}<0\); or \({d}_{i}={I}_{i}-{M}_{i}\), which without sign corresponds to either net entries different from new voters by age, if \({d}_{i}>0\), or net exits different from deaths, if \({d}_{i}<0\). The aggregation by units of these quantities allows estimates of the sizes of these collectives in the entire population to be achieved. In general, we recommend working with small units to avoid (in absolute values) underestimations of \({b}_{i}\)’s and \({c}_{i}\)’s and with large units to avoid overestimations of \({d}_{i}\)’s. In any case, since, as a rule, (i) these groups tend to be marginal, (ii) the focus is not usually on their electoral behaviour and (iii) it is usually simple to obtain accurate estimates of \({Y}_{i}\) and \({M}_{i}\) from demographic statistics (see, e.g., Pavía and Veres-Ferrer 2016a, b), it is advisable to work with large units. This also has the advantage of reducing the costs of data wrangling (Klima et al. 2016).

The scenarios

In this section, we describe the characteristics of the different scenarios and detail how the data should be arranged in each one of them. In all the cases, we assume that the rows of \({{\varvec{P}}}^{0}\) and the columns of \({\varvec{X}}\) are sorted in the same order for \(min({J}_{0},{J}_{x})\) and that this also happens between the columns of \({{\varvec{P}}}^{0}\) and columns of \({\varvec{Y}}\) for \(min({K}_{0},{K}_{y})\). These arrangements are transferred to \(\widehat{{\varvec{P}}}\).

So as keep the complex discussion of scenarios simple and in order not to include \({{\varvec{P}}}^{0}\) in the discussion, we assume without loss of generality that (i) the matrix \({{\varvec{P}}}^{0}\) has been derived from a poll (i.e., no information is enclosed in \({{\varvec{P}}}^{0}\) about exits and therefore \({K}_{0}\le {K}_{y}\)) and (ii) minor voting options are grouped in \({\varvec{X}}\), \({\varvec{Y}}\) and \({{\varvec{P}}}^{0}\) in an others option. Although no hypothesis is stated regarding the relationship between \({J}_{0}\) and \({J}_{x}\), in general \({J}_{0}\ge {J}_{x}\) when information about the electoral behaviour of new voters is available.

Simultaneous scenario

This is the simplest scenario. In this case the same electors are entitled to vote in E1 and E2, usually because both elections are held at the same time (Pavía and Romero 2022c). In this scenario the sum by rows \({\varvec{X}}\) and \({\varvec{Y}}\) must coincide and special constraints about the coefficients of the matrix (such as the ones stated in Eqs. (6)–(15) introduced in Sect. 5) do not apply. Here \(J={J}_{x}\) and \({K=K}_{y}\) and the figures \({N}_{i}\), \({I}_{i}\), \({D}_{i}\) and \({M}_{i}\) are null by definition, so \({b}_{i}={c}_{i}={d}_{i}=0\). This defines a basic scenario (type I), which can be solved with the basic model defined by Eqs. (1)–(5); a model with no specific proportions’ constraints.

Raw scenario

This scenario accounts for the most common situation. A situation with two elections elapsed over some time and where only the raw election data recorded in the I territorial units in which the area under study is divided are available. In this scenario, net exits (deaths and emigrants) and net entries (new young voters and immigrants) are estimated using the \({b}_{i}\)’s coefficients. We can have four variants, depending on the values of the \({b}_{i}\)’s. If \({b}_{i}=0 \;\; \forall i\), this case collapses to a simultaneous scenario. If \({b}_{i}\ge 0 \;\; \forall i\), with at least a \({b}_{i}\) strictly positive, \(J={J}_{x}+1\) and \({K=K}_{y}\) and the resulting problem can be also solved using the basic model. If \({b}_{i}\le 0 \;\; \forall i\), with at least a \({b}_{i}\) strictly negative, \(J={J}_{x}\) and \({K=K}_{y}+1\). In this case, it seems reasonable to follow Romero et al. (2020) and to assume that exits (mainly as a consequence of mortality) impact uniformly in all the political options: \({p}_{1,K}={p}_{2,K}=\dots = {p}_{J,K}\). We call this kind of scenario type II. If the uniform hypothesis is not assumed, the scenario collapses to a type I scenario. Finally, if strictly positive and negative \({b}_{i}\)’s are obtained, then \(J={J}_{x}+1\) and \({K=K}_{y}+1\). In this case, the \({J}^{th}\) option of E1 will correspond to (net) new entries and the \({K}^{th}\) option of E2 to (net) exits. Since new entries cannot be exits, the logical constraint \({p}_{J,K}=0\) applies. This gives rise to two new types of scenarios. If the uniform hypothesis is assumed (\({p}_{1,K}={p}_{2,K}=\dots = {p}_{J-1,K})\),the type III scenario appears, otherwise this gives a type IV scenario.

Regular scenario

Initially, this scenario looks similar to the raw scenarios; it only differs in the information provided regarding E1, which is enlarged. In this case, the data contained in the last column of \({\varvec{X}}\) refers to, without loss of generality, new young electors who have the right to vote for the first time (formally there is no differentiation in the discussion as to whether this refers to immigrants or new young electors). Here, again, new variants arise depending on the information we can derive for net entries (different from new young electors) and net exits. Without net entries and net exits, we are formally in a type I scenario, with \(J={J}_{x}\) and \({K=K}_{y}\). With net exits but not net entries, we have \(J={J}_{x}\) and \({K=K}_{y}+1\) and we are in either a type III scenario if the uniform hypothesis is assumed or in a type IV scenario if that hypothesis is not assumed. With net entries but not net exits, \(J={J}_{x}+1\) and \({K=K}_{y}\) and we again are in a type I scenario. Finally, if we have both net entries and exits, we need to impose by logic the constraints \({p}_{J-1,K}={p}_{J,K}=0\). In this last case, where \(J={J}_{x}+1\) and \({K=K}_{y}+1\), we can consider two variants depending on whether or not the uniform hypothesis (that this time corresponds to \({p}_{1,K}={p}_{2,K}=\dots = {p}_{J-2,K}\)) is imposed. The last case defines two new basic types of scenarios. If the uniform hypothesis is imposed, we are in a type V scenario and if it is not imposed, in a type VI scenario.

Ordinary scenario

This scenario is also an extension of a raw scenario but this time with more information available regarding E2. In this case, the data contained in the last column of \({\varvec{Y}}\) refer to, without loss of generality, exits of the census due to death (formally there is no differentiation in the discussion if it refers to emigrants). As usual, new variants arise depending on the information we can derive for net entries and net exits (different from deaths). Without net entries and net exits, \(J={J}_{x}\) and \({K=K}_{y}\) and we are formally in either a type II scenario if the uniform hypothesis about exits is imposed or in a type I scenario without this hypothesis. With net entries but not net exits, \(J={J}_{x}+1\) and \({K=K}_{y}\) and we are in either a type III scenario with the uniform hypothesis or in a type IV without it. With net exits but not net entries, we have \(J={J}_{x}\) and \({K=K}_{y}+1\) and we are in either a type I scenario if the uniform hypothesis is not assumed or in a new type scenario if it is assumed; we call this scenario type VII. In type VII scenarios, we have two possible types of exits (death and emigration) and the uniform hypothesis applies for both of them: \({p}_{1,K-1}={p}_{2,K-1}=\dots = {p}_{J,K-1}\) and \({p}_{1,K}={p}_{2,K}=\dots = {p}_{J,K}\). Finally, if we have both net entries and exits, the constraints \({p}_{J,K-1}={p}_{J,K}=0\) need to be imposed, \(J={J}_{x}+1\), \({K=K}_{y}+1\), and two new basic types of scenarios, depending on whether or not the uniform hypothesis (that this time crystallizes into \({p}_{1,K-1}={p}_{2,K-1}=\dots = {p}_{J-1,K-1}\) and \({p}_{1,K}={p}_{2,K}=\dots = {p}_{J-1,K}\)) is assumed. If this hypothesis is assumed, we are in a type VIII scenario and if it is not assumed, in a type IX scenario.

Enriched scenario

This scenario extends raw scenarios in the two directions analysed in regular and ordinary scenarios. Here, we have without loss of generality data of new young electors by age in the last column of \({\varvec{X}}\) and of deaths in the last column of \({\varvec{Y}}\). Again, different variants arise depending on whether net additional entries (due to immigration) and/or net additional exits (due to emigration) are estimated. Without net entries nor net exits, we are in a type III scenario if uniform hypothesis is assumed and in a type IV scenario without it. With net entries but not net exits, the scenarios that emerge are type V and type VI with and without the uniform hypothesis, respectively. With net exits but not net entries, the types of scenarios that arise are of types VIII and IX with and without the uniform hypothesis, respectively. Finally, with both net entries and net exits, two new basic scenarios emerge. In this case, it is necessary to include by logic \({{p}_{J-1,K-1}={p}_{J-1,K}=p}_{J,K-1}={p}_{J,K}=0\) and also \({p}_{1,K-1}={p}_{2,K-1}=\dots = {p}_{J-2,K-1}\) and \({p}_{1,K}={p}_{2,K}=\dots = {p}_{J-2,K}\) if the uniform hypothesis is assumed. We call the scenario with all the above constraints type X, and type XI the corresponding scenario only containing the logic constraints.

Semifull scenario

In this scenario, the analyst has both aggregate information about total entries (young electors and new immigrants) and total exits (deaths and emigrants) of the census list. Total entries and exits are assumed to be in the last columns of \({\varvec{X}}\) and \({\varvec{Y}}\), respectively. Here, the sum by rows of \({\varvec{X}}\) and \({\varvec{Y}}\) must agree and there are only two variants. If the uniform hypothesis is assumed we are in a type III scenario and in a type IV, otherwise.

Full scenario

In this scenario, the analyst has detailed information about totals of new young electors that have the right to vote for the first time (penultimate column of \({\varvec{X}}\)), and of new immigrants that have the right to vote (last column of \({\varvec{X}}\)) and aggregate information about total exits (due to death or emigration) of the census lists (last column of \({\varvec{Y}}\)). Here, again, the sum by rows of \({\varvec{X}}\) and \({\varvec{Y}}\) must agree and there are only two variants. If the uniform hypothesis is assumed we are in a type V scenario and, if not, in a type VI.

Gold scenario

This scenario is similar to a full scenario but here total exits are separated out between exits due to emigration and due to death (penultimate and ultimate columns of \({\varvec{Y}}\)). Again, the sum by rows of \({\varvec{X}}\) and \({\varvec{Y}}\) must agree and there are only two variants, depending on whether or not the uniform hypothesis is assumed. We are in a type X scenario under this hypothesis and in a type XI otherwise.

Schematic representation of scenarios

The above discussion is quite dense. We have considered eight different scenarios regarding the information available for \({\varvec{X}}\) and \({\varvec{Y}}\) and, after considering different variants, this gives us thirty-five possibilities that collapse in eleven basic structures for the estimated transfer matrix. Table 1 schematically summarises the cases discussed.

Table 1 Summary of the scenarios included in the lp_apriori function

The model

Whatever the scenario and linked constraints, our model reaches its solutions after solving a linear programming model whose objective is to minimize a weighted sum of the absolute value discrepancies between the pairs \({p}_{jk}\) and \({p}_{jk}^{0}\). The model imposes the consistency property (Eqs. (1)–(3)) to the solution and does not necessarily deal with all the discrepancies, defined in Eq. (4), symmetrically. They are weighted in the objective function (see Eq. (5)). Because the level of confidence for all the a priori proportions, \({p}_{jk}^{0}\), is not usually the same, we inform the model about this using weights, \({\omega }_{jk}>0\). In particular, the basic model is defined by the system of Eqs. (1)–(5).

$${p}_{jk}\ge 0\;\;for\;j=1,\dots ,J \; k=1,\dots ,K$$
(1)
$$\sum_{k=1}^{K}{p}_{jk}=1\,for\,j=1,\dots ,J$$
(2)
$$\sum_{j=1}^{J}{p}_{jk}{x}_{\cdot j}={y}_{\cdot k}\,for\,k=1,\dots ,K$$
(3)
$$\left({p}_{jk}^{0}-{p}_{jk}\right)={e}_{jk}^{+}-{e}_{jk}^{-}\;\;for\,j=1,\dots ,J \; k=1,\dots ,K$$
(4)
$$minimize\,Z=\sum_{j=1}^{J}\sum_{k=1}^{K}{ \omega }_{jk}({e}_{jk}^{+}+{e}_{jk}^{-})$$
(5)

where \({x}_{\cdot j}=\sum_{i=1}^{I}{x}_{ij}\) and \({y}_{\cdot k}=\sum_{i=1}^{I}{y}_{ik}\).

Although for simplicity in the above mathematical representation of the problem it has been stated that Eqs. (4) and (5) range for all the values of \(j\) and \(k\), in practice they are just delimited for the pair of indexes \((j,k)\) for which \({p}_{jk}^{0}\) and \({\omega }_{jk}\) are defined.

In addition to the above equations, depending on the type of scenario we are in, the system will also include (except in type I scenarios) some of constraints defined in Eqs. (6)–(15). The interested reader can find in Table 1 the relationships between the scenarios and their associated additional constraints.

$${p}_{1,K}={p}_{2,K}=\dots = {p}_{J,K}$$
(6)
$${p}_{1,K-1}={p}_{2,K-1}=\dots = {p}_{J,K-1}$$
(7)
$${p}_{1,K}={p}_{2,K}=\dots = {p}_{J-1,K}$$
(8)
$${p}_{1,K-1}={p}_{2,K-1}=\dots = {p}_{J-1,K-1}$$
(9)
$${p}_{1,K-1}={p}_{2,K-1}=\dots = {p}_{J-2,K-1}$$
(10)
$${p}_{1,K}={p}_{2,K}=\dots = {p}_{J-2,K}$$
(11)
$${p}_{J-1,K-1}=0$$
(12)
$${p}_{J-1,K}=0$$
(13)
$${p}_{J,K-1}=0$$
(14)
$${p}_{J,K}=0$$
(15)

The weights

The weight \({\omega }_{jk}>0\) is the penalty that we assign to the discrepancy \(\left|{p}_{jk}^{0}-{p}_{jk}\right|\), either \({e}_{jk}^{+}\) or \({e}_{jk}^{-}\), in Eq. (5). This coefficient measures the (relative) degree of confidence we have in the a priori value \({p}_{jk}^{0}\). Everything else being constant, the greater this weight the closer the estimated \({p}_{jk}\) and the \({p}_{jk}^{0}\) proportions will be. This means that setting the weight \({\omega }_{jk}\) to an arbitrarily large number will be, in many practical situations, equivalent to including in the model an additional restriction, the constraint \({p}_{jk}={p}_{jk}^{0}\).

Although the weights can be exogenously stated by the user and introduced in the model through a matrix of weights, \({\varvec{W}}=[{ \omega }_{jk}]\), the lp_apriori function also has programmed functionalities derived from the contextual information. In addition to the possibility of introducing personal weights through a matrix, lp_apriori includes seven more ways of computing the weights, denoted by "constant", "x", "xy", "expected", "counts", "sqrt" and "sd".

When weights are set to "constant", the same credibility is attached to all the a priori proportions and, by default, all the weights are set equal to 1. In practice, however, analysts tend to have more information from proportions related to the largest groups. For example, a survey is more likely to interview people who belong to the groups whose election options are more frequent. The "x", "xy" and "expected" strategies to compute the weights exploit this fact by implementing different approaches. In particular, the "x", "xy" and "expected" weights are calculated as proportional to, respectively, \({\omega }_{jk}\propto {x}_{\cdot j}\), \({\omega }_{jk}\propto {x}_{\cdot j}{y}_{\cdot k}\) and \({\omega }_{jk}\propto {x}_{\cdot j}{p}_{jk}^{0}\).

The a priori information is typically introduced to lp_apriori through a row-standardised matrix of proportions, but it could also be informed via a matrix, \({[n}_{jk}]\), of counts. When this happens, lp_apriori internally transforms the a priori data into a row-standardised matrix of proportions, but the issue here is that we can also use this information to define the weights. In this case, weights are defined observing the (underlying) sampling properties of the counts. The "counts" and "sqrt" weights are computed respectively proportional to \({n}_{jk}+\frac{1}{2}\) and \(\sqrt{{n}_{jk}+0.5}\), with the definition of "sd" weights being slightly more complex. They are \({\omega }_{jk}\propto \sqrt{\frac{{ n}_{jk}+0.5}{{p}_{jk}^{0}(1-{p}_{jk}^{0})}}\) if \({p}_{jk}^{0}\ne 0\) and \({\omega }_{jk}\propto \sqrt{2}\) if \({n}_{jk}={p}_{jk}^{0}=0\).

Assessing the models with real data

This section assesses, using real data, whether the adjustment corrects the biases introduced by polls when estimating vote transfer matrices and to what extent. The impact on the bias reduction of the sample sizes and the different weight structures considered in the previous section are also analysed. To do this, a realistic random sample of each of the 565 elections available in the R-package ei.Datasets (Pavía 2022) is simulated and their corresponding sample-estimated and model-adjusted transition matrices compared to the actual ones.

The main difficulty when assessing methods to estimate vote transfer matrices between elections lies in the fact that, due to the secret ballot, actual cross-distributions of votes are as a rule unknown. In simultaneous elections, however, when the same electors cast their votes in the same ballot for several elections, actual vote transfer matrices can be collected. The ei.Datasets database gathers the 565 real crosstabs of votes corresponding to the party-to-candidate cross-distributions recorded in the 492 electorates of the New Zealand general elections held between 2002 and 2020 and in the 73 constituencies of the 2007 Scottish Parliament election. Hence, the ei.Datasets crosstabs are exploited to answer the previous questions.

A random sample of size \(n\) (with \(n = 250, 500\, \mathrm{or} \, 1000\)) is simulated from each election, assuming that polls suffer from both non-response bias and response error. Given that the average size of the populations in ei.Datasets is ~ 33,192 voters, sample sizes of this order seem reasonable. Indeed, the minimum sample size recommended in NZ for electorate/regional polls is 250 (Research Association New Zealand 2020), 500 being the standard sample size. Likewise, it is reasonable to assume that polls are going to be impacted both by differential response rates (non-response bias), which depend on voters’ preferences and characteristics, and nonresponse error, due to social desirability issues or inaccurate recall voting. Actually, there is a large stream of literature documenting this impact occurring in real-world surveys all around the globe (see, e.g., Groves et al. 2002; Pavía et al. 2016; Cavari and Freedman 2022). The simulated samples are summarised in two-way contingency tables with parties in rows and candidates in columns and small parties and candidates (those who gain less than a 3% of the votes in the election) grouped under the “others” option. The row-standardised versions of these tables correspond to the sample-estimated transition matrices.

The estimated transition matrices are now adjusted to match the actual observed distributions of votes recorded in each election. The adjustments are made with the help of the lp_apriori function of the lphom package, testing as weights the seven precomputed possibilities available to them in the function: "constant", "x", "xy", "expected", "counts", "sqrt" and "sd". At the end of this process, we have eight estimated vote transition matrices for each election: the one that comes directly from the poll and the seven attained after adjusting this. These matrices are compared with actual transitions using three distance statistics: \(EI\), \(EPW\) and \(MAN\), given by Eqs. (16)–(18). The interested reader can replicate all the computations and find more in-depth information by reviewing the R reproducible code available in the supplementary material (see the Data availability statement).

$$EI=100\cdot \frac{0.5{\sum }_{j=1}^{J}{\sum }_{k=1}^{K}\left|{v}_{jk}-{\widehat{v}}_{jk}\right|}{{\sum }_{j=1}^{J}{\sum }_{k=1}^{K}{v}_{jk}}$$
(16)
$$EPW=100\cdot \frac{{\sum }_{j=1}^{J}{\sum }_{k=1}^{K}{v}_{jk}\left|{p}_{jk}-{\widehat{p}}_{jk}\right|}{{\sum }_{j=1}^{J}{\sum }_{k=1}^{K}{v}_{jk}}$$
(17)
$$MAN=100\cdot \frac{{\sum }_{j=1}^{J}{\sum }_{k=1}^{K}\left|{p}_{jk}-{\widehat{p}}_{jk}\right|}{JK}$$
(18)

where \({v}_{jk}\) and \({\widehat{v}}_{jk}\) stand for the actual and estimated/adjusted number of voters, respectively, voting simultaneously for party \(j\) and candidate \(k\); and \({p}_{jk}\) and \({\widehat{p}}_{jk}\) represent the actual and estimated/adjusted proportions of voters who, respectively, vote for candidate \(k\) among those who vote for party \(j\).

While \(EI\) measures distances between transfer matrices of votes, \(EPW\) and \(MAN\) deal with row-standardised transition matrices. \(EI\) accounts for the percentage of wrongly assigned votes in the estimated matrix: the minimum percentage of votes that should be moved among the cells of the table to reach a perfect fit. \(MAN\) measures the mean of the differences between the actual and the estimated proportion transitions, and \(EPW\) a similar distance to \(MAN\), but with the individual differences weighted by the number of votes corresponding to the transfer between party \(j\) and candidate \(k\). The smaller the numbers of these statistics, the closer the estimated/adjusted and the actual matrices.

Before presenting and analysing the results of the full simulation exercise, the whole process described above is first illustrated through an example. Consider the electorate of Maungakiekie in the 2020 NZ general election, the actual outcomes of which are presented in Table 2. Note that looking at the second panel of Table 2 there is evidence of strategic voting in this election and of a large number of switchers; an issue that emphasizes the importance of knowing the transition matrix (Abou-Chadi and Stoetzer 2020). For instance, 55.32% of the electors who voted for the Green Party also chose to vote in the ballot for the Labour Party candidate (Priyanca Radhakrishnan) and 74.49% of ACT voters chose to vote for the National Party candidate (Denise Lee).

Table 2 Actual Outcomes recorded in Maungakiekie during the 2020 NZ general election

Let us consider a random sample of size 500 selected from this electorate, as summarised in Table 3. By comparing the distributions in the last column and in the last row of the upper panel of Table 3 with the equivalent distributions in the lower panels of Table 2, it can be seen that, although the sample captures the general trends of the election, its raw estimates miss the actual results by a large amount. This sample shows a relevant level of bias.

Table 3 Simulated sample of size 500 in Maungakiekie for the 2020 NZ general election

Furthermore, although the estimates of the marginal distributions can be improved using post-stratification, that is, when the sample party-to candidate (candidate-to-party) transition matrix is used to approximate the candidate (party) distributions conditional on the actual party (candidate) distributions (as can be done in non-simultaneous elections), significant bias still remains in the marginal estimates. This can be confirmed comparing the distributions in the last column and the last row of the lower panel of Table 3 with the equivalent distributions in the lower panels of Table 2.

If the focus is on the transition matrix, as in our case, we see that its estimation can be improved by adjusting it, i.e., by making it consistent with the actual outcomes. This is evident since the transition matrix in Table 4, obtained after adjusting the data in Table 3 using lp_apriori with weights = "x", is closer to the actual transition matrix (displayed in the lower panel of Table 2) than the equivalent sampling transition matrix (see the lower panel of Table 3). For instance, in this example the \(MAN\) distance is reduced from 6.83 to 5.08.

Table 4 Adjusted matrix with "x" weights corresponding to the sample in Table 3

The above example, which corresponds to one of the scenarios that can be replicated using the code of the supplementary material (see the Data availability statement), is quite representative of the full set of scenarios, since 90.44%, 93.81% and 88.50% of the matrix adjustments made with weights = "x" are closer (as measured with \(EI\), \(EPW\) and \(MAN\), respectively) to the actual matrix than the sample matrix. Indeed, as can be seen in Table 5, where the averages of the distance statistics recorded for the 565 elections is shown grouped by weights used in lp_aprioi and sample size, the adjustment of the raw sample matrix leads as a rule to more accurate matrix estimates.

Table 5 Average of EI, EPW and MAN distances for the simulated samples by sample size

On average, the results show that adjusting with weights = "x" generates the most accurate solutions, whereas adjusting with weights = "expected" worsens the sampling estimates. As weights = "expected" gives more credibility to the sampling estimates with the largest expected number of votes, \({{v}_{jk}^{0}={x}_{\cdot j}p}_{jk}^{0}\), this weighting exacerbates the sampling bias. Despite this result with weights = "expected", adjusting leads to more accurate solutions on average for five out of the seven weightings ("constant", "x", "xy", "sqrt" and "sd"). The solutions achieved using weights "x" and "sd" are the most accurate. These results, moreover, are consistent by sample size: the same conclusions are reached regardless of the sample size. The impact of the sample size is seen in an improvement in the poll estimates, as expected, and also the adjustments. Indeed, as the sample size grows, both the sample and adjusted estimates improve in the same proportion.

Conclusions

This paper describes a family of models to adjust initial estimates of row-standardised voter transition probabilities to reach consistency and completeness in thirty-five different situations. All of them can be solved with the function lp_apriori available in the R-package lphom (Pavía and Romero 2022b). This package provides several algorithms based on linear programming for estimating, under the homogeneity hypothesis, general \(J\times K\) ecological contingency tables and, in particular, vote transition matrices.

Although the lp_apriori function has been included in the lphom package, it should be noted that the solutions programmed in lp_apriori are conceptually different from the rest of the procedures available in the package. While the models in lp_apriori have been conceived to adjust (initial) estimates by modifying the available estimates as little as possible, making them consistent and complete, the rest of the algorithms of the package have been devised to generate estimates from aggregate data by employing the homogeneity hypothesis. They all, however, share the same mathematical approach to solving the problem (linear programming) and the aim of estimating a transfer matrix of votes.

The suggested models are not only valuable in their own right, as the previous section shows, but they also open the way to solving one of the limitations of the lphom-family algorithms. According to Greiner (2007, p. 120), “… a good [ecological inference] method should be flexible enough to incorporate information from surveys or exit polls” and while this issue has been recently addressed within the Bayesian ecological inference framework (Greiner and Quinn 2010; Klima et al. 2019) it still remains to be solved within the linear programming ecological inference approach; a matter which deserves prompt attention since its methods are much easier to apply. The new models introduced in this paper will make it possible to develop new methods capable of integrating a priori information and aggregate results within the linear programming framework.