1 Introduction

Recent years have witnessed the rapid development of social media and their innovative applications in many fields [1]. For instance, it has been found that the volumes of tweets related to protests on Twitter are associated with real-life protest events [2]. Moreover, film mentions on Twitter can reflect box office revenues [1]. Additionally, public moods extracted from tweets can predict changes in stock markets [3, 4], and a real-time earthquake reporting system was developed by analyzing only tweets [5].

The unprecedented prevalence of social media has driven politicians to make use of this channel to propagate their ideas and political views [69] to more directly approach potential voters. It is not unusual to see election candidates post their daily activities and political ideas on social media and even debate on social media before and during the campaign. These behaviors can attract online discussion from massive numbers of netizens and, compared with traditional polls, are an easier way to gather wide-ranging public opinions about the candidates. Some research has shown the predictability of election results based on social media information in various countries and regions, including the United States [1012], the United Kingdom [13], Germany [14], the Netherlands [15], and Korea [16], where netizens’ behaviors and posts on social media were analyzed to infer the election results.

The existing research, however, usually exploits a single information source and uses simple descriptive statistics for election predictions, which easily results in hindsight bias and lacks generality. The way to ameliorate these issues is two-fold. On one hand, multiple sources should be included to obtain heterogeneous information for robust predictions. For instance, the keywords searched in Google represent the attention of the public, and the aggregated volumes can be used to predict the trends of influenza [17], stock markets [18, 19], consumer behaviors [20], etc. On the other, massive heterogeneous data obtained in real time are often too chaotic to provide consistent predictions; therefore, a method that can fuse the data and deliver robust predictions is indispensable. Our work in this paper is a novel attempt on this front.

We take Taiwan’s 2016 general election as a real-life case. Taiwan adopted direct election in 1996, and since then, Kuomintang (KMT) and the Democratic Progressive Party (DPP) have become the two major competing political parties. KMT pursues a “One China Policy” and the political legitimacy of the “Republic of China”, whereas DPP takes “Taiwan Independence” as its party program. In 2016, three candidates ran for the general election, including Eric Chu from KMT, Tsai Ing-wen from DPP, and James Soong from the People First Party (PFP). The election regulations adopt the “one man one vote” principle and execute the majority rule [21].

This research leverages time series data collected from various mainstream online platforms (i.e., Facebook, Twitter and Google) and visitation traffic to candidates’ campaign pages. These heterogeneous signals represent public opinions and are fed into a Kalman filter [22] to estimate the vote shares of each candidate dynamically. The most efficient signals are then identified based on the signal strengths characterized by the Kalman gain. In addition to prediction, this research attempts to automatically identify the events that most influenced the election by leveraging the event study model that originated in the field of financial research [23].

The results show that the prediction errors for every candidate one day, week, and month before the election are no greater than 2.59%, 4.58% and 5.87%, respectively. The results include some interesting findings. First, online signals appear to be more accurate than traditional polls in election prediction, although the polls can still function on mitigating the sample bias of netizens. In particular, a simple Facebook “Like” on a candidate’s post is the most significant predictor, whereas the seemingly more informative “Comments” function is much less important. Second, online signals show clear convergence as the final election day approaches. For example, Google keyword searches fluctuated initially but became a strong indicator in the final stage. Third, bursty events most influential to the campaign have a strong relationship with the cross-strait relation topics. For instance, while the Xi-Ma meeting reduced support of Tsai Ing-wen by 0.55%, the Chou Tzu-yu flag incident followed by the apology video one day before the election increased her votes by 3.66%.

2 Data and measurements

To identify the most popular Internet applications in Taiwan, we referred to professional Internet surveysFootnote 1and web traffic reports from Alexa, comScore and Digital Age (see Additional file 1, Table S1). We selected Facebook, Twitter, Google, and candidates’ campaign homepages as the “online sensors” of public opinions towards the election and designed various daily updated measurements to characterize the signals during the period from Oct. 31, 2015 to Jan. 16, 2016 consecutively. A 30-day moving average was applied to each measure to avoid excessive fluctuation. The data sets are available from: https://doi.org/10.6084/m9.figshare.6014159.

Facebook. Facebook is the most popular social platform in Taiwan and provides an easy way for candidates to reach out to a large audience. For each post by a candidate, users can click the “Like” tag to indicate a positive reaction. Hence, we can use the “daily average number of Likes per post” to measure a candidate’s popularity:

$$ s^{c}_{k, \mathrm{FAL}}=\frac{1}{m}\sum ^{m-1}_{j=0}\frac{\sum_{i} {like^{c}_{k-j,i}}/n^{c}_{k-j,\mathrm{FA}}}{\sum_{c}{\sum_{i}{like ^{c}_{k-j,i}}/n^{c}_{k-j,\mathrm{FA}}}}, $$
(1)

where \(like^{c}_{k,i}\) is the number of Likes of post i published by candidate c on day k, \(n^{c}_{k,\mathrm{FA}}\) is the total number of the candidate’s posts, and m is the window length of the moving average. Analogously, we compute the “daily average number of Comments per post” for each candidate as another signal from Facebook:

$$ s^{c}_{k, \textit{FAC}} = \frac{1}{m}\sum ^{m-1}_{j=0}\frac{\sum_{i} {Comment^{c}_{k-j,i}}/n^{c}_{k-j,\mathrm{FA}}}{\sum_{c}{\sum_{i}{Comment ^{c}_{k-j,i}}/n^{c}_{k-j,\mathrm{FA}}}}, $$
(2)

where \(Comment^{c}_{k,i}\) is the number of comments on post i published by candidate c on day k.

Twitter. We use three candidates’ names in both Simplified and Traditional Chinese as keywords (see Additional file 1, Table S2) to retrieve tweets from Twitter. The measure “number of tweets mentioning the candidate” is calculated as

$$ s^{c}_{k, \mathrm{TW}}= \frac{1}{m}\sum ^{m-1}_{j=0} \frac{tw^{c}_{k-j}}{\sum_{c}tw^{c}_{k-j}}, $$
(3)

where \(tw^{c}_{k}\) is the volume of tweets about candidate c on day k.

Search Engine. We also obtained search data from Google Trends to trace the evolution of a keyword’s search volume. We used the three candidates’ names in both Simplified and Traditional Chinese as keywords and restricted the search source to Taiwan. The measurement “search index ratio” is defined as

$$ s^{c}_{k,\mathrm{GO}} = \frac{1}{m}\sum ^{m-1}_{j=0}\frac{search^{c} _{k-j}}{\sum_{c}search^{c}_{k-j}}, $$
(4)

where \(search^{c}_{k}\) is the aggregated search indexes of keywords about candidate c on day k.

Campaign Homepages. We collected the daily traffic to candidates’ campaign homepages data from Alexa, and used the “IP traffic ratio” as an opinion measure as follows:

$$ s^{c}_{k,\mathrm{IP}} = \frac{1}{m}\sum ^{m-1}_{j=0}\frac{\mathrm{IP}^{c}_{k-j}}{ \sum_{c}\mathrm{IP}^{c}_{k-j}}, $$
(5)

where \(\mathrm{IP}^{c}_{k}\) is the IP traffic volume to candidate c’s campaign homepage on day k.

The above measurements convey different signals for continuous election prediction. We also collected offline election polls published by nineteen authoritative pollsters during the period from Aug. 1, 2015 to Jan. 16, 2016 (see Additional file 1, Sect. 1.1) for comparison. These polls were published aperiodically and infrequently, so we assume the opinions from a poll remain unchanged until a new poll has been released.

3 Vote prediction model

The goal of election prediction is to infer the underlying vote shares of various candidates based on heterogeneous noisy signals. A model that can fuse the signals in such a way to debias the prediction from noise and make dynamic predictions to reflect the evolution of public opinion is desired. We exploit the Kalman filter, a linear dynamic model, for this purpose. The filter was adopted in [2426] for election analysis, but previous studies were mostly based on polls and assumed only two candidates.

In general, a Kalman filter maps hidden states to observed variables with noise, and the current hidden states are assumed to transition from previous states with noise. That is,

$$\begin{aligned}& \mathbf{s}^{c}_{k} = \mathbf{h}_{k} x^{c}_{k}+\mathbf{r}^{c}_{k}, \quad \mathbf{r}^{c}_{k} \sim N \bigl(0,\mathbf{R}^{c}_{k} \bigr), \\& x^{c}_{k} = f_{k}x^{c}_{k-1}+q^{c}_{k}, \quad q^{c}_{k} \sim N \bigl(0,\sigma^{2}_{c,k} \bigr), \\& x^{c}_{0} \sim N \bigl(m^{c}_{0},p^{c}_{0} \bigr), \end{aligned}$$
(6)

where \(\mathbf{h}_{k}\) is a vector that maps the hidden state \(x^{c}_{k}\) of candidate c to observed multiple signals in \(\mathbf{s}^{c}_{k}\), \(f_{k}\) is the state transition coefficient, and \(x^{c}_{0}\) is the initial value of the hidden state. \(\mathbf{r}^{c} _{k}\) and \(q^{c}_{k}\) denote independent Gaussian random noise.

In our case, \(x^{c}_{k}\) is the genuine vote share of candidate c on day k, and \(\mathbf{s}^{c}_{k} = (s^{c}_{k,\mathrm{GO}},s^{c}_{k,\mathrm{FAL}}, s^{c} _{k,\mathrm{TW}},s^{c}_{k,\mathrm{IP}})^{\top }\) contains the observed multiple signals. We set \(f_{k} = 1\) and \(\mathbf{h}_{k} = \mathbf{1}\) for scale equivalence of the variables. The initial vote \(m^{c}_{0}\) is set as the average value of the latest poll results, with \(p^{c}_{0}=1\) to allow fluctuation. Note that we also change the setting of initial vote \(m^{c}_{0}\) to the mean value of each candidates’ signals and an equal value \(m^{c}_{0}=1/3\), with state variances \(p^{c}_{0}=0\) and \(p^{c}_{0}=1\), respectively (see Additional file 1, Sect. 2.1). The final prediction turns out to be insensitive to the initial values when the time series is sufficiently long (see Additional file 1, Sect. 2.2 and Sect. 2.3). The logic behind the set of equations is that the online measures are flawed signals with the true vote states represented by the mean with mixing noise. The goal of the model is to fuse the flawed signals to estimate the daily state and to further transfer the estimation to the next day to make a prediction.

The next task is to estimate the noise parameters \(\mathbf{R}^{c}_{k}\) and \(\sigma^{2}_{c,k}\). To reduce the model complexity, we assume \(\mathbf{R}^{c}_{k} = \mathbf{R}_{k}\) and \(\sigma^{2}_{c,k}=\sigma ^{2}_{k}\), ∀c. The maximum a posteriori estimation can then be obtained by maximizing the conditional density function:

$$\begin{aligned} \mathcal{J} =& p \bigl(x^{tsai}_{1:k},x^{chu}_{1:k},x^{soong}_{1:k}, \sigma ^{2}_{k},\mathbf{R}_{k}| \mathbf{s}^{tsai}_{1:k}, \mathbf{s}^{chu}_{1:k}, \mathbf{s}^{soong}_{1:k} \bigr) \\ &{}\propto \prod_{c} p \bigl(x^{c}_{0} \bigr) \prod^{k}_{j=1}p \bigl( \mathbf{s}^{c}_{j}|x ^{c}_{j}, \mathbf{R}_{k} \bigr)p \bigl(x^{c}_{j}|x^{c}_{j-1}, \sigma^{2}_{k} \bigr)p \bigl( \sigma^{2}_{k}, \mathbf{R}_{k} \bigr), \end{aligned}$$
(7)

with \(\sum_{c} x^{c}_{k}=1\) and \(\sum_{c} \mathbf{s}^{c}_{k} = \mathbf{I}_{4 \times 1}\). We finally have (see Additional file 1, Sect. 2.1),

$$ \begin{aligned} &\widehat{\sigma ^{2}_{k}} = \frac{1}{3k}\sum_{c} \sum ^{k}_{j=1} \bigl( \hat{x}^{c}_{j|j}-f_{j} \hat{x}^{c}_{j-1|j-1} \bigr)^{2}, \\ &\widehat{\mathbf{R}}_{k} = \frac{1}{3k}\sum _{c} \sum^{k}_{j=1} \bigl( \bigl( \mathbf{s}^{c}_{j}-\mathbf{h}_{j} \hat{x}^{c}_{j|j-1} \bigr) \bigl(\mathbf{s}^{c} _{j}-\mathbf{h}_{j}\hat{x}^{c}_{j|j-1} \bigr)^{\top }-\mathbf{h_{j}} {p}^{c} _{j|j-1} \mathbf{h_{j}}^{\top } \bigr), \end{aligned} $$
(8)

where \(\hat{x}^{c}_{k|k-1}\) is the vote state prediction for candidate c at time k given the signals up to \(k-1\), and \(\hat{x}^{c}_{k|k}\) is the updated estimation of the vote state at time k given the signals up to k. \(p^{c}_{k|k-1}\) and \(p^{c}_{k|k}\) are the prediction covariance and updated estimation covariance, respectively.

To recursively estimate the daily vote state at time k, the prediction of vote shares \(\hat{x}^{c}_{k|k-1}\) is first derived by a variation of the state transition equation in (6):

$$ \begin{aligned} &\hat{x}^{c}_{k|k-1}=f_{k} \hat{x}^{c}_{k-1|k-1}, \\ &p^{c}_{k|k-1}=f^{2}_{k} p^{c}_{k-1|k-1}+\widehat{\sigma ^{2}_{k}}. \end{aligned} $$
(9)

Meanwhile, since the online signal \(\mathbf{s}^{c}_{k}\) is observed, it is feasible to update the state estimation \(\hat{x}^{c}_{k|k}\) by absorbing \(\mathbf{s}^{c}_{k}\) into the prediction of \(\hat{x}^{c} _{k|k-1}\). We use a weighted function to express the combination of the state prediction and signals as follows:

$$ \begin{aligned} &\hat{x}^{c}_{k|k}=f_{k} \hat{x}^{c}_{k|k-1}+\mathbf{k}^{c}_{k} \bigl( \mathbf{s}^{c}_{k}-\mathbf{h}_{k} \hat{x}^{c}_{k|k-1} \bigr), \\ &p^{c}_{k|k}=p^{c}_{k|k-1}- \mathbf{k}^{c}_{k}\mathbf{h}_{k}p^{c}_{k|k-1}, \end{aligned} $$
(10)

where \(\mathbf{k}^{c}_{k}\) is called the Kalman gain [27] used to weight the state prediction and various signals in the prediction. By minimizing the updated state estimation error \(x^{c}_{k}-\hat{x}^{c}_{k|k}\), we can derive the Kalman gain as

$$ \mathbf{k}^{c}_{k}=p^{c}_{k|k-1} \mathbf{h}^{\top }_{k} \bigl(\mathbf{h}_{k}p ^{c}_{k|k-1}\mathbf{h}^{\top }_{k}+\widehat{ \mathbf{R}}^{c}_{k} \bigr)^{-1}. $$
(11)

When the updated estimation is obtained, we can use (9) to predict the next-day vote share.

According to the Internet usage report of Taiwan,a more than 90% of Taiwan residents aged between 20 and 45 years have accessed the Internet since May 2015. This proportion is over 80% in the population aged between 45 and 55 years. By contrast, only 49.5% of residents aged over 55 years have used the Internet during the same time period. Thus, we take the online data fusion result as a representation for the group aged between 20 and 50 years. With respect to the age-adjusted sampling method adopted by pollsters, we take the poll results for the 50 to 60 year-old, 60 to 70 year-old and over 70 year-old groups as the vote share estimations of the corresponding age groups. Therefore, the final daily vote share prediction \(y^{c}_{k}\) for candidate c at time k is weighted as follows,

$$ \begin{aligned} y^{c}_{k}=w_{20\sim 50} \hat{x}^{c}_{k|k-1}+w_{50\sim 60}z^{c}_{50 \sim 60,k}+w_{60\sim 70}z^{c}_{60\sim 70,k}+w_{70}z^{c}_{>70,k}, \end{aligned} $$
(12)

where \(w_{i}\) is the population proportion of age group i, which could be obtained from the Ministry of the Interior of Taiwan.Footnote 2\(z^{c}_{i,k}\) is the most recent poll result of age group i for candidate c on day k.

4 Event detection method

Twitter, as an online plaza, aggregates information about different candidates during an election campaign. By analyzing the sentiment of Twitter in October 2015, we find that more than 80% of the retrieved tweets are news. Due to the fact that most of the Taiwan mainstream media have set up accounts in Twitter, the volatility of tweets is able to signal influential events. A three-step detection method is designed as follows.

Step I is to perceive events based on massive numbers of tweets. To this end, we watch the statistic \(tw_{k}^{c}\), i.e., the number of tweets about candidate c on day k, and trace its volatility in the past m days by comparing it with an upper bound \(u^{c}_{k+1} = \bar{n}+\frac{s}{\sqrt{m}}t_{\alpha /2}(m-1)\), where is the average of \(tw_{k}^{c}\) on m days and s is the standard deviation. Based on a t-test with significance level α, there exists an influential event if \(tw^{c}_{k+1}\) surpasses \(u^{c}_{k+1}\) (see Additional file 1, Fig. S9). We assume that only one new event is dominant in each burst, which is reasonable for political campaigns.

Step II is to estimate the event time window. The daily tweets about each candidate are first integrated into a single document; then, the terms in the document are weighted by the tf-idf method. tf-idf is a numerical statistic intended to reflect how important a word is to a document in a collection of corpora. The tf-idf value increases proportionally with the number of times a word appears in a document but is often offset by the frequency of the word in the corpus, which helps to adjust for the fact that some words appear more frequently in general. tf-idf is calculated as follows,

$$ \begin{aligned} &tf \bigl(t,d^{c}_{k} \bigr)= \frac{f_{t,d^{c}_{k}}}{\sum_{t} f_{t,d^{c}_{k}}}, \\ &idf \bigl(t,D^{c} \bigr)=\log \frac{N^{c}}{1+\vert d^{c}_{k} \in D^{c}:t \in d^{c}_{k}\vert }, \\ &tf\text{-}idf \bigl(t,d^{c}_{k},D^{c} \bigr)=tf \bigl(t,d^{c}_{k} \bigr)idf \bigl(t,D^{c} \bigr), \end{aligned} $$
(13)

where \(f_{t,d^{c}_{k}}\) is the count of term t in a tweet \(d^{c}_{k}\) referring to candidate c on day k. \(D^{c}\) is the total tweets of candidate c, \(N^{c}=|D^{c}|\), and \(|d^{c}_{k} \in D^{c}:t \in d^{c}_{k}|\) is the number of documents in which the term t appears. The top-30 terms with the highest weights in the burst are selected as the typical words for that event. We then proceed to check the overlaps of typical words on the burst day plus or minus five days. The first day with non-zero overlap is deemed to be the start day of the event, and the last day with non-zero overlap is the closing day, which defines the event time window (see Additional file 1, Table S9, Table S10, and Table S11). We remove suspicious events with a time window of only one day.

Step III is to measure the impact of events on public opinion. We denote the estimated \(x_{k}^{c}\) initially transited from the previous day as \(\hat{x}_{k|k-1}^{c}\) (see equation (9)) and the final \(x_{k}^{c}\) calibrated with multiple signals as \(\hat{x}_{k|k}^{c}\) (see equation (10)). Intuitively, \(\hat{x}_{k|k}^{c}\) has absorbed the information about all pertinent events on day k; hence, the change from \(\hat{x}_{k|k-1}^{c}\) (equaling \(\hat{x}_{k-1|k-1} ^{c}\) for \(f_{k}=1\) and \(\mathbb{E}(q^{c}_{k})=0\)) to \(\hat{x}_{k|k} ^{c}\) indicates the impact of an event. To measure the significance of the impact, we apply the event study model [28] from the field of finance as follows:

$$ \hat{x}^{c}_{k|k}=a+\hat{x}^{c}_{k-1|k-1}+ \sum^{J}_{j=1}\gamma_{j}D ^{c}_{j,k}+\varepsilon, $$
(14)

where \(D^{c}_{j,k}\) is a dummy variable equal to 1 if day k is within the time window of event j for candidate c and is equal to 0 otherwise. J is the total number of detected events, and a is a regression constant. \(\gamma_{j}\) is the estimator of the effect of event j, which passes the t-test if event j has a significant effect on public opinion. In this way, we can identify the events that actually influence the election.

5 Results

5.1 Prediction performance

Figures 1(a)–(c) show various online signals two months before election day. Intuitively, the user behavior in different channels is related to the public opinion towards a candidate, but the signals have vastly different volatilities. This justifies the value of information fusion for election prediction.

Figure 1
figure 1

Online signals and time series of vote share predictions. (a)–(c) Signals of public opinions from all the online channels for the three candidates. (d) Time series of the vote share predictions. The dashed lines are the actual election outcomes. On election day, the errors are less than 2.59%

Figure 1(d) depicts the dynamic vote predictions after fusing the four types of online signals, i.e., \(s^{c}_{k,\mathrm{FAL}}\), \(s^{c}_{k,\mathrm{TW}}\), \(s^{c}_{k,\mathrm{GO}}\) and \(s^{c}_{k,\mathrm{IP}}\), by the Kalman filter. Although the four signals behave differently, the fused signal representing the predicted vote share for each candidate is relatively stable and exhibits a clear tendency, confirming the effectiveness of the prediction system for information aggregation. The final result is impressive—while Tsai’s win is easy to predict even in October, the prediction errors for every candidate one day, week, and month before the election day are no greater than 2.59%, 4.58% and 5.87%, respectively.

To further justify the predictive power of online signals, we also compare our results with offline polls. As shown in Fig. 2, during the last two weeks of the election, our predictions (M1) outperform most of the pollsters (P1–P10), and can improve continuously by absorbing up-to-date information. This is possibly due to the fact that the anonymity of the Internet enables individuals to express their opinions freely and voluntarily, which could reduce the bias relative to that in the tele-interview setting of a traditional poll. Furthermore, currently, news usually breaks online first and then spreads at a tremendously fast pace from online to offline via physical social networks. Therefore, online information can also influence offline voting blocs during campaigns, which mitigates the bias effect of using only the netizen population in our method.

Figure 2
figure 2

Timeline of the absolute prediction errors of final polls and data fusion methods. The bars on the left side of the timeline represent the prediction errors of the data fusion methods. In each interval between two gray dashed lines, there are two bars. The lower bar represents the absolute error of the online data fusion method, and the upper bar represents the absolute error of the online–offline data fusion method. The interval between two gray horizontal dashed lines indicates one day. The bars on the right side of the timeline show the prediction errors of the final polls from ten pollsters. Comparison of the bars on both sides shows that the absolute prediction errors of the signal fusion methods are smaller than those of the polls

We also try to reduce the sample bias by mixing the prediction results from online signals with those from offline pollsters in older groups. As shown in Fig. 2, the online-offline data fusion method (M2) indeed outperforms the online data fusion method (M1) in the early stage of the final two weeks, which indicates the power of sample bias correction. But the advantage disappears gradually as the final election day approaches, which again exposes the drawback of offline polls in responding to newly emerging information.

5.2 Signal evaluation

We also explore the predictive power of various online signals via their daily Kalman gains \(\mathbf{k}^{c}_{k}\). As shown in Fig. 3, Facebook “Likes” are consistently the strongest indicator among all the signals. This demonstrates the power of social media in collecting public opinions via a simple mechanism, although it is vulnerable to shilling attacks. The predictive power of the Google index appears to be time-sensitive, contributing less initially and becoming the second best indicator one month before the election. One possible explanation is that the election might not be a focal topic in the early stage of the campaign, making Google searches rather random. However, as the election day approaches, the campaign becomes the central topic and drives the public to search for information about the candidates. The two remaining signals, i.e., tweet volumes and homepage traffic, appear to be of much weaker predictive value, which may be due to their lack of popularity in Taiwan (see Additional file 1, Table S1) and diverse attitudes about candidates.

Figure 3
figure 3

Kalman gain for different online signals. The Kalman gain of the Facebook “Like” constitutes the highest proportion but gradually decreases while the Google signal continuously increases. Twitter and the campaign homepages are not indicative signals for the election. The Kalman gains of all the signals converge to steady states approximately one month before the election

We further explore the distinct value of the “Like” function on Facebook. We compare it with the “Comment” function by substituting \(s^{c}_{k,\mathrm{FAL}}\) with \(s^{c}_{k,FAC}\) in the Kalman filter. The results indicate that the prediction outcomes become significantly worse—the one-day-earlier prediction errors for Tsai and Chu increase to 5.42% and 4.86%, respectively (see Additional file 1, Sect. 2.5). These results indicate the superiority of “Like” over “Comment”. To understand this result, we search for the population of Facebook users who have ever liked or commented on the candidates and obtain the overlapping users who have both liked and commented on a candidate. Figure 4 shows that these users constitute only a small proportion of the “Like” users but a much larger proportion of the “Comment” ones. Therefore, a considerable proportion of users who have commented on a post may also choose to like the post but not vice versa. In other words, the “Like” signal represents the positive attitude of a much larger population than that of the “Comment” signal, which may be attributed to the fact that a “Like” is a more direct and widely engaged in behavior for online users to express their positive opinions without great effort. Another disadvantage of “Comment” lies in its diversity of expression, which can be a blend of contradictory attitudes, including support, praise, opposition and even insult. We apply Latent Dirichlet Allocation (LDA) model [29] to extract topics from the overlapping users and users who only commented on the candidates. The representative topics of the overlapping users are mainly supportive attitudes, while the topics of the users who only commented on candidates are mixed, with both positive and negative topics (see Additional file 1, Tables S3–S8).

Figure 4
figure 4

Proportions of overlapping users who have both liked and commented on the candidates’ posts among “Likers” and “Commentators”. The intermediate vertical axis is a timeline covering the whole period of the election. The bar on the left side of the timeline represents the daily proportion of overlapping users to users who have ever liked. The bar on the right side of the timeline represents the daily proportion of overlapping users to users who have ever commented. The number of overlapping users accounts for less than 1% of all the users who have “liked” on average, with the maximum proportions being 3.51%, 3.74%, and 9.25% for the three candidates. By contrast, the overlapping users constitute more than 37.16%, 14.90%, and 12.03% of all users who have commented, on average, for the three candidates, and the maximum ratios are 73.05%, 59.75%, and 83.01%

The overlapping users indeed constitute a group of firm supporters for each candidate who show their support by not only clicking “Like” but also going through the effort to publish comments. By further tracking the changes in the overlap ratios during the election, as shown in Fig. 4, we find that the ratio for Tsai is relatively stable, indicating that Tsai has a firm group of supporters regardless of her behavior during the campaign. By contrast, for Chu and Soong, the overlap ratios remain small until election day approaches, suggesting Tsai should partially attribute her success to her firm supporters rather than swing voters. This also explains why we can predict the victory of Tsai two months before election day.

5.3 Influential events

We apply the event detection method to each candidate’s Twitter data to identify influential events. Figure 5 shows the results, and Table 1 shows the event descriptions. The most influential events detected with p-values less than 0.05 include the meeting between Xi Jinping and Ma Ying-jeou (Xi-Ma Meeting), the emergence of negative comments on Tsai Ing-wen’s Facebook homepage possibly by users from mainland China, and the Chou Tzu-yu flag incident. All these events share a common feature; that is, they all belong to the category of cross-strait relation, which is always subtle and controversial in Taiwan’s political circle. Other seemingly important events from the perspective of the election campaign, such as the TV broadcast of the candidates’ debates and various types of electioneering activities in local areas, have insignificant influences on public opinion.

Figure 5
figure 5

Detected events and their influence on potential vote shares. The numbers of detected events during the elections are 8, 7, and 6, respectively, for the Tsai, Chu, and Soong. The description of events \(E_{i}\) is presented in Table 1. Each event spans a time window and influences the potential voting rates differently, as denoted by different colors. A light purple bar indicates that the detected event does not have a significant influence on the vote shares. A red bar indicates a positive effect of the event on vote shares, and a blue bar represents a negative effect of the event. The influence of each significant event is marked on the curve, and the number below it in brackets is the p-value of the t-test. The typical words used to determine the event timespan are detailed in Additional file 1, Table S9, Table S10, and Table S11. In addition, the Twitter bursty days detected in Step I are noted in the Twitter volume time series. The red points represent events with a timespan longer than one day, which are fed into Step II for further analysis. The blue points are removed

Table 1 Detected Events

We further assess the influence level of the events, which is measured by the coefficient \(\gamma_{j}\) in (14). Table 2, Table 3 and Table 4 give the detailed results for the three candidates, respectively. The statistical results of \(\gamma_{i}\), \(i \in \{1,\ldots,21\}\), correspond to the effects of 21 events marked in \(E_{i}\), \(i \in \{1,\ldots,21\}\), in Table 1.

Table 2 Influential significance of events detected for Tsai Ing-wen
Table 3 Influential significance of events detected for Eric Chu
Table 4 Influential significance of events detected for James Soong

The Xi-Ma Meeting resulted in a 0.55% decrease in the vote share of Tsai Ing-wen. This result is not surprising because Tsai was believed to favor Taiwan independence over the “One China Policy”, and the meeting thus prompted the public to doubt Tsai’s ability to handle cross-strait relations. This same event increased Eric Chu’s vote share by 0.58% because he was thought to be more able to develop cross-strait peace after the meeting.

Despite the abundance of events during the campaign, the Chou Tzu-yu flag incident from the entertainment domain is the most influential. Chou Tzu-yu, a 16-year-old Taiwan singer, sparked huge controversy in social media for showing the Taiwan flag as the national flag of China. As the uproar intensified online, Chou’s company released a video in which Chou apologized for her behavior by stating that “there is only one China” and identifying herself as Chinese. The most subtle point is that the video was released the day before the election, which was described as a humiliation to Taiwan and spread quickly in Taiwan’s online social media. As a consequence, this incident increased the vote share of Tsai Ing-wen by approximately 3.66% and lowered the vote share of Eric Chu by approximately 2.62%.

6 Discussion

The accurate prediction of Taiwan’s 2016 general election suggests an interesting viewpoint that public opinions towards political campaigns can be determined via online user-generated content. This indeed coincides with some recent studies reporting that social media such as Facebook [6, 10], Twitter [2, 6, 7, 11, 1316] and Youtube [6] are able to aggregate public opinions about political matters. Donald Trump winning the 2016 US Presidential Election was also considered to be a victory for the heavy use of social media such as Twitter [30]. Nevertheless, this finding remains controversial in academia, and the above studies have often been criticized for the unreliability of single-source information [31] and/or the unrepresentativeness of online user populations [32, 33]. Our study attempts to address these concerns.

First, we introduce multiple online channels as different types of signals to produce more robust predictions. These signals, while reflecting more or less latent public opinions, have varied fluctuations due to their different sensitivities to campaign dynamics and possible fake responses from the Internet “water army” (see Fig. 1). The fusion of these signals can help to filter out some noise by consensus learning to highlight the tendencies. Moreover, although one signal might contribute more to some specific election prediction, such as the Facebook “Like” for the Taiwan election, it is unlikely to find it omnipotent for different elections. The fusion of these signals could help to mitigate the risk of selection bias. This information fusion scheme gives our study some important extensibility—the four channels, namely, Facebook, Twitter, Google Trends and campaign homepages, could be considered to be the fundamental and preemptive online information sources for different elections.

We also find that although selection bias of the online voting population exists, its influence on the prediction results is limited. Prediction based on pure online information is much more accurate than the polls released by Taiwan’s mainstream pollsters (see Fig. 2). The reason behind this may be two-fold. On one hand, online users who pay close attention to election campaigns likely become active voters and constitute a large voting population on election day [34, 35]. On the other hand, we should not underestimate the information exchange between online social networks and offline physical networks [36, 37]. Older people who seldom interact with the Internet still have access to online information via ordinary family communications or traditional media’s reports on Internet opinions. This communication contributes to the opinion conformance across online and offline networks and further improves the representativeness of the online voting population. In fact, compared with traditional polls, which are susceptible to questionnaire wording [38], reporting error [39], ballot order [40], and social desirability bias [39, 41], online big data enables a much larger sample and thus can improve the sample resistance to human manipulation. The real-time availability of online data, which enables dynamic predictions based on continuously incoming information, is another major advantage relative to polls.

Our study also suggests that the Kalman filter with the event detection model could be packaged as a fundamental kit for political vote analytics. Specifically, the Kalman filter is responsible for the dynamic prediction of vote shares given multi-source time-varying signals and multiple candidates. Meanwhile, the event detection model is responsible for the automatic identification of influential events during the campaign, which provides a causal explanation for the predictions. In other words, the two models together could provide interpretable predictions to political vote analytics, which is deemed particularly valuable for a big-data-driven research paradigm [42].

The Kalman filter has been adopted in previous studies but either for backward review given the final result or for forward prediction given multiple historical elections data. Our study shows that while we cannot obtain the true vote shares until election day, we can still fine-tune the model parameters by using up-to-date time series signal data for the current election, which solves the problems in leveraging the Kalman filter for election prediction. Moreover, given the sum-to-one constraint in a statistical learning framework (see (8)), the Kalman filter is capable of building models for more than two election candidates. One may consider the inclusion of some other relatively stable factors, such as the globalization trend, economic status, the technology environment, etc., in the prediction model, which can be achieved by setting appropriate initial values of the Kalman filter. Nevertheless, our study shows that the Kalman filter is insensitive to the initial values as long as the prediction is based on a sufficiently long time series (see Additional file 1, Sect. 2.2 and Sect. 2.3). In this case, the signals should have fully “absorbed” the influences of the macro factors.

Our study provides some political insight into the Taiwan general election. It is interesting that the simple “Like” function on Facebook collects the public opinions about candidates (see Signal Evaluation in Results), although it has been reported to be vulnerable to shilling attacks in electronic commerce [43]. The “Like” function is more beneficial than the “Comment” function, although the latter actually expresses more complex sentiments and richer opinions. This difference is attributed to the widespread use of Facebook in Taiwan (see Additional file 1, Table S1) and the easy-to-use characteristic and emotional unambiguity of the “Like” function. Another interesting finding is that the most influential events during the Taiwan election campaign are all closely related to cross-strait relations (see Influential Events in Results). In particular, in line with the findings in [44], the events more closely associated with public sentiment (such as the Chou Tzu-yu flag incident) appear to have a greater impact than those with merely political meaning (such as the Xi-Ma Meeting).

We provide accurate prediction and automatic causal analysis of the 2016 Taiwan general election, which illustrates the feasibility of applying a data-driven paradigm for political vote analytics. Although our focus is on Taiwan, the proposed signal fusion approach and the event detection model can be applied to other elections or referendums, especially those using majority rule. Considering the different Internet applications used across countries and areas, we may need to adjust the input online information sources and design new measurements for the new signals. Furthermore, we should consider how the election systems of particular countries or areas differ and require adjustment of the prediction model. For example, the US election system is not a direct election but relies on the Electoral College system with 538 electoral votes. Hence, we have to incorporate information about the states and locations of online users into the prediction. However, this information is often unavailable. Nevertheless, we can still consider online users as the voters for a “virtual” direct election and obtain the predictive results as the popular votes for the candidates, which could still indicate the winner if there is a large difference in vote share among candidates. The recent 2016 US Presidential Election demonstrates the power of voices on social media.