1 Introduction

Event labeling is recognized as a basic function in surveillance and monitoring systems. Labels are essential for evaluation of the algorithms and for incorporation in real time systems. However, event labeling is an expensive and time-consuming task which requires synergy of one or more human experts. Several solutions have been developed to avoid performing human-based labeling. The first group of methods relies on the creation of artificial and simulated data [31, 38, 42, 44] so that both normal and abnormal instances are generated via simulation. In the second group, events are injected on real background data [8, 28]. However, the ideal scenario is to have access to ground truth data [35] where both normal and abnormal instances are labeled without simulation. The first and second solutions suffer from two issues. Firstly, they do not reflect the reality [41] and secondly it is extremely difficult to develop a simulator that generates data close to the ground truth [19]. Besides, availability of ground truth data is limited or has been under some criticisms (e.g. [37, 50]).

Regardless of learning methodologies, evaluation of event detectors is still highly dependent on human efforts. In supervised event detection, both normal and abnormal instances are required to be labeled by human experts. In semi-supervised approaches normal instances should be labeled by humans. In unsupervised methods, detected events are required to be verified by human experts. However, in practice, labeling or verification of events by human experts can be extremely time-consuming and expensive [41]. In order to solve this problem some efforts have been made to assist users to label data more efficiently via a graphical user interface [41]. However, such methodologies still are human-dependent to a great extent.

An automatic event detection system ought to operate without intuitive dependency on human resources neither in providing labeled data nor in verification of alarms. One alternative for human knowledge can be computer-based knowledge resources. Although there is a lot of non-human knowledge sources, however, unfortunately still a large number of event detection systems rely on human-based knowledge. For instance, available knowledge inside search engines, meta-image and meta-video databases, data archives, data warehouses rarely is incorporated into event detection problems. Only a few researches addressed this issue. For instance, [3] use background knowledge from reliable sources of information (e.g. CNN, BBC) for matching and validation of detected events on Twitter streams. Collier et al. [12] use an event ontology, called BioCaster, as background knowledge for detection of infectious disease outbreaks from linguistic signals on the web. SanMiguel et al. [43] present an approach for creation of domain ontology as background knowledge for detection of events in surveillance video. Xu et al. [55] use web cast text as background knowledge source for detection of events in sport videos. However, in none of the above works, the target has not been the eliminating of human role from the detection cycle, rather background knowledge is used as a supportive role.

The essential tool for elimination of the human role from the detection cycle is the possession of a highly reliable detector. However, different event detectors are not equally capable of detecting events and subsequently different detectors perform differently on different environments [2, 48, 49, 52]. This is due to the inconsistency of detector performance [22, 52]. But how can we overcome this difficulty? When we want to make an important decision in our daily routine, we probably ask for recommendations from different peoples from different perspectives. The similar idea is already broached in the machine learning which is entitled ensemble learning. Ensemble learning is robust solution for more accurate and relatively domain-independent classification and clustering [2, 16]. It also can be embedded in parallel computing paradigm to improve the efficiency [47]. However, the application of ensemble methods in event detection has received a little attention in the research literature [20, 23, 30, 45, 53] while theoretically it is believed that combining different detectors should provide a better anomalous space coverage [20, 49].

There are few works in the literature [2] that adapted ensemble methods to event detection. The first work [4] applies multiple classifiers for anomaly detection from real network traffic data. The authors showed that a few judiciously selected classifiers outperform many diverse classifiers. They propose a method called standard deviation normalized entropy of accuracy as a strategy for combining the classifiers. In another work [20] authors combine four diverse anomaly detectors for automated event labeling of network traffic data and create a data with ground truth. The strength of their approach relies on the synergy between detectors with different granularity. However, it is not specified as to how data can be considered as ground truth while not validated by an external knowledge source or human expert. Besides, the role of randomness and chance is not considered in the combination of the outputs of detectors.

In this work we aim at development of an approach for labeling and detection of events in unlabeled data by exploiting a combination of both ideas of ensemble learning and background knowledge. This approach has two main applications. Firstly, it can be used for creating benchmark data sets for evaluation of event detection algorithms and secondly can be used in a real world event detection problem when data nature is unknown and there is no access to human experts for labeling of data or verification of alarms.

In parallel to our main contribution, we perform a comparative study on different important issues in event detection such as learning strategy composed of unsupervised and semi-supervised, scale analysis, multiple denoising approaches, offline and online regression models and distance function in time series similarity estimation.

The rest of the paper is organized as follows: The next section identifies the main concepts for event detection and introduces the proposed model. The Sect. 3 presents a case study using a real data set, discusses the obtained results and presents a sensitivity analysis. The last section concludes the exposition presenting the final remarks.

2 Proposed solution

2.1 Definitions

The central concept in this paper is related to event that there is as yet no formal agreed definition about that in the literature. It is sometimes interpreted as a sub-category of anomaly [11] or in some circumstances equivalent to anomaly [29] or change [24]. Several definitions exist in different contexts. However, more appropriate definition that can distinguish event from anomaly, outlier, change or other equivalent terms is the definition with emphasis on spatial-temporal dimension of an event [13, 36]. We have to note that in this paper, even though we do not conduct a spatiotemporal analysis on data, each event of interest implies an occurrence of something in a specific place (e.g., Washington, D.C.) and a time period (e.g., 2012/05/16). In the following, we define and distinguish some of the concepts used in the paper.

Definition 1

An Event is something that happens in space and time and creates change in the environment under study.

Definition 2

Event labeling is the process of marking events in unlabeled data.

Definition 3

An Event Detector (or Detector) is a method or algorithm that discriminates events from non-event instances.

Definition 4

An Ensemble Detector is a group of event detection algorithms that assign a score to each instance of data. The score usually represents the chance of that instance not be an event.

Definition 5

Background Knowledge is a sort of knowledge that cannot be used directly in the training phase due to privacy, computational complexity or competitive reasons but can be queried directly or indirectly. Some examples of Background knowledge sources are as follows.

  • Homogeneous sources This category may include data archives, data warehouses, domain ontologies and other homogeneous sources. The assumption in this category is that we have computational limitations for dealing with big data sets. However it is assumed that it is possible to query the higher scale data set through an efficient DBMS gateway.

  • Heterogeneous sources Heterogeneous sources differ in nature with train and test sets. The well-known example is the World Wide Web. There is huge heterogeneous information available on the web that cannot be integrated in the learning process because of both volume and competitive issues. But a direct query or query over API is possible over these sources. Our work is concentrated on this type of knowledge sources. We use existing knowledge inside Google™web and image search and YouTube™for verification of detected events.

  • Confidential sources Sometimes due to the privacy or security matters is not possible to have access to whole database. However, the third party provide secure gateway to perform a limited queries over the databases.

2.2 Event labeling model: a proposal

There are two classic event labeling models that rely on human-based knowledge. In the first model (Fig. 1a) a desired detector is applied to the data and then detected events are verified by one or more domain expert(s) [7, 15, 32, 34]. In this model, checking of all instances is not essential; rather a limited number of candidates are finally verified by the expert. This model has two main drawbacks: on one hand there is no guarantee that the detector algorithms work well on that particular data set and could detect all potential events and on the other hand in the evaluation phase is not possible to measure the accuracy of the detector.

Fig. 1
figure 1

Event labeling models. a Domain expert(s) verify the alarms raised by a single detector. b Domain expert(s) label the instances by manual inspection. c Events are labeled by applying an ensemble of detectors. d Proposed model: extension of previous model with this difference that alarms are verified by background knowledge before labeling

In the second model (Fig. 1b), all instances are checked individually by knowledge expert(s) and events are labeled manually [5]. This model has also three main drawbacks: firstly, it is an infeasible task for large databases to check instances individually. Secondly, the opinion of one expert may not be sufficient and affects the labeling quality. Finally, different experts have different perspectives and therefore it is hard to assume that they have the same agreement on the event labels [54].

A recent automatic model is proposed in [20] which does not rely on human-based knowledge. As depicted in Fig. 1c, output of ensemble detectors is combined based on the detectors’ output similarity. The drawback of this method however is that the result is highly dependent on the detectors’ selection and no knowledge source (human nor machine) validates the outputs of detectors. Therefore false alarms might be raised, more than expected, due to not considering chance and randomness.

We extend the third model to a new model (Fig. 1d) which uses potential knowledge resources for verification of alarms. It also has no dependency on human-based knowledge and is less dependent on the methods. It is also less affected by randomness. Since it is based on ensemble detectors, it is also capable of working in parallel computing framework and thus can be computationally efficient. Based on these explanations we define our research hypotheses as follows.

Hypothesis 1

Ensemble detectors improve the detection performance comparing individual detectors.

Hypothesis 2

Background knowledge along with ensemble detectors improves the performance of event detection systems.

In the following we try to examine and validate the above hypotheses through comprehensive experimental study and evaluation tasks.

3 Experimental evaluation

There are several public data sets for outlier and anomaly detection. However it is difficult to find a real data set that for each instance, the corresponding environmental data and background knowledge are available. The most challenging part is the background knowledge which hardly can be found as an open access source. For this reason many data sets used in the outlier and anomaly detection literature are not so useful for our research goals. However, we could manage to find a data set that has a potential to be adapted for both above-mentioned issues. In the following we first describe this data set and then explore the concepts related to the method.

3.1 Data set

The data set under study is related to 2-year usage log of a bike sharing system namely Captial Bike Sharing (CBS) at Washington, D.C., USA. There are three reasons why we think this data set may fit our research goal. Firstly, it includes at least two of full life-cycle of the system and therefore seems be suitable for supervised and semi-supervised learning. Secondly, there exist some external sources that corresponding historical environmental values such as weather conditions, weekday and holidays are extractable. And finally the alarms are verifiable through open access knowledge sources (search engines, meta-image and meta-video sources).

Bike sharing systems are new generation of traditional bike rentals where whole process from membership, rental and return back has become automatic. Through these systems, user is able to easily rent a bike from a particular position and return back at another position. Currently, there are about over 500 bike-sharing programs around the world which is composed of over 500 thousands bicycles [40]. Today, there exists great interest in these systems due to their important role in traffic, environmental and health issues. Presently, the top three bike-friendly countries are Spain (132 programs), Italy (104 programs) and China (79 programs) [40]. The number of major cities that are becoming bike-friendly is growing day-by-day. It is expected that in a near future, most major cities provide this service along their other public transport services.

Apart from interesting real world applications of bike sharing systems, the characteristics of data being generated by these systems make them attractive for the research. Opposed to other transport services such as bus or subway, the duration of travel, departure and arrival position is explicitly recorded in these systems. This feature turns bike sharing system into a virtual sensor network that can be used for sensing mobility in the city. Hence, it is expected that most of important events in the city could be detected via monitoring these data. Some few researches have already addressed bike sharing data analysis [6, 51] mostly via spatiotemporal analysis to aid operation-oriented decisions. However, our work differs from such works. In this paper, our main concentration is not specifically on bike sharing data, rather we use bike sharing data as a supportive source for examining our event labeling model.

In the CBS system when a rental occurs, the operation software collects basic data about the trip such as duration, start date, end date, start station, end station, bike number and member type. The historical data set of such trip transactions is available online via [9]. To avoid trend issues, we select only corresponding data to years 2011 and 2012 consisting of 3,807,587 records. Later, we aggregate the data into two scales of hourly and daily. The hourly time series includes 17,379 h and the daily time series includes 731 days. Next, we divide both daily and hourly scale time series into two sets of 2011 (train) and 2012 (test). The test set is illustrated in both scales in Fig. 2 (daily scale) and Fig. 3 (hourly scale).

Fig. 2
figure 2

The number of rented bikes in 2012 in daily scale

Fig. 3
figure 3

The number of rented bikes in 2012 in hourly scale

As we discuss later, if we apply a regular anomaly detection algorithm on the daily or hourly time series we would not be able to detect all events. We only can detect severe events because bike rental process is probably under effect of seasonality and environmental settings such as weekday, holiday, temperature, precipitation, wind speed, humidity, etc. Therefore, event signature cannot be directly observed in these time series. In order to study such effects we need to extract weather data. There exists several weather data sources, however, most of them provide only forecasting data and do not contain historical weather reports. There is another group of forecasting sources that contain historical weather reports for specific last days (e.g. 14 days). Another group also contains weather historical report but in daily scale. However, we could manage to find a source that provides the hourly historical data [21]. We therefore, extract from this source some attributes such as weather temperature, apparent temperature, wind speed, wind gust, humidity, pressure, dew point and visibility for each hour from the period 1 January 2011 to 31 December 2012 for Washington, D.C., USA. Next, we map each hour in bike rental time series with corresponding weather reports. There are some missing weather reports for some hours. Thus, we map the closest report for that hour. The maximum temporal difference is 292 min, with mean of 3 min and standard deviation of 14 min. We also extract the official holidays of Washington, D.C. from [14] and map them to the corresponding dates. Afterward, holidays are combined with weekends such that finally each day is classified as a working day or non-working day. Additionally, according to weather conditions provided in the weather data, we mark each hour by four weather grades: good, cloudy, bad and very bad.

As a result we create two sets in two scales. In the hourly scale set, each record includes hour, month, working day, season, weather grade, temperature, filling-temperature, humidity and wind speed as variables and hourly aggregated count of rented bikes as target value. In daily scale, each record consists of month, working-day, season, daily average weather-grade, daily average of temperature, daily average of filling-temperature, daily average of humidity and daily average of wind-speed as variables and daily aggregated count of rented bikes as target value.

We then perform a feature selection step on the data set to identify the most significant features. As a result, month, hour, working day and temperature are selected as most important features for hourly scale and month, working-day and temperature are selected as final features for daily scale. The final processed data set is available online via [17].

3.2 The proposed method

Our event labeling system is depicted in Fig. 4. We first apply ensemble detectors (see Sect. 3.2.1 for details of detectors) with highest possible disagreement rate. To assess the degree of disagreement of the detectors, we perform Fleiss’ Kappa test (see Sect. 3.2.3). The more disagreement on alarms results in more coverage on anomalous space. In addition, more alarms increase the chance of false alarms and thus alarms are required to be validated by an external knowledge source. We run the detectors and combine all alarms to make a candidate events list. The output list is much limited comparing whole instances in data set and therefore imposes lower cost for verification. Then we combine all outputs together by adding distinctive instances together. In the next step we verify each candidate via Google web and image search and YouTube (we choose Google due to its prominent coverage). The verification phase works as follows: a spatiotemporal query is submitted to Google (Fig. 5). If an important event is detected from the result we mark the date with that event. For instance by querying 2012-10-30 Washington, D.C” we notice that the Sandy storm has happened on this date. So this date is marked as “Sandy”. If we could not find any significant event from the search result we try Google images and YouTube or try another query this time including the keyword “weather” (e.g. “2012-10-30 weather Washington, D.C”). Note that due to the relatively small volume of our data set we did not perform some text processing steps. However, in a fully automatic system we could process the retrieved textual result and count the most repeated terms.

Fig. 4
figure 4

Our event labeling system

Fig. 5
figure 5

A query with format of date + place is submitted to Google web/image search and YouTube for understanding occurred event

In the next phase we compute the weight of the event by a method similar to [56] where search results count is used as a criterion for extraction the correlation between two terms (e.g. food + shopping vs. food + drink). The search results count itself is meaningless; however, it would be considered as an appropriate criterion for comparative purposes. For instance if query “food + shopping” result in one million and “food + drink” leads to five million pages, then it reveals that food is more correlated to drink than shopping. We adapt this idea to measure the weight of candidate events. To this end, as Fig. 6 shows, we add event title (e.g. “sandy”) to the previous query and then extract the count of retrieved results. For instance, as is depicted in the figure, 6.920.000 results are returned for this query. This can be used as a criterion to measure the weight of the event. After obtaining this weight for each event candidate we transform all weights to their corresponding z-scores. Suppose the vector \(x=(w_1, w_2,\ldots , w_n)\) of obtained weights from Google result count. z-score corresponding to each weight is obtained by the following equation.

$$\begin{aligned} z\text {-}\mathrm{score} = \frac{w-\mu }{\sigma } \end{aligned}$$
(1)

where \(w\) is the obtained weight, \(\mu \) is the mean and \(\sigma \) is standard deviation of vector \(x\).

Fig. 6
figure 6

After understanding event, the query with format of date + event + date is submitted to Google to measure the weight of event

Then we remove from candidate list those events whose z-score is lower than 2. In other words, we keep only those events that there is a low probability that be produced by chance. After this filtering step, the final list contains the event labels.

3.2.1 Event detectors

Although at first glance, data looks like a time series. However, based on our prior knowledge we can argue that it is rare that someone rides a bicycle in some circumstances such as midnight, heavy rains and very cold or very hot weather. Conversely, it is very likely that people rent more bikes in the peak working hours or in good weather conditions in weekends. In short, it seems that rental count would have a close relationship to environmental settings. To validate this hypothesis, we design our detectors such a way that could support both of these perspectives. In other words, in some detectors we assume that data is a time series with auto-correlated instances (unsupervised detectors) and in other detectors we assume that instances are temporally independent and are correlated to some environmental settings (semi-supervised detectors). However, we give more weight to the latter detectors since they are more reasonable based on our prior knowledge. Please note that the term semi-supervised should not be confused with its equivalent in classification or anomaly detection where classes or anomalies labels are specified in the train data. We here deal with the count time series, therefore, when we use the term semi-supervised we refer to a scenario in which we have access to each instances corresponding environmental settings in the train set and not the class labels.

Ten detectors are designed in this study such that each has its own distinctive ability. Different techniques are involved such as regression trees, control chart, hierarchical clustering [25, p. 520] with two different distance functions of Euclidean and Dynamic Time Warping (DTW) [46]. We also employed Principal Component analysis (PCA) [1] and Multi-channel Singular Spectral Analysis (MSSA) [39] for denoising purpose in some detectors. Schematic representation of the detectors is presented in Fig. 7. For semi-supervised detectors, we make a predictive model from the train set based on environmental and periodicity setting (To ease the further explanations, from now on, when we refer to environmental setting we mean both environmental and periodicity settings) and then make a forecast on the test set and then compare the predicted bike rental count with the actual bike rental count in test set and then monitor the residuals to detect events. In some detectors we also apply a filter for denoising data. For unsupervised detectors, we monitor test time series irrespective of the environmental settings.

Fig. 7
figure 7

A general architecture of the ensemble detectors. See Table 1 for more details

Table 1 Event detectors settings

As already mentioned, the data set is made in two scales: hourly and daily. If we perform analysis only on daily scale we would not be able to detect those events that affect the city only in specific hours during the day. Such events are also interesting and need to be detected. For instance, suppose that in 12/05/15 a severe event is happened during 8 a.m. and in the rest of the hours, we witness a calm day. The daily scale analysis probably would not be able to detect such kind of events, because some events manifest themselves in hourly scale.

In order to provide a unit output, alarms in hourly scale are upgraded to the daily scale (e.g. detector 10). For instance in the above example, 12/5/15, 8 a.m. is transformed to its corresponding higher scale 12/05/15. In this case all the detectors despite of different scale inputs generate the same output and their outputs can be combined. Each method finally returns the corresponding p values of each day. This p value indicates the probability of that day not be an event. So if we determine a threshold like 0.05 then each instant with p value lower or equal to 0.05 should be reported as an event.

In the following, each individual detector is described in detail. Note that the selected methods for detectors are optional and can be replaced with any other desired methods. However, we take into account two factors in our ensemble detectors architecture (Fig. 7). Firstly, well-known but different techniques be involved to make the maximum diversity and secondly, be more appropriate for automatic settings.

  • Detector 1: The predictive model predicts the expected value on the hourly test set according to the corresponding environmental setting. Then the residuals of hourly expected value and hourly actual values on test set will be transformed to z-scores. Next, we compute the daily mean of z-scores for each day and again transform the obtained daily means to z-scores and consequently to p-values.

  • Detector 2: The predictive model predicts the hourly expected value on the hourly test set according to the corresponding environmental setting. Then we compute the mean of hourly residuals for each day. Afterward, the computed daily means are transformed to z-scores and consequently to p-values for each day.

  • Detector 3: The predictive model makes a forecast for the daily test set according to the corresponding environmental setting. Then the daily residuals are computed as a difference between daily predicted counts and daily actual counts. Then the residuals are transformed to z-scores and consequently p-values.

  • Detector 4: This method does not need the train data. It operates directly on the daily test set. The count corresponding to each day is transformed to first z-scores and consequently p-values.

  • Detector 5 and 7: This method operates as follow. First, The predictive model makes a forecast for the hourly test set according to the corresponding environmental setting and then we compute the residuals as difference between hourly predicted count and hourly actual count. Next, matrix of \(Days \times Hours\) is built such that each cell represents the residuals corresponding that day and hour. In the next step, MSSA (Method 5) or PCA (method 7) is applied on this matrix. The result is a reconstructed matrix. Later, the residual corresponding each day of original and reconstructed matrix is transformed to z-scores and consequently p-values.

  • Detector 6: This method does not need train data. The hourly test data is converted to matrix of \(Days \times Hours\). Afterward, MSSA is applied on this matrix and then residuals corresponding to each day of original and reconstructed matrices are transformed to z-scores and consequently p-values.

  • Detector 8 and 9: The predictive model predicts the expected value on the hourly test set according to the corresponding environmental setting. Then the residuals of hourly expected value and hourly actual values on test set are clustered using agglomeration hierarchical clustering algorithm one time with Euclidean distance and one time with DTW distance. Outliers are then are chosen using a manual inspection and are reported as events. Note that this kind of approach is not appropriate for automatic detection and is only provided here for comparison to the other approaches.

  • Detector 10: The predictive model predicts the expected value on the hourly test set according to the corresponding environmental setting. Then the residuals of hourly expected value and hourly actual count on test set are transformed to z-scores. Reported z-scores are still in hourly scale so we select the maximum obtained z-scores for each day. This z-scores corresponding each day are then transformed to p-values and is reported.

3.2.2 Detectors settings

Table 1 illustrates the settings used for each detector. All detectors except detector 4 and 6 are semi-supervised. For semi-supervised methods we apply REPTree regression tree as our predictive model (see Sect. 3.2.4 for justification). As it can be seen in Table 1, even though some detectors receive hourly train set as input, they score events in the daily scale. Three of detectors (5, 6 and 7) also use a filtering strategy such as MSSA and PCA for data denoising. Two of detectors (8 and 9) that are based on agglomerative clustering can only detect events and are not able to score each instant. In detector 8, Euclidean and in detector 9, DTW distance is used. Note that these clustering-based detectors generally are not appropriate choice for automatic settings since need some prior knowledge for parameter setup. However, we include them in ensemble to have more diverse detectors.

3.2.3 Fleiss’ Kappa test

The benefit of combining different detectors relies on diversity among detectors ensembles [20]. Hence, an ideal ensemble detectors is required to include a sort of diverse and different detectors. In order to ensure about the detectors right choice we apply an agreement test on detectors outputs to measure the disagreement rate of the detectors (the more disagreement, the better).

Since we want to evaluate the overall agreement rate between all detectors and not individual agreements between pairs of detectors, we cannot use the common Cohen’s kappa [10]. Instead we use Fleiss’s kappa [18] which is a statistics that measures the level of agreement between multiple raters when assigning categorical ratings to a number of items or classifying items. It is considered as an extension of Cohen’s kappa statistic that works for multiple raters. If a fixed number of raters assign numerical ratings to a number of items then the kappa reveals the consistency of ratings. Fleiss’s kappa is always between 0 and 1. Table 2 shows how \(K\) values can be interpreted [33].

Table 2 Fleiss’es Kappa interpretation

So far, Fleiss’s kappa has been used in psychology and bio-informatics for measurement of agreement of different human agents on a subject. Here we use it for measuring the rate of agreement between multiple event detectors. Opposed to the psychology and bio-informatics that a k closer to 1 is more desired, we want a closer value to zero. Because we look for a group of non-similar detectors that could detect more range of events. If all detectors for instance agree and have k equal to 1 then it means that use of multiple detectors is meaningless and one detector is enough. In contrast, if k could approach zero it means that idea of using ensemble detector is helpful and result in better coverage of discovery of unknown events.

After the experiment we obtain Fleiss’s kappa equal to 0.0034. That means that the ensemble detectors exhibit a very slight agreement. In other words, if we define H0 hypothesis as the observed agreement is accidental we cannot reject the hypothesis due to the observed low agreement. This is the desired result since it reveals that selected detectors exhibit high degree of diversity.

3.2.4 Predictive model selection

One of the main components of the introduced semi-supervised detectors is the predictive model. We apply some different regression and classifier algorithms in Weka [26] on the train data to measure the accuracy of the built model. Table 3 illustrates the comparison of the models in terms of correlation coefficient, relative absolute error (RAE), root relative squared error (RRSE), train time in seconds and test time in seconds. Train time is the time that takes to model be built on the train set and test time is a time that takes model be evaluated through tenfolds cross validation. As it can be seen, among all, REPTree [26] has a better performance in terms of the trade-off between accuracy, train and test time. It is from the regression tree family and thus presents interpretable model. IBk, and decision table both provide relatively the same accuracy but with higher test time. Therefore, we select REPTree as the predictive model in the detectors. RRSE, RAE and correlation coefficient in Table 3 also can be calculated using the following equations [26, p. 180].

$$\begin{aligned}&\mathrm{RAE} =\frac{|p_1-a_1|+\cdots +|p_n-a_n|}{|\bar{a}-a_1|+\cdots +|\bar{a}-a_n|}\end{aligned}$$
(2)
$$\begin{aligned}&\mathrm{RRSE} = \frac{(p_1-a_1)^{2}+\cdots +(p_n-a_n)^{2}}{(\bar{a}-a_1)^{2}+\cdots +(\bar{a}-a_n)^{2}}\end{aligned}$$
(3)
$$\begin{aligned}&\mathrm{Correlation\ coefficient} = \frac{S_{PA}}{\sqrt{S_{P}S_{A}}} \end{aligned}$$
(4)

where:

$$\begin{aligned}&S_{PA}=\frac{\sum (p_{i}-\bar{p})(a_{i}-\bar{a})}{n-1},\end{aligned}$$
(5)
$$\begin{aligned}&S_{P}=\frac{\sum (p_{i}-\bar{p})^{2}}{n-1},\end{aligned}$$
(6)
$$\begin{aligned}&S_{A}=\frac{\sum (a_{i}-\bar{a})^{2}}{n-1} \end{aligned}$$
(7)

In the above equations, a denotes actual target values, p denotes predicted target values, \(\bar{a}\) represents the average of actual target value, \(\bar{p}\) denotes the average of predicted target values and n denotes the sample size.

Table 3 Predictive models tested

3.3 Results

3.3.1 Event labels

Table 4 demonstrates detected events by our system that have passed verification phase by background knowledge. Bold items are those events that meet the condition of z-score \(\ge 2\). The primary candidates list before verification phase contains 69 events and as it can be seen this number is decreased to 30 events. To validate this result we ask a domain specialist to rate the impact of each detected event corresponding date from 0 to 5. The third column in Table 4 indicates the impact rates specified by specialist for that date. To evaluate the effectiveness of condition z-score \(\ge 2\), we define a cut line on impact rates given by specialist and then see how detected events match with the events over the cut line. Since the given impact rates are between 1 and 5, we define four cute point of 2, 3, 4 and 5 and compute the true and false alarm rate one time for all items of Table 4 and one time for bold items. The result is shown in Fig. 8. As it can be seen, applying condition of z-score \(\ge 2\) can decrease the false alarm rate up to an average of 20 %.

Fig. 8
figure 8

Effect of condition z-score \(\ge 2\) on false alarm rate

Table 4 Detected Events after verification phase by background knowledge

Having look to the Table 4 we notice that five of alarms with impact rates of 5 and 4 that are specified with human domain specialist are not appeared in the verified alarms list. This can be due to two reasons. On one hand, the specialist admitted that our verification method based on investigation on Google web and image search and YouTube is more reliable that his evaluation. This is to some extent logical because existing knowledge in these sources represents the collaborative knowledge and intelligence of thousands of people with different insights and perspectives. This definitely might outperform a single knowledge source that has a limited insight to the subject. For this reason one reason for this observed gap can be because of the effectiveness of our approach comparing the specialist knowledge. In other hand, it can be due to some existing problems in our methodology details. For instance, probably due to the naming complexity of the events, our submitted query has not been well enough for measuring the weight of event. Four of these events are related to weather related events. It indicates that the query of “time + place + weather” might not be a good idea and we should find for more appropriate alternative query. The reason can be this fact that the weather events on these days are entitled with different terms and the weight corresponding to these events is distributed in different terms. For instance, some sources might call “Tornado” with different terms such as high speed wind, storm or severe weather, etc. However, as it can be seen, only one non-weather event is missed in our final event labels. It reveals that the condition of z-score \(\ge 2\) reasonably is able to filter non-significant alarms and consequently avoid false alarms more effectively.

3.3.2 Event labeling in the absence of background knowledge

Here we study an event labeling model with relies only on ensemble detectors and have no access to any external knowledge (Fig. 1c). To compare this model with our proposed model (Fig. 1d), we need to compare the outputs of both models. If we assume model d and its result as reference we can formulate the problem as an information retrieval problem. We run model c and then measure the similarity of retrieved events with the reference detected events. In other words, we want to know how we can reproduce the same result as model d by using model c. We consider our model as reference model because its output is already checked with a human domain expert.

To this end we use a voting strategy to combine ensemble detectors alarms. We first define a confidence threshold equal to p value \(=\) 0.05 and compute the total of detectors votes corresponding each day. For detector 8 and 9 that are based on clustering and do not return p value we assume that the detectors are confident enough (p value \(\le 0.05\)) and thus include them in the voting process.

We count the votes of detectors for each instant and then compute the similarities of detected list with Table 4. The results is presented in Table 5. N in this table denotes the detector votes. \(N>1\) for instance indicates that at least two detectors agree that a hypothetical instant should be recognized as event. The terms \(N>2\), \(N>3\) and \(N>4\) means that at least three, four or five detectors are respectively suspicious to a particular instant. As it can be seen, event signals corresponding to \(N>1\) are 72 % similar to events marked with bold in Table 4. It means that if we have vote of at least two detectors for an instant, the alarms would be over 70 % similar to final output of our proposed model. However, by increasing the required vote, F-measure decreases to 58 % for \(N>2\) and 0.23 for \(N>3\) and \(N>4\).

Table 5 Ensemble detectors retrieval performance in the absence of background knowledge

We repeat the same procedure for individual detectors to compare their individual performance. Comparing the ensemble detectors F-measures in Table 5 with individual detectors in Table 6 reveals the effectiveness of ensemble detectors comparing the individual detectors. As it can be seen the maximum F-measure obtained for individual detector is related to detector 1 which is equal to 0.55. This is about 20 % lower than ensemble detectors with \(N>1\) condition.

Table 6 Individuall detectors retrieval performance in the absence of background knowledge

3.3.3 Evaluation of individual detectors with ROC analysis

ROC curves are robust tools for evaluation of classifiers and event and anomaly detection algorithms. Vertical axis of a ROC curve corresponds to true positive rate and horizontal axis is related to false positive rate.

As we already mentioned our detectors (except detector 8 and 9) return a p value for each instant indicating the chance of that instant is not an event. In most event detection systems, usually a pre-defined threshold (0.05) is set by user as confidence level. Then, if p value returned by detector is lower than the threshold, an alarm is raised.

In order to plot ROC curve of each detector, we first define validated bold items in Table 4 as the target set. Then we vary threshold from 0 to 1 with step of 0.001 and then compute the true positive and false positive rate of detected set against target set. Next, in order to compare the accuracy of individual detectors we compute area under the obtained ROC curves. The result is presented in Table 7. As it can be seen, detector 10 provides the best accuracy and detectors 7, 1, 2 and 5 are ranked in the subsequent places, which all provides over 70 % accuracy. The values in this table do not indicate the strength of the detectors in general; rather reveals their specific performance on this particular condition. In other words, if we would not consider ensemble detectors and we would not have access to background knowledge, detector 10 could reproduce the same result as our system with accuracy of 75 %. There is no guarantee that this detector performs the same on other circumstances.

Table 7 Area under ROC curves for each individual detectors

3.4 Sensitivity analysis

Here we present our comparative study results. Note that the following results are domain specific and may not be generalized to other different problems and settings. Due to the difficulty and limitations on obtaining data with required characteristics we were not able to perform experiments on different domains and applications. Hence, the following specific findings emphasis on bike sharing data and the similar regression problems and domains and are required to be validated for other domains.

3.4.1 Learning: semi-supervised vs. unsupervised

Table 7 shows the area under ROC curve for the individual detectors. As it can be seen, semi-supervised detectors outperform unsupervised detectors. Area under curve for unsupervised detectors (detectors 4 and 6) is below 0.6. This shows that bike sharing data is highly non-sequential and each instant is temporally independent of other instances. Instead, instances are dependent on environmental settings. Unsupervised detectors are able to detect only severe events while semi-supervised detectors are able to detect more meaningful events.

3.4.2 Scale analysis: hour vs. day

We designed the detectors such a way to be able to compare the performance of analysis on daily scale vs. hourly scale. Among semi-supervised detectors, detector 3 is the only approach that operates on daily scale. It means that makes a model from train set in daily aggregated counts (Fig. 2) and then make a forecast on test set in daily scale. Other approaches make a model from train set in the hourly scale. Area under curve of detector 3 comparing to the other methods is presented in Table 7. As it can be seen, the low AUC value for this detector reveals this fact that for detecting events in day scale it is not always a better idea to make a predictive model on the same scale, rather sometimes is better to make a model on smaller scales. The detectors that operate on hourly scale show at least 10 % more accuracy comparing detector 3. This provides evidence that for event detection in a desired scale, training on smaller scale also would worth to be considered.

3.4.3 PCA vs. MSSA

The idea of MSSA is to adapt PCA for time series. PCA looks to the instances independently while MSSA takes into account the auto-correlation between temporal instances together. Capturing this auto-correlation is not considered in PCA. We compared the performance of PCA (detector 5) vs. MSSA (detector 7). We considered only three Principal components for both and did not check other settings. They might have different performance in other settings. However in the same condition as Table 7 shows, PCA outperforms MSSA. The reason is almost clear. The bike sharing data is highly non-sequential and it is a poor correlation between consecutive instances. If there was a strong auto-correlation then MSSA would perform better. The point is that PCA and MSSA is required to be chosen depending on the data nature. PCA is recommended for independent instances and MSSA for auto-correlated instances.

3.4.4 Predictive model: online vs. offline

In Table 3 we compared the performance of 12 different algorithms. The last algorithm in this table is related to FIMT-DD algorithm [27], which is an online regression tree model. This is a streaming algorithm that scans the training data only once, using little computational resources (memory and CPU). As it can be seen, this model includes less accuracy comparing to the REPTree, however if the computation complexity is the problem, can present a relatively reasonable performance. The predicted counts correlation to the original counts is 68.29 % vs. REPTree that presents 91.57 % correlation. This difference seems reasonable for large data sets or streaming settings where REPTree fails.

3.4.5 Distance: euclidean vs. DTW

A comparison of the performance obtained using Euclidean (detector 8) and DTW (detector 9) distances is presented in Table 7. Although there is any significant difference, DTW exhibits worst results. This reveals that DTW necessarily do not always outperform Euclidean distance in time series similarity measurement. This is another evidence that environmental attributes play an important role in bike sharing process.

4 Conclusion

We proposed a novel event labeling model based on ensemble learning and background knowledge. We provided some evidences about the effectiveness of the proposed model through a set of tasks on a real-world data set.

Our research findings can be summarized as follows: (1) When there is no access to human experts, background knowledge (if available) can an appropriate alternative; (2) scale of the train and test data sets necessarily should not be the same. We demonstrated that in some particular settings, this assumption could be violated; (3) regression tree model REPTree is promising on like bike sharing data set. We believe that this model probably work well on data sets with same nature dealing with count time series under effect of environmental and periodicity settings (4) MSSA and DTW are reconsigned as robust tools in time series analysis. However as we demonstrated they can act inverse when data is under seasonal effects. (5) On the absence of background knowledge, ensembles detectors can present a result 70 % similar to the condition where we have access to background knowledge for verification; (6) we offered evidence that ensemble detectors with at least two votes provide 20 % better result than the best individual detector; (7) we showed that bike rental data is highly correlated with environmental and periodicity settings such as temperature and hour of the day, month, work of the day (weekend, weekday, and holiday). A regression tree model can make a prediction based on these environmental attributes very close to actual counts. This shows that bike rental count time series should not be analyzed without taking into account the environmental attributes; (8) web and existing knowledge in that can be potential source for aiding event detection systems.

Event detection on bike sharing data also has two potential applications. First, it can be incorporated in a decision support system for a better planning and management of system and secondly, can be employed in a recommender system for alarming or suggestion purposes. For instance, suggesting people not going out due to severe weather conditions or encourage them to go out for participating in an ongoing event in the town.

Further research will include the following directions: (1) Verification of the proposed model performance on other data sets and with other knowledge sources apart from online sources; (2) testing different ensemble designs; (3) studying different combination techniques in ensemble detectors; (4) spatiotemporal analysis on the data to discover localized events; (5) real time detection; (6) development of some text processing methods for automated capturing of knowledge from the Web.