Introduction

In the wastewater treatment area, there is a constant pressure to improve treatment plant efficiency, including shortening the response time for corrective actions, to meet increasingly stringent disposal guidelines [1]. Using an UV–Vis spectrophotometer as a water quality monitoring device has gained attention in both the water and wastewater fields, it is a simple analytical method with a wide range of monitoring applications [2,3,4]. The absorption spectrum contains useful analytical information about the character of the wastewater [5, 6]. When applying this measurement in online/real-time monitoring mode, it will permit better optimisation of the treatment train [7,8,9,10].

Another factor that can adversely impact on the wastewater treatment process is shock loads of contaminants such as trade wastes entering the system unexpectedly, online monitoring has the potential to give a rapid indication of changing incoming water quality, allow “real-time” treatment response and improve existing and proposed management change at an early stage in the process [11,12,13]. To manage a sustainable future, the monitoring of wastewater treatment and disposal will minimise the risk of environmental incidents. Rapid response to problems within water and wastewater systems will minimise the cost associated with the corrective actions of such incidents because rapid management protocols and plans can be put in place to contain the situation [10, 14,15,16].

When operating these monitoring devices online, the data are immediately available for use to operate systems more efficiently and reliably, however, these devices also generate enormous amounts of data (including spectra and other calculated parameters). Since online analytical instruments have been increasingly deployed across the water industry, vast amounts of data have been generated. The operators and process analysts are challenged by accessing and assessing large volume and high dimensional data and making quick responses/corrective actions [13,14,15,16,17]. Another issue related to using online data from monitoring devices is that the data is not always used to its full potential and one of the main barriers is that techniques for managing large amounts of data are not well developed in the water industry [13]. In operations, the data is often used only for basic alarm generation and many opportunities to extract valuable information from this data has been missed, such as monitoring of instrument performance and identification of significant trends.

To improve the way that users handle online monitoring data, there is a strong need in the water industry for operations managers to have a tool to access data that reveals the status of the systems they operate in “real-time” to enable optimisation of processes and correct decision-making. Commonly used data storage/analysis software packages such as Microsoft Excel can provide basic timeseries graphic function for static display of data from the historical database; however, they often do not have the specialised functionalities to cope with the unique datasets generated by online monitoring devices over an extended period, particularly for real-time streaming and process of online monitoring data. Our earlier experience in process control application development using various analytical instruments, such as UV–Vis spectrophotometer, photometric dispersion analyser and customer built ammonia analyser, has confirmed that using an in-house custom-built data processing software solution, includes commonly used software platform, such as Labview (National Instruments), R (R Core Team), Visual Basic (Microsoft), requires operators with programming experience and the tool developed can only be used for one application and has been shown to be rather inflexible to adapt [4, 8,9,10,11,12, 18,19,20,21]. Our new approach is to investigate and develop an effective framework for data integration for water quality monitoring and process control application. To make correct decisions in managing a treatment plant requires multiple sources of information, including operational, water monitoring and weather data; these data are often stored in different locations and in different formats. To make the data useful, there is a need to develop tools for effective data sharing and integrating capabilities. There were two objectives in this study; (1) to develop a data visualisation platform to handle multiple dimensional data for users to explore data through visualization and comparison, a case from an online spectrophotometer measuring at the raw wastewater inlet of the selected Water Reclamation Plant and (2) to develop an integrated software solution, includes anomaly detection in the data, for making this information available together with other data for operations personnel to inform decision as a technological solution to support the water and wastewater industry.

Materials and Methods

Whyalla Water Reclamation Plant and Monitoring Equipment

Whyalla is a regional centre of South Australia (SA) and is located 396 km northwest of Adelaide (capital city of SA). The Whyalla Water Reclamation Plant (WRP) uses activated sludge treatment via sequencing batch reactors, it treats domestic sewage to produce ‘Class B’ recycled water according to the South Australian Reclaimed Water Guidelines [1]. The monitoring instrument, s::can spectro::lyser™, is a UV–Vis-spectrophotometer (200 to 750 nm) with a 5 mm optical path length and was setup in the inlet channel after the screens. Automatic air cleaning was setup using compressed air to clean the lens at regular intervals.

Online Data Collection

The online data stored in the con::sole was downloaded using the mobile network. The full spectral data (200 to 750 nm in 256 steps) was saved into a fingerprint (FP) file. Data acquisition time interval was set at 1 min. The FP file contained the spectral data, absorbance value against wavelength (Fig. 1). The FP file contained the spectral data from 200 to 730 nm (Fig. 1 shows data collected every 1 min) which is organised in rows (a section of the spectral data from 200 nm (Cell C2) to 220 nm (Cell K2) is shown in Fig. 1). The spectral data collected from this case study was used for further spectral analysis, detail of the data analytics study has been reported in Chow et al. [10].

Fig. 1
figure 1

An example of a s::can spectro::lyser™ FP file opened using Excel

Software Development

As part of this investigation into periodic biological process failures, a web-based prototype portal with data integration, visualisation, prediction and anomaly detection functions for complex online data sets was developed to access the data produced by the spectrophotometer. This tool has the potential to inform operational decision. For the integration modules, an agile development approach to gradually building the following components was applied:

  • Data visualisation and comparison tool

  • Integration connectors to remote data sources

  • Data extraction tool for physical integration

  • Data mining tool

This web-based prototype system, shown in Fig. 2 is developed with a client/server system architecture.

Fig. 2
figure 2

Software system architecture

The prototype system is divided into two parts — the client and the server. The client can be any commonly used web browsers, like Microsoft Internet Explorer and Google Chrome, which is provided as part of the computing operating system or freeware that can be downloaded from the internet. The client is responsible to collect user requests (visualisation, or analysis) and parameters (time duration, or parameter names), and send the information to the server. The server is a website supported by specially developed software components. It receives the request information from the client and intercommunicates with the database and the data mining engine to retrieve data, conduct analysis, generates the required visualisation and then sends the graphs to clients. In addition to the Apache website, the server side also has components of a data receiver/uploader responsible for data collection, a MySQL [22] database for data storage, the Javascript visualization package Highcharts [23], the R Data Mining Packages [24] for general data processing and models, and the Artificial Intelligence (AI) functions developed by our research team for profiling, anomaly detection and predictions. Data directly from the spectro::lyser together with aggregated data of different granularity were stored to increase the speed of access. When visualization function was called, if the number of data items exceeds a limit, the data values at a high aggregation level will be automatically retrieved. If details of a specific period are needed, detailed data can be retrieved.

The prototype provides various functions for time-series visualisation, data analysis, dynamic online monitoring, 3-D visualisation and manual and automatic data uploading. The prototype aims to support the inspection of the s::can spectro::lyser™ data and other types of data, and to support the investigation of patterns and visualise anomalies and predictions. Highcharts js libraries [23] allow the visualization to be displayed in any browser without having to install add-ons and apps. The client devices including computers and mobile phones can be used to access the tool at any time and places as far as internet connection is available. The concept of this development and the selection of internet browser as client are to eliminate the installation of client-side software and the training of the operators.

Data Management and Artificial Intelligence (AI)

The spectral data collected from this case study was used in spectral analysis, one of the key aims of this project. The files of data produced from s::can spectro::lyser™ were imported to a relational database so that the seams between the files were removed. Then, an error correction procedure was applied to identify problematic data (negative values, null values and extremely large values) where the data values were checked and corrected. Extremely large values were defined by being several folds larger than other large values in the dataset. Extremely large positive values were replaced by the three times standard variance of the hourly data. Negative values and null values were replaced by the mean value of hourly data.

Spectral data per minute was also aggregated to hourly average and daily average spectrum and their variance, respectively. The purpose of this operation was twofold. The first purpose was to increase the access efficiency. If the period of data to be accessed was large, such as 2 years, using minute data would not be practical and could cause heavy loading and transportation of the data over the network/internet. With the aggregated data, an automatic calculation was used to decide whether minute, hourly, or daily data should be used when plots were made and consequently the efficiency of access can be guaranteed. The second purpose of the aggregation was to help the detection of patterns using the AI components. Minute data has lots of peaks (spikes) and it was very hard to observe the average trend of the data. The aggregation operation helps to filter the peaks to reveal the trend in them. The AI component of the tool used a few techniques.

  • Clustering based profile learning.

    • Segments of data that were labelled as normal by experts were used to learn hourly profiles for some major wavelength for different days like weekdays, weekend days, and public holidays. The method used in component is the well-known algorithm k-means [25] clustering algorithm. We used multiple values for the parameter k and chosen the value that has the best clustering effect. After clustering, cluster centres, mean radius and standard deviation are derived.

  • Event prediction using learnt profiles

    • In the detection stage, if the live data exceeds the confidence interval of the profile, warnings were generated with scores indicating how far away the live data is from the centre of the profiles. Our method follows the work of [26, 27].

  • Autoregression based on the fly event prediction

    • In addition to the previous profile-based anomaly detection, a different way to produce warning was developed by using autoregression. Autoregression uses the value vectors (for multiple wavelength) of proceeding t-time periods to predict the values of the next time moment. When the predicted values compare with the actual live values, differences were identified. Trained thresholds were used to decide whether the live value is an outlier. Using autoregression to detect outliers was also used in [28].

  • Grouping events into densities for more reliable detection

    • To increase the reliability of detection, density of the detected events was used to detect major anomaly events. Density based outlier detection [29,30,31] has been widely used and our design is similar to [32].

In addition, proper security on accesses and secure prototype were used for safe running of the system.

Results and discussions

There are commercially available data management packages, such as PI Datalink—OSIsoft, available for process engineers to assess process performance. These packages provide data collection, historicising, finding, analysing, delivering, and visualizing functions. They are particularly useful for managing basic monitoring and day-to-day operations but with limited functionalities for advanced online monitoring systems and troubleshooting/investigation in a treatment plant. Therefore, the development of our customized package was to address the need for deeper analytical investigation such as anomaly detection, comprehensive comparison of trends, clustering and classification analysis during non-routine operations (such as water quality investigation) and troubleshooting a process.

The Usability of the Prototype

The prototype works in two modes: the query mode and the analysis mode. The query mode aims to support data visualisation. The web browser collects parameters and sends the data requests to the web server. After receiving the requests from browser clients, the web server passes the queries to the database server, and then sends the data back to the browser client after receiving the query results from the database server. The process routine is shown as the blue line in Fig. 2. The analysis mode aims to support data interrogation and analysis. After analysis requests are received from browser clients, the web server calls the data analysis functions in the Data Mining Engine (DME). Then, DME extracts necessary data from the database server, analyse/mine the data, generates patterns and plots, and then sends the plots back to browser clients through the web server. The process routine is shown in red lines in Fig. 2.

This prototype was then used for extracting information of the studied water in terms of water profiles and investigates the representations of the profiles. In this architecture, the clients are only in charge of requests collection and plot display. The server is responsible for all tasks of data retrieval, manipulation, and analysis. This guarantees faster responses and better user supports both locally and remotely even when the client computers/devices have limited memory and computing power.

The use of the prototype, especially the analysis mode, leads to knowledge in terms of patterns/trends comparison. The knowledge is valuable to support intelligence for decision marking in plant operations and long-term planning. The knowledge and its application led to smart information use and informative decision support.

Data Visualisation

Data visualisation, which is an important step for all types of analysis, is difficult to manage without the use of a suitable software package. Although most monitoring devices would have some simple visualisation functions, the functions like summarisation, comparison, operations over multiple files etc. are missing from onboard processor of the devices. Secondly, data from multiple devices and sources cannot be integrated and compared [33]. There is also lack of support for deep interrogation of data and comprehensive analysis of data [13]. An industry research project “Optimisation of Existing Instrumentation to Achieve Better Process Performance” supported by the Water Research Australia has been conducted because of the issues reported [13] and the full project development and findings have been detailed in the project report [34].

The spectrum measured by the spectrophotometer (s::can spectro::lyser™) is often combined with both the contribution of water quality change and the state of the instrument (signal drift and noise). The application of advanced data analysis to understand the nature of the signal would allow the building of an intelligence base for the monitoring application, the steps below were conducted, data visualisation, inspection and sanitation, more specifically, this includes:

  • Visualise and compare data sequences

  • Investigate suitable ways to detect device drifting and learn the patterns of device drifting

  • Detect and filter noise introduced to the data generated by sensors

  • Detect and handle missing values (missing values have significant impact on learning the properties of water)

  • Verifying data for correctness

  • Segment data sequences for the application of different approaches

A demonstrative prototype portal with data integration, visualisation, prediction and anomaly detection functions was developed to better use the data for early warning of possible problems and to assist in decision-making (Fig. 3). The developed tools are not s::can spectro::lyser™ specific; they can be applied to similar monitoring devices. The prototype provides various functions for time-series display, data analysis, dynamic online monitoring, 3-D visualisation and manual and auto data uploading (Fig. 4).

Fig. 3
figure 3

An example of the display of a selected single wavelength, such as 200 nm, extracted from the spectrum using a Google Chrome browser

Fig. 4
figure 4

An example of displaying multiple spectra over the specified period

The example shown in Fig. 5a demonstrates various advantages of using different plot functions, including Basic Plot, Time Series of Different Types, Time Series Comparison, and Differences of Two Time Series. This function is useful for plant operators as decision often is based on several environmental parameters, such as temperature and rainfall, in addition to the water quality data.

Fig. 5
figure 5

Examples of (a) the display – multiple databases and (b) with statistical analysis including hourly average and 95% confidence range

Figure 5b shows the statistics plot displays, the time-series of the selected feature and the corresponding statistics in the same time duration on one chart. Features of statistics plots are listed below:

  • Plot a selected data series which can be fingerprint readings of a selected wavelength, parameter data, or lab data over a period

  • Display the hour means of the selected data

  • Display the 99% confidence Bollinger Bands

  • Display data details for the point where the mouse points

  • Select date period with a sliding bar

  • Support plot export in formats of PNG/JPG/PDF/SVG

This portal is extremely useful to visualise data from multiple sources (data linkage and integration) such as weather data and laboratory data. The functionalities provided by the portal included time series visualisation and comparison for multiple data sources and data analytics to perform various statistics plots, which include the following:

  • Collect environmental data such as temperature and rainfall data, and possibly laboratory derived data

  • Link the collected data to the sensor data and perform integration

  • Detect the impact of environmental events on the sensors data and model the impact

  • Separation of impacts from water quality itself and from the environmental factors

  • Design models to represent water properties (spectral character) and the impacts of environmental events

Prediction and anomaly detection

From the analysis of statistical patterns and anomalies, an early-stage predictive model was developed using multiple time series. As described earlier, four techniques, namely clustering based profile learning, event prediction using learnt profiles, autoregression based on the fly event prediction and density-based anomaly detection were used to support the function. The technology developed in the project is generic and can be adapted to other business areas including catchment, drinking water treatment and distribution system management. The prediction analysis aims to build a function to know the values in the future. When predicted values go beyond the range of normal patterns, an alarm can be triggered and further action can be taken to guarantee the smooth running of the plant. This will provide more time for the operator to react and correct any operational issues. The model that predicts future value of a time series based on past values of this time series and neighbouring time series.

Assume that \({x}_{j\left(n+1\right)}\) is to be predicted from the following data

$${x}_{(i-d)1}, {x}_{(i-d)2}, \dots , {x}_{(i-d)n}$$
$$\dots$$
$${x}_{i1}, {x}_{i2}, \dots , {x}_{in}$$
$$\dots$$
$${x}_{(i+d)1}, {x}_{(i+d)2}, \dots , {x}_{(i+d)n}$$

where \({x}_{kj}\) means the reading for wavelength \(k\) at time point \(j\), \(2d\) is the number of time series considered in the prediction model, and \(n\) is the number of readings to be considered for each prediction, the regression expression below with the coefficients \({a}_{kj}\) are trained during the training phase.

$${x}_{j\left(n+1\right)}={\sum }_{j=1}^{n}{\sum }_{k=i-d}^{i+d}{a}_{kj}{x}_{kj}$$

The formula is applied to each moment to predict the value for the new moment.

The prediction function integrates time-series display, model learning, model storage, and applying the learnt model for prediction. An example of prediction visualisation is shown in Fig. 6a. The user selects a data series and date from the control panel. Data will be displayed in the plot area (blue line) and when the prediction button is clicked. The software applies a previously built multivariate autoregression model [35] to calculate predictions and displays the predictions in the plot area (black line). If no model for the selected feature was previously built, a warning message appears ‘No model was learnt for this feature, please train first’. In this case, the user is prompted for training. The user clicks on training button first. After training, predictions using the newly trained model will appear in the plot area. If the user likes the new model, he/she presses the Update button to store the new model into the database. The user can carry out model training with the latest data, and overwrite the model stored in the system to obtain better prediction performance. Imagine the situation that the system is monitoring live, the user can forecast the change. The performance in terms of accuracy between predictions (black line) and the actual points is show in Fig. 6. The predicted line is consistent with the actual line, at the same time, it is less perturbed than the actual line.

Fig. 6
figure 6

Visualisation of prediction/training model, prediction (black) against actual (blue)

The prediction functions aim to actively foresee the values for the upcoming time points and make alarms if predicted values are beyond the range of normal patterns. This enables early actions to be taken to ensure that the plant is running under optimum conditions. Anomaly detection is another important function in the prototype. To support this function, normal patterns in spectrum data are trained based on statistical properties of data and k-means clustering [25] over a normal period verified by the processing plant log. Full details of the data analytics component of this study have been reported in Chow et al. [10]. Anomalies are then detected using Pearson product-moment correlation [36]. Following Pearson correlation, the correlation coefficient ρ between a spectral vector x and a normal pattern y is calculated by:

$$\rho=\frac{\text{cov(x, y)}}{\sigma_x\sigma_y}=\frac{\sum_{i=1}^n(x_i-\overline x)(y_i-\overline y)}{\sqrt{\sum_{i=1}^n{(x_i-\overline(x))}^2}\sqrt{\sum_{i=1}^n{{(y_i-\overline(}y))}^2}}$$

For illustration, 1440 spectral vectors in 2014–08-02 were extracted. For each time moment, the correlation coefficients between the spectral vector and the two patterns were calculated. To visualize anomalies in data, an anomaly detection model has been incorporated in the prototype. This model can be trained based on statistical properties of data and the anomaly can be detected and displayed in the visualization area. The model is for a user selected wavelength. After the Detection button is clicked on, the points that are detected as anomalies are displayed in red colour as shown in Fig. 7

Fig. 7
figure 7

Visualisation of anomaly detection (red dots)

The detected anomalies were compared against the logs of the water treatment plant; the results were matching very well.

Real-Time Detection

In the real-time monitoring module, a function is provided for the real-time sensor data to be appended to the existing time series visualisation. In the development stage, this function is only in demonstration mode but when in real application, an online sensor will be ‘connected’ to this software package and live data will be fed. This function forms the basis for ‘on the fly’ working when it is combined with the prediction function earlier. Currently, the function is a demo (Fig. 8) capturing data from a database to simulate the process.

Fig. 8
figure 8

Real-time monitoring example

Although this function is currently in demonstration/simulation mode, this function would have great potential to improve the usability of the whole real-time monitoring package to assist treatment plant operation.

Conclusion

The need for a smart data processing system to process real-time data streams to provide support for the operator is becoming a necessary tool to handle the vast amount of data generated from online instruments [10, 13, 37,38,39]. The prototype demonstrated these techniques in the application can be applied and indirectly verified that the methods are effective by being integrated into the treatment plant operations. This application is specific and so far not seen in any reports on detecting events in this scale and with the data from s::can spectro::lyser™ and laboratory data. It will provide the benefit or pre-filtering of useful data to be transferred from the remote monitoring system and stored for reference purposes. Much of this data is stored but never accessed and the potential capability of expensive instruments often goes unrealised due to a lack of understanding and difficulty in extracting useful information from massive databases. Information obtained from various workshops suggests that this is a widespread problem and there is a development need for utilising the obtained data to improve process performance [13, 34]. In other words, the water industry requires better tools to manage online data to achieve a cost-effective outcome to improve system performance. This multidisciplinary science research work demonstrated the potential of real-time display and visualisation of online water quality data and the potential for prediction tools to provide an early warning system for possible process upsets. This innovative monitoring and data visualisation approach not only provides a tool to support operations, but it also forms part of the risk assessment system.