This section provides an overview of the current state of the art in big data usage, addressing briefly the main aspects of the technology stacks employed and the subfields of
visualization, and more technical aspects of
data stream processing. Future requirements and emerging trends related to big data usage will be addressed in Sect. 8.6.
4.1 Big Data Usage Technology Stacks
Big data applications rely on the complete
data value chain that is covered in the BIG project, starting at data acquisition, including curation, storage, analysis, and being joined for data usage. On the technology side, a big data usage application relies on a whole stack of technologies that cover the range from data stores and their access to processing
execution engines that are used by
query interfaces and languages.
It should be stressed that the complete big data technology stack can be seen as much broader, i.e., encompassing the hardware infrastructure, such as storage systems,
datacentre networking infrastructure, corresponding data organization and management software, as well as a whole range of services ranging from consulting and outsourcing to support and training on the business side as well as the technology side.
Actual user access to data usage is given through specific tools and in turn through query and scripting languages that typically depend on the underlying data stores, their execution engines, APIs, and programming models. Some examples include SQL for classical relational database management systems
Google’s file system (GFS), and
Jaql for Hadoop-based approaches, Scope for
Microsoft’s Dryad and CosmosFS, and many other offerings, e.g.
Stratosphere’sFootnote 1 Meteor/Sopremo and ASTERIX’s AQL/Algebricks.
Analytics tools that are relevant for data usage include
SystemT (IBM, for data mining and information extraction) and
Matlab (U. Auckland and Mathworks, resp. for mathematical and statistical analysis), tools for business intelligence and analytics (
SAS Analytics (SAS),
SPSS (IBM)), tools for search and indexing (
Solr (Apache)), and specific tools for visualization (
Tableau, Tableau Software). Each of these tools has its specific area of application and covers different aspects of big data.
The tools for big data usage support business activities that can be grouped into three categories: lookup, learning, and investigating. The boundaries are sometimes fuzzy and learning and investigating might be grouped as examples of exploratory search.
Decision support needs access to data in many ways, and as big data more often allows the detection of previously unknown correlations, data access must be more often from interfaces that enable exploratory search and not mere access to predefined reports.
in Big Data Usage Technologies
An in-depth case study analysis of a complete big data application was performed to determine the decisions involved in weighing the advantages and disadvantages of the various available components of a big data technology stack. Figure 8.2 shows the infrastructure used for Google’s
YouTube Data Warehouse (YTDW) as detailed in Chattopadhyay (2011). Some of the core lessons learned by the YouTube team include an acceptable trade-off in functionality when giving priority to low-latency queries. This justified the decision to stick with the ([
Dremel tool (for querying large datasets) that has acceptable drawbacks in expressive power (when compared to SQL-based tools), yet provides low-latency results and scales to what Google considers “medium” scales. Note, however, that Google is using “trillions of rows in seconds”, and running on “thousands of CPUs and petabytes of data”, processing “quadrillions of records per month”. While Google regards this as medium scale, this might be sufficient for many applications that are clearly in the realms of big data. Table 8.1 shows a comparison of various data usage technology components used in the YTDW, where latency refers to the time the systems need to answer request; scalability to the ease of using ever larger datasets;
SQL refers to the (often preferred) ability to use SQL (or similar) queries; and power refers to the expressive power of search queries.
4.2 Decision Support
decision support systems—as far as they rely on static reports—use these techniques but do not allow sufficient dynamic usage to reap the full potential of exploratory search. However, in increasing order of complexity, these groups encompass the following business goals:
Lookup: On the lowest level of complexity, data is merely retrieved for various purposes. These include fact retrieval and searches for known items, e.g. for verification purposes. Additional functionalities include navigation through datasets and transactions.
Learning: On the next level, these functionalities can support
knowledge acquisition and interpretation of data, enabling comprehension. Supporting functionalities include comparison, aggregation, and
integration of data. Additional components might support social functions for
data exchange. Examples for learning include simple searches for a particular item (knowledge acquisition), e.g. a celebrity and their use in
advertising (retail). A big data search application would be expected to find all related data and present an integrated view.
Investigation: On the highest level of decision support systems, data can be analysed, accreted, and synthesized. This includes tool support for exclusion, negation, and evaluation. At this level of analysis, true discoveries are supported and the tools influence
planning and forecasting. Higher levels of investigation (discovery) will attempt to find important correlations, say the influence of seasons and/or weather on sales of specific products at specific events. More examples, in particular of big data usage for high-level strategic business decisions, are given in Sect. 8.6 on future requirements.
At an even higher level, these functionalities might be (partially) automated to provide predictive and even
normative analyses. The latter refers to automatically derived and implemented decisions based on the results of automatic (or manual) analysis. However, such functions are beyond the scope of typical decision support systems and are more likely to be included in
complex event processing (CEP) environments where the low latency of automated decision is weighed higher than the additional safety of a
human-in-the-loop that is provided by decision support systems.
A prime example of predictive analysis is
predictive maintenance based on big data usage. Maintenance intervals are typically determined as a balance between a costly, high frequency of maintenance and an equally costly danger of failure before maintenance. Depending on the application scenario, safety issues often mandate frequent maintenance, e.g., in the aerospace industry. However, in other cases the cost of machine failures is not catastrophic and determining maintenance intervals becomes a purely economic exercise.
The assumption underlying
predictive analysis is that given sufficient sensor information from a specific machine and a sufficiently large database of sensor and failure data from this machine or the general machine type, the specific time to failure of the machine can be predicted more accurately. This approach promises to lower costs due to:
Longer maintenance intervals as “unnecessary” interruptions of production (or employment) can be avoided when the regular time for maintenance is reached. A
predictive model allows for an extension of the maintenance interval, based on current sensor data.
Lower number of failures as the number of failures occurring earlier than scheduled maintenance can be reduced based on sensor data and
predictive maintenance calling for earlier maintenance work.
Lower costs for failures as potential failures can be predicted by predictive maintenance with a certain advance warning time, allowing for scheduling maintenance/exchange work, lowering outage times.
4.3.1 New Business Model
The application of
predictive analytics requires the availability of sensor data for a specific machine (where “machine” is used as a fairly generic term) as well as a comprehensive dataset of sensor data combined with failure data.
Equipping existing machinery with additional sensors, adding communication pathways from sensors to the predictive maintenance services, etc., can be a costly proposition. Based on experiencing reluctance from their customers in such investments, a number of companies (mainly manufacturers of machines) have developed new business models addressing these issues.
Prime examples are
GE wind turbines and
Rolls Royce airplane engines. Rolls Royce engines are increasingly offered for rent, with full-service contracts including maintenance, allowing the manufacturer to lift the benefits from applying predictive maintenance. By correlating the operational context with engine sensor data, failures can be predicted early, reducing (the costs of) replacements, allowing for planned maintenance rather than just scheduled maintenance. GE OnPoint solutions offer similar service packages that are sold in conjunction with GE engines.Footnote 2
big datasets and the corresponding analytics results can be distributed across multiple sources and formats (e.g. new portals, travel blogs, social networks, web services, etc.). To answer complex questions—e.g. “Which astronauts have been on the moon?”, “Where is the next Italian restaurant with high ratings?”, “Which sights should I visit in what order?”—users have to start multiple requests to multiple, heterogeneous sources and media. Finally, the results have to be combined manually.
Support for the human trial-and-error approach can add
value by providing intelligent methods for automatic
information extraction and
aggregation to answer complex questions. Such methods can transform the data analysis process to become explorative and iterative. In a first phase, relevant data is identified and then a second learning phase context is added for such data. A third exploration phase allows various operations for deriving decisions from the data or transforming and enriching the data.
Given the new complexity of data and
data analysis available for exploration, there are a number of emerging trends in explorative interfaces that are discussed in Sect. 220.127.116.11 on complex exploration.
4.5 Iterative Analysis
An efficient, parallel
processing of iterative data streams brings a number of technical challenges. Iterative data analysis processes typically compute analysis results in a sequence of steps. In every step, a new intermediate result or state is computed and updated. Given the high volumes in big data applications, computations are executed in parallel, distributing, storing, and managing the state efficiently across multiple machines. Many algorithms need a high number of iterations to compute the final results, requiring low latency iterations to minimize overall response times. However, in some applications, the computational effort is reduced significantly between the first and the last iterations. Batch-based systems such as
Map/Reduce (Dean and Ghemawat 2008) and
Spark (Apache 2014) repeat all computations in every iteration even when the (partial) results do not change. Truly iterative dataflow systems like
Stratosphere (Stratosphere 2014) of specialized graph systems like
GraphLab (Low et al. 2012) and
Google Pregel (Malewicz et al. 2010) exploit such properties and reduce the computational cost in every iteration.
Future requirements on technologies and their applications in big data usage are described in Sect. 18.104.22.168, covering aspects of
pipelines versus materialization and error tolerance.
Visualizing the results of an analysis including a presentation of trends and other predictions by adequate visualization
tools is an important aspect of big data usage. The selection of relevant parameters, subsets, and features is a crucial element of
data mining and
machine learning with many cycles needed for testing various settings. As the settings are evaluated on the basis of the presented analysis results, a high-quality visualization allows for a fast and precise evaluation of the quality of results, e.g., in validating the predictive quality of a model by comparing the results against a test dataset. Without supportive visualization, this can be a costly and slow process, making visualization an important factor in data analysis.
For using the results of data analytics in later steps of a data usage scenario, for example, allowing
data scientists and business decision-makers to draw conclusions from the analysis, a well-selected visual presentation can be crucial for making large result sets manageable and effective. Depending on the complexity of the visualizations, they can be computationally costly and hinder interactive usage of the visualization.
However, explorative search in analytics results is essential for many cases of big data usage. In some cases, the results of a big data analysis will be applied only to a single instance, say an airplane engine. In many cases, though, the analysis dataset will be as complex as the underlying data, reaching the limits of classical statistical visualization techniques and requiring interactive exploration and analysis (Spence 2006; Ward et al. 2010). In Shneiderman’s seminal work on visualization (Shneiderman 1996), he identifies seven types of tasks: overview, zoom, filter, details-on-demand, relate, history, and extract.
Yet another area of visualization applies to data models that are used in many machine-learning algorithms and differ from traditional data mining and reporting applications. Where such data models are used for
clustering, recommendations, and
predictions, their quality is tested with well-understood datasets. Visualization supports such validation and the configuration of the models and their parameters.
Finally, the sheer size of datasets is a continuous challenge for visualization tools that is driven by technological advances in GPUs, displays, and the slow adoption of immersive visualization environments such as caves, VR, and AR. These aspects are covered in the fields of scientific and information visualization.
The following section elaborates the application of visualization for big data usage, known as
visual analytics. Section 22.214.171.124 presents a number of research challenges related to visualization in general.
4.6.1 Visual Analytics
A definition of visual analytics, taken from Keim et al. (2010) recalls first mentions of the term in 2004. More recently, the term is used in a wider context, describing a new multidisciplinary field that combines various research areas including visualisation,
data analysis, data management,
geo-spatial and temporal data processing, spatial
decision support and statistics.
The “Vs” of big data affect visual analytics in a number of ways. The
volume of big data creates the need to visualize high dimensional data and their analyses and to display multiple data types such as
linked graphs. In many cases interactive visualization and analysis environments are needed that include dynamically linked visualizations. Data
velocity and the dynamic nature of big data calls for correspondingly
dynamic visualizations that are updated much more often than previous, static reporting tools. Data
variety presents new challenges for
The main new aspects and trends are:
visual queries, (visual) exploration, multi-modal interaction (touchscreen, input devices, AR/VR)
User adaptivity (personalization)
Semi-automation and alerting, CEP (
complex event processing), and BRE (
business rule engines)
Large variety in data types, including graphs, animations,
microcharts (Tufte), gauges (cockpit-like)
Spatiotemporal datasets and big data applications addressing geographic information systems (GIS)
Near real-time visualization. Sectors finance industry (trading), manufacturing (dashboards), oil/gas—CEP, BAM (
business activity monitoring)
Data granularity varies widely
Use cases for visual analytics include multiple sectors, e.g. marketing, manufacturing, healthcare, media, energy, transportation (see also the use cases in Sect. 8.6), but also additional market segments such as software engineering.
A special case of visual analytics that is spearheaded by the US intelligence community is visualization for cyber security. Due to the nature of this market segment, details can be difficult to obtain; however there are publications available, e.g. the VizSec conferences.Footnote 3