Introduction

Recent technical developments enable the collection and analysis of huge amounts of data from cyber-physical systems (CPS) in their use phase. These data can be fed back into product planning and development, where data analytics reveals valuable insights about product performance and usage patterns. This is already a common procedure in the software domain (software usage analytics) (Menzies & Zimmermann, 2013). However, it seems not to be common for CPS.

Data analytics is a field that is highly interdisciplinary in nature that has adopted aspects from disciplines such as statistics, machine learning (ML), pattern recognition, system theory, operations research, and artificial intelligence (Runkler, 2020). By integrating analytical insights into decision-making processes, existing and future products can be optimized. These concepts form the research area of data-driven product planning (Meyer et al., 2021).

However, the integration of data analytics into the decision-making processes of product planning and product development poses major challenges for companies (Hou & Jiao, 2020; Wilberg et al., 2017). When implementing data analytics, small- and medium-sized enterprises (SMEs), in particular, face problems and challenges, such as a lack of know-how, a shortage of qualified employees, the dominance of domain specialists, and a lower awareness of topics such as data analytics and AI (Coleman et al., 2016).

This also complicates effective use of machine learning in product planning and development. In the data analytics process (e.g., CRISP-DM), it requires comprehensive knowledge and understanding at several points. After understanding the problem, an essential step is to identify and understand the data and analyze the data sources to evaluate the relevant data for the defined analysis problem (Reinhart et al., 2017). One challenge arises during data discovery and collection: Often, companies or data analysts do not have an overview of the existing data. There is a lack of knowledge about where to start and where to get the right information from, as it is scattered throughout the company (Kayser et al., 2019; Menon et al., 2005). This quickly becomes a first hurdle in the application of data analytics.

In order to identify the data that is relevant to the defined use case or to better understand and use the existing data, those involved must also have knowledge and understanding of the data. “The most important knowledge a data mining engineer uses to judge workflows and models' usefulness: understanding the meaning of the data” (Kietz et al., 2010). A certain level of data competence is, therefore, required, although this is often lacking, as well as a systematic recording of metadata (Sternkopf & Mueller, 2018). These are challenges that must be overcome, especially in data-driven product planning, since numerous, heterogeneous data and an often complex system landscape must be taken into account in the use phase. Examples of data in the use phase of the product life cycle are maintenance data, fault messages, service data, and log and measurement data (Li et al., 2015).

After understanding the data, preprocessing and modeling can start. Here, different components or methods are used, e.g., cleaning data sets, preprocessing data further, extracting domain-specific features from the data, modeling them appropriately, and post-processing the model output (Reinhart et al., 2017; Shabestari et al., 2019). Together, these components form a specific data analytics workflow.

In order to determine and select the appropriate workflows, characteristics of the data and the analytical methods must be always taken into account in addition to the objectives of the use cases (Nalchigar & Yu, 2018).

The versatility of data and the additional variety of machine learning methods used in data-driven product planning make the setup of a workflow an extensive task, which requires expert knowledge. Therefore, a simplification for companies is required to be able to implement solutions more resource efficiently or with fewer specialists. Our aim is to facilitate an easy entry to data analysis and workflow design for data-driven product planning by providing a pre-selection of relevant methods via sample workflows. The prerequisite for this is comprehensive knowledge of relevant data sources in an aggregated way.

The fundamental research question in this context is as follows: Are there commonalities in data from data generated in the use phase or in product planning practice that suggest similar analytics processing and thus can be grouped together?

To answer this question and to build such data classifications, which can be mapped to specific analytics workflows, we use a design science research methodology approach. We first build upon an existing classification of use phase data from a product planning view to build a knowledge base of typical usage data. We then present a classification of usage data from the data analytics view, which delivers relevant data characteristics that can be used to describe the knowledge base data. Building on these artifacts, we propose a joint classification of data for data-driven product planning, whose artifacts can be assigned to appropriate workflows. Figure 1 summarizes these contributions and their connections.

Fig. 1
figure 1

Contributions

The application in the context of data analytics in product planning is shown for one exemplary use case by means of a sample workflow.

Foundations and State of the Art

In the following, we present the foundations for the data classifications and existing approaches.

Data from Product Planning View

Definitions

Various definitions for data exist (Awad & Ghaziri 2007; Bourdreau & Couillard, 1999; International Organization for Standardization, 1993; International Dama, 2017; Koohang et al., 2008; Morgenstern, 1997). In the following, we understand data as recorded interpretable signs and signals, which potentially provide information in a given context or for a specific purpose. In an industrial context, we speak of industrial data. It can be classified according to various properties.

Classifications of Industrial Data

Data in a production-oriented company can be divided into organizational and technical operational data (Kurbel, 2005). The organizational operational data includes order data and personnel data. Technical operational data are machine data, tool data, and material data. Machine data is differentiated into product and process data. The latter includes all data that is generated during the operation of a machine. Product data describes the condition of the manufactured part. In combination with process data, they encompass information about the production process as a whole.

According to Schäfer et al. data sources can be roughly divided into three groups according to the origin of the data: machine-generated and human-generated content and business data (Schäfer et al., 2012). Raffeiner proposes a classification, which distinguishes between created, received, paid, and public data (Raffeiner, 2019).

An additional subdivision of data, which is made in computer science as well as in management science, is a distinction regarding the time reference. With regard to this data constancy, a distinction can be made between “master data” and “transaction data.” The term master data refers to data that remains constant over a long period of time. This includes, for example, company data such as building or plants. In contrast to master data, transaction data is time related and changes according to known or unknown processes. Transaction data and movement data are usually related to master data (Spitta & Bick, 2008).

Another classification is offered by the automation pyramid, which represents the different levels of automation in a factory and allows the structuring of technologies into different functional layers of industrial manufacturing (Dumitrescu et al., 2015). Along these layers, data sources as IT-systems, such as sensory, PLC, SCADA, MES, and ERP, are categorized.

In addition, industrial data can be classified in terms of their occurrence in the functional areas service, marketing, work preparation, development, purchasing, production, quality assurance, and IT (Gausemeier et al., 2009). A comprehensive product perspective is provided by categorizing data based on the product lifecycle phases product planning, design and development, production planning, production, use and support, and reuse and recycling (Kassner et al., 2015). Li et al. arrange data into the three main phases of product lifecycle management (PLM) BOL, MOL, and EOL (Li et al., 2015). Tao et al. propose another classification into management data, equipment data, user data, product data, and public data (Tao et al., 2018b). Table 1 summarizes the presented data classification approaches.

In data-driven product planning, the focus is often on the usage phase of the product life cycle and its data, often called usage data or field data (Kammerl et al., 2016; Kreutzer, 2019). Kreutzer refers to field data generated during the product or system usage phase after the point of sale (PoS) (Kreutzer, 2019). Edler defines field data as “[…] data that is generated in connection with the use of a product in the field or the use of a service by the customer. This include, in addition to errors, malfunctions, defects or failures, usage information such as machine running times, consumption of operating materials […], and the requirements expressed by the user for the next product generation.” (Edler, 2001). With regard to the sources of field data, Kreutzer proposes the following classification for cyber physical systems: sensors and actuators, user data, and system data. Sensors are divided into shape and material measures, functional and process variables, and environmental interaction variables. For use in product planning, this classification is not sufficient, since data related to the CPS or the product is missing, such as service and customer data.

Data from the Data Analytics View

From a data analytics perspective, it is important to understand the (intrinsic) characteristics of the data in order to infer necessary or appropriate processing methods.

Definitions

“Data characterization describes the data in ways useful to the miner and begins the process of understanding what is in the data—that is, is it reliable and suitable for the purpose?” (Pyle, 1999). To describe the nature of data, characteristics are needed (Kitchin & McArdle, 2016). In this context, there is also often a reference to meta data. Metadata (“data about data”) refer to structured data that can be used to describe and specify facts about an information object (Dippold et al., 2005). Metadata are used to define data characteristics. This idea is common to the field of meta learning, where attributes relevant to the problem are of particular interest.

 

Table 1 Classification approaches for industrial data

Classifications

In general, the following types of data characteristics can be distinguished: general measures (general information to the dataset at hand, such as number of instances and dimensionality) as well as statistical and information-theoretic measures (attribute statistics and class distributions, such as mean and standard deviation) (Bilalli et al., 2016).

Another approach of characterization is the characteristics of Big Data, such as quantity, variety, and speed (Zhang, 2016). According to Hildebrand et al. data can be described based on their characteristics based on six criteria. These criteria are divided into format, structure, content, stability, processing and the business object (Hildebrand et al., 2015). An important criterion in data analysis is the structure of the data, which is also hidden behind the term data variety in big data terminology. The degree of structuring determines the further processing. Quality differences and problems form another dimension (Corrales et al., 2015). In order to successfully prepare data for analysis, a large number of criteria must be taken into account. These include completeness (often a problem especially with textual data), consistency, and accuracy.

Existing classifications are usually not or only partially aligned with the data analytics requirements. Ziegenbein et al. provide a list of data set characteristics, which are related to machine learning procedures (Ziegenbein et al., 2018). Since this is not an exact fit for the requirements in data-driven product planning, a new classification is needed.

Research Methodology

In the last section, we motivated the need for structuring and concretization of data sources of the use phase, which we call usage data in the following, as well as suitable characteristics to describe them with the goal of data analysis. This is also the first activity in the design science research methodology (DSRM) presented by Peffers et al. that we followed to develop the classifications (Peffers et al., 2007). The research process is summarized in Table 2. In the following, we describe design and development of the research process for conceptualization of the classification from the product planning view, from the data analytics view, and the joint classification for data-driven product planning in more detail (see Fig. 2).

Table 2 The DSRM process
Fig. 2
figure 2

Design and development of classification artifacts

Data from a Product Planning View

In the context of the “Data from Product Planning View” section, different ways of classifying industrial and field data sources were introduced. For the usage data knowledge base, we used the classification according to Meyer et al. (2022), which introduces five categories of use phase data: 1. usage data (describe how a product is used by its customers and users), 2. user behavior data (summarize how users behave when utilizing the product), 3. service data (data dealing with problems and the quality of the product), 4. product behavior data (show how the product behaves and performs during operation) and 5. status data (describe the status and “health” of the product).

In order to extend the classification with further data objects, an intensive literature search was carried out, e.g., Li et al. (2015), Menon et al. (2005), Kassner et al. (2015), Kreutzer (2019), and Tao et al. (2018a). The results were enriched and validated by experts from industry and research within the research project DizRuPt.

Data from a Data Analytics View

The quality of insights for product planning generated from usage data highly depends on the correct usage of analytics techniques, which—in turn—is highly dependent on smart classification of the data characteristics. The type of data determines which tools and techniques can be used to analyze the data (Tan et al., 2016). So, in the following, we will attempt to answer the question “What are the key characteristics/what is the nature of usage data?”. The characteristics are identified and organized using the method for taxonomy development suggested by Nickerson et al. (2013). Often used synonymously with terms such as framework, typology or classification taxonomies are empirically and/or conceptually derived groupings in terms of dimensions and characteristics (Puschel et al., 2020). Nickerson et al.’s method includes the following steps: determination of a meta-characteristic, determination of objective and subjective ending conditions, and the iterative choice of approach until all ending conditions are met. For the choice of approach, Nickerson et al. propose empirical-to-conceptual and conceptual-to-empirical approaches. In the empirical-to-conceptual approach, real-life objects are selected, characteristics are induced, given conceptual labels, and assigned to dimensions. In the conceptual-to-empirical approach, researchers first propose dimensions and characteristics before dimensions and characteristics are examined by classifying objects. This leads to an initial or revised taxonomy. Figure 3 summarizes the taxonomy development method suggested by Nickerson et al. (2013).

Fig. 3
figure 3

Taxonomy development method according to Nickerson et al. (2013)

In line with our research question, out meta-characteristic was analytics relevant characteristics of usage data. We distinguished between general data set-describing characteristics, which we assume are similarly pronounced for usage data, and very individual characteristics, which are company and infrastructure dependent. We used the objective ending conditions proposed by Nickerson et al.: every characteristic is unique in its dimension, every dimension is unique and not repeated, at least one object is classified under each characteristic of each dimension, and no new dimensions or characteristics have been added in the last iteration. Subjectively, the method will end when the taxonomy is determined by all the authors to be concise, robust, comprehensive, extendible, and explanatory. In Table 3, details of all iterations are shown. In the first iteration, we chose the conceptual-to-empirical approach to conceptualize dimensions and characteristics based on standard literature and expert knowledge by the authors. As a starting point, we chose the popular big data characteristics, which we filtered with respect to our meta characteristic. To evaluate the initial taxonomy, we used the first 22 data objects of the knowledge base (see Fig. 4). In the next iterations, we applied the empirical-to-conceptual approach. In summary, we used additional 22 data objects from the knowledge base to infer new characteristics or other constellations and 10 real usage data sets from industry to challenge the individual dimensions and characteristics. To cover the perspective of the analytics side even better, we used descriptions of 20 algorithms from the literature in the last iteration to find out if the taxonomy was final.

Table 3 Iterations of the taxonomy development process

Joint Data Classification

The goal of this research step was to identify usage data with similar general characteristics and narrow down possible combinations of characteristics to a fixed set of artifacts. For this purpose, we combined the product planning and analytics view by using the classified data objects of the knowledge base according to the taxonomy. The assignments were again challenged with experts from research and industry who frequently work with usage data and, therefore, know their characteristics well. In the end, we obtained binary vectors that acted as input for the automated clustering. We chose a prototype-based algorithm, the well-known and most widely used clustering algorithm k-means, which determines a prototype for each cluster and forms clusters by assigning data objects to the closest cluster prototype (Wu, 2012). To determine the optimal number of clusters k, we used the graphical “elbow” method. That resulted in five clusters. The interpretation of these revealed each cluster could be reasonably interpreted standalone and in relation to the other clusters.

In the second step, the generated clusters were combined with possible forms of the individual characteristics to obtain a comprehensive list of usage data classes.

Approach of Data Classification for Workflow Assignment

Classification for Usage Data from a Product Planning View

Figure 4 presents the knowledge base relying on the classification by Meyer et al. (see the “Data from a Product Planning View” section). It lists 44 relevant data sources or data objects of the use phase.

Fig. 4
figure 4

Usage Data Knowledge Base

Classification for Usage Data from a Data Analytics View

Figure 5a, b show the taxonomy for general and individual usage data characteristics and possible indicators for easier classification of data objects. In the following, all dimensions and characteristics are described in more detail.

Fig. 5
figure 5

a General data characteristics; b individual data characteristics

General Dimensions and Characteristics

  • The data set group (variety): This dimension examines data in terms of its variety, i.e., data set type and degree of structuring. Characteristics on the first layer are tabular data (structured) and text data. These can be broken down further. Record data assumes a dataset as a collection of records with a fixed set of data fields (variables). Table or matrix form is common. Generally, there is no explicit relationship among records, and every record has the same set of variables. Graph-based data considers data with relationships among objects or data with objects that are graphs (if objects contain sub objects that have relationships). For ordered data, the attributes have relationships with a temporal or spatial order. Ordered data can be grouped further into sequential transaction data (each transaction has a time associated with it), sequence data (the dataset that is a sequence of individual entities–positions instead of time stamps), time series data (each record is a series of measurements taken over time) with signals and no signals, and spatial data (spatial attributes, such as positions or areas). For text, a distinction can be made between structured and semi-structured text data. Image-, audio-, and graph-based data are grayed out because the procedure in the “Data from a Data Analytics View” section showed that they are not relevant as a characteristic for usage data. However, since these formats may well play a greater role in the future, they are also listed.

  • Dimensionality: Dimensionality is another important factor that can play a crucial role for the selection of an adequate analytics technique, e.g., too many dimensions cause every observation in a dataset to appear equidistant from all the others (curse of dimensionality), which is a big problem for clustering algorithms. Hence, the characteristics are small dimensional and high dimensional.

  • Distribution: Some general aspects of distributions often have a strong impact, which can make modeling difficult. Sparsity is such a special case, where most attributes of an object have values of 0. Some data mining algorithms, such as the association rule mining algorithms, work well only for sparse data (Tan et al., 2016). On the other hand, some algorithms such as random forests work best on dense data.

  • Complexity: Complexity in data can be expressed by, e.g., (auto-)correlation, which is important to know, as e.g., one of the assumptions of regression analysis is that the data has no autocorrelation. Therefore, other methods may have to be used. Correlation and multicollinearity in data may have an impact on the performance of the model, too. Algorithms, such as logistic regression or linear regression, are not well suited in that case so that it should be fixed before training.

  • Real-time behavior (velocity): In data analytics or machine learning real-time or online ML (training of a model by running live data through it to continuously improve the model) can be distinguished from traditional training, where a batch of historical data is used. The former requires different procedures than the latter.

  • Volume: Regarding the volume, a data object or dataset can have small, middle, or big size. To evaluate this, the amount of data generated per day is certainly important. The volume affects the analysis to the extent that some methods are better able to handle few training samples, e.g., support vector machines, or some algorithms are better suited to process large volumes of data.

Individual Dimensions and Characteristics

These individual characteristics are not only important mainly for the selection of the right preprocessing techniques but also play a role in the modeling algorithms (Banimustafa & Hardy, 2012).

  • Data quality problems: Data quality has a major impact on data analysis, for example, some techniques are more tolerant to missing values, outliers, and unusual data distributions. Some data pre-processing procedures (e.g., outlier elimination, normalization, phasing, data reduction) may be necessary to address the quality issues and tailor the data for modeling. Characteristics are random noise, systematic errors, outliers, inconsistency, missing values, and duplicate data.

  • Variable type: To describe individual data objects, the variable type is suitable. Basically, categorical (qualitative) and numerical (quantitative) attributes are distinguished here. Qualitative attributes lack most of the properties of numbers and should be treated more like symbols. Here, again, nominal and ordinal types can be distinguished. Quantitative attributes are represented by numbers and have most of the properties of numbers. Binary and date variables can be both categorical and numerical and are sub characteristics of special/hybrid form.

To evaluate or determine the quality characteristics and to better estimate preprocessing actions, we propose the use of a three-level scale “to be neglected,” “to consider,” and “dominant.” Quality constraints in the context of measurement quality may be negligible, for example, if the dataset contains constant systematic errors but only relations are of interest, or if random errors are present that are rare enough to have an impact. Systematic errors that can be corrected would be to be considered in the context of preprocessing. “Dominant” is intolerable sensor failures or random errors, which dominate the data. The final evaluation of course needs also to consider the use case.

Joint Data Classification

As mentioned in “Joint Data Classification through clusters”, the data basis for the clustering is the assignment or classification of the data objects from the knowledge base to the general feature characteristics according to the taxonomy by experts (see Fig. 6).We inferred five clusters or categories covering combinations of general usage data characteristics that typically occur together. The clusters are illustrated in Table 4 where we highlighted the most frequent characteristics per dimension. The names of the clusters are shaped by the most distinctive characteristics.

Fig. 6
figure 6

The data basis for clustering (excerpt)

Table 4 Usage data clusters

Cluster 1: Sequential Sparse Real-Time Data

This cluster is characterized by the dataset group ordered data, more specifically time series and sequential transaction data. Data objects in this cluster are, in most cases, generated in real time; data size is small to middle, and, mostly, there is no obvious correlation. Furthermore, the cluster is marked by low dimensionality and sparsity. Sensor data, in some cases also actuator data, hardware, and software states as well as warning and error messages, can often be classified here.

Cluster 2: Highly Structured Historical Data

This cluster predominantly contains structured data, which can be stored in relational databases. The data volumes are rather small, also because the data are rather sparse. Examples of data objects are hardware configurations, factory settings, warnings, ratings, and login data.

Cluster 3: Mixed-Structured, High-Dimensional Real-Time Data

This cluster includes data from different data set groups. Objects that can be classified here often have semi structured format, but can just as well be sequential or structured data. Other characteristics of this data are its real-time behavior and middle to big data sizes. Often, they are also high dimensional and dense.

Cluster 4: Real-Time Time Series Data

Real-time time series data are characterized by a time series format or even signal characteristics. They are generated in real time and mostly small data. Signal data such as vibration, on the other hand, often appear in large datasets. Since the focus is on time series, they are often characterized by autocorrelation, but tend to be low dimensional and dense. Vibration data, hardware, and software status, runtime, and energy consumption can be classified into this cluster.

Cluster 5: Text Data

The last cluster is characterized by an unstructured or structured text format. The amount of data is rather small. Sparsity is given by the format. Examples are licenses and various protocols.

The resulting classes can be combined with the respective data quality assessment and variable type feature (see Fig. 3b). Since this results in more than 12,000 possible combinations, these must be reduced to a few representative classes. To obtain classes that are relevant in practice, if possible, we asked six industry and research institutes in a workshop in which of the five clusters they classify their usage data and in which quality characteristics their usage data are available. Some key results are summarized in Table 5 and the resulting final data classes in Table 6.

Table 5 Workshop results
Table 6 Final data classes

Toward Data-Driven Product Planning

An exemplary use case from data-driven product planning shall illustrate the application of the joint data classification for linking to a suitable sample workflow. A popular application to improve products is failure detection and diagnosis or root cause analysis. For example, a company wants to detect frequently occurring errors on its production machine and uncover possible causes. To do this, it can use machine sensor data, such as pressure, speed, and motor current on the one hand, and service reports on the other hand, which contain error information and possible causes for some processes. Machine data can be categorized into data class nr. 4 from Table 6(real-time time series data with systematic errors and random noise to be considered). Service reports belong to data class 5 (text data with partly dominant many missing values). A possible data analytics workflow for data with these characteristics is shown in Fig. 7 and could look as follows:

  1. 1.

    Selection: For detecting failures, all machine information is helpful (pressure can indicate valve damage, RPM can indicate motor damage or bearing damage, and motor current can indicate bearing damage or blockage). Since service reports have very few failure cases documented and causes are usually missing, they are more suitable for validating failure detection.

  2. 2.

    Integration: The three sensor measurement data can be combined for multivariate analysis. For this, it is important that the time stamps and the sampling rates match. Here, if necessary, down- or upsampling can be used.

  3. 3.

    Cleaning: Due to the data class, the data suffers from measurement inaccuracies and random noise. These can be resolved by calibration or setting an offset and a filter such as Kalman.

  4. 4.

    Transformation: For time series, it is often worth transforming to the frequency domain to get a different perspective on the data. With respect to the detection method selected in the next step (LSTM autoencoder), the data require normalization between 0 and 1 and must be reshaped into a three-dimensional tensor.

  5. 5.

    Detection: Since we want to detect failures, we can frame the problem as an anomaly detection task. Since numerical time series are involved, statistical approaches or unsupervised or semi-supervised models, since mainly normal states are known, come into question. Methods, which are able to monitor several features or time signals in parallel (multivariate), are, e.g., clustering methods like DB-SCAN or K-means, ARIMA, or autoencoder. We propose an LSTM autoencoder due to its suitability for temporal data.

Fig. 7
figure 7

Sample workflow for exemplary data class

After detection, the diagnosis part would start. Suitable techniques can again be provided for this task.

Conclusion and Future Research

We have presented three classification schemes for data in data-driven product planning. The first classification looks at usage data from a product planning view. The resulting knowledge base lists typical data of the usage phase and offers an overview about relevant data for data-driven product planning use cases. The second classification looks at data from a data analytics view by summarizing characteristics that are relevant to preprocessing and data analytics algorithm selection. Finally, these two approaches were combined by assigning the characteristics to the typical data sources and doing a cluster analysis on it. The resulting data classes can be used in data-driven product planning to match to sample workflows. This greatly simplifies the task of understanding data and selecting appropriate analytics techniques. We illustrated the utilization of the classes and that the data classes are useful to derive abstracted, generally valid sample analytics workflows for data-driven product planning, with an application example. The development of such workflows requires future work. This can only take place for selected data classes. For this purpose, the important classes and their most frequent quality ratings must be identified. Furthermore, these workflows can only provide initial impetus, since not all factors to be considered, such as concrete domain knowledge, can be covered.