Simple-ML: Towards a Framework for Semantic Data Analytics Workflows

In this paper we present the Simple-ML framework that we develop to support efficient configuration, robustness and reusability of data analytics workflows through the adoption of semantic technologies. We present semantic data models that lay the foundation for the framework development and discuss the data analytics workflows based on these models. Furthermore, we present an example instantiation of the Simple-ML data models for a real-world use case in the mobility domain.


Introduction
The creation of a Data Analytics Workflow (DAW) demands significant data science expertise. This expertise is required to integrate data from heterogeneous sources, to extract features for machine learning (ML) tasks, to configure the DAW and to optimize its parameters. The Simple-ML framework, which we currently develop to address these challenges, aims to enable a robust, efficient and reusable DAW configuration through seamless integration of semantic information in all typical DAW components, making it a Semantic Data Analytics Workflow (SDAW). The adoption of semantic information, such as a domain model and semantic dataset profiles, substantially differentiates Simple-ML from existing data science frameworks such as RapidMiner or Microsoft Azure.
In this paper we present Simple-ML and illustrate its adoption to data analytics for urban mobility. Popular problems in this domain include short-term road traffic forecasting [5], the prediction of congestion patterns [7] and impact prediction of planned special events [8]. The corresponding SDAWs require a variety of heterogeneous data sources, including but not limited to traffic and mobility data streams, map data (e.g. OpenStreetMap), knowledge graphs containing events and spatial entities (e.g. EventKG [3] and Wikidata), as well as traffic warnings, accidents, weather conditions and event calendars [5,8].
Our contributions are as follows: (i) We propose the Simple-ML framework for SDAWs: a semantic-driven approach that aims at increasing the efficiency of the workflow configuration, as well as robustness and reusability of DAWs using semantic technologies. (ii) We introduce a domain-specific semantic data model that provides semantic descriptions of the application domain and domainspecific relevant datasets (i.e. dataset profiles). (iii) We illustrate an application of the Simple-ML framework to a real-world use case in the mobility domain.

Semantic Models for SDAWs
The goals of Simple-ML are realized through a domain model (Fig. 1), semantic dataset profiles and the SDAW. We conduct the modeling in RDF 1 reusing existing vocabularies (e.g. dcat 2 ), where possible. The terms specific to Simple-ML are defined in the Simple-ML vocabulary, denoted using the sml prefix. Dataset Profiles: A dataset profile is a formal representation of dataset characteristics (features). A dataset profile feature is a dataset characteristic. Such features can belong to general, qualitative, provenance, statistical, licensing and dynamics categories [1]. In Simple-ML, the goal of the dataset profiles is to define dataset characteristics required to facilitate SDAWs, including information required for data materialization.
Dataset profile: A dataset profile is modeled as an instance of dcat:Dataset. General dataset profile features as well as provenance and licensing features are described using the DCMI Vocabulary (dcterms). Statistical dataset profile features (e.g. the number of instances) can be provided at the dataset and the attribute levels.
Dataset attributes: The attributes of the dcat:Dataset are modeled as instances of sml:Attribute. An attribute is described through its statistical characteristics at the instance level (e.g. the mean value sml:meanValue), along with the access information to the underlying data source (e.g. the column name in a relational database) to facilitate data access and materialization.
Dataset access: Simple-ML supports access to datasets through dedicated attributes that represent physical storage location and data format (e.g. sml: fileLocation and csvw:separator). Currently, relational databases (sml:Database) and text files (sml:TextFile) are supported.
Mapping between the Dataset Profile and the Domain Model: Dataset attributes are mapped to the concepts in the domain model (sml: DomainClass) through the sml:Mapping class, as illustrated in Fig. 1. This mapping adds domain-specific semantic description to the dataset attributes and facilitates their use in the SDAWs. The class sml:Mapping provides two properties: sml:mapsToProperty to map a dataset attribute to a property in the domain model, and sml:mapsToDomain to specify the rdfs:domain of this property, which is an instance of sml:DomainClass. Data Catalog: Dataset profiles are organized in a domain-specific data catalog. The extensible Simple-ML data catalog is modeled as an instance of dcat: Catalog. The data catalog schema including representations of dataset profiles and the mapping to the domain model is illustrated in Fig. 2. 3 Semantic Data Analytics Workflow (SDAW)  Iterative Generation of a Semantic Data Specification: In this first step, the user defines the semantic specification of the data to be used in the workflow. The input in this step is the data catalog. The specification is defined through the selection of the operations to be applied to the dataset(s) in the data catalog and their attributes. Possible operations include dataset selection, sampling, feature selection, feature extraction and data integration. These operations can be applied iteratively in a user-defined order. The Semantic data specification is defined at the metadata level using dataset profiles and does not require any physical data access. The specification can be stored to facilitate reusability.  Data Materialization: The data specification configured during the previous steps is applied to the physical datasets to materialize the integrated data.
Semantic Machine Learning Workflow (SMLW): The domain model is complemented with a ML domain model that captures the essential properties of ML concepts and their implementation in specific frameworks. A domain specific language (DSL) for SDAWs and SMLWs will include an advanced type system that will use metadata from the application domain to describe datasets and the intermediate results of data processing on one hand, and the metadata of the ML domain to describe the ML processing steps. This will enable statically checking the correctness of applying particular ML methods to particular data. To this extent, we will build upon previous approaches aiming to integrate ontologies into existing type systems (see e.g. [4]). We will go one step further, by designing a language dedicated to the data analytics and ML domain and including data models both for the data and also for the ML processes.
Result Visualization: The domain model can be used to automatically suggest suitable visualizations for specific data types.  sml:WeatherRecord: Temperature and rainfall at location and time.

Domain Model for Mobility
dcterms:Location: Spatial information with geographical coordinates.
sml:SpeedLimit, sml:AccidentType, sml:VehicleType: Classes that represent categorical values for speed limits, accident types and vehicle types. These classes are sub classes of sml:MobilityClass, which is a sub class of sml:DomainClass and thus allows the use of sml:Mapping as shown in Fig. 2. Fig. 5 provides an excerpt of an example Simple-ML mobility data catalog.

Simple-ML Application to Traffic Speed Prediction
We illustrate the iterative generation of a semantic data specification for the problem of traffic speed prediction for a specific road segment at a given time.
Dataset Selection: The user selects a Floating Car Data (F ) and Open-StreetMap (O) datasets. Fig. 6 shows the SPARQL query to retrieve F 's profile.
Data Specification: (i) Feature Selection: The user selects four features based on the domain model: sml:maxSpeed, sml:hasTime from (F ) (class sml: FloatingCarDataPoint), and rdf:type and sml:maxSpeed from (O) (class sml: StreetSegment). (ii) Feature Extraction: The user selects the following temporal features that are suggested by the system: week day, hour of day from (F ). (iii) Data Integration: A mapping between the vehicle positions in (F ) and the street segment coordinates in (O) is suggested by the system and chosen by the user.
Data Materialization: Using the data specification, relevant features are materialized, with example instances shown in Table 1. The resulting data can then be used in the SMLW to train a supervised traffic speed prediction model.

Related Work
Recent works [2,4,6] aim to combine semantics and ML to address a variety of real-world problems. Simple-ML goes one step further and makes use of semantics in the entire DAW. Simple-ML employs dataset profiles and domain-specific data models. The survey [1] provides a comprehensive overview of RDF dataset profiling methods, tools, vocabularies and features partially utilized by Simple-ML. We illustrate the use of Simple-ML in the mobility domain. Mobility has seen many challenges and use cases for data analytics [5,7,8]. In Simple-ML, the mobility domain is modeled in a light-weight, data-driven manner that facilitates compatibility and reusability of the SDAWs across use cases and datasets.

Conclusion
In this paper we presented our current development towards the Simple-ML framework. Simple-ML adopts semantic technologies to support the efficient creation, configuration and reusability of robust data analytics workflows. We illustrated an application of the framework to a real-world use case in the mobility domain.