Keywords

1 Introduction: The State of Practice Regarding the Use of Data in the Insurance Sector

The personalization of insurance products is a peremptory necessity for insurance companies. This is because, until now, the scarce availability of information on the part of clients has made such personalization impossible. Furthermore, there is little or no automation in data entry, which leads to a significant amount of data errors, low quality, repeated data, and other data issues.

Despite these problems, insurance companies base all their activity on data management, so these types of processes are key in these companies. Therefore, these types of entities invest important resources in carrying out part of these tasks semi-manually and with obsolete data or data sent by the client, based on the experience of certain highly qualified personnel. However, if they really want to compete with an advantage in today’s world, it is necessary to automate and improve these processes.

In the following paragraphs, we analyze the state of the art of different technological areas, along with related innovations.

1.1 Predictive Underwriting and Automation of Underwriting Processes

Strangely enough, the finance and insurance industries still gather and complete information based on manual or quasi-manual procedures. This leads to great inefficiencies, reduced economies of scale, and poorer customer service. As an example, there are cases of insurance pricing for SMEs in the USA where the average time from product application to authorization and issuance takes on average between 30 min and 1 h.

In recent years, robotic information gathering techniques, automated ingestors, and machine learning hold the promise to offer significant advantages in this area. The use of such techniques has many advantages, including:

  1. 1.

    The automatic subscription system can be put into operation more quickly if it deals with only a subset of all possible instances.

  2. 2.

    The error rate that the automatic underwriting system will have. In easy cases, it will be lower than the error rate in non-automated cases.

  3. 3.

    Deployment issues (e.g., integration with other systems, obtaining user feedback and acceptance) can be addressed at an early stage of development.

1.2 Customization of Financial Product Offerings and Recommendation According to the SME’s Context

Product recommendation systems determine the likelihood that a particular product will be of interest to a new or existing customer. The rationale behind the recommendations is to enable targeted marketing actions addressed only to those business customers who might be interested, thus avoiding the traditional bombardment of untargeted advertising.

Recommendation systems are basically of three types:

  • Characteristic-based: In this case, the system learns to select potential customers by studying the differentiating characteristics of existing customers.

  • Grouping-based: Customers are divided into groups based on their characteristics using unsupervised learning algorithms, and the system recommends products oriented to each group.

  • Based on collaborative filtering: Such systems recommend products that are of interest to customers with similar profiles, based exclusively on the preferences expressed by users.

Collaborative filtering systems are the most widely used. However, these systems have been developed for cases where the number of products to be recommended is very large, for example, the case of Netflix movies. This is not typically the case with insurance products, where the number of customers is much larger than the number of products.

There are not many documented applications of these techniques to the insurance sector. For example, Qazi [1] proposes the use of Bayesian networks, while Rokach [2] uses simple collaborative filtering to obtain apparently good practical results. Mitra [3] uses hybrid methods combining feature-based recommendation with preference matching, and Gupta [4] uses another mixed approach, which however includes clustering algorithms.

1.3 Business Continuity Risk Assessment

A correct assessment of the business continuity risk assumed when offering a financial or insurance product is a foundation for doing business in the financial and insurance sector. Hence, all financial institutions have systems for estimating this risk.

This is another area where the use of machine learning can offer significant advantages. However, although the use of these techniques is common in credit scoring activities, their use in insurance remains quite limited. As an example, in [5], Guelman compares traditional linear models with decision tree algorithms with good results.

Up to date, financial and insurance companies leverage machine learning models based on internal and traditionally financial data. Hence, the use of alternative data sources for the improvement of machine learning models in insurance applications is completely new. Despite the implementation of personalization features in predictive underwriting and process automation, open data and data from alternative sources are not currently used, and their innovation potential remains underexploited.

2 Open Data in Commercial Insurance Services

Nowadays, the most common business data sources available include information about their financial statements, property data, criminality levels, catastrophic risks, news, opinions from clients, information from the managers, number of locations, etc. While some of these sources are useful to update our current data, other sources can be useful to automate processes and increase the risk assessment accuracy.

2.1 Improving Current Processes

Data quality is the first aspect that must be improved. It is important to know the reality of the customer or potential customer to improve the results of daily tasks. Understanding the customers’ context is important prior to deploying sophisticated processes. The problem is rather common: According to our projects, an average of 45% of insurer’s commercial customers has inaccuracies in their business activities or/and their addresses, as illustrated in Fig. 18.1.

Fig. 18.1
figure 1

Information inaccuracy example

Figure 18.1 illustrates the data sent to the insurer when buying a business insurance policy right. However, this business information changes over time. Insurers tend to renew policy rights without updating the business data as required. Therefore, open data are likely to comprise inaccurate or obsolete information. To alleviate this issue, there is a need for linking open data to the actual data sources, so as to ensure that open data are properly updated and remain consistent with the actual data (Fig. 18.2).

Fig. 18.2
figure 2

Linking open data to insurer’s database

2.2 Process Automation

Once the data are adjusted to the customers’ reality, process automation becomes feasible. By establishing a connection with our systems and open data sources, process automation can be applied to underwriting and renewals to reduce times and increase efficiency. Figure 18.3 illustrates the change from a traditional underwriting process for business property insurance to automated underwriting. The traditional underwriting process for a business property insurance requires the potential customer to collect an extensive list of data points. In the case of Fig. 18.3, our system is connected to a property register and can perform the data collection instead of the customer. This can greatly simplify and accelerate the underwriting process. Specifically, a two-question underwriting process can be offered to the customer.

Fig. 18.3
figure 3

Traditional vs. automated underwriting

2.3 Ensuring Data Quality

Data quality is a major concern as insurers base risks and premiums on the data they receive. The information that can be found online about a business is typically shared by themselves, their clients, or public institutions’ business databases. Hence, the quality of the data must be assessed differently depending on the source and type of data.

Businesses share information on their website, on their social media profiles, and on their Google Business Card. Data quality can be audited by checking when it was published and whether it matches the different data sources. For example, if the business address instances that are found on Google Maps, the business website, and their social media profiles are the same, it can be safely assumed that this data point is correct and updated.

When the data is found on public registers or opinions, it often comes with the date on which the information was published. This enables the use of the information to assess whether the insurer’s data is more recent or not.

Another problem is illustrated in Fig. 18.4, which presents the rating of two different businesses. Business 1 has a higher rating than Business 2. However, it only received seven opinions, and all of them were published in only 1 day. Therefore, the number of opinions and dates must be considered, to avoid potential doubts that can lead to a customer’s dissatisfaction and complaints. In this case, opinions for Business 1 will be considered, as it seems that either it is either a new business or clients do not like to share opinions about it. On the other hand, Business 2 is strong and popular and has a good rating.

Fig. 18.4
figure 4

Ratings for two different businesses

In the future, commercial insurance processes will become more automated. The automation of underwriting and renewal processes will provide significant benefits and increased profitability, especially in the case of small-medium enterprises (SMEs) that usually have small premiums. SMEs represent a market segment where competition among insurers is high. Hence, insurance companies that will manage to offer SMEs lower prices will be more competitive in this market.

Automation is however more challenging for bigger risks and other segments of the insurance market. Insurers that handle big premiums can invest on human resources to perform the underwriting process with the human in the loop. Most importantly, they can do so without any substantial increase in the premium. Also, automated systems are sometime prone to errors or inefficiencies that cannot be tolerated for certain types of risks. In this context, machine learning techniques will be used on conjunction with human supervision. Human supervision is not only useful to prevent these mistakes but also to find new ways to apply open data.

Overall, increasing data quality is a key prerequisite for leveraging open data in insurance. With quality data at hand, one can consider implementing and deploying more complex processes, including the development of more automated and efficient data pipelines for insurance companies.

3 Open Data for Personalization and Risk Assessment in Insurance

To enable users to leverage open data for personalization and risk assessment in insurance applications, a new big data platform has been developed. The main functional elements of the platform are outlined in following paragraphs.

3.1 User Interface Application

One of the main modules of the platform is its user interface, which enables users to manage their information. The interface is developed with the IONIC v3 framework using languages such as Angular 6 with the TypeScript layer, HTML5, and CCS3 with Flexbox. This setup enables the delivery of the user interface in Web format while at the same time facilitating its operation on mobile platforms (i.e., Android, iOS) as well. This client environment is hosted in the Amazon EC2 service (Amazon Elastic Compute Cloud), using a Nginx server, being able to increase/scale the power of the server depending on the load of the clients.

3.2 Information Upload

The platform enables insurance and financial entities to upload their company’s information in the system. The information can be uploaded either via an appropriate API (application programming interface) or using a panel on the platform in a CSV format. The latter option is important, as many SME companies are not familiar with open API interfaces. Once uploaded, the information is structured and automatically converted to a JSON (Javascript Object Notation) format. Accordingly, the information is stored in a specific “bucket” for that company in Amazon S3, which includes separate JSON files for each client of the company that uses the platform.

3.3 Target Identification: Matching Targets with Sources

As soon as the information is uploaded to Amazon S3, triggers are automatically activated. Triggers are a series of algorithms that analyze the information uploaded by the client and identify the different possible sources from which, according to the uploaded data, the user can be identified on the Internet. The targets are identified and uploaded to a non-relational database using the AWS (Amazon Web Service) DynamoDB module.

Once the target is loaded, a set of matching robots are activated. These robots go through all the possible sources identified in the loaded data and associated the client information with the possible Internet sources that comprise the same pieces of information about the company. The marching is performed with good accuracy, yet there is never absolute 100% success in the matching. Once matched, the relevant information is stored to identify the customer in the source.

3.4 Information Gathering, Collection, and Processing: ML Algorithms

The collection and processing of information from open sources are carried out using cognitive algorithms technology, namely, WENALYZE Cognitive Algorithms (WCA). These algorithms process in real time only the information that belongs to the users or targets without storing any data as it is found in the source of origin. Hence, only the result of the analysis is retained or stored.

Once the matching of the target with the different Internet sources has been completed, the platform unifies all the information about the company, leveraging all the different sources in a single node per user. Analytical algorithms are applied to the stored information toward associating the clients with the information obtained from external sources. This association is based on a hash map search algorithm, which forms an associative matrix. The latter relates catalogued words from a dictionary with those of the previously stored client information. This relationship generates a series of results that are stored to facilitate processes like fraud detection, risk assessment, and product recommendation.

For the automation of the underwriting processes, classification algorithms are configured and trained, notably algorithms applying binary classification such as SVM (support vector machines).

For the product recommendation system, a hybrid algorithm between feature-based filtering and collaborative filtering has been designed to maximize the use of all available information.

For risk assessment, regression algorithms are used to predict a numerical value corresponding to the risk of the transaction. This regressor is based on the characteristics obtained from all the data sources mentioned above.

4 Alternative and Automated Insurance Risk Selection and Insurance Product Recommendation for SMEs

The platform is installed, deployed, and operated in one of the data centers and sandboxes provided by the H2020 INFINITECH project. The platform is destined to operate as a software as a service (SaaS) tool that will be used by different customers based on a fair additional customization and parameterization effort for each new use case. It also offers a visual part that will allow the creation of a demonstrator and an API for connectivity.

The platform automates the processes of data extraction from different sources (i.e., from insurance companies, from banks, and from public data records or social networks) while boosting the ingestion and the preprocessing of this data. Moreover, it facilitates the storage and processing of distributed data, along with the execution of analytical techniques over large amounts of data toward improving business processes.

The data pipelines of the platform are illustrated in Fig. 18.5. The platform comprises of the following main modules:

  • A cloud infrastructure, which provides scalability, distribution, and fault tolerance

  • Data extractors and data ingestion modules for the extraction and loading of data from insurance companies and banks

  • Data extractors from external sources, for the extraction and loading of data from external sources

  • Multi-language and multicountry data processing modules, which enable adaptive processing and extraction of different structured and unstructured data sources toward extraction of relevant characteristics

  • Storage modules that facilitate the creation of a distributed and secure data lake

  • Data processing and analysis modules that perform anonymization and transformation functions, including functions that identify the relationship between different sources and extract relevant characteristics

  • Microservices, which enable the scalable implementation of the business processes of the platform

Fig. 18.5
figure 5

Elements of the personalized insurance product platform in line with the INFINITECH-RA

Moreover, the platform provides a dashboard including different views for the end customer and for internal use by the company. Also, integration with API and through online browser access is supported, to ensure connectivity with client financial institutions.

5 Conclusions

This chapter addresses one of the fundamental problems of the insurance industry: the use of data beyond the traditional data sources, toward broadening the spectrum of data collection in a scalable way and with open-source processing. The implementation and use of the results of this project in the INFINITECH project represent a quantum leap in improving the quality and variety of data. The presented solution improves the efficiency of insurance processes and provides significant efficiency benefits to the insurance industry. In this direction, advanced technologies in the areas of robotic processing, cloud computing, and machine learning are combined, integrated, and used.

Leveraging machine learning techniques, the presented solution can support the following use cases: (i) automation of (banking) financial product underwriting, (ii) risk estimation and product recommendation, and (iii) prediction of the company’s business continuity.