1 Introduction

Over recent years, more and more consumers gave up individual ownership for demand-based access to goods (Giesel and Nobis, 2016; Klein and Smart, 2017; Le Vine and Polak, 2019; Oakil et al. 2016). Accordingly, business models emerge, where consumers simply pay for on-demand access to goods (Wilhelms et al. 2017; Owyang et al. 2013). Benefiting from this development, carsharing gained public attention for several years now (Münzel et al. 2018), with 15 million registered users worldwide in 2016 (Shaheen and Cohen, 2020). In addition to commercial carsharing, peer-to-peer (P2P) platforms such as Turo or Getaround allow private car owners to share their car with a previously unknown person (Wilhelms et al. 2017).

On such platforms, transactions are mostly carried out online, which lowers the costs, but increases the anonymity. However, a lack of trust in other “buyers” or ”sellers” is stated as one of the most frequent reasons for rejecting P2P sharing platforms (Bossauer et al. 2020; Pakusch et al. 2018). As a result, transparency, reputation, and trust are considered essential for the success of the P2P sharing economy (Belk, 2007; Botsman and Rogers, 2011; Hawlitschek et al. 2016). Trust is typically increased by mechanisms, such as showing self-provided information (e.g. pictures or personal information) as well as peer-provided information (e.g. consumer reviews and ratings) (Bente et al. 2012). Besides, Internet of Things (IoT) information could serve as a third trust-building mechanism (Gandhi and Gandhi, 2018; Handel et al. 2014; Stevens et al. 2018; Wahlström, 2017; Wahlström et al. 2017; Stevens and Bossauer, 2020). For instance, the driving behavior could be monitored by car telematics, which consists of onboard communication services and applications that communicate and record sensor information while driving. The information is then algorithmically processed into a score. Such insights into the driving behavior could increase trust into the driving skills of a possible rentee. So far, such information is used and analyzed by car insurance companies for risk assessment for individual insurance tariffs (Merzinger and Ulbrich, 2017).

In the context of P2P carsharing, algorithm-based reputation systems could create an added value compared to peer ratings, since they are based on objective measures (Stevens and Bossauer, 2020). As the current standard among various reputations mechanisms in the sharing economy (Hawlitschek et al. 2016; Teubner et al. 2016), peer-ratings have also been criticized for various biases (Edelman et al. 2017; Carol et al. 2019; Tjaden et al. 2018). However, it is unclear if users would trust algorithm-based scores and if it would create an added value compared to peer ratings.

To close this research gap, this design case study (Wulf et al. 2011) examines whether algorithm-based reputation systems have the potential to improve trust-building in P2P-carsharing. Therefore, we conducted a pre-study with 16 problem-centered interviews to find out how people understand algorithm-based scoring (section 3), we co-designed a P2P carsharing app prototype with an alorithm-based reputation system implemented (section 4), and finally evaluated it with 12 participants (section 5). Our findings show that algorithm-based reputation systems can support trust-building in P2P-carsharing and give insights how they should be designed. Our work contributes to the discourse of how algorithm-based reputations systems in P2P carsharing can be designed to support trust-building and decision-making for car owners and especially how the interaction with such systems should be designed to support a trustworthy coordination of the user groups and reduce social biases and discrimination.

Our design case study builds primarily on the work of Bossauer et al., (2020) and Stevens and Bossauer (2020). The study by Bossauer et al., (2020) already focused on trust building and the willingness to share data from the rentee perspective and showed that the willingness to share data exists in the form of an area of negotiation. This paper puts a larger focus on the use case of driver selection based on an algorithm-based score, which is based on the data from rentees but focuses on the interaction with car owners. This is intended to address in particular the information assemetry for car owners and to support a better interaction with the algorithm-based scoring.

2 Related Work

2.1 Trust-Building in Peer-to-Peer Carsharing

As a part of the Sharing Economy (Richter et al. 2015), the basic idea of Peer-to-Peer Carsharing is a joint consumption following the principle ”sharing rather than owning”. Products are not acquired by the consumer, they only get a temporary right of use a service or good — normally for a certain fee, e.g., based on the kilometers driven (Belk, 2007). For many sharing economy participants, economic motivations — such as cost saving, reduced burden of ownership, or increase access to resources — play an important role (Hamari et al. 2016). Moreover, against the background of climate change, sharing vehicles becomes increasingly important (Hampshire and Gaites, 2011). According to a study by the Ford Motor Company (Ford Motor Company, 2016) with over 10,000 participants, 55% of respondents in Europe would share their car for a fee, however, actual adoption is very low (Bossauer et al. 2020; Pakusch et al. 2018). Some studies in the context of P2P carsharing (Wilhelms et al. 2017; Ballús-Armet et al. 2014; Lewis and Simmons, 2012; Nobis, 2006) already address findings about user characteristics and user motivations. People participate in P2P carsharing for economic (reducing mobility and vehicle costs) and situational-practical reasons (availability, convenience, and flexibility) are mentioned (Ballús-Armet et al. 2014; Nobis, 2006). But there are also hurdles as sharing a private car via a P2P platform includes the effort of entering the availability of the car, arranging handover dates and the follow-up check to see whether the car has been damaged. In addition, there is the general fear of sharing a car with strangers (Wilhelms et al. 2017; Shaheen and Cohen, 2013). In particular, people often have a personal and emotional bond to their cars (Gatersleben, 2007), which increases fear of loss (Belk, 1988), because others might not treat the rented car with care, cause an accident, or return it late or dirty (Bossauer et al. 2020). P2P carsharing platforms therefore play an important role, because they have a coordinating and trust-building function between the respective user groups and can digitally support cooperation between car owners and rentees.

Trust is generally acknowledged to be a multi-dimensional, socio-psychological construct (Hawlitschek et al. 2016; Ter Huurne et al. 2017). Our work relies on the following definition:

“[Trust is] the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other will perform a particular action important to the trustor, irrespective of the ability to monitor or control that other party.” (Mayer et al. 1995)

Since such situations are typical for the sharing context, especially as most P2P transactions are executed online, trust is of particular importance in potentially risky and uncertain situations where parties are interdependent. Transparency, reputation, and trust are therefore seen as essential requirements (Belk, 2007; Botsman and Rogers, 2011; Hawlitschek et al. 2016), that reduce transaction costs efficiently in social exchanges (Ter Huurne et al. 2017). However, it is difficult to build and sustain trust in online interactions (Hawlitschek et al. 2016; Möhlmann, 2015). For this reason, trust mechanisms have been investigated in different areas, e.g., social media (Ridings et al. 2002), online shopping (Gefen, 2002), but also the sharing economy (Hawlitschek et al. 2016). Regarding e-commerce, there are two dimensions of trust: trust in the seller and trust in the offered goods. In particular, buyers must trust in the integrity, quality and competence of the seller (Gefen, 2002). As products or services are usually offered by private individuals in the P2P sharing economy, users have to trust other peers, the platform, and the offered products and services (Hawlitschek et al. 2016).

2.2 Algorithm-Based Reputation Systems via Car Telematics

To address the above mentioned lack of trust in online transactions, reputation systems emerged as a trust mechanism (Ter Huurne et al. 2017; Ert et al. 2016). The basic idea is that the parties rate each other, e.g., after concluding a transaction, and derive a trust or reputation score from the aggregated ratings. The resulting score can help other users to decide whether to interact with that party in the future or not. Thus, reputation systems are incentivizing good behavior, and therefore tend to have a positive effect on market quality. Reputation systems are related to collaborative filtering systems (Schafer et al. 1999) as they use the opinions of a community to help individuals more effectively identifying relevant content from a potentially overwhelming set of choices. Resnick and Zeckhauser (Resnick and Zeckhauser, 2002) give a functional definition of a reputation system, as it must: (1) provide information that allows peers to distinguish between trustworthy and non-trustworthy peers, (2) encourage peers to be trustworthy, and (3) discourage participation from those who are not. We can distinguish three different types of information that contributes to trust building:

Self-Provided Information

Peers can provide information about themselves as well as the goods or services they offer. Repschläger et al. (Repschläger et al. 2015) pinpoint that personal attributes such as name, age, address as well as pictures serve as trust factors. However, Hanrahan et al. (Hanrahan et al. 2018) observed that self-provided information can often lead to the problem of discrimination against peers based on specific characteristics, such as gender or photo. This problem shall be overcome by aggregated ratings (Stevens and Bossauer, 2020).

Peer-Provided Information

As self-provided information is often suspected of being sugarcoated, peer ratings become increasingly common and serve as a substitute for word-of-mouth recommendations (Zhu and Zhang, 2010). Consumer Ratings and Reviews are usually written to either recommend a product or service or to warn others about it (Hennig-Thurau and Walsh, 2003; Sen and Lerman, 2007). One common problem is that users tend to write reviews mostly for products that they perceive exceptionally good or the opposite (Dellarocas and Narayan, 2006). Another problem of peer-provided information such as user ratings are susceptible to biases (Rogers, 2015; Hanrahan et al. 2018).

Computational-Provided Information

This category includes all information that is not provided by oneself or other peers, but collected by electronic devices, e.g., sensors in smart home environments (Ter Huurne et al. 2017). The benefit of computational-provided information is that they are not collected manually, so there is no additional effort for users. Secondly, they are based on automatically generated measures and therefore are more difficult to manipulate by users than self- and peer-provided information (or at least it requires an additional effort). Additionally, computational-provided information could solve the problem of only rating those products or services that were perceived very good or very bad.

Telematics can enable new forms of trust building (Bossauer et al. 2020) by, e.g., enabling assessments based on quantitative real-time data on driving behavior. Such systems can be described as the use of technical devices for the identification, storage and/or processing of computational-provided information which are interconnected by means of telecommunication systems (Wahlström et al. 2017). The term is often used synonymously with car telematics and thus, Connected Car technologies. Connected Car technology enables numerous driving and environmental data to be captured by sensors and processed into valuable information via the Internet (Stevens et al. 2017). This possibility offers the potential to protect reputation systems from manipulation and to stabilize user trust in evaluations and transactions (Olakanmi and Oluwaseun, 2018; Ribeiro et al. 2016; Wiegand et al. 2019; Stevens and Bossauer, 2020). In addition to integrated sensors, telematic solutions can also be implemented in vehicles via telematics boxes, dongles or provided by smartphone apps (Häberle et al. 2015; Lawson et al. 2015; Mikusz et al. 2015; Spada, 2018; Hong et al. 2014; Handel et al. 2014). In P2P carsharing, such technologies could enable a condensation of the driving characteristics to a score that can lead to an overall rating. The rating represents the potential risk of a rentee using the vehicle and can therefore increase trust towards rentees with a better score (Ter Huurne et al. 2017; Ert et al. 2016; Olakanmi and Oluwaseun, 2018; Teigland et al. 2019). Looking at the specific domain of car insurance, the basic idea is to record acceleration and position with the help of sensors while driving to detect, e.g., speed violations, braking, mileage and traveling direction and to use this information to assess risk of a person’s driving behavior (Ma et al. 2018). The fee then expresses the insurance company’s confidence based on the risk assessment that the policyholder will not suffer an accident (Desyllas and Sako, 2013; Roel et al. 2017).

However, an algorithmic bias can also occur here, which penalizes users with a driving behaviors that do not correspond to the “algorithmic imaginations”. Due to the fact that it is almost impossible for sensors to gain an overall understanding of specific traffic situations, computational information should be more flexible in analysis to reduce the “practice of bias” (Jackson Jr, John L, 2013; Alkhatib, 2021). Rouvroy calls this phenomenon the algorithmic governmentality, which operates with infra-individual data and supra-individual patterns without, at any moment, calling the subject to account for himself. Furthermore, Rouvroy has introduced the term “data-behaviourism”, which according to him is the widest possible zone of indistinction between reality and the world. Because of the representation of the world based on data, it becomes difficult to criticize conclusions based on algorithms, because data are usually described as objective and neutral. This bias may disadvantage individuals who, for good reasons, violate or are unaware of the algorithm’s rules (Rouvroy, 2013).

2.3 System Intelligibility and Accountability

The relevance of comprehensible systems can be seen in the increasing amount of work in the area of system intelligibility and Explainable AI (Abdul et al. 2018; Madumal et al. 2018; Strobel, 2018; Wiegand et al. 2019). Due to recent advantages in Machine Learning (ML) and Artificial Intelligence (AI) we increasingly see automated algorithm-based decision-making in digital technologies. Nevertheless, it is a fundamental challenge to design these technologies intelligible and accountable such that people can understand the information and feel empowered (Abdul et al. 2018). There is already related work in interpretable, fair, accountable, and transparent algorithms (Ribeiro et al. 2016; Datta et al. 2016; Lawo et al. 2021) in the AI and ML as well as HCI communities. The European Union approved a data protection law (Goodman and Flaxman, 2017; EUG Portal, 2017) that includes a “right to explanation”. Users are often not explicitly informed about the information they see and algorithmic decision-making systems typically do not provide visibility into how the technology works (Nagulendra and Vassileva, 2016). While transparency on the one hand is necessary to improve collaboration between humans and algorithms, missing transparency on the other hand is a risk, if humans have difficulties figuring out how the systems reached their decisions (Brynjolfsson and Mcafee, 2017). Therefore, explainability is needed for collaboration and is related to interpretability and justification (Biran and Cotton, 2017). Systems are interpretable if a human is able to understand the reasons for the decision, “justification explains why a decision is a good one, but it may or may not do so by explaining how it was made” (Biran and Cotton, 2017). An explanation is important to support users to understand, what outputs the system is supposed to produce and recognize mistakes or errors (Rader et al. 2018). While empirical studies showed “the importance of explanation to users, in various fields, consistently [...] that explanations significantly increase users’ confidence and trust”, current AI and ML systems are weak in this area (Biran and Cotton, 2017; Brynjolfsson and Mitchell, 2017). Positive examples are rule-based models such as decision trees, which are easy to understand in contrast to most types of complex neural networks. In particular, the reduction of complexity plays a crucial role here (Setnes et al. 1998; Lawo et al. 2021). Since rule-based algorithms are more comprehensible in contrast to complex black box models (Keneni et al. 2019; Holzinger, 2018; Arrieta et al. 2020), the specification of individual preferences and norms regarding reputation systems can support a Human-Algorithm Interaction (Wolf and Blomberg, 2019; Schmit and Riquelme, 2018).

3 Empirical Pre-Study: Understanding Algorithm-Based Reputation Systems in P2P Carsharing

The aim of the pre-study was to 1) understand how people imagine the functioning and derivation of an algorithm-based score on driving behavior enabled by telematics systems (Driving Score) in terms of what information is needed and how it is processed to a score, and 2) to derive first design implications for an artifact, integrating the users needs and perceptions in the context of P2P carsharing.

To address our research goal, we choose a qualitative research approach. The semi-structured interviews lasted about 45 minutes on average. An interview guideline was used to narrow down the subject area and to create a structure for the interview process. The interviews started with a description of P2P carsharing and the principle of telematics solutions as an introduction. Altogether, the interview questions can be divided into three categories. The first category consisted of questions on which data should be collected for scoring driving behavior. Furthermore, the interviewees were asked to explain their understanding of a Driving Score and how they think it works. In the second category, design aspects of a algorithm-based score were examined. We presented four different profiles of people who would like to borrow a car (translated from German) (see Figure 1). However, the order in which an interviewee responded to the questions was flexible, which allowed the interviewees to deal with topics that seemed more important to them. This procedure enabled the interviewer to lead the conversation by asking specific questions without interrupting the flow of the conversation (Kohlbacher, 2006).

Fig. 1
figure 1

Different Driver Profiles presented to the Participants – A Driver Profile consists of a 1) Picture of the Person, 2) Name, 3) Age, 4) Driving Score, 5) Kilometers driven, 6) User Rating, and 7) Number of User Ratings

The study was conducted in Germany with a sample of 16 participants (P1-P16), that was recruited through personal contacts. Seven participants were female and nine were male. The mean age is comparatively young (33.8 years in a range of 22-64 years), since the young generation is the main target group for carsharing platforms (Loose, 2010). Moreover, our sample covers a broad variety with regard to Telematics Experience, Tech Affinity, and experience with car-or ridesharing. For the analysis all interview were recorded and transcribed. The analysis itself followed the procedure of thematic analysis (Braun and Clarke, 2006). Here, two authors coded the transcriptions independent of each other using MAXQDAFootnote 1, and combined the resulting code system collaboratively (Berends and Johnston, 2005).

3.1 Understanding of an Algorithm-Based Reputation System

At the beginning of each interview, we started with the question of how people understand the computational logic of an algorithm-based reputation system. Since many participants found it difficult thinking themselves into an algorithm, we asked them to put themselves in the position of a co-driver and evaluate driving behavior from their perspective. This led to a list of measurement criteria, which were associated with the term driving behavior. This discussion also revealed different levels that could be used to classify the measurement criteria (see Table 1). While some of the mentioned measurement criteria are on a technical level and more easy to capture and process into a score (e.g. braking), other criteria are more norm-based and thus require an aggregation of several factors to enable an evaluation.

Table 1 List of categories of measurement criteria of a driving score

Often the interviewees oriented themselves to the road traffic regulations without thinking about the technical feasibility. This can be seen in quotes such as: “Do you put the turn signal on properly?” –[P16]. The first part of the quote can be answered with yes or no. The second part, however, contains the word “properly” that is known when, where, how long and how often you must indicate a turn. Here the term “properly” has obviously already been interpreted as correct or in conformity with the law. The quote can be related to the following factual context: The flashing light must be set several seconds before the start of changing the direction or lane. The setting of the signal is accompanied by a shoulder check and a look at the mirror in the direction, which is to be taken. These activities are to be regarded as road traffic regulations. Most participants created a fictitious situation in which they gave an example of how they imagine an evaluation while driving. In most cases, the participants chose the criterion speed for their use cases:

“Well, I would pay attention to how fast they are and what speed is allowed on the respective road or motorway and then I would pay attention to whether that was adhered to. So ± 10 kilometers per hour are okay, but I would make sure that the driver follows the traffic signs.” –[P4]

As previously stated, most situations referred to already interpreted factual contexts. One connection, e.g., was that there are certain rules to which speed must be aligned. If the driver doesn’t follow the rules, a consequence can be expected. Additionally, it is interesting that the interviewees also specify tolerance ranges for rule-based reference values. In the above example, a tolerance of ± 10 kilometers per hour was specified despite compliance with traffic signs. Consequently, it can be assumed that reference values in conjunction with the categories of measurement criteria mentioned (according to human understanding) serve the algorithm as a basis for evaluation.

The measurement of the driving behavior in terms of speed was mostly described with words such as “above” or “below”, “good” or “bad”, without explicitly naming the reference values. There are, e.g., no reference values for the ideal number of changes of direction while driving in the traffic regulations. Trying to represent the statement in a score would favor drivers who drive straight a lot in contrast to drivers who must turn more often. Whether this supports trust in carsharing and result in a realistic Driving Score remains to be discussed. In order to be able to calculate the values of the deviation, the factors must have a uniform unit of measurement. Most respondents stated that the resulting score should be an average value of driving behavior [P7]. Further, a weighting of criteria was desired by our participants.

“But I would also like the weighting to be shown to me. What is actually weighted the most or is all weighted equally? Or what the most relevant criterion is. Then I could understand what the score tells me at the end.” –[P1]

The statements on the weighting of the criteria for the driver evaluation varied in the interviews. Each participant has own preferences of weighting criteria. Some participants also mentioned a kind of knockout variable, which alone caused the score to drop considerably. Participant 3, e.g., stated “everything that exceeds 10 % above the speed limit is not okay”. But only when the composition and the weighting are transparent, the score becomes understandable and able to build trust. In addition to the information which criteria are included in a Driving Score, participants questioned how strongly these criteria affect the final score. The participants wanted to know how each individual value is considered in the overall score.

“I would like to break the score down and see the individual parameters.” –[P1]

3.2 Perceived Potentials of Algorithm-Based Reputation Systems for Trust-building

In the following, we aim to show what potentials of algorithm-based reputation systems our participants perceive and how these can support trust in P2P carsharing. As already mentioned we presented four different profiles of fictitious car rentees (see Fig. 1) to our participants and asked them to choose a profile and explain their decision. What was striking about this question in the interviews was that the majority based their selection on the Driving Score and only paid secondary attention to the user ratings. In total 8 out of 16 participants stated that they consider both, the Driving Score and the user ratings in their decision, with a tendency towards a higher weighting of the Driving Score. Six participants made their decision only on the Driving Score and only two participants relied on the user ratings. It was possible to get an impression of when a score was considered “positive” or “negative”. Basically, we found many different perceptions of “trustworthy” scores. Often, the values were compared with the ratings of hotels. The participants didn’t state an explicit score, from which a positive or negative impression of the rentee arises, but “5.2/10” was called “not decent” –[P10]. When selecting the profiles, the participants often paid attention to how many kilometers the person had already driven. On this basis, respondents determined how reliable the Driving Score appears [P1]. One of the participants found 500 kilometers driven to be a good basis for the validity of the Driving Score [P1]. Whereas the indication of 177 kilometers was often perceived as not much, but sufficient enough [P10].

Various benefits could be found for the use of algorithm-based scoring for the evaluation of driving behavior. For example, algorithm-based evaluations were perceived as less manipulable [P2]. The respondents described them as more accurate and neutral [P10, P15]. According to Participant 5, personal evaluations are difficult to quantify and do not have a fixed evaluation scheme. Some of the respondents argued for a combination of both information:

“I think a general score should perhaps consist of both things and not be divided generally [...]. Perhaps, the user rating can contribute to a certain extent to completing the Driving Score.” –[P3]

In this context, the reference to reality was emphasized. This factor was often mentioned, which is why the respondents demanded personal evaluations in addition to the algorithm-based scoring. An example of this was a situation where a driver is on the acceleration lane and has to accelerate quickly to drive onto the motorway.

“If the Driving Score encourages you to drive safely, I think it’s really cool. But if you had to accelerate because you’re on the acceleration lane to drive towards the highway and two trucks are on the lane and you’re afraid to accelerate because you just don’t want to risk your 10/10 score, that’s not cool. So, if you get too involved into keeping good scores, it could have a negative effect.” –[P3]

Since fast acceleration would be assessed negatively, the Driving Score could incentivize an unwanted driving behavior in some situations [P3], e.g., the evasion of ambulances, would possibly falsify the assessment [P1]. Almost one third (5/16) of all respondents stated that the Driving Score must function reliably and be extensively tested [P10]. The participants expressed security concerns in the form that the underlying technology must be mature an be tamper-proof [P1]. Under the premise that the technology is tested and found to be safe, a score calculated by this technology was described as “objective”, “trustworthy” –[P9], and “fair” –[P15].

“I personally would set the weighting to 70% Driving Score and 30% user rating. Personal rating says it all. There is a subjective perception behind it. And this Driving Score is actually trustworthy and objective [...] and I can say, okay, these are data-based ratings, which I can accept without doubt.” –[P9]

As a part of the selection decision, potential biases arose both explicitly and implicitly. The majority of participants chose the profile with the highest Driving Score and the most kilometers driven (cf. Figure 1, Christina). However, especially for those who relied on the User Rating, the profile picture and age would probably play a major role.

“So the picture, I think that would be good if that would be in there. So I would know who is driving the car. I think that’s not completely unimportant, and also the age, whether the person has had a driver’s license for a long time.” –[P5]

Overall, it can be said that many of the participants see potential in the Driving Score for their decision-making. However, the score must meet certain criteria in order to build trust. This includes that it is tamper-proof, has enough data based on many kilometers driven, and is comprehensible for the users. Some participants motivated the combination of Driving Score and User Rating. How a Driving Score can be designed more comprehensible is discussed in the following section.

3.3 Designing a Driving Score

The results addressed many aspects that have an influence on the design of a Driving Score. One aspect is the visualization of it. During the interviews, the participants mentioned various visualization options. These include 1) scales, 2) diagrams, 3) ratios, and 4) other visualizations. In our qualitative thematic analysis, we counted the mentions of these visualization types and the 10 of our participants liked the idea of a 10-point Scale for the Driving Score.

“10 is great and 0 is bad. I’d like that, because there’s such a benchmark, where you can say if a driver gets a 7, [...] you know, this person drove not perfect, but well enough.” –[P6]

In addition to a scale from 0 to 10, other intervals such as 1 to 5 or 1 to 6, based on the German school grading system, were mentioned [P7, P8]. As already indicated in the quote, scales offer a good orientation, because they are easier to interpret due to the clear graduation between the best and worst rating.

Diagrams

were mentioned a total of 8 times in various forms. One idea was to create radar charts with all dimensions which impact the Driving Score. Each dimension represents a part of the net and fills up in a corresponding direction. The more points a driver reaches, the more this circle fills out [P8]. Another form of representation was a bar chart with a green area, yellow and red areas [P2], respectively for good, acceptable or bad driving behavior. The ideas of bar and line diagrams were very similar. Also, a scatter-diagram was proposed. The scatterplot has a horizontal line on which an average speed could be plotted. Above the line, points can be displayed at which the speed was above average and below the line, braking maneuvers would be recorded. On axis-based diagrams, target and actual lines could be displayed that allow a direct comparison [P14].

Ratios

were mentioned 7 times in total. According to a statements, participants seem to prefer percentages over star ratings [P1, P15]. This is considered an advantage for a drill-down function to obtain more detailed results than a star rating.

“Of course, you can do a bit more with percentages than with stars. In particular, I would like to tap it again and see where I didn’t have the right speed. So just more drill-down options so I can see more. I would like that with all data.” –[P1]

A general requirement for the Driving Score was that it should be well structured and comprehensible [P5]. All 16 participants explained that it must be possible to break down the score that is finally calculated.

“I’d want to know exactly and transparently all the criteria that go into this score.” –[P3]

Participants propose different forms of visualizations for different criteria. For example, when assessing the speed, deviations from the average speed were desired in the form of line diagrams, while for other criteria, such as smartphone use, people mentioned an indication in minutes. The respondents do not want a too complex score as too many factors would reduce the benefit of a score [P12]. Therefore, they assign importance to a comprehensible listing of the measurement criteria, but the listing must, nevertheless, be limited.

4 Prototyping

Based on the results of the pre-study, an interactive prototype (Schmidt et al. 2020) was developed based on the results of a co-creation workshop. The workshop lasted about 2.5 hours and included 4 male and 3 female participants between 25 and 65, who are interested in car- and ridesharing. The content and developed paper prototypes were documented.

The co-creation workshop aimed to embed the findings of the pre-study into a P2P carsharing app prototype according to the participants’ perceptions. After introducing the participants to the problem of reputation mechanisms in P2P carsharing, they received an in-depth briefing on the results of the pre-study. In the following creative phase, the participants were asked to sketch the app UI and discuss its computational logic from an end-user perspective. This phase was supported by the exemplary profiles of Figure 1 pictorially and the prepared results in the form of printed power point slides. Furthermore, printed app icons, diagrams and materials such as pencils, paper and scissors were available. Finally, there was critical reflection on additional factors.

The final clickable prototype (see Fig. 2) was then designed by the authors and implemented in Figma. The individual app components and their functions are presented below.

Fig. 2
figure 2

Clickable Prototype – a) examplary preference weighting, b) individual norm settings, c) home screen with requests, d) examplary driver profile, e) exemplary Driving Score drill-down, f) exemplary Trust Score weighting

4.1 Preference and Individual Norm Settings

Our pre-study participants perceive algorithm-based driving scores as an “interplay” of various measurement criteria [P1, P10]. However, rule-based systems have weaknesses in dealing with the uncertainties of the real world (Holzinger, 2018), such as rapid acceleration, which in some cases (in reality) are necessary to get on the motorway quickly. Furthermore, according to Rouvroy (2013), the objectivity and neutrality of sensor data lead to additional hurdles for an algorithm: If, for example, the sensor used to measure the distance hold contains objective information (“close object”), but this information is generated by dirt on the sensor and not by the actual object to be measured, the evaluation may be objective based on the given data, but this leads to an incorrect evaluation of the driving behavior. In the co-creation workshop, our participants therefore designed a customizable Driving Score to face these problems by giving the opportunity to set a lower weight to such situations and defining individual norms, as already mentioned in the pre-study interviews, to become more aware of the algorithm’s rules.

To allow an adjustment of the Driving Score and therefore a customization, the participants of the co-creation workshop sketched weighting sliders for the measurement criteria in addition to a brief explanation, which represent the preferred weighting (0 - 100%) (see Fig. 2 a). In addition, the workshop participants were in favor of inserting an example from the pre-study for the individual norms to show how such a norm could look like (see Fig. 2 b).

In line with the pre-study, the calculation of the Driving Score was considered to be an average value of driving behavior [P7]. An influence factor is measured between 0 and 10 and weighted according to individual preference settings [P15]. Individual norms are modeled as exclusion criteria, e.g., that every speeding > 10 kilometers per hour leads to an average factor according to road traffic regulations of 0. The Driving Score thus corresponds to the sum of all weighted factors divided by their number.

4.2 Home Screen

After participants have set their preferences and individual norms, they are redirected to the home screen (see Fig. 2 c). The workshop participants mainly designed their sketches based on the exemplary driving profiles provided, as these closely resembled a real app. Further, the home screen show four exemplary driver profiles, which clearly display the profile information of the potential rentees. The home screen is designed to simulate a real decision-making situation in which the participants of the evaluation interviews had to choose a profile. To obtain more detailed information about a potential rentee, they should independently outline their motivations and click on the driver profiles.

4.3 Driver Profiles and Drill-Down

To improve the intelligibility the participants of the pre-study, as well es the participants of the co-creation workshop discussed various types of visualizations for the Score, such as scales, radars or ratios. Radar visualizations were chosen for a deeper understanding of the composition of the algorithm-based Driving Score. With regard to the pre-study, a Driving Score should therefore offer the possibility of obtaining not only the initial information of the Driving Score, but also further visualizations for a deeper analysis. In the sketches of our co-creation workshop, as well as in our prototype this was realized by displaying the score within the Driver Profile (see Fig. 2 d) and visualizing the measurement criteria of the score in the form of a radar chart (see Fig. 2 e). In addition, an overall Score was desired as a composition of the algorithm-based score (Driving Score) and the (average) peer-provided rating [P9]. For reasons of differentiation, we named the combined score Trust Score, which is an combination of the Driving Score and the User Rating (see Fig. 2 f). The participants of the co-creation workshop stated that the drill-down of the Driving Score in the form of the radar chart (see Fig. 2 e) should be optionally accessible in the sense of the evaluation by an info button. This should show whether the participants of the following evaluation show interest in it, or ignore this information from the beginning.

5 Evaluation: Impact of an Algorithm-Based Reputation System in P2P Carsharing

The evaluation aimed to discuss the design and suitability of the algorithm-based reputation system in the form of the Driving Score in a scenario-based decision-making situation. The main focus was on the setting of individual preferences and norms for interacting with the algorithm of the Driving Score, the perception of the Driving Score and the user rating for the selection of the vehicle renter, and the relevance of drilling down the Driving Score and Trust Score. Furthermore, it became clear in the pre-study that discrimination due to social bias may occur based on profile information (name, age, gender). In line with prior research (Hanrahan et al. 2018; Ge et al. 2016), racial, gender and age biases were simulated to enable a discussion on whether the Driving Score can contribute to a reduction here. For this purpose, the exemplary profile names (see Fig. 1) were adapted with respect to the names of the potential rentee (see Fig. 2 c).

We conducted interviews with 12 participants (E01-E12) aged between 18 and 56 years, which lasted on average 47 minutes. There were 6 female and 6 male interview participants with different nationalities and carsharing experiences acquired through previous contacts of research projects. They followed a semi-structured interview guide in combination with the use of the clickable prototype. The establishment of preferences and individual norms, the selection of the vehicle rental company, and the role and relevance of User Rating, Driving, and Trust Score were addressed. Finally, with regard to the investigation of potential discrimination, the (expected) influence of profile picture, name, age and gender was addressed. After transcribing the recorded interviews, we analyzed them following the inductive approach of thematic analysis (Braun and Clarke, 2006). The authors undertook the coding of the interview material collaboratively (Berends and Johnston, 2005).

5.1 Preferences and Individual Norms Contribute to the Understanding of Algorithm-Based Scores

The settings of preferences for the calculation of the Driving Score is perceived as positive by all participants and contributes to a deeper understanding and transparency. Respondents indicate that they can incorporate their own preferences into the Driving Score, which helps to build trust and acceptance. The setting of high values can essentially be related to the perception of a high risk of accidents and a high level of car wear and tear. The motivation behind this is primarily based on a lack of trust in potential renters as well as their own negative experiences.

“I would rate him 100%. If he doesn’t obey the traffic signs, that’s a danger to my car [...]. It is a matter of trust when I rent my car to someone.” – [E05]

However, it becomes clear that some respondents already differentiate here according to rules and norms, which is consistent with the results from our pre-study. These norms result from the reference to their own driving behavior or from the perception that an algorithm-based score should not be too restrictive in some situations. These situations are categorized by some participants as speed-related or weather-related.

“He is supposed to stick to the speed limit but he can also speed 10 to 15 km/h too fast but only on the highway. In a 30 zone he should drive 35 at most.” – [E05]

The setting of low values is exclusively based on the orientation to own driving behavior or the assumption that the algorithm-based score does not recognize certain situations fairly. This finding supports the results from our pre-study that most people allow a tolerance range as a norm-based approach.

“That’s okay, if it stays within a certain range, e.g. 10 km/h faster.” – [E08]

Based on our prototype, participants state novel ideas for a better understanding of the algorithm’s processing of high or low values. For example, this can be achieved by a clear description of the adjustable levels as feedback on how the algorithm will interpret the respective setting.

“I would like a little bit more explanation about the impact when I change between 0% and 100%. E.g. for speed and 0% it could say ’Speeding violations are not taken into account at all’ and for 100% ’From 5 km/h faster there is a penalty, the higher the violation is.”’ – [E12]

The participants argued that the definition of individual norms should be as intuitive as possible. Using the example of speeding violations, it is discussed whether a percentage or an absolute speed deviation divided into speed zones should be specified. According to the participants, a percentage figure offers too much room for higher violations.

“I would say that 5-8 km/h is still okay, between 8-10 km/h is borderline and above that it is negative in town. There would have to be a certain realism here. At some point, however, it becomes too cumbersome for the user to configure.” – [E10]

In addition to the setting of individual limits for speeding, the specification of a maximum speed, a maximum consumption, and the maximum number of previous accidents are also desired. Overall, two participants mentioned a final summary of the preferences set in form of a persona, which is characterized by comprehensible descriptions and could serve as a filter for pre-selecting requests.

“I would like to have a kind of persona at the end, which represents the characteristics of the driver corresponding to the defined preferences. So as a summary: ’He always keeps to the speed limits’, ’Almost always uses the turn indicators’, etc. You could then use it as a filter or see if there is something to adjust.” – [E12]

5.2 Support for Decision-Making and Trust-Building

Before participants look at the ratings, some already make a pre-selection based on age or gender. In the second step, they take the Driving Score and the User Rating into account, which they mostly assess in combination with the kilometers driven and the number of ratings as supporting parameters. At this point, some of the participants deviate from their pre-selection by age or gender. This is mainly due to the Driving Score, which distracts the participants from their initial thought and makes them think about the relevance of social factors and driving safety.

“Okay, then I would actually rather, [...] funny, I actually didn’t look at the women at all, although Fatima would fit best, at least in terms of kilometers driven and the driving score. First I went through the men and when I look at the women now, I would actually rather choose Fatima at that point because of her Driving Score.” – [E02]

Confirming to the pre-study it becomes apparent that the Driving Score comes to the fore especially among participants who pay attention to driving safety. The User Rating is predominantly preferred by respondents who pay more attention to the condition of their vehicle as well as social factors, which are reflected in cleanliness and reliability.

“At the end of the day, it’s business what I do [...] it’s very important to me that he drives safely with my car. How he looks or if he is unsympathetic doesn’t interest me that much.” – [E06]

However, decision-making based on just one score is not enough for most people. For the proponents of the Driving Score, the Trust Score usually has a confirming effect on their decision. A prorated weighting of the user rating is used to ensure that the social component is not completely disregarded. However, the Driving Score is perceived fairer and is therefore considered more relevant by them.

“I think that’s cool that you can weight the preference between driving score and user ratings again. It hasn’t had too much of an impact on me now. It had something confirming, because my decision for Fatima was clear quite quickly. First, because of the Driving Score and then the Trust Score actually confirmed it again” – [E01].

Interestingly, the Trust Score contributes to the fact that those who use the user ratings as a key decision-making parameter are sometimes unsure about their decision when setting the Trust Score and, by then looking at the drill-down of the Driving Score, sometimes reconsider and select the rentee with the highest Driving Score. Because of that some respondents also state that they first need to build up trust in the Driving Score to determine whether it matches their individual expectations.

“I’d like to see the radar chart on him again, though. Oh, that’s significantly worse. I think it’s good to see the individual points. Now I realize that the Driving Score is more important to me than the user rating. Han scores worse than Martin in the driving score factors that are important to me, even though he is older.” – [E07]

5.3 Contribution of Algorithm-Based Reputation Systems to the Reduction of Discrimination

As seen in the previous chapter, when selecting a potential rentee on the home screen, some of the respondents already expressed that information such as age or gender played an important role for them. Here it also became clear that the Driving Score can lead to a change of opinion.

“I think at first glance you can already see all the important info. Of course, male, female. I don’t lend my car to a woman then.” – [E02]

When asked explicitly what influence the name had on their decision, most participants stated that it did not play a major role for them. Nevertheless, three participants said that they were prejudiced against foreign-sounding names. However, two of them finally adjusted their decision based on the Driving Score and chose Fatima.

“Yes, with Han I already paid attention to the age and Fatima, sounds bad now, the name first struck me negatively. And with Han, too, that’s why I was looking at Martin first. But I have now chosen Fatima. For decision making after analyzing the Driving Score, it wasn’t so relevant.” – [E08]

When asked whether the participants could generally imagine that a (foreign-sounding) name could play a role for other car owners, all referred to its discriminatory effect. Mostly reasons of antipathy, stereotypes and language problems were mentioned. The influence of a profile picture also stood out here.

“I think with Fatima, many have an image in mind. An older lady, 55 years old with a headscarf and reserved. But if you have a profile picture now of Fatima standing in the middle of life, without a headscarf, who you find likeable. People would certainly not pay so much attention to the name if the picture is appropriate.” – [E09]

Opinions are divided on the suitability of the Driving Score for reducing discrimination and social bias. On the one hand, respondents find the Driving Score objective, which they see as an advantage over user ratings. In addition, the Driving Score’s assumed objectivity may minimize discrimination against potential rentees by age or gender, as it creates transparency regarding driving ability. This was also reflected in the decisions made, which were predominantly motivated by the Driving Score.

“[...] with an older woman, where you now assume that she doesn’t drive so well anymore, but then you see the Driving Score, which can also be weighted and you can look at the assessments of the specific parameters, that definitely convinces you otherwise.” – [E04]

On the other hand, the Driving Score reaches its limits in case of discrimination based on name or profile picture if it is racially motivated. In the case of prejudice, however, looking at Driving Score or Trust Score could possibly reduce it.

Some participants also expressed suggestions to better design the driver profiles to counteract discrimination. It was suggested that the important information, such as the Driving Score and the User Rating, should be visualized larger and that the discriminatory factors, such as name, age and gender, should be made smaller to not draw attention to them. Furthermore, only mentioning the first name is sufficient and a profile picture can be omitted if the renter is verified on the platform by means of depositing the driver’s license data.

“So if you don’t show a picture, the Driving Score can definitely reduce discrimination. [...], so if you only have the first name and [...] verify the profile through the driver’s license. Then that’s perfectly sufficient and draws attention to the Driving Score.” – [E12]

6 Discussion

Based on our results from the pre-study and evaluation, we want to discuss four main findings in regard to an algorithm-based score. These include how rules and norms can be considered in an algorithm-based reputation systems, why explainability is important for the understanding and trust-building, why such reputation systems could reduce social biases and discrimination, and finally why they are no magic bullet for trust-building.

6.1 Algorithmization of Rules and Norms

The findings of our pre-study show that the understanding of a Driving Score and its interpretation is strongly aligned to the Road Traffic Regulations. Good driving behavior seems to be the practical compliance to these regulations. Since every driver always must adhere to it, this standard seems to automatically serve as a general rule for the evaluation of driving behavior. Within the framework of sociological institutional theory (Carvalho et al. 2017), algorithms are perceived as elements of social order processes and considered on four levels. 1) The regulative level serves the implementation and execution of formal rules, while the 2) normative level evaluates alternatives for action regarding their legitimacy. An interpretation and perception of the action alternatives and social contexts takes place on 3) the cognitive level. Furthermore, this concept is extended by 4) a technological level, which ensures that social expectations and rules are reflected, transformed and further embedded in the algorithm (Geels, 2004; 2005; Scott, 2008). Such a differentiation can be found in our pre-study results, as well as our evaluation. From the user’s point of view, the regulative and normative level stands out in particular. For example, many participants judge regulative, so that in the event of rule violations in the form of exceeding or undercutting – which are usually measured according to the road traffic regulations – a negative impulse should be given to the Driving Score. Some participants also mentioned various subjective legitimations for deviations from a regulatory (limit) value, so that a legitimating normative level can also be identified here. This is characterized by the addition of subjective scope to an objective evaluation and, in this sense, alternative actions for the algorithm, e.g., the tolerance in speed by + 10 km/h [P3] (see Fig. 2 b).

The evaluation confirm the necessity of differentiating regulative and normative levels. However, they add a cognitive level through the desire to subdivide individual norms into different categories due to the dependence on the (socially) perceived context. Similarly, the technical level is also addressed by providing weighting options and explanations, ensuring that these parameters are consistent with their expectations. Further, the cognitive level is complemented by extending the weighting of preferences by including an accident probability and criticality of the situation within a social context. As driving behavior may be forced in specific situations, like braking abruptly and hard or accelerating too fast, it should be assessed in the context of the situation. For example, the following algorithmic scoring could be performed to assess cornering behavior considering the situation: 1) If someone drives too fast, 2) under bad weather conditions through 3) a tight curve with 4) a high probability of an accident, he or she should be punished harder than if some parameters are not that critically assessed.

6.2 Need for Understanding to Build Trust

Furthermore,the pre-study shows that it is essential to make algorithm-based ratings transparent and explainable, which is confirmed by the results of the evaluation. The principle of transparency and explainability seems to rely on the attitude “What you don’t understand right away may not be understood at all” for the participants. The clearer and more comprehensible the scoring, the more trustworthy it seems. A clear visualization is of great importance for the comprehensibility and therefore for the trust-building process. It should be possible to break down the Driving Score to support the understanding by increasing transparency of how the Score is composed. According to Aigner et al., the visualization must consider 1) what kind of data need to be visualized for 2) what task, in order to 3) choose a suitable visual representation (Aigner et al. 2011). Edwards and Vaele (Edwards and Veale, 2017) discuss, e.g., the right of explanation in the context of automated decision-making. They argue that this right of explanation is helpful but is only a transparency fallacy (Edwards and Veale, 2017; Team IGP, 2020). Often, users have little use for a precise breakdown of such scores due to a high complexity. Therefore, it is important to design a Driving Score which informs and supports a decision-making process instead of replacing it. According to Harper, it may be necessary to understand the algorithm in its entirety when dealing with HCI, when designing the interaction between a computer system and a user. But the resulting design should be such that an understanding of the computer system is no longer necessary (Harper, 2019). That means that explanations only matters when they are relevant to the user’s purposes. The Driving Score should therefore provide explanations in a degree which make the score interpretable for all parties on P2P sharing platforms, but they should not necessarily explain how the algorithm works.

Here, the evaluation interviews gave some interesting insights about the individual relevance of the specific Driving Score parameters, which should be accompanied by a description. A comprehensible description of the different possible configurations of Driving Score parameters could support a better understanding of the algorithm. Furthermore, a clear and understandable summary of the preference settings and individual norms in the form of personas is desired, which should also act as a ”personification” of the algorithm for comparability and indication of possible need for adaptation in case of a too restrictive or too weak specification. This is in line with prior research discussing the generation and incorporation of explanation sentences in the context of recommender systems (Zhang et al. 2014; Chen et al. 2021). At the same time, it can also serve as a filter to pre-select the potential rentees displayed. According to the evaluation interviews, the drill-down of the Driving Score in the form of a radar chart further supports the understanding and in some cases, when the User Rating actually dominate decision-making, even contribute to rethink a decision. It supports a individual reflection of the driving behavior based on specific parameters.

6.3 Potentials of Algorithm-Based Reputation Systems to Reduce Biases and Discrimination

It becomes apparent that the Driving Score can reduce potential biases against vehicle rentees, but not completely eliminate them. While participants of our pre-study and evaluation consistently value the objective nature of the Driving Score, for some this is not enough to place sufficient trust in the driver. Regarding trust in the Driving Score, individualizability, e.g., in the form of individual norms, can help to make the system more flexible (Alkhatib, 2021) in order to reduce algorithmic bias as well. As trust is composed of interpersonal and technological trust, the combination of Driving Score and User Rating addressed by the participants could have a positive effect on the trust-building process (Luhmann, 1979; Hawlitschek et al. 2016; Rotter, 1967; Stevens and Bossauer, 2020). Furthermore, the Trust Score could also counteract users’ bias due to their trust attitude towards both User Ratings and algorithm-based scores (Stevens and Bossauer, 2020). The possibility of individualization and its visualization can increase the comprehensibility of the Driving Score, which is likely to benefit technology-skeptical users in particular (Shin, 2021). Furthermore, by combining them in the context of the Trust Score, ambivalent users can specify the composition of their preferences over user ratings and algorithm-based scores. This leads to a better decision support of the users and allows a better cooperation between the different user groups. Especially the findings of the evaluation show that an algorithm-based approach can also contribute to the reduction of discrimination. In several responses, discrimination in terms of age, gender, or nationality could be identified. Complementing to prior literature (Calo and Rosenblat, 2017; Edelman et al. 2017; Carol et al. 2019; Tjaden et al. 2018), we could often observe a rethinking of their decision caused by algorithm-based reputation systems, where they first judged based on characteristics like name, age or gender. After a closer look at the Driving Score drill-down, the decisions were adjusted and justified with the good results in the scores. Furthermore, the profile design should be adapted to draw attention to the relevant factors such as the Driving Score. The profile information could be displayed smaller compared to the Driving Score and User Rating so as not to draw attention to them. Highly discriminatory factors such as the profile picture could be removed by verifying the renter through the platform.

Nonetheless, algorithmic bias can occur here as well, which may disadvantage some users or driving behaviors in specific situations (Jackson Jr, John L, 2013; Alkhatib, 2021). In this regard, shortcomings are also discussed in the following chapter.

6.4 Shortcomings of Algorithm-Based Scores

Most of the participants perceive algorithm-based scoring solutions as correct and fair. Nevertheless, they also reach their limits at some point. Often, we have the case that maneuvers, which are necessary, can be evaluated negative. An example of this is the acceleration lane on motorways. Abrupt acceleration is perceived as bad driving behavior, but it is necessary in some situations. Much more serious, however, seems to be the occurrence of unpredictable influences in road traffic. The definition of rule-compliant behavior therefore varies from situation to situation. The algorithm doesn’t know if abrupt braking or a strong acceleration is necessary or not. This confirms the findings of Rouvroy and Alkhatib, whose work draws attention to a bias in algorithm-based decisions and seeks to focus on the importance of subjective perception of individuals (Rouvroy, 2013; Alkhatib, 2021). This problem can also be seen within the context of autonomous driving (Hengstler et al. 2016). The participants do not want to have a disadvantage by unexpected maneuvers that may have been necessary. Another problem can be an unfair algorithm-based scoring. Just because some drivers drive more sporty that does not mean necessarily they drive unsafe. This could lead to a structural tendency of algorithmic systems to specifically misread marginalized groups (Alkhatib, 2021). Comparable problems can be found in the literature on credit scoring, where the place of residence can have a negative effect on creditworthiness and thus discriminate against people in selected regions (Andreeva et al. 2004; Fernandes and Artes, 2016; Havard, 2010; Marron, 2007). Also, the algorithms can be underminde due to the so-called data-behaviorism (Rouvroy, 2013) by adjusting behavior accordingly to obtain the best possible algorithmic evaluation. In some cases, this can lead to bad behavior for society as a whole. In the example of P2P carsharing, for example, it is conceivable that some rentees avoid certain streets because there are many traffic lights and they would have to accelerate and brake there more frequently. This could lead to congestion on some roads and higher traffic volumes overall.

Further, there is a need of sufficient data to be able to calculate a algorithm-based score. As Bossauer et al. (Bossauer et al. 2020) pointed out, there is a trade-off between trust-building and privacy by using car telematics. It is not a self-evident fact that rentees disclose all possible information about their driving behavior from a privacy perspective. Nevertheless, rentees can be incentivized by certain added values and financial incentives to disclose their driving data. Such a score can only be applied if an added value is given, if the utilization of the data is comprehensible, or at least drivers can not be disadvantaged. (Athey et al. 2019; Bossauer et al. 2020). It is important here that the users or the data suppliers are involved in the technological processes and that these are also transparent and comprehensible. Here, it is up to the HCI designers to design the systems in such a way that, on the one hand, acceptance for the provision of data increases due to better transparency, but also an incentive is created to interact with the algorithms (Harper, 2019). This applies to both car owners and rentees.

At the same time, a regular comparison of the algorithm-based scores with the real world is desirable, because this is the only way to counteract the algorithmic biases. For example, an option for the driver to comment on the driving behavior in certain situations would be important to ensure a fair evaluation. This could also increase the acceptance of such scores (Rouvroy, 2013).

6.5 Limitations

Regarding the limitations, we have tried our best to exclude subjective influences by, e.g., using multiple coders (Berends and Johnston, 2005). In qualitative studies, however, statements are subjected to terminological tests, which means that all statements are subject to a personal interpretation by being assigned to certain categories. The selection of participants also allows only minor conclusions to be drawn about the understanding of algorithm-based reputation systems on the mass. The heterogeneity of the sample makes it possible to give a rough impression of how a small section of society understands scoring algorithms. Nevertheless, the results give a first impression of how people understand the functioning of an algorithm for evaluating driving behavior and how such algorithm-bases reputation systems may support trust-building and decision-making in the context of P2P carsharing.

Further, only one form of presentation of a Score (a number between 1 and 10) was used in the interviews for orientation purposes. This representation was chosen because ratings are the current standard for reputation systems in many areas of the sharing economy (Stevens and Bossauer, 2020; Ert et al. 2016; Ye et al. 2009). Regarding our prototype, our work is limited due to the fact that a prototype evaluation is just a study of ’what might be’ (Salovaara et al. 2017). Accordingly, there is a need to validate our findings in a real-world application scenario.

Developing a model for an AI was not the goal of this work, even though conjectures for a possible category system were made. No concrete model is used in this paper, so no model of measurement and evaluation for an Artificial Intelligence is given. In the paper, the participants assume a normative evaluation, so that the categories correspond to normative assessments. Therefore, in the context of a classification model of an artificial intelligence, the question arises how and which classes are generated and how the data flow into these classes. This can lead to conflicts of interest between humans and artificial intelligence, which need to be examined in more detail in future research.

7 Conclusion

The sharing economy has experienced a veritable boom in recent years. Many sharing models have emerged as a consequence. In our study, we focused on P2P carsharing which depends on the trust that needs to be built between peers. Sharing platform providers therefore make use of reputation systems, which address trust by sharing information between peers. A new form of trust-building on P2P sharing platforms can be algorithm-based reputation systems. To bring people closer to the benefits of algorithm-based reputation systems, they need to gain an intuitive understanding of their functional principle.

Based on 16 interviews in our pre-study, we gave insights into how people think such scores work and how they should be designed for a better understanding. In a second step, we developed a prototype of an algorithm-based reputation system for P2P carsharing based within a co-creation workshop. The evaluation with 12 participants shows that people mostly understand an algorithm-based reputation systems (Driving Score) in the context of P2P carsharing as a kind of digital monitoring of driving behavior. The results of the pre-study as well as the evaluation indicate that algorithm-based reputation systems can indeed support trust-building. It was also confirmed by the evaluation participants that a drill-down of a score considerably increases the benefit of it and thus promotes trust. Further, the possibility of weighting preferences within the scoring parameters as well as feedback for the implications of different parameter values would support system intelligibility and therefore also trust in technology Stevens and Bossauer (2020). Nevertheless, the existing literature on biases caused by algorithms and a corresponding biased reality can also be confirmed. Algorithm-based reputation systems are not free of biases, especially in the case of unpredictable events such as strong braking maneuvers to prevent rear-end collisions. Here, our study confirmed that certain tolerance limits and also the inclusion of the driver’s perspective, e.g., through a individual statement in certain situations, are necessary for a reasonable evaluation. Such scenarios have to be considered in the design of the interaction between the users and the algorithm-based systems and fair mechanisms have to be incorporated (Harper, 2019; Rouvroy, 2013). In future work, we want to explore algorithm-based reputation systems in a real-world application and other scenarios, e.g., selecting a Uber driver for a ride. Furthermore, there is a need to extend our understanding of algorithm-based reputation systems for reducing discrimination in real-world situations.