Keywords

1 Introduction

The rapid growth of the tourism industry in recent years has provided online travel booking services and other entities with a massive amount of traveler data. As a result, travel sites that wish to provide useful recommendations are faced with the overwhelming task of mining through information. There is a great need for an accurate internal representation of users as a foundation for recommender services.

Recommender systems (RS) address the information overload problem by categorizing users’ interests and providing personalized recommendations. In the past, predominant focus was placed on rating structures as a way of understanding users. This makes accuracy dependent on the engagement of the user when providing ratings. However, a trending solution is the employment of personality in the user modeling process. Personality Based Recommender Systems (PBRS) have served multiple purposes, including increasing recommender accuracy [12], providing a user-centric experience [9, 13] and eliminating the cold-start problem [12]. But despite the growing use of personality in RS we still observe decreased user engagement levels during the personality acquisition process [9, 11, 15].

Another recent trend is the inclusion of context in recommender systems. Authors Adomavicius and Tuzhilin in [1] have provided a detailed guide to Context-Aware Recommender Systems (CARS), while also emphasizing that the definition of context varies amongst disciplines. However, a widely used explanation is found in the works of Dey [8]: “Context is any information that can be used to characterize the situation of an entity,” that is, the user whom the context influences. Because personalities do not exist in a vacuum, context is crucial to understanding real users’ preferences.

We propose a gamified personality acquisition method that incorporates travel-related questions for context inclusion, increased accuracy and user engagement. We assume possible increased accuracy based on the works of, among others, Mishel et al., who discuss the role of the situation of as locus of control when assessing personality. During interaction with the personality assessment system—in psychology jargon often referred to as a “scale”—users are asked “what-if” questions: If you were in situation x while traveling, how would you respond?

In the next section we will discuss related works and highlight current methods of personality acquisition. Section 3 will detail our approach to the scale construction process and the Item Response Theory (IRT) we used as an evaluation metric, as well as our methodology in building our Gamified Personality Acquisition (GPA) system. In Sect. 4 we will discuss our results and conclusions. Lastly, we will conclude with future directions in Sect. 5.

2 Background and Related Work

2.1 Personality Model

Psychologists have constructed several instruments to measure personality. In contemporary science, one of the most widely accepted trait theories is the Five Factor Model (FFM), also known as Big Five. This psychological construct divides personality traits into five dimensions: Openness, Conscientiousness, Extraversion, Agreeableness, and Neuroticism (OCEAN) [5]. The revised NEO personality inventory (NEO PI-R) [5] consists of 240 items, which take participants approximately 45 min. to complete. Others, such as (NEO-FFI) [5], and (TIPI) [10], are made up of 60 and 10 items respectively. Though lengthier instruments demonstrate better psychometric quality, as identified in [10], long instruments may be impractical in certain research settings. In order to achieve a user-centric personality acquisition experience, whereby a participant’s engagement level and accuracy are maintained or even increased, we decided to modify the Ten-Item Personality Inventory (TIPI) [10]. TIPI consists of 10 bi-polar items, each representing high and low poles of the Five Factor Model. The traits and respective adjectival descriptors can be found in Table 1.

Table 1. Ten-item personality inventory (TIPI)

2.2 Personality Acquisition

Various branches of the computation field, such as Artificial Intelligence (AI), have focused on finding ways to extract users’ personalities. Largely, these approaches can be divided into explicit and implicit methods. Explicit methods ask users to answer question sets from validated personality inventories used in the psychology domain. Implicit methods investigate users’ personalities by observing a person’s interactions with an interface or analyzing digital footprints such as social networks, blogs, comments, etc. [12]. For instance, the authors Walker et al. used personality markers in language to detect a user’s personality. Thus user personality can be inferred by analyzing text (i.e. written essays) or conversations. This application [16] was built using WEKA machine learning models, which are trained on the Psychology Essay corpora designed by Pennebaker and King. The system gives a choice of four algorithms: linear regression, M5’ model tree, M5’ regression tree, and vector machines for regression. Researchers Wright and Chin [4] have also developed a similar tool; however, to our knowledge, it is not freely available. Additionally, Yokoyama et al. successfully developed a tool using Egogram as well as Multinomial Naïve Bayes to evaluate the personalities of bloggers. More recently, authors Shen, Brdiczka, and Liu succeeded in creating a system that analyzes personality from e-mail messages. Recent research further investigates the feasibility of both methods. The explicit method was also preferred by participants in a study conducted by Hu et al. [13], which compared personality quiz questionnaires to rating-based preference elicitation. These results were analyzed in terms of several criteria: perceived accuracy, user effort, and user loyalty.

Even though the inclusion of personality in the recommendation process has led to increased accuracy, researchers have also uncovered the need for an “entertaining personality quiz” [11] or “interactive and engaging” interface [9]. Thus far there have only been a few works focused on making this process more user-centered. In [7] the authors had partial success developing stories to assess the FFM. However, this is only one approach to keeping users engaged.

3 Methodology

We divide this section into two parts: (i) the approach to creating our contextualized scale and GPA, and (ii) the approach to building the personalized TravelRecommender based on GPA.

3.1 Gamified Personality Acquisition (GPA) System

Personality Scale Construction.

Our methodology comprises 4 steps.

Step 1: Articulate construct and context.

In the first step, we identified the TIPI as a base for our modified scale and therefore attempted to map the underlying construct (Big Five), thus assuming the dimensions listed in Table 1. The scale is designed to assess personality in the context of travel recommendations. In BestTripChoice, over 30 years of industry knowledge and academic research were used to establish a 15-question quiz that sorts travelers into “travel personalities” and generates a destination recommendation. Destination personality is a concept derived from brand personality. The behaviors of travelers used by Plog were used as inspiration to apply context to items as well as response options.

Step 2: Choose response format and initial item pool.

The response options in TIPI are in the form of a 7-point Likert scale (disagree strongly, disagree moderately, disagree a little, neither agree nor disagree, agree a little, agree moderately, and agree strongly). We therefore also mapped 7 response options to each item. The items (corresponding to a trait, i.e. extraversion) are represented in the form of a contextualized situation (e.g. “You sit outside at the hotel bar and are having something to drink; suddenly a crowd of people joins because it is happy hour”). The response options are in the form of an action-based response to that particular travel situation, depending on level of target trait (e.g. a response indicating a high degree of extraversion would be “I put down my book and join the crowd”). A partial contextualized scale can be found in Table 2.

Table 2. Partial Contextualized Scale for Extraversion

Step 3: Collect data from respondents.

We collected data using Amazon’s Mechanical Turk workforce. This collaborative platform brings researchers and participants together efficiently and has proven effective and accurate in the social sciences [2, 3]. To ensure reliability of data, participants went through a rigorous qualification process that included an English proficiency test designed by Cambridge English.Footnote 1 To eliminate potential misunderstandings due to language and cultural differences, we only accepted adult workers from the United States. According to the test makers, native proficiency can be assigned to test scores of 23 out 25. Given that this standard is based on British English versus American English, we set the threshold to slightly lower, at 20 out of 25. Additionally, attention check questions were input to insure participants were reading the questions. Finally, 549 participants were used to evaluate the psychometric quality of our scale.

Fig. 1.
figure 1

GPA start interface and sample question interface (from left to right)

Step 4: Examine psychometric properties and quality.

We used Item Response Theory (IRT) to evaluate the psychometric quality of our contextualized scale. There are two main parameters in the Graded Response Model (GRM) of IRT, shown in Eq. 1 below. Item discrimination (a) indicates the degree to which an item distinguishes between examinees with various levels of a target trait. An item’s difficulty parameter (b) is on the same metric as the level of the trait (depicted as theta, \( \varTheta \), on an arbitrary scale). This index marks at what trait level examinees have a 0.50 probability of endorsing a given response category versus any higher-level response category. Thus according to the analysis seen in Table 5, an examinee must have an extraversion (extra1) level of −1.289 to have a 50/50 chance of endorsing the first response category versus response categories 2−7.

$$ \begin{aligned} P\left( {Y \ge j\left| \theta \right.} \right) = \frac{{\exp^{{a_{i} (\theta - b_{u} )}} }}{{1 + \exp^{{a_{i} (\theta - b_{u} )}} }} \hfill \\ \hfill \\ \end{aligned} $$
(1)

These parameters are graphically represented by Item Characteristic Curves (ICC) and Item information Curves (IIC) in Fig. 3. Information indicates to what degree the test is estimating a person’s ability at each level. Thus the more information is present (i.e. the more peaks over a broader range of ability levels on the IIC), the more reliable the item/scale is. The scale construction led to successfully acquiring items to measure 60 % of the FFM. The other 40 % were completed with four items from the original TIPI.

GPA.

We implemented the GPA System using the Unity3d Game Engine.Footnote 2 We chose to use a game engine to mimic game-like features to the extent possible without creating too much noise. A user more focused on “winning” than truthfully answering questions could compromise the integrity of the personality assessment. Unity3d is an open-source cross-platform game engine. It is currently one of the most popular game engines due to the support of the open-source community as well as sufficient availability of documentation. A sample interface of GPA is shown in Fig. 1 below.

Fig. 2.
figure 2

GPA TIPI interface and rating-based interface (from left to right)

3.2 Personalized TravelRecommender Based on GPA

We evaluated the performance of GPA by integrating it into a collaborative filtering-based TravelRecommender using real-life data from Trip Advisor, the world’s largest platform for travel-related reviews. We measured the effects of GPA on the recommender performance in terms of Mean Absolute Error (MAE) and Receiver Operating Characteristic (ROC) sensitivity. We additionally built a rating-based, as well as TIPI-based, preference elicitation interface to serve as a baseline for comparison (see Fig. 2). The recommendation process consists of two stages: The first stage entails a similarity estimation based on predetermined criteria, and the second stage is a prediction of a rating for a given item. We decided to employ similarity and prediction algorithms as seen in [12]. The method employs the Pearson correlation coefficient, which is one of the most commonly used similarity computations in RS research. While the authors compared rating-based to personality-based recommender systems, we modified their approach to adjust for our cases, adding a GPA-based case and using a different FFM personality assessment inventory as seen in Table 1. Lastly, we collected user preference data.

Data Collection.

The personality-based TravelRecommender uses a data set obtained from Trip Advisor. We chose Trip Advisor as a data source because it is the world’s largest travel site and displays various travel-related points of interest and reviews. According to Trip Advisor, the site has 315 million unique visitors monthly and provides 190 million reviews covering over 4.4 million accommodations, restaurants, and attractions. We aimed to obtain a broad spectrum of points of interest and therefore decided to utilize TA’s guide for the “Best of 2014,” selecting the top 10 cities for every continent. Due to time limitations, we also selected the top 100 attractions per destination and the first 200 reviews. The data statistics summary can be found in Table 3 below.

Table 3. Trip Advisor Data Statistics

Personality Scoring Using Authors’ Reviews.

Given the novelty of using personality in RS, there is a corresponding lack of available data mapping user preferences to personality scores. Therefore, we employed the Personality Recognizer [16] mentioned in Sect. 2 to calculate personality scores of the authors/users from our Trip Advisor dataset. For research purposes, the personality-based TravelRecommender used static data, and therefore the calculation of personality scores was performed offline and stored in the database as well. This software employs the Linguistic Inquiry and Word Count program (LIWC)Footnote 3 and the MRC Psycholinguistic Database Machine Usable Dictionary (MRC).Footnote 4 LIWC is often used as a text analysis tool; it counts words and assigns them to the psychological categories defined in the LIWC dictionaries. MRC is a machine-usable dictionary designed by the University of Oxford for researchers of artificial intelligence and computer science, who need psychological and linguistic word descriptions. The Personality Recognizer outputs personality scores based on the FFM ranging from 1−7 for each domain. To ensure a reasonable word count for every author’s personality assessment, we then separated out all reviewers having at least five reviews and computed their FFM personality scores using the Personality Recognizer. We also found that several were not written in English and thus excluded any authors who had a user name written in a non-Latin alphabet. Table 4 summarizes the new updated dataset.

Table 4. Data Containing Personality Scores

Participants.

To evaluate the effects of GPA in our TravelRecommender, we invited the same (already screened) participants to experience our system. A total of 246 participants out of the 549 responded. After further filtering, i.e. selecting users who completed the GPE, TIPI and Rating-Based (RB) interface, we obtained results for 66 unique participants. There were 39 female and 27 male participants with ages ranging from 22 to 66 (mean = 42.33, st. dev. = 14.54). Fourty-three percent of participants described themselves as traveling “a few times per year,” while 28.9 % claimed to travel “at least once per year.” A vast majority, 68.18 %, stated they had traveled abroad at least once.

4 Results

4.1 Item Response Theory

The ltm package in R was used to calibrate the items with the graded response model. Parameter estimates are shown in Table 4. The analysis for the extraversion domain shows difficulty scores ranging from −1.289 to 1.54, covering a relatively wide spectrum of the ability continuum. A scale that is targeted to assess a broad range of the ability level, according to [6], should range from −2 to 2 units of ability levels.

Overall, both items of the domain contribute to the reliability of the test. The peaks (where the test is most reliable) of the TIC are at approximately 4 for ability levels from −2 to 2, indicating that the scale provides a reasonably good amount of information for the extraversion domain and can be considered reliable for that trait level range. We can also see high discrimination values for extra1 and extra66, 3.907 and 1.436 respectively. Thus for both items in this dimension the scale is discriminating well between low- and high-level examinees of the trait. For space limitation reasons, the other domains are not discussed in depth; however, Table 5 can be referred to for interpretations. We must note that some of the items did not have sufficiently high discrimination values, resulting in an inability to assess the broad spectrum of the trait level. Thus we decided to construct the final scale with six story-based items and four from the original TIPI scale to ensure a solid measure of the FFM (seen in Table 5 as items starting with “t”).

Table 5. Difficulty and Discrimination Parameters
Fig. 3.
figure 3

Extraversion domain IRT

4.2 GPA-Based TravelRecommender

In the subsequent analysis, we will look at a sample of the collected data. The sample was chosen based on whether a participant completed all three cases and thus could provide a direct comparison. There were a total of 66 unique participants out of the 264 total who finished all stages.

Mean Absolute Error (MAE).

Figure 4 displays the MAE values for the 66 unique participants in the sample described above. In this sample, GPE (MAE = 1.116) outperformed TIPI (MAE = 1.186) with a 5.93 % decrease. In this sample, the MAE for RB also performed significantly worse, showing an average absolute error of 1.84 stars in rating prediction. GPE as well as TIPI both exhibited a significantly lower prediction error over the traditional RB system, with increases of 39.36 % and 35.54 %, respectively.

Fig. 4.
figure 4

MAE (left) and ROC sensitivity (right) of travel recommender sample data

Receiver Operating Characteristic (ROC) Sensitivity.

ROC sensitivity is the sample’s performance with respect to decision support: more specifically, its ability to retrieve true positives. Figure 4 displays the results with respect to ROC sensitivity. These clearly indicate that GPE (MAE = 0.616) outperforms TIPI (MAE = 0.576) and RB (MAE = 0.53). In this sample set there is a significant difference of 6.94 % and 16.23 %, respectively. Overall, GPE displayed a significantly better ability to retrieve relevant items compared to TIPI- and RB-based systems.

5 Conclusion and Future Work

In this research paper, we have demonstrated our novel approach to designing and statistically validating a constructed scale, which provides an interactive experience while also integrating it into a personality-based TravelRecommender. Our empirical data indicates that the use of an interactive story-based preference elicitation system can match, and in some cases outperform, existing systems, which are based on ratings or traditional personality questionnaires, in terms of accuracy metrics.

We intend to further explore the effects of having a scale that is comprised of 100 % story-based assessment, versus the present 60−40 % split. Based on the current results, we believe that we can improve usability by using only contextualized personality assessment items. Another avenue, which we plan to explore, is the inclusion of machine learning algorithms that adjust personality assessment questions based on the response given for one question. This would provide for an adaptive system that more accurately assesses various facets of a given domain. While the vast majority in the first survey (nearly 60 %) preferred the interactive questions, it would be interesting to explore personality acquisition in a more gamified manner that depends less on text and more on graphical cues instead.

Recommendation systems are designed for the user, and thus we must find ways to increase system likability and at the same time increase the likelihood that a user will consult it in the future.