The diversity rank-score function for combining human visual perception systems.

There are many situations in which a joint decision, based on the observations or decisions of multiple individuals, is desired. The challenge is determining when a combined decision is better than each of the individual systems, along with choosing the best way to perform the combination. It has been shown that the diversity between systems plays a role in the performance of their fusion. This study involved several pairs of people, each viewing an event and reporting an observation, along with their confidence level. Each observer is treated as a visual perception system, and hence an associated scoring system is created based on the observer's confidence. A diversity rank-score function on a set of observation pairs is calculated using the notion of cognitive diversity between two scoring systems in the combinatorial fusion analysis framework. The resulting diversity rank-score function graph provides a powerful visualization tool for the diversity variation among a set of system pairs, helping to identify which system pairs are most likely to show improved performance with combination.


Introduction
The concept of multiple scoring systems has been applied to a variety of domains [1,2]. In situations where multiple scoring systems are constructed, we are interested in conducting a meta-analysis to gain an understanding of the relationship between the systems, specifically the diversity between them. It has been shown that the combination of two scoring systems can outperform individual systems when there is some diversity between the systems, and they are of relatively good performance [1,3]. To this end, quantitative measures of diversity can be used to generate diversity scores for pairs of systems, which can then be analyzed within the combinatorial fusion analysis (CFA) framework [1].
Human beings are constantly and naturally performing fusion of information within and among the senses. There is extensive research in this area on the neurological level pertaining to how fusion in the sensory system works [4][5][6], how visual information is combined with information from other senses [7][8][9][10][11], and how visual systems are combined [12,9,13]. In this study, however, we are focused on the inter-human level of information and fusion of the information at the decision level.
There are many situations in which two people's observations are considered for a decision, such as referees in a football or tennis match, physicians examining a patient, co-pilots navigating a plane, and so on. For example, when two physicians are examining a new patient, each may observe different symptoms that can indicate different diseases; interactive consultation may lead to a final diagnosis. When two people are interactively making a decision based on visual input, research by Bahrami et al [12], Ernst and Banks [7], and Kepecs et al [13] suggests that these decisions are improved when two people are interactively making the decision, rather than an individual. The question then becomes, if we have two people making visual observations of an event, how do we integrate these observations or decisions? Do we choose one of the observer's results, or create a combination of the two? Koriat [14] emphasizes the importance of confidence, and that it may be a good option to take the decision of the more confident person. The approach taken in our study is to combine the observations or decisions made by two people in an attempt to outperform the individual decisions. The visual observations tested in this project involve pairs of volunteers that are asked to give the location of a small object they observe being tossed in a field.
In order to perform the desired combination, by score or by rank, a scoring system must first be constructed for each participant in a trial. Each participant's observation, or perception system, is represented as a scoring system, which is made up of a score function and a rank function. Given this multiple scoring system scenario, we then analyze the cognitive diversity between the scoring systems of a trial. A quantitative diversity measure, the distance between two rank-score functions, is used to represent the cognitive diversity between two scoring systems [1,2]. Examining the relative diversities between the system pairs, together with the performance of their combinations, can give us insight into how diversity variation may play a role in the performance of system combinations. The diversities between systems are analyzed using the diversity rank-score functions, which are then visualized in diversity rank-score graphs. This visualization of diversity variation is beneficial in situations where there are a large number of scoring system pairs (hundreds or thousands). Interactive data visualization [15][16][17] is a dynamic field in which data are visualized with the intent to facilitate an end user in a particular task. The diversity rank-score function graph is such a tool that has potential to be integrated into various data analytics and software systems.
Information fusion can be applied to many situations where there are multiple scoring systems, or multiple classifiers. For example, the CFA framework [18,1,2] has been applied to information retrieval [19], text categorization [20], target tracking [21], sensor feature selection and combination [22], and image skeleton pruning [23]. Combinatorial fusion has also been used for enhancing the analysis of various biomedical datasets including virtual screening for molecular compounds [3], protein structure prediction [24], and ChIP-seq peak detection [25]. When combining multiple models (performing information fusion), it would be useful to know in advance whether the fusion will outperform the best model. Ng and Kantor [26] identify system features that can help predict whether fusion will be beneficial. Combination of multiple classifiers has also been shown to improve results in the area of pattern recognition. [27,28] The content of this paper is organized as follows: Sect. 2 describes the concept of multiple visual perception systems, along with the corresponding multiple scoring systems, which are considered a generalization of multiple classifier systems. The CFA framework, which establishes each visual perception system as a scoring system and combines two such systems, is also described. The diversity rank-score function can be used as a guiding light to combine pairs of visual perception systems based on the diversity variation across a set of trials. In Sect. 3, we describe the visual perception dataset, present the results of scoring system combinations, and examine the role of the diversity rank-score function graph in the context of diversity variation and visualization. Concluding remarks and discussion are included in Sect. 4. In many domains, such as biomedical informatics, finance, security, information retrieval, among others, classification models are created in order to generate class predictions for new data. Binary classifiers attempt to categorize items into one of two classes (or labels). For example, determining whether a webpage is relevant to a search term or not, or whether a patient tests positive or negative for a disease. Some binary classification problems are asymmetric, meaning one class occurs much less frequently than the other. Multiclass classifiers involve more than two classes. The output of a classification system includes a class prediction, along with an associated probability. Treating these probabilities as scores, and sorting the results by score to generate rankings, enables us to consider classification systems as a scoring system that have a score function and a rank function.
In an effort to improve classification accuracy, it is often desired to incorporate the results from multiple classifiers that are varied in terms of their approach or algorithm. The element of variety, or diversity, is essential since different classifiers may contribute various perspectives, results, or predictions, on the data. Generally, the results from multiple classifier systems are combined using ensemble methods such as majority voting (bagging) or weighted voting (boosting). Table 1a, b contains a snapshot from a classification example in which the class label of a sample document is predicted in each of the following two cases: (a) 3 class labels, and (b) 6 class labels (Table 1a, b). The document is analyzed by 4 different classifiers, each of which output the probability that the document belongs to class A, B, or C, in the case of 3 class labels. In Table 1b, each document belongs to one of the 6 class labels: A, B, C, D, E, or F. For each classifier, the class label with the highest probability is considered the predicated class label and is assigned rank 1. Likewise, the next highest probability is assigned rank 2, and so on. The ensemble approach of majority voting is used to combine the results of the individual classifiers. For each class label, we count the number of times that class is ranked 1 (has the highest probability) by a classifier. Then, the class label with the highest number of votes is considered the predicted class for the document.
If we consider the classifiers as scoring systems (see Table 2), we can apply score and rank combinations as an alternative ensemble approach. Here, the probabilities are treated as scores, which are then ranked. Score combination (SC), in this example, is the average of the scores for a class label across the 4 classifiers. The class label with the highest average score is chosen as the result. The rank combination (RC) is computed as the average rank for a class label for all classifiers. The class label with the lowest average rank is then selected. Weighted averages can be used if the past performance of the classifiers is known. In this example, we can see that combining by score or rank may produce different results. Table 1b is a classification Table 1 Combination of multiple classifier systems with (a) 3 class labels and (b) 6 class labels using majority voting (C MAJ ), score combination (C SC ), and rank combination (C RC ) a Classifier   The diversity rank-score function for combining human 65 problem that involves more possible class labels. In this example, we see that classifiers can be viewed as scoring systems, where the scores are the class label probabilities. The concept of multiple classifier systems with multiple class labels (the case in Table 1b) is then generalized to multiple scoring systems with multiple choices (items or options) (as is the case in Table 2). When constructing an ensemble, it is desired to have diversity among the component classifiers or scoring systems. Several techniques for measuring diversity have been proposed for regression and classification [29,30]. It is more challenging to measure diversity between classifiers if we just consider the output class labels, without their associated probabilities [29].
Viewing classification systems as scoring systems enables us to apply the concept of diversity that has been defined for multiple scoring systems [1,2,26,3].

Establishing each visual perception system as a scoring system
In situations where we have a set of documents (webpages, genes, customers, etc.) that are assigned scores or probabilities by an algorithm or classifier, creating a scoring system is straightforward. However, in cases where we do not have a set of scores to work with, a score function needs to be generated based on the value(s) given. In this experiment, when an observer is deciding on the proposed landing point of the object based on the visual input, he/she is selecting from several locations within a range. Intervals within this visual range will be considered as the items (or options) that will be scored and ranked. Since there are two subjects within each trial, the corresponding score functions must score the same set of intervals. To this end, a common visual space is created, as described in previous work [18]. First, the mean of the decisions (points) for the two observers P and Q is computed in three different versions, varying the weight given to the confidence radius r. M 0 ; M 1 , and M 2 , are computed as The scoring system analysis is performed for each version of M i . Specifically, the M i values are used as a foundation point from which to create a common visual space. The M i points are always located between the P and Q original points. The visual space is also extended on both sides of P and Q. The common visual space is divided into 63 intervals. The interval scores are computed using a normal distribution around M i , using the confidence radius (0.5r) for the standard deviation. The performance of each M i is measured as the distance from M i to the actual location of the object [31]. The scores, created for the intervals for P and Q, give us the score functions s P and s Q . Given a set of intervals d 1 ; d 2 ; . . .; d n , the scoring system P consists of a score function s P , rank function r P , and rank-score characteristic (RSC) function f P (see Fig. 1). The rank function for the scoring systems P and Q are obtained by sorting s P and s Q and assigning ranks to create the rank functions r P and r Q . The Rank-Score Characteristic (RSC) function, as defined by Hsu et al [1,2], is the composite function of s p and the inverse of r P . Rank-score functions map ranks to scores, and are independent of the data items. Here, the rank-score characteristic (RSC) function for the scoring system P, f P : N ! R, is computed as Similarly, f Q is computed for scoring system Q.

Combining two visual perception systems
Within the CFA framework [1,2], system combination is performed either by score or rank combination. A score combination is computed as the average of the score functions, s p and s Q for each interval, d i , giving us the score function of the score combination s SC . The rank function of the score combination, r SC , is achieved by sorting s SC in descending order and obtaining ranks for each d i . In addition, we compute the rank combination by averaging the rank functions r P and r Q , to give us the score function of the rank combination, s RC . We sort this function in ascending order and assign ranks to get its associated rank function, r RC (see the example in Table 2). The performance of these combined results is measured by the distance of the newly computed points to the actual x,y coordinates where the object landed in the field.

Cognitive diversity between two scoring systems
In cases where multiple scoring systems, algorithms, or approaches exist, it is beneficial to know under what circumstances combining pairs of these systems could result (a) score function s P , (b) rank function r P , and (c) rank-score characteristic (RSC) function f P in improved performance. Diversity between two scoring systems A and B can be measured in a few different ways, such as the distance between score or rank functions using covariance (between s A and s B ) or Kendalls tau (using r A and r B ), respectively. Another method to measure the diversity between two scoring systems, which is used here and called cognitive diversity, is to measure the distance between the rank-score functions (f A and f B ) of the two systems [1,2] (see formula (2) and Fig. 1). Figure 2 illustrates two RSC functions, f A and f B , for two arbitrary scoring systems A and B. One distance measurement is the area between the two RSC functions. We note that the cognitive diversity between scoring systems A and B, as seen in Fig. 2, provides a powerful visualization tool on the similarity or dissimilarity between these two visual perception systems, A and B, in the context of the current study.
In this analysis, the concept of cognitive diversity is applied to the trials and scoring systems P and Q, which represent the 2 participants in a given trial pair. Therefore, the cognitive diversity of the two observers P and Q, d(P,Q), defined as the distance between the rank-score functions of two systems P and Q, f P and f Q , is computed as follows: 2.3 Diversity rank-score function across a set of trials Let T ¼ fðp 1 ; q 1 Þ; ðp 2 ; q 2 Þ; . . .; ðp n ; q n Þg represent a set of n trials, each consisting of an ordered pair of participants and let R ¼ fdðp 1 ; q 1 Þ; dðp 2 ; q 2 Þ; . . .; dðp n ; q n Þg represent the diversity scores for each pair in T, where N ¼ 1; 2; . . .; n.
The cognitive diversity between each pair of scoring systems, P and Q, is measured by the diversity function d(P,Q), as shown in equation (3), where m is the number of items (intervals) to be scored; in this case m is 63, indicating the number of intervals in the common visual space. The set of diversity values itself can be treated as a scoring system, making the diversity function into a diversity score function. For this purpose, the number d(P,Q), which is the diversity between scoring systems P and Q, is considered as the diversity score function value of the trial (p,q) and is denoted as s ðp;qÞ . The diversity rank function is attained by sorting the score function and generating ranks, giving r ðp;qÞ . A diversity rank-score function, f ðp;qÞ , is computed as f ðp;qÞ ðiÞ ¼ ðs ðp;qÞ r À1 ðp;qÞ ÞðiÞ ¼ s ðp;qÞ ðr À1 ðp;qÞ ðiÞÞ ð4Þ The diversity rank-score function is a mapping from diversity ranks to diversity scores. The relationship between s ðp;qÞ , r ðp;qÞ , and f ðp;qÞ is shown in Fig. 3.
3 Case analysis using diversity rank-score graph

Visual perception dataset
The setting for the data collection was in a grassy field in NYC's scenic Central Park. A lab member was tasked with recruiting pairs of participants for the experiment. The pairs of subjects varied in terms of gender and relationship between the individuals. The subject pairs were randomly chosen and could be friends, siblings, husband and wife, colleagues, or acquaintances. A small metal object that was made of metal plates, nuts, and a bolt, and of size 1.5 by 1.5 inches was used for the experiment, since it was possible to throw it far distances, small enough to be hidden in the grass, and would not roll from its position once landed. The subject pairs stood 40 feet from a marked square of size 250 by 250 inches, and the individuals stood a distance of 10 feet away from each other. A member of our group tossed the metal object into the designated square. Each participant is asked individually to walk and point to where he/she believed the object landed. A marker is placed at these locations. The participants are also asked to give a measure of their confidence of his or her guess in the form of a confidence radius around the specified mark. Lab members helped the participants gauge their confidence The diversity rank-score function for combining human 67 radius by using tool consisting of 2 poles of length 36 by 36 inches to represent the x and y coordinates. Smaller radius values indicate higher confidence of the subject. A lab member measures the distance from the actual position where the object landed and the guess positions of the subjects.
The subjects are given feedback as to how far off their guess is from the actual landing point of the object. The values collected are: x,y coordinates for subject P and Q from each experiment, a confidence radius for each participant, along with the actual landing x,y coordinate of the object. All measurements are in inches. The values for the trials in this most recent experiment are shown in Table 3. Our group has conducted previous data collection activities of this type, the data of which can be found in [18].
The distribution functions for P and Q for a sample trial are shown in Fig. 4a. Sample rank-score functions for a trial are shown in Fig. 4b.

Analysis results of combinations
The experimental results are presented in Fig. 5. The performances of P and Q, shown in column (a), are the distances to the actual landing point of the object. The confidence radii are included in column (b), in which a shaded cell indicates that the more confident participant leads to the best performance. The performance of the weighted means M 0 , M 1 , and M 2 is listed in column (c). C represents the score combination and D represents the rank combination. The last column, (d), presents information for the results using each of the weighted means, along with the score and rank combinations (C and D). For each i ¼ f0; 1; and 2g, P, Q, M i , C, and D are ranked in descending order of performance; repeated ranks indicate tied performance. Rank 1 showed the best performance, meaning the closest interval to the actual location of the object. Cases where the score (C) or rank (D) combinations either outperformed or tied the best individual system are highlighted.
3.3 The role of diversity rank-score graphs After performing the score and rank combinations for the three different computations of M ( M 0 , M 1 , and M 2 ), we can summarize the results as follows: Using M 0 , the score and/or rank combination for 14/16 trials showed either tied or improved performance compared to the best individual system; using M 1 , 9/16 trials; and using M 2 , 7/16 trials. The diversity rank-score functions for the scoring systems created according to the three different computations of the mean: M 0 , M 1 , and M 2 , are depicted in Fig. 6. Examination of these graphs, along with the performance of the corresponding system pair combinations, can help us understand the role of cognitive diversity in system combinations by score and rank. To make the connection with the trials, Table 4 is included to show the ranking of trials according to the diversity of their component scoring systems, for each case of M 0 , M 1 , and M 2 . When comparing with the performances of the system combinations, we detect a tendency for pairs of systems with relatively high diversity to have more improved performance. In this We observe that the diversity rank-score graphs are good indicators for the combination outcome. For example, trials d 5 and d 16 appear at the very end of the graph in M 0 , M 1 , and M 2 (see Figure 6 and Table 4). In these two trials, neither rank nor score combination helps improve the outcome. However, even though trial d 9 has a very low diversity (Table 4), its combination of scoring systems P and Q is better than or equal to the best of P and Q since P has a relatively high performance.

Conclusion and further work
In this paper, we studied the combination of multiple visual perception systems using the CFA framework and the diversity rank-score function. By establishing each visual perception system as a scoring system on a set of options (possible locations, in our context) in a common visual space, the problem of combining multiple visual perception systems is treated as a problem of combining multiple  scoring systems. Using a dataset of an experiment with sixteen trials where each trial consists of a pair of two observers, we studied various issues as to how the diversity between these two observers (and their individual perception systems) affects the performance of the combined system.
At the individual trial level, we illustrated that the rankscore characteristic (RSC) function graphs of the two scoring systems (perception systems) can provide a useful visualization tool on the similarity or dissimilarity between these two visual perception systems (see Fig. 2 and Sect. 2.2.3). At the population level, the diversity rank-score graphs on three common visual space definitions, M 0 , M 1 , and M 2 , respectively provide a powerful visualization comparison, not only among all (sixteen) trials in an experiment, but also among all (three) analytic methods based on M 0 , M 1 , and M 2 , respectively (see Fig. 5 and Sect. 2.3). Our current study suggests a few issues which are worthy of further investigation. We list three here: 1. With the diversity rank-score function defined in formula (4) and the diversity rank-score graphs based on M 0 , M 1 , and M 2 , extend the study to include higher order of M i , i = 4, 5, and so on (refer to formula 1). 2. Establish a CFA framework to study the combination of more than two visual perception systems. In this regard, the notion of diversity among more than two systems would have to be defined differently. 3. Apply the visualization tool illustrated in current work to combination of multiple sensing systems, multiple robotics systems, and multi-modal physiological imaging systems such as MRI, EEG, and EKG.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://crea tivecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The diversity rank-score function for combining human 71