Keywords

1 Introduction

In the context of Augmented Cognition, this work is intended to show how specially designed information theoretic and visualization tools can provide insight into complex data sets, where the information binding is unknown, and the relative information content low. The base empirical problem selected to demonstrate the tools is the prediction of community crime rates using gross sociographic factors. As such, communities will be the example used in the background.

1.1 First Data Set

Two different data sets were used. The first data set used was a combination of two separate data sets: the Part I Uniform Crime Reports (UCR) and the Healthy People 2010 (HP) data set. The UCR are compiled annually by the U.S. Federal Bureau of Investigation (FBI) from crime reports from over 18,000 law enforcement agencies nationwide; Part I includes reports of criminal homicide, forcible rape, robbery, aggravated assault, theft, burglary, motor vehicle theft, and arson. For this experiment, crime reports from county law enforcement agencies were used. The HP data set is collected under the Centers for Disease Control and Prevention (CDC) as part of the U.S. Department of Health’s Healthy People agenda. The HP data set is comprised of health information from over 3,000 counties across the United States. This information includes demographics, obesity rates, birth and death rates, preventative health care access, mental health measures, number of physicians, and a plethora of other health related information. The UCR and HP data sets were combined to create a data set with features describing community demographics, health, and crime. The FAE was used to predict crime using the demographic and health information.

1.2 Second Data Set

The second data set was the Communities and Crime (CC) data set created by Redmond (2009) by combining socioeconomic and law enforcement data from the 1990 U.S. Census and the 1990 Law Enforcement Management and Administrative Survey respectively. The CC is available on the University of California Irvine (UCI) Machine Learning Repository website. The CC data set includes community features, such as percent of the population considered urban, family income, percent of 2 parent families, unemployment rates, racial and ethnic distributions, number of illegal immigrants, number of police officers, and many more. The CC data set also contains per capita violent crime rates calculated by using population and counts of violent crimes in the United States (murder, rape, robbery, and assault).Footnote 1 Similar to the first data set, the FAE was used to predict violent crime using the community features. The tables and results reported in this paper are derived from the second data set, which was determined to contain more latent information for this study.

2 Background

2.1 Features and Feature Selection

Features are measurable attributes of entities of interest. For example, a community may be an entity of interest with features such as crime rate, the percent of the population that is unemployed or which has a high school diploma, the number of primary care physicians, and many others. Features are used to detect, characterize, and classify these entities-for example, a community with a crime rate above 1,000 per 100,000 population may be classified as “high crime”.

Additionally, features are data elements that can be nominal or numeric. For instance, a community may be given the nominal measure of “highly educated” or have the numeric measure of a crime rate of 200 crimes per 100,000 persons. Nominal features are often coded as numeric values so that analytic methods can be applied to quantify and transform them. Thus, “poorly educated” might be coded as 1, “moderately educated” might be coded as 2, and “highly educated” might be 3.

As in the case above of the community, it is often the case that entities of interest will have multiple, sometimes many, features. This means that data modelers must determine which of these features to use, in what combinations, to create the “best” models for a given application. Indeed, whether for legal reasons, grant requirements, or organizational functioning, there is a wealth of information about community features that government, nonprofit, research, and other agencies collect with regard to communities. Some of this information might be helpful in solving problems, while other information might be superfluous or even harmful.

Not all features are created equal. Features can be thought of as symbolic evidence for or against some conclusion about the corresponding entities. Some features are more salient than others, that is, they provide more readily usable and reliable information for addressing the application at hand. Some features are “toxic” in the sense that including them in a feature set might mislead or dilute the information content of otherwise salient features. It is important, then, to determine which features, used together as a suite, provide the “best” view of the problem space.

For a simple example, consider policy makers in a community which has been dealing with an alarming number of drownings. The policy makers might see that their community’s ice cream consumption is highly correlated with drowning deaths. Subsequently, this information might lead them to make policies aimed at reducing ice cream sales when, in fact, ice cream consumption is a misleading or “toxic” feature for the problem of reducing drownings. A tool that can identify “season” or “temperature” as highly salient and “ice cream consumption” as toxic would lead to better policy decisions, such as hiring more lifeguards in the summer.

Features can be selected using statistical, probabilistic, analytic, and information theoretic methods. While there are various methods available for selecting featuresFootnote 2, the authors will not be reviewing them here. The purpose of this paper is to consider the special characteristics of an information theoretic sampling methodology created to automate feature extraction from large feature sets. First, though, it is important to understand the two major reasons that feature extraction is difficult: (1) the enigmatic nature of the evidentiary power of features and (2) the complexity of feature selection.

The Enigmatic Nature of the Evidentiary Power of Features.

While the toxicity of ice cream consumption as a feature might be obvious in the drowning example above, for many problems, the usefulness of the features is not so clear. In fact, the evidentiary power of features is rarely obvious; in such cases, special methods are required to quantify them. The next section explains why automation is essential to effective feature selection.

The Complexity of Feature Selection.

Given N features, there are \( 2^{N} \) possible feature sets, making feature selection exponentially complex. This means a data set with a mere 10 features would have over 1,000 possible feature sets, while 20 features would result in over 1 million. Data sets in the real world frequently have far more features than this.

To further illustrate this complexity, one can compare two problems. First, select the best feature set for a classification problem, where: (1) the entire data set consists of 250 features for each object and (2) using a super-computer, one can evaluate 1 trillion feature sets per second. The goal is to determine the best subset of features by checking all subsets. Second, wear down an Earth-sized sphere of solid iron, where once every 10 thousand years, a tiny moth flies by and brushes the globe with its wing, knocking off a single atom. The goal is to wear away the entire Earth-sized iron sphere. The moth will have completely worn away eight of the Earth-sized spheres centuries before all the feature sets have been evaluated. Clearly, complex problem spaces can create an overwhelming number of solution sets-an efficient way of selecting the best feature set in a limited amount of time is necessary.

3 Method

3.1 Feature Analysis Engine

The authors have developed a Feature Analysis Engine (FAE) that compares the information content of features, in context, for a specific classification task. The strategy is similar to forming a sampling distribution for feature performance. The FAE uses uniform random samples of features to inform a classifier. When the classifier for a particular random selection of features is run, its performance is added to the bin for each feature in that sample. “Good” features will be those that have participated in many good outcomes when they are present, and lower quality outcomes when they are absent. In this way, over many feature suite samples, the feature selector indicates the average contribution each feature makes to the performance of random feature suites of which it is a member.

3.2 The Feature Evaluation Algorithm

Let the set of feature vectors consist of L vectors, each having N features (i.e., coordinates in each feature vector) in the data set. The analysis consists of sampling trials, with three initial steps:

  • Select a proportion of vectors to be used for training, with the remainder to constitute the blind set.

  • Determine a maximum sample size, 1 <= M <= N, for the number of features to select on a trial.

  • Segment the set of feature vectors into a training set and a blind set, having the same proportion of every ground truth class in both.

Each trial proceeds as follows:

  1. 1.

    Uniform randomly select from 1 to M features for use on this trial. (note M/L is the probability that any particular feature will be selected for a trial.)

  2. 2.

    Train the classifier on the training set to recognize the ground truth assignment of a vector having the selected features.

  3. 3.

    Apply the trained classifier to the blind set (which also uses the selected features).

  4. 4.

    Add the performance of the classifier on the blind set into the corresponding histogram bin for each feature in the sample selected for this trial.

Over many trials, this procedure empirically computes the expected value of the performance of a classifier for each feature, when used as part of randomly selected feature suite of various sizes.

In the work described in this paper, three performance measures were aggregated for each feature: The proportion of the blind vectors correctly classified (range [0–1]); the geometric mean of the class Precisions (range [0–1]); and, the geometric mean for the class Recalls (range [0–1]). These measures are the components of the important two-class statistical decision metric called the F-Measure. The corresponding performances for the “feature absent” cases are computed in the same way using all the bins but leaving out each feature in turn.

From the “aggregate” Blind Accuracy, Precision, and Recall for each feature (as described above) is formed what we call the K-metric. According to this metric, the relative value of feature j in the dataset is given by:

$$ K(j) = (A\left( {j,1} \right) - A\left( {j,2} \right))(B\left( {j,1} \right) - B\left( {j,2} \right))(C\left( {j,1} \right) - C\left( {j,2} \right)) $$

where, for features j = 1, 2, 3, … N:

  • A(j, 1) = Aggregate Accuracy on blind set when feature j is used

  • A(j, 2) = Aggregate Accuracy on blind set when feature j is not used

  • B(j, 1) = Aggregate Precision on blind set when feature j is used

  • B(j, 2) = Aggregate Precision on blind set when feature j is not used

  • C(j, 1) = Aggregate Recall on blind set when feature j is used

  • C(j, 2) = Aggregate Recall on blind set when feature j is not used

Notice that for “noise” features, one or more of the factors of K could be negative. This will happen in instances for which the classifier is not able to detect and ignore “noise” features. This can be handled by setting very small factors in the expression for K to some minimum non-negative value. This situation is rare but does occur.

It was also seen that the convergence of the histogram bins is fairly stable, rapidly settling down after a few hundred trials (i.e., a “large sample”). Figure 1 presents two typical examples taken from a run of over 4,000 trials:

Fig. 1.
figure 1

Rapid convergence of two feature histograms to stable values in two typical runs

3.3 Prediction of Crime

The Feature Analysis Engine was applied to the datasets collected; the ground truth used was a measured crime rate value (#annual violent crimes/population). Initially the ground truth values were quantized into 8 integer values, conceptually representing extremely low, very low, low, low nominal, high nominal, high, very high, extremely high. Several experiments, statistical analysis, and visualization indicated that the data did not support estimation at this quantization level, and only a blind accuracy of 44% could be achieved. Figure 2 contains a confusion matrix showing that the transition from low crime rate to high crime rate covers several classes, making them impossible to distinguish in a manner that generalizes:

Fig. 2.
figure 2

A classification accuracy of 85% is achieved when there are only two quantization levels

For this reason, the ground truth was re-quantized into two classes: Lower and Higher. With two quantization levels, a classification accuracy of 85% was achieved (Fig. 2).

Using the two-level ground truth values, metric measurements were tabulated over 4,000 trials. The resulting K-metric histogram bin averages are plotted in Table 1, and graphically in Fig. 3.

Fig. 3.
figure 3

The taller peaks indicate the most informative features.

As stated above, the base empirical problem selected to test the FAE is the prediction of local crime rates using gross sociographic factors. Crime prediction, of course, is a problem that has been studied for a long time and is of importance to lawmakers, schools, businesses, non-governmental organizations (NGOs), and individuals. Indeed, a wide variety of sociodemographic, health, and economic factors have been tied to crime and delinquency. For example, mental health issues (Lipsey et al. 2010), male gender (Bayer and Pozen 2005), age (Moffitt 1994), and minority race (Hartney and Vuong 2009; Reisig et al. 2007) have all been found to predict higher rates of offending. In addition, the provision of adequate healthcare services has been tied to lower levels of reoffending after release from confinement (Hancock 2017; Kim et al. 1997). Furthermore, urbanicity, region of the country, temperature, and social class have been tied to crime in numerous studies (Siegel and Worrall 2013). In addition, family variables have been identified as risk factors for delinquency: these include having parents or siblings with criminal histories, coming from single parent homes, inconsistent parental discipline, having siblings, not being first-born, having abusive or neglectful parents, and separation from parents (Shader 2001). Moreover, a wide variety of school (i.e. harsh disciplinary policies) and neighborhood (i.e. high poverty, high crime) variables have been noted as risk factors for delinquency (Mack et al. 2015; Shader 2001; Taylor and Fritsch 2015). See Table 3.

Clearly there are a plethora of features that have been used to predict crime, yet there is also disagreement on whether and how these things might be related to crime. Thus, it is evident that the issue of crime is a complex problem space with less than obvious relationships between various features and the crime rate. It is important for policy makers to address the right features to most effectively and efficiently address the crime problem.

4 Experimental Results

4.1 Feature Ranking

Figure 4 presents two confusion matrices. The top one gives the results using all the features, while the bottom one gives the results using just the 10 “Best” features designated by the FAE.

Fig. 4.
figure 4

Top table is the confusion matrix for the classifier when all features are used. Bottom table is the confusion matrix when only the 10 best features, as determined by the FAE, are used. This demonstrates that little information loss results when removing features identified as less informative are removed.

Table 2 on the next page presents the features sorted in descending order of their K-metric.

Table 2. Features ranked by how much they contribute to Blind Classification %Accuracy, Precision, and Recall.
Table 3. Correlation Coefficients among the 10 most informative features (according to FAE) and Crime Rate. (Crime rate is in the bottom row (GT-Class). It has 7 levels, with 1 being lowest crime rate, and 7 being highest. Therefore, positive correlations in the bottom row are indicative of factors that, when larger, correlate with higher crime rates. Negative correlations in the bottom row are indicative of factors that, when larger, correlate with lower crime rates.

5 Conclusions

The three most informative predictors of crime are the percent of children with two parents, the percent of illegal immigrants, and the percent of families with 2 parents. Indeed, 5 of the 10 top features involve family structure features. Thus, the FAE indicates that measures of family structure and stability are the most informative features in the set, superior to measures of educational attainment, economic and employment factors, age distributions, and racial and ethnic distributions. This information can be invaluable to policy makers and point to policies and programs aimed at strengthening families.

Beyond these specific implications, the results also suggest the benefit of using the FAE to inform the policy process. Indeed, the importance of family factors in the current study conforms to prior research and theory on crime and delinquency, underscoring the utility of the FAE for identifying important decision factors. Furthermore, the usefulness of the FAE could be extended to a broad range of important problems: expanding access to health care, improving community health, bolstering order maintenance in institutional settings, optimizing judicial and parole decision-making, facilitating the creation of case plans for troubled individuals and families, refining hospital efficiency, maintaining safety in schools, and many more. Identifying the most salient factors in these and other problems can improve response effectiveness, save money, and properly allocate resources to things that matter rather than to “toxic” factors.

6 Future Work

Given that this work suggests the relative superiority of family structure in the prediction of violent crime, future work will naturally evaluate the validity of this finding in other ways. This would include the use of datasets that express family structure in a variety of ways, and at different scales (e.g., multi-family neighborhoods). Further, the question of whether these factors are causal or merely correlative, and the precise characterization of any underlying mechanism(s) warrants investigation.

Another possible avenue for future work would be to approach the crime issue from another angle: rather than identifying factors that predict high crime rates, or “risk factors,” the FAE could be used to identify factors which predict low crime rates, or “protective factors.” Protective factors can lessen the impact of risk factors, effectively buffering individuals and communities from negative elements in their environment (Pollard et al. 1999). As with factors that predict crime, there is disagreement about what actually constitutes a salient protective factor-scholars tend to view these factors through the lenses of either being the absence of the risk factor (i.e. broken home versus unbroken home) or being factors which interact with risk factors (strong parental supervision mitigates poor attitudes towards rules) (Shader 2001). It may be that these lenses and the debate surrounding these factors have left important protective factors unidentified, as the focus is primarily on identified risk factors. The FAE may be a first step in identifying these-as such, policy makers can work to reduce the most salient risk factors while also trying to increase the presence of the most salient protective factors. Indeed, prior research has indicated that addressing both risk and protective factors is necessary for crime reduction to be effective (Pollard et al. 1999). Future research can also investigate the use of the FAE with problem spaces outside of crime and delinquency, as outlined in the conclusion above.