Keywords

1 Introduction

1.1 Inevitability of Cognitive Augmentation

There exists an “ecosystem” that will serve as a significant catalyst of change in the human-computer experience [12]. The impending change may be comparable to the impact of the World Wide Web on the tech boom of the 1990s. Consumers are adopting these systems now and companies will follow-suit soon.

  • 11 Million Alexa devices have been sold as of Jan 2017 with already,” [1].

  • 1.5 Billion smartphones have been sold with cognitive augmentation apps (Siri, Google, Cortana) [2].

  • Investment in AI technology was ~600MM in 2016. Expected to be 37.8 Billion by 2025 [4].

  • SAP Ariba to use Watson AI with procurement data to produce “predictive insights” for supply chains [3].

The influx of AI will have organizational behavioral implications with regards to cognitive systems in the form of cognitive augmentation in human operators. Such organizational behavioral implications can be measured with metrics that have yet to be established. These metrics should evaluate behavioral characteristics with human-cog and cog-cog interactions. Consequently, there exists potential application of those metrics to situations where the effectiveness of personal cognitive augmentation is required.

1.2 Related Problems

“Brittleness” - When vocally interacting with a personal cognitive agent:

  • The device does not understand your phrasing.

  • The device misunderstands your intent.

  • The device cannot find an answer, when you know one exists.

  • The device offers an answer that is technically correct, but not enough detail is offered.

“Directive Contention” - When more than one device (same/different platform(s)) is in use:

  • Devices may or may not answer the same way or start researching at different times.

  • Human operator cannot delegate question priority between the devices.

  • Device cannot extract directives from multi-part questions that are aligned with its responsibility domain.

Developers of personal cognitive augmentation agents (PCA) and platforms such as IBM Watson internally measure quality of interaction and information transmitted from human operator to cog. One way is to measure brittleness, an anomalous result due to comprehension gaps between what the operator speaks and what the cog understands. Cog platform application programming interfaces (API) provide supervised machine learning mechanisms that establish continuity between what is spoken (utterance) and what is intended (intent). That mechanism produces an utterance/intent relationship. However, API evaluation methodologies applied are proprietary and will differ between platforms. As such, there should exist publicly available standardized evaluation practices that assess cognitive augmentation interactivity. This paper will explore tools that will provide a foundation for such standards.

1.3 Practical Contribution

With the emergence of big data analytics, it will be necessary to discriminate numerous and varying potential answers to business questions. Cognitive augmentation would be a mechanism used to process such volumes of results offered by big data and similar platforms. Moreover, a business entity will need the ability to evaluate communication between its stakeholders—specifically when some stakeholders will exist in the form of a cognitive system or agent. With a standardized set of metrics, managers may be able to evaluate communication between stakeholders in the enterprise as well as evaluate efficacy of human/cog augmentation. Measuring utterance/intent relationships is a step towards realizing communication assessment in this domain.

2 Literature

This study assesses interrelationships between information theory, information science, representational information theory and human-robot interaction. Efforts are already under way in the field of human-robot interaction (HRI) [23]. Researchers continue exploration of a practical symbiotic relationship between humans and computer needs.

2.1 Humans and Computers

The idea of artificial intelligence and human task support has been explored for decades. Newell, Engelbart and Licklider in their early works in the 1960s reveal a desire for human-computer symbiosis [20] and frameworks [10] that improve the efficiency of tasks performed by humans. Almost 60 years later, advancements in technology have strengthened the relationship between humans and computers, specifically shifting mental processing capacity and physical tasks to machines. Weizenbaum’s ELIZA was able to converse in English on any topic [32]. Ted Shortliffe developed an expert system to approach medical diagnoses [27]. Hans Moravec developed an autonomous vehicle with collision avoidance in 1979 [22]. In 1979 BKG, a backgammon program, defeats the world champion [6]. Chinook program beat Checkers world champion Tinsley in 1994 [5]. Google introduces a self-driving car in 2009 [29]. IBM’s Watson AI agent defeats Ken Jennings as game show Jeopardy champion [16].

2.2 A Cognitive Era Emerges

The work continues as a confluence of technologies enable the cog ecosystem. Dr. Ron Fulbright attributes six classes of technology working together providing a backbone for large-scale interconnected cognitive entities [12].

  • Deep Learning: Multi-layered supervised machine learning algorithms utilizing convolutional neural networks [17]. Cognitive systems tap into deep learning algorithms to develop systems with human expert like performance [12].

  • Big Data: Almost limitless datasets of granular data derived from multiple sources [14].

  • Internet of Things: A global network of machines [18] producing ambient data without human input evaluated by deep learning algorithms [12].

  • Open Source AI: ROS: an open-source Robot Operating System [21] is one of many open source projects that allow many developers work on the same project from the comfort of their garages, basements, attics and pajamas.

  • NLI: Natural language interfaces are application libraries used to facilitate person-machine communication [15].

  • Connected Age: The adoption of smartphones, tablets, wearable technologies, Internet, Cloud services by millions of users globally provide a market for cog-enabled applications like Siri, Alexa and Cortana [12].

Tapping into this ecosystem are companies like IBM, Amazon, Google, Apple and Facebook. They are investing billions of dollars into artificial intelligence architectures [22].

2.3 Brittleness

Brittleness is an unstable systems behavior brought on by data validation failures or degradation in some other foundational process [7]. The term was used to describe software subject to disruption as a result of transitioning to the year 2000 during the Y2K crisis in the late 1990s. Brittle system behavior has also been a term applied to Expert Systems architecture [19]. To avoid brittleness in cognitive systems, I look to understand and measure utterance/intent relationships as a root cause for this phenomenon.

2.4 Utterance Intent Relationships

There a relationship between an utterance and an intent. In the literature, phrasing is covered under a metric called situation-specific vocal register [9]—more explicitly defined as an utterance (Uh) or articulated utterance originating from a human operator (h). After accepting an utterance Uh, PCA evaluates phraseology with one or more cognitive system platform APIs (Cogx) for utterance/intent relationship (UIRPCA) quality. UIRPCA quality is defined by the degree of Uh match with predefined Intents (IPCA). UIRPCA quality is scored differently in each platform. Cogx engines typically use natural language interface logic (NLI) applied to Uh evaluated against predefined IPCA linked to predefined training utterances (Utrain). Increased UIRPCA scores will result in a better outcome for the operator/cog interaction. IBM Watson API (CogIBM) applies a metric called weighted evidence scores (WES) to evaluate a confidence relationship between utterance and intent. WES confidence score derived from CogIBM equates UIRPCA in this scenario. New Utrain(s) are introduced to a set of objects that will systematically train CogIBM. Machine learning will categorize and rank the clauses/words in each phrase when applied to Natural Language Understanding (NLU) in CogIBM. Figure 1 describes an utterance path to response effectiveness. Methods in this paper will reveal a connection between the quality of an utterance and its influence on the UIR as it follows the path.

Fig. 1.
figure 1

Utterance path to response effectiveness – This paper will address everything but Response Effectiveness (Rh). Rh will be part of a larger experiment as part of another paper.

Figure 2 illustrates an architectural view in utterance/intent modeling. The model allows for interoperability between heterogenous Cogx workspaces. Any PCAx may tie its skill, action or bot to any combination of Cogx platforms. A platform typically uses JavaScript object notation (JSON) to manage data structures that facilitate interaction models (UIR Model) to evaluate incoming Uh. The agent parses Uh followed by a comparison of the result with specific intent domains’ Utrain. Intent domains build context around entities or slots (E). Each E can have synonyms (S) applied to them. Synonyms aid in fine-tuning UIRs so they stand apart from other very similar UIRs. Consider the following example.

Fig. 2.
figure 2

UIR modeling architecture.

$$ {\text{U}}_{\text{h}} = ``{\text{What book do I need for this class}}?'' $$
(1)
$$ {\text{U}}_{\text{train}} = ``{\text{Do I need a }}\left\{ {\text{material}} \right\}{\text{ for this }}\left\{ {\text{course}} \right\}?'' $$
(2)
$$ {\text{I}}_{\text{PCA}} = \, \left\{ {\text{getRequiredMaterials}} \right\} $$
(3)
$$ {\text{I}}_{\text{PCA}} .{\text{E}}.\left\{ {\text{type}} \right\} \, = \, \left\{ {{\text{material}},{\text{ course}}} \right\} $$
(4)
$$ {\text{I}}_{\text{PCA}} .{\text{E}}.\left\{ {\text{material}} \right\} \, = \, \left\{ {{\text{headphones}},{\text{ textbook}},{\text{ notebook}},{\text{ tablet}}} \right\} $$
(5)
$$ {\text{I}}_{\text{PCA}} .{\text{S}}.\left\{ {\text{textbook}} \right\} \, = \, \left\{ {{\text{book}},{\text{ publication}},{\text{ ISBN number}}} \right\} $$
(6)

The path to the intent is as follows:

$$ {\text{U}}_{\text{h}} .\left\{ {\text{book}} \right\} \to {\text{S}}.\left\{ {\text{textbook}} \right\} \to {\text{E}}.\left\{ {\text{material}} \right\} \to {\text{U}}_{\text{train}} .\left\{ {\text{material}} \right\} \to {\text{I}}_{\text{PCA}} .\left\{ {\text{getRequiredMaterials}} \right\} $$

3 Methods

3.1 Research Question

The following research question establishes a two-part goal:

  • Determine a set of measures (potentially metrics) to evaluate brittleness in a quantitative manner.

  • Evaluate brittleness effect based on the application of the measures in goal 1. A strong relationship between operator utterances and training utterances implies a strong utterance/intent (UIR) relationship. Strong utterance/intent relationships should lead to an improved response from the PCA, thereby reducing the brittleness effect. Future research will address phrasing quality and response quality from a human operator’s perspective.

RQ1: How can brittleness be measured and reduced in personal cognitive agents?

3.2 Hypothesis

I will evaluate three hypotheses in this paper, linking the quality of training utterances (QUtrain) to an improved UIR score while applying a static set of articulated utterances Uh. Furthermore, I will assess training utterance quality by calculating cognitive value in an assessment algorithm called CogMetrix.

  • H1: As the number of unqualified training utterances \( \left( {\left| {\vec{U}_{train} } \right|} \right) \) in a set increases, UIRPCA will also increase, thereby improving the confidence scores. H0 = adj. R2 < .8 and p-value > .05.

  • H2: As the quality of training utterances (QUtrain|Utrain) in a set increases, UIRPCA will also increase, thereby improving the confidence scores. \( {\text{H}}_{0} =\Upsilon \left( {{\text{UIR}}\left( {{\text{QU}}_{\text{train}} } \right)|{\text{UIR}}\left( {{\text{U}}_{\text{train}} } \right)} \right) < \tau_{\text{UIR}} = - .2 \).

  • H3: As the number of Qualified Training Utterances \( \left( {|\overrightarrow {QU}_{train} |} \right) \) in a set increases, UIRPCA will also increase, thereby improving the confidence scores. H0 = adj. R2 < .8 and p-value > .05.

Variables used in the preceding hypotheses are included in Table 1.

Table 1. Variables

3.3 Cognitive Value

Cognitive value or cognitive gain is an emergent measure developed by Fulbright that utilizes representational information theory [13]. He builds on Vigo’s theory that quantifies structural complexity in information [32]. Structural complexity is further used as a foundation for a key component in cognitive value (\( \hbar \)). \( \hbar \) identifies the amount of informative value an object offers to its representational concept. As it relates to this paper, a representational concept is the intent (IPCA). As training utterances are collected, the relative effect on conceptual understanding trends in either a positive or negative direction. Any utterances that positively compliment a concept understanding are included in a subset of qualified utterances. As such, an optimization effect will emerge—offering a set of well-defined utterances that will yield a best-case for an efficient rule-based machine learning process. It is necessary to evaluate a master set of unqualified utterance candidates because the potential of multiple intents exists in a Cogx application. This set is defined as a universe of unqualified utterances.

I apply cognitive value (\( \hbar \)) as a quality measure compared against a discrimination threshold τUtrain used to determine qualified utterances (QUtrain). The value of τUtrain is arbitrary and set to 1. When cognitive value is less than τUtrain, the training utterance is included in a new set of qualified training utterances. Cognitive value assesses change in structural complexity between attribute values in a set of objects called categorical stimuli. While there are many potential attributes one can use to evaluate speech in natural language understanding, I chose three for this exemplar: parts of speech model (POSModel), dominant entity and statement type. See an example JavaScript Object Notation (JSON) object for a set of utterances in Fig. 3.

Fig. 3.
figure 3

JSON object with training utterances.

The POSModel is a string of tags defined by Stanford University POS Tagging utility [33]. See an example used in this exercise called UtterancePOSModel in Fig. 3.

Furthermore, I extract a dominant entity based on keywords in the phrase. Dominant entities ultimately lead to intent resolution. The application compares keywords against an entity dictionary. An entity dictionary is part of an interaction model common in Cogx applications. If a keyword is present in the dictionary, its entity lemma is returned and assessed for fitness to be assigned the dominant entity attribute. Assessing lemma fitness as a dominant entity is a process that goes beyond the scope of this paper and will be included future research. A sample dictionary can be found in Appendix 1.

Statement type is one of four possible values, declarative (1), imperative (2), interrogative (3) or exclamatory (4).

Next, I calculate structural complexity using Vigo’s Generalized Invariance Structure Theory (GIST) algorithm. GIST is an invariance extraction mechanism applied to a set of categorical stimuli in a concept [32]. Invariance is a measure of similarity in attribute values of categorical stimuli. Structural complexity is established by determining the amount of invariance in a set of objects. In this exemplar the categorical stimuli are the training utterances. Dimensions within the categorical stimuli are the POSModel, statement type and dominant entity. Examples of attribute values can be found in Fig. 3. The GIST algorithm itself goes beyond the scope of this paper, but I will include a generalized abstraction. The structural complexity equation is listed in line 7 where \( p \) = number of objects/utterances in the set and \( v \) is the amount of similarity/invariance of values in an object’s dimension.

$$ \psi \left( {\overbrace {\varvec{F}}^{{}}} \right) = pe^{{{-\!\!-}\sqrt {\left( {\frac{{v_{1} }}{p}} \right)^{2} + \left( {\frac{{v_{2} }}{p}} \right)^{2} + \cdots + \left( {\frac{{v_{D} }}{p}} \right)^{2} } }} $$
(7)

GIST calculates a Euclidian distance between values of free dimensions in an object by removing one bound dimension. Similar objects are adopted by comparing the distances to a discrimination threshold τ d  = 0 where d is the bound (or removed) dimension. Object dimension value distances are measured with a similarity function \( e^{{\Delta_{\left[ d \right]}^{r} \left( {\overrightarrow {{\varvec{obj}_{i} }} } \right.,\left. {\overrightarrow {{\varvec{obj}_{j} }} } \right)}} \). A 0 distance returns 1 when applied to the similarity function e−1*0 = 1. I sum the 1s, taking the result, dividing it by the total number of objects \( (| {\overbrace {\varvec{F}}^{{}}} | = \varvec{p}) \). The process yields an invariance measure per dimension whose values are plugged into the structural complexity equation in line 7.

Consider the following concepts \( \overbrace{F} \) and \( \overbrace {G}^{{}} \). R is an element of set \( \overbrace {F}^{{}} \). \( \overbrace {G}^{{}} \) is a subset of \( \overbrace {F}^{{}} \) without R. I use equation in line 7 to calculate structural complexity for both sets.

$$ \overbrace {F}^{{}} = \vec{U}_{train} = \left\{ {{\text{Master}}\,{\text{set}}\,{\text{of}}\,{\text{unqualified}}\,{\text{training}}\,{\text{utterances}}\,{\text{listed}}\,{\text{in}}\,{\text{Appendix}}\, 2} \right\} $$
(8)
$$ {\mathbf{R}} = \vec{U}_{train} \left( 1 \right) \, = \, \left\{ { ` ` {\text{Do}}\,{\text{you}}\,{\text{require}}\,{\text{a}}\,{\text{charger}}\,{\text{for}}\,{\text{this}}\,{\text{class?''}}} \right\} $$
(9)
$$ \overbrace {\varvec{G}}^{{}} = \overbrace {F}^{{}} - {\mathbf{R}} $$
(10)
$$ \psi \left( {\overbrace {\varvec{F}}^{{}}} \right) = 1.558 $$
(11)
$$ \psi \left( {\overbrace {\varvec{G}}^{{}}} \right) = 1.773 $$
(12)
$$ \uptau_{\text{Utrain}} = 1 $$
(13)

Next, I calculate the structural complexity in \( \overbrace {G}^{{}} \) in as it relates to \( \overbrace {F}^{{}} \) and assess the outcome for its fitness as a qualified training utterance.

$$ \overrightarrow {QU}_{train} \left( {\mathbf{R}} \right) = \hbar \left( {{\mathbf{R}}|\overbrace {\varvec{F}}^{{}}} \right) <\uptau_{\text{Utrain}} $$
(14)
$$ \hbar \left( {{\mathbf{R}}|\overbrace {\varvec{F}}^{{}}} \right) = \frac{{\psi \left( {\overbrace {\varvec{G}}^{{}}} \right) - \psi \left( {\overbrace {\varvec{F}}^{{}}} \right)}}{{\psi \left( {\overbrace {\varvec{F}}^{{}}} \right)}} $$
(15)
$$ \frac{{\psi \left( {\overbrace {\varvec{G}}^{{}}} \right) - \psi \left( {\overbrace {\varvec{F}}^{{}}} \right)}}{{\psi \left( {\overbrace {\varvec{F}}^{{}}} \right)}} { = }\frac{1.773 - 1.558}{1.558} $$
(16)
$$ \hbar_{1} = 0.138 $$
(17)
$$ .138 < 1 $$
(18)
$$ {\mathbf{R}}\,{\text{is}}\,{\text{adopted}}\,{\text{and}}\,{\text{added}}\,{\text{to}}\,\overrightarrow {QU}_{train} $$
(12)

A selection table is found in Table 3.

3.4 Applications

I wrote two applications to evaluate UIR

  • WatsonAskSirDexConversationAPI– Connector between CogMetrix and CogIBM

  • CogMetrix – application of the Cognitive Agreement algorithm

3.5 Procedure

I will capture the change in UIRPCA with respect to both unqualified training utterances and qualified training utterances via textual application of a static set of articulated utterances to CogIBM.

First, I define set of twenty (|Uh|) random articulated utterances \( (\vec{U}_{h} ) \) found in Table 3 followed by a random set of thirty-eight (|Utrain|) unqualified training utterances \( (\vec{U}_{train} ) \) found in Appendix 2. I add Utrain examples to CogIBM in stepwise fashion until I reach |Utrain|. I apply all \( \vec{U}_{h} \) to CogIBM and record the results for each step.

Next, I build a subset of \( \vec{U}_{train} \) called \( \overrightarrow {QU}_{train} \) and calculate \( \hbar \) with CogMetrix for each Utrain. CogMetrix tests each QUtrain for \( \hbar \). A QUtrain element is discarded when \( \hbar\geq\uptau_{\text{Utrain}} = 1 \), leaving a final set of qualified training utterances found in Appendix 3.

Having now created the set of qualified training utterances I can assess quality impact on UIRPCA by first replacing all Utrain with QUtrain in CogIBM. I, then, in stepwise fashion, apply all \( \vec{U}_{h} \) to CogIBM and record the UIRPCA results for each step.

Finally, I compare the results of the application of \( \vec{U}_{h} \) to both \( \vec{U}_{train} \) and \( \overrightarrow {QU}_{train} \) and assess the direction of change in UIRPCA with regards to the impact of \( \vec{U}_{train} \) and \( \overrightarrow {QU}_{train} \) to satisfy H1 and H3 respectively. The desired ANOVA R2  ≥ .8 and F-test with p-value < .05 should indicate a relative degree of confidence in UIRPCA trends. A rejection of \( {\text{H}}_{0} =\Upsilon \left( {{\text{UIR}}_{\text{PCA}} \left( {{\text{QU}}_{\text{train}} } \right)|{\text{UIR}}\left( {{\text{U}}_{\text{train}} } \right)} \right) < \tau_{\text{UIR}} = - .2 \) should show a positive quality outcome for UIRPCA.

4 Results and Discussion

Results are inconclusive for the test of H1. There is a low degree of confidence in a positive direction of UIRPCA with respect to Utrain despite a p-value < .05. Increasing the number of unqualified random training utterances does not seem to fully explain the change in UIRPCA. Figure 4 shows the average change in WES/UIR scores. Table 2 is the data.

Fig. 4.
figure 4

Fit plot for average change in UIR/WES as it relates to \( |\vec{\varvec{U}}_{{\varvec{train}}} | \).

Table 2. Data for average change in UIR/WES as it relates to \( |\vec{\varvec{U}}_{{\varvec{train}}} | \).

Conversely, results are better for H2. When testing the change in UIRPCA for each Uh, more intent resolution instances occur with fewer targeted training utterances. Table 3 shows this behavior.

Table 3. Data for average change in UIR/WES as it relates to \( |\overrightarrow {{\varvec{QU}}}_{{\varvec{train}}} | \). The tolerance level for the change is .2.

Finally, I satisfy H3 indicating an upward trend in UIRPCA with the R2 value being .93 and p-value <.05, concluding that adding more targeted quality training utterances does explain the change in UIRPCA. Figure 5 and Table 4 illustrate the result.

Fig. 5.
figure 5

Fit plot for average change in UIR/WES as it relates to \( |\overrightarrow {{\varvec{QU}}}_{{\varvec{train}}} | \).

Table 4. Data for average change in UIR/WES as it relates to \( |\overrightarrow {{\varvec{QU}}}_{{\varvec{train}}} | \).

5 Final Thoughts and Future Research

There are clear opportunities to improve outcomes using RIT as a mechanism to assess fitness between training utterances and intent resolution. A rigorous process of selecting utterance attributes should bolster results for all three hypotheses. Expanding the study to include human participants utilizing mixed-method instruments will add a favorable degree of randomness missing from the method applied in this exercise. Finally, I would assess an entire Cogx application that employs more than two intents.

Measuring utterance-intent relationships should improve rule-based machine learning algorithms used to prepare Cogx applications. As such, employing UIR discrimination should mitigate interaction brittleness in personal cognitive augmentation.