Introduction

Assessment of hearing sensitivity has become increasingly important in research of cognitive psychology and neuroscience. Individual differences in hearing sensitivity are known to affect language, cognitive, and academic outcomes in children (Kronenberger & Pisoni, 2020; McCreery & Walker, 2021; Tomblin et al., 2015), even in cases with very subtle differences in sensitivity from the typical range (McCreery et al., 2020; Moore et al., 2020). For adults, hearing loss has been identified as a risk factor for dementia (Lin et al., 2011; Livingston et al., 2020; Thomson et al., 2017), depression (Lawrence et al., 2020) and as a contributing factor to social isolation (Dawes et al., 2015; Maharani et al., 2019). Even when hearing sensitivity is not a primary research question, there is a need for researchers to document the hearing status of participants when studying related domains of language, cognition, and behavior (Füllgrabe, 2020). Diagnostic assessment of hearing often is expensive (Abu-Ghanem et al., 2016; Yousuf Hussein et al., 2016), as it requires specialized equipment, space, and personnel. Though some researchers have access to these clinical assessments, they are not widely available in many research settings. Barriers to documenting hearing sensitivity in research studies has led many scientists to rely on self-reported hearing difficulty. Self-reported hearing assessments have limitations for predicting the magnitude of hearing loss, have yet to be validated with children, and have poorer sensitivity and specificity than pure-tone screening methods (Clark et al., 1991; Louw et al., 2018; Nondahl et al., 1998; Tsimpida et al., 2020). The purpose of this study, as part of the Advancing Reliable Measurement in Alzheimer’s Disease and Cognitive Aging (ARMADA) study (Weintraub et al., 2021), was to evaluate the acoustic performance of and behavioral responses to a tablet-based measure of hearing sensitivity used with consumer-grade headphones. Compared to standard pure-tone audiometry completed in an audiology clinic, the tablet-based hearing assessment paired with consumer-grade headphones could address the need for more accessible hearing assessments in a broad range of research settings.

There are several types of hearing assessments that vary in specificity (screening, diagnostic) and methodology (self-report, pure-tone audiometry) of measurement. The audiological diagnosis of hearing loss relies, in part, on hearing assessment via pure-tone audiometry (Roeser, 2013). Pure-tone audiometry is a process that establishes absolute sensitivity for pure-tone signals across the range of frequencies important for understanding speech (ASHA, 2005; Brender et al., 2006). There are rigorous standards for room acoustics of the spaces where audiometric tests are conducted (Frank et al., 1993; Frank, 2000) and the specific equipment (e.g., diagnostic audiometers) and transducers needed to present the signals (ANSI S3.6 2018), all of which needs precise calibration (ANSI/ASA S3.1-1999 R2018; Champlin & Letowski, 2014). In addition to equipment, specialized healthcare professionals are needed to conduct diagnostic audiologic assessments. Audiologists, who diagnosis disorders of hearing and balance, have graduate- or doctoral-level training, are licensed to perform diagnostic audiometry, and typically work in medical, industrial, or educational settings (Martin & Clark, 2019). Because of these equipment and personnel requirements, pure-tone audiometry is often not feasible for research studies in neuro- and behavioral sciences outside of specialized hospitals or research centers (e.g., Amieva et al., 2015; Amieva et al., 2018; Vaccaro et al., 2019; Vassilaki et al., 2019; Zekveld et al., 2013).

The limited availability of pure-tone audiometry means that many scientists have relied on measures of self-reported hearing status when examining the effects of hearing loss on cognitive or behavioral outcomes or determining if a participant meets inclusion or exclusion criteria (Reilly et al., 2007). While self-reported hearing assessments are important for characterizing an individual’s subjective, functional hearing, particularly related to the impact of their disability (e.g., limitations in activity and participation; ASHA, n.d.; World Health Organization, 2001), it is not an objective measure of hearing sensitivity and is a proxy measure for an individual’s impairment (e.g., physiological limitations; ASHA, n.d.; World Health Organization, 2001).

Measures of self-reported hearing loss have been shown to predict the presence or absence of hearing loss in healthy adults with reasonable accuracy (e.g., Clark et al., 1991; Nondahl et al., 1998; Salonen et al., 2011). However, the relationship between self-reported hearing status and the magnitude or degree of hearing loss for adults is less precise. In a study that pooled results from several large, public-health datasets for over 23,000 adults, Kiely et al. (2012) found that the accuracy of self-reported hearing difficulty for predicting at least a mild degree of hearing loss depended on the age of the participants. The authors found that self-report generally underestimated the presence of hearing loss for adults who were over 75 years of age. They reported self-reported hearing loss (as compared to pure-tone average of better hearing ear) had sensitivity that ranged from 62 to 85% and specificity that ranged from 41 to 85% across datasets. Likewise, accuracy of self-reported hearing loss varies across demographic characteristics like race, ethnicity, and listener sex (Kamil et al., 2015; Schnohr et al., 2019).

Several technological solutions have been developed recently to mitigate barriers to hearing assessment for researchers, using advances in computer-tablet technology. These solutions fall into two broad categories: hearing screening applications and diagnostic audiometry applications. Hearing screening applications assess pure-tone hearing thresholds and provide an overall pass or fail result for each ear based on established criteria for normal or near-normal hearing (Bright & Pallawela, 2016; Corona et al., 2020; Swanepoel et al., 2014). Like measures of self-reported hearing loss, hearing screening only provides an indication of presence or absence of hearing loss and does not quantify the magnitude of hearing loss, if present. Tablet- or smartphone-based diagnostic audiometric applications have been developed and validated to meet established diagnostic standards as an alternative to standard diagnostic pure-tone audiometry (Bright & Pallawela, 2016; Irace et al., 2021; Sandström et al., 2020; Szudek et al., 2012; Thompson et al., 2015; Yeung et al., 2015). For instance, Thompson et al. (2015) found that one diagnostic application provided hearing threshold estimates that were within 10 decibels (dB) of standard diagnostic pure-tone audiometry for adults. However, these commercial diagnostic applications currently cost over $5000 United States dollars and, similar to diagnostic pure-tone audiometry equipment, still require annual calibration by an expert. These costs may limit their utility for scientists who want an affordable way to obtain a valid and reliable estimate of hearing sensitivity for their research.

Until the last decade, scientists often faced similar cost and access barriers when completing standardized assessments of vision, cognition, language, and emotional functioning for their research. The National Institutes of Health Toolbox for the Assessment of Neurological and Behavioral Function, or NIH Toolbox®, was developed to address these barriers and provide researchers with access to a set of standardized tools that assess sensory and cognitive abilities (Gershon et al., 2010). The NIH Toolbox application (https://apps.apple.com/us/app/nih-toolbox/id1002228307) is available through the App Store of the Apple iPad (NIH & Northwestern University, 2021), and researchers can access these assessments with the purchase of a subscription through the app. The NIH Toolbox includes a validated assessment of speech recognition in noise using the “Words-In-Noise Test” (Zecker et al., 2013). Speech recognition in noise is an important predictor of auditory ability that often relates to communication difficulties faced by people with hearing loss in real-world environments (Eckert et al., 2017), but it does not provide an estimate of the presence or degree of hearing loss. The NIH Toolbox Hearing Threshold Test (HTT) was recently developed to provide an automated, cost-effective assessment of hearing sensitivity (air conduction thresholds) that can be conducted with a commercially available computer tablet and consumer-grade headphones (i.e., available commercially, more cost-effective than healthcare-grade equipment). This application helps fill the gap between self-report or screening of hearing loss and standard diagnostic audiologic assessment by an audiologist. While this tool has been previously described in the literature (Zecker et al., 2013), the HTT has not been validated for use in children and adults with a variety of hearing levels.

There are several methods of validation used to determine whether an assessment measures its intended construct. Criterion validity refers to a tool’s performance compared to the standard tool used to measure the intended construct. To prove good criterion validity, the HTT should be compared to the standard diagnostic audiometry testing commonly used to diagnose hearing loss in a clinical setting. Tablet-based hearing assessments currently on the market have demonstrated 80.1–95.8% accuracy within 10 dB of a standard clinical audiometer (Saliba et al., 2017; Sandström et al., 2020; Thompson et al., 2015), suggesting precedent for the validity of tools similar to the HTT. If the HTT produces estimates of hearing sensitivity that approximate those of the standard pure-tone audiologic assessment, this would indicate the HTT could be used to provide ear-specific information about hearing sensitivity that could be used to account for hearing in behavioral and neuroscience research. In addition to examining validity of the HTT, it is important to examine whether accuracy differs systematically by listener characteristics. Studies of self-reported hearing loss suggest that the accuracy of a single-question measure of self-reported hearing loss differs by listener age for adolescents (Schnohr et al. (2019) and adults (Choi et al., 2016). Accuracy of self-reported hearing loss also differs by degree of hearing loss, with accuracy increasing as degree of hearing loss worsens (Feltner et al., 2021; Fredriksson et al., 2016). However, no studies of tablet-based hearing assessments have examined the impact of age and degree of hearing loss on the accuracy of these assessments relative to standard pure-tone audiometry.

Research on the accuracy of new assessments should address their sensitivity and specificity and their test–retest reliability. Sensitivity and specificity are often reported for self-report measures of hearing loss (sensitivity ranging from 62 to 85% and specificity ranging from 41 to 85% across studies) and other tablet-based measures (sensitivity ranging from 88 to 100% and specificity ranging from 58 to 96% across studies), suggesting varied ability of these tools to accurately identify individuals with hearing loss (Bright & Pallawela, 2016; Kiely et al., 2012; Saliba et al., 2017; Sandström et al., 2020). However, test–retest reliability is rarely reported in feasibility and validation studies of tablet-based hearing threshold assessments (Sandström et al., 2020). Proving good test–retest reliability of the HTT would provide additional evidence of the utility of this measure for research purposes. The purpose of this study was to assess the criterion validity and reliability of the HTT for a sample of children and adults who had varying levels of hearing sensitivity to determine if this assessment could be used to estimate hearing levels for individuals who participate in behavioral research studies. The following research questions (RQs) were addressed:

  • RQ1a: How do thresholds measured via the HTT compare to thresholds measured via standard pure-tone air-conduction audiometry?

  • Hypothesis: The HTT will produce hearing thresholds that are highly correlated to thresholds measured via standard pure-tone audiometry.

  • RQ1b: Does validity of the HTT vary depending on the age or hearing level of the listener?

  • Hypothesis: There may be differences in accuracy by listener age or hearing level, but these differences will not impact the overall validity of the HTT.

  • RQ2: What is the test–retest reliability of the HTT?

  • Hypothesis: The HTT will have good test-test reliability (Cronbach’s α > .8).

  • RQ3: What is the specificity and sensitivity of the HTT?

  • Hypothesis: We predict good specificity and sensitivity (i.e., greater than 90%).

Methods

Participants

Ninety participants were recruited using a variety of methods (i.e., research database, email campaign, outpatient audiology clinic). Participants included in this study met the following criteria: a) were 6 years of age or older and b) had hearing within normal limits (< 20 dB hearing level [HL] better-ear pure-tone average [BEPTA] at 500, 1000, 2000, and 4000 Hz; ANSI S3.6-2018 R-2018) or indicated mild-to-severe hearing loss (20–70 dB HL BEPTA) based on their clinical audiogram results (World Health Organization, 1991). Exclusion criteria included any otologic condition requiring medical management (e.g., otitis media, sudden or recent changes in hearing, vestibular complaints) or that required developmental modifications to the test protocol, including visual reinforcement or conditioned response per medical record and/or patient or caregiver report. Table 1 shows the age and hearing loss characteristics of the participants.

Table 1 Participant characteristics

The final sample included 27 children (mean age = 9.2 years, standard deviation = 2.2, range = 6–15) and 63 adults (mean age = 42.1 years, standard deviation = 17.4, range = 18–76). The child group included 23 children with normal hearing and 4 children with hearing loss (mean BEPTA = 33.8 dB HL). The adult group included 44 adults with normal hearing and 19 adults with hearing loss (mean BEPTA = 29.1 dB HL). All participants categorized in the hearing loss group had bilateral hearing loss. Six participants had mild hearing loss (i.e., PTA of 20–40 dB HL) in one ear, and normal hearing (i.e., PTA < 20 dB HL) in the other, but were categorized as having normal hearing due to their better hearing ear. Two participants in the hearing loss group had asymmetry between ears (PTA difference of > 15 dB). All participants consented or assented to participate, and parents of children completed informed consent. Participants were compensated $15/h for their participation. The Institutional Review Board of Boys Town National Research Hospital approved the study procedures prior to data collection.

Equipment

The study consisted of a standard pure-tone air conduction audiometric assessment and the experimental tablet-based hearing threshold assessment (i.e., HTT). Standard pure-tone audiometric assessments were completed on a Grason Stadler Inc. GSI-61 2-channel diagnostic audiometer (Eden Prairie, MN) using either Telephonics TDH-49 circumaural headphones (Farmingdale, NY) or Etymotic Research ER-3A insert earphones (Elk Grove Village, IL). The experimental tablet-based hearing threshold assessment was completed with a 5th generation Apple iPad (Cupertino, CA) with Sennheiser HD 280 Pro circumaural headphones (Wedemark, Germany). Sennheiser HD 280 Pro headphones were selected for the behavioral study because of their widespread commercial availability, relatively low cost, and output and frequency responses that were adequate for the HTT. The HTT assessment was administered through the NIH Toolbox app available through the App Store for iPad (NIH & Northwestern University, 2021). Acoustic calibration for both hearing assessments (clinical audiometric assessment and HTT) was completed using a Larson Davis 831 sound-level meter (Depew, NY) with a Larson Davis 2559 ½” microphone in a 6 cm3 flat-plate coupler for circumaural headphones.

Procedure

Prior to the administration of the hearing assessment, an electroacoustic validation of the experimental tablet-based HTT was completed to ensure the accuracy of the intensity and frequency of pure-tone signals. The electroacoustic validation was initially completed with several different iPad models to ensure consistency across different models of the same generation, including iPad and iPad Air. The output of the iPad was set to 50% volume. Test frequencies of 500, 1000, 2000, and 4000 Hz were chosen because of their importance for hearing and communication based on published standards (ANSI S3.5-1997). The intensity range of the tablet-based assessment was verified from 10 dB sound pressure level (SPL) to 90 dB SPL. Based on American National Standards Institute (ANSI) Calibration Standards for Audiometers, the output of the device was verified to be within 3 dB across the intensity range at each test frequency to ensure accuracy and linearity of the audio output and headphones (i.e., Sennheiser HD 280 Pro headphones).

The behavioral validation of the experimental tablet-based hearing threshold assessment consisted of a diagnostic pure-tone air conduction audiogram and the HTT for each participant. For both the clinical audiogram and the HTT, each participant was seated in a large, sound-treated audiometric test booth. The diagnostic clinical audiogram was completed first to ensure that each participant had thresholds within the range that could be assessed by the experimental assessment. For both hearing tests, participants were instructed to respond when they heard the tone, even if it was perceived to be soft or barely audible. Response feedback was not provided for either the diagnostic or experimental hearing test.

For the standard diagnostic audiogram, circumaural headphones or insert earphones were placed on the participant’s ears by the research assistant, and the participant was given a response button to press when they heard the sound. Air conduction testing was performed using pure-tone signals via a modified method of limits (Hughson & Westlake, 1944), where the audiologist decreased the level of the signal by 10 dB HL when the participant responded to the signal and increased the level of the signal by 5 dB HL when the participant failed to respond to the signal. Threshold was defined as the lowest level in dB HL where the participant responded to two-thirds consecutive ascending trials (Carhart & Jerger, 1959; Hughson & Westlake, 1944; Jerger, 2018). As in standard clinical practice, thresholds could be tested within the limits of the audiometer down to –10 dB HL and up to 110 dB HL. Thresholds to pure tones were measured via air conduction at octave frequencies from 500 to 4000 Hz in each ear. To match the HTT protocol, bone conduction testing was not performed. If the participant had completed a standard diagnostic audiogram in the clinic within the past 6 months, those pure-tone threshold results were used in this study, as previous data show limited changes in sensorineural hearing loss for children (McCreery et al., 2015) and adults (Cohn, 1999) within this timeframe.

For the experimental, tablet-based HTT, consumer-grade Sennheiser HD 280 Pro headphones were placed on the participant’s ears by the research assistant, and the tablet was placed in front of the participant on a stand, with the screen facing the participant. The participant was instructed to press a target area/button on the screen of the tablet when they heard a sound. A single practice tone (randomized; 500, 1000, 2000, or 4000 Hz) was administered at 70 dB SPL in either right or left ear (randomized) to familiarize the participants with the test stimuli. The experimental, tablet-based HTT used an automated, adaptive tracking procedure that followed a similar method of limits as the diagnostic audiometric assessment conducted by the audiologist. The experimental assessment started by presenting a signal at a specific frequency in one ear, randomized by frequency and ear for each participant. The initial presentation level began at 60 dB SPL for each frequency and ear. If the participant responded to the signal at that level, the level of the signal decreased on the next trial by 20 dB SPL. If the participant failed to respond to the initial trial, the level of the signal increased by 5 dB SPL. After the initial presentations, the level of the signal was increased or decreased in the same manner as in a diagnostic audiogram based on each participant’s response. The trials were administered in blocks by frequency and ear so that thresholds could be established within a block of trials (Harrell, 2002). The order of frequencies tested and test ear for each block of trials was randomly selected for each participant from 500, 1000, 2000, and 4000 Hz. If the tracking procedure reached the limits of the intensity range for a specific frequency, three consecutive responses (at the lower limit 0 dB SPL) or three consecutive non-responses (at the upper limit of 90 dB SPL) led to the limit being established as the threshold level. The end of a block occurred after five reversals or, in the cases where a limit was reached, after the participant responded to the final presentation of the block. The threshold was calculated as the arithmetic mean of the last five reversals and converted from dB SPL into dB HL by the application. A subset of 76 participants (53 adults, 23 children) completed the experimental test a second time after a short break to assess the test–retest reliability of the experimental test.

Statistical method

All statistical analyses and data visualizations were completed using the R Statistical Computing Language (version 4.0.3, R Core Team, 2021; packages: ggplot2, lem4, BlandAltmanLeh). To conduct electroacoustic validation, acoustic output of the HTT was compared to the indicated level in the HTT application for each frequency. Descriptive statistics were calculated for each frequency, and Pearson correlations were calculated between measured output and the indicated level. First, descriptive statistics were calculated for each test method and test frequency. To address RQ1a, we quantified the difference in thresholds by test methods using root-mean-square (RMS) error. RMS error was calculated by taking the geometric mean of the differences between the HTT and the audiometer threshold at each frequency for each individual. In addition, thresholds from the standard diagnostic audiogram and tablet-based hearing assessment were compared using a Bland–Altman analysis (Bland & Altman, 1999) that quantified the magnitude of errors at each frequency, with the diagnostic audiogram as the clinical standard and the experimental test as the comparison.

To address RQ1b, a linear mixed effects model with a random intercept for each participant was used to assess the effects of the participant’s age and threshold level (dB HL) on the magnitude of the difference between the diagnostic audiogram and the HTT with test frequency and ear as a repeated measure within each participant. We include data from both right and left ear for each participant in this model because it provides a larger dataset; therefore, we accounted for dependence between ears by including a random intercept for each participant, signaling to the model that multiple measurements were taken from the same person.

To address RQ2, the test–retest reliability of the experimental test was assessed using Cronbach’s α. The clinical standards for test–retest reliability for behavioral audiometry of ± 7 dB for adults and children over 13 years of age (Mahomed et al., 2013) and ± 10 dB for children under 13 years of age (Beahan et al., 2012) were used as boundaries for clinically significant differences (RMS error) at each frequency. To address RQ3, test sensitivity and specificity of the HTT relative to the diagnostic audiological assessment were calculated for adults and children using a 2x2 confusion matrix of predicted versus true outcomes. This was calculated based on the classification of either normal hearing (BEPTA < 20 dB HL) or hearing loss (BEPTA ≥ 20 dB HL), based on each participant’s BEPTA measured via the HTT compared to the BEPTA from the diagnostic audiometric assessment. We calculated sensitivity by dividing the number of true positives (BEPTA ≥ 20 dB HL per diagnostic audiometric assessment and HTT) by the number of participants with hearing loss per diagnostic audiometric assessment. We calculated specificity by dividing the number of true negatives (BEPTA < 20 dB HL per diagnostic audiometric assessment and HTT) by the number of participants with normal hearing per diagnostic audiometric assessment.

Results

Electroacoustic analysis

The acoustic output of the HTT for each test frequency measured by the sound-level meter in dB SPL was compared to the indicated level in the HTT application at each test frequency for a range of inputs from 35 dB SPL to 90 dB SPL. The accuracy and linearity of the calibration was assumed by analyzing the absolute difference in dB and Pearson correlation between the measured levels and the levels indicated in the HTT application. The mean difference across frequencies was 0.49 dB (means: 500 Hz = 0.57 dB, 1000 Hz = 0.32 dB, 2000 Hz = 0.5 dB, 4000 Hz = 0.66 dB). The maximum difference at any frequency was 1.5 dB. The correlation between the HTT-indicated level and measured level was r = 0.99 across frequencies, indicating near perfect correspondence and consistent with the small differences between the indicated and measured levels in the application.

Behavioral validation

The average thresholds for each ear of children and adults grouped by hearing status for the standard diagnostic audiometric assessment and the HTT are shown in Table 2. The average test time for the HTT was 4.9 min (SD = 1.3) for children and 5.0 min (SD = 1.4) for adults. Individuals with hearing loss (M = 6.0 min, SD = 2.8) took 1.4 min longer to complete the test than participants with normal hearing (M = 4.6 min, SD = 1.2) on average.

Table 2 Hearing thresholds (dB HL) for audiometry and HTT by frequency and group

Figure 1 shows the HTT thresholds plotted against those from the standard diagnostic audiometric assessment, with a different panel for each test frequency. For individuals with two HTT threshold measurements (for reliability), the HTT threshold closest to the diagnostic audiometric threshold (i.e., smallest difference) was taken for comparison to their audiometric threshold. The closest threshold was taken because the close correspondence between HTT tests was high. The Pearson correlation between the best HTT threshold and diagnostic audiometric thresholds ranged from .80–.93 across frequency, suggesting strong correspondence between the two measures.

Fig. 1
figure 1

Relationship between thresholds measured via audiometer and the Hearing Threshold Test app. ***p < .001; HTT = Hearing Threshold Test; thresholds from the HTT were repeated (i.e., two runs). HTT thresholds plotted here represent the closest thresholds (i.e., smallest difference) to the audiometer thresholds. The blue line represents the linear relationship between the variables. Individual thresholds are represented by gray points (circles for adults and triangles for children). Darker points indicate that multiple participants had the same thresholds

Table 3 shows the RMS error between standard diagnostic audiometry and HTT by age group and hearing status. The proportion of each group with a clinically significant difference (± 7 dB for adults; ± 10 dB for children) between the two measures is also shown. The mean differences and RMS errors for adults were smaller than for children, but all mean values were within current clinical standards for accuracy for diagnostic audiometry for each age group. These results suggest that the classification of degree of hearing loss for the HTT would be similar to that provided by diagnostic audiometric assessment.

Table 3 Root-mean-square (RMS) error in dB between audiometer and HTT thresholds by group and frequency

Bland–Altman analysis was used to assess the agreement between the HTT and the diagnostic audiometric assessment. Figure 2 shows the difference between the HTT and the diagnostic audiometric assessment as a function of the average of the two measures for each participant. A difference of zero indicates agreement between the HTT and diagnostic audiometric assessment. The errors were within acceptable limits and did vary as a function of the average of the two measures across frequency.

Fig. 2
figure 2

Bland–Altman analysis of difference between audiometer and HTT thresholds by frequency. dB HL = Decibel hearing level; HTT = Hearing Threshold Test; x-axis represents mean of the audiometer and HTT measurements, and the y-axis is the difference between them. The three dotted lines represent the mean of the differences (center line) and 2 standard deviations above and below that. Points are scaled by number of occurrences of the value

To address RQ1b, we examined whether the validity of HTT varies by listener characteristics using linear mixed effects model (Table 4) to examine the effects of listener age and degree of hearing loss on the RMS error for each frequency for the HTT. The main effect of age indicated that younger participants had larger errors, as expected. The effect of pure-tone average was not significant and had a small coefficient, suggesting that the RMS error did not systematically vary as a function of degree of hearing loss. The main effect of age and nonsignificant effect of pure-tone average should be interpreted considering the significant age by pure-tone average interaction, which suggested that the RMS error increase with pure-tone average was slightly greater in the children than in the adults. There were no significant differences in RMS error between ears of the same participants. The frequency-specific effects suggest that while the RMS error was acceptably low across all test frequencies, the RMS error was significantly lower at 500 Hz than at 1000 Hz and 2000 Hz. The statistically significant differences in RMS error across frequency were not clinically significant.

Table 4 Linear mixed effects model examining effects of age and hearing status on RMS error by frequency

To address RQ2, we examined the test–retest reliability of the HTT using a subset of participants (n = 76) who completed two HTT assessments at the same visit. Figure 3 shows the 1st HTT plotted against the 2nd HTT for each participant. A reliability run for one participant was removed due to observed inattention during the task. The Cronbach’s α for each frequency was high, ranging from .87 to .97. These values indicate that the results were consistent across both HTT assessments for each participant. Table 5 shows the RMS error between the two HTT assessments for each participant, the mean difference, and the proportion of each group with a clinically significant difference across HTT. Children had larger errors and a higher proportion of clinically significant differences than adults, but the RMS error and mean differences were within acceptable limits for audiometric diagnostic standards for test–retest reliability. Examining RQ3, the overall sensitivity for the HTT was 96% (22/23 participants with hearing loss correctly identified as hearing loss) and the specificity was 96% (64/67 participants with normal hearing correctly identified as having normal hearing).

Fig. 3
figure 3

Relationship between repeated HTT thresholds (run 1 vs. run 2). ***p < .001; dB HL = Decibel hearing level; HTT = Hearing Threshold Test; The blue line represents the linear relationship between the variables. Individual thresholds are represented by gray points (circles for adults and triangles for children). Darker points indicate that multiple participants had the same thresholds

Table 5 Test–retest differences for the HTT by frequency and age group

Discussion

This study evaluated the validity and reliability of a new assessment of hearing sensitivity, known as the Hearing Threshold Test (HTT), in the NIH Toolbox. Electroacoustic validation of the measured and intended output of the HTT were assessed, followed by an examination of criterion validity of the HTT compared to the standard diagnostic pure-tone audiological assessment. The electroacoustic validation indicated that the application, coupled to consumer-grade headphones, met the standards for accuracy and linearity that are established for diagnostic audiological assessment (ANSI S3.6-2018). The maximum tolerance of the HTT at any frequency was 1.5 dB, meeting the ANSI standard for audiometric calibration, which requires the recorded level to be within 3 dB of the indicated level. These results suggest that the application is sufficiently calibrated for use with Sennheiser HD 280 Pro circumaural headphones and an iPad. It is recommended that users perform regular biological checks with this test and perform recalibration if irregularities are noticed.

The behavioral validation study indicated that the HTT produces results comparable to standard diagnostic audiometry, as differences between the HTT and diagnostic audiometric assessment were generally within test–retest reliability standards for the adults (± 7 dB; Mahomed et al., 2013) and children (± 10 dB; Beahan et al., 2012). These results provide evidence of good criterion validity of the HTT. The high sensitivity (96%) and specificity (96%) of the HTT demonstrate its ability to accurately identify the presence of hearing loss in an individual. Overall, these results suggest that the HTT of the NIH Toolbox can provide researchers with efficient, accurate estimates of hearing sensitivity for research purposes at a considerably lower cost than a diagnostic audiometric assessment. At the time of this study, the cost of the application, iPad, and headphones was under $1000, making it more financially accessible than other currently available audiometers. The tablet and headphones can be used for additional research tasks if needed, making the equipment more efficient for use in a research environment. The HTT addresses the current limitations of screening and diagnostic tablet-based applications (e.g., need for trained personnel, lack of validation on children) and has sufficient accuracy to assess the magnitude of hearing loss (e.g., degree of hearing loss, PTA), if present.

Validity and test–retest reliability of HTT

This study tested whether the HTT produced estimates of hearing thresholds comparable to those obtained using standard diagnostic audiometry on a sample of children and adults who had either normal hearing or varying degrees of hearing loss. We found that the HTT reported thresholds that were, on average, within 5 dB or less for adults and within 9 dB or less for children compared to diagnostic audiometry. The current study showed comparable accuracy of threshold measurements relative to other validated tablet-based diagnostic applications currently on the market. These diagnostic applications have demonstrated 80.1–95.8% accuracy within 10 dB of a standard clinical audiometer (Saliba et al., 2017; Sandström et al., 2020; Thompson et al., 2015). While the current study used a stricter definition of accuracy (± 7 dB for adults in the current study) to reflect current clinical standards for diagnostic audiological assessment, we found similar accuracy (87.5%) to other applications when using a 10 dB definition.

Additional findings from the current study differentiate the HTT from other applications. For instance, the accuracy of the HTT is notable given the inclusion of children and adults with and without hearing loss. Most other validated tablet-based diagnostic applications on the market have not been validated using children, though there are a few exceptions (Thompson et al., 2015; Yeung et al., 2015). Validation studies that include children either focus on applications specifically designed for younger children using conditioned play audiometry methodologies (Yeung et al., 2015) or include few children and did not perform separate analyses by age group (i.e., children and adults; Thompson et al., 2015). The HTT was shown to be valid measure of individuals ages 6 years or older, making it suitable for testing hearing sensitivity in both adults and children. Furthermore, the accuracy of the HTT is particularly noteworthy, given that the application uses a commercially available tablet and consumer-grade headphones, rather than the tightly controlled equipment and transducers of clinical assessments (ANSI/ASA S3.1-1999 R2018; ANSI S3.6 2018; Champlin & Letowski, 2014).

The sensitivity and specificity of the HTT for classifying the presence or absence of hearing loss was high. Other studies of commercially available tablet-based systems have shown mixed results with respect to sensitivity and specificity. For instance, validation studies of one application demonstrate good sensitivity (> 98% across studies), but variable specificity (ranging from 60.0 to 82.1% across studies). While results suggest the validity of this application to screen for hearing loss, a few studies concluded that it should not be used to obtain threshold-specific estimates of hearing sensitivity (Abu-Ghanem et al., 2016; Bright & Pallawela, 2016; Peer & Fagan, 2015; Szudek et al., 2012).

This study also examined the test–retest reliability of the HTT by comparing two administrations of the tablet-based assessment within the same individual. The high reliability (Cronbach’s α > .86), high correlations (r > .80), and small differences between test administrations (mean < 3 dB) suggest that this measure provides consistent, stable estimates of hearing from one test administration to the next. These results suggest that any discrepancies in magnitude of hearing loss over time most likely reflect changes in hearing rather than quality of the measurement. Most tablet-based, threshold-specific apps that are currently on the market have not reported their test–retest reliability. In one exception, Sandström et al. (2020) reported good test–retest reliability of a tablet-based application, but they only retested a single frequency (1000 Hz) and the retest occurred within the same test administration. The good test–retest reliability of the HTT provides additional evidence of the utility of this measure for research purposes.

Overall, evidence of strong validity, high sensitivity and specificity, and good reliability of the HTT suggests that this test can be used by researchers as an alternative to diagnostic audiological assessment to determine the presence and/or degree of hearing loss without specialized equipment or personnel. The accurate categorization of participants by hearing status (i.e., normal hearing versus hearing loss) and quantification of degree of hearing loss (e.g., BEPTA) can allow researchers to account for the participants’ hearing levels when researching systems complementary to audition, such as language, cognition, memory, and psychosocial functioning. For instance, previous studies suggest that participants with hearing loss may have greater difficulty hearing and comprehending stimuli in tasks involving spoken language (e.g., word learning; Pittman et al., 2005; Stiles et al., 2013) or understanding verbal task instructions (Füllgrabe, 2020). Tools like the HTT allow researchers to account for hearing difficulties as they relate to their area of interest (e.g., language development, memory). Characterizing a participant’s hearing status also allows researchers a more controlled approach to confirming eligibility for a study, if the presence of hearing loss is a criterion for inclusion or exclusion.

Factors related to differences by test method

This study examined whether the listener’s age or degree of hearing loss had an impact on the accuracy of the HTT, which has yet to be examined in studies of tablet-based assessments. Any effects of these factors on the accuracy of the HTT could have implications for the potential range of ages or degrees of hearing loss where the test could be used. Larger differences between the HTT and diagnostic audiometric assessment for listeners with normal hearing were observed for children and for people with better hearing thresholds than adults or individuals with poorer hearing thresholds. Children are known to have greater variability in their estimates of hearing thresholds than adults (Beahan et al., 2012) due to a range of factors, including greater likelihood of lapses in attention and the inability to suppress self-generated noise during the assessment (Buss et al., 2016). These inattentive or noisy trials are often caught and discarded when testing is conducted by a clinician, but this is not the case for the automated HTT, which may help explain these larger differences in children. The effect of age on HTT accuracy showed an interaction with degree of hearing loss, suggesting that discrepancies between the HTT and the standard diagnostic audiogram are more likely to be larger in younger children with normal hearing, relative to other individuals. The larger differences between the HTT and diagnostic audiometric assessment in young individuals with better hearing thresholds were likely related to the fact that diagnostic assessments test at much lower sound levels for individuals with normal hearing levels than does the HTT, which are more easily masked by self-generated noise. As a result, some individuals with normal hearing had higher thresholds on the HTT, but these differences did not have an impact on the accuracy of the application.

Limitations

Although the HTT provided highly accurate estimates of hearing sensitivity for children and adults in this study, there are several limitations to the application and the study results that should be considered. The HTT is not designed as a replacement for diagnostic audiometry. The HTT provided estimates of hearing sensitivity that were within established clinical standards for test–retest reliability and accuracy, but the HTT is not intended to diagnose specific hearing problems or replace diagnostic audiological assessment by an audiologist. It is meant to be used as a research tool. Medical diagnosis of hearing problems includes full diagnostic audiological assessment that includes pure-tone testing via air-conduction through headphones and via bone conduction to confirm potential sites of lesion in the auditory pathway. The HTT does not include bone conduction testing and as a result is not able to differentiate between different types of hearing loss, including conductive, sensorineural, or mixed. Additionally, the HTT cannot test hearing levels ≥ 90 dB SPL due to equipment constraints, limiting its utility in assessing profound degrees of hearing loss. The HTT does not require professional calibration, but the current data cannot demonstrate the stability of this system over time or substantiate the need for periodic calibration checks beyond biological calibrations and listener checks by end users.

This application has limitations in measuring thresholds in individuals with unilateral or asymmetric hearing loss, as this application does not use masking to prevent crossover of loud stimuli to the contralateral ear. The HTT also does not test at higher frequencies (e.g., 6000, 8000 Hz) and is therefore limited it identifying listeners with high-frequency hearing loss, which is more commonly found in older individuals and individuals with noise induced hearing loss. The sensitivity and specificity for the HTT were high and support the application of these results for establishing the presence or absence and degree of hearing loss in children and adults, but the number of participants in this study with hearing loss was relatively small and further study of the HTT could provide a larger normative sample with hearing loss of varying characteristics.

Further, the results suggest that the HTT is an effective assessment for children, but this study did not assess children younger than school age. This assessment may not be applicable to young children, such as those who require alternative testing procedure (e.g., conditioned play audiometry) to complete an audiological assessment. Some children may require specific adaptations or closer monitoring by the test administrator to maintain their interest and attention in the HTT. Another limitation of the current study was that the test environment for the study was a quiet audiometric test booth in an audiology clinic. While previous work suggests minimal differences in hearing thresholds measured in a sound booth compared to a quiet room (Foulad et al., 2013), results for assessment in a less acoustically controlled context might limit the ability of the test to differentiate normal hearing from mild degrees of hearing loss in realistic settings. Any hearing assessment is only as effective as the acoustic environment where the test can be completed, and the results of the HTT obtained outside of the controlled environment tested here should be interpreted in the context of the ambient noise levels in the space. Headphones that cover the ears (e.g., circumaural) can provide some sound isolation but are not a replacement for attempts to minimize sources of ambient noise in any environment where hearing assessment will be conducted. Test administrators should monitor the ambient noise in the testing environment (e.g., sound level meter) to ensure quiet conditions and adjust the environment if noise will interfere with testing. Ideas for future development of this application could include adding features that that would be useful for testing in non-standard environments with higher levels of ambient noise, such as documentation of noise level via the application, and testing capabilities at higher frequencies, such as 6000 and 8000 Hz.

Conclusions

The accuracy and test–retest reliability of a new component of the NIH Toolbox, the HTT, was assessed using electroacoustic analyses and behavioral validation involving a clinical sample of children and adults who had varying levels of hearing sensitivity. The HTT is available as part of the NIH Toolbox App via the Apple iPad. The electroacoustic analysis indicated that the HTT met current conformity guidelines for output and linearity for clinical devices that are used to measure hearing. The behavioral validation indicated that the magnitude of differences between the HTT and the gold-standard diagnostic audiological assessment was within current clinical guidelines for children and adults. The sensitivity and specificity of the HTT were high for both children and adults, indicating the results can provide a reliable estimate of the presence and absence of hearing loss and the level of hearing loss within approximately 10 dB. In approximately 5 min, the HTT is able to provide researchers with a validated tool to assess hearing sensitivity across a range of clinical populations and can help to expand the inclusion of people with hearing loss in research in fields that study development, behavior, cognition, and language.