Natural observations have the potential to reveal human behaviors that are not evident in a controlled laboratory setting. Recognizing this, but constrained by human and technical considerations such as the investment of human labor, financial cost, and computational limitations, social scientists have historically collected behavioral data in very small samples or under conditions that have presented threats to their ecological validity (Narayanan & Georgiou, 2013; Neisser, 1967). Many well-documented deleterious or undesirable consequences have followed from this methodological approach, including the Hawthorne effect, in which study participants change behavior due to their own knowledge of being observed (McCambridge, Witton, & Elbourne, 2014); observer biases related to biosocial factors (e.g., a physically larger male researcher interacting with a female participant) and psychosocial factors (e.g., differential effects of undiagnosed disorders in males versus females; Rosenthal & Rosnow, 1991); demand characteristics, in which participants’ assumptions about the goals of the research affect their behavior (Orne, 2009); and potential questions about the generalizability of results (Brewer & Crano, 2000; Cook & Campbell, 1979). Even in the best of studies, the sample sizes have been very small. A well-known example from the literature is representative of these criticisms. In the mid-1990s, Hart and Risley (1995) argued that the number of hours that parents converse with children is the single strongest predictor of academic success. Hart and Risley’s work was constrained by factors familiar to any social scientist: namely, the labor-intensive nature of collecting, transcribing, and tagging audio-recorded speech and language samples. They studied 42 children for one hour per month over a two-and-a-half-year period, averaging about 28 total observations per family. This produced about 1,200 h of total collected audio, resulting in over 30,000 pages of transcripts. It took “six years of painstaking effort before we saw the first results” and “none of us took a vacation for more than 3 years” (Hart & Risley, 1995, pp. 46, 41). To get a sense of how things have changed in just two decades, the authors of the present article published a study of child-directed speech (De Palma & VanDam, 2017) whose conclusions were based on an analysis of 7,000 h of recorded speech, the overwhelming majority of which was processed automatically by a computer. Similarly, we currently have over 20,000 h of in situ family conversations awaiting analysis.

Current advances in micro-electronics and the ability to store, process, and interpret vast amounts of data afford scientists new avenues of investigation and resolve some of the earlier problems, even problems that were unseen or unconsidered in previous eras of the research enterprise. These advances include (a) commercially available, wearable biotechnology capable of capturing rich data on human behavior, (b) archival databases of raw and processed data, and (c) computational techniques, such as advances in automatic speech processing and recognition (ASP/R). These have been further enhanced by the discovery and rediscovery of computational and statistical models to interpret the rich, complex data that are generated. Computational and statistical advances and rediscoveries have included Bayesian inference, so-called “robust statistics” (Box, 1953; Gelman & Hennig, 2017; Huber, 1981), and the recognition that assumptions such as normality and homoscedasticity may not be appropriate for very large datasets. Also important are the easy availability and general familiarity of statistical and numerical software packages such as R (R Core Team, 2013), Python’s SciPy (Jones, Oliphant, & Peterson, 2014), commercial and third-party quantitative software packages for Matlab (MathWorks, 2018), and user-interface-driven applications such as SPSS (IBM Corp., 2013) or RStudio (RStudio Team, 2015) that increase the accessibility of analytic techniques (McGrayne, 2012; Wilcox, 2016; Wilcox & Keselman, 2003). The importance of these discoveries and tools cannot be overestimated.

With the rapid rise in the abilities of modern computing, the collection and storage of vast amounts of data are fairly straightforward and, in many applications, pose few new technological challenges. Commercial computing began handling large datasets in a straightforward manner within a decade after IBM researcher E. F. Codd proposed separating the data model from the underlying hardware (Codd, 1970). But, in the era of very large datasets, a number of extratechnological challenges remain. For example, wearable sensors for a disabled or disordered population may pose significant challenges to obtaining data or dealing with noisy, messy, and missing data (Milliken & Johnson, 2001, 2009). It is one thing to record and store everything that is heard (for example), and quite another to assess its validity. An especially thorny problem in the social, behavioral, and health sciences is a variant of the issue that accompanied the rise of the personal computer in the 1980s: the administration of complex systems is itself complex. The administrative costs of hardware may be many times the cost of the hardware itself, perhaps by as much as a factor of 5, as the Gartner Group has been pointing out for more than a decade (Cearley, Burke, Searle, & Walker, 2017; van der Meulen, 2008). Who does this administration is contested territory. A research team may pay for it out of an already-stretched budget, or it becomes the de facto territory of graduate students, postdocs, or, in small research teams, the researcher herself—all amateur information technology (IT) administrators whose time in a well-funded world might be better spent doing the content of the science at hand. Our colleagues in the physical sciences share the same issue, leading to an early recommendation for “science data centers” that “would curate one or more massive datasets, . . . the applications that provide access to that dataset, and . . . staff that understands the data” (Gray et al., 2005). Among the examples cited are CERN and Fermilab.

The structure of and access to very large datasets has of course received much attention since the earliest uses of vacuum-tube computers and monolithic databases stored on tape drives. But the advent of massive increases in the use and storage of computing technology throughout the second half of the 20th century has resulted in new and much more massive datasets. Early massive data archives have been described in terms of “warehousing” (Jarke, Jeusfeld, Quix, & Vassiliadis, 1999), but this approach was shown to have the deleterious effect of isolating or siloing data. Warehoused data is to some degree decontextualized data. One approach to addressing this problem was the concept of a data lake (Dixon, 2010; O’Leary, 2014). A data lake is a large body of data filled, piped, or ingested from many sources, in which “users of the lake can come to examine, dive in, or take samples” (Dixon, 2010). There are also detailed methods (e.g., Constance [Hai, Geisler, & Quix, 2016]) and technologies (e.g., Hadoop, Hadoop + Spark, Splunk, Storm [Mohanty, Bhuhan, & Chenthati, 2015]) to deal with the resulting data lake. There has been substantial interest and analytic activity (particularly using Hadoop) from Microsoft, Amazon, Google, Oracle, Facebook, Yahoo!, and others (Pasupuleti & Purra, 2015). The approach and resources discussed here, namely the HomeBank resource detailed below, are consistent with a data lake.

In addition to the general data storage and processing described immediately above, a number of projects or tools seek to aggregate or serve as a repository for data, often in the form of actively ongoing, open data submission archives in which users continually submit new data to the repository or data lake. These also provide (open or restricted) access to that data for the purpose of facilitating new empirical discovery for researchers who may not have collected the data. The advantages of this approach are that the costs, broadly construed, are reduced and data permanence increases; the disadvantages include new management responsibilities for the curators and potential breaches of personal or other data unintended for public consumption. Of particular interest to behavioral scientists are the Open Science Framework (Nosek et al., 2015), CLARIN (Hinrichs & Krauwer, 2014), Databrary (Adolph, Gilmore, Freeman, Sanderson, & Millman, 2012), and TalkBank (MacWhinney, 2007). In addition, there are projects whose details are only partially available to the public or to the research community, including massive ongoing behavioral data collection and archiving within Google, Amazon, Microsoft, Netflix, and other tech-based enterprises. At least some data in the private sector are specifically oriented toward language and speech analysis, detection, perception, or other similar areas of interest that generate massive datasets. Not least of these are the Amazon Alexa, Apple Siri, Microsoft Cortana, and Google Assistant speech recognition and language-processing projects (Chung, Iorga, Voas, & Lee, 2017).

Along with large datasets comes the related issue of data permanence, an issue recognized nearly two decades ago in the context of digital libraries (Rothenberg, 1999). In this case, the rapidity of software and hardware change becomes a liability. The researcher who does not maintain updates is condemned to live on a technological island whose landmass will shrink to zero as hardware and software become unusable, through obsolescence, reduced interoperability, or simple mechanical failure. Recognizing the problem, of course, points to a relatively straightforward solution: maintain software updates and data backups, and copy data to new hardware periodically. These tasks will be performed as a matter of course by professional IT administrators, and perhaps not so routinely by (social, behavioral, and health) researchers who, with their postdocs and graduate students, frequently maintain their own hardware and software.

As technology changes, it is not at all clear what the best practices will be. To offer a simple example, the fairly recent introduction of Amazon Web Services has the capacity to put social scientists out of the hardware business. But technology comes with unintended consequences, and as Edward Tenner and others have observed, in the context of very-large-database computing, not the least of these relate to data privacy (Raghupathi & Raghupathi, 2014; Tenner, 1996). Though the problems are widely recognized and potential solutions exist, one might be forgiven if he or she claimed that we have simply substituted a software problem (how to maintain secure data stored remotely) for a hardware problem (how to maintain large fast computing clusters and data repositories locally while sharing with other researchers).

At a level beyond the collection, storage, and maintenance of data is interpretation of the data. With very large datasets, this has taken on an increasing urgency. That is to say, data collection and all its attendant processes are often less challenging than being able to generalize meaningful trends in the data. Without substantial tools to organize and analyze the collected data, the advantages of ecological validity, the ability to look longitudinally at the time course of human behavior, and the ability to generalize would be diminished. It is the case that algorithms that perform efficiently on small datasets can require days or weeks to run on very large datasets, or may never complete their functions due to hardware limitations such as available memory. Matrix multiplication is a simple example. The number of computations necessary to multiply matrices grows in a nonlinear fashion as the cube of the size of the dataset. Furthermore, as anyone who has tried to send a dataset over the Internet surely knows, network bandwidth has not kept pace with storage capacity.

A related theoretical problem is the necessity of automation in the processing of very large datasets. The demand for automaticity is in fact one of the hallmark definitions of very large datasets, or so-called big data: a dataset that is larger than traditional means can accommodate (De Mauro, Greco, & Grimaldi, 2016). In any analysis that depends on automatic, deterministic, or algorithmic methods of computation, there is the possibility of systematic error being introduced into the analysis. Systematic errors or biases in machine learning and artificial intelligence (of which ASP/R is a subtype) have been demonstrated in large datasets, with biases documented in such disparate domains as insects, flowers, race, gender, careers, and first names (Caliskan, Bryson, & Narayanan, 2017); clearly, the potential for bias is great. There are traditional methods to confirm the performance of the automation by quantifying precision, accuracy, sensitivity, and specificity, but those methods can be challenging to implement reliably in large datasets. Importantly, the structural correspondence between the measure and the description, the validity, must also be considered. In practice, for sufficiently small datasets, a common assessment method is to examine some subset (or a holdout set) of the data by hand, often 10%–20% of the overall quantity of data, and compare that hand-derived gold standard to the results obtained from the automatic method.Footnote 1 If the two results converge, they are considered valid, the results are generally trusted, and the automatic method is accepted (possibly with some conditions, such as higher error rate). But with very large datasets, validation is more challenging, due to the inability to compute by hand an appreciably large gold standard set with which to compare the automatic methods. The generality of the subsample increases as its proportion to the whole increases, but given the samples discussed in the present work, no reasonable effort could attain a hand-derived subsample of even 1%, when 20%–40% is common in the literature (see Kohavi, 1995, and Kim, 2009). For example, in some of our own work we have performed automated analyses on more than 10,000 h of audio collected in home environments. Hand transcription to the level of detail required for these analyses (here, a careful orthographic transcription—but a phonetic transcription would be substantially more labor-intensive) is roughly a ten-to-one time investment for an experienced transcriber. So, to transcribe all 10,000 h by hand would take approximately 100,000 h of labor, or roughly 50 person-years. A holdout or subsample of even 1% would consume an estimated six months of full-time labor from an experienced transcriber. Instead, to assess the validity of automatic methods, we rely not only on gold-standard comparison, but also on content, criterion, and construct validity—all theoretical measures used to argue that the empirical quantities reflect the things we claim they reflect.

Yet another issue in working with very large datasets is that, of necessity, some processing software is probabilistic. For example, much of our work relies upon speech-processing software from the LENA Foundation (LENA, 2018). The advances in automatic speech recognition in the past two decades have been due to the introduction of probabilistic machine-learning models, and, like the humans they are intended to simulate, they make mistakes. Although Google’s use of very large datasets along with yet another machine-learning technique, neural networks, has brought ASP/R to near-human performance, LENA is a commercial product (Hinton et al., 2012; Kepuska & Bouthota, 2017), so its details are opaque. Our approach has been to test the accuracy of LENA against the gold standard in automatic speech recognition—namely, human subjects (Silbert, Linck, & VanDam, 2013; VanDam & Silbert, 2016). Though LENA performs well, it does not perform perfectly. And the errors that any computer or automatic method makes are not the same kinds of errors that humans make in language perception. The fact that computers and humans are known to use different storage, analysis, and production mechanisms (Woods, Dekker, Cook, Johannesen, & Sarter, 2017; Woods & Roth, 1988) may introduce problems, both in fine-tuning the output and in a deeper understanding of the mechanisms themselves, whether the mechanisms are the product of human cognition or the often opaque models used to specify these mechanisms (such as in hidden Markov models or computational neural networks).

Outline of the present work

What follows in this report is a description of the technology we use, followed by a series of case studies of several recent empirical projects undertaken by our research group. We use the methodology to explore new scientific questions by capitalizing on the confluence of the technological advances and capabilities set forth above. We use our own empirical work as an example and proof of concept. We demonstrate how the modularity, flexibility, and extensibility of the modern landscape accommodates new areas of scientific inquiry in a wide range of disciplines with both theoretical and practical importance, by implementing techniques that can be shifted or substituted to accommodate different problems. For example, we demonstrate the use of a particular spectral feature from acoustic speech processing, but the same general procedure could be adapted to other spectral or temporal features, or indeed, to any raw, high-density data, including images and data from sensors or sensor networks. We conclude with short discussion of the extensibility of the framework described here.

Description of the technology, its extension, and its application

The LENA system was developed in the early 2000s in an attempt to use automatic methods to capture the linguistic input to children and their auditory environment (Gilkerson & Richards, 2009). The commercially available system consists of a small, body-worn audio recorder to collect audio from the child’s perspective and software to process the collected daylong audio. Audio is collected in units of one day, with audio ideally being recorded continuously from when the child awakes in the morning until the child goes to sleep in the evening. After a raw recording is collected, the daylong audio file is transferred to a computer where it is processed using the proprietary LENA software. The goal of the processing is to segment the audio and assign one of about 60 a priori labels to each segment. The segment size depends on the auditory signal and is not of a predetermined duration. Segments ultimately bearing labels corresponding to live human vocal events are generally the size of an uninterrupted utterance, and not at the level of individual words or syllables. The goal of many artificial speech recognition systems is to map audio onto an orthographic representation of lexical items, with the output being a readable transcript of the audio event. This fairly specific, granular goal has obvious utility, but performance declines substantially as noise increases, a serious problem for daylong recordings in which the environment is overtly uncontrolled. Ecological validity and environmental noise are, in the context of naturalistic audio recordings, proportional to one another. To address this problem, the LENA algorithms assign labels at a relatively coarse level, such as adult male, adult female, TV/electronic media, silence, noise, target-child, and so on. The output of the processing is a raw audio file (16-bit, 16-kHz, lossless pulse-code modulated WAV format) and an XML-coded record of the onset–offset times at centisecond resolution bearing the label of each segment. There is also some additional information in the output, such as the mean amplitude of the signal for the duration of that segment. The audio files are as long as 16 h (the limit of the internal memory of the recorder) but are typically less. In a database maintained by the present authors of several thousand recordings, the average daylong recording is just under 11 h. In our case, that database consists of roughly 2,500 recordings from about 120 families. Some families have contributed one recording, and one family has contributed more than 600 recordings from three children, with one child wearing the recorder in about 300 of her first 400 days of life. The original design of the database was to collect one recording per month from each family for one year in early, preschool development. This goal was largely realized, but many factors have influenced the database since its inception in the mid-2000s. For example, families enrolled in the data collection study with a target preschooler might have a school-age sibling who volunteered to contribute recordings. Similarly, families with certain disorders not targeted in the original design might be included (and documented) if it were convenient to collect such recordings. In principle, we have taken the viewpoint that recordings have a fairly high cost to collect, but we have an established protocol and mechanism, so unplanned opportunities to collect recordings are generally accommodated in the hope that those recordings may facilitate future scientific discovery. The time-aligned record of segment labels depends on the content of the recording, but typically it runs to roughly 25,000 lines of output.

The output of the LENA system can be interpreted directly, in a limited fashion, via the interface software provided with the processing module. Summary reports quantifying child and adult utterances, conversational exchanges, TV and electronic media exposure, noise, and silence are available via the entailed software. Options to output the data into various interpreted formats, such as CLAN/CHAT or Transcriber files, are also available. Using the output of the ASP/R software to provide time-aligned labels and the (raw) audio can be accomplished with third-party or custom-written software, such as the collection of user scripts at the HomeBankCode repository (https://github.com/homebankcode/) or the ADEX parser, provided by the LENA Research Foundation. For example, a researcher may want to quantify certain spectral characteristics of all audio segments labeled noise by the ASP/R routines. The researcher might use custom scripts to extract the onset and offset times of those segments, read and process the corresponding raw audio from the daylong file, and return the acoustic features of interest.

Our approach to collecting and analyzing large amounts of ecologically valid data can be decomposed into four principal modules. We describe each in terms of our own work, but argue later in this work that there is wide flexibility and extensibility in the modules described here. First, ecologically valid recordings of human behavior are collected in real time through a body-worn recorder. In our case, we collect daylong audio recordings from the LENA system described above. Second, the raw data are processed by commercially or publicly available software. For us, the daylong audio is processed with LENA ASP/R, taking the daylong audio file as input and outputting a tagged, time-aligned record of segment boundaries and labels known to be important for speech and the development of speech and language in children. Third, we use custom software to process the results for features of interest. We use Matlab, C, and Python code to process for speech and language features such as utterance duration or fundamental frequencies of child and adult speech. The ability to use the output of commercially available processes as the input for specialized analyses affords the unique possibility of interdisciplinary applications and specializations not previously considered or not available directly within the LENA software. Fourth, massive data are subjected to structural data organization and computational modeling techniques, to gain insight into the structure of the data. The goal of some empirical work is to describe; a loftier goal of empirical work is to use the data to draw larger conclusions and to generalize from the sample to the population with some degree of confidence. In the case of the massive data described here, we are ultimately interested in offering the potential to shed light on human behavior, the structure of the brain, social cognition, or other understanding of complex systems at play, in such a way that the conclusions carry forward, the observations generalize, and the results guide predictive hypotheses. The general structure of these modules, issues important to consider, and examples at each stage are given in Fig. 1. An important feature of this design and implementation is the extensibility of this approach. In particular, each module and the processes contained therein are easily substituted, rearranged, manipulated, added to, or otherwise tailored to a wide array of raw input data, so as to gain empirical insight. So, for example, at present there are no real alternatives to the LENA system for processing daylong audio files, but efforts are underway to achieve a similar level of ASP/R, including the open source Virtual Speech Kitchen project (Metze, Fosler-Lussier, & Bates, 2013). Furthermore, any given single database ingested into such as system could be subjected to many alternative investigations by tailoring the method to the questions of interest. Several examples of such projects are given in the following section from our own work.

Fig. 1
figure 1

Flow chart, issues, and examples for the collection and processing of behavioral data from a very large database. ASP/R, automatic speech processing and recognition; CDS, child-directed speech

Detailed examples of recent work

Just as the development of core memory, the transistor, and the high-level programming languages in the 1950s ushered in the cognitive revolution, the formalist approach to the study of language has made available to researchers the use of computers to transcribe speech, now known as ASP/R. ASP/R, along with current developments in computing and computational techniques, has permitted the investigations we have undertaken. These developments include not only inexpensive storage and recording devices, but also the rediscovery of Bayesian inference, the application of probabilistic machine-learning techniques—the hidden Markov model, in particular—to speech processing, and the expansion of Internet bandwidth. The latter has allowed both world-wide collaboration and remotely accessible data repositories. In the following four subsections, we describe using this collection of methodologies, tools, and techniques to explore child-directed speech; the application of automatic methods to clinical populations, including children with hearing loss; quality control and assessment of validity; and the advantage of sharing data in a public database.

Child-directed speech

In two studies we used the phenomenon of child-directed speech (CDS) as a proof of concept—namely, as evidence that very large speech corpora could be analyzed in a reasonable time to produce results unavailable to other techniques (De Palma & VanDam, 2017; VanDam & De Palma, 2014). CDS is the manner in which adults speak to their infants and toddlers. Characteristics of this speech are shorter utterances, hyperarticulation, decreased syntactic complexity, and raised fundamental frequency (f0). The phenomenon has been attested in several European languages, Mexican Spanish, and Japanese, leading some researchers to suggest that not only might CDS be a language universal but also may be implicated in language acquisition (Masataka, 1999). Though CDS has been investigated since the 1970s, the availability of automatically processed daylong recordings has afforded a deeper look at the phenomenon.

LENA, as we noted above, is similar to a conventional automatic speech recognizer of the SIRI variety. The system transforms a time-stamped audio stream into a collection of feature vectors that it segments and labels. In essence, LENA, like any other recognizer, is a classifier, but instead of mapping audio segments to word classes, it maps them to the origin of the sound. Possibilities are target-child, other child, male adult, female adult, TV/electronic sound, and others. These labels permit automated judgments of CDS: a speech segment labeled adult and found adjacent to another adult segment is defined as non-CDS, and an adult segment adjacent to a segment labeled target-child is defined as CDS. Both studies described below used raised f0 as a proxy for raised pitch, since pitch and f0 are correlated (Hess, 1983). Raised pitch is one of the most recognizable features of CDS. More importantly, f0 can be extracted from WAV files and analyzed computationally. Interestingly, the automated methods also allowed us to explore whether CDS was different when performed by mothers versus fathers. This was a straightforward extension of the software: All that was required was for mothers and fathers to be identified separately by the automatic methods.

In one study we used over 490 h of daylong audio from 33 families as a preliminary investigation based on a small subset of our corpus (VanDam & De Palma, 2014). The study was done before we had developed software capable of traversing a file system with thousands of hours of recorded speech. There were two hypotheses:

  • Mothers and fathers will produce a higher mean f0 during CDS than during non-CDS.

  • Mothers and fathers will produce comparable f0 regardless of the hearing status of children; for this analysis, families with typically developing (TD) children were compared to families with a child who is hard of hearing (HH).

The results were surprising in two ways. First, during the CDS condition, mothers significantly increased f0. Fathers, on the other hand, did not show an increase in f0 in CDS as compared with non-CDS speech. Second, when comparing between the TD and HH groups, the fathers of HH children had a higher overall f0, indicating that fathers might be sensitive to the hearing status of their children. By contrast, mothers’ speech in this study did not differentiate whether they had an HH or TD child.

The second study used over 7,500 h of a corpus that has now grown to over 10,000 h (De Palma & VanDam, 2017). The number of families studied grew from 33 in the first study, to 62 in the second. This time we had developed software capable of traversing a complex Linux file system, building nearly four million 1- to 2-s WAV files from the much larger daylong recordings, and extracting f0 from files designated as CDS and non-CDS. Software constructed for this study is available in the HomeBankCode repository (VanDam et al., 2016), about which we will say more below.

Curious about the strong finding from the 2014 study that fathers do not raise f0 when speaking to TD children, we began with a single hypothesis: Mothers and fathers will produce higher mean f0 during CDS than during non-CDS. To test this, we extracted the acoustic record of those segments automatically labeled as the mother or father by the LENA software, segregated into CDS and non-CDS. We processed those acoustic files according to a pitch determination algorithm, and then obtained the mean fundamental frequency estimates for each talker in each social condition. We then performed a paired t-test to test for within-talker difference between social conditions. We also performed an ordinary least squares regression to examine the trend. This time mothers remained sensitive to the CDS condition. Fathers also raised f0 significantly, but not as much as mothers, showing a weaker correlation. The results are interesting in two ways. First, we demonstrated that not only is it possible to analyze a very large corpus, but also that the larger corpus corrected an incorrect inference drawn from a smaller corpus, namely that mothers but not fathers alter speech to their children in specific ways. Second, the differential speech behavior of fathers in the presence of children and adults has not been thoroughly addressed in the literature. We found that fathers’ speech patterns were similar to those of mothers, but far from identical, over and above the general differences between male and female speech. Follow-up work will consider TD and HH populations separately, look at the potential role of siblings in the family communicative environment, consider other acoustic–phonetic features known to be important for speech and language, and examine the role of linguistic complexity in language development. The overarching goal here was to demonstrate that very large quantities of in situ speech could be recorded, stored systematically, analyzed through several layers of software (automatic speech processing, finely grained segmentation of WAV files by timestamp, f0 extraction, and statistical analysis) and yield generalizable inferences.

Application to clinical populations

Several studies have used the automatic methods described here to investigate speech production and language development and to describe the auditory environment of children with hearing loss. In addition to our own work looking at children with hearing loss, researchers have used automatic methods and the daylong recordings to look at preterm infants (Caskey, Stephens, Tucker, & Vohr, 2011; Caskey & Vohr, 2013), autism spectrum disorder (Rankine et al., 2017; Warren et al., 2010), Down syndrome (Thiemann-Bourque, Warren, Brady, Gilkerson, & Richards, 2014), language delay (Oller et al., 2010), and others. Here we will briefly describe three studies, to demonstrate the utility and productive nature of using automatic methods to describe and better understand hearing loss in children.

In one study we asked whether two-year-olds who were hard of hearing were exposed to the same quantity of talk from their parents as typically developing children from our own database and from a published sample (VanDam, Ambrose, & Moeller, 2012). We used daylong recordings processed by the LENA software to estimate adult words and conversational turns (normalized by the total hours in the recorded day). We also looked at possible covariates, such as hearing level, sex, socio-economic status, child language skills, and child age. We found that child hearing status had a limited influence on the quantity of parents’ talk to these toddlers, but within those families with a child with hearing loss, parents produced more talk as children’s hearing was better.

In another study we looked at the role that exposure to television and electronic media has on the language development of children with hearing loss (Ambrose, VanDam, & Moeller, 2014). Here we showed that toddlers’ exposure to more electronic media was negatively associated with receptive language abilities, but that increasing conversational turns might mitigate that disadvantage.

Finally, in another study we compared four groups: typically developing, autistic, language delayed, and hard-of-hearing children (VanDam et al., 2015). Here we analyzed 1,913 daylong recordings from 273 children in four groups of known diagnoses: typically developing, autism spectrum disorder, language delayed, and hard of hearing. Using automatic methods based on LENA recordings, we collected parameters from the acoustic signal known to be important for speech and language development and built statistical models to classify children into groups and examine their vocal development. We were able to show successful classification of children using automated vocal analysis. This result is a step toward using large-sample automated methods to distinguish among disorders and improve treatment to a differentiated population.

The three empirical research projects described above demonstrate a concrete approach to using objective, automated methods and very large sampling to gain insight into at-risk or clinical populations that are currently not well understood.

Quality of automatic procedures

One of the issues raised above in the context of investigations of data using probabilistic machine-learning techniques is that these techniques are just that—probabilistic. The conditions under which our data were recorded pose some of the greatest problems for ASP/R, namely open-set vocabulary, continuous speech uttered by many speakers. The consequences of these factors are that the data are both complex and noisy. A natural question to ask, then, is how accurately do the automatic methods perform? Said more precisely, to what degree do the segment labels—target-child, adult male, and so on—conform to those of human judges? Previous empirical studies of children learning English, Spanish, and French have been conducted. In these, the agreement between human coders and LENA has varied from 82% to 64%, depending on the label (Canault, Le Normand, Foudil, Loundon, & Thai-Van, 2015; Soderstrom & Wittebolle, 2013; Weisleder & Fernald, 2013).

A more recent study looked at the fidelity of the automatic LENA methods using many excised tokens of specific labels and a cadre of judges (VanDam & Silbert, 2016). In this study, 26 daylong recordings were used, one from each of 26 families. The ages of the participating children averaged 29.1 months. After the daylong recordings had been collected using the LENA system, segments bearing specific talker labels (male adult, female adult, and target-child) were excised, labeled and cataloged, and saved to a file. The files were sampled evenly throughout the day from the original recordings, using a custom Matlab script that computed the total number of utterances from one category label (e.g., father), n, and collected each n/30th segment with that label in order to achieve exactly 30 samples evenly spaced throughout the day. A sampling procedure emphasizing randomness, sparser sampling, or other techniques would be trivial to produce. These excised segments then served as the stimuli for 23 human judges who listened to a total of 53,820 stimulus presentations, 17,940 from each of the categories of interest. The judges performed a four-alternative forced choice task, choosing mother, father, child, or other for each audio presentation. In a tenfold cross-validation procedure, the error rate stood at .353, consistent with the performance of ASP/R in a large-vocabulary continuous-speech setting (VanDam & Silbert, 2016). Others have investigated the quality of the tools in typically developing (Xu, Richards, & Gilkerson, 2014) and special (Woynaroski et al., 2017) populations. This is an example of using the commercially available LENA system as a first step to data collection, with custom software being used to postprocess for features of interest.

Since this study was completed, researchers in ASP/R have been reporting spectacular decreases in error rate though the introduction of neural networks in their acoustic and language models, and some have looked specifically at the LENA system (Richards et al., 2017). Thus, Xiong and colleagues (2017) report an error rate of .062. There are two caveats. Xiong and colleagues at Microsoft tested on the Switchboard corpus (Godfrey, Holliman, & McDaniel, 1992), a corpus of telephone conversations, which are considerably more constrained than our own daylong recordings of in situ child speech. Furthermore, the error rate posted by the Microsoft team is the word error rate, whereas the study cited above reported labeling error. The two can be compared in order to get a rough measure, but, of course, they are not identical.

Database management and shared data

The question of how to catalog and store thousands of hours of recorded speech in such a way that the speech can be made available to research groups other than our own, while protecting the privacy of the subjects, is a nontrivial problem. After a corpus reaches a certain size, it becomes unreasonable for investigators to become their own systems administrators and, in the case where software is to be shared, part-time librarians. This prompted the initiation of HomeBank (https://homebank.talkbank.org), a permanent, online database of daylong audio recordings made in naturalistic settings along with an associated GitHub code repository (https://github.com/homebankcode/) to analyze, process, and store the data. The HomeBank database currently has eight corpora representing greater than 100 talkers and 1,000 recordings. HomeBank is administered by a professional technology team. The recordings consist of vetted and unvetted data. In the vetted portion, the recordings and associated metadata have been examined by trained listeners to ensure that no private or personally identifying material is present. The vetted database is unrestricted and available for public download and analysis. The unvetted database has not been examined by human listeners for content. It is password-protected and available only to registered HomeBank members who have agreed to confidentiality and who have passed through recognized ethics training on how to handle human data (VanDam et al., 2016).

Extensions, adaptations, and future applications

The characteristic feature of the work detailed here is the high quantity of behavioral data available to the researcher. We believe that this is a recent advance and fertile ground for research. That is to say, big data have been considered and collected for much longer than behavioral big data have been available. The advent of increases in computing, not least of which are storage and miniaturization, have afforded interested researchers the ability to collect behavioral data from biological organisms (i.e., humans) as never before. The work shown here is particular to research programs primarily interested in utilizing automatic methods to explore a wide variety of open questions about speech production, language development, and disorders with associated risk of communication dysfunction, but the general data acquisition and analytic approach can be and has been straightforwardly extended or adapted.

One area of current interest is the development of an alternative to the LENA system (Metze et al., 2013; Schuller et al., 2017). An alternative has several challenges and several advantages. Among the challenges are standardizing hardware, cost, coordination of effort, lack of validation, lack of continuity in the literature for a new, undescribed system, and the varying interests of research teams leading to different or conflicting paths of development. The advantages of an alternative may afford the research community with transparent processing (which is not a feature of the LENA system), flexibility in coding and applications, such as models that may directly apply to variable problems to address specific populations of interest, and interoperability with other hardware and software systems. The advantages will allow an alternative to be research-driven, with development undertaken to address problems of interest within the research community. We expect and hope that the development of this alternative will continue to receive generous support in proportion to the increasing interest in the international scientific community. Researchers around the world using daylong audio recording technology have made discoveries in science and engineering, anthropology, human health disorders, language development, and many disparate fields. With this broad scientific attention and demonstrated results, the development of an alternative to the LENA system is fast approaching viability. Finally, an alternative may accelerate the course of scientific discovery, expressly because of the extensibility of such a system. Compared with the present landscape, dominated by a single proprietary provider, a collaborative enterprise (as is currently being undertaken) will most likely be transparent, open-source, and accessible to a wider range of interested scientists, who can take advantage of this flexibility and extensibility in order to gain new insights and promote new discoveries. There is also potential for additional collaboration between the scientific community and industry as more participants enter the research space.

In the behavioral and health sciences domain, researchers are exploring a breathtaking array of parameters. Just to name a few from disparate fields, there is extant research on processing radiological images (Shin et al., 2015), exploring the genotypology of disease (Nalls et al., 2014), dermatological diagnoses (Esteva et al., 2017), the onset and time course of disease epidemics (Brownstein, Freifeld, & Madoff, 2009; Young, 2015), neuroimaging (Beaton, Dunlop, & Abdi, 2016; Biswal et al., 2010), understanding personality traits (Bleidorn, Hopwood, & Wright, 2017), economic behavior and trends (Einav & Levin, 2014), the nature of human long-term memory (Stanley & Byrne, 2016), theoretical developmental phonology (Bergmann, Tsuji, & Cristia, 2017), and many more.

The framework described above takes in large quantities of behavioral data and uses modern computer hardware and software to store, organize, and analyze the data. Instead of acoustic sensors such as the microphones described above, researchers are using piezoelectric sensors, gyroscopes, implanted electrodes, accelerometers, and pressure devices (among many others) to obtain data about human behavior in domains from the physiological to the cognitive (Gowers et al., 2015; Imani et al., 2016; Klucken et al., 2013; Soh, Vandenbosch, Mercuri, & Schreurs, 2015; Staudenmayer, He, Hickey, Sasaki, & Freedson, 2015). In principle, sensors that collect behavioral data in another domain—such as the electrochemical response to a stimulus or in a veterinary setting, for example—can be treated in largely the same way as the data we have collected from microphones. These data can be collected at a particular sampling frequency, stored in a database, processed (or mined) for features of interest, and (possibly) shared for future or collaborative analysis.

This modular approach also has certain drawbacks. First, the computational requirements of a third-party provider are likely to be fairly substantial. It is not clear who will fund offsite data storage, how it will be managed, who will make management decisions, and how the data will be maintained in perpetuity. Second, access and throughput are continuing concerns. As data volume increases, the need for increased capacity also increases. This might restrict overall access to the data, which would reduce the very reason it is put in place. Third, access is a double-edged sword that is particularly risky for behavioral data, since it critically provides details about the humans it represents. Despite a robust appreciation for the responsible handling of human data in the academic research setting, it is unclear what potential harm to individuals, other ethical violations, or alternative uses might arise out of these data being accessible in ways they have not been previously (Zhou et al., 2012; Krutz &Vines, 2010).

Conclusions

This article describes the use of large-scale, daylong ecologically valid audio recordings to examine human behavioral variables. The daylong audio recordings are processed with ASP/R software, then postprocessed to focus on variables of interest. The resulting data are rich and complex and afford the opportunity to generalize observations beyond a descriptive case alone. This practical approach lends itself to a great variety of extensions, expansions, and cross-discipline collaborations. Here we have focused on our own research interest, acoustic phonetic analyses to gain insight into human speech and language development, but the extensibility of the methodological approach is ripe for a wide array of other scientific questions.