Introduction

Conversation is the most natural setting for language learning and use and is key to understanding what makes us uniquely human. Before they utter their first words, infants communicate through proto-conversations in face-to-face interactions (Bateson, 1979; Che et al., 2018). Such exchanges establish strong communicative bonds between caregivers and infants and form the foundation of conversational turn-taking, which supports language development throughout the toddler and preschool years (Bornstein et al., 1999; Conboy et al., 2015; Hirsh-Pasek et al., 2015; Romeo et al., 2018a, b; Tamis-LeMonda et al., 2014). A large and growing body of research has proposed several mechanisms through which turn-taking supports children’s language development. Turn-taking provides opportunities for temporal contiguity and contingency, fluency, connectedness, and joint engagement between caregivers and children, processes that are critical to word learning (Bornstein et al., 1999; Conboy et al., 2015; Hirsh-Pasek et al., 2015; Tamis-LeMonda et al., 2014). Turn-taking also helps caregivers constantly adjust their language input through observation of their child’s linguistic abilities, hovering within the zone of proximal development (Vygotsky, 1978). Infants, in turn, have been observed to adjust their vocalizations in response to parental input, thereby creating a social feedback loop (Braarud & Stormark, 2008; Goldstein & Schwade, 2008; Smith & Trainor, 2008; Warlaumont et al., 2014). In addition to supporting language development in infancy and toddlerhood, turn-taking between caregivers and children has recently been linked to children’s socio-emotional, cognitive, and brain development (Gómez & Strasser, 2021; Huber et al., 2023; Romeo et al., 2018a,b; Song et al., 2014). Furthermore, associations have been found between turn-taking and children’s emergent reading skills (Merz et al., 2020; Weiss et al., 2022), executive functioning (Romeo et al., 2021), reasoning scores (Romeo et al., 2021), and IQ scores in middle school (Gilkerson et al., 2018).

Measuring conversational turns

Considering the importance that current developmental theories give to turn-taking, its accurate quantification is critical for evaluating questions about the nature of child development. Conversational turns have so far been measured in a variety of ways, ranging from assessing the extent to which a conversation is equally participatory between two speakers (Adamson et al., 2012) to quantifying the number of utterances that are contingent upon a previous speaker’s turn (Dunn & Brophy, 2005; Dunn & Cutting, 1999). In recent years, one of the main approaches for measuring turn-taking employed across many studies is the Language Environment Analysis system (LENA), which facilitates daylong audio recordings in children’s natural environments. An important advantage of LENA is that it provides an ecologically valid option for recording the speech of caregivers and children on a daylong timescale, supplemented by automated speech analyses. The three primary measures provided by LENA are (1) the number of adult words heard by the child (adult word count, AWC), (2) the number of the child’s language-related vocalizations (CVC), and (3) the number of adult–child back-and-forth exchanges (conversational turn count, CTC). Of these, CTC has recently received the most attention, as it is interpreted as a proxy for high-quality serve-and-return exchanges, engagement, reciprocity, and adult responsiveness, and thus a key component of high-quality language environments that has been linked to language, reading, cognitive, executive functioning, socio-emotional and brain development (see http://lena.org/conversational-turns for the company’s explanation of how LENA-measured conversational turns are related to development).

Over the last decade, LENA’s CTC has been used across a number of different languages, for the purposes of basic research and intervention, in children’s homes and in schools, and to examine typical development as well as developmental delays and disorders. Given the importance that current theories give to LENA’s CTC (Gilkerson et al., 2018; Gómez & Strasser, 2021; Merz et al., 2020; Romeo et al., 2018a,b, 2021), a critical question to consider is how LENA estimates CTC (i.e., what is actually measured, and how). The answer to this question may seem quite straightforward: the algorithm simply looks for adult and child speech in close temporal proximity. That is, the system looks for adult speech closely followed by child speech, or vice versa, and counts such instances in discrete pairs, with pauses of 5 seconds or more constituting the end of a conversation. Critically, however, the system counts such exchanges without “knowing” whether the adult and the child are talking to each other. This means that a portion of CTCs are identified in error, such as when a parent is talking to another parent, and the infant is babbling to herself nearby. We refer to such “erroneous” cases as examples of accidental contiguity.

Accidental contiguity

Accidental contiguity refers to a situation in which adult and child speech occurs in close temporal proximity, but the two parties are not talking to one another. To illustrate how accidental contiguity is distinguished from real turn-taking, we consider two exchanges between a mother and her 14-month-old baby (Scenario 1 and Scenario 2), both obtained through the Ferjan Ramírez, Hippe, and Kuhl (2021) dataset.

Scenario 1:

[3.71]Infant: Ba!

[5.05]Mom: Yes, sweetie, that’s a ball! Ball! Ball!

[7.82]Infant: Ball!

[6 s pause].

[13.61]Mom: Can you roll the ball to mama?

[15.32]Infant: Ball! [17.08]Mom: Good job, sweetie, you rolled the ball!

Scenario 2:

[1.45]Infant: Mamamama

[2.92]Mom (on the phone): ...and then we went out for some sushi and wine...didn’t make it home till 2 a.m. I’m exhausted!

[6 s pause]. (Mom still on the phone)

[9. 32]Infant: Mamamamama

[11.14] Mom (still on the phone): I know, right?! We’re super stressed out about that too!

Assuming that LENA correctly tags adult and child speech, both of the above scenarios would be tagged with the same number of conversational turns, two. Note that assigning two conversational turns to Scenario 2 is not a measurement error, but a consequence of the fact that the software does not differentiate between child-directed and overheard speech, or “know” whether parental speech is semantically contingent on the child’s speech. Of course, from a standpoint of describing and evaluating this child’s language environment, equating Scenario 1 and 2 is highly problematic: Scenario 1 contains a number of qualitative language input features that are known to promote language learning (for example, the parent is “following in” and producing talk that is not only temporally but also semantically contingent; see McGillion et al., 2013) . In Scenario 2, on the other hand, the child is merely overhearing speech that is semantically unrelated to the child’s focus of attention, and is directed at someone else (in this case, the mother’s interlocutor is not even physically present in the room). While the function and use of overheard speech are still under debate and some studies show that children can and do learn from overheard speech (for example, see Foushee et al., 2021; Akhtar, 2005), the LENA CTC estimate is marketed as a proxy for “quality serve-and-return interactions” (see description on the company’s website: http://lena.org/conversational-turns. With this in mind, an important question to consider is: how frequent are examples such as Scenario 2 in young children’s day-to-day lives.

The answer to this question is currently unknown. In fact, only some published LENA studies acknowledge that LENA’s CTC is a composite of intentional responses (such as Scenario 1) and accidental contiguity (such as Scenario 2). One recent study that directly compared manually tagged and LENA-estimated CTs reports that examples such as Scenario 2 are ubiquitous in infants’ typical days (Ferjan Ramírez et al., 2021). Specifically, the study reports that 15–23% of segments that contained LENA-tagged CTs contained no child-directed speech (Ferjan Ramírez et al., 2021). Interestingly, the same study reports that the rate of accidental contiguity decreased with infant age between 6 and 24 months, perhaps as a result of infants’ increased mobility (i.e., the infants’ increasing ability to crawl or walk away from a talking adult when their speech is directed to someone else).

Speaker tagging (diarization)

Unfortunately, accidental contiguity is not the only potential source of error in LENA’s CTC estimate: In Scenarios 1 and 2, the assumption was that the speakers were labeled correctly. However, LENA’s CTC is a derived metric, meaning that its accuracy depends on an earlier step: an accurate classification of audio as human communicative vocalizations. The initial steps of LENA’s algorithms involve classifying stretches of audio into the following speech classes: female adult (FAN), male adult (MAN), key child (CHN), and other child (CXN). The non-speech classes include silence, TV and electronic noise, undefined noise, and overlap. Categories other than silence are then divided into “near-field” or “far-field” sounds based on the energy in the acoustic signal. Finally, stretches of audio categorized as “near” speech-like vocalizations by an adult or child that are temporally close to one another are grouped together into units called “conversational blocks” (Xu, Yapanel, Gray, & Baer, 2008a; Xu, Yapanel, Gray, Gilkerson, et al., 2008b). Taken together, this means that CTC is the end result of multiple, hierarchically dependent signal-processing steps for classifying audio sound sources, and errors at any stage could potentially be compounded to affect the accuracy of CTC.

While LENA has provided data on segment classification accuracy, most of the reported statistics were not published in peer-reviewed journals. For example, Xu et al. (2009) is an unpublished study that reports classification accuracy of 82%, 76%, and 76% for adult, child, and other segments, and has been cited many times in support of LENA’s reliability. More recently, research laboratories outside of the LENA organization have investigated its classification accuracy. While a thorough review of their findings is beyond the scope of the present manuscript, this growing body of research agrees that whether LENA’s automatic classification is sufficiently accurate depends on the research question and specifics of the study at hand (Cristia et al., 2020; see also Cristia et al., 2021). For example, a recent systematic review of 33 studies reports that LENA’s adult and child vocalization counts generally correlate well with human transcription (i.e., Pearson’s r is typically quite high), with adult words having a tendency to be slightly overestimated and child vocalizations having a tendency to be underestimated (Cristia et al., 2020). However, independent groups have also noted large and systematic errors in specific situations, such as when female speakers (for example, mothers) address children in infant-directed speech (IDS; Lehet, Arjmandi, Houston, & Dilley, 2021). Under such circumstances, maternal voices are often tagged as child voices, likely because LENA diarization appears to rely heavily on fundamental frequency (F0; see Lehet et al., 2021), which varies as a function of multiple factors, including speech register [i.e., adult- vs. infant-directed speech (IDS)]. These results raise the possibility of systematic error in detecting mothers’ speech turns, consequently suggesting that LENA may significantly misrepresent counts of turns due to systematic biases in diarization of maternal IDS. Similar issues have been noted with regard to child vocalizations. For example, one study reports that while LENA performed fairly well in identifying the voice of the child wearing the recorder (60% of frames were correct), the system performed poorly in identifying the voices of “other children,” such as siblings (less than half of the frames were tagged correctly, with siblings often being classified either as “adults” or as the “target child”; see Cristia et al., 2021). Finally, some studies report that the presence of electronic media has the potential to inflate the LENA adult speaker counts, likely because speech emanating from electronic devices can be erroneously classified as coming from nearby live adults (Lehet et al., 2021; Xu et al., 2009). Note that the above-described types of errors in diarization have the potential to affect the accuracy of LENA’s CTC in either direction (overestimating it or underestimating it), depending on who is present and/or talking within the target child’s environment.

Participant factors and segment selection

Another important consideration is whether LENA’s CTC accuracy is impacted by participant factors, such as the language spoken by the family, or the age of the child wearing the recorder. It is important to note that the LENA algorithms were trained on data from a sample of North American English-speaking families with typically developing children aged 1–42 months (Gilkerson et al., 2008). However, the LENA technology has been and continues to be used with samples that go far beyond the kind of data in the original training set (including its use across a number of different languages, with children spanning a much wider range of ages, children diagnosed with or at risk for autism spectrum disorder, children with hearing loss, bilingually raised children, and children born preterm). Only some of these studies conduct validation analyses within their datasets, and when they do, they typically only assess whether LENA’s accuracy is “good enough” with regard to a specific study question, as validation analyses are typically not the primary focus of such work (see Cristia et al., 2020, for further discussion). Based on a recent review, there is currently little evidence to suggest that there are systematic biases in LENA’s accuracy of adult and child vocalizations based on child age; however, the authors caution that larger datasets are needed to confirm these findings, as they were based on data from only 10 children per corpus whose age varied widely (Cristia et al., 2021). With regard to CTC and child age, one recent study based on a longitudinal sample of English-speaking families recorded when the infants were 6, 10, 14, 18, and 24 months old reports that LENA’s CTC estimates improved in accuracy as a function of child age, likely due to an observed decline in accidental contiguity over time (Ferjan Ramírez et al., 2021). However, it is unknown whether these findings generalize to ages beyond 24 months, or to languages other than English. Furthermore, this study compared LENA and human-annotated CTs exclusively in “high adult talk” segments: that is, segments were selected based on high AWC to capture parts of the day particularly dense in conversational exchanges. It is unknown whether the findings generalize to segments selected in other ways, such as at random.

The present study

Taken together, LENA’s CTC, which is currently the most widely used measure of caregiver–child turn-taking in developmental psychology, has at least two distinct sources of error: accidental contiguity and erroneous diarization. The impact of each of these sources on the overall accuracy of CTC is poorly understood. An additional concern is that the LENA algorithms were originally trained on an English-speaking sample of children between 1 and 42 months of age (Gilkerson et al., 2008), but they are being used across a wide range of populations, often under an assumption that the reported levels of accuracy will generalize across environments and segment types. It is unknown whether such assumptions are valid. While these concerns apply to all three key LENA automatic metrics (AWC, CVC, and CTC), they are particularly problematic in the case of CTC, mainly because the validation research for this measure has been extremely limited (see Cristia et al., 2020). To our knowledge, only a handful of peer-reviewed studies have considered the relation between LENA’s CTC estimates and human-annotated CTCs, and the results have been inconsistent. One study reports a significant correlation after removing three outliers (Gilkerson et al., 2015), one reports no correlation unless five samples that contained considerable overlapping speech and crying are removed (Pae et al., 2016), and three report significant correlations, though in some cases they are weak (Busch et al., 2018; Ferjan Ramírez et al., 2021; Ganek & Eriks-Brophy, 2018). Only the studies by Busch and colleagues and Ferjan Ramírez and colleagues assessed the agreement, as opposed to simple correlations, which can mask systematic biases (see Methods). Surprisingly, these two studies found conflicting results, with LENA overestimating the CTC in the case of infants (Ferjan Ramírez et al., 2021) and underestimating the CTC in the case of preschoolers (Busch et al., 2018), suggesting that accuracy may be, at least in part, related to child age. As developmental researchers, we are concerned about these inconsistencies for theoretical and practical reasons. For example, if the LENA technology incorrectly and systematically over- or underestimates CTCs in specific kinds of environments, then invalid research, theoretical, or clinical inferences could be drawn.

With these issues in mind, the present study asks how LENA’s CTC compares to the manual (human) annotation of turn-taking, through two specific aims. In Aim 1, we use previously described procedures (Ferjan Ramírez et al., 2021) to examine correlation and agreement between LENA’s CTC and manual measurement of adult–child turn-taking in two corpora: a bilingual corpus of Spanish–English-speaking families with infants aged between 4 and 22 months, and a corpus of monolingual families with English-speaking 5-year-olds. In each corpus for each child, 100 30-second segments were extracted from daylong LENA recordings in two ways: (1) via random selection, and (2) based on highest AWC. This yields a total of 2×100 30-second segments per child for manual analyses, or a total of 9300 minutes of manually annotated audio, which is then compared with LENA’s automatic CTC estimates. Following previously described procedures, annotators identified adult speech that followed child speech or vice versa, and tallied the CTs in each segment. Critically, cases in which the child and the adult vocalized in close temporal proximity but did not talk to each other (accidental contiguity) were not counted as CTs by the human coders. LENA’s CTC estimate for the same segments was obtained through LENA’s Advanced Data Extractor Tool (ADEX; LENA, 2011), which allows researchers to pre-filter output data by time interval or to specify the time resolution of the output dataset (i.e., export statistics at the level of 30-second intervals).

Importantly, and unlike in the previous study that used the same procedures for counting CTs (Ferjan Ramírez et al., 2021), neither of our samples matches the sample on which the LENA algorithms were trained (one is mismatched in the main language spoken by the families, and one in the age of children). This is intentional, given that LENA’s CTC estimates are often used in published studies that draw conclusions about participant samples that are also mismatched with the original LENA training samples (see Ganek & Eriks-Brophy, 2018; Gilkerson et al., 2008). Based on previous results from Ferjan Ramírez et al. (2021), we hypothesized that we would see wide discrepancies between the automatic and manual methods. We had no a priori hypotheses as to which of the two samples (bilingual infants or monolingual 5-year-olds) would have higher reliability. However, we hypothesized that random selection of segments for analyses would have higher reliability than selection according to high AWC, based on previous findings that environments with multiple speakers (likely high in AWC) tend to be particularly problematic for LENA’s automatic counts (Ferjan Ramírez et al., 2021).

The second aim of this study is to conduct an exploratory segment-level analysis to evaluate the relative contributions to the CTC error. Specifically, we hypothesized that the two contributors to the CTC error are related to (1) erroneous speaker tagging (i.e., LENA mistakes one or more speakers as the key child or adult) and (2) accidental contiguity (i.e., both the key child and the adult are talking in close temporal proximity, but not to each other). We had no a priori hypotheses as to which of these factors would contribute more to the errors in LENA’s CTC estimates.

Method

Participants

Participants in the present sample consisted of two groups: a group of families with infants between 4 and 22 months of age, raised bilingually with English and Spanish in Washington State and Florida, USA (bilingual infant group), and a group of families with 5-year-old children raised monolingually, using English, in Washington State, USA (monolingual 5-year group).

Participants in the bilingual infant group were recruited through the University of Washington Participant Pool and advertisements through flyers, social media, and email listservs, and were assessed for eligibility and inclusion in this particular study via video chat. The criteria for inclusion in the present study were as follows: infant age between 1 and 24 months; parents identify infant as being of Latinx descent (i.e., at least one parent self-identifies as Latinx); infant resides with their mother and father; infant is exposed to English and Spanish via direct interactions by native speakers (self-characterized) at home; infant was born full-term (within +14 days of due date), of normal birth weight (6–10 lbs.), and had no major birth or postnatal complications. Thirty-nine families met the eligibility criteria and were enrolled in the present study. One family withdrew from the study prior to completing the LENA recordings. Upon review of the families’ background information, another family’s data was excluded from analyses because of a reported diagnosis of language delay. Therefore, the final sample in the bilingual infant group includes 37 families (18 girls; mean age: 13.3 months, range: 3.9–21.9 months) who completed the LENA recordings. Socioeconomic status (SES) was measured with the Hollingshead Index (Hollingshead, 1975, 2011), a widely used measure which codes parent educational attainment and occupational prestige to generate a number between 8 and 66. Participants were socioeconomically diverse, ranging from lower-class to upper-middle-class families, with a mean Hollingshead Index of 47.6 (e.g., both parents have completed some college or have a college degree, and work as a restaurant manager, fitness instructor, preschool teacher, or marketing assistant), and a range from 19 (e.g., neither parent has completed high school, both working in unskilled labor) to 66 (e.g., both parents with advanced degrees, working as professionals).

Participants in the monolingual 5-year group consisted of families who participated in previous studies in our laboratory and agreed to be re-contacted, and additional families recruited through the University of Washington Participant Pool. Seventy-five families completed a phone screening interview to determine whether their child met the following criteria: (1) child not yet enrolled in kindergarten, aged between 5 years and 5 years 4 months; (2) English was the primary language of communication in the home; (3) child had no apparent congenital, neurological, or physical abnormalities. Exclusion criteria included any brain injury or medications that impact cognition; intellectual disability; autism spectrum disorder; mood disorders; significant and permanent hearing impairments. After the initial screening process, 59 eligible participants were invited to take part in the study, and 56 completed the LENA recordings (27 girls, mean age: 5 years 2 months). SES was again measured with the Hollingshead Index and ranged from 30 to 66 in the final analyzed sample, (M = 51.9, SD = 11.2) (i.e., working- to upper-middle-class families).

Experimental procedures were approved by the institutional review board of the University of Washington, and informed consent was obtained from parents. The study conforms to the US Federal Policy for the Protection of Human Subjects.

Data collection, preparation, and annotation

Families in both participant groups received two LENA recorders in the mail and were instructed to use one recorder on each day of a typical weekend. Parents were asked to start each recording in the morning when the child woke up, go about their day as usual, and turn off the recorder at night when the child went to sleep. The average duration of the recordings across both participant groups was 13 hours and 13 minutes (range: 10 hours 12 minutes to 16 hours for the bilingual infant cohort, 10 hours 41 minutes to 16 hours for the monolingual 5-year cohort).

Following the previously described procedures (Ferjan Ramírez et al., 2018, 2020, 2021; Ramírez-Esparza et al., 2017a, 2017b), the files were processed using LENA’s ADEX to automatically identify segments for manual analyses of CTs. Each participant’s two daily recordings were segmented into 30-second intervals. For each of the two recording days, 50 intervals per day were selected in two different ways: (1) based on the highest AWC, and ensuring that the selected segments were at least 3 minutes apart, and (2) at random, after excluding all segments that contained only silence (with no restrictions of being 3 minutes apart, and no restrictions of not being included in the high AWC batch). This yielded, for each participant, a total of two sets (one set with high AWC segments, one with randomly selected segments) of 100 30-second coding intervals, or a total of 9300 minutes of audio data for the study. Of the 18,600 30-second segments that were analyzed, 447 (2.5%) occurred in both batches.

Seven research assistants were trained to manually identify and tally the CTs in the segments. Although all annotators had some experience coding adult, infant, and child vocalizations, they took part in additional training and a reliability assessment. The same training and reliability assessment procedures were used as previously described (Ferjan Ramírez et al., 2018, 2020, 2021; Ramírez-Esparza et al., 2014, 2017a, 2017b). During training, coders listened to examples of conversational turns, defined as segments of adult speech followed by a response from the target child, or vice versa (Gilkerson et al., 2017). The LENA algorithm segmentation was not used to determine the relevant speech segments (i.e., coders simply listened to files and identified CTs “from scratch,” without opening the LENA segmentation codes), and tallied the CTs that they heard in each interval, without transcribing the conversations. As with the LENA algorithm, CTs were counted in discrete pairs (i.e., child–adult or adult–child = 1 turn; child–adult–child or adult–child–adult = 1 turn; child–adult–child–adult or adult–child–adult–child = 2 turns and so on; see Coder Manual for details), and pauses of 5 seconds or more constituted the end of a conversation. Unlike with the LENA algorithm, cases of accidental contiguity between adult and child speech (such as two adults talking to each other but not to the child, while the child was babbling to herself) were not included in the manual CTC. In order to be considered as part of conversational turns, children’s vocalizations had to be babbles (at least vowels), word attempts, words, or word combinations (see Coder Manual in Supplemental Materials and Ferjan Ramírez et al., 2018, 2020, 2021, for definitions of babble, words, word attempts, and word combinations). Annotators only counted conversational turns when both the child and the adult were talking to each other; therefore, overlap in speakers was relatively infrequent. In cases where annotators detected overlapping speech during conversational exchanges (e.g., another speaker in the background), they were instructed to count the turns as long as they could discern that the child and the adult were speaking to one another. Cases where it was unclear whether the child and/or the adult were talking to one another were relatively rare. When they did occur, coders were instructed to count them as turns in the absence of evidence against it, provided that all other criteria were satisfied (see Coder Manual for details). To assess inter-coder reliability, intraclass correlation coefficients (ICC; see Shrout & Fleiss, 1979) were calculated using a training file of 100 intervals independently coded by all annotators (see also Ramírez-Esparza et al., 2014, 2017a, 2017b; Ferjan Ramírez et al., 2018, 2020, 2021, which use the same procedures). The ICC for CT was 0.95, indicating effective training and reliable coding, based on a two-way random effects model (ICC [2, k]; Shrout & Fleiss, 1979). LENA’s estimate of CTs for the same intervals was obtained through ADEX.

Power analysis

The cohorts included 37 bilingual infants and 56 monolingual 5-year-olds. These cohorts were originally recruited for other studies with different experimental questions (see Ferjan Ramírez, Hippe, Correa, Andert, & Baralt, 2022 and Ferjan Ramírez, Weiss, Sheth, & Kuhl, 2023). The sample sizes for the present study were not preplanned prior to enrollment, but power was calculated after enrollment was concluded to determine whether the sample sizes would be sufficient. A recent study comparing automatic and manual CTCs in 70 infants between 6 and 24 months of age reported a mean difference (automatic minus manual) of 113, with a standard deviation of 83 and coefficient of variation of 74% (Ferjan Ramírez et al., 2021). Based on 37 infants and reported standard deviation, we expected having 90% power to detect a mean difference of 45 using a paired t-test with α = .05. Actual power was likely to be > 90% because a difference of 45 is 60% smaller than the difference observed in the prior study. We expected the true difference to be similar to that study, while it could differ to some degree because the difference could vary by age. The corresponding 95% confidence intervals (CIs) for the mean difference were expected to be ± 28 (counts) or ± 25%. These margins of error are much smaller than the differences expected based on the prior study, so the sample sizes available were determined to be sufficient for the present study.

Statistical analyses

For the first aim of the study, automatically and manually measured CTC values obtained from 30-second intervals were summed for each participant and sampling method (random and high AWC) to produce participant-level CTC values. Each cohort (bilingual infants and monolingual 5-year-olds) and sampling method were summarized separately (four combinations). CTC values were summarized using the arithmetic mean (also referred to as simply “the mean”) and the geometric mean, as well as the corresponding standard deviation and geometric standard deviation. Agreement between participant-level automatic and manual CTC values was analyzed using the techniques of Bland and Altman (Bland & Altman, 1986, 1999). Bland–Altman analysis helps characterize differences between two methods (henceforth referred to as “errors”) in multiple ways. Errors are decomposed into systematic biases and random errors in either direction around the bias (limits of agreement [LoA]). The bias in the automatic CTC was estimated as the mean difference between automatic and manual CTC and was tested against 0 using the paired t-test. The LoA was estimated as mean difference ± 2*standard deviations of differences. The LoA is an interval which is expected to contain 95% of differences that might be observed between automatic and manual CTCs. Bland–Altman plots were generated to display the differences (automatic minus manual CTC) versus the average of the two. This provides a way to visualize how the magnitudes of the errors vary across the range of CTC values (i.e., whether there are fewer errors when there are fewer CTCs but more errors when there are more CTCs, or whether the amount of error is similar regardless of underlying CTC values). A random scattering of points that is centered around 0 on the y-axis (appearing “flat” across the range of CTC values) would indicate a lack of apparent bias. Differences were analyzed on the original scale (absolute differences) and on the log-scale (percent differences) (Bland & Altman, 1996). For percent difference, CTC values were log-transformed (log-CTC), the mean difference (automatic log-CTC minus manual log-CTC) and corresponding 95% CI were calculated, and the resulting mean estimate and CI were exponentiated to invert the log transformation. The result of these calculations corresponds to the ratio of the geometric means of automatic CTC over manual CTC, which was then converted to a percent difference = 100% × (ratio − 1).

Overall agreement between automatic and manual CTC estimates was summarized using Pearson’s correlation coefficient (r) and the intraclass correlation coefficient (ICC). Pearson’s r indicates the scatter of values around the line of best fit and quantifies random error, but not the systematic biases that may exist between two different measurements. By contrast, ICC considers the absolute agreement between the two methods, and is sensitive to systematic shifts or biases. The ICC ranges from 0 (no agreement) to 1 (perfect agreement). The presence of a systematic shift in one of the measurements, all else being equal, would decrease the ICC but would not affect Pearson’s r.

For the second aim of this study, an exploratory segment-level analysis of automatic CTC error was conducted to evaluate the relative contributions of different conditions related to the wrong speaker and accidental contiguities. Automatic CTC error was defined as automatic CTC minus manual CTC per segment, so positive values indicate that CTC was overestimated and negative values indicate that CTC was underestimated. The segment-level conditions evaluated included other child talking, multiple adults talking, electronic media playing, and accidental contiguity. These conditions were coded as “present” if they occurred at any time during each 30-second segment and were derived from the manual coding of segments. Multiple adults talking was defined as present if more than one adult spoke at any time during a 30-second segment, regardless of whom they were speaking to or their relative timing. Accidental contiguity was defined as present if both the child and an adult spoke at any time during the 30-second segment and the manual CTC was zero. This definition is only an approximation for accidental contiguities, limited by the available manually coded data. The definition may have missed some segments with accidental contiguities if they also contained actual CTs in another part of the segment. On the other hand, the definition may also misclassify some segments as accidental contiguities where both child and adult spoke but not with sufficient temporal proximity to be truly contiguous.

Multivariable linear regression was used to estimate associations of conditions related to wrong speakers and accidental contiguities with mean automatic CTC error (outcome variable). Confidence intervals (CIs) and p-values were calculated for the regression coefficients using the nonparametric bootstrap with resampling done by participant rather than by segment (Huang, 2018). The use of the bootstrap avoids distributional assumptions about the CTC error and the participant-level resampling accounts for the non-independence of the segments from the same participant. The intercept term from each model was also interpreted as the mean automatic CTC error in the absence of all four conditions considered, referred to as “none identified.”

Throughout, two-tailed statistical tests were used. The 95% confidence level was used for all confidence intervals, and statistical significance was defined as p < .05. P-values were not adjusted for the number of comparisons. In Aim 1, our focus was on estimating bias and agreement between automatic and manual CTC rather than null hypothesis significance testing. Aim 2 was considered exploratory. P-values were always fully reported rather than dichotomized (except for p < .001) so they could also be interpreted using stricter significance thresholds based on the Bonferroni correction, such as p < .0125 (.05 / 4) for tests within each of four cohort and sampling groups and p < .003125 (.05 / 16) for tests of four regression coefficients per cohort and sampling group.

Results

The datasets analyzed during the current study are available in the Open Science Framework (OSF) repository: https://osf.io/k5yqc/?view_only=2c747c33165f44f68a245acd5597c73e.

Aim 1: Correlation and agreement between automatic and manual measures of CTC

Automatically and manually measured CTs are shown in Fig. 1, and a comparison of their average values is shown in Table 1. On average, automatic CTC values were higher than the corresponding manual CTC values (all ps < .001), except for the random segments from the monolingual 5-year cohort, where the pattern was reversed (p < .001). For the bilingual cohort, CTC estimates were 110 (95% CI [91, 129]) counts and 700% (95% CI [409, 1156]) higher by the automatic method than the manual method in the high AWC sample. In the same cohort, but for segments selected at random, the absolute and percent differences were 14 (95% CI [8, 20]) and 210% (95% CI [96, 391]), respectively. For the monolingual 5-year-old cohort, the average CTC estimates were also higher by the automatic method than the manual method in the high AWC sample (absolute difference: 163, 95% CI [132, 194]; percent difference: 95%, 95% CI [71, 122]) while the pattern was reversed in the random sample (absolute difference: −29, 95% CI [−42, −17]; percent difference: −28%, 95% CI [−38, −16]).

Fig. 1
figure 1

Distributions of conversational turn counts (CTC) per child in 100 30-second segments in the bilingual infant corpus (n = 37) and in the monolingual 5-year corpus (n = 56), as estimated by LENA (orange) and as assessed by a human coder (blue). In each corpus, segments for analyses were selected in two ways: at random, or based on high AWC. The left panel shows boxplots with individual data points. The thick horizontal line indicates the median, and the top and bottom of the boxes indicate the 75th and 25th percentiles. The right panel shows the mean and standard deviation for the same data. Error bars represent one standard deviation

Table 1 Comparisons of automatic and manual conversational turn counts (CTC)

In both cohorts, the absolute and relative bias between automatic and manual CTC was smaller in the random segments than the high AWC segments (all ps < .001). In terms of absolute CTC, the bias in the high AWC segments was larger in the monolingual 5-year cohort than in the bilingual infant cohort (mean difference: 163 vs. 110, p = .005), but the percentage bias was smaller in the monolingual 5-year cohort than in the bilingual infant cohort (mean % difference: 95% vs. 700%, p < .001). That is, the absolute bias seems to be larger in samples with a higher number of conversational turns (for example, in samples with older children), while the percent bias is larger in samples with fewer conversational turns.

The LoA estimates in Table 1 provide ranges of likely differences between the two methods in both participant cohorts and for both sampling methods. These ranges are expected to include approximately 95% of the differences that could be observed between the methods if the experiment were repeated. For the bilingual infants, LoA was −80% to 4776% in randomly sampled segments while it was −47% to 11,908% for segments high in AWC. For the monolingual 5-year-olds, LoA was −76% to 116% in randomly sampled segments, and −26% to 417% in high AWC segments. Together, these LoA estimates indicate that in individual instances, the disagreement in CTC between automatic and manual methods can be quite substantial, much more than implied by the average differences. Individual absolute and percent differences are summarized in Bland–Altman plots shown in Supplemental Figs. 1 and 2, respectively.

Table 2 presents the overall agreement between the automatically and manually measured CTs, as measured by both Pearson’s r and the ICC. Considering the Pearson’s r, the two measures of CTC had low correlations for the monolingual 5-year-olds sampled in both ways (r = .23 and .28) but had higher correlations for both bilingual samples (r = .62 and .56, p < .001). The ICC, which is sensitive to systematic biases (summarized in Table 1), demonstrates low absolute agreement for both cohorts when considering the high AWC segments (ICC = .15 for the bilingual sample and .09 for the monolingual sample, ps > .17) as well as for the monolingual cohort when considering randomly selected segments (ICC = .17, p = .074). The ICC was somewhat higher for the bilingual cohort when considering the randomly selected segments (ICC = .49, p = 0.009).

Table 2 Correlation between automatic and manual CTC

As can be seen in Fig. 2, the linear trend between manual and automatic CTC is apparent in the bilingual infants’ cohort (evidenced by a higher Pearson’s r). However, in the high AWC segments the automatic CTC values have an obvious bias upward relative to manual CTC, resulting in lower ICC (mean absolute difference: 110, Table 1). In the monolingual 5-year cohort, by contrast, the linear trends between manual and automatic CTC were relatively weak (as evidenced by the lower Pearson’s r). Similar to the bilingual infant cohort, the ICC for the high AWC segments was low, due to an obvious upward bias in automatic CTC values relative to manual CTC (mean absolute difference: 163, Table 1) as well as the weak linear relationship.

Fig. 2
figure 2

Scatterplots of automatic and manual CTC for each cohort and segment sample. The dashed line indicates the line of equality. The solid lines indicate the least-squares linear regression trend lines, and the shaded area represents the 95% confidence intervals of the regression lines

Taken together, there were significant biases (absolute and relative) in automatic CTC relative to manual CTC in both cohorts and for both segment samples, and the limits of agreement were wide in all cases. These patterns of results are well aligned with the previous findings by Ferjan Ramírez et al. (2021), and highlight potential problems with automatic measurement of parent–infant verbal interactions in two new cohorts of participants and across two different segments samples.

Aim 2: Segment-level analyses

The second goal of the present study was to explore the relative contributions to the error in CTC estimation by the automatic method relative to the manual method (automatic CTC minus manual CTC). Recall that we hypothesized that the two main contributions to CTC error were erroneous speaker labeling and accidental contiguity; to capture these two conditions, the following variables were manually assessed in each segment: multiple adults speaking, another child speaking, electronic media playing, accidental contiguity, or none identified (absence of the four aforementioned conditions studied). Figure 3 demonstrates the percentage of segments affected by each of these conditions. Interestingly, accidental contiguity had similar prevalence in both participant cohorts and for both segment sampling methods, affecting between 12% and 17% of all segments. Between 20% and 29% of segments had none of the assessed conditions across the two cohorts and two segment sampling methods. There were also some notable differences between the two participant cohorts. The most common conditions for the bilingual infants were speech from multiple adults (44–53% of segments vs. 31–35% of segments in the monolingual 5-year-old cohort, p < .001) and electronic media (33–42% vs. 19% of segments, p < .001). On the other hand, the most common condition for the monolingual 5-year-old cohort was speech from another child (37–40% of segments vs. 16–19% of segments in the bilingual infants, p < .001), followed by speech from multiple adults (31–35% of segments).

Fig. 3
figure 3

The percentage of segments affected by each condition (related to the wrong speaker or accidental contiguity). The absolute number of segments affected by each condition is shown at the bottom of each bar. Note that the conditions are not mutually exclusive, so the numbers of segments shown add up to more than the total number. The error bars correspond to 95% confidence intervals. *The “none identified” category corresponds to segments where none of the four conditions assessed was detected, but does not rule out the possibility of other relevant conditions that affect error rates being present

Figure 4 demonstrates the impact of each condition on the automatic CTC error per segment, using multivariable regression. Recall that the variables are generally not mutually exclusive, with the exception of the category “none identified.” That is, any combination of the “wrong speaker”-related variables and the “accidental contiguity” variable could be positive and/or negative at the same time. Of all the conditions, accidental contiguity had the largest individual impact on the CTC error per segment for both participant cohorts and segment sampling methods, associated with the automatic method overestimating CTC relative to the manual method in all cases (difference in mean automatic CTC error: .41–1.0 in the bilingual infants and 1.0–1.5 in the 5-year-old monolingual cohort per 30-second segment, all p < .001). In the bilingual infants, the total automatic CTC across all participants and segments was 1374 under random sampling and 5408 under high AWC sampling. The percent of the total automatic CTC across all participants and segments that were counted in segments with accidental contiguities was 17% (238/1374) under random sampling and 22% (1195/5408) under high AWC sampling. The corresponding rate of total automatic CTC in segments with accidental contiguities in the monolingual 5-year-olds was 14% (598/4369) under random sampling and 11% (1961/18,591) under high AWC sampling.

Fig. 4
figure 4

The impact of each condition on automatic CTC error relative to the manual method per 30-second segment. Error was defined as automatic CTC minus manual CTC. Each bar corresponds to a regression coefficient in a multivariable model with all conditions included as factors potentially associated with CTC error (see Methods). The error bars correspond to 95% confidence intervals. The intercept corresponds to the mean automatic CTC error in the absence of the four conditions assessed (“none identified”). The “none identified” category does not rule out the possibility of other relevant conditions outside of the four studied being present. The other coefficients correspond to the mean difference in CTC error between segments with and without the corresponding condition present, controlling for the presence/absence of the other conditions. The asterisks indicate the p-values for the regression coefficients (null hypothesis: coefficient = 0), excluding the intercept (“none identified”). * p < .05, ** p < .0125 (.05/4), and *** p < .003125 (.05/16)

Speech from another child had the next largest impact on the automatic method overestimating CTC in the 5-year-old monolingual cohort (difference in mean automatic CTC error: .70–.72 per 30-second segment, all p < .001) and the bilingual infants under random sampling (difference in mean automatic CTC error: .14 per 30-second segment, 95% CI: .04–.23, p = .003). However, speech from another child was not statistically significantly associated with automatic CTC error in the bilingual infants under high AWC sampling (difference in mean automatic CTC error: −.01 per 30-second segment, 95% CI: −.25 to .24, p = .95).

The other two conditions did not appear to be consistently associated with automatic CTC error, with each statistically significantly associated with automatic CTC error in only one of four cohort/sampling method combinations: speech from multiple adults was associated with overestimating CTC only in the bilingual infants under random sampling (difference in mean automatic CTC error: .12 per 30-second segment, 95% CI: .06–.19, p < .001) and the presence of electronic media was associated with overestimating CTC only for the monolingual 5-year cohort under random sampling (difference in mean automatic CTC error: .38 per 30-second segment, 95% CI: .22–.54, p < .001).

Other patterns in Fig. 4 are more challenging to explain: for example, in the high AWC samples, the “none identified” category is particularly large in both participant cohorts (mean automatic CTC error: 1.1–1.4 per 30-second segment), suggesting that a large portion of the CTC overestimation remains unexplained by the assessed conditions. Perhaps even more surprising is the finding that the monolingual 5-year cohort under random sampling demonstrates overall underestimation of CTC in the “none identified” condition. Paradoxically, as the presence of any other condition except multiple adults speaking added positively to the automatic CTC error, those conditions could have reduced the total apparent error (e.g., when segments with underestimated CTC are added to segments with overestimated CTC, the resulting average CTC error is closer to zero). Thus, the error in this case may largely cancel out, but unfortunately not because the system is accurate, but because different conditions pull the error in opposite directions. On a more positive note, the “none identified” category was extremely small in the bilingual infant cohort under random sampling. That is, if there are not multiple adult or child voices present, no electronic media playing, and no cases of accidental contiguity, the automatic CTC estimate comes close to manual CTCs on average in this cohort, provided that segment sampling is random.

Discussion

The present study considered the correlation and agreement between LENA’s automatic CTC estimate and manual annotation of parent–child turn-taking in two corpora of audio recordings: a bilingual corpus of Spanish–English-speaking families with infants aged between 4 and 22 months, and a corpus of monolingual families with English-speaking 5-year-olds. In each corpus and for each child, audio samples were extracted from daylong LENA recordings via random selection and based on the highest AWC, in order to compare manually annotated conversational turns to LENA’s automatic CTC estimates. Confirming our hypothesis, we found wide discrepancies between the automatic and manual methods, in both participant cohorts and for audio segments selected in both ways. The two measures of CTC had low correlations for the monolingual 5-year-olds sampled in both ways, and somewhat higher correlations for the bilingual infant samples. The ICC, which is sensitive to systematic biases, showed low absolute agreement for both cohorts when considering the high AWC segments, as well as for the monolingual 5-year-old cohort when considering randomly selected segments, but was somewhat higher for the bilingual cohort when considering the randomly selected segments. On average, automatic CTC values were higher than the corresponding manual CTC values, except for the randomly selected segments from the monolingual 5-year cohort, where the pattern was reversed. In both participant cohorts, the absolute and relative bias between automatic and manual CTC was smaller in the randomly selected segments than in segments high in adult speech. The absolute bias was larger in the monolingual 5-year cohort than in the bilingual infant cohort, but the percentage bias was smaller in the monolingual 5-year cohort than in the bilingual infant cohort. Taken together, these data confirm previous reports of wide discrepancies between the two measurements of CTC in two new cohorts of participants and in segments selected in two different ways, suggesting that the automatic and manual CTC measures are not identical and cannot be interchanged (see also Busch et al., 2018; Ferjan Ramírez et al., 2021).

The present study also evaluated the relative contributions of various factors to the CTC error. We found that accidental contiguity had the largest individual impact on the CTC error per segment for both participant cohorts and for both segment sampling methods—affecting 12–17% of all segments across both cohorts and sampling methods, and 17–22% of automatically counted CTs in the bilingual infants and 11–14% of automatically counted CTs in the monolingual 5-year-olds—which was associated with the automatic method significantly overestimating CTC relative to the manual method in all cases. Together, this means that on average, 11–22% of LENA’s CTCs are unintentional and erroneous due to accidental contiguities, and do not represent reciprocal serve-and-return exchanges that the technology is intended to measure. We also show that LENA can overestimate CTCs by over 100%. These findings should be considered and appropriately acknowledged in future studies that rely on LENA’s CTC estimates when drawing research, theoretical, or clinical inferences.

Other factors that significantly affected the CTC error were speech from other children, the presence of multiple adults, and the presence of electronic media. Of these, speech from another child had the largest impact on the error, significantly overestimating CTC in three out of four studied conditions. The impacts of other factors on the CTC error were more varied, and depended on the participant sample and/or segment selection method. For example, in the 5-year-old cohort, the largest impact after adjusting for the total number of segments came from speech of another child (likely a sibling, given that the recordings were collected in children’s homes). By contrast, in the bilingual infant cohort, accidental contiguity had the largest impact on CTC in the high AWC sample, and speech from multiple adults had the highest impact for this cohort under random sampling. Together, this suggests that LENA’s CTC error may be an artifact of differences in environmental, participant, or segment selection factors, which can compromise the comparability of LENA measures across conditions, studies, participant samples, or developmental time points. As an example, a research setting such as a multigenerational home is likely to contain higher rates of adult and/or child speech. The present results suggest that LENA’s CTC may be less accurate in such an environment. Thus, when comparing children from diverse family structures, differences that may be attributed to distinct language environments could simply be the result of LENA’s poorer performance in certain environments.

Our data also demonstrate that different factors that contribute to the CTC error can pull the error in opposite directions. This can sometimes result in CTC estimates that appear to be close to real CTC values, but not because the system is accurate, but because a positive and a negative error can cancel each other out. It is clear that even small inaccuracies and biases detected at the 30-second level can potentially accumulate into large absolute differences, and if their direction or magnitude is determined by participant characteristics or properties of the recorded situations, this could introduce confounds when comparing CTCs between subjects, conditions, or developmental time points. Nevertheless, it is important to acknowledge that LENA was originally developed to study children’s language environments at the daylong level, and that non-peer-reviewed LENA user guides suggest that AWC errors cancel out over the course of an entire day (Gilkerson et al., 2008; Xu et al., 2009). For example, Xu et al. (2009) suggest that LENA-based AWC estimates initially differ by more than 40%; however, this variation decreases logarithmically as a function of time. It is theoretically possible that a similar trend would be observed for CTC. Unfortunately, it is not clear from our data how the observed disagreements and biases would accumulate in a regular full-day LENA recording (i.e., would the positive and negative errors cancel each other out after a full recording day?). Of note, if the CTC error is to cancel out with increased duration, LENA should both over- and underestimate CTs by comparable amounts across different segments. The Bland–Altman plots presented here show that underestimation over 50 minutes of noncontiguous segments was quite uncommon except for the monolingual 5-year-olds, making the possibility of the error canceling out with extended duration seem less likely. From a more practical standpoint, it is also important to acknowledge that daylong recordings are often not possible or desirable. For example, families may choose to turn off the recorder for a portion of the day due to privacy concerns. Furthermore, in clinical, educational, or intervention settings, limitations typically exist that prevent data collection at the daylong level. As a result, LENA users often rely on recordings much shorter than the recommended full day. Oftentimes, researchers extract and analyze shorter portions of recordings (recording segments or snippets), based on criteria aligned with the specific goals of the study (see for example Cychosz et al., 2021; Orena et al., 2020; Weisleder & Fernald, 2013). Future peer-reviewed studies should consider whether the accuracy of CTC does, in fact, increase with recording length.

Another potential area of investigation is the precise timing of turns between children and adults, in relation to the rate of accidental contiguities. Within the turn-taking literature, the optimal time frame for an appropriate response is typically shorter than LENA’s choice of 5 seconds (i.e., around 2 seconds; for example, Elmlinger et al., 2019; Hilbrink et al., 2015; see also Nguyen et al., 2022, for a meta-analysis). One might argue that a 5-second window is unrealistically long to accurately capture turns; further, one might hypothesize that the length of this window contributed to the high rates of accidental contiguities observed in the present study (i.e., that decreasing the time of what is considered a valid turn would reduce the number of occurrences in which an adult and a child are speaking in temporal proximity but not to one another). Unfortunately, due to the time-consuming nature of manual annotation, our annotators were not asked to tag the precise timing of each turn. Therefore, we are hesitant to make recommendations about shortening the CT time window based on the present results. However, we think this is an interesting area for future research. Particularly informative will be studies in which annotators are asked to transcribe parental and child speech, tag the timing of each turn, and take detailed notes on other environmental factors that may be modulating the accuracy of CT estimates.

A somewhat surprising pattern in the present data was that the monolingual 5-year cohort under random sampling demonstrated an overall underestimation of CTC by the LENA software. While this pattern is the opposite of what we observed in the other three conditions studied in the present dataset, as well as the opposite of what was previously reported for English-speaking infants between 6 and 24 months of age (Ferjan Ramírez et al., 2021), an underestimation of CTC by the LENA software had previously been reported by Busch et al. (2018), who studied six Dutch children between the ages of 2 and 5 years. Interestingly, Busch et al., 2018 report that LENA’s CTCs were overestimated in samples with few real conversational turns, and underestimated in samples with many real conversational turns, a pattern that was partially replicated in the present dataset. However, the present study also shows that, while LENA underestimated CTs in 5-year-old monolingual English-speaking children when segments were selected at random, CTs were overestimated in the same participants if segments were selected based on high AWCs. In the present dataset, this overestimation was partially explained by the presence of another child (i.e., a sibling) in the recording, but remained partially unexplained (meaning that it was not affected by any of the conditions that we identified as potentially contributing to the CTC error). In fact, between 22% and 30% of segments with errors in CTCs were not affected by any of the conditions explored in the present study.

The LENA algorithms are proprietary, preventing us from fully describing the origins of these discrepancies. However, one potential explanation is that there is a difference in how LENA and human coders count child vocalizations (CVs). Specifically, LENA operationalizes CV as “breath groups,” whereby a 300-ms pause ends a vocalization, and vegetative sounds (e.g., cries, burps) are not counted. This allows LENA to deal with the high variability of speech in the absence of semantic or lexical “knowledge.” Human coders, on the other hand, also consider semantic boundaries. It is also possible that LENA’s and human CV counts are better aligned in infants, as opposed to 5-year-old children who produce full sentences, or in certain (perhaps less chaotic) environments. However, it remains unclear exactly how environmental noise or developmental changes in children’s language production between birth and 5 years affect the relation between LENA’s CVC estimate and human child-vocalization counts, and what this may mean for the corresponding alignment of CTCs, as there are other simultaneous factors that affect this relation. Future studies would benefit from full transcription of parent–child speech and/or detailed annotator notes around other environmental factors that may be at play within each coded segment.

Another important issue for future research will be the treatment of overlapping speech. In counting CTs, the LENA automated algorithm discards speech that overlaps with other sounds. By contrast, our annotators were instructed to count CTs in cases where they detected overlap, as long as they could reliably discern that the two speakers were indeed talking to one another. While this could explain some of the observed discrepancies, adopting a more permissive definition of what “counts” as a conversational exchange by the human annotators would lead to LENA underestimating the turns in comparison to human annotators. This pattern was observed only in one out of the four conditions studied here, while the other three conditions showed the opposite pattern. Whether overlapping speech should “count” as input to children is an interesting question that deserves further investigation. Future studies will need to consider the conditions under which overlapping speech is more or less likely to occur (i.e., large families, school or daycare settings), conditions under which children are more or less likely to learn from it, and cultural variation in terms of how acceptable overlapping speech is.

An exciting finding in the present study was that in the bilingual infant cohort under random sampling, we were able to attribute the sources of systematic overestimation almost entirely to the conditions that we hypothesized would be potentially problematic for LENA: accidental contiguity, multiple adult or child voices present, and electronic media playing. The present results therefore suggest that if none of these conditions is present, LENA’s CTC estimate comes close to manually counted CTs in the bilingual infant cohort, on average, provided that segment sampling is random. Although an environment without any of the abovementioned interfering factors may be difficult to achieve in the typical day-to-day life of a busy toddler, these results nevertheless show that better agreement between the two methods is possible, under specific circumstances. At the same time, however, our results also suggest that there are certain environments to which we simply cannot and should not generalize LENA’s reported levels of accuracy. For example, in a childcare setting where multiple adult and child voices are likely to be consistently present near the child wearing the recorder, accidental contiguity is likely to be frequent, and the potential for error is very high. We therefore urge LENA users to refrain from assuming that the reported LENA levels of accuracy generalize across environments and segment types. These settings include but are not limited to (pre)school settings, multigenerational homes, or clinical settings. The present findings suggest that the current methods of automatically assessing caregiver–child interaction are limited, and that human annotation is still necessary across all environments, but especially in environments of multiple adults and/or child speakers.

Finally, we would like to clarify that it is not our goal to dismiss LENA on the basis of the results presented here. By contrast, we find LENA to be an indispensable tool for collecting naturalistic, daylong audio samples. In the time it took our team to annotate 9300 minutes of audio data, LENA could have provided automatic statistics for language environments of thousands of children. We also recognize the potential of the LENA technology to capture powerful learning moments as they occur “in the wild,” such as Scenario 1 that we described in the Introduction of the present study. Nevertheless, our concern about LENA’s automatic CTC estimates continues to grow as we listen to the LENA recordings and consistently observe that adult–child speech in close temporal proximity is often not reciprocal. The bulk of our laboratories’ LENA analyses require careful listening to audio snippets. Note that this is not the case for the majority of LENA users, which include schools, nonprofits, hearing and speech schools, home visitors, early intervention programs, and others. Most of these organizations do not have access to the necessary equipment, financial resources, or trained researchers to conduct the necessary analyses steps to compare LENA’s estimates and human annotation. In fact, in several LENA products (LENA Grow, LENA Home, and LENA Start), the audio is deleted immediately and automatically after processing into data (i.e., adult words, conversational turns). The only two versions of LENA that allow for retention of audio are LENA Sp and LENA Pro, both of which are intended for research purposes. Of course, deleting the audio is perfectly understandable from a confidentiality perspective; however, as a consequence, LENA users outside of the research community do not have a chance to critically think about the accompanying audio and how it may relate to the automatic CT estimates. The CT estimates are, as we demonstrate here, often inconsistent with what humans consider turn-taking between caregivers and children. In the present study, the differences between LENA’s CT estimates and manual CTCs varied widely (as indicated by the limits of agreement), as did the sources of CTC error. Together, this calls into question the comparability of the CTC measure across participants, conditions, and developmental time points, which is critical for most studies that use the LENA CTC metric. Until systematic reliability estimates of turn-taking across different contexts are available, LENA users should validate their conclusions and theoretical proposals through manual analyses.