Introduction

The LENA™ system (Language Environmental Analysis) is a research tool that was developed based on American English to automatically map the linguistic environments of young children and assess their vocal development (Ford et al., 2008; Oller et al., 2010). Studies conducted with LENA in the English language have reported quantitative gaps between children from low and high socioeconomic statusFootnote 1 (LSES and HSES, respectively) in exposure to language, dyadic interaction, and vocal development (e.g., Gilkerson et al., 2017). These gaps were already evident in infancy (Dwyer et al., 2019) and were found to have a negative impact on children’s later language and brain development (Romeo, Leonard, et al., 2018a). Of all the LENA variables, the best predictor of language development was the count of adult-child conversational turns (CTC) which reflects the degree of the child’s engagement in dyadic interaction (Romeo et al., 2021). The findings from these studies have led to the development of dedicated intervention programs aimed at increasing CTC in LSES children (e.g., Suskind et al., 2015). In these programs, LENA is also used to monitor the impact of the intervention on the children’s linguistic environment and vocal development. While the findings generally support the efficacy of LENA in intervention programs that are aimed to reduce linguistic gaps (e.g., Suskind et al., 2013), expanding its use to infants learning a language other than English is not straightforward. In a recent study, the CTC variable was found to be less accurate in English-speaking infants than manual counts, with the greatest difference in the youngest infants (6 and 10 months) (Ramírez et al., 2021).

It has also been argued that LENA’s counting algorithm may be influenced by the acoustic, morphosyntactic, and prosodic characteristics of the spoken language (e.g., Ganek et al., 2018). To date, studies using LENA in a few non-English languages have mainly reported on reliability outcomes by correlating and comparing the counts of LENA and those of human transcribers (e.g., Canault et al., 2015). While the reliability was found to be quite good, there is lack of data in non-English languages with respect to LENA’s ability to (1) identify the differences in linguistic environments and vocal development between children from low and high SES (i.e., construct validity), and (2) predict their language development (i.e., criterion validity). The fact that the Semitic languages of Hebrew and Arabic differ significantly from American English (e.g., Kishon-Rabin et al., 2005b; Kishon-Rabin & Rosenhouse, 2000), emphasizes the need to test LENA’s suitability for these languages. Hence, the aim of the present study was to test the applicability of the LENA system in the first year of life to Hebrew and Arabic, using the three measures of reliability, construct validity, and criterion validity.

The LENA™ system has two components. The first is a small wearable recording device (called the Digital Language Processor or DLP) which is worn by the child and can record up to 16 hours of the sounds in the child’s immediate environment. The second component is dedicated speech recognition software (LENA Pro) that analyzes the recordings offline of three main quantitative variables. These are the number of words the child is exposed to by adults within a radius of 1.2–1.8 meters (adult word count, AWC), the number of conversational turns between an adult and the child (CTC), and the number of vocalizations produced by the child (CVC). These parameters are extracted by algorithms that are capable of locating the acoustic signal in the recordings, segmenting the speech stream into utterances and defining its acoustic features, determining the gender of the speaker and their proximity to the recording device, and filtering out overlapping and nonspeech sounds (Ford et al., 2008; Oller et al., 2010). The algorithms rely on pitch level, the signal to noise ratio, the duration of silent pauses, changes in acoustic energy, formant transitions, changes in loudness and statistical models (Cristia, Lavechin, et al., 2020b; Gilkerson & Richards, 2009). The additional features of the software include automatic estimation of the child’s vocal age (AVA) (Richards et al., 2009; Richards et al., 2017), assessment of developmental age (DA) based on a dedicated parental questionnaire (SNAPSHOT) which has large normative data for the English language (Gilkerson & Richards, 2008), and the extent to which the child is exposed to electronic sounds, such as television (TV).

Using the LENA system to assess linguistic environment and vocal functioning has several advantages compared to traditional tools such as manual transcriptions. First, the LENA system is equipped to efficiently store and analyze data obtained over 16 hours (a whole day), while transcription-based studies are limited to short periods of time (e.g., Hart & Risley, 1995). As a result, LENA can provide information not only about the child’s home environment, but also about other linguistic environments the child is exposed to during the day, such as kindergarten, the playground, relatives’ homes, etc. Furthermore, the long recording hours that LENA analyzes provide data on fluctuations in speech density throughout the day, whereas transcription-based studies rely on speech density at a particular hour (e.g., dinnertime) to assess the linguistic input during the day (e.g., Hart & Risley, 2003a). It has been argued that this extrapolation may lead to inaccuracies because it does not take the changes in the amount of speech between different hours of the day and/or different environmental contexts into account (Bergelson et al., 2019; Soderstrom & Wittebolle, 2013). Another advantage of the technology over human transcription is its savings in terms of time, money and manpower since the analysis is performed automatically by LENA Pro, which can complete the analysis of 16-hour recordings in just 3 hours (Oetting et al., 2009). Manual transcription, on the other hand, requires the training of transcribers as well as long and tedious hours of work, which also makes the entire process costly. Moreover, the possibility that human transcribers are influenced by the child’s linguistic environment and/or her development cannot be ruled out.

LENA’s reliability for English was assessed mainly by measuring the accuracy of the system in identifying the speaker and distinguishing speech from noise, as well as by correlating the automatic measurements of AWC with manual transcriptions (Xu et al., 2009). These showed good agreement ranging from 71% to 82% between the software and the transcribers in the identification of the speaker (adult vs. child), and the sounds recorded (speech vs. noise). The analysis also found strong correlations between LENA’s automatic counts of adult words and the manual counts of the transcribers (r = .92). In general, the automatic count was 2% lower than the manual count. This was explained by the fact that in contrast to the transcribers, LENA excludes sections with overlapping speech or noise (Gilkerson et al., 2008; Gilkerson & Richards, 2020; Xu et al., 2009). Due to the lack of reliability tests for CTC for English and its importance to language development, researchers have recently examined this variable in infants. In comparison to manual counting by transcribers, LENA’s CTC for infants aged 6–24 months was significantly higher (Ramírez et al., 2021). This finding suggests that CTC should be validated in this population before using LENA.

The construct validity of LENA was determined for English based on the sensitivity of LENA’s measures to distinguish between linguistic environments and vocalizations that characterize different populations (for a review see Ganek & Eriks-Brophy, 2018a; Wang et al., 2017). These studies had two main foci. The first was to characterize the linguistic environments and vocal functioning of children with organic language disorders compared to children evidencing typical development. These include, for example, children with autism (Warren et al., 2010; Yoder et al., 2013), language delay (Oller et al., 2010), hearing loss (Lehet et al., 2018; Sangen et al., 2018; Vandam et al., 2012; VanDam & Silbert, 2016; Vohr et al., 2014), and children who were born prematurely (Caskey et al., 2011). LENA has been shown to be sensitive to the differences between these populations and has provided insights into their developmental trajectory (e.g., Caskey & Vohr, 2013).

The second focus of research involved characterizing the linguistic environments and vocal functioning of children from low socioeconomic (LSES) (or low maternal education) backgrounds (Dwyer et al., 2019; Weisleder & Fernald, 2013). The findings showed that LSES children were less exposed to linguistic input and that their language achievements were significantly lower compared to high SES children (HSES) (Piot et al., 2021). The rising awareness of the developmental gaps of LSES children has led to the application of LENA’s technology in large scale intervention programs whose goal is to reduce the linguistic gaps between low and high SES children (e.g., Providence Talks: http://www.r2lp.org/early-childhood-development/providence-talks/; Thirty Million Words: https://tmwcenter.uchicago.edu/). These programs are usually based on parental guidance to increase the quantity and improve the quality of linguistic input. LENA’s role in these programs is mainly to monitor the changes in language input (AWC and CTC) and children’s progress (CVC) after the intervention.

The criterion validity was determined based on the associations between LENA’s measures and the assessments scores of standard tools for the English language such as PLS-4 and MB-CDI (Gilkerson et al., 2017; Wang et al., 2020). Specifically, a recent meta-analysis of 13 studies found moderate mean effect sizes for the correlation between CTC and CVC and language outcomes (r = 0.31, 0.32, respectively p < 0.0001); however, the effect size for AWC was low (r = 0.21, p < 0.0001; Wang et al., 2020). A large-scale study (over 300 participants) with ages ranging from 2 to 48 months revealed that LENA’s measures of CVC and CTC accounted for 7% to 16% of the variance in standardized language tests while the AWC accounted for up to 8% (Gilkerson et al., 2017). While these studies tested LENA’s three basic indices, other studies have focused on CTC. Their findings showed that CTC was associated with brain function and structure in children aged four to six. Specifically, a correlation was found between CTC and neural language processing (Romeo, Leonard, et al., 2018a) and between CTC and white matter connectivity in Broca’s area (Romeo, Segaran, et al., 2018b). Moreover, these correlations explained the association between CTC and the children’s language achievement (Romeo, Leonard, et al., 2018a; Romeo, Segaran, et al., 2018b). Given these and other findings, CTC is considered a valuable variable for language intervention programs (Romeo et al., 2021).

While these findings point to the efficacy of using LENA to monitor the effects of interventions as well as predict their outcomes in terms of language development, caution should be exercised when the target population speaks a language other than English. This is because LENA was developed for the American English language and was validated in most cases in this language (Wang et al., 2020). American English, however, is a West Germanic language that differs in its characteristics from other languages, which may affect the accuracy of LENA’s counting algorithm. To date, studies using LENA in seven non-English languages have mainly reported on reliability outcomes, as shown in Table 1. In these studies, sections were sampled from day-long recordings and the automatic counts were correlated to the manual transcriptions of LENA’s measures. However, as shown in Table 1, while three of the studies tested all of LENA’s measures (Busch et al., 2017; Gilkerson et al., 2015; Pae et al., 2016), the other four only tested some of them (AWC and/or CT, CVC, TV) (Canault et al., 2015; Elo, 2016; Ganek & Eriks-Brophy, 2018b; Weisleder & Fernald, 2013). Moreover, the correlation between the automatic and manual counts differed across studies and ranged from 0.36 to 0.99. Further, the seven studies differed in their protocols, including the number of participants (ranging from 2 to 63) and their ages (ranging from 3 months and 5 years), the number of the sections that were transcribed (ranging from 10 to 324), and their duration (ranging from 5 minutes to 1 hour). A recent meta-analysis that examined LENA’s reliability in English and other languages found high correlations between the automatic and manual counts of AWC and CVC measures (0.79 and 0.77 respectively), but a low correlation for the CTC measure (0.36) (Cristia, Bulgarelli, & Bergelson, 2020a). Overall, these findings indicate considerable variability in the procedures and outcomes across studies (for more information, see Table S1 in the Supplement).

Table 1 The reliability of LENA in seven non-English languages (Dutch, Korean, Mandarin, Spanish, French, Vietnamese, Finnish)

While overall, reliability in non-English languages was found to be quite good, there is lack of data regarding LENA’s construct validity. One study compared the counts of CVC and CTC in a group of bilingual English- and Spanish-speaking LSES children and higher SES English-speaking children. Although significantly lower counts were found in the LSES group, the ability to draw conclusions from the findings was limited because the groups differed on additional variables (such as the duration of recording and the dominant language) (Wood et al., 2016). Similarly, there is scant information on the association between the variables measured by the LENA and language development in non-English languages (i.e., criterion validity). To the best of our knowledge, only two non-English studies have addressed this issue. The first is a Chinese study that reported a positive association between LENA’s CTC and performance on the Bayley Scales of Infant Development (Xu et al., 2012). The second is a Finnish study that found an unexpected negative correlation (r = −0.47, p < 0.05) between LENA’s AWC and the children’s expressive vocabulary performance on the MB-CDI subtest (Elo, 2016). Due to this paucity of information, it is currently impossible to determine whether the LENA system is sensitive enough to monitor changes in linguistic environments of different populations and whether it has predictive value for later language development in languages other than English.

The issue of the suitability of LENA for non-English intervention programs is particularly relevant to the Semitic languages of Hebrew and Arabic, since they differ from English in several respects. In terms of acoustics, English and Semitic languages differ in phonemic inventory, voice onset time (VOT= the elapsed time between the release of a stop consonant and the onset of voicing), and prosody patterns. Specifically, Hebrew and Arabic have fewer vowels than English (five in Hebrew, six in Arabic, and 12 in English) and their respective consonant systems include more sounds that are produced in the back of the oral cavity (also termed guttural sounds) as /ɤ/, /ħ/, /ʕ/, /q/, /x/ in comparison to English (e.g., Hillenbrand et al., 1995; Kishon-Rabin et al., 1999; Most et al., 2000; Newman & Verhoeven, 2002; Satori et al., 2007). In terms of VOT, Hebrew and Arabic have negative values; i.e., voicing is initiated before the plosive’s release (voicing lead), whereas voiceless stops are characterized by positive VOT values (i.e., voicing is initiated after the release of the plosive or voicing lag) (e.g., Kishon-Rabin, Kochva, et al., 2005b; Segal et al., 2016a; Taitelbaum-swead et al., 2003). In comparison, voiced and voiceless stops in English are characterized by positive VOT (short and long lag, respectively) (e.g., Borden et al., 2003). For example, the VOT values of /b/ are (−92) in Hebrew, (−54) in Arabic, and (+11) in English (Segal et al., 2016a). In terms of prosody, most bisyllabic English words (~80%) have a trochaic stress pattern (i.e., the initial syllable is stressed followed by an unstressed syllable) (e.g., Clopper, 2002) compared to Hebrew for which the majority (~70%) of all bisyllabic words are iambic (i.e., final syllables are stressed) and the remainder (~30%) are trochaic (Segal et al., 2009; Segal & Kishon-Rabin, 2012). In Arabic, in many cases, stress is assigned to the super-heavy syllable in a multisyllabic word (e.g., Segal & Kishon-Rabin, 2018). There is also evidence suggesting that the Semitic languages may differ from English in speaking rate. Studies have shown that Hebrew speakers talk at a faster rate than English speakers (e.g., Amir & Grinfeld, 2011; Robb et al., 2004).

The Semitic languages also differ considerably from English in their morpho-syntactic structure. Specifically, while Hebrew and Arabic have a synthetic structure with a rich bound morphology, English has a more analytical structure with fewer bound morphemes (Ravid, 2012b). For example, the three-word sentence “I saw him” in English can be expressed in one word in both Hebrew (/reiItihu/) and Arabic (/raIaytuhu/). The verb inflection system is also different. In Hebrew and Arabic prefixes, infixes and suffixes are regularly added to the verb to mark the pronoun and the tense. In English, for example, the utterance “I will go” is composed of the pronoun, the time, and the action in three free morphemes. This utterance is expressed in one word with bound morphemes in both Hebrew (/eIlex/) and Arabic (/saIathhabu/). Similarly, the English past tense phrase “I went” is expressed in one word in both Hebrew (/haIlaxti/) and Arabic (/thaIhabt/). Finally, the synthetic structure of Semitic languages is also expressed by replacing function words in English which are separate free morphemes, such as “of,” “that,” “the,” “and,” “as,” “to,” “in” with prefixes to the nouns /me/, /ʃe/, /ha/, /ve/, /ke/, /le/, /ba/ in Hebrew, respectively (Ravid, 2012b). In Arabic, prefixes such as /al/, /wa/, and /bil/ replace the English free words “the,” “and,” and “in.” Thus, the utterance “in the house” is produced as one word in Hebrew (babait). In Arabic it can be produced as one word (bil’beit) or two words (fi l’beit) (Hetzron & Kaye, 2009). In addition, the English language has grammatical articles (“a” and “an”) that are counted as separate words, compared to no grammatical articles in Hebrew and Arabic.

Thus, overall, the unique characteristics of the Hebrew and Arabic languages in comparison to the English language may have a significant impact on language-dependent measurements such as LENA’s automatic word count. Moreover, the reliability of the LENA CTC variable is questionable in the infant population. Therefore, the suitability of the LENA technology for Hebrew- and Arabic-speaking infants needs to be assessed prior to its implementation in language intervention programs.

The aim of the present study was to test the validity of the LENA system for Hebrew and Arabic in the first year of life using the following measures: (1) reliability (accuracy) by comparing the automatic counts of LENA recordings with manual counts by transcribers; (2) construct validity by assessing the sensitivity of the LENA count to differences between linguistic environments of infants with mothers with an academic degree compared to mothers with a high school education; and (3) criterion validity by associating LENA outcomes with a standardized language test.

Method

Participants

Thirty-two infants aged 3 to 11 months, with a mean age of 25.20 weeks (SD = 9.31) were included in the study with their mothers (Mean age = 32.7, SD = 5.59). Sixteen infants were Hebrew learners and 16 were learners of Levantine ArabicFootnote 2. In order to test LENA’s sensitivity to differences between linguistic environments, in each language, eight of the mothers had an academic degree (at least a BA or BSc) from an institution recognized by the Israel Council for Higher Education, and eight had a nonacademic educationFootnote 3. The groups with higher and lower maternal education were termed HME, and LME, respectively. The descriptive statistics for age and maternal education in each language are presented in Table 2. Infants’ gender was distributed equally in each language.

Table 2 The means, SD, minimum and maximum of infants’ age in weeks in each group of maternal education in each language. Low (no academic degree); high (with academic degree)

The participants were recruited through websites and Facebook groups for mothers after birth and through advertising in centers that provide developmental guidance for mothers. The inclusion criteria were as follows: infants born at full-term with no known risk factors for hearing loss or other developmental delays (JCIH, 2000), normal postnatal course, normal hearing based on passing neonatal hearing screening tests and within +1 standard deviation of the mean on the Infant-Toddler Meaningful Auditory Integration Scale for the assessment of auditory perception at the time of recording (IT-MAIS parental questionnaire: Kishon-Rabin et al., 2015; Segal et al., 2016b), regular visits to well-baby clinics, age appropriate development according to well-baby clinics, no health problems involving hospitalization or recurrent infections and/or fever, and no family history of developmental disability such as autism, deafness, etc. This information was collected via a self-reported detailed background questionnaire.

Apparatus

LENA recordings

Sixteen hours of recording for each of the 32 infants were obtained using the LENA DLP. After the recording, the DLP was collected and brought to the lab where the audio files were transferred to LENA Pro for the analysis of AWC, CT, CVC, TV, and AVA.

Manual transcriptions

For each participant, we identified six sections of 10-minute recordings (6 × 32 = 192 sections in total). Sampling took place in two stages. Initially, we examined the AWC at an hourly level and chose the hours from which to extract samples so that they represent different speech densities (low to high) at different times of the day (morning to night) (see Table S2 and Figure S1 in the Supplement). Then, within those selected hours, we identified six sections of 10 minutes (each composed of two consecutive sections of 5 minutes each) by using LENA’s composite view (which includes AWC, CTC, and CVC). These sections were chosen to represent a wide range of AWC, CTC, and CVC values (see Table S3 in the Supplement). By doing so, we ensured that diverse samples were selected to test the reliability of LENA. In total, 1920 minutes were transcribed (6 × 10 × 32) and counted on LENA’s AWC, CTC, and CVC. The Hebrew and Arabic transcriptions implemented the same protocol. The transcribers listened to each section while blind to the output obtained from LENA. That is, they were not provided with LENA’s segmentation or talker tags while annotating (Figure S2 in the Supplement provides an example of LENA’s annotation). The annotations were made using Microsoft Word software. For AWC, all adult speech was transcribed orthographically using the English alphabet to increase accuracy (the Hebrew keyboard has no niqqud marks) and then counted using the word count feature of Word. Specifically, adult words included productions of at least one syllable (e.g., Canault et al., 2015) whether meaningful (i.e., content or function words) or meaningless (i.e., gibberish) (e.g., Busch et al., 2017). Bound morphemes were not counted separately but rather were included in the count of the word to which they were bound (Ravid, 2012a) . For example, the word habait (“the house”) was counted as one word. The counting of CTC and CVC implemented the criteria in the LENA algorithm (Gilkerson & Richards, 2009). For CTC, the transcribers coded and then counted all cases in which the key child responded vocally to adult speech (or vice-versa) within 5 seconds (Ford et al., 2008). For CVC, all the speech-like vocalizations of the key child were coded as perceived by the transcribers using Oller’s infrastructural approach (2000). For example, a syllable-like vocalization perceived as /ba/ was transcribed as such even when the infant did not produce it in a canonical manner. We chose this method over others (e.g., phonetic transcription) because the aim was to identify the number of productions rather than their characteristics. A silence of 300 milliseconds or more was considered the boundary between different vocalizations. For example, if the silence pause between two /ba/ productions lasted 300 milliseconds or more, they were counted as two separate vocalizations.

Two transcribers, a Hebrew native speaker and an Arabic native speaker, transcribed all recordings orthographically in Hebrew and Arabic, respectively. For transcriber reliability, four other transcribers, two native Hebrew speakers and two Arabic speakers who were blind to the LENA count, transcribed 30% of the recordings in their language. All transcribers underwent a short training period. More details on the transcription process are provided in the Supplement (Table S4). Inter-transcriber reliability was measured using intra-class correlation (ICC), with a two-way random model for computing absolute agreement (Koo & Li, 2016). The results of the ICC in Hebrew were 0.91 for AWC, 0.94 for CVC, and 0.89 for CTC. The results of the ICC in Arabic were 0.93 for AWC, 0.87 for CVC, and 0.93 for CTC (see Table S5 in the Supplement for the scripts of the ICC analysis).

The standardized language test

The standard language questionnaire chosen for testing LENA’s criterion validity was the Production Infant Scale Evaluation (PRISE) questionnaire (Kishon-Rabin & Segal, 2016; Kishon-Rabin et al., 2005a; Segal et al., 2016a). The PRISE questionnaire is composed of 11 questions rating the frequency of occurrence of infants’ vocal behaviors according to the developmental stages of preverbal vocalization (i.e. phonation, cooing, expansion and so forth). Each question describes a vocal behavior and the parents are asked to report its frequency of occurrence on a scale from 0 (lowest) to 4 (highest) for evaluating the target behavior. The questions are such that as the child develops, the score increases.

The PRISE was chosen for the present study because it is one of the few standard language production tests that has been shown to be appropriate for assessing preverbal productions of infants in their first year of life, until and including the production of single words (Kishon-Rabin et al., 2009). In addition, it has a version in both Hebrew and Arabic (the PRISE was developed in Hebrew and was later validated for Levantine Arabic) (Kishon-Rabin et al., 2005a). PRISEFootnote 4 data from 304 normal-developing infants between the ages of 2 weeks and 38 months (median 9.0) reported in Kishon-Rabin et al. (2015) showed a monotonic increase in performance with age (best fitted by an exponential function) until a maximum of 100% was reached at approximately 18 months. Most of the variance (87%) was explained by age with an average increase rate of 10% per 1.5 months for the linear part of the function (ages 6–13 months). In addition, the PRISE was found to be sensitive to changes in vocal behavior after cochlear implantation in infants with hearing loss (e.g., Kishon-Rabin et al., 2005a, 2010) and was sensitive to early developmental delays in infants with unilateral hearing impairments (Kishon-Rabin et al., 2015). In previous studies, the validity of the PRISE was confirmed against the Infant-Toddler Meaningful Auditory Integration Scale which evaluates auditory functioning of young infants through parents’ reports (Kishon-Rabin et al., 2015; Segal et al., 2016b) (the association between the IT-MAIS and the PRISE questionnaires in the present data is shown in Figure S3 in the Supplement).

In addition to the PRISE, a translation of the LENA SNAPSHOT questionnaire into Hebrew and Levantine Arabic was used to evaluate the language development of the infants. The SNAPSHOT questionnaire was designed for English-speaking children aged 2 to 36 months and has 52 questions on communicative and linguistic behaviors (Gilkerson & Richards, 2008). The questions are answered “yes” if the infant is already performing the described behavior or “not yet” if not. The LENA Pro software automatically scores the results (abbreviated DA).

Procedure

Two meetings were held within a week with each of the mothers. During the first meeting the mothers signed the informed consent form, received a folder with four questionnaires (medical background, SNAPSHOT, IT-MAIS, and PRISE) and were instructed to read the questions on the IT-MAIS and the PRISE and observe their infants’ behavior at home during the week and until the next meeting. In addition, the mothers received the LENA DLP, two dedicated garments with a pocket to place the DLP and an operating instruction sheet in Hebrew or Arabic. Then the mothers were given a demonstration of how to use the LENA DLP and were guided to choose a typical day for recording and not to alter their ordinary behavior on that day. During the second meeting, which took place after the recording, the experimenter collected the DLP from the mother and filled out the questionnaires by interviewing her. The experimenter then brought the DLP to the lab, where it was connected to a computer for automatic processing by the LENA Pro software.

Statistical analysis

The reliability between automatic and manual counts of AWC, CTC, and CVC was tested in each language separately using three types of analysis: (1) t-tests for paired samples to compare the means of the counts; (2) Pearson correlation coefficients to calculate the association between the counts and (3) intra-class correlation coefficients (ICC) with a one-way random model to compute the absolute agreement between the counts. In order to determine whether the correlations between variables and between languages differ, Fisher’s r-to-Z transformation was used. The Pearson correlation coefficient was used to determine whether the participants’ age affected the outcomes. Construct validity for AWC, CTC, and CVC in the two maternal education groups (LME and HME) was tested in each language separately using a t-test for independent samples. In addition, the developmental measures (AVA and DA) and age were compared using t-tests and correlated using the Pearson correlation coefficient. Concurrent criterion validity for all LENA’s measures (AWC, CT, CVC, TV, AVA, and DA) was tested in the two languages together by associating them with the PRISE questionnaire using Spearman’s rank correlation coefficient.

Results

Reliability of LENA automatic counts

The reliability of the LENA system was measured by comparing (accuracy) and correlating the manual counts of six 10-minute samples from each recording (child) to the automatic counts (the raw data is shown in Tables S6 and S7 of the Supplement, and the scripts of the analyses are shown in Tables S8–S12 of the Supplement). There was a total of 32 infants for whom each number entering the analysis reflected the sum of six sections (i.e., a total of 1 hour of recording). The means and standard deviations (in brackets) of the automatic and manual counts in Hebrew and Arabic are shown in Table 3. Table 3 also shows the results of the three analyses of the two counts (comparison, correlation, and agreement). Table 3 demonstrates that the manual counts did not differ from the automatic counts for CTC and CVC, for either language (p > 0.05). Furthermore, the two counts were significantly correlated (p < 0.05). Fisher r-to-Z transformations of independent samples showed that while the correlation values in Arabic were higher than in Hebrew (CTC: 0.82 vs. 0.57; CVC: 0.84 vs. 0.63) the differences were not statistically significant (CTC: z = −1.30, p = 0.10; CVC: z = −1.22, p = 0.11). The ICC values for CTC and CVC varied from fair to excellent with Arabic showing higher values than Hebrew (CTC: 0.81 vs. 0.47; CVC: 0.83 vs. 0.64). However, the overlap between their confidence intervals indicates these differences are not statistically significant either. For AWC, the automatic count was found to be significantly higher than the manual count in both languages (a difference of 24% in Hebrew and 32% in Arabic), and the effect sizes supported the statistical significance. The correlation between the counts was significant (p < 0.05). A Fisher r-to-Z transformation of independent samples showed no statistical difference between the correlation values of Arabic and Hebrew (0.76 vs. 0.65, respectively; z = −0.56, p = 0.29). Similarly, the overlap between the ICC’s confidence intervals suggests that the difference in ICC values in Arabic and Hebrew were not significant (0.39 vs. 0.45, respectively).

Table 3 Means and standard deviation (in brackets) of the automatic and manual counts in a total of 1 hour of recording (6 sections of 10 minutes for each infant) in Hebrew and Arabic and the results of the three analyses: (1) t-test which compared the automatic and manual counts for each LENA variable and language, (2) Pearson correlation coefficient for the correlation between manual and automatic counts in Hebrew and Arabic, and (3) ICC to test the agreement between the counts

In order to determine whether the participants’ age affected the outcomes, we correlated each variable with the child’s age in weeks (AWC, CTC, and CVC of the transcribers and LENA’s). No significant correlation was found, hence age as a potential intervening variable was ruled out (the results of the analysis are provided in Table S13 of the Supplement).

One possible explanation for the difference between the automatic and manual counts of AWC may be related to the fact that many function words in the Semitic languages are bound to content words and therefore were not counted as independent words as in English (Hetzron & Kaye, 2009). This results in overall fewer words in Hebrew and Arabic compared to English (Ravid, 2012b). To test this hypothesis, we recounted the Hebrew data by separating the function words from the content words. The new mean manual count in Hebrew was 1965.88 (SD = 643.45) which reduced the difference between the automatic and manual counts to 6.7% whereas the level of agreement between the counts remained the same (ICC: 0.44, −.042–.76). The t-test of the recounted data revealed no significant difference between the manual and automatic counts [t(15) = −0.83, p > 0.05, Cohen’s d = 0.21]. However, this reanalysis reduced the correlation coefficient between the manual and the automatic counts (r = .44, p > 0.05).

Construct validity

Construct validity was measured by testing the sensitivity of LENA’s measures to quantitative differences in linguistic environments and by correlating and comparing LENA’s developmental measures, AVA and DA, with chronological age (the raw data is shown in Tables S14 and S15 of the Supplement, and the scripts of the analyses are shown in Tables S16–S19 of the Supplement). First, we compared infants in the low and high maternal education groups (LMA and HME respectively). The comparison was based on the assumption that the two groups should differ in the quantity of exposure to speech (AWC) and dyadic interaction (CTC) as well as in their linguistic development (CVC) (e.g., Dwyer et al., 2019; Hart & Risley, 2003a, 2003b). To control for the length of the recordings, we used the 12-hour projected counts that are automatically provided by LENA Pro (e.g., Ganek, Smyth, Nixon, & Eriks-Brophy, 2018). The means and standard deviations of LENA’s measures in each maternal education group (across the two languages) are shown in Table 4. The t-test analysis revealed a significant difference between the two groups for all three variables (AWC, CTC, and CVC), reflecting higher counts for HME than LME. The effect sizes indicated that the magnitude of the difference between the two groups was very large (an analysis of the impact of maternal education on 1-hour samples of manual and automatic counts appears in Table S20 of the Supplement). A nonsignificant correlation between infants’ age in weeks and the 12-hour variables (AWC, CTC, and CVC) confirmed that age was not a confounding factor in the results (the results of the analysis are provided in Table S13 in the Supplement).

Table 4 Means and standard deviations (in brackets) of AWC, CT, and CVC in 12-hour projected counts in the two groups of maternal education: LME (low maternal education) and HME (high maternal education) in the two languages together (Hebrew and Arabic)

There were significant positive correlations between the two developmental measures and age, DA: r = 0.80. p < 0.01, N = 30; AVA: r = 0.68, p < 0.01, N = 32. The t-test comparison showed no difference between chronological age (mean = 6.47, SD = 2.23) and the predictive measures DA [t(29) = −0.497, p > 0.05, mean = 6.33, SD = 2.39] and AVA [t(31) = 0.927, p > 0.05, mean = 6.59, SD = 2.38] (the scripts and results of the t-tests and correlation analyses are presented in Tables S17–S19 of the Supplement).

Concurrent criterion validity

Criterion validity was measured by correlating LENA’s measures (AWC, CTC, CVC, AVA, DA) and the scores of the PRISE questionnaire. The correlation coefficients are shown in Table 5 and a visual representation of the relationship between LENA’s CTC and AVA and the PRISE, by language and maternal education, is provided in Figs. 1 and 2, respectively. As shown, with the exception of AWC, all the LENA measures were positively correlated with the PRISE questionnaire (the raw data is shown in Tables S14 and S15 of the Supplement, and the correlations are shown in Figures S4S7 and in Table S21 of the Supplement).

Table 5 Correlations coefficients between the LENA measures (AWC, CTC, CVC, AVA) in 12-hour projected counts, LENA’s SNAPSHOT questionnaire (DA), and the scores on the PRISE questionnaire. N = 32 for all measures except DA (N = 30)
Fig. 1
figure 1

The relationship between the scores in the PRISE questionnaire and LENA’s Conversational Turns by language and maternal education, p < 0.05. HME = high maternal education, LME = low maternal education, dotted lines: N = 8, continuous line and equation: N = 32

Fig. 2
figure 2

The relationship between the scores in the PRISE questionnaire and LENA’s Assessment of Vocal Age (AVA) by language and maternal education, p < 0.01. HME = high maternal education, LME = low maternal education, dotted lines: N = 8, continuous line and equation: N = 32

Discussion

This study is the first to report the validity of the LENA system in Hebrew and Arabic for reliability, construct validity, and criterion validity. The findings show (1) good reliability for the LENA’s automatic count on AWC, CTC, and CVC based on the positive associations and fair to excellent agreement between the manual and automatic counts; (2) good construct validity based on higher counts for HME vs. LME and positive associations between AVA and DA and age; and (3) good concurrent criterion validity based on the positive associations between the LENA counts for CTC, CVC, AVA, and DA and the scores on the preverbal parent questionnaire (PRISE).

Reliability of LENA automatic counts

The automatic count of the LENA system in Hebrew and Arabic was found reliable based on the good association between the automatic and manual counts for AWC, CTC, and CVC. This is consistent with studies that have tested LENA in languages other than English (Busch et al., 2017; Canault et al., 2015; Elo, 2016; Ganek & Eriks-Brophy, 2018b; Gilkerson et al., 2015; Pae et al., 2016; Weisleder & Fernald, 2013). The good agreement (ICC) found in this study further supports this conclusion.

The present study also compared the manual and automatic counts as an additional measure of reliability. No differences were found between the counts for CTC and CVC; however, higher counts were found for the LENA automatic count compared to manual counts for AWC in both languages. For the CVC and CTC variables, our findings are in close agreement with published data in Vietnamese and Mandarin for CTC (Ganek & Eriks-Brophy, 2018b; Gilkerson et al., 2015), and in Dutch for CVC (Busch et al., 2017). In French, although LENA underestimated the number of child vocalizations compared to the manual count, the researchers concluded that LENA was reliable for human counts in French (Canault et al., 2015). By contrast, for the AWC variable, the automatic count in our study overestimated this variable by 24% in Hebrew and 32% in Arabic, whereas it underestimated the counts (compared to manual counts) by 20% in Dutch (Busch et al., 2017), by 33% in French ( Canault et al., 2015), and by 18% in Mandarin (Gilkerson et al., 2015). For Dutch, the authors argued that the gap between automatic and manual counting was influenced by the amount of speech in the sections that were transcribed (Busch et al., 2017), whereas in French, it was influenced by the amount of background noise (Canault et al., 2015).

While the data here support the previously published discrepancy between automatic and manual counts for AWC, our findings of overestimation of the automatic count disagree with the underestimation found in other languages (Dutch, French, and Mandarin). There are several possible explanations for the higher automatic counts on AWC in the present study. The first is that LENA’s algorithm applied different criteria to define a word compared to the human transcribers. Specifically, LENA’s algorithm for AWC relies on statistical models and acoustic information of segmental and prosodic features which could not be implemented in human coding. In contrast, the criteria applied by the transcribers to coded adult speech were based on linguistic meaning in the case of content or function words, and on phonological structure in the case of gibberish (e.g., Busch et al., 2017). Furthermore, LENA’s algorithm is based on the acoustic characteristics of the English language, which differ from those of Semitic languages. In English, the differences in criteria between automatic and manual transcriptions did not have much impact (e.g., Xu et al., 2009); however, in Hebrew and Arabic, the differences may have been more consequential. For example, the AWC algorithm detects changes in loudness (Sphinx decoder) to identify the end of a word (a word’s loudness decreases between its beginning and end). However, a change in loudness may also occur following a stressed syllable (Sluijter et al., 1997). Thus, in Hebrew and Arabic, where stress occurs in the middle of multisyllabic words, (Segal et al., 2009), LENA may count this as more than one word, which may lead to the overestimation of the automated LENA count for AWC in Semitic languages.

The second explanation for the overestimation of the LENA automatic count compared to the manual count may be related to the morpho-syntactic structure of Semitic languages. As noted above, function words in the Semitic languages are bound to the content words, thus lowering the number of manually counted words per utterance compared to the English language (Ravid, 2012b). When the data were recounted so that each bound word was counted as two words, the difference between the manual and automated counts became nonsignificant. However, the increase in the number of words in the recount was inconsistent across different recordings. While adults produced a relatively large amount of function words in some of the recordings, which increased the number of words relative to the automatic count, in others, the adults produced few function words. Nonetheless, the LENA counted more words than the transcribers. This probably led to the lack of association between the automatic and manual counts, although they were more similar (accurate). Differences in morpho-syntactic structure between languages also include prefixes and suffixes to verbs in Semitic languages that mark the pronoun and the tense. This may also have influenced the count of adult words. Future studies should explore the influence of the morpho-syntax of Semitic languages on LENA’s counts, especially for adults and children beyond the two-word stage.

The third explanation for the discrepancies between the manual and automatic count of AWC may be related to LENA’s error in identifying the mother or the infant’s caregiver, possibly due to the specific characteristics of the speech directed at the infants. Most of the infant-directed speech (IDS) of adults is assumed to be characterized by a slower speaking rate, wider intonation ranges, and frequent use of high pitch (e.g., Farran et al., 2016). It has been reported that the LENA system is affected by the acoustic changes occurring in speech prosody (Gilkerson et al., 2008; Gilkerson & Richards, 2009; Richards et al., 2009). Thus, it is possible that the high pitch of IDS may have influenced LENA’s ability to correctly recognize the speaker in the recording. For example, a woman using IDS may have been identified as a child (CHN) so that her words were not counted in the AWC (Kim Coulter, personal communication). As an illustrative example, we tested LENA’s speaker identification in events that were labeled as a female speaking near the DLP (FAN) and as the key child (CHN), in two recordings, one from each language (Subject 7 from the Hebrew group and Subject 15 from the Arabic group). We chose the first 100 consecutive speech events from each recording (for details see Tables S22–S24 in the Supplement). In Arabic, the results showed that LENA labeled FAN correctly 75% of the time; however, 25% of the time, CHN was mistakenly identified as FAN. CHN was correctly identified 97% of the time, but in 3% of the events, FAN was mistakenly identified as CHN. In Hebrew, however, LENA only labeled FAN accurately 29% of the time because of incorrect identification of CHN (8%) and CXN (63%) which refers to another child speaking near the recording device (probably the infant’s sibling). CHN was tagged correctly 100% of the time. These results support the hypothesis that the discrepancies between manual and automatic coding in AWC may have stemmed from LENA’s misidentification of the adult that directed the speech toward the infant.

The higher values of the correlations between the manual and automatic count for CVC and CTC compared to AWC may be related to the fact that the manual transcription of the two variables (CVC and CTC) was conducted according to the published criteria of the count performed by the LENA algorithm. Specifically, CVC was coded based on the length of the silent pauses (> 300 milliseconds) that elapsed between infants’ vocal productions, and CTC was coded when an adult responded vocally within 5 seconds after the infant vocalized and vice-versa (Gilkerson & Richards, 2020). It is possible that since these criteria are time-based rather than language-based, the correlation between the automatic and manual counts was good. Another possible explanation for the good reliability of the CTC and CVC counts may be related to the fact that infants’ vocalizations as well as the interactions with them usually occurred near the recording device (because the infant was wearing a garment with the device). In contrast, a conversation that was not directed toward the infant could have taken place away from the device and was likely to be filtered out by LENA, which does not analyze distant speech (Ford et al., 2008). However, in the present study the transcribers were unaware of LENA’s labels of distant speech. Thus, it is possible that in these cases (characterized mainly by AWC) there was more of a mismatch between the transcribers and LENA compared to events in which the talking occurred near the device (characterized mainly by CVC and CTC). It is, however, noteworthy that the differences in the correlation values among the variables AWC, CTC, and CVC between the manual and automatic counts were not statistically significant, neither in Hebrew nor in Arabic.

While the values of correlation and agreement between the manual and automatic counts were higher in Arabic than in Hebrew, particularly for CTC and CVC, the differences between them were not statistically significant. To ensure that there are no confounding factors that may have influenced the results differently in the two languages, several acoustic environmental factors in the recordings were compared between the two languages for the one hour transcription sections. First, LENA’s subcategories of segment identification provided as part of the system’s analysis reports were examined. These include the duration of sections that LENA identified as MEANINGFUL speech (the part of the recording that is actually analyzed), the duration of sections in which adults spoke at a distance from the child (termed DISTANT), the percent of time in which the key child was exposed to electronic media (termed TV), the duration of sections in which LENA identified background noise, and the duration of sections in which LENA identified silence. We also compared the values of AWC, CTC, and CVC, the percent of IDS within AWC, and the percentage of male and female speech (the scripts and outcomes of all analyses are provided in the Supplement, Tables S25–S34). Hebrew and Arabic did not differ significantly in any of the variables. The DISTANT variable, however, found a marginally significant difference between Hebrew and Arabic [0.27 vs. 0.22 hours; SD = 0.1; 0.09, respectively, U(n1 = 16, n2 = 16) = 77, z = −1.92, p = 0.06]. This stemmed from the longer duration of Hebrew speech at a distance from the key child than Arabic. Nevertheless, repeating the correlations while controlling for DISTANT increased the gap between Hebrew and Arabic (the results of the analysis can be found in Table S35 in the supplement). While the results of the comparison indicate an acoustic similarity between the samples of the two languages, they suggest that any nonsignificant difference in agreement between the two languages may stem from some other explanation, such as the limitation of the sample size. This hypothesis should be tested in a future study with a larger sample.

Construct validity

The present study reports on the sensitivity of LENA to the effect of maternal education in non-English languages. The larger counts for HME compared to LME supports good construct validity in these Semitic languages. Studies have shown that maternal education affects some or all of LENA’s quantitative measures in the English language. Dwyer et al. (2019), for example, found higher counts for AWC and CTC but not for CVC in mothers having 16–18 years of schooling compared to 10–14 years of schooling. In another study, Greenwood, Thiemann-Bourque, Walker, Buzhardt, and Gilkerson (2010) divided the mothers into high school/General Educational Diploma (GED) or higher education and found a trend of higher counts in all LENA’s measures in the more highly educated mothers. In another study, Gilkerson et al. (2017) divided the participants into four levels of maternal education (less than high school, high school/GED, some college, and BA degree or higher) and found that the higher the mother’s level of education, the greater the counts in LENA’s measures (AWC, CTC, and CVC); however, the differences in counts between the four groups were not always significant.

The association between AVA and DA, and the chronological age strengthens the construct validity of LENA’s measures by suggesting that the LENA system is capable of correctly assessing the vocal age of infants speaking Hebrew and Arabic during the first year of life and that the Hebrew and Arabic versions of the SNAPSHOT questionnaire reflect the course of language development in this period. AVA’s ability to properly assess the vocal age of Hebrew- and Arabic-speaking infants in the first year of life opens up the possibility of an automatic assessment of a measure which is currently only collected through parent questionnaires (e.g., the PRISE). It should be noted, however, that the AVA score is calculated based on age-dependent regression models (Kim Coulter, personal communication, July 2021), which probably contributed to the association between AVA and age. The correlation between DA and age suggests that the translation of the SNAPSHOT questionnaire into Hebrew and Arabic maintained the stages in language development. The fact that the calculation of DA is done independently of age (Kim Coulter, personal communication, July 2021), strengthens its validation and suggests it could be used among Hebrew- and Arabic-speaking infants in the first year of life.

Criterion validity

This is the first study to link the PRISE questionnaire with LENA variables. PRISE is one of the few standard language production tests that assesses infant preverbal productions during their first year. Specifically, the questionnaire rates the frequency of occurrence of infants’ vocal behaviors according to the developmental stages of babbling (i.e. phonation, cooing, expansion and so forth) (Kishon-Rabin et al., 2009). The LENA variables that were found to be associated with the PRISE were CVC, CTC, AVA, and DA. Moreover, LENA’s measures of CTC and CVC explained 16–17% of the variance of the PRISE score; i.e., values that are similar to what was found in English when using the Preschool Language Scale, Fourth Edition (PLS-4), Receptive-Expressive Emergent Language (REEL-3), and MacArthur-Bates Communicative Development Inventories (MB-CDI) (Gilkerson et al., 2017) and in the recently published meta-analysis (Wang et al., 2020). While the PRISE measures share some characteristics with LENA’s assessment of vocal age (AVA) and developmental age (DA), the association with CVC suggests that the number of vocalizations in the infant is related to the infant’s babbling stage. In addition, the association with CTC suggests that infants’ engagement in dyadic interaction could significantly impact their development of language as early as the first year of life. The lack of association between PRISE and AWC supports the notion that LENA accurately reflects infant vocalization. Overall, the positive association between the automatic counts of the LENA and PRISE contributes to the validity of the LENA system. Importantly, the findings of criterion validity alongside construct validity in AVA and DA support the applicability of these measures to Semitic languages in the first year of life.

Limitations

This study has several limitations. One limitation relates to the fact that the present study tested the validity of the LENA system for a specific age range (3 to 11 months) in which the productions of infants are universal in part (e.g., Oller, 2000). Thus, the validity of the automatic counts needs to be assessed in older ages, when children begin to talk and use language-specific morphology. Furthermore, the SNAPSHOT items pertaining to young infants describe universal behaviors such as the social smile, vocal play, and imitation (Gilkerson & Richards, 2008). However, at later ages, the items relate to language-specific behaviors that characterize the English language (e.g., adding ing to a verb in order to indicate the progressive tense or using the grammatical articles “a” and “an”). Thus, the adaptation of the SNAPSHOT to Hebrew and Arabic should be assessed in older ages. Another limitation of this study is that the separation of the function words from the content words was only carried out for Hebrew. The replication of the separation process in Arabic may shed more light on the results of the reliability for AWC in Semitic languages. Finally, LENA’s segmentation and talker tags were not systematically tested in the present study. The small sample that was tested revealed some misidentifications of the speaker, which reinforces the need for such an examination.

Summary and implications

The present study supports the validity of LENA technology for assessing the linguistic environment and interactions of infants learning Hebrew and Arabic in the first year of life. Specifically, the automatic counts of AWC, CTC, and CVC and the assessment of AVA and DA were found to be applicable to Semitic languages. The findings emphasize the importance of obtaining several reliability measures (association and agreement) when assessing the validity of the LENA system for languages other than English, to help identify possible confounding factors.

In terms of the importance of early language intervention for reducing linguistic gaps in children from LSES, the outcomes of the present study support the use of the LENA system in early intervention programs for infants from Hebrew- or Arabic-speaking families. Such programs could use the LENA to monitor their effectiveness, as well as provide feedback to parents on the amount of language experience their children have in everyday life, and on the progress of their vocal productions. Since the LENA was found to be able to detect the relative difference between linguistic environments independent of language, its use in early intervention programs can help a wide range of populations of immigrants around the world who speak Semitic languages such as different dialects of Arabic.

The present outcomes also indicate the potential utility of LENA for assessing the linguistic environment and interactions in Hebrew- and Arabic- speaking infants from special populations, such as the hearing-impaired, children with cerebral palsy, and those with other developmental disorders. Nevertheless, to expand research possibilities, it is important to test LENA’s validity in children older than one year of age. If found suitable for children at different stages of language acquisition, it will encourage Semitic language researchers to use the LENA system to advance the theory and practice of language development and improve linguistic abilities in at-risk populations.