Participants
Our sample comprises 32 bilingual (mean age = 9 years and 1month, SD = 2 years and 2 months, 18 females and 14 males) and 38 monolingual (mean age = 9 years and 9 months, SD = 1 year and 8 months, 22 females and 16 males) children; a total of 70Footnote 6. The bilingual children are competent in both Greek and English languages to varying degrees. The bilingual children were recruited if at least one of their parents spoke the Greek language with them. The mean age of acquisition is 7 months (SD = 1 year and 2 months) for Greek and 2 years and 6 months (SD = 2 years and 9 months) for English. We have excluded any trilingual participants.Footnote 7 Children were included in the study if their non-verbal intelligence score was not under 80. In this case, all children had scores over 80. Based on parental and teacher reports, the children did not have any hearing, behavioural, emotional, or mental impairment. More information is included in Table 4 and “Descriptive statistics” section below.
Bilingual Greek–English children were recruited from a Greek supplementary school in the north-west of England. The school offered a Greek-speaking supplementary program for 2.5 to 3.5 hours a week to enhance the reading and writing of the Greek language. This program is supplementary to the mainstream English school that these children attend. Eight of the bilingual children were born in Greece and lived in the UK for more than 2 years at the time of the study. The Greek monolingual control group consisted of children born and based in Greece.
Ethical approval was granted by the College of Arts and Humanities Research Ethics Committee at Bangor University. Information sheets were sent to the head teachers of schools and to the parents and informed consent was obtained before the collection of data. Teachers, parents, and children were provided enough time to express any questions about the nature of the study. Parents and children were informed that they could withdraw at any time, and they were debriefed after the study.
Materials
Parental questionnaire
The children’s language experience was assessed through the Language and Social Background Questionnaire for Children (LSBQ) (Luk & Bialystok, 2013). The LSBQ was forward and backward translated in Greek and it was completed by at least one of the parents/guardians in their preferred language (Greek or English). It consisted of information about the child’s age, grade, date of birth, country of birth, age of onset of all the languages, knowledge of playing a musical instrument, and length of exposure to different educational mediums. The questionnaire also included information about the parents’ language backgrounds. SES was measured as the mean of the highest attained educational level of both parents rated on an eight-point scale. Parental education is the most commonly used index of SES background, is highly predictive of other SES indicators (e.g., income, occupation), and is a better predictor of cognitive performance than other SES indicators (Calvo & Bialystok, 2014). The child’s understanding and speaking in all of their languages was rated on a five-point scale ranging from Poor to Excellent. Language use with parents, siblings, grandparents, neighbours, friends, and caregivers in various situations was measured on a seven-point scale ranging from 1 (only English) to 7 (only Greek/or other language).
Non-verbal intelligence
Non-verbal intelligence was assessed using the Kaufman Brief Intelligence Test, Second Edition (KBIT-2) (Kaufman, 2004). It consists of 46 items including a series of abstract images, such as designs and symbols, and visual stimuli, such as pictures of people and objects. Participants were required to understand the relationships among the presented stimuli and complete visual analogies by indicating the relationship between the images by either pointing to the answer or saying its letter. All items include an option of at least five answers thus reducing the chance of guessing. The Matrices non-verbal subtest was individually administered and scored according to the KBIT-2 manual, and percentages for the Matrices scores were obtained for participants.
Language measures
English language measures
The British Picture Vocabulary Scale, Third Edition - BPVS3 (Dunn & Dunn, 2009) was used to assess the receptive vocabulary of the bilingual and monolingual children in the English language. It is an individually administered, standardised test of Standard English receptive vocabulary for children ranging from 3 years to 16 years and 11 months. In this task, children are asked to select, out of four coloured items in a 2 by 2 matrix, the picture that best corresponds to an English word read out by the researcher. The assessment consists of 14 sets of 12 words of increasing difficulty (e.g., ball, island, fictional). The administration is discontinued when a minimum of eight errors is produced in a single set.
The Clinical Evaluation of Language Fundamentals, Fourth UK Edition - CELF-4UK (Semel et al., 2006) is an individually administered standardised language measure, which is used for the comprehensive assessment of a student’s language skills by combining core subtests with supplementary subtests. The expressive vocabulary subtest was used here to assess the participants’ expressive vocabulary in the English language. This measure is designed for children and adolescents ranging from 5 to 16 years of age. Expressive vocabulary was screened through the Expressive Vocabulary subtest for children. Children were asked to look at a picture and name what they see or what is happening in each picture (e.g., a picture of a girl drawing, the child should give the targeted response ‘colouring’ or ‘drawing’ to score 2 points or the response ‘doing homework’ to score 1 point). The administration is discontinued after seven consecutive zero scores.
The Test for Reception of Grammar, Version 2 - TROG-2 (Bishop, 2003) was used to assess receptive grammar. It is an individually administered standardised test for children and adults and it comprises 80 items of increasing difficulty with four picture choices. Children are asked to select the item that corresponds to the target sentence read out by the researcher. For each grammatical element, there is a block of four target sentences. A block is failed unless all four items of each block are established by the child. The sentences include simple vocabulary of nouns, verbs, and adjectives. If a child fails five consecutive blocks the administration is terminated.
Greek language measures
A standard Modern Greek version of the Peabody Picture Vocabulary Task-PPVT (Dunn, 1981) was adapted and used based on the Greek adaptation by Simos et al. (2011). The children clicked on the image, out of four possible choices, that best corresponded to the target word they heard, such as nouns, verbs, or adjectives. There were 173 items of increasing difficulty. If eight incorrect responses were provided to ten consecutive items, then the task was stopped. The answers were scored as correct (1) or incorrect (0).
The Picture Word Finding Test-PWFT (Vogindroukas et al., 2009) is an individually administered standardised measure used to assess standard Modern Greek expressive vocabulary. It is a tool norm-referenced for Greek adapted from the English Word Finding Vocabulary Test - 4th Edition (Renfrew, 1995). The children are presented with 50 black-and-white images consisting of nouns in developmental order. The words included originate from objects, categories of objects, television programs and fairy-tales very familiar to children. A score sheet is used to record the responses provided during testing which are later scored as correct (1) or incorrect (0). The children are asked to name the objects they saw and when they are ready, they move to the following one. The assessment is discontinued after five consecutive wrong replies.
The Developmental Verbal Intelligence Quotient-DVIQ (Stavrakaki & Tsimpli, 2000) was used to assess Greek receptive grammar. It consisted of five subtests used to measure children’s language abilities in expressive vocabulary, understanding metalinguistic concepts, comprehension and production of morphosyntax, and sentence repetition. This was an assessment that measured language development in standard Modern Greek, and it was administered individually. For this study, only the subtest measuring comprehension of morphosyntax was used for both Greek monolingual and Greek–English bilingual children. Each child was given a booklet with 31 pages, each including three images. The researcher read out a sentence and each child was asked to point to the picture that best represented the situation in the sentence. For example, the sentence might have been “μην καπνίζετε” (do not smoke) and the correct answer depicted a “No Smoking” sign. An answer sheet was used to record the child’s answers (as A, B, or C) during testing which were later scored as correct (1) or incorrect (0).
For each of the background language measures, we define percentage scores as the number of correct responses/number of correct and incorrect responses. Bilinguals were assessed on each of these background measures using one test in each language. Percentages were used in order to create a comparable scale for all tests, which allows us to produce a composite measure.
Executive function tasks
In this section, we present the administration details for the five executive function tasks that span attention, working memory, inhibition, and shifting. All cognitive tasks were administered on a laptop using the experimental software E-Prime 2.0 (Schneider et al., 2002). E-Prime 2.0 is behavioural experiment software that provides an environment for computerised experiment design and data collection with millisecond precision timing ensuring accuracy of data. We discuss each of these tasks in turn below.
Attention task
The Attentional Network Task (ANT) (Fan et al., 2005) was designed to evaluate three different attentional networks: i) alerting; ii) orienting, and iii) executive control (Posner & Petersen, 1990). Participants are asked to indicate the direction (left or right) that the target stimulus (a fish appearing at the centre of the screen) points to. Distance between the participant’s head and the centre of the screen was approximately 50 cm. The child’s task was to press either the right or left key button on the mouse (with the right or left index finger) corresponding to the direction in which the middle fish is swimming. The child was presented with a training block of 16 trials and 128 trials distributed in four experimental blocks. There were breaks in between. During both the training and experimental blocks, auditory feedback was provided to the child.
Working memory tasks
The first task was a Counting recall task, which was an adaptation of the Automated Working Memory Assessment (Alloway, 2007). The children were presented on the laptop screen a varying number, from four to seven, of red circles and blue triangles on the screen. The children should remember the number of red circles in each image. The images presented begin from one and reach seven. Each experimental block, consisting of one to seven images, consists of four trials. If the child failed to correctly recall three trials in a block, the task stopped.
The second task was a Backward digit span task (BDST) and it was adapted from Huizinga et al. (2006). The children began with two training trials in order to understand the task and type the reverse order of the numbers presented. For example, if a child hears the number 7 and 4 they should type 4 and 7. The sequence begins with four trials of two numbers reaching gradually eight numbers. Similarly to the above task, if the child failed to correctly recall three trials in a block the task stopped.
Both tasks were administered in the preferred language of the child. In all cases the preferred language was English for the bilingual children.
Inhibition task
The Nonverbal Stroop task was adapted from Lukács et al. (2016) and consisted of stimuli of arrows pointing upwards, downwards, left and right. Three experimental blocks of 60 trials each were presented to the children. The aim was to select the direction that the arrows indicated regardless of their position on the screen. The children used the arrow buttons on the laptop’s keyboard. The first was the control block and arrows were presented in the middle of the screen (Stroop base). In the second block, which was the congruent block, the direction of the arrows matched their position on the screen (e.g., an arrow indicating upwards was presented at the top of the screen) (Stroop congruent). Finally, the third experimental block was the incongruent block. Here the direction of the arrows was the opposite compared to their position on the screen (e.g., an arrow indicating upwards was presented at the bottom of the screen) (Stroop incongruent). During the administration of the task, the second and the third blocks are randomly mixed to enhance the conflict effect.
For accuracy measures, the number of correct answers for the incongruent items was subtracted from the number of correct answers for the congruent items. The difference in reaction times for congruent and incongruent trials represents the inhibition cost.
Shifting task
All children were also administered one shifting task, the Colour-shape task. This task included three blocks each, where children were presented with two shapes (triangle, circle) coloured either red or blue. The same buttons, one for the left hand and on for the right, corresponded to one of the choices (circle–triangle, red–blue). In the first two experimental blocks, the children’s task was to recognise the shape of the stimulus and ignore their colour or the reverse. The stimuli were presented in the top half and bottom half of the screen, respectively. In the third block, they were required to alternate between colour and shape depending on their location on the screen. Cues directing the participant to the relevant dimension are presented simultaneously with the stimuli on all trials, in all blocks. The first two blocks contained 32 trials each, while the third block contained 64. The number of shifting and non-shifting sequences within the third block was balanced. The difference in reaction times for the first two (non-shifting) and the third (shifting) block represents the shifting cost.
Procedures
A pilot study with four children was conducted before the actual data collection. As a result of the pilot study, the choice of the above fixed order of tasks was such so the children did not feel tired or uninterested. After the end of each session, the researcher thanked the child for their participation. All children participated enthusiastically.
The children were tested individually in a quiet school classroom setting, during one session in Greek and one session in English that lasted 40 min on average. The second session was conducted no more than 1 month’s time after the first one. The researcher informed the children that they would play some games. Parents were administered the questionnaire (LSBQ) and returned it to the researcher, or the classroom teacher, or the school’s head teacher.
The first session was the Greek session for the bilingual participants. Each child completed the tasks in the following fixed order: i) Greek adapted PPVT, ii) ANT, iii) PWFT, iv) Colour shape task, v) Nonverbal Stroop task, and vi) DVIQ. The second session was the English session for the bilingual participants. Each child completed the tasks in the following fixed order: i) KBIT-2, ii) BDST, iii) BPVS, iv) counting recall task, v) CELF-4, and vi) TROG-2.
The Greek monolingual children completed the tasks in the following fixed order: i) Greek adapted PPVT, ii) ANT, iii) PWFT, iv) Colour shape task, v) Nonverbal Stroop task, vi) DVIQ, vi) KBIT-2, vii) BDST, viii) Counting recall task.
Technical efficiency
In this section, we introduce the concept of technical efficiency, which may be viewed as a special case of a performance ratio. We use a random sample from our dataset and assume that each participant is a decision-making unit (DMU) that produces two outputs from two inputs. The outputs are the accuracy scores on two executive function tasks; the BDST and the Counting recall. The inputs are a measure of the non-verbal intellectual ability (KBIT-2) and a measure of the grammar skill (DVIQ). Ultimately, we are interested in comparing the performance of the DMUs. We illustrate three cases; case A considers one Output and one Input; case B uses two Outputs and one Input; case C uses two Outputs and two Inputs.
Table 2, Panel A, presents the output and input values for each of the ten participants of the random sample. Panel B calculates an array of performance measures associated with each of the three cases outlined above.
Table 2 Performance ratios and technical efficiency In case A, the ratio BDST / KBIT-2 may be viewed as a performance measure where higher values denote a participant with a superior performance; i.e., a higher accuracy score in the BDST measure, using a lower KBIT-2 score. Participant F has the highest value (1.278), hence may be viewed as the one with the best performance, or the most efficient. That is, s/he is producing the highest BDST accuracy score by using the lowest KBIT-2 score. A graphical representation of the ten participants is given in Fig. 1a. The line that connects the axis origin (black line) to point D (the left-most in the graph) is the efficient frontier and envelops all the other points. By contrast, a regression line (orange line) goes through the middle of these points; a direct consequence of the estimation technique used. As such, while the regression line considers the “average” as the benchmark unit, by allowing some to over-perform and others to under-perform, the frontier analysts consider the efficient (i.e., best-practice) unit as the benchmark; thus letting all others to under-perform.
In case B, the ratios BDST / KBIT-2 and Counting recall / KBIT-2 are defined. Points F, D, and E are of special attention as they are the furthest away from the axis origin, hence they represent the best-performers (i.e., efficient ones). The participants represented by these three points represent efficient combinations in the sense that they produce the maximum output for a given level of input. Contrary to case A, the efficient frontier here is a piecewise linear frontier that is made up of the efficient DMUs and envelops all the inefficient combinations. For example, point I lies inside the frontier and has an efficiency score of Oy/Oy’, which means that there is a margin of improvement in the performance of participant I by Oy’-Oy (i.e., the distance between point I and the efficient frontier).
Case C would require the ratios BDST / KBIT-2, Counting recall / KBIT-2, BDST / KBIT-2 and Counting recall / DVIQ to be computed. However, in this case visual representation would have to be multidimensional. A particular challenge that was made apparent in case B is that the points (F, D, E) are all efficient but have a different output/input mix. For example, point F is superior in terms of BDST, while point E in terms of Counting recall. The fact that the output/input mix would vary among DMUs becomes more apparent as outputs and inputs considered increase. Consequently, it is difficult to identify the participant with the overall best performance, unless we assign some “desirability” on the outputs (and similarly the inputs). For example, this could take the form of a higher accuracy in the BDST having a higher value than in the Counting recall.
To address the issue, Charnes et al. (1978) introduced the concept of technical efficiency in the form of a linear optimisation model – the CCR model. The novelty lies in the use of weighted outputs and weighted inputs to form a performance measure, known as technical efficiency. Technical efficiency may be viewed as a ratio where, on the nominator (denominator) each output (input) is assigned a weight. The weight, which lies between 0 and 1, is universal for all the DMUs, and could be viewed as a measure of the relative desirability of the outputs and inputs.
A linear optimisation technique that maximises the overall technical efficiency of the system is used to estimate the weights (Charnes et al., 1978). Hence, the weights, and consequently any ranking of outputs and inputs that is implied, is determined from the data themselves without any a priori information or assumptions.
Mathematically, and starting from the case of two outputs and two inputs (i.e., Case C), the technical efficiency ratio for a single DMU is given as:
$$ TE=\frac{u_1{y}_1+{u}_2{y}_2}{v_1{x}_1+{v}_2{x}_2} $$
(1)
where y1 and y2 being the BDST and Counting recall accuracy scores (Outputs); x1 and x2 being the KBIT-2 and DVIQ scores (inputs); u1, u2, v1 and v2 are output and input weights, respectively.
We can generalise this to the case of R outputs and M inputs as follows:
$$ {TE}_j=\frac{{\overset{\sim }{u}}_1{y}_{1,j}+{\overset{\sim }{u}}_2{y}_{2,j}+\dots +{\overset{\sim }{u}}_R{y}_{R,j}}{{\overset{\sim }{v}}_1{x}_{1,j}+{\overset{\sim }{v}}_2{x}_{2,j}+\dots +{\overset{\sim }{v}}_M{x}_{M,j}} $$
(2)
Here we also add the subscript j which denotes the DMU with j = 1, 2, …, N as well as the tilde on top of the weights to denote that these are estimated through linear optimisation. Note that as the weights are common across all DMUs, they do not carry the j subscript.
The linear optimisation works by maximising the sum of TEj across all DMUs subject to the TEj being bounded between 0 and 1 (where 1 is assigned to the efficient DMUs) for each DMU, and to the weights being non-negative.Footnote 8 Mathematically:
$$ \underset{u,v}{\max}\sum \limits_{j=1}^N{TE}_j $$
(3)
$$ \mathrm{subject}\ \mathrm{to}:\left\{\begin{array}{c}0\le {TE}_j\le 1\\ {}{\overset{\sim }{u}}_1,{\overset{\sim }{u}}_2,\dots, {\overset{\sim }{u}}_R\ge 0\\ {}{\overset{\sim }{v}}_1,{v}_2,\dots, {\overset{\sim }{v}}_R\ge 0\end{array}\right. $$
(4)
Data transformations
In our case, each child produces certain outputs while receiving certain inputs. We consider the output to be the executive function score, which may be viewed as a proxy for brain performance.
As per the Miyake et al. (2000) classification, three distinct and interrelated components of executive function are defined. These relate to an individual’s ability to switch between various tasks (switching/shifting), the ability to maintain and process information in mind (working memory), and the ability to suppress irrelevant information at any given moment (inhibition). Performance in each of these categories is assessed via the following tasks: i) BDST, ii) Counting recall, iii) Colour shape, iv) Non-verbal Stroop (Stroop), v) ANT. All of these tasks and their administration procedure have been explained in an earlier section.
In each task we record: i) the accuracy (ACC); ii) the response time (RT) of the child, which form our two outputs. The accuracy for each task and each child is calculated as the average accuracy over the respective number of trials that each task consists of, and ranges theoretically between 0 and 1. For tasks that have congruent and incongruent trials, we use the average accuracy. Empirically, extreme points are not observed, thereby the tasks are appropriate for the children’s age. The higher the accuracy the better the performance of the child.
The response time is measured in milliseconds and is only considered for the correct answers to test questions. The lower the response time, the faster the response is given. Consistent with the literature, we exclude any response time that is below 200 ms (Antoniou et al., 2016). We also carry out an outlier treatment in line with Purić et al. (2017), where we trim response times that lie outside of a 3 standard deviations bound.Footnote 9 As the two output variables are inversely coded, we consider the inverse of response time and dub this as response speed (1/RT).Footnote 10 Hence, the two outputs in our case are: i) accuracy (y1); ii) response speed (y2). The inputs are as follows: i) non-verbal intellectual ability (x1); ii) grammar skill (x2); iii) expressive vocabulary skill (x3); iv) receptive vocabulary skill (x4); v) age (x5).
The grammar, expressive vocabulary, and receptive vocabulary skills of monolingual children are assessed in Greek using the DVIQ, the PWFT, and the Greek receptive vocabulary test, respectively. The grammar, expressive vocabulary, and receptive vocabulary skills of bilingual participants are assessed in Greek using the same measures as with the monolinguals and in English using the equivalent English tests, namely TROG-2, CELF-4 and BPVS, respectively. With regards to the intellectual ability, we used the Matrices subtest, which is the non-verbal component of the KBIT-2. Table 3 presents information about the mapping of the tasks for each group of participants.
Table 3 Task mapping per group To arrive at comparable estimates of grammar, expressive vocabulary, and receptive vocabulary skills, we standardise the scores of the monolinguals and bilinguals. As the bilinguals have two measures for each skill, one in Greek and another in English, we follow three strategies to arrive at a composite measure of the respective skill. In the most naïve and easiest-to-implement strategy, we assume that all bilinguals are balanced between English and Greek, hence their composite score would be a simple weighted average of the respective tasks, and this represents Composite Score 1 (CS1). As the balanced bilingual assumption may be strong, we introduce a second, more realistic composite score (CS2) that assumes that bilinguals may be more competent in a particular language. Hence, under CS2, the composite measure is a weighted average of the individual tasks, with the weights calculated from the relative performance of the participants in the Greek and English versions of the test. Composite Score 3 (CS3) is similar to CS2 with the only difference being that the relative weights are derived from the parental questionnaire; hence the relative competency level is self-declared. In the following analysis, we present the results based on CS2, and we compare with the results of CS1 in the robustness section.Footnote 11
Similar to regression models, a DEA analysis needs to be “well specified” in the sense that relevant variables should be included in the specification. In case of regression, a minimum number of observations is required for estimation; statistical inference (e.g., hypothesis testing) requires additional observations and/or bootstrap techniques for small samples. Due to the DEA’s non-parametric nature, minimum sample size has no formal statistical basis. However, DEA’s discriminatory power depends on the relative numbers of inputs, outputs, and DMUs in the sample. As a rule of thumb, the number of DMUs should be at least 2–3 times higher than the inputs and outputs combined (Banker et al., 1989; Golany & Roll, 1989). In our case, the number of DMUs is at least seven times higher than the combined inputs and outputs.
Second-stage analysis
The technical efficiency estimate from the previous step may be used as the dependent variable in subsequent analysis. We investigate differences in the technical efficiency of monolingual and bilingual children in a second-stage analysis. We use three estimation methods: i) an ANCOVA, which is widely used in the literature; ii) a regression with bootstrap corrected standard errors that corrects for potential small sample bias (Cameron & Trivendi, 2005); and iii) a k-means nearest-neighbour matching technique. We opt for the k-means nearest neighbour as it is a non-linear, non-parametric technique that matches observations with similar characteristics. The advantage of k-means nearest-neighbour matching is that it does not rely on a formal model (like propensity score does); thus, being more flexible. Like the propensity score approach, it can match observations on both categorical and continuous variables. However, when matching on continuous variables, a bias-corrected nearest-neighbour matching estimator is necessary (Abadie & Imbens, 2006, 2011). More information is provided in Technical Appendix 2.
We allow for three formulations in each estimation method, hereafter referred to as Specifications A to C, respectively. These specifications are progressively less restrictive as they allow for decreasing similarities between the participants. In particular, specification A controls for differences with respect to non-verbal intellectual ability, grammar skill, expressive vocabulary skill, receptive vocabulary skill and age. Specification B further adds SES to specification A, while specification C further adds language use to specification B.