Behavior Genetics

, 36:845

Heritability and Reliability of P300, P50 and Duration Mismatch Negativity


    • Social, Genetic Developmental Psychiatry Research Centre, Institute of PsychiatryKing’s College London
    • Division of Psychological Medicine, Institute of PsychiatryKing’s College London
  • Katja Schulze
    • Division of Psychological Medicine, Institute of PsychiatryKing’s College London
  • Frühling Rijsdijk
    • Social, Genetic Developmental Psychiatry Research Centre, Institute of PsychiatryKing’s College London
  • Marco Picchioni
    • Division of Psychological Medicine, Institute of PsychiatryKing’s College London
  • Ulrich Ettinger
    • Neuroimaging Research Group, Institute of PsychiatryKing’s College London
  • Elvira Bramon
    • Division of Psychological Medicine, Institute of PsychiatryKing’s College London
  • Robert Freedman
    • Departments of Psychiatry and PharmacologyUniversity of Colorado Health Sciences Center
  • Robin M. Murray
    • Division of Psychological Medicine, Institute of PsychiatryKing’s College London
  • Pak Sham
    • Social, Genetic Developmental Psychiatry Research Centre, Institute of PsychiatryKing’s College London
    • Department of PsychiatryUniversity of Hong Kong
Original Paper

DOI: 10.1007/s10519-006-9091-6

Cite this article as:
Hall, M.H., Schulze, K., Rijsdijk, F. et al. Behav Genet (2006) 36: 845. doi:10.1007/s10519-006-9091-6



Event-related potentials (ERPs) have been suggested as possible endophenotypes of schizophrenia. We investigated the test–retest reliabilities and heritabilities of three ERP components in healthy monozygotic and dizygotic twin pairs.


ERP components (P300, P50 and MMN) were recorded using a 19-channel electroencephalogram (EEG) in 40 healthy monozygotic twin pairs, 19 of them on two separate occasions, and 30 dizygotic twin pairs. Zygosity was determined using DNA genotyping.


High reliabilities were found for the P300 amplitude and its latency, MMN amplitude, and P50 suppression ratio components. Intra-class correlation (ICC) = 0.86 and 0.88 for the P300 amplitude and P300 latency respectively. Reliability of MMN peak amplitude and mean amplitude were 0.67 and 0.66 respectively. P50 T/C ratio reliability was 0.66. Model fitting analyses indicated a substantial heritability or familial component of variance for these ERP measures. Heritability estimates were 63 and 68% for MMN peak amplitude and mean amplitude respectively. For P50 T/C ratio, 68% heritability was estimated. P300 amplitude heritability was estimated at 69%, and while a significant familiality effect was found for P300 latency there was insufficient power to distinguish between shared environment and genetic factors.


The high reliability and heritability of the P300 amplitude, MMN amplitude, and P50 suppression ratio components supports their use as candidate endophenotypes for psychiatric research.


ReliabilityERPsTwinsGenetic analysisHeritability


The heritability of schizophrenia has been estimated to be about 80% (Cardno and Gottesman 2000). Despite this, few if any susceptibility genes for schizophrenia have been definitely discovered. This probably reflects schizophrenia’s polygenic aetiology, locus and allelic heterogeneity, epistasis, pleiotropy, and environmental modification, all serving to reduce the effect size of any single genetic variant on the clinical disorder (Weinberger et al. 2001). One strategy to help identify ‘susceptibility genes for schizophrenia’ is to study neurophysiological processes that may be involved in the pathway from genetic predisposition to clinical illness (Gottesman and Gould 2003). Such neurophysiological processes can be considered as biological markers or ‘endophenotypes’. An ideal endophenotype should have a simpler genetic architecture than the clinical syndrome itself, and, would therefore, provide more power in gene finding approaches (Adler et al. 1999; de Geus 2002; Gottesman and Gould 2003). Good endophenotypes should be (1) meaningfully associated with the disorder; (2) stable over time, and (3) under significant genetic control (de Geus 2002).

Event-related potentials (ERPs) have been widely studied as potential endophenotypes in psychiatry. Three ERP components in particular have been found to be associated with schizophrenia. Reduced P300 amplitude and delayed latency (Bramon et al. 2004; Jeon and Polich 2001), abnormalities of P50 inhibitory response (Bramon et al. 2004) and reduced MMN amplitude (Umbricht and Krljes 2005) have been demonstrated in schizophrenic patients and also in their biological non-affected relatives (Adler et al. 1992; Blackwood et al. 1991; Cadenhead et al. 2005; Frangou et al. 1997; Freedman et al. 1994; Jessen et al. 2001; Michie et al. 2002). In addition, evidence from discordant schizophrenic twins further supports that P300 is genetically transmitted vulnerability markers (Weisbrod et al. 1999). P300 amplitude is believed to index working memory update of changes in the environment (Donchin 1981) whereas latency corresponds to stimulus evaluation time (Kutas et al. 1977). P50 gating response is related to individual’s ability to filter out repetitive stimuli to minimize information overload (Freedman et al. 1996). The generation of MMN is hypothesized to reflect a memory-related change-detection mechanism (Naatanen 1992); when MMN surpasses a threshold, involuntary attention may be activated towards stimulus change (Escera et al. 2002). In short, these ERP components are associated with information processing and allocation of attentional resources which are hypothesized to be impaired in schizophrenic patients (Braff 1993). Although abnormalities of P300, P50 and MMN are not specific to schizophrenia, the reproducibility of the findings in schizophrenic patients and their relatives suggests significant associations between these ERP parameters and the disorder, thus supporting them as promising candidate endophenotypes, in line with the first criterion outlined above.

The main focus of this paper is on the second and third criteria for endophenotypes, i.e., temporal stability and the degree of genetic contribution to the variance, of each ERP component. The ‘temporal stability’ includes not just average levels of characteristics but also rate of change during development or in response to stimuli, or the degree of variability over time and in different conditions. For example, Shirley Hill and her colleagues have reported differences in P300 amplitude growth model parameters that differ in relation to familial risk for alcoholism. Although the rate of change in the P300 varies between high and low risk groups during childhood and adolescence, it is shown to be converged by young adulthood (Hill and Shen 2002). Segalowitz and Barnes (1993) suggested that the total phenotypic variance of an ERP measure can be expressed as a function of (a) trait variance which represents stable subject characteristics such as gender; (b) stimulus variance that is systematically manipulated in the experimental paradigm such as target frequency or intensity; (c) variance due to subject’s psychological or physiological state independent of stimulus properties, such as arousal, nicotine and caffeine use; and (d) measurement error. Most electrophysiological designs aim to optimize the variance due to trait and standardize stimulus factors through systematical manipulation, and to minimize the uncontrolled variance due to state and measurement error. However, state variance and measurement error are critical in determining the basic reliability of the ERP measure itself and present an upper bound on its validity as an endophenotype (Segalowitz and Barnes 1993). When individuals are tested on multiple occasions, the test–retest reliability is a reflection of the extent to which subjects resemble themselves on retest (within individuals) as compared with the variation among subjects (between individuals). Thus, a high within individual variance results in lower reliability, and is considered a non-valid portion of the ERP variance which will reduce the statistical power of the research paradigm.

The third endophenotype criterion reflects the extent to which individual differences in ERP measures are determined by genetic and environmental factors, and can be studied in twin samples. In the classical twin design, resemblance of a trait compared between monozygotic (MZ) and dizygotic (DZ) twin pairs can provide an index of the relative contribution of genetic (heritability) and environmental effects to the total phenotypic variance. Although ERP heritabilities have been examined in several family and twin studies (van Beijsterveldt and van Baal 2002), the majority of studies have only considered the P300 component. Only two papers have examined the heritability of the P50 index (Myles-Worsley et al. 1996; Young 1996) and to our knowledge, there are no published reports on the heritability of MMN. These twin studies have tested individuals on only one occasion and therefore could not partition out variance due to measurement error and variance due to random individual specific effects.

In this paper, we report a twin study involving repeated ERP recordings. Our aims are to address the test–retest reliability of three ERP components (P300, P50 and MMN) and to estimate their respective heritabilities. All subjects were adults and given the short test–retest interval in the present study, the temporal stability would include mostly variance due to test–retest reliability of the measurements. We applied structural equation modelling (SEM) in our genetic analyses. Maximum likelihood estimates of the genetic, shared environmental, specific environmental and measurement error components were obtained by decomposing the observed phenotypic variances and covariances of the twin data.



Forty monozygotic (14 male, 26 female, mean age 34.68 years, age range 19–56 years) and 30 dizygotic twin pairs (4 male, 26 female, mean age 40.81 years, age range 20–58 years) participated in the study. Twins were recruited from the Volunteer Twin Register, Institute of Psychiatry, London. A subset of 19 MZ twins (8 male, 11 female, mean age 36.76 years, age range 19–55 years) were tested on two occasions with an average inter test interval of 17.8 days (7–56 days) so that variance due to true individual specific effect and measurement error can be partitioned out. No significant sex and age differences were observed between the subset MZ twin sample and either the rest of MZ twin pairs (P = 0.4 and 0.43, respectively) or the rest of total twin sample (P = 0.06 and 0.58, respectively). The selection of MZ twin pairs for retest was random (i.e., selection was not based on any trait or characteristics of the twins). In the model fitting, retest data for those twins who were not retested were coded as missing, and were assumed to follow the missing completely at random (MCAR) mechanism (Little and Rubin 2000).

All participants were screened using the Schedule for Affective Disorders and Schizophrenia-Lifetime Version SADS-L (Spitzer 1978) or the Structured Clinical Interview for DSM-IV (SCID) (First et al. 1997) interviews. Exclusion criteria included the presence of any psychotic disorder, any neurological disorder, hearing impairment and active substance abuse. One female DZ twin pair was excluded due to one member having a history of psychosis. Zygosity was determined by the assessment of 12 highly polymorphic DNA markers except for 18 twin pair where zygosity was determined using a structured physical likeness questionnaire which shows greater than 90% accuracy (Nichols and Bilbro 1966). The study was approved by the UK Multi-centre Research Ethics Committees (MRECs). Written informed consent was obtained from all participants.

Procedure and Task

Each measurement session lasted 2 h and consisted of 3 separate recordings, P50, MMN and P300, carried out in a fixed order. Data were collected using Neuroscan software (Scan 4.0). Electroencephalogram (EEG) data were recorded continuously from 17 scalp sites according to the 10/20 International System (Jasper 1958), using silver/silver chloride electrodes on a Nihon Kohden PV-441A machine. The ground was placed at the middle of the forehead and the reference was placed on the left ear lobe. Eye movements were recorded from four locations (the outer canthus of each eye, above and below the left eye). Electrode impedances were below 6 k \(\varvec{\Omega}\). Subjects were seated in a reclining chair, and wore bilateral intra-aural earphones (Neuroscan 10  \(\varvec{\Omega}\) Herndon, Virginia 22070) for auditory stimulus presentation. EEG activity was amplified 10,000 times with 0.03 high pass and 120 low-pass filters, digitized at a 500 Hz rate. All stimuli were generated and presented using STIM system. Continuous EEG data and participants’ responses were stored for later off-line analyses.


MMN was elicited by a duration auditory oddball task using 4 blocks of 400 binaural 80 dB stimuli (ISI: 0.3 s). 85% of the 80 dB, 1000 Hz tones were ‘standards’ (25 ms duration, 5 ms rise/fall time) and 15% were ‘deviants’ (50 ms duration, 5 ms rise/fall time). Subjects were instructed to keep their eyes open during the recording, ignore the sounds presented to them and rest their eyes on a picture placed in front of them.

Continuous data were epoched to 100 ms prestimulus and 300 ms poststimulus, then filtered using a zero-phase shift digital 0.1–30 Hz bandpass filter (24 dB/octave roll off) and baseline corrected to the pre-stimulus interval. Individual epochs were rejected if the EEG amplitude exceeded \(100\,\upmu\hbox{V}\) in any channel. Eye-blink artefacts were corrected using regression based weighting coefficients (Semlitsch et al. 1986). Epochs were then averaged separately for the standard and deviant tones, excluding the first 20 stimuli of the first block. Mismatch negativity was extracted by subtracting standard from deviant waveforms on a point-by-point basis for each individual. The amplitude of MMN was measured in two ways: peak MMN amplitude and mean amplitude measured in the 50–200 ms latency range at FZ. The MMN latency was determined as the peak latency of the maximal MMN amplitude.


P50 was recorded with a conditioning–testing paradigm. Stimuli were 120 pairs of conditioning and testing clicks (condition C and test T) of 1 ms duration, separated by 500 ms with an interval between conditioning stimuli of 10 s and presented in 4 blocks. Each block was separated by one minute break. Individual’s hearing threshold was tested before the recording and stimulus intensity individually adjusted to 43 dB above the hearing threshold. Data acquisition time was from 100 ms before to 400 ms after each click. Subjects were asked to fix their gaze on a fixed target and to avoid all eye movement or blinks during the presentation of clicks.

Signal processing was performed using NEUROSCAN (4.3) software. A linked eye channel was created off-line by re-referencing the left vertical EOG to the left horizontal EOG using Neuroscan software’s linear derivation function. The signal of each channel was epoched creating 240 sweeps and a 1 Hz high pass filter (24 dB/oct) was applied. Epochs were then baseline corrected using the pre-stimulus interval. An automatic ocular artefact rejection procedure identified and rejected any sweeps with activity greater than \(20\,\upmu\hbox{V}\) in CZ or linked eye channel between 0 and 75 ms poststimulus (to capture blinks and other slow wave activity).

Accepted sweeps were averaged for each of the 4 blocks of 30 trials for C and T separately. Ideally each individual would have 4 average waveforms for C and T, one from each block. However, if the number of accepted sweeps per block was too small (<50% of trials), due to excessive eye movement or other artefacts, trials from consecutive blocks were combined by averaging. Average waveforms were then digitally filtered using a 10 Hz high pass filter (24 dB/oct) with zero phase shift and a 7-point moving average applied twice. The resulting bandwidth (10–100 Hz) isolates beta and gamma EEG frequencies, which have been previously shown to have the largest changes during habituation to repeated stimuli (Clementz et al. 1997a; Haenschel et al. 2000).

P50 Peak Selection

We report P50 ERPs from CZ. The averaged waveforms for the C and T responses in each block were presented simultaneously on a high-resolution colour monitor for visual inspection. For the C response, the most prominent peak in the 40–75 ms post-stimulus window was selected as the P50 peak. The preceding negative trough was used to calculate the P50 amplitude (Clementz et al. 1997b; Nagamoto et al. 1989). If there was no clear trough on CZ, we used at least one other site (FZ or PZ) to help identify the trough. For the T response, the positive peak with latency closest to that of the C P50 peak was selected as the P50 T response. The T P50 wave amplitude was determined in the same way as for the conditioning response.

Blocks without identifiable P50 response were excluded from the grand average because there is no peak to be selected within defined 40–75 ms peak selection window and including such blocks may lead to grand averaged latency fall outside the peak selection window. Blocks containing EOG activity in the 40–75 ms post-stimulus window exceeding the P50 wave or with a large negative–positive P30 complex (1.5 times bigger than the P50 wave) were also excluded from the grand average (Nagamoto et al. 1989; Waldo et al. 1992). Data from all remaining blocks were included in the grand average. The average percentage of rejected trials was 13.9%. Grand averages for the C and T responses were compiled separately and P50 peak responses determined in the same way as for individual block averages (Fig. 1). P50 ratio (T/C) was calculated as the ratio of the T amplitude to the C amplitude, multiplied by 100. This ratio is a measure of the gating of the P50 response, with lower values indicative of increased auditory sensory gating. To avoid readings to be influenced by physical similarity of the twins, the evaluator was blind to the true zygosity of the twins, data files were analyzed 4 months after collection, stored in alphabetical order of first names rather than family names or pair-wise order. Two MZ twin pairs did not have clear P50 waves on both occasions due to excessive artefacts and one MZ twin pair skipped P50 recording at second occasion, therefore reliability analyses of P50 data were only available for 16 instead of 19 pairs.
Fig. 1

Example of a grand average for the P50 condition (a), test (b) and eye movement (c)


P300 was assessed using an auditory oddball paradigm (400 binaural 80 dB, 20 ms stimuli). 20% of the stimuli were target (1500 Hz) and 80% standard (1000 Hz) tones. Participants were instructed to press a button every time they heard a target tone. Inter-stimulus interval (ISI) was variable between 1.8 and 2.2 s. Continuous data were epoched to 400 ms prestimulus and 1000 ms poststimulus then filtered using a zero phase shift digital 0.15 (12 dB/octave roll-off) to 40 Hz (48 dB/octave roll off) bandpass filter. A low-pass zero phase shift 8.5 Hz digital filter was then applied and baseline corrected to the pre-stimulus interval. Eye-blink artefacts were corrected using regression based weighting coefficients (Semlitsch et al. 1986). Subsequently, individual epochs were rejected if voltage exceeded \(50\,\upmu\hbox{V}\) in F7, F8, Fp1, or Fp2. Accepted target trials were examined and rejected if residual horizontal eye movements were present in the  − 100 to 800 ms period or if the post-stimulus period did not cross the baseline in the 0–800 ms range. Two separate average waves, one for target and one for standard, for which a correct response had been recorded were then calculated. P300 amplitude and latency were reported at PZ. P300 amplitude was measured as the largest positive value at PZ between 280 and 600 ms. Two MZ twin pairs were excluded from P300 reliability analyses due to excessive artefacts at second occasion.

Statistical Analysis

Reliability and Descriptive Statistics

Reliability of each ERP variable was assessed by calculating the intra-class correlations (ICC) between occasion 1 and 2 using the Statistical Package for the Social Sciences Version 11 (SPSS) by selecting the options scale, reliability analysis, and then ICC from the menu. Maximum likelihood estimates of MZ/DZ twin correlations were computed using the SEM program Mx (Neale et al. 1999). Twin correlations are estimated from transformed sex- and age-regressed residuals. We calculated these values using Mx for several reasons: (i) by fitting models to pairs, non-independence of observations could be accounted for, (ii) significant differences in means and SD (variance) between the MZ and DZ pairs, between MZ first and second occasions, and between twin one and twin two can be tested by likelihood ratio, and (iii) MZ test–retest correlations as well as MZ and DZ correlation with 95% confidence interval (CI) can be estimated simultaneously. We used a model in which the 4 by 4 MZ covariance matrix (twin 1 occasion 1, twin 1 occasion 2, twin 2 occasion 1, twin 2 occasion 2) and the 2 by 2 DZ covariance matrix (twin 1 occasion 1, twin 2 occasion 1) were specified as S*R*S′, where S is a diagonal matrix of standard deviations and R is a correlation matrix. Two additional constraints were applied to the MZ correlation matrix: (i) correlations between occasion 1 and 2 were equated to be the same for twin 1 and twin 2, (ii) cross-twin correlations within occasion and across occasion were equated, yielding one MZ twin correlation and one test–retest correlation. Using this model, we systematically tested differences in means and SD between (a) MZ cross-twin within occasion, (b) MZ cross-twin cross occasion, (c) DZ cross-twin, and (d) MZ and DZ pairs. The observed Pearson’s twin correlations on twice entered data were computed in addition to the maximum likelihood estimates of twin correlations for comparison.

Genetic Model Fitting

To estimate the relative contributions of genetic and environmental influences to individual differences in MMN, P50, and P300, we employed genetic model fitting using Mx. We applied a model for repeated measurements in which the total phenotypic variance (VP) is expressed as a linear function of the underlying genetic, environmental and measurement error factors: VP = VA +  (VD or VC) + VE + VM, where VA, VD, VC, VE, VM refer to the variance associated with additive genetic, dominant genetic, shared environmental, individual-specific environmental and measurement error factors, respectively. The expected covariances for measurements across MZ and DZ twins are CovMZ = VA + (VDor VC) and CovDZ = 1/2VA + (1/4VD or VC), respectively (Neale and Cardon 1992). Dominant genetic effects are indicated if the DZ correlation is less than half of the MZ correlation, whereas shared environment effect is indicated if the DZ correlation is more than half the MZ correlation. It is however not possible to estimate both dominant genetic and shared environmental effects simultaneously when using only twins reared together. Thus, models including ACEM or ADEM were considered based on the twin correlations. A path diagram of this AC/DEM model is depicted in Fig. 2. The influence of each of the four factors on the phenotype is given by parameters a, d, c, e, and m, which are equivalent to the regression coefficients of the phenotype on the latent factors. The amount of variance due to each source is the square of these parameters and standardized by dividing by VP. For the ACEM model, heritability is calculated as VA/VP, for the ADEM model heritability is (VA + VD)/VP.
Fig. 2

Path diagram of the genetic model for test–retest data. A: Additive genetic factor, D: dominant genetic factor, C: Shared environmental factor, E: Specific environmental factor, M: measurement error factor. Parameters a, d, c, e and m are the regression coefficients of observed phenotypes on the latent factors. Path cross occasions are constrained to be equal. Note: M is estimated from test–retest of sub-sample of MZ pairs (not shown here)

Maximum likelihood estimates of the parameters (variance components) are obtained by the Mx program, which allows model fitting of raw data to allow for the presence of missing data. A series of models were fitted which differed in the variance components that were freely estimated (as against fixed at 0), to obtain maximum likelihood estimates and minus twice the log-likelihood (−2*LL) of these models. The goodness-of-fit of all models is measured relative to a saturated model, in which we have estimated the maximum number of parameters to describe the covariances, although all means were equated. For example, there are 10 and 3 variance/covariance parameters for the MZ and DZ twins, respectively, plus one overall mean. The overall goodness of fit of a model was assessed by the χ2 statistic and Akaike’s Information Criterion (AIC). A small χ2 with non-significant P value indicates a good fit between data and model. The AIC assesses the fit of the model (χ2 statistic) relative to the number of parameters, providing an index for comparing nested models (Akaike 1987). The fit of sub-models (CEM model, AEM model, ACM model, DEM model, ADM model, EM model, AM model, and CM model) was assessed by χ2 difference test compared to the full ACEM or ADEM model (Neale and Cardon 1992). Confidence intervals of parameter estimates (and their explained variance) were obtained by maximum likelihood rather than standard errors (Neale and Miller 1997).


Due to small sample size, the study did not have sufficient statistical power to distinguish between alternative sub-models for many variables and the ‘best fitting model’ for these variables could not be unambiguously identified. For these variables we presented the results for the full model only shown in Table 1.
Table 1

Standardized estimates of full model and best-fitting model for each ERP variable



% Variance accounted for

Fit index





χ2 (df)



MMN peak amplitude


0.60 (0.06–0.73)

0.00 (0–0.46)

0.13 (0–0.33)

0.27 (0.16–0.45)

14.7 (9)




0.63 (0.46–0.75)


0.37 (0.25–0.54)

17.9 (11)



MMN mean amplitude


0.68 (0.31–0.79)

0.00 (0–0.33)

0.00 (0.0–0.11)

0.32 (0.21–0.47)

16.6 (9)




0.68 (0.53–0.79)


0.32 (0.21–0.47)

16.6 (11)



P50 C amplitude


0.35 (0–0.55)

0.00 (0–0.45)

0.15 (0–0.47)

0.49 (0.3–0.77)

15.6 (9)



P50 T amplitude


0.66 (0.24–0.79)

0.00 (0–0.31)

0.07 (0–0.30)

0.26 (0.15–0.48)

25.1 (9)



P300 latency


0.48 (0.00–0.76)

0.17 (0–0.66)

0.06 (0.00–0.21)

0.29 (0.19–0.45)

45.4 (9)








χ2 (df)



MMN latency


0.00 (0–0.48)

0.31 (0–0.53)

0.14 (0–0.45)

0.55 (0.34–0.85)

13.7 (9)



P50 C latency


0.00 (0–0.59)

0.47 (0–0.65)

0.00 (0–0.33)

0.53 (0.30–0.75)

22.15 (9)



P50 T latency


0.00 (0–0.58)

0.42 (0–0.61)

0.28 (0.02–0.57)

0.30 (0.17–0.54)

13.94 (9)



P50 T/C ratio


0.00 (0–0.65)

0.68 (0–0.81)

0.11 (0–0.34)

0.20 (0.11–0.38)

16.17 (9)



P50 T–C difference


0.00 (0–0.57)

0.40 (0–0.59)

0.14 (0–0.43)

0.46 (0.28–0.73)

11.69 (9)



P300 amplitude


0.10 (0–0.79)

0.59 (0–0.80)

0.14 (0.03–0.30)

0.17 (0.11–0.29)

19.86 (9)



Estimates are obtained with means constrained across twins and occasions. The upper and lower 95% likelihood-based confidence intervals (CI) are reported in brackets: CI including 0 indicate a non-significant influence of a parameter


MMN variables are normally distributed. Reliability of MMN amplitude was high and similar for both peak and mean amplitude conditions (Table 2). For MMN latency, a much lower reliability and MZ correlation were observed. Differences in means were found only for MMN latency, cross-twins in the DZ group. Differences in variance were found between MZ cross-twins within occasion and cross occasion in both peak and mean amplitude variables, and between the zygosity groups for the mean amplitude variable. DZ correlations were about half of the MZ correlations for both the mean and peak amplitude suggesting additive genetic effect whereas for the latency, MZ correlations were greater than twice of DZ correlations indicating dominant effect (Tables 3 and 4). Therefore, in the genetic model fitting, we fitted an ACEM model for the mean and peak amplitude and an ADEM model for the latency.
Table 2

Means, SD and test–retest reliability (ICC) with 95% CI of MMN, P50 and P300 indexes
















  Peak amplitude





0.67 (0.44–0.81)




  Mean amplitude





0.66 (0.44–0.81)









0.34 (0.03–0.59)





  C amplitude





0.56 (0.26–0.76)




  C latency





0.42 (0.10–0.67)




  T amplitude





0.57 (0.27–0.77)




  T latency





0.75 (0.54–0.87)




  T/C ratio





0.66 (0.40–0.82)




  T–C difference





0.59 (0.29–0.78)










0.86 (0.74–0.93)









0.88 (0.78–0.94)




SD = Standard deviation; ICC = Intra-class correlation. Amplitude reported in microvolt; latency reported in millisecond; P50 ratio reported in percentage

Table 3

Maximum likelihood estimates of MZ and DZ twin correlations (& 95% CI) of MMN, P50 and P300 indices





Peak AMP

Mean AMP






T/C ratio





0.61 (0.39–0.76)

0.66 (0.50–0.79)

0.27 (0.02–0.50)

0.35 (0.11–0.56)

0.46 (0.23–0.65)

0.50 (0.26–0.68)

0.50 (0.25–0.69)

0.52 (0.28–0.70)

0.37 (0.12–0.58)

0.72 (0.55–0.84)

0.68 (0.50–0.80)


0.25 (−0.12–0.55)

0.26 (−0.10–0.56)

 −0.05 (−0.40–0.31)

0.19 (−0.18–0.52)

 −0.04 (−0.42–0.35)

0.30 (−0.07–0.59)

 −0.36 (−0.64–0)

0.04 (−0.33–0.39)

0.06 (−0.31–0.42)

0.18 (−0.19–0.50)

0.38 (−0.01–0.66)

Correlations are estimated from sex- and age-regressed transformed scores. AMP=amplitude; LAT=latency; C=condition; T=test; T−C DIFF=difference between test and condition amplitude

Table 4

Observed MZ and DZ twin correlations of MMN, P50 and P300 indices

N = pair




mz 1st = 38; 2nd = 19; dz = 29

mz 1st = 37; 2nd = 16; dz = 27

mz 1st = 39; 2nd = 17; dz = 27

Peak AMP

Mean AMP






T/C ratio




MZ 1st




































Correlations are estimated from sex- and age-regressed transformed scores on twice entered data

In the genetic model fitting, for both peak and mean amplitude, the AM model was selected as the best-fitting model as the χ2 was not significantly different from the full ACEM model and AIC value was smaller (Table 1). Genetic effect cannot be dropped from the model without significant deterioration in fit. Heritabilities for MMN peak and mean amplitude were 0.63 and 0.68, respectively. Measurement error accounted for the rest of the variance. For MMN latency, all parameters could be dropped individually and both A and D together can be dropped. Over half of the variance (55%) was due to measurement error, suggesting this index to be rather unreliable.


P50 indices (except the C latency) were highly reliable (Table 2). The mean P50 T/C ratio for the entire sample was 31.72 indicating that there was 68% suppression of the P50 wave. This value is similar to those reported for normal adult subjects in previous studies (Clementz et al. 1998; Myles-Worsley et al. 1996; Nagamoto et al. 1989; Waldo et al. 1992). No means and variance differences were found. MZ correlations were greater than twice of DZ correlations in most of P50 variables except for C amplitude and T amplitude (Tables 3 and 4). Therefore, an ADEM model was fitted for C latency, T latency, T/C ratio and T–C difference and an ACEM model for C amplitude and T amplitude.

P50 variables are normally distributed except the T/C ratio which was square-root transformed prior to model fitting. Compared to the saturated model, the χ2 of the full model for C latency and T amplitude was significant, suggesting a poor fit of the data to the genetic model. Therefore, we reported the full model for both C latency and T amplitude. For the C amplitude, significant familial effects were found (χ2 = 7.56; df = 2; P  <  0.02) but there was not enough power to distinguish between genetic (A) and shared environmental (C) effects, since either component could be dropped independently, but not simultaneously. The full ACEM model is reported with standardized estimates of a2 = 0.35, c2 = 0, e2 = 0.15, m2 = 0.49 (Table 4). For all other variables (C latency, T latency, T/C ratio and T–C difference), results showed that significant genetic effects (additive genetic + dominant) were found. However, there was insufficient power to distinguish between (A) and (D), i.e., either component could be dropped independently, but not simultaneously. Therefore, the full ADEM model was reported in Table 4. Heritability estimates for the T/C ratio is (a2 = 0; d2 = 0.68), C latency is (a2 = 0; d2 = 0.47), T latency (a2 = 0; d2 = 0.42), and T–C difference (a2 = 0; d2 = 0.40).


Both amplitude and latency were highly reliable. Differences in SD were found between MZ cross-twins cross occasion for amplitude and latency and between zygosity groups for P300 latency. DZ correlations were about half the MZ correlations for the P300 latency and an ACEM model was fitted. For the amplitude, MZ correlations were greater than twice of DZ correlations and an ADEM model was fitted (Tables 3 and 4).

The distribution of the P300 amplitude was normal but of the Latency was skewed, and square-root transformed prior to model fitting. For P300 amplitude, significant heritability (additive + dominant genetic effects) was found. Both A and D together cannot be dropped simultaneously (χ2 = 32.2; df = 2; P  <  0.0001) but with insufficient power to distinguish between these two components. A and D estimates were 0.10 and 0.59 respectively and heritability was computed as the sum of these two components (0.69). For P300 latency, a substantial familial effect was found but dropping either A or C showed a non-significant deterioration in goodness-of-fit. Therefore, the ACEM model is reported (a2 = 0.48, c2 = 0.17, e2 = 0.06, m2 = 0.29). The significant difference in likelihood between the saturated and full ACEM model (χ2 = 45.35, df = 9, P  <  0.001) indicated that the data did not fit the genetic model. The reason for this poor fit was not entirely clear perhaps due to relative small samples size and transformation did not successfully remove skewness.


This study uses a repeated measurement twin design which enables the heritability estimates of ERP indices and their corresponding test–retest reliabilities to be calculated simultaneously. We found significant reliabilities for the ERP measures P300, P50 and MMN, with P300 amplitude showing the lowest measurement error. The best fitting models for all three ERP measures indicated a substantial familial (or genetic) component of variance (Fig. 3). In contrast to studies of van Baal et al. (1998) and van Beijsterveldt et al. (2001), heritability estimates in the present study were computed from the total variance rather than from the stable part of the variance. As a consequence, heritability estimates reported here were more conservative.
Fig. 3

Estimated heritability (h2), shared environment (c2), specific environment (e2) and measurement error (m2) components based on the best fitting model. Note: Amp: amplitude; Lat: Latency. Heritability estimates for variables shown are the sum of A and D components


Reliability values of MMN measures vary considerably between types of deviant MMN, scalp sites, ISI and attentional conditions. Improvement of reliability was suggested by using duration MMN rather than frequency or intensity MMN and a shorter ISI and frontal sites provide the best signal-to-noise ratio (Escera and Grau 1996; Escera et al. 2000; Kathmann et al. 1999; Pekkonen et al. 1995; Tervaniemi et al. 1999). The present study examined duration MMN at FZ using 0.3 second ISI and found reliabilities of 0.67 for peak amplitude and 0.66 for mean amplitude. Kathmann et al. (1999) found reliability to be higher with attentional demands on the task (r = 0.83) and results of our study could be explained by the lack of attentional demands and increased non-standardized intra-individual variability. Reliability of MMN amplitude was also suggested to be Improved when measured as the mean amplitude of a latency window (100–200 ms) around its peak (Escera et al. 2000). The similar reliabilities in both peak and mean amplitude conditions of this study suggest both ways of measuring MMN amplitude have comparative validity.

In contrast to MMN amplitude, MMN latency was not reliably measured. In accordance with the low ICC value, we found measurement error to account for more than half of the variance, and the heritability to be underestimated. This may explain why studies conventionally only compare amplitudes between groups.

Our model fitting results support the use of both mean and peak amplitude as endophenotypes. No environmental influences were observed in both conditions suggesting that genetic effects are the main source of individual differences in MMN amplitude. Several studies have reported reduced MMN amplitudes in schizophrenic patients (Catts et al. 1995; Michie et al. 2002; Shelley et al. 1991; Umbricht et al. 2003) although there have also been negative reports (Jessen et al. 2001; O’Donnell et al. 1994). Negative findings may be due to the use of a less sensitive paradigm with higher measurement error and lower statistical power.


Previous P50 T/C ratio studies found low to moderate reliability estimates (see Smith et al. 1994 for review). A reliable T/C ratio is difficult to obtain due to its inherent mathematical properties, i.e., T and C are not independent, so that the shared variance between T and C cannot be completely eliminated (Adler et al. 1999; Smith et al. 1994). All cerebral evoked potential measurements include both a signal and a noise component. A ratio measurement that includes noise in both the numerator and denominator will introduce an even greater variability (Smith et al. 1994).

The present study suggests that a reliable T/C ratio can be achieved (ICC = 0.66) by adopting a rather stringent trial selection criterion (i.e., trials at CZ or eye channels exceeding \(20\,\upmu\hbox{V}\) were rejected). The eye movements or muscle artefacts observed in the P50 recording were reflex responses that were part of the response to the stimulus itself. To remove these artefacts most P50 studies have used \(35\,\upmu\hbox{V}\) as cut-off point but we applied a conservative cut-off (\(20\,\upmu\hbox{V}\)). Also, the averaged response from each block was excluded from the grand average if eye movements overlapped with and exceeded the P50 waves. This method was chosen over the regression coefficient method (as applied to the P300 and MMN) because there were few or no blinks during conditioning and testing clicks as subjects were instructed to keep their eyes still and focus on a target when stimuli were presented. During P300 and MMN recordings, eye blinks occur randomly to the ERP stimuli and the regression method is a better option to regress out obvious eye blinks (\(>100\,\upmu\hbox{V}\)) in order to keep maximum number of trials for averaging. In addition, regression coefficient methods generally result in a slight reduction of amplitudes, and since P50 waves are usually very small, even minor reductions could result in a distortion of the ratio.

Genetic model fitting revealed that genetic effects (additive genetic + dominant genetic) were the main source of variance for T/C ratio, C latency, T latency and T–C difference. However, given the relative small sample size there was not enough power to distinguish between (A) and (D), since either component could be dropped independently, but not simultaneously. Heritability estimates for these variables therefore should be viewed as the sum of these two components. Both T/C ratio and T–C difference are secondary measures derived from T amp and C amp and both are heritable traits. A secondary measure can be more heritable when the cross-twin cross-trait covariances tend to be much smaller than the within twin cross-trait covariance, as when there is cross-trait phenotypic correlation due to non-shared (between twins) environmental factors. For C amplitude and T amplitude, significant familial effects were found. Our results also suggested that T/C ratio is preferable to be used as an endophenotype compared to other P50 indices (Smith et al. 1994). It is more reliable than the T–C difference and its variance is mainly due to genetic effects (68%).

Twin studies examining P50 sensory gating are rare. One small study reported ICC of 0.57 for MZ twins and 0 for DZ twins (Young 1996). The second, slightly larger study (26 MZ and 13 DZ pairs), found a significantly higher MZ (0.50) than DZ ICC (0.13) (Myles-Worsley et al. 1996). Both studies suggested substantial genetic influences but the extent of heritability was not formally calculated. We found similar results of MZ correlation of 0.52 and DZ correlation of 0.04 with a heritability estimate of 68%. The reason for the poor fit of the genetic model to T amplitude and C latency is not entirely clear but perhaps due to small sample size. Variance component estimates in both T amplitude and C latency should be interpreted with caution as the genetic models showed poor fit to the data.


Most studies found a higher reliability for P300 amplitude than latency (for reviews, see Segalowitz and Barnes 1993; Walhovd and Fjell 2002). However, we found both the amplitude and latency to be highly reliable, and, as our model fitting analysis indicated, this was due to low measurement error.

A number of twin studies yielded substantial evidence of P300 amplitude heritability (Katsanis et al. 1997; O’Connor et al. 1994; Polich and Burns 1987; Rogers and Deary 1991; van Baal et al. 1998). A recent meta-analyses across 5 twin studies estimated P300 amplitude heritability to be 60% and the specific environmental variance to be 40% (van Beijsterveldt and van Baal 2002). Our results correspond well with these figures, and further show that around half of the specific environmental variance may be due to measurement error.

In the meta-analyses, P300 latency heritability was reported to be 51% and specific environmental variance to be 49%. No significant influence from the shared environment was found. Our results show evidence for the presence of familiality (65%), but lack of power to distinguish between genetic and shared environmental influences. The poor fit of P300 latency to the data, however, suggested that heritability estimate obtained from our genetic model fitting may not reflect the true estimate and therefore, results should be interpreted with caution.

The assumption of equal variances for the MMN mean amplitude and P300 latency variables between MZ and DZ groups were not met. Higher DZ variance was observed in MMN suggesting negative sibling interaction whereas higher MZ variance was found in P300 latency suggesting positive sibling interaction. However, these differences most likely occurred due to small sample size.

Although ERP components are non-specific (e.g. P300 amplitude reduction is proposed as an endophenotype for other disorders like alcoholism), the test of validity as putative schizophrenia endophenotypes (the first criteria) can be addressed with study design including both control and schizophrenic twin pairs (concordant and discordant). Data of this type will enable researchers to examine the extent to which the endophenotypes overlap with the disorder and the overlap is due to genetic or environmental factors.

Some limitations of the study need to be addressed. This study tested only a sub-sample of MZ pairs on two occasions instead of the whole sample population. However, since this sub-sample showed to be representative for the total sample in terms of gender composition and mean age, the estimates of measurement error are likely to be un-biased. Also, our model (Fig. 2) fitted the A, C/D and E components as common factors (cross occasion 1 and 2), with path coefficients equated cross occasions and therefore represented the ‘stable’ A, C/D and E variance cross time. The use of a small subset of MZ twin pairs to estimate these stable variances might produce a slightly biased estimate. However, the estimates obtained from the present study are consistent with estimates from other ERP studies in the literature. More importantly, if a bias had occurred, one would have observed heritability, c2 and e2 estimates in the AC/DEM models that are not comparable to the ones derived from analyzing the occasion 1 data only. These analyses show little evidence for such bias (details available on request). The incorporation of retest data on a subset of MZ twins makes the assumption that test–retest reliability is the same for MZ and DZ twins. While this assumption is not testable in the current study because retest was not carried out in DZ twins, it is probably not unreasonable, and is indeed a hidden assumption of all twin analyses that do not include retest data. Moreover, the small sample size did not allow for formal sex-limitation testing (i.e., sex differences in genetic architecture) and that there were overall an excess of females in the sample, especially among DZ twins. Finally, due to small sample size, the study did not have sufficient statistical power to distinguish between alternative sub-models for many variables, particular between A and D components.

In conclusion, ERPs have been suggested as possible endophenotypes of schizophrenia. Indices at a more elementary neurobiological level can be very useful in the search for susceptibility genes for complex psychiatric disorders and understanding their function. Two fundamental requirements, however, are that a substantial portion of the variance in these indices must be of genetic origin and stable over time. Our investigation demonstrated that the three ERP components: MMN peak and mean amplitude, P50 T/C ratio, and P300 amplitude, have met these requirements. Our results are inconclusive about P300 latency due to the lack of power to discriminate genetic from shared environmental influences. The genetic and environmental relationship between these components will be the subject of a future analysis using multivariate modelling.


We thank all twin pairs who participated in the study and Mr. Y. Nguyen and Mr. L. Drummond for technical help with EEG equipment.

Copyright information

© Springer Science+Business Media, Inc. 2006