Introduction

Interest in the role of epigenetic processes in common complex diseases continues to increase [1, 2]. Epigenetics is a potentially major mechanism by which environmental factors can affect physiological function and disease risk. Research into epigenetics promises to reveal many of the causes that remain undiscovered after extensive investigation of common genetic variation [3].

Epidemiological approaches can be used to identify whether epigenetic processes are involved in mediating the association between risk factors (environmental, genetic, lifestyle, socioeconomic and so on) and common complex disease [4, 5]. For example, longitudinal cohort studies have been a cornerstone of observational epidemiology for many years. Long-term follow-up of adult cohorts has identified important risk factors for cardiovascular disease, chronic bronchitis, and cancers, and follow-up of cohorts from birth or childhood has been equally successful at identifying the importance of early exposures (especially the childhood social environment) and developmental characteristics for adult health (for example, [610]). Longitudinal studies, particularly those that start in early life, can contribute to our understanding of how the epigenome changes over time, as a result of varying environmental exposures, and how disease phenotypes evolve. Longitudinal studies are costly to instigate and maintain, and cross-sectional studies (a less expensive alternative study design) have more often been used to assess the relationship between exposures and the epigenome and/or the epigenome and disease. However, cross-sectional studies cannot capture the dynamic nature of epigenetic mechanisms [11], making it difficult to identify the influences of the environment and/or disease state (or sub-clinical features of disease) on the epigenome and thus establish the direction of causality. As a result of this, study designs that make use of multiple time points are being increasingly recognized as the most suitable to analyze the epigenetics of common complex diseases. Because longitudinal studies track the same cohort at multiple time points throughout their lifetime, enabling the temporal relationship between exposure and disease to be established, they are ideally placed for exploitation in epigenetic investigations.

Advances in genomic technologies have opened up the possibility of large-scale population-based assessment of epigenetic patterns to help understand their influence on disease. How should such studies be conducted to maximize their impact and what can epigenetics researchers learn from previous approaches to population-based studies? Here we focus on how epidemiological approaches, including the design of cohort studies, can help investigate the role of epigenetic variation in common complex disease. Furthermore, the dynamic nature of epigenetic patterns means that they can be altered by disease-related factors (a process called 'reverse causation') as well as a host of confounding factors (such as age, sex, socioeconomic position, diet, or smoking). Many relevant approaches have been developed in the context of both genetic and life course epidemiology that could be fruitfully applied to epigenetics; examples are methods for dealing with biases, confounding, and reverse causation and also longitudinal statistical modeling techniques [12, 13]. We first assess what epigenetic markers have been measured within existing life course studies before discussing how the epidemiologist's toolkit can be applied to epigenomics.

Epigenetic studies within longitudinal cohorts

Since 2010, 34 life course studies have included measurements of DNA methylation, and just four of these have included analysis of epigenetic features at more than one time point (Table 1). In line with the vast majority of other epigenetic studies, the focus is on DNA methylation as this is the most straightforward form of epigenetic modification to measure, and the only currently feasible option in archived DNA samples. Prospective sample collection will permit the analysis of chromatin modifications and microRNA. Three of the studies analyzing more than one time point (Table 1) report findings relating specifically to age-related changes in childhood [14] or adulthood [15, 16], and all three focus on gene-specific DNA methylation of a small panel of (different) loci and report differences that were modest in size (generally <5%). A further study considers changes in DNA methylation over a relatively short time period (28 to 180 days) in relation to air pollution exposure [17]. Although there was some indication of lower global DNA methylation in repetitive elements across the genome in this study [17] at 90 days of exposure, there was no evidence of a dose response, casting doubt on the biological importance of this association. In summary, very little has been done in this area.

Table 1 Epigenetic studies in longitudinal cohorts: a summary of recent literature (2010 to 2012)

Table 2 summarizes additional examples in which case-control studies of DNA methylation have been nested within existing large-scale longitudinal cohorts; this approach has been applied so far exclusively in the context of cancer. Analyses in this instance have been limited to gene panels (generally established tumor suppressor or oncogenes) and have been undertaken either (i) to assess the utility of epigenetic signatures as early biomarkers of cancer risk [1820] or (ii) to consider the determinants of a perturbed methylation state (methylator phenotype), which has been implicated in numerous cancers [2125]. With improved knowledge of methylation variable regions associated with diseases other than cancer (for example, cardiovascular disease, dementia, and rheumatoid arthritis), the same approach could be adopted in the context of longitudinal cohort studies.

Table 2 Nested case-control epigenetic studies in longitudinal cohorts: a summary of recent literature (2010 to 2012)

The paucity of DNA methylation measurements undertaken in cohorts that have collected serial samples from the same individuals is clear, indicating that the potential richness of longitudinal data and sampling in these studies has yet to be fully exploited. Few studies have routinely collected serial samples from the same individuals at multiple points in the life course (for example, the Avon Longitudinal study of Parents and Children (ALSPAC) [26, 27], and the Normative Aging Study [17, 2832]), but others are planning serial sampling in light of the interest in epigenetics (such as the Medical Research Council National Survey of Health and Development [33] and the Southall And Brent REvisited (SABRE) cohort [34]). Given the temporal variation in epigenetic patterns, serial sampling of any longitudinal cohort would be advised where possible.

Of the studies published so far, the variety of tissues analyzed is limited mainly to easily accessible peripheral blood, cord blood or buccal cells, the studies are modest in size compared with those used for genetic research, and the range of different methods that have been used to quantify DNA methylation have led to an overall lack of comparability between studies. It is clear from these observations that more can be done with respect to the collection and analysis of biological samples from longitudinal cohorts so that they are optimal for epigenetic studies.

Attributes of longitudinal cohort studies

Ideally, longitudinal epigenetic studies should include extensive, prospectively collected data and biological samples at multiple time points across the life course. Many existing longitudinal cohort studies are population-based, although some focus on a specific sub-group of the general population. For example, the SABRE cohort focuses on groups that are first or second generation migrants to the UK of non-European ethnicity to examine particular health issues, in this case the marked discordance in disease risk observed in migrant groups compared with Europeans living in the UK [34]. Longitudinal epigenetic studies can add value to existing resources, such as data from genome-wide association studies - for example, ALSPAC [26, 27] and the Relationship between Insulin Sensitivity and Cardiovascular disease (RISC) cohort [35]. Exposures commonly captured in longitudinal studies include lifestyle factors, such as smoking, alcohol intake, diet, and physical activity patterns, and also socioeconomic measures across the life course. Common phenotypes on which longitudinal studies tend to focus include physical and anthropometric measures, cognitive, cardiovascular, metabolic, respiratory, and musculoskeletal function, and a range of blood-based intermediate biomarkers. Of particular value are birth cohorts with trans-generational and across-life samples from birth onwards, allowing an appraisal of epigenetic changes associated with in utero and early life exposures, a period when the epigenome is believed to be particularly plastic.

The epidemiological toolkit

Applying principles of life course epidemiology to epigenetic research

Research in life course epidemiology investigates developmental, aging, and risk factor trajectories and how dynamic relationships unfold over time, and takes into account potential confounding, mediating, or interactive effects of lifetime biological, psychological, and social risk factors [36]. This conceptual framework is relevant for epigeneticists investigating long-term associations that may be biased, confounded or due to reverse causation. Life course epidemiologists have investigated various different methods for modeling risk factor trajectories (particularly growth trajectories) in relation to later health outcomes and have developed a novel structured approach [37] to distinguish critical, sensitive, and accumulation life course models [38]. They use a range of approaches for modeling repeat continuous and binary outcome measures, such generalized estimating equations or mixed models that consider correlated data such as repeat measures from the same individuals over time, and for modeling time to an event, such as survival and event history analysis. This toolkit is relevant to epigeneticists, whether studying lifetime environmental exposures that promote particular epigenetic signatures over time or how these signatures themselves may affect not just the level (intercept) of function (such as blood pressure) at a point in time but also its rate of change (slope) over time. Such statistical approaches have not been widely applied to epigenetic data, although examples can be found in Madrigano et al. [16, 17], who illustrate the use of mixed models to analyze changes in methylation over time while accounting for the correlation among measurements within the same individual. Further discussion of this subject is provided below in the section on data analysis considerations.

Several research collaborations involving cohort studies, such as HALCyon (Healthy Aging across the Life Course) [39], FALCon (Function Across the Life Course) [40] and GEoCoDE (Genomic and Epigenomic Complex Disease Epidemiology) [41] have been formed. These have increased the sample size and power to investigate lifetime risk factors on longitudinal phenotypes and to test whether findings are replicated across cohorts in a systematic way, and they will be useful to epigenetics research. The collaborations have developed experience in data harmonization to derive comparable phenotypes across the cohorts, and in cross-cohort methods (for example, [42]). Those running epigenetic studies may want to make use of these collaborations for similar reasons, and a coordinated approach is likely to advance the science and be appealing to funders. Coordinating the cohorts has led to more effective ways of gaining knowledge of the various datasets and metadata as well as facilitating data sharing and encouraging good practice in data management.

From genetic to epigenetic epidemiology

Incorporating epigenetic measures into epidemiological studies is often done in the context of genetic epidemiology resources. However, studying epigenetic factors - which are, partly at least, phenotypic - is more similar to conventional epidemiology than it is to genetic epidemiology. Several aspects of germline genetic variation lead to special-case conditions that allow relaxation of usual epidemiological principles: reverse causation (disease influencing the variable being measured rather than vice versa) is clearly not an issue in genetic epidemiology, and confounding - which often vitiates conventional epidemiology - generally relates only to ancestry in genetic epidemiology [43], and this can be accounted for by using principal components from genome-wide data as control variables. Germline genetic variation can be assessed on samples taken at any stage of life, does not change over time, and can be assayed with high precision and low measurement error. Effect sizes for the influence of common genetic variants on common complex diseases tend to be small, which means that very large sample sizes are required. Given these circumstances, the genetic epidemiology study design of choice became large case-control studies, with the controls not being carefully selected to represent the source population - and sometimes (as in the case of the landmark Wellcome Trust Case Control Consortium (WTCCC) [44]) control groups shared for comparison with several disease groups. For example, in the WTCCC the common control groups consisted of blood donors (who are very unrepresentative in terms of factors that would be important confounders in conventional epidemiological studies, such as health-related behaviors and social class) and participants in the 1958 birth cohort - all of the same age, which in some cases barely overlapped with the age of the cases.

However, such study designs are not appropriate for epigenetic epidemiology, as confounding, bias, and reverse causation are all serious problems when studying phenotypic exposures. It is important that the successes of genetic epidemiology are not translated into failures for epigenetic epidemiology [1, 5, 45]. Prospective studies are the ideal type of study, including documented exposure (epigenetic) measures collected before the outcomes and temporal changes, detailed assessment of confounding factors, and consideration of measurement error. Currently, the effect sizes of associations in epigenetic studies are poorly delineated, but it is likely that, unlike the situation in the early days of molecular genetic epidemiology, the problem will not be one of relatively few robust associations, but rather many real observational associations will exist and the issue will be the separation of causal associations from those generated by confounding and bias. Various methods that have been developed to strengthen causality in conventional epidemiology - including collaborative analysis of multiple cohorts in which confounding structures differ [46], comparisons of plausible and implausible associations [47, 48], and the use of instrumental variables [47] - can be applied to epigenetic epidemiology studies.

An instrumental variables method that uses germline genetic variants as the instruments - Mendelian randomization - is increasingly used to strengthen causality with respect to environmentally modifiable exposures for which genetic variants can serve as proxy measures [4951]. Mendelian randomization can be extended to the investigation of epigenetic profiles as the potentially modifiable exposure. This method - 'two step epigenetic Mendelian randomization' - is currently under development, and details can be found elsewhere [5, 52].

A further complexity of epigenetic studies is the tissue-specific nature of epigenetic patterns. Given that they are integrally involved in the process of cell and tissue differentiation, it is no surprise that epigenetic patterns differ between tissue sources. Genetic comparisons within and between studies can be made using a variety of sources of DNA to generate genotype data; however, this is not the case in an epigenetic context. Population-based studies often have to rely on easily accessible DNA sources (such as blood, saliva, buccal cells; Table 1). These serve as a surrogate for the target tissue involved in the disease of interest, but there is inevitable heterogeneity in both specific cell type represented and sample processing, which may bias epigenetic measurement (see the section below on data analysis considerations). Despite these limitations, epigenetic epidemiological studies are emerging and include strategies such as Mendelian randomization approaches [53] or inter-tissue comparisons [15] to interrogate the functional relevance and casual nature of observations.

Inter-generational epigenetic studies

Family-based sampling of both siblings and multiple generations can have particular value in epigenetic studies. The fact that epigenetic states are often established in early (in particular antenatal) development makes birth cohorts with recruitment and sample collection from pregnant women and sample collection on offspring from birth onwards of particular value [26, 27]. There is considerable interest in the role of epigenetic mechanism in the developmental origins of adult disease, to which longitudinal cohort studies are making a valuable contribution [4, 5359].

Data analysis considerations

Most research undertaking longitudinal analysis of molecular biomarker data assumes that there are predictable biological changes over time associated with a given exposure or disease process. However, in the context of epigenetic studies, change over time can be due to technical [60] or genetic factors [61], tissue type [62, 63], changes with normal aging, and stochastic changes [64]. These sources of data 'noise' threaten the detection of the biological signal of interest. Thus, as is often the case, the first and most critical step to performing longitudinal DNA methylation analysis is careful study design and data collection with meticulous recording of technical factors and factors that vary between people. Given that data collection may occur months, years or even decades apart, the awareness and/or control of such sources of variability are paramount to making valid conclusions regarding within-individual changes over time as it may be impossible to account for these factors after the fact. Pre-processing of data is often necessary to generate comparable data from samples between and within individuals over time. International initiatives to address and reach consensus on such issues are in progress [65]. Equally important is that many of these methods seek to optimize the signal-to-noise ratio. These two considerations are critical to generating valid and reproducible results. Prudent use of pre-processing that matches the study design and data, and experimentation with several different methods are strongly encouraged. In addition, the threat of time-varying artifacts masquerading as biological signal is constantly present in longitudinal studies. This possibility should be formally tested as an automatic addition to the primary study hypothesis.

An example of a 'noise' source that is just beginning to be understood is the role of genetic factors in determining the degree of variability in DNA methylation over time. This is suggested by familial clustering of DNA methylation variability over time [61]. From the perspective of individual loci, there is also evidence of CpG site-dependent differential stability [15]. This indicates that loci should be carefully selected that demonstrate greater inter- than intra-individual variation over time. The mechanisms underlying this are unknown but could reasonably be related to overlying genetic architecture (for example, interaction with other epigenetic marks and possibly even the DNA itself) or the cellular milieu, as suggested by tissue-specific difference in stability in the same loci [63]. With the success of next-generation sequencing and its falling costs, we can look forward to a clearer view of the effect of genetic factors on DNA methylation and time-dependent variability.

As alluded to earlier, the vast majority of longitudinal cohort studies that are in a position to consider including epigenetic assessment have used biological specimens collected from peripheral blood. Reliance on leukocyte DNA extracted from peripheral blood introduces a potential source of measurement error [66]. Given the labile nature of leukocyte subtype populations over time, this variation may make an important contribution to intra-individual changes in DNA methylation. For instance, shifts in leukocyte populations can occur as a result of normal development and aging, inflammation from infectious, rheumatological, or oncological diseases, or normal response to medications (such as non-steroidal anti-inflammatory drugs). The most definitive solution is to isolate cell types (for example, through magnetic-activated or fluorescence-activated cell sorting), so as to perform comparisons within relatively homogenous leukocyte populations. However, this is possible only with freshly collected samples; one of the advantages of prospective longitudinal studies is the potential to collect appropriate samples relevant for epigenetic studies.

When analysis of relatively homogeneous cell types was unavailable, Zhu and colleagues [67] used total and differential leukocyte count (from a sample drawn concurrent with the methylation sample) to control for this variation in regression models. These researchers found that the proportion of leukocyte cell types correlated with levels of LINE-1 methylation. Importantly though, statistical adjustment for this did not alter the association between LINE-1 and Alu methylation levels and individual characteristics (age, gender, smoking habits, alcohol intake, and body mass index). Candidate gene studies of methylation have reached similar conclusions [15, 16]. This could mean that leukocyte populations contribute a negligible amount of variance relative to the specified model factors. Alternatively, it may be that controlling for leukocyte population in this manner inadequately captures the effect of this noise. The possibility that using the direct measure of an unwanted variable in a regression equation may sub-optimally reduce noise was explored by Teschendorff and colleagues [60]. Using Illumina HumanMethylation27 BeadChip data, they proposed a variation of surrogate variable analysis in which confounders are modeled as statistically independent components. Using these components instead of the original measures in regression analysis, they found a stronger association between methylation of Polycomb-family gene loci and their phenotype of interest, age. From this, they concluded that the effect of confounders on the DNA methylation data was better represented by independent components than the original covariates.

Lastly, in cases where no information on cell counts is available, a potential solution may arise from the DNA methylation data itself. Such a possibility is presented by Houseman and colleagues through their software methylSpectrum [68]. The authors propose an algorithm to infer the contribution of different leukocyte sub-populations to whole blood DNA methylation patterns. This software is not designed to examine changes over time and requires a suitable reference sample from which to make inferences, which would reasonably require multiple age-appropriate references in a longitudinal study setting.

In summary, we need formal comparisons of these methods in heterogeneous and homogeneous samples from the same specimen. International efforts to create reference epigenomes from homogeneous cell samples will be highly beneficial [65]. However, variation due to cellular and tissue heterogeneity is just one example of the wide breadth of issues regarding noise that require detailed and systematic study.

Modeling epigenetic change over time

There are several issues that need to be considered when analyzing epigenetic change over time, such as the unit of DNA methylation change under examination (Box 1) and the analytic technique. The unit of analysis must consider several issues. For example, how is DNA methylation measured? What is the question under investigation? Is the research focused on testing site-specific changes in DNA methylation related to exposures and/or outcomes or is it seeking to explore a network of gene regulation? What type of a priori information is available? How does this information contribute to understanding of error or covariance of methylation measurements? Are individuals compared using categorical or continuous variables?

Guided by the selected unit of DNA methylation change, we now turn to examples of modeling intra-individual variation over time that is due to disease and/or environmental factors. The selection of an appropriate modeling technique has important implications for study power and calculations of statistical significance. We limit this discussion to longitudinal studies with three or more time points, as two time points can at most infer a difference rather than the nature of change. Much of this work is borrowed from other fields, particularly gene expression studies, and uses data-driven or knowledge-driven techniques, or combinations of both.

Several techniques use comparisons between two groups (such as controls versus cases) to determine differential time courses [69, 70]. Some of these methods can be extended to comparisons between more than two groups (for example, [71]). An alternative to this individual-based approach is to find time course patterns that distinguish one group of individuals from another (for example, [72, 73]). Methods that capitalize on other biological knowledge (such as genomes, transcriptomes, or nucleosomes) may allow us to better infer the nature of methylation in the context of how functional regulation of the genome relates to exposures or disease processes. This is especially powerful to detect signals that are expected to be subtle but consistent among jointly regulated loci [74]. An example is longitudinal gene set analysis [75] using annotations from databases such as Gene Ontology. The parallel analysis of different sources of high-throughput data has so far only been explored in cross-sectional methylation studies but could in theory be applied to longitudinal analysis. However, such longitudinal analysis will require advanced multi-dimensional techniques (Box 2). These techniques require pre-processed data that are relatively free of noise. Another approach may use data reduction techniques to extract meaningful features from data noise while simultaneously considering the time-varying nature of DNA methylation. For example, group-independent component analysis with temporal concatenation of microarray data would assume that there are common sites of epigenetic activity but that the course of change may be different for each individual. Most experience in this type of technique comes from the analysis of neuroimaging data, where the goal is to uncover areas of the brain that are activated similarly among individuals in an experimental group over time [76]. The translation of such ideas to molecular data, which often have far lower temporal resolution but higher 'spatial' resolution (gene loci as opposed to areas of the brain), would be a challenging but also potentially promising avenue.

The promise of epigenetic studies of longitudinal cohorts

Future longitudinal epigenetic studies will undoubtedly integrate greater levels of genomic, biologic and/or phenomic information. For example, our expanding knowledge of factors influencing chromatin architecture may soon allow the analysis of methylation marks within context of the broader chromatin state. Examples of such data are nucleosome mapping [77], histone modifications [78], and chromosome conformation capture [79]. The influence of the underlying and overlying chromatin architecture (interaction with protein, RNA, and DNA primary and secondary 'structure' [80]) on differential locus stability over time remains to be elucidated. Analysis of DNA methylation is clearly only scratching the surface of the epigenetic information that regulates gene expression, but longitudinal cohort studies provide a tractable opportunity to contribute to our knowledge base in this area and, as our understanding of the wider epigenome improves, additional epigenetic features may also be added to such studies.

Increasingly, studies are pushing to provide a broader mechanistic picture of cellular function and regulation by juxtaposing data from two or more kinds of high-throughput data [81, 82]. So far, these data are often extracted from different materials or individuals (such as DNA methylation from whole blood and RNA from cell culture). This limits interpretation of functional relevance. However, advances in biotechnology that reduce the amount of specimen required and increase automation, in conjunction with falling costs, are likely to overcome this problem. Biobanked samples, such as plasma, DNA, and RNA from longitudinal cohorts, could make a valuable contribution to developments in this area. Furthermore, the development of nested recall studies for intensive phenotyping within established cohorts will greatly enhance research opportunities in this area.

As multi-dimensional datasets evolve and the ability to mine the information within them improves, it will be imperative that this information is made as accessible as possible to the wider scientific community. Although it is currently possible to access some information relating epigenetic data to common genetic variation and gene expression, providing an integrative approach, this is not available at multiple time points. Longitudinal studies can offer considerable added value in these settings and profiling using a comprehensive range of high-throughput methods can be overlaid on a wealth of exposure and phenotypic data, allowing researchers to explore specific hypotheses in silico and thus helping to prioritize resources for more detailed investigations.

In summary, longitudinal cohorts can offer a great deal in the context of epigenetic epidemiology, including identification of the major determinants of epigenetic variation in populations and a better understanding of the relationship between genetic and epigenetic variation. They provide an unprecedented opportunity to increase our understanding of the dynamic nature of epigenetic patterns and how changes occur in response to a wide range of environmental, lifestyle, and behavioral factors. Population-based studies will improve our knowledge of the extent and topography of inter-individual variation in epigenetic patterns and permit assessment of effect sizes of shifts in epigenetic patterns on health-related outcomes. A wealth of statistical approaches can be borrowed and adapted from related fields and be applied to longitudinal epigenetic analysis - an area of biostatistics that is likely to grow exponentially as high-throughput datasets become increasingly multi-dimensional. Insights into the temporal relationship between changes in epigenetic patterns and functional and health-related outcomes that can be gleaned from longitudinal studies will assist in defining causality. This, and other epidemiological methods to strengthen causal inference, will contribute to the identification of predictive epigenetic biomarkers and modifiable targets for intervention.

The ultimate goal of observational data generated in epidemiological investigations is to feed forward into clinical practice or public health. There is already evidence of translation of longitudinal biological data to clinical applications [83]. The incorporation of epigenetic biomarkers to enhance clinical tools for prediction and prognosis is beginning to emerge [5] (Table 2), and longitudinal cohorts will undoubtedly help in this domain.

Box 1: Potential units of change to examine epigenetic mechanisms

  • A single gene or gene region of interest

  • Single gene loci that have different temporal patterns between biological groups

  • A family of genes of known biological or clinical importance (such as those previously known to show exposure-related differential methylation)

  • A group of functionally related genes (for example, as identified by Gene Ontology or Kyoto Encyclopedia of Genes and Genomes (KEGG) terms)

  • A network of co-regulated genes (for example, using intersection with concurrent gene expression data or from previous literature)

  • Genes related by their linear proximity on the DNA strand (such as regional grouping, as done to examine differential methylation between and within individuals [70])

  • Genes related to the overlying chromatin architecture (such as knowledge of nucleosome position or histone modifications)

  • Genes that show similar patterns of change (for example, gene curve [71])

Box 2: Longitudinal modeling strategies for high-dimensional data

Many techniques determine differential time courses based on comparison of two groups of variables (for example, [69, 70, 8486]). When there are more than two groups, Yuan and colleagues [71] have demonstrated the utility of their method using hidden Markov models. Multi-group comparisons are also possible; Yuan and colleagues have demonstrated the utility of hidden Markov models to classify genes based upon their temporal expression patterns, which, rather than ignoring, takes advantage of the information contained in time course data. If no groups are present, an alternative is to group genes that show similar temporal patterns (for example, [72]). Another approach is to group genes using a priori knowledge of biological similarities and reduce the amount of multiple comparisons. Using Gene Ontology annotation to group 'functionally' related genes, Zhang et al. [75] developed a non-parametric longitudinal gene set analysis of gene expression data to detect time-exposure interaction effects. This method is suitable for unbalanced data with missing time points. It is also appropriate for heteroscedastic variance (where variance is uneven across a given data distribution) and non-normal data distributions.

Another consideration is the anticipated type of time course. If a cyclical pattern is expected - for instance, in the study of circadian rhythms or cell cycles - Li et al. [73] propose functional clustering using an autoregressive moving-average process. If the goal is to identify groups of co-expressed genes showing gradual changes over time that may be linked to disease progression, Qiu et al. [87] have developed a method to study gene expression in cancer tissue at various stages of malignant transformation, which may be applicable to epigenetic data.

Units that consider genes as groups or networks may require a transition from viewing DNA methylation data as a two-dimensional entity (such as disease group and time) to a three-dimensional one (such as disease group, gene locus and time), or even data 'blocks' with greater dimensions. The family of matrix and tensor decompositions (such as independent component analysis, canonical correlation analysis, non-negative tensor factorization, and canonical-decomposition/parallel factor analysis) used in areas such as psychometrics and chemometrics have been proposed as powerful representations of biological multi-dimensional data [88, 89]. Translation of such methods to DNA methylation is sure to follow.

Although having multiple time points is advantageous for several reasons, a complication is that similar patterns of change in any group of people can start at different times (such as onset of puberty). This may obscure detection of meaningful but overlapping patterns. This can be unraveled using methods that account for lag between individuals, such as by using parallel factor analysis-related models [90] or spline-based models [91]