Introduction

“Pseudopod epidemiology” is the expansion of longitudinal epidemiological studies either vertically by adding new independent variables to evaluate the same dependent measures, e.g. new measures of subclinical cardiovascular disease (CVD) to evaluate the risk of coronary heart disease (CHD), or horizontally which adds new dependent variables including cancer, osteoporosis, or dementia to an original study designed to evaluate CHD. In the 2nd horizontal approach, an investigator can limit the study to the same or new independent variables used to evaluate the original dependent endpoint, e.g. CHD, to the study of dementia, cancer, etc.

The rapidly evolving genomic technologies, proteomics, and metabolomics have been a strong impetus for the continued support of longitudinal studies and their pseudopod additions. The issues that will be reviewed in this paper relate to the successes and problems of pseudopod epidemiology.

Many modern era epidemiologists were trained that longitudinal epidemiological studies. The Framingham Heart Study beginning in the 1940s was the primary or only epidemiology research methodology, especially for long incubation period diseases [1]. Many of the new era epidemiologists assumed that in the 1940s a group of investigators found a town outside of Boston, MA (Framingham), sampled a group of individuals within a specific age range to participate in a longitudinal study that would determine the cause of heart disease, not true; the study was not a random sample, it included both volunteers and household members. Second, the variables selected for study were based on “preconceived hypotheses” based on clinical observations, pathology, animal experimental studies, and smaller case–control studies [2, 3].

Bill Kannel, one of the great epidemiologists and leaders of the Framingham Heart Study, coined the term “risk factors” for variables which were determinants of CVD in the Framingham Heart Study, elevated cholesterol, smoking, and high blood pressure (BP) [4]. Later, in the 1960s and 1970s groups of biostatisticians initially led by Jerry Cornfield at the National Heart Institute and the University of Pittsburgh, developed methods for combining these risk factors into multivariate models and later, longitudinal analysis including time to the event [5].

The development of computer technologies to handle large data sets made it possible to search data files from these longitudinal studies for many independent variables and relate them to the dependent variable, e.g. CVD. Therefore, by the 1970s and 1980s, the successful modern era of epidemiology had begun. Subsequently, the introduction of genetic technologies to study the host, especially Genome-Wide Association Studies (GWAS) [6] gave further impetus to this movement in epidemiology to large longitudinal studies with many pseudopods, both horizontal and vertical [7].

The genomic era and GWAS also greatly impacted epidemiological methods, as a GWAS did not require any specific hypotheses or knowledge of biological plausible mechanisms of the relationship between dependent or independent variables. The GWAS and subsequent approaches to whole genome analysis and epigenetics changed the focus of many longitudinal studies to the host (genetics) and then molecular biology, proteomics, metabolomics, etc. [817].

It is important to distinguish studies whose main focus is understanding population biology from epidemiology. The goal in population biology studies is to expand knowledge of specific phenotype, e.g. deep phenotyping [18], such as blood glucose and lipid lowering levels (metabolomics), circulating proteins (proteomics), and the host, genomic determinants of the levels of the phenotypes [19]. The population biology studies have made good use of large sample sizes and availability of specimens, e.g. blood, urine, etc. Such studies are important in improving understanding of pathophysiology of disease and new therapeutic approaches. These studies are not epidemiology.

Recently, epidemiology studies have utilized these new technologies such as proteomics, metabolomics to evaluate numerous biomarkers as “risk factors” for diseases without any prior hypotheses or knowledge of agent and environmental factors that determine, in part, the level of these metabolites, proteins, etc. [20, 21].

The concept of “risk factors” as initially conceived by Kannel, et al. now included almost any variables that had a significant “p value” of relationship between dependent and independent variable [22]. Epidemiology studies were loaded with numerous “p values” (the peppered p value syndrome). Whether these variables were in the causal pathway or the possible pathophysiological basis of the statistically significant associations were not necessarily a high priority [23, 24].

At face value, the use of existing longitudinal epidemiological studies for both vertical and horizontal pseudopod investigations makes excellent sense and should be a cost-effective approach for obtaining new important data. This may not be the case. What are the issues? First, there may be a mistaken concept that the identification of many new variables from traditional biochemistry to metabolomics, proteomics, microRNA, etc. represent the identification of new risk factors for disease, e.g. independent variables. The vast majority of these new identified variables may not be risk factors for the dependent variable, disease, pathology, etc. of interest but, rather: (1) subclinical measurements of the disease [25]; (2) independent levels of variables determined by the extent of subclinical disease [26], e.g. reverse causality; (3) highly correlated variables with previously identified independent variables [27]; (4) related to another dependent variable, e.g. disease with similar characteristics (pleiotropy) [28]; and (5) most important, they could be variables in the causal pathway of the dependent variable, e.g. disease, a true “risk factor” [29].

If the independent variable is in the causal pathway then the next important questions should be the identification of the pathophysiology or cellular, molecular basis for the risk factor association and effects of potential therapeutic, e.g. preventive and public health, interventions. Epidemiology is not only data collection but a basic science of public health and preventive medicine [30].

To date, many pseudopod epidemiological studies have successfully generated observations in 1–4 above and have had little success in identifying “new risk factors in the causal pathway,” especially of sufficient relative and attributable risk to lead to important changes in prevention and treatment of disease. Have we just generated a lot of new data, many publications, many grants that have had relatively little impact? Colleagues will argue that the acquisition of new data, even in 1–4 above, is of considerable importance and could ultimately lead to the identification of specific risk factors and interventions over time. Why have these longitudinal epidemiological studies, and especially the pseudopods, not been able to generate new important ‘risk factors’ that have unique pathophysiology and ‘impact’ on the disease and, most important, resulted in changes in either preventative or clinical therapies [31, 32]? Why have these studies had limited impact?

If we accept epidemiology as a basic science of public health and preventive medicine, then impact could be defined as the identification of new risk factors whose modification results in an increase or decrease in the exposure to the risk factor or level of risk factors resulting in either prevention of disease, e.g. changes in incidence or prevalence, decreasing disability, increase in active life expectancy, or reduction in mortality. This definition requires that a new risk factor to have impact requires not only identification but also successful intervention to reduce the magnitude of the disease. The intervention could be drug therapy, surgery, changes in lifestyle, removal of a specific agent from the social or physical environment, modification of host susceptibility by gene therapy, precision medicine, etc. [33, 34]. This concept closely links observational epidemiology with experimental epidemiology, e.g. clinical trials.

The cigarette smoking epidemic, for example, began in the early 1900s. Excess risk of cancers, especially lung cancer, was recognized long before the US Surgeon General’s Report in 1964 of the risk of cigarette smoking and lung cancer. There have been no clinical trials that have shown that smoking cessation significantly reduced the risk of lung cancer, i.e. p ≤ .05 [35, 36]. Therefore, requiring clinical trial evidence that modification of a new “risk factor” reduced morbidity and mortality is the accepted gold standard but may not always be obtainable, especially for behavioral and community-based interventions. The epidemiological observations from these longitudinal studies can identify new risk factors that precede the understanding of the pathophysiology for many years. The identification of dietary risk factors, e.g. low protein diet, primarily from corn, as a risk factor in the causal pathway of pellagra preceded many years the understanding and identification of the specific vitamin, i.e. nicotinic acid, in the pathophysiology of pellagra [37].

The impact of identification and modification of a new risk factor in these longitudinal epidemiology studies and their pseudopods could also be based on observational changes in the new risk factor(s) and incidence or morbidity and mortality over time or within a specific geographic area, e.g. modification of an “epidemic”. For example, smoking cessation had a substantial effect on both individual risk of lung cancer and with substantial decline in lung cancer mortality rates. The changes in blood cholesterol, BP, and smoking have been linked repeatedly to decline in CHD incidence and mortality rates. A community-based effort to change type of oils, replacement of saturated fat by polyunsaturated fat, used in food preparation resulted in decline in blood cholesterol and CHD [38].

A further argument that is often used to quell the concern of lack of impact of “new risk factors” is that most long incubation period diseases are multifactorial, i.e. many risk factors each with a relatively small effect size but together they contribute a substantial component of the attributable risk of the disease. Therefore, if we keep working at it, our pseudopod studies will identify the 150 risk factors for “CHD”, an extremely dubious proposition.

Recently, Mendelian randomization was developed to separate these “independent variables” with small effects (true risk factors?), i.e. part of multifactorial risk, from the numerous variables not in the causal pathway [3941].

The concern about issues of risk factors and causality in chronic disease long incubation period epidemiology is not new. In the 1950s, much as today, the relevance of chronic disease epidemiology was questioned as having little relevance especially as compared to the short incubation period or infectious disease epidemiology in which the specific etiological agent was either identified or not identified, e.g. HIV, polio, Zika virus, etc., the risk was either 0 or 100 % [40]. The explanation given in the 1960s by Lilienfeld and others to account for these low relative risks in chronic diseases was that the many variables were vectors of disease in a chain from broad descriptors, such as social class, poor neighborhoods, poverty, low education, that were determinants of the extent of exposure to the true causal factors [1, 4244]. For example, in Flint, MI, changes in the water supply resulted in acidic water which leached lead out of pipes from older homes resulting in high lead levels in the water [45]. The consumption of this water by children resulted in high blood lead levels and poor cognition [46]. In this model, all of the variables leading to the high blood lead levels would be classified as vectors and not as risk factors. Modern epidemiology would classify all of the markers from the acidic water supply to ultimate blood lead levels as risk factors, a multifactorial model caused by poverty, acidic water supply, lead in the water pipes, elevated blood lead levels, and brain damage. Furthermore, host susceptibility, e.g. genomics, clearly played an important role since not all children who have elevated blood lead levels will develop cognitive changes [46]. An intervention that modified the water supply so that lead did not leach out of the pipes would be a successful public health effort. Traditional longitudinal studies and their pseudopods could utilize a great deal of resources describing the vectors of a specific disease as “risk factors” but fail to identify the specific causal agents, true risk factor(s). Modification of “vectors” can have a big impact on disease, e.g. pellagra [37], as above yet not identify the causal “risk factor”. The big payoff in reducing morbidity and mortality, however, likely occurs by identifying the specific agent(s).

The concept of “multifactorial causes of disease” is probably a misnomer. Most diseases to-date are determined by a single or small number of agents defined by a broad definition to include both biological agents, e.g. viruses, environmental particulate pollution, psychosocial factors [47, 48]. Multiple agents can cause the same or similar disease. For example, exposure to rubella and Zika virus in the first trimester both are associated with an increase in microcephaly and other developmental abnormalities. High ambient temperature and stagnant water are all vectors, e.g. the environment, that lead to an increase in specific type of mosquitos and Zika virus exposure in the population and risk factors of infection by a pregnant woman. Not every woman exposed to the Zika virus in the first trimester will have babies with microcephaly. Other environmental factors, such as prevalence of Dengue virus infection, immunological responses, placental characteristics, and host susceptibility, genetics, contribute to risk of Zika virus and development abnormalities.

Host susceptibility (genomics) after exposure to specific agents is not a new concept. The best example of susceptibility was exposure to polio myelitis. Only few infected with polio virus developed severe paralysis, maybe 1 in 1000 [49]. Part of the difference in risk is clearly genetic or host susceptibility. Other vectors or epi-genetic effects may be important. Risk of clinical paralytic polio is related to age at time of first infection.

Improved nutrition, especially increase in protein intake, played a major role in the risk of clinical tuberculosis given infection with mycobacterium tuberculosis. The agent of disease is the tuberculosis mycobacterium and susceptibility is likely related to genetic, epi-genetic and lifestyle factors, e.g. higher protein intake, etc. Many modern epidemiologists would call this a “multifactorial etiology of disease” [50]. It is likely that for many diseases, the environmental and other cofactors are being measured and attributed as the agent(s) of the disease [5052].

Discussion

The horizontal and vertical pseudopod studies may not be reaching their goal of impact on public health for the following reasons:

They may not be hypothesis-driven but rather data collection exercises, primarily descriptive, e.g. categories 1–4 above

The goal may be number of publications, grants, and successful identification of “risk factors” with significant p values [53].

They are attached to study designs and populations that are not appropriate for identifying “risk factors”

For example, many pseudopod epidemiological studies are attempting to study inflammation. They define “inflammation” by levels of cytokines [54] or acute phase proteins (APP) [55]. There are numerous cytokines and spliced variants of the same cytokine and similarly many APP are produced in the liver, will be increased or decreased by activation of liver protein synthesis. Some APPs may also decline with progressive liver injury. It may be better to study inflammatory diseases, such as rheumatoid arthritis [56, 57], HIV [58, 59], etc. in which the association of inflammation, specific cytokines, T and B cell function is known and can be modified by drug therapy as part of the treatment armamentarium rather than in pseudopod population longitudinal epidemiological studies [60].

Atherosclerosis is the underlying pathology for clinical CHD, a long incubation period disease driven primarily by levels of low density lipoprotein cholesterol (LDL-C), apolipoprotein-B, and LDL particles, and secondarily by cigarette smoking, elevated BP, insulin resistance, and diabetes. This has been known for at least 60 years [61]. The important unanswered questions are “what are the risk factors for the acute precipitating events that modify atherosclerosis and result in heart attack, especially sudden cardiac death? Such pathophysiology has an incubation period of seconds to minutes”. How best to identify risk factors for thrombosis, coagulation, ruptured or fissured plaques, arrhythmias? The study designs to identify such “agents” require both frequent contact with participants and better methods of identification of possible “agents,” e.g. air pollution, psychosocial stressors, infections, and their relationship to “vulnerable plaques” and thrombosis [6265]. If inflammation plays an important role in conversion from atherosclerosis to clinical disease, then epidemiology studies need better measures of “inflammation,” e.g. innate and adaptive immune responses and their relationship to coagulation, thrombosis, etc. [18, 66, 67]. Most large epidemiology studies and their pseudopods may not be the best study population or design to answer the specific questions of short term determinants of incident CVD events.

The phenotype or the dependent variable disease is not adequately measured

Many long term large epidemiological studies include aging cohorts leading to new interest in studying dementia and other age-related disorders. The definition of dementia is a change in cognition over time which impacts on function. Dementia, like heart disease, is not one disease but “caused” by different pathologies. The Framingham Study would not have been successful if it had studied “heart disease” rather than “coronary heart disease” and similarly failure to have specificity as to the pathology, amyloid, tau, vascular disease, neurodegeneration, may limit epidemiology study of dementia [68]. New techniques, such as MRI, PET, cerebral spinal fluid examination, cerebral blood flow measurements, have substantially changed the methods of identifying “risk factors”, and host genetic susceptibility of the specific pathology of dementia [69].

Modern era cancer epidemiology studies are based not on site-specific cancer diagnosis only, e.g. breast or colon cancer, but rather on molecular phenotyping of the cancer based on somatic mutations. Mutations may occur many years prior to initial clinical diagnosis which limit the success of identifying etiological agent(s) [31, 70]. A key issue is how to integrate these new technologies into ongoing longitudinal and their pseudopod studies or are new study designs more appropriate [71]? Larger and larger population studies that cannot measure the agent(s) in the causal pathway for a specific cancer are not the answer but rather application of new technologies to measure “agents” in at risk populations.

The independent variables are not accurate or repeatable

The within individual variability over time may be as large as the between individual differences [53, 72, 73]. For long incubation period disease, the level of the risk factors, e.g. the independent variables measured many years before, may be more important than proximate measured variables. For example, blood cholesterol level at younger ages may be a better measure of risk for clinical coronary artery disease. A duration of exposure is an important variable. Possibly epi-genetic markers could provide a window of exposure to an agent, e.g. risk factor that was important many years in the past. Study designs, such as long term life course epidemiology studies, may be a better approach to evaluating “risk factors” that likely change over time, especially if such change is a function of evolving subclinical disease, e.g. inflammatory markers [74, 75]. A potential resource for the long incubation period epidemiology studies could be data collection many years ago, for example, in prior studies of children, young adults.

Many independent variables are measured poorly in longitudinal epidemiological studies, not necessarily the fault of the investigator, especially in relatively homogenous population studies and their pseudopods, leading to spurious associations. For example, measurements of diet in many longitudinal pseudopod studies using food frequency questionnaires or single day dietary recall which fail to discriminate in homogenous populations the within individual versus between individual variation or bias associated with eating behaviors [76]. Much of the knowledge of nutrition and CVD, cancer, etc. has been derived from studying unusual populations, a vegan diet [77]. Populations in southern Europe consuming a Mediterranean-type diet [78, 79] with low risks of CHD, in Japan with very high intake of omega-3 fatty acids and low CHD and other diseases [80] or from well controlled feeding studies and diet trials [81, 82].

Epidemiology studies thrive on studying differences. An epidemic is the rapid increase in frequency of disease, e.g. AIDS which led quickly to the identification of HIV or the recognition of a subpopulation within longitudinal studies with very low cholesterol levels which led to the identification of genetic variant PCSK9 and subsequently a new class of drug therapies to treat hyperlipidemia or the recent observations of very high levels of high density lipoprotein cholesterol (HDL-C) attributed to genetic disorders that led to inability to clear HDL-C in the liver and an increase rather than decrease in the risk of CAD [83, 84].

Similarly, the recognition of much higher incidence and prevalence and mortality from kidney disease among blacks as compared to whites, especially at younger ages, originally attributed in most epidemiological studies to the higher prevalence of hypertension and diabetes among blacks, was in part due to striking differences in the frequency of ApoL, a gene related to susceptibility to trypanosomiasis that is much higher in the black as compared to white population [85, 86]. Similarly, studies of Greenlandic Inuit have led to interesting interrelationships of diet and genetic polymorphisms and possible diseases [87].

Pseudopod studies which focus on evaluation of determinants of unusual attributes using new measurement technologies within these large longitudinal studies likely have the best chance of identifying the agents and understanding of the pathophysiology of the disease, a combination of population biology and epidemiology [88].

The primary study is too short given the incubation period for the disease

The study may have capabilities of measuring intermediate endpoints and short term events, i.e. the beginning of the incubation period. One of the most important aspects of current longitudinal studies is the long term follow up of their cohort(s) and the potential to investigate diseases with very long incubation periods, especially to older ages, for example, dementia, sarcopenia, Parkinson’s disease, macular degeneration. Horizontal and vertical pseudopod studies have successfully utilized these long term longitudinal studies to investigate these diseases with very long incubation periods. Unfortunately, there was a tendency to stop or severely pare down the data collection in longitudinal studies over long periods of time before the cohort has reached the time of highest incidence and uniqueness for the pseudopod studies and especially to be able to identify the susceptibles from the nonsusceptibles at very old ages [89].

In summary, large epidemiological studies and their pseudopods have subsumed the entire repertoire of long incubation period epidemiological research. These studies have provided interesting and important papers, generated collaborative projects across studies, including internationally, and have been a very important resource for genetic studies of host susceptibility and evaluation of new measurement techniques, such as imaging, proteomics, metabolomics, epigenetics, microRNA. Pseudopod studies, as noted, both horizontal and vertical, have become the norm for “new epidemiologists”. Their impact to-date has not been as great as it should be.

Recommendations that can greatly enhance the future of pseudopod epidemiological studies include:

  1. 1)

    First, the most important tenet is that all or nearly all diseases are caused by a single agent or agent(s). Host susceptibility in part determines response of individuals to exposure to the agent. Failure to find “new agents” is not, as some have suggested, due to prior identification of all important “agents”, low hanging fruit, but rather to poor study designs, population selection and lack of utilization of new technologies [31]. Finding these agent(s) should be the cornerstone of pseudopod and new longitudinal epidemiology studies. Rheumatoid arthritis (RA) is one of many autoimmune diseases. The pathobiology of RA and role of innate and adaptive immunity is well-defined and very effect drug therapies are available. The task for epidemiologists is to determine if there are specific etiological agents for RA in ‘susceptible hosts’. The hypothesis is that, similar to infectious diseases, there are one or two important risk factors that are the specific agents of disease and that their distribution and exposure to the individual is determined by physical, social, environmental factors, e.g. vectors. The risk of disease is related to the interaction of these specific agents with host susceptibility factors, as measured by genomics, epi-genetics, RNA analysis, etc. [90].

  2. 2)

    Pseudopod epidemiology studies should be driven by specific well-defined hypotheses identified prior to and independent of the data analysis. Publications especially in epidemiological journals should require a statement of the hypothesis.

  3. 3)

    Epidemiological studies should not only define the hypothesis but also present the possible basis in pathophysiology. In some studies, the hypothesis may be based on clinical observations, smaller studies and unique populations, time trends, geographic variations [9194].

  4. 4)

    Epidemiological studies thrive on the unusual, innovative ideas and hypotheses and not the same old same old. The most successful longitudinal studies and their pseudopod studies have identified the unusual, as previously noted. Thus, the identification of the unique subpopulations within these longitudinal studies may provide the best application for both vertical and horizontal pseudopod studies. Summary analyses, such as age-, race-, and sex-adjustment, or multivariate analysis without evaluation of variables separately likely have missed important observations. Adjustment for sex probably requires a surgical procedure. For example, several reports have suggested that very physically active athletes may have an increased risk of amyotrophic lateral sclerosis. We further tested this hypothesis in a pseudopod study within the large Women’s Health Initiative which collected excellent data on strenuous physical activity in a large cohort of women. The unique population was the very small number of women who were very physically active within this large sample of women. Studying all physically active women without restriction to very physically active would have missed the initial observation [95]. Innovative ideas and unique studies should receive the highest priority.

  5. 5)

    Commitment to improved research technologies as a “new tool” for pseudopod studies, e.g. imaging, proteomics, metabolomics, whole genome analysis [88, 96, 97], will enhance these pseudopod studies. Maximize the use of new technologies to provide answers to important hypotheses and not just collection of new measures of the same independent or dependent variables continually, circular epidemiology [98]. The hypothesis drives the use of new technologies, not technologies driving the study design. There is a real risk that continued focus on new technologies and “finding new risk factors” can delay implementation of proven preventive approaches, such as in CVD [29, 33, 61, 99].

  6. 6)

    The study of proteomics, metabolomics, genomics, etc. without specific hypotheses is an important research agenda, especially to identify new or better therapies. This is not epidemiological research.

  7. 7)

    In the past, for both infectious and chronic long incubation period disease, epidemiology-type studies that found the specific “agents” have been the big winners. If you do not identify the specific agents of disease or proximate environmental vectors then continued high incidence and prevalence of disease in the population, high morbidity and mortality is very likely compared to those who do not have the disease, i.e. primary prevention is the big winner. The future of high quality epidemiology studies likely will depend on the ability to develop and utilize new approaches for the measurement of “agents”, whether living organisms, environmental toxins, specific nutrients in food, energy expenditure, or psychosocial variables. Especially important for long incubation period epidemiology studies will be methods to evaluate exposures prior to or at incubation period, both at the population and individual level. Studies of unique populations will likely provide the best opportunities for future epidemiology research.

Epidemiology has been remarkably successful in the past in identifying the important agents of disease, the impact of the environment, both physical and social, and interrelationship of host susceptibility. Most important, epidemiologists and others have utilized this new knowledge to impact on the incidence of disease, disability, and ultimately the increase in active life expectancy.

There is certainly nothing wrong with identification of large populations and measuring every known or unknown independent variable in a hope of finding a magic bullet that will lead to better drug or surgical therapies. It is not epidemiology and based on past experience, it is unlikely to be very successful especially compared to epidemiological approaches [100, 101]. Therefore, in the future, as in the past, smart epidemiologists using epidemiology methods, good study designs, and new technologies to identify the “risk factors” for the higher “hanging fruit” and application of such knowledge will remain the cornerstone of improving the health of the population [102].