Skip to main content

Data Mining, Data Analytics, and Bioinformatics

Leveraging Big Data to Identify the Most Vulnerable and to Reduce Health Disparities


Bioinformatics big data hold the potential to improve the healthcare delivery to those most economically and socially disadvantaged and historically underserved. This chapter presents a bioinformatics big data life cycle model with three segments: Segment 1, data acquisition, with four steps – data collection, data aggregation, data evaluation, and data tagging; Segment 2, data processing, with four steps, data processing, data transformation, data mining, and data analysis; and Segment 3, data preservation, with three steps, data classification, data archive, and data distribution. Data mining is often used to predict outcomes or future behavior. It is essential in research to track and identify patterns, such as health status disparities. Once these patterns are identified, big data analytics is used to generate insights.

Bioinformatics can change and save lives – by weakening a human trafficking chain, diagnosing infectious disease, and bringing attention to the resource-constrained Navajo Nation. However, sloppy data analysis and review can lead to disastrous repercussions, as exemplified by a COVID-19 data integrity scandal. An algorithm meant to provide healthcare to those with the greatest need can impose unintentional biases. Big data have helped elucidate health inequities, health crises, and institutional health discrimination in the Navajo Nation. Unfortunately, the potential of big data analytics in public health is not yet fully realized.

As the world of bioinformatics matures, big data research must intensify the focus on marginalized and underserved groups, health disparities, and health equity without compromising data integrity.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12



American Indians and Alaska Natives


Acute myocardial infarction


Best Babies Zone initiative


Coronavirus Aid, Relief, and Economic Security Act


Census block groups


US Centers for Disease Control and Prevention


Centers for Medicare & Medicaid Services


Chronic obstructive pulmonary disease


Clinical research informatics


Department of Defense


Electronic health records


Fiscal year


Human-centered design

Health FFRDC:

Alliance to Modernize Healthcare Federally Funded Research and Development Center


US Department of Health and Human Services


Health information technology


Healthcare Information Technology Standards Panel


Internet of Medical Things


Information technology


Learning Health Community




Mobile health


Mobility-based responsive index


National Center for Health Statistics


National Health Interview Survey


National Institute on Minority Health and Health Disparities


Navajo Nation Health Survey


Oregon Community Health Information Network


Office of the National Coordinator, HHS


Social determinants of health


Social Interventions Research and Evaluation Network at the University of California, San Francisco


Social network analysis


Translational bioinformatics


Department of Veteran Affairs


The VA’s legacy EHR system


World Health Organization


Download references

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jean E. Garcia .

Editor information

Editors and Affiliations



Best Babies Zone initiative – An initiative launched in Oakland, CA, to reduce infant mortality rates focusing on economic development, community systems, health, and education

Big data analytics

We have learned that data mining uses algorithmic models to identify previously undiscovered patterns in a given dataset. Once these patterns are identified, they can be examined to generate insights. This is big data analytics (Bresnick 2017)


An umbrella term that encompasses the computerized management of biological, medical, and healthcare data from the initial data collection through the entire data life cycle

Bioinformatics big data

The massive amounts of healthcare data accumulated from patients and populations and the analytics that can give meaning to such data

Bioinformatics big data life cycle

All activities that annotate, integrate, and present data output in a format that facilitates interpretation and discovery


The Coronavirus Aid, Relief, and Economic Security Act, also known as the CARES Act, is a $2.2 trillion economic stimulus bill passed by the 116th US Congress and signed into law March 27, 2020, in response to the economic fallout of the COVID-19 pandemic in the United States


CBGs, or census block groups, are the smallest unit of geography defined by the Census Bureau. The largest block is over 8,500 square miles in Alaska, but half of the blocks are smaller than a tenth of a square mile

Clinical informatics

The application of informatics and information technology to deliver healthcare services

Clinical research informatics

The sub-domain of biomedical informatics concerned with the development, application, and evaluation of theories, methods, and systems to optimize the design and conduct of clinical research and the analysis, interpretation, and dissemination of the information generated

Cloud computing

The practice of using a network of remote servers hosted on the internet to store, manage, and process data rather than on a local server or a personal computer

Consumer health informatics

The informatics specialization that focuses on the use of medical and health informatics to disseminate information to patients and providers. The field includes patient-focused informatics, health literacy programs, and consumer education


A highly contagious respiratory disease caused by the SARS-CoV-2 virus that was first identified in Wuhan, China, in December 2019


The sub-domain of biomedical informatics concerned with the development, application, and evaluation of theories, methods, and systems to optimize the design and conduct of clinical research and the analysis, interpretation, and dissemination of the information generated

Cyclical translational model

During cyclical translational research, the proposed interventions are systematically tested, evaluated, and revised via multidisciplinary collaboration prior to implementation with multidisciplinary collaboration. A major component of the model includes gathering from the target population to determine their needs for health needs

Data acquisition

The process of collecting and digitizing data. The four steps of the data acquisition segment are data collection, data aggregation, data evaluation, and data tagging

Data aggregation

The conversion of the data from its atomic form – the lowest level of detail – to a dataset expressed in summary form. Data aggregation can be performed manually or with aggregation software

Data analysis

The examination of data for errors, duplications, and ambiguities through the entire data processing segment of the big data life cycle

Data archive

The placement of data, either manually or with software, into permanent or temporary storage resources

Data classification

The organization and preparation of data for efficient storage by applying optimization, such as classification, arrangement, and compression

Data collection

The process of gathering and measuring information on variables of interest in an established systematic fashion

Data dissemination

The preparation of archived data for private or public end user access according to local permission procedures

Data evaluation

The examination of the data for quality. The data review includes discarding or repairing low-quality data, monitoring the data flow, and addressing and correcting detected process failures

Data mining

A set of techniques using automated tools to discover patterns within a large dataset. Data mining is often used to predict outcomes or future behavior and often reveals interesting or important patterns within the data. The process is essential in research to track and identify patterns, such as health status disparities

Data preservation

A series of managed activities to ensure continued access to data for as long as processing and manipulation are required. The computation result is in an encrypted form that can be decrypted by an authorized individual. The data preservation segment of the big data life cycle consists of the following four steps, namely, data classification, data quality assessment, data archive, and data dissemination

Data processing

The use of sophisticated analysis techniques to extract knowledge or generate additional value from the data. The three steps of the data processing segment are data transformation, data mining, and data analysis. Depending on the software configuration, the data processing steps may be performed simultaneously or vary in the order of execution

Data quality assessment

The evaluation of the quality level of classified data prior to storage

Data tagging

The manual or automated assignment of descriptors to data that define the data type, the confidentiality classification, and any other predefined criteria

Data transformation

The conversion of aggregate data into meaningful information for interpretation by computers and users. This step enables the data to be read, altered, and executed in an external application or database while maintaining the integrity of the data and embedded structures


A dataset is a structured collection that describes values for each variable for unknown quantities such as height, weight, temperature, and volume of an object or values of random numbers

Differential privacy

A system for publicly sharing information about a dataset by describing the patterns of groups within the dataset while withholding information about individuals in the dataset

Food desert

Geographic areas where residents’ access to affordable, healthy food options (especially fresh fruits and vegetables) is restricted or nonexistent due to the absence of grocery stores within convenient traveling distance

Garbled circuit

A constant round secure protocol which allows any function to be computed between multiple parties, hiding both their inputs from each other. Secure hardware enables a user to separate confidential data and code from the dataset and perform secure computation in a protected enclave inside the processor. The protected portion of the data or the code is inaccessible from the rest of the execution ensured by the processor itself


Human-centered design is a framework that empowers the teams and individuals designing products, services, systems, and experiences to address the core needs of those who experience a problem.

Health disparities

Preventable differences in the burden of disease, injury, violence, or opportunities to achieve optimal health that are experienced by socially disadvantaged populations.

Health inequities

Systemic differences in the health status of different population groups that result in significant social and economic costs both to individuals and societies

Health information exchange

An electronic health information exchange that enables healthcare providers and patients to access and securely share a medical information electronically

Homomorphic encryption

The cryptographic process of holding the data in a concealed state while being processed and manipulated. The encrypted result can be decrypted later by an authorized individual

Human trafficking

The recruitment, transportation, transfer, harboring, or receipt of people through force, fraud, or deception with the aim of exploiting them for profit. The traffickers often use violence or fraudulent employment agencies and fake promises of education and job opportunities to trick and coerce their victims

Indian Health Service

The Indian Health Service, or IHS, is the federal agency responsible for providing medical and other health-related services to enrolled Native American tribal members. The mission and vision of the IHS are to raise the physical, mental, social, and spiritual health of AI/AN to the highest level while building healthy communities and quality healthcare systems through strong partnerships and culturally responsive practices

Intergenerational inheritance

The theory that epigenetic marks can be transmitted from one generation to the next due to the exposure of the parent generation to environmental factor(s) that leads to an epigenetic alteration


The ability of health information systems to work together within and across organizational boundaries in order to advance the effective delivery of healthcare for individuals and communities


The devices, software, hardware, and services that compile, process, and analyze health indicators

Long Walk of the Navajo

The Long Walk of the Navajo, also called the Long Walk to Bosque Redondo (Navajo: Hwéeldi), refers to the 1864 deportation and attempted ethnic cleansing of the Navajo people by the US federal government. Navajos were forced to walk from their land in what is now Arizona to eastern New Mexico. Some 53 different forced marches occurred between August 1864 and the end of 1866

Marginalized communities

Communities excluded from mainstream social, economic, educational, and/or cultural life due to unequal power relationships between social groups. Examples of marginalized populations include, but are not limited to, groups excluded due to race, gender identity, sexual orientation, age, physical ability, language, and/or immigration status


The practice of medicine and public health supported by mobile devices such as mobile phones, tablets, personal digital assistants, and the wireless infrastructure


The National Health Interview Survey, a nationally representative multipurpose health survey of the civilian noninstitutionalized US population conducted continuously throughout the year by the National Center for Health Statistics (NCHS)

Public health informatics

The development and use of interoperable information systems for outbreak management, biosurveillance, disease prevention, and electronic laboratory reporting. Public health informaticists include and engage targeted populations in their living and working environments to determine the most appropriate methods for informatics delivery

Qualitative data

Data that can be counted, measured, and expressed using numbers

Quantitative data

Data that is descriptive and conceptual and is typically categorized based on traits and characteristics

SAS-callable SUDAAN

SUDAAN is a proprietary statistical software package for the analysis of correlated data. SAS-callable software runs within an SAS, a popular statistical software


Social Determinants of Health – the conditions in which people are born, grow, live, work, and age and are shaped by the distribution of money, power, and resources at global, national, and local levels


Social network analysis is a field of data analytics that uses networks and graph theory to understand social structures. SNA techniques can also be applied to networks outside of the societal realm

Structural poverty

When the fabric of organizations, institutions, governments, or social networks contains an embedded bias which provides advantages for some members and marginalizes or produces disadvantages for other members

Three Vs of big data

The three Vs of big data are volume, velocity, and variety and help explain the nature of the data and key data management challenges associated with big data

Tribal data Sovereignty

Indigenous data sovereignty is the right of a nation to govern the collection, ownership, and application of its own data. Indigenous data sovereignty accords with international declarations and covenants to which the United States has become a signatory, such as the United Nations Declaration on the Rights of Indigenous Peoples (UNDRIP).

Tribal self-determination

Native American self-determination refers to the social movements, legislation, and beliefs by which the Native American tribes in the United States exercise self-governance and decision-making on issues that affect their own people. In 1975, the US Congress enacted the Indian Self-Determination and Education Assistance Act, Public Law 93-638. The Act allowed for Indian tribes to have greater autonomy and to have the opportunity to assume the responsibility for programs and services administered to them on behalf of the Secretary of the Interior through contractual agreements


Twitter is a microblogging system that allows the user to send and receive short posts called tweets and follow other users

US Census

The US Census is a national survey conducted in five US territories every 10 years to enumerate the population for taxation and political representation. The US Census is mandated by Article 1, Section 2 of the US Constitution. Refusal to take the Census is punishable by a fine of up to $100. The information for the US Census is aggregated for statistical analysis and is intended to remain confidential

Rights and permissions

Reprints and Permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this entry

Verify currency and authenticity via CrossMark

Cite this entry

Cullen, T., Garcia, J.E. (2021). Data Mining, Data Analytics, and Bioinformatics. In: Okpaku, S.O. (eds) Innovations in Global Mental Health. Springer, Cham.

Download citation

  • DOI:

  • Received:

  • Accepted:

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-70134-9

  • Online ISBN: 978-3-319-70134-9

  • eBook Packages: Springer Reference MedicineReference Module Medicine