Skip to main content
  • Textbook
  • Open Access
  • © 2016

Secondary Analysis of Electronic Health Records


  • Written with the aim of promoting an inter-disciplinary and ethical approach to health data analytics
  • Teaches how all clinicians, with the help of data scientists, will share the responsibility of growing the knowledge base and transforming practice to improve care
  • Shows how Big Data in Healthcare is ushering in the era of precision medicine and ethically sound decision making in healthcare
  • Includes supplementary material:

Buy it now

Buying options

Softcover Book USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book USD 59.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Table of contents (30 chapters)

  1. Front Matter

    Pages i-xxi
  2. Setting the Stage: Rationale Behind and Challenges to Health Data Analysis

    1. Front Matter

      Pages 1-2
    2. Objectives of the Secondary Analysis of Electronic Health Record Data

      • Sharukh Lokhandwala, Barret Rush
      Pages 3-7Open Access
    3. Review of Clinical Databases

      • Jeff Marshall, Abdullah Chahin, Barret Rush
      Pages 9-16Open Access
    4. Challenges and Opportunities in Secondary Analyses of Electronic Health Record Data

      • Sunil Nair, Douglas Hsu, Leo Anthony Celi
      Pages 17-26Open Access
    5. Pulling It All Together: Envisioning a Data-Driven, Ideal Care System

      • David Stone, Justin Rousseau, Yuan Lai
      Pages 27-42Open Access
    6. The Story of MIMIC

      • Roger Mark
      Pages 43-49Open Access
    7. Integrating Non-clinical Data with EHRs

      • Yuan Lai, Edward Moseley, Francisco Salgueiro, David Stone
      Pages 51-60Open Access
    8. Using EHR to Conduct Outcome and Health Services Research

      • Laura Myers, Jennifer Stevens
      Pages 61-70Open Access
    9. Residual Confounding Lurking in Big Data: A Source of Error

      • John Danziger, Andrew J. Zimolzak
      Pages 71-78Open Access
  3. A Cookbook: From Research Question Formulation to Validation of Findings

    1. Front Matter

      Pages 79-80
    2. Formulating the Research Question

      • Anuj Mehta, Brian Malley, Allan Walkey
      Pages 81-92Open Access
    3. Defining the Patient Cohort

      • Ari Moskowitz, Kenneth Chen
      Pages 93-100Open Access
    4. Data Preparation

      • Tom Pollard, Franck Dernoncourt, Samuel Finlayson, Adrian Velasquez
      Pages 101-114Open Access
    5. Data Pre-processing

      • Brian Malley, Daniele Ramazzotti, Joy Tzung-yu Wu
      Pages 115-141Open Access
    6. Missing Data

      • Cátia M. Salgado, Carlos Azevedo, Hugo Proença, Susana M. Vieira
      Pages 143-162Open Access
    7. Noise Versus Outliers

      • Cátia M. Salgado, Carlos Azevedo, Hugo Proença, Susana M. Vieira
      Pages 163-183Open Access
    8. Exploratory Data Analysis

      • Matthieu Komorowski, Dominic C. Marshall, Justin D. Salciccioli, Yves Crutain
      Pages 185-203Open Access
    9. Data Analysis

      • Jesse D. Raffa, Marzyeh Ghassemi, Tristan Naumann, Mengling Feng, Douglas Hsu
      Pages 205-261Open Access
    10. Sensitivity Analysis and Model Validation

      • Justin D. Salciccioli, Yves Crutain, Matthieu Komorowski, Dominic C. Marshall
      Pages 263-271Open Access

About this book

This book trains the next generation of scientists representing different disciplines to leverage the data generated during routine patient care. It formulates a more complete lexicon of evidence-based recommendations and support shared, ethical decision making by doctors with their patients. 

Diagnostic and therapeutic technologies continue to evolve rapidly, and both individual practitioners and clinical teams face increasingly complex ethical decisions. Unfortunately, the current state of medical knowledge does not provide the guidance to make the majority of clinical decisions on the basis of evidence.

The present research infrastructure is inefficient and frequently produces unreliable results that cannot be replicated. Even randomized controlled trials (RCTs), the traditional gold standards of the research reliability hierarchy, are not without limitations. They can be costly, labor intensive, and slow, and can return results that are seldom generalizableto every patient population. Furthermore, many pertinent but unresolved clinical and medical systems issues do not seem to have attracted the interest of the research enterprise, which has come to focus instead on cellular and molecular investigations and single-agent (e.g., a drug or device) effects. For clinicians, the end result is a bit of a “data desert” when it comes to making decisions. The new research infrastructure proposed in this book will help the medical profession to make ethically sound and well informed decisions for their patients.

Authors and Affiliations

  • Massachusetts Institute of Technology, Cambridge, USA

    MIT Critical Data

About the author

MIT Critical Data

MIT Critical Data consists of data scientists and clinicians from around the globe brought together by a vision to engender a data-driven healthcare system supported by clinical informatics without walls. In this ecosystem, the creation of evidence and clinical decision support tools is initiated, updated, honed and enhanced by scaling the access to and meaningful use of clinical data.

Leo Anthony Celi

Leo has practiced medicine in three continents, giving him broad perspectives in healthcare delivery. His research is on secondary analysis of electronic health records and global health informatics. He founded and co-directs Sana at the Institute for Medical Engineering and Science at the Massachusetts Institute of Technology. He also holds a faculty position at Harvard Medical School as an intensivist at the Beth Israel Deaconess Medical Center and is the clinical research director for the Laboratory of Computational Physiology at MIT.Finally, he is one of the course directors for HST.936 at MIT – innovations in global health informatics and HST.953 – secondary analysis of electronic health records.


Peter Charlton

Peter gained the degree of MEng in Engineering Science in 2010 from the University of Oxford. Since then he held a research position, working jointly with Guy's and St Thomas' NHS Foundation Trust, and King's College London. Peter’s research focuses on physiological monitoring of hospital patients, divided into three areas. The first area concerns the development of signal processing techniques to estimate clinical parameters from physiological signals. He has focused on unobtrusive estimation of respiratory rate for use in ambulatory settings, invasive estimation of cardiac output for use in critical care, and novel techniques for analysis of the pulse oximetry (photoplethysmogram) signal. Secondly, he is investigating the effectiveness of technologies for the acquisition of continuous and intermittent physiological measurements in ambulatory and intensive care settings. Thirdly, he is developing techniques to transform continuous monitoring data into measurements that are appropriate for real-time alerting of patient deteriorations.

Mohammad Ghassemi

Mohammad is a doctoral candidate at the Massachusetts Institute of Technology. As an undergraduate, he studied Electrical Engineering and graduated as both a Goldwater scholar and the University's “Outstanding Engineer”. In 2011, Mohammad received an MPhil in Information Engineering from the University of Cambridge where he was also a recipient of the Gates-Cambridge Scholarship. Since arriving at MIT, he has perused research at the interface of machine learning and medical informatics. Mohammad's doctoral focus is on signal processing and machine learning techniques in the context of multi-modal, multi-scale datasets. He has helped put together the largest collection of post-anoxic coma EEGs inthe world. In addition to his thesis work, Mohammad has worked with the Samsung corporation, and several entities across campus building “smart devices” including: a multi-sensor wearable that passively monitors the physiological, audio and video activity of a user to estimate a latent emotional state.

Alistair Johnson

Alistair joined the Laboratory for Computational Physiology as a postdoctoral associate in 2015. He received his B.Eng in Biomedical and Electrical Engineering at McMaster University, Canada, and subsequently read for a D.Phil in Healthcare Innovation at the University of Oxford. His thesis was titled “Mortality and acuity assessment in critical care”, and its focus included using machine learning techniques to predict mortality and develop new severity of illness scores for patients admitted to intensive care units. Before joining the LCP, Alistair spent a year as a research assistant at the John Radcliffe hospital in Oxford, where he worked on building early alerting models for patients post-ICU discharge. Alistair’s research interests revolve around the use of data collected during routine clinical practice to improve patient care.

Matthieu Komorowski

Matthieu holds board certification in anesthesiology and critical care in both France and the UK. A former medical research fellow at the European Space Agency, he completed a Master of Research in Biomedical Engineering at Imperial College London focusing on machine learning. Dr Komorowski now pursues a PhD at Imperial College and a research fellowship in intensive care at Charing Cross Hospital in London. In his research, he combines his expertise in machine learning and critical care to generate new clinical evidence and build the next generation of clinical tools such as decision support systems, with a particular interest in septic shock, the number one killer in intensive care and the single most expensive condition treated in hospitals. 


Dominic is an Academic Foundation doctor in Oxford, United Kingdom. Dominic read Molecular and Cellular biology at the University of Bath and worked at Eli Lilly in their Alzheimer’s disease drug hunting research program. He pursued his medical training at Imperial College London where he was awarded the Santander Undergraduate scholarship for academic performance and ranked first overall in his graduating class. His research interests range from molecular biology to analysis of large clinical data sets and he has received non-industry grant funding to pursue the development of novel antibiotics and chemotherapeutic agents. Alongside clinical training, he is involved in a number of research projects focusing on analysis of electronic health care records.

Tristan Neumann

Tristan Naumann is a PhD candidate in Electrical Engineering and Computer Science at MIT working with Dr. Peter Szolovits in CSAIL’s Clinical Decision Making group. His research includes exploring relationships in complex, unstructured data using data-informed unsupervised learning techniques, and the application of natural language processing techniques in healthcare data. He has been an organizer for workshops and “datathon” events, which bring together participants with diverse backgrounds in order to address biomedical and clinical questions in a manner that is reliable and reproducible.

Kenneth Paik

Kenneth is a clinical informatician driving quality improvement and democratizing access through technology innovation, combining a multidisciplinary background in medicine, artificial intelligence, business management, and technology strategy.  He is a research scientist at the MIT Laboratory for Computational Physiology investigating the secondary analysis of health data and building intelligent decision support system. As the co-director of Sana, he leads programs and project driving qualityimprovement and building capacity in global health. He received his MD and MBA degrees from Georgetown University and completed fellowship training in biomedical informatics at Harvard Medical School and the Massachusetts General Hospital Laboratory for Computer Science.

Tom Joseph Pollard

Tom is a Postdoctoral Associate at the MIT Laboratory for Computational Physiology. Most recently he has been working with colleagues to release MIMIC-III, an openly-accessible critical care database. Prior to joining MIT in 2015, Tom completed his PhD at University College London, UK, where he explored models of health in critical care patients in an interdisciplinary project between the Mullard Space Science Laboratory and University College Hospital. Tom has a broad interest in how we can improve the way that critical care data is managed, shared, and analyzed for the benefit of patients. He is a Fellow of the Software Sustainability Institute.

Jesse Raffa

Jesse is a research scientist in the Lab for Computational Physiology at the Massachusetts Institute of Technology in Cambridge, USA. He received his PhD in biostatistics from the University of Waterloo (Canada) in 2013.  His primary methodological interests are related to the modeling of complex longitudinal data, latent variable models and reproducible research. In addition to his methodological contributions, he has collaborated and published over 20 academic articles with colleagues in a diverse set of areas including: infectious diseases, addiction and critical care, among others.  Jesse was the recipient of the distinguished student paper award at the Eastern North American Region International Biometric Society conference in 2013, and the new investigator of the year for the Canadian Association of HIV/AIDS Research in 2004.

Justin Salciccioli

Justin is an Academic Foundation doctor in London, United Kingdom. Originally from Toronto, Canada, Justin completed his undergraduate and graduate studies in the United States before pursuing his medical studies at Imperial College London. His research pursuits started as an undergraduate student while completing a biochemistry degree. Subsequently, he worked on clinical trials in emergency medicine and intensive care medicine at Beth Israel Deaconess Medical Center in Boston and completed a Masters degree with his thesis on Vitamin D deficiency in critically ill patients with sepsis. During this time he developed a keen interest in statistical methods and programming particularly in SAS and R. He has co-authored more than 30 peer-reviewed manuscripts and, in addition to his current clinical training, continues with his research interests on analytical methods for observational and clinical trial data as well as education in analytics for medical students and clinicians.

Bibliographic Information

  • Book Title: Secondary Analysis of Electronic Health Records

  • Authors: MIT Critical Data

  • DOI:

  • Publisher: Springer Cham

  • eBook Packages: Medicine, Medicine (R0)

  • Copyright Information: The Editor(s) (if applicable) and The Author(s) 2016

  • Hardcover ISBN: 978-3-319-43740-8Published: 13 September 2016

  • Softcover ISBN: 978-3-319-82899-2Published: 06 July 2018

  • eBook ISBN: 978-3-319-43742-2Published: 09 September 2016

  • Edition Number: 1

  • Number of Pages: XXI, 427

  • Number of Illustrations: 8 b/w illustrations, 100 illustrations in colour

  • Topics: Health Informatics, Ethics, Data Mining and Knowledge Discovery, Statistics for Life Sciences, Medicine, Health Sciences

Buy it now

Buying options

Softcover Book USD 59.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book USD 59.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access