“Nothing in life is to be feared, it is only to be understood. Now is the time to understand more, so that we may fear less.” (Marie Curie)

Artificial intelligence in medicine comes with enormous promise but also potential pitfalls. Liu Y et al. [1] highlight the need for a cautious and critical approach to evaluate machine learning tools, as with any diagnostic tool, that must be supported by clinical judgment: “…clinical gestalt plays a crucial role in evaluating whether the results are believable. Results that substantially exceed what even such a hypothetical expert is capable of should be scrutinized and validated carefully.”

The transition from the physician’s handwritten notes to electronic health records and a plethora of digital data ushered in the era of Big Data in medicine. Classical hypothesis-driven research is giving way to data-driven research, with opportunities to pursue novel questions and directions raised from the data itself. The statistical approaches currently used to explore this expanding data universe are often drawn from the field of artificial intelligence. The principle of AI is to mimic the thinking and decision-making capabilities of humans using a variety of algorithmic tools.

Clinical decision-making, as much as art as science, is the final outcome of a complex process that rest on scientific knowledge and clinical experience gained through years of training and practice. Evidence-based medicine (EBM) and clinical trials represent the pinnacle of scientific decision-making. The ability to exploit Big Data with AI offers the potential to greatly accelerate the experience-based component of the decision-making process. This interplay of EBM and AI can ultimately enhance the physician’s performance.

Within the broad discipline of AI, the subfields of machine learning (ML) and deep learning (DL) are currently of greatest relevance to medical practice (Fig. 1) [1,2,3,4].

Fig. 1
figure 1

Hierarchical classification with examples of artificial intelligence, machine learning, and deep learning

The two main approaches to ML are supervised or unsupervised learnings [1, 2]. Supervised learning uses a given set of input features and one or more outcomes (labels) as the basis for model training. The model is iteratively trained to minimize prediction error when comparing samples drawn from the data with a target reference standard, also called ground truth. Supervised ML for predicting a known outcome is the most widely used approach at present. Unsupervised learning does not use any labeling information and aims to group data by shared properties. This helps to discover structure in the data, such as identifying clusters of patients at similar risk or selecting variables most strongly correlated with an outcome. DL, a specific group of ML methods, uses multi-layered arithmetic operations (sometimes hundreds of layers containing many millions of individual calculations) in order to model the complex non-linear relationships between data inputs and outputs.

While AI/ML is designed to output a simple answer, the underlying process to get there is extremely complex and requires attention to numerous technical details. This is analogous to the diagnostic process in medicine. The final diagnosis conceals a non-linear reasoning pathway that incorporates medical knowledge and experience with clinical clues from the history, physical examination, and investigations. The classical pipeline for developing and implementing a supervised ML model is based on the subsequent steps shown in Fig. 2.

Fig. 2
figure 2

Simplified model development flowchart for supervised learning

Although each of these steps is important, data preparation deserves special attention given the data-driven nature of ML. Garbage-in-garbage-out (GIGO) describes the importance of data quality. Access to Big Data is not enough—we must ensure High Quality Big Data. Failure to adhere to this principle can lead to biased or even erroneous results. Companies are rushing to provide off-the-shelf platforms using pre-defined algorithms for democratizing AI access. In theory, one only needs to load of the data and specify a few parameters, voilà—a fully trained convolutional neural network (CNN)! However, caveat emptor. Any biases in the data collection or labeling (e.g., establishing the ground truth) would automatically generate systematic errors in the predictions that machines would now perform repeatedly. In contrast to carefully collected and adjudicated research data, Big Data comes from “real-world” sources, which are comparatively “dirty.” Nothing is free, and the cost of data quantity is questionable quality, which can affect the reliability of the derived ML products. In summary, any deviation from the eight steps described previously can lead to overly optimistic (or more rarely pessimistic) results, thereby threatening clinical reliability of the results. Accordingly, it is important not to under-report model details and clinical information as any lack of reporting transparency impedes effective comparisons, model reproducibility, and clinical use [4, 5].

Many of the early successes of AI in medicine have been in image-intensive specialties, such as radiology, pathology, ophthalmology, and cardiology [6]. Clinical risk prediction, diagnostics, and therapeutics are more challenging. Hence, AI is still relatively novel in the osteoporosis field. A query on PubMed indicates an exponential increase in AI publications since 2010 with more than 38,000 articles, with over 10,000 in the last year alone. In contrast, fewer than 100 of these were in the field of osteoporosis, although this is following the same exponential trajectory with the majority of studies published during the last 2–3 years. Efforts have been made in osteoporosis diagnosis and classification, bone mineral density assessment, fracture detection, fracture risk estimation, and bone image segmentation [7,8,9,10,11,12,13,14]. The majority of these articles used opportunistic data—particularly in imaging.

Accurate fracture risk estimation is crucial in osteoporosis management and the first step in bone health clinical evaluation. Widely used fracture risk assessment tools (e.g., FRAX®) are based on classical statistical approaches informed by clinical expertise in osteoporosis [15]. For instance, FRAX was developed and validated in various large population-based datasets. Each individual clinical risk factor (twelve in total) was incorporated into FRAX based on a solid scientific rationale and a supporting meta-analysis. Anticipating the rise of Big Data, FRAX is an example of how evidence-based hypotheses drive data analysis and find their way into clinical utility.

In this issue of Osteoporosis International, De Vries et al. [16] compared fracture risk prediction from classical techniques (Cox regression) with AI-/ML-based survival models (random survival forests, RSF, and artificial neural network, ANN-DeepSurv). Their study was conducted in a sample of 7578 post-fracture individuals, relatively large by conventional measures though not by current ML standards. Their data-driven hypothesis-free investigation aimed to compare the performance of the models and to identify, if possible, novel risk factors. This study reminds us about the wealth of electronic data that is increasingly available to researchers from electronic health record databases. Although FRAX performs well in clinical practice, it still ignores large amounts of potentially valuable patient information. The use of these additional data sources, such as in De Vries et al., can suggest novel hypotheses, risk factors, and disease/health determinants. Despite examining more than 40 clinical and laboratory variables in their predictive models, and contrary to expectation, Cox regression outperformed the AI/ML models. In part, this may reflect the use of more sophisticated approaches with the Cox regression (LASSO variable selection, non-linear transforms including restricted cubic splines) and rather simplistic ML architectures (only 2 layers in ANN-DeepSurv). One would anticipate that ML performance would improve with more complex architectures and sufficient data to avoid overfitting. The identification of overlapping predictors (e.g., age, hip T-score, time since menopause) provides face validity that the approaches are responsive to similar signals in the data. ML was also able to identify plausible risk factors that were omitted from the Cox model (e.g., vertebral fracture, lumbar spine T-score) and to propose some novel risk factors (e.g., plasma albumin, breastfeeding), both worthy of future study.

We are optimistic regarding in the future of medical AI and for the osteoporosis field in particular. In 2018, the Food and Drug Administration (FDA) approved the OsteoDetect (Imagen), an AI software based upon DL to identify and highlight distal radius fractures during the review of posterior-anterior and lateral radiographs of adult wrists as an adjunct to the clinician’s review and clinical judgment [17]. Thus, AI has already found its way to addressing important tasks in the overall osteoporosis clinical management. Nevertheless, healthy skepticism should balance zeal to see this science move forward and temper our enthusiasm to see AI integrated into clinical applications. A recent systematic review of medical imaging DL algorithms between 2010 and June 2019 found that almost all were retrospective, non-randomized, at high risk of bias, and deviated from existing reporting standards [18]. Moreover, data and code availability were lacking in most studies, yet only a minority stated that further prospective studies or trials were required. Aside from technical issues related to model reliability and reporting transparency, AI/ML raises prickly new questions that have yet to be answered: Who owns the data used to initially train the algorithm and what are the rights of the patient to control their personal information? How are individual privacy concerns balanced against making the dataset (which can be highly detailed in “real-world” data) available for independent validation? Once approved, how can one provide “stewardship” over changes to the AI/ML algorithm (mostly cloud based) without stifling its unique ability to evolve and improve?

To conclude, we believe healthcare will see increasing synergy between human and artificial intelligences, where the latter will enhance a physician’s performance and support well-informed clinical decision-making—namely augmented intelligence. AI should be seen as yet another tool for improving the quality of patient care. Not to be feared but to be understood, as we explore AI’s unique strengths and limitations.