Medical imaging has its earliest roots in 1895 when Wilhelm Roentgen discovered X-ray, providing physicians with the first approach to image internal conditions of human body [1]. After that, multiple imaging methods were developed and optimized in succession based on various imaging principles, such as computed tomography (CT) [2], magnetic resonance imaging (MRI) [3], and positron emission tomography (PET) [4]. The advent of these imaging techniques has rendered medical imaging a crucial pillar of clinical practice and a fundamental domain for the realization of precision medicine.

With the ongoing advancements in biological and instrumental science, medical imaging technologies have made remarkable progress in recent decades. The overall structural, functional, and molecular alterations of the individuals could be obtained non-invasively through multiple imaging methods [5]. Especially, with the development of molecular imaging, pathophysiological processes at the cellular and molecular levels can be precisely visualized, characterized, and quantified [6,7,8]. The continuous advancement of imaging equipment and probes has further enhanced the capacity of molecular imaging to evaluate pathophysiological alternations noninvasively, thereby making the diagnostic capabilities increasingly approach the level of pathological practice. Recently, a novel pattern of pathological practice termed “transpathology,” which could comprehensively depict pathophysiological events in vivo from a multiscale perspective, holds the great potential to facilitate the translational processes from the bench to the bedside and drive traditional medicine towards precision medicine [9].

In parallel with the advancement of medical imaging technology, medical image analysis methods have also experienced rapid development, with an increasing focus on quantification and intelligence. In 2012, “radiomics” was proposed as an innovative approach to image analysis, using automated high-throughput extraction of large amounts of quantitative features from standard-of-care medical images [10]. With the assistance of artificial intelligence (AI), radiomics and other medical image analysis approaches could potentially aid more complex decision-making tasks, such as disease prognostication, prediction of response to different treatment modalities, recognition of treatment-related changes, and discovery of imaging representations of phenotypic and genotypic features associated with prognosis [11]. However, the existing AI-based methodologies for medical image analysis encounter various obstacles. The dominant research paradigm heavily depends on a substantial quantity of annotated training samples to construct models tailored to particular tasks, which is heavily reliant on extensive medical imaging datasets [12]. Nevertheless, the scarcity of annotated medical data restricts the model’s generalizability, impeding its potential to achieve robust transferability across diverse tasks and diseases [13]. Moreover, a significant proportion of existing medical image intelligence models predominantly rely on image data, with limited incorporation of textual language data. In clinical practice, radiologists often rely on extensive textual information during the process of medical image diagnoses, leading to a stark disparity with the model’s architecture. This incongruity hampers the model’s ability to perform certain image-text tasks, including the automated generation of diagnostic reports for images.

Recently emerged large language models (LLMs) bringing a ray of hope to address the above issues, especially Chat Generative Pre-Trained Transformer (ChatGPT) developed by OpenAI [14]. This model is trained using a large number of textual corpora, acquiring massive knowledge that can be used for various natural language processing tasks, such as language understanding, text generation, and machine translation. It possesses the capability to receive user input and generate coherent natural language responses, thereby accomplishing seamless and articulate conversations. Recent studies indicated that ChatGPT exhibits diverse application scenarios with the domain of medical imaging, including automated reporting, patient communication, addressing specific technical inquiries [15], and educational purposes [16]. However, the limited availability of high-quality medical data in the pre-training dataset of GPT-3.5 has resulted in certain constraints on its accuracy when providing responses to medical inquiries. Furthermore, its incapability to handle image inputs hinders its applicability in the field of medical imaging. Although the updated GPT-4.0 possesses the ability to process image inputs, it still demonstrates relatively restricted proficiency in medical image recognition [17].

The Visual-Linguistic Pre-training (VLP) models exhibit the capacity to acquire transferable visual and linguistic attributes by means of pre-training on extensive multilingual data that encompasses both language and vision [18]. Within the field of medicine, the BiomedCLIP model [19], which is based on the Contrastive Language-Image Pre-training (CLIP) framework [20], has exhibited improved zero-shot predictive abilities, making it well-suited for medical image recognition tasks. Additionally, PubMedCLIP has demonstrated exceptional performance in tasks involving reciprocal retrieval of information between textual and visual modalities [21]. These VLP models have broadened the range of tasks applicable to medical imaging, enabling the seamless integration of textual and visual data. Nevertheless, there is still potential for enhancing the precision of task execution.

Herein, we propose the concept of medical image GPT (MI-GPT), a pre-training foundation model that predominantly utilizes medical imaging as a primary data source, while also integrating multi-omics data and electronic health records, which might be the future direction of foundation model for application in the medical imaging field in clinical practice (Fig. 1). The data formats used for MI-GPT can be derived from either pure image data, pure text data, or a combination of both image and text information.

Fig. 1
figure 1

The development of medical imaging modalities and image analysis approaches. With the continuous advancements in the fields of biological and instrumental sciences, medical imaging technologies have progressed from unimodal structural imaging towards multimodal structural–functional imaging. Simultaneously, there is a growing inclination towards the intelligent automation of image analysis methodologies, shifting from subjective evaluations to more accurate quantitative assessments. Considering the continuous progress in foundational models within contemporary medical research, we believe that the future integration of medical foundational models customized for specific pathophysiological conditions, such as medical image Generative Pre-Trained Transformer (MI-GPT), will substantially drive the advancement of precision medicine

To enhance the interpretability and generalizability of MI-GPT models in clinical practice, it is crucial to foster inter-institutional and multi-disciplinary research collaborations by training models on extensive datasets obtained from various medical centers, scanners, and protocols, with a focus on disease detection, segmentation, and classification tasks in specific application scenarios. Furthermore, through the integration of diverse data types (e.g., text, images, and videos) along with multidimensional data (e.g., genomics, proteomics, transcriptomics, and phenomics), the future multi-modality MI-GPT models hold enormous promise for acquiring more comprehensive understanding of patients’ condition, thereby facilitating the potential for achieving more precise disease diagnoses and formulating individualized therapeutic strategies [22, 23].

The progression of MI-GPT models holds potential for the advancement of clinical applications that cater to diverse user bases and disciplines (Fig. 2). One prominent application is that they can aid radiologists in their workflow by automating the generation of structured radiology reports and describing abnormalities and findings, while also taking into account the patient’s history. Clinicians can receive additional support from MI-GPT through the combination of text reports and interactive visualizations, which may include the highlighting of the corresponding region for each phrase. Additionally, MI-GPT can assist clinicians by integrating image, language, and audio modalities, enabling real-time decision-making in clinical practice (e.g., pre-treatment comprehensive evaluation, adjustment of surgical alternatives during surgery, monitoring in vivo drug delivery and therapeutic response), leading to more efficient and effective patient management and healthcare. Furthermore, the MI-GPT is expected to predict the risk of a certain disease in the future based on the patient’s previous and current conditions. Through extracting meaningful information from a patient’s time series data (e.g., imaging, vital laboratory parameters, and clinical notes), the MI-GPT possess the ability to provide a comprehensive summary of the patient’s current clinical state, while also projecting potential future states and offering treatment recommendations. We believe that MI-GPT can also be utilized as a chatbot to leverage multimodal data and construct a holistic understanding of a patient’s condition. It possesses the capability to decipher diverse data formats and engage in interactive conversations with patients to provide detailed medical advice and explanations, which will be crucial for the comfortable and precise medicine in the future.

Fig. 2
figure 2

MI-GPT in clinical practice. By integrating multimodal data including imaging, omics, and electronic health records, MI-GPT holds potential for the advancement of clinical applications that cater to diverse user bases and disciplines, thereby facilitating the potential for achieving more precise disease diagnoses and formulating individualized therapeutic decision-making