1 Introduction

1.1 Background and significance of machine learning (ML) and deep learning (DL)

The evolution of ML and deep learning (DL) has been punctuated by a series of landmark models and technologies, each representing a significant leap in the field. The perceptron, developed in 1958 [1], laid the early groundwork for deep neural networks (DNNs) in pattern recognition. In the 1990s, support vector machines (SVMs) [2] gained prominence for their ability to handle high-dimensional data in classification and regression tasks. Long short-term memory (LSTM) [3], introduced in 1997, became essential for sequential data processing in language modeling and speech recognition. LeNet-5Footnote 1, introduced in 1998 [4], was one of the first convolutional neural network (CNN), pioneering digit recognition and setting the stage for future CNN developments. In the early 2000s, ensemble methods such as random forest emerged [5, 6], enhancing the capabilities of classification and regression. Deep belief network (DBN), unveiled in 2006 [7], reignited interest in DNNs, ushering in the modern era of DL. A major milestone was achieved with the advent of AlexNet in 2012 [8], a DNN that dominated the ImageNet challenge and brought DL into the AI spotlight. The development of generative adversarial networks (GANs) in 2014 [9] introduced a novel generative modeling approach, impacting unsupervised learning and image generation. The introduction of the transformer model in 2017 [10], and subsequently bidirectional encoder representations from transformers (BERT) in 2018 [11], revolutionized natural language processing (NLP), setting new performance benchmarks and highlighting the significance of context in language understanding. These milestones not only mark critical points in AI but also showcase the diverse methodologies and increasing sophistication in ML and DL.

Numerous practical advantages have been offered by DL, revolutionizing various fields. One of the primary benefits is its ability to automatically extract features from raw data, significantly reducing the need for manual feature engineering. This capability is particularly impactful in domains with complex data structures, such as image and speech recognition, where traditional methods struggle to achieve high accuracy [7, 8]. In healthcare, DL models are used for analyzing medical images to detect diseases like cancer, providing early and accurate diagnoses that are crucial for effective treatment [8]. For instance, CNNs have been successfully applied to mammography images to identify breast cancer with higher precision than traditional methods [8]. In industrial applications, DL enhances quality control processes by detecting defects in products on assembly lines, thus improving efficiency and reducing waste. Additionally, in the automotive industry, DL is a cornerstone of autonomous driving technology, enabling vehicles to interpret and respond to their environment in real-time. Beyond these specialized applications, DL also impacts the daily lives of citizens through various consumer technologies. Mobile phone applications, such as virtual assistants (e.g., Siri, Google Assistant), rely on DL to understand and process natural language commands, providing users with convenient and hands-free interaction with their devices [12]. Furthermore, facial recognition technology, powered by DL, is used for secure authentication in smartphones, enhancing both security and user experience [12]. Personalized recommendations on platforms like Netflix, Amazon, and Spotify also utilize DL algorithms to analyze user behavior and preferences, delivering tailored content and improving user satisfaction [13]. These practical examples highlight how DL not only pushes the boundaries of AI but also provides significant improvements and solutions to real-world problems across various sectors, making everyday life more efficient and convenient.

The expanding frontiers of ML and DL expose a paradoxical combination of advancement and limitation. [14, 15]. The exponential growth in training compute for large-scale ML and DL models since the early 2010s marks a significant evolution in computational technology [16]. Surpassing the traditional bounds of Moore’s law, training compute has doubled approximately every six months, introducing the large-scale era around late 2015 [17]. This era, marked by the need for 10 to 100 times more compute power for ML models, has significantly increased the demand for computational resources and expertise in advanced ML systems [14,15,16, 18]. One of the most noted manifestations of this growth is the expansion of the largest dense models in DL. Since the 2020s, these models have expanded from one hundred million parameters to over one hundred billion, primarily due to advancements in system technology, including model parallelism, pipeline parallelism, and zero redundancy optimizer (ZeRO) optimization techniques [17]. These changes made it possible to train larger and better models, which has changed how we handle ML. As computational capabilities continue to expand, the increase in graphics processing unit (GPU) memory, from 16 GB to 80 GB, struggles to keep pace with the exponential growth in the computational demands of ML models [14]. This gap tests the limits of current hardware and magnifies the importance of more efficient utilization of available resources. The integration of ML with high-performance computing (HPC), or high-performance data analytics (HPDA), has been pivotal in this context, enabling faster ML algorithms and facilitating the resolution of more complex problems [14, 16, 18]. Advanced techniques like DeepSpeed [15] and ZeRO-Infinity [17] further demonstrate how innovative system optimizations can push the boundaries of DL model training.

Nevertheless, the continued increase in model size and complexity underscores the need for model optimization [19,20,21,22,23]. Compressing ML models emerges as a vital approach, reducing the disparity between escalating computational demands and inadequate memory expansion by adjusting models to be compressed without significantly affecting performance. This approach encompasses several techniques, quantization, and knowledge distillation [24,25,26,27,28,29]. Model compression not only addresses the challenge of deploying AI systems in resource-constrained environments, such as mobile devices and embedded systems, but also improves the efficiency and speed of these models, making them more accessible and scalable. For instance, in mobile applications, compressed models enable faster inference times and lower power consumption, which are critical for enhancing user experience and extending battery life. Additionally, in edge computing scenarios, where computational resources are limited, compressed models facilitate real-time data processing and decision-making, enabling a wide range of applications from smart home devices to autonomous drones. In essence, model compression becomes not just a beneficial strategy, but a necessity for the practical deployment of AI systems, particularly in environments where resources are inherently limited. By optimizing the balance between model complexity and computational efficiency, model compression ensures that the advancements in AI technology remain sustainable and widely applicable across various domains and industries [24,25,26,27,28,29].

In conclusion, the rapid advancement in training compute for DL models marks a remarkable era of technological progress, juxtaposed with significant challenges. The stark contrast between the exponential demands of these models and the more modest growth in GPU memory capacity underscores a pivotal issue in the field of ML. It is this imbalance that necessitates innovative approaches like model compression and system optimization. As we proceed, this paper will delve deeper into these challenges, exploring the intricacies of model compression techniques and their critical role in optimizing large-scale ML and DL models. We will examine how these techniques not only address the limitations of current hardware but also open new avenues for efficient, practical deployment of AI systems in various real-world scenarios.

1.2 Main contributions and novelty

This paper makes significant contributions to the field of model compression techniques in ML, focusing on their applicability and effectiveness in resource-constrained environments such as mobile devices, edge computing, and internet of things (IoT) systems. The main contributions of this paper are as follows:

  1. 1.

    Comprehensive review of model compression techniques: we provide an in-depth review of various model compression strategies, including pruning, quantization, low-rank factorization, knowledge distillation, transfer learning, and lightweight design architectures. This review not only covers the theoretical underpinnings of these techniques but also evaluates their practical implementations and effectiveness in different operational contexts.

  2. 2.

    Highlighting the balance between performance and computational demand: our synthesis reveals the dynamic interplay between model performance and computational requirements. We emphasize the necessity for a balanced approach that optimizes both aspects, crucial for the sustainable development of AI.

  3. 3.

    Identification of research gaps: by examining the current state of model compression research, we identify critical gaps, highlighting the need for more research on integrating digital twins, physics-informed residual networks (PIResNet), advanced data-driven methods like gated recurrent units for better predictive maintenance of industrial components, predictive maintenance using DL in smart manufacturing, and reinforcement learning (RL) in supply chain optimization.

  4. 4.

    Future research directions: the paper advocates for future studies to focus on hybrid compression methods that combine multiple techniques for enhanced efficiency. Additionally, we suggest the development of autonomous selection frameworks that can intelligently choose the most suitable compression strategy based on the specific requirements of the application.

  5. 5.

    Practical examples and applications: to bridge the gap between theory and practice, we provide practical examples and case studies demonstrating the application of model compression techniques in real-world scenarios. These examples illustrate how model compression can lead to significant improvements in computational efficiency without compromising model accuracy.

  6. 6.

    Innovative solutions for efficient ML: we propose innovative solutions for improving the efficiency and effectiveness of ML models in resource-constrained environments. This includes the development of lightweight model architectures and the integration of advanced compression techniques to facilitate the deployment of ML models in practical, real-world applications.

The novelty of this paper lies in its approach to understanding and advancing model compression techniques. By synthesizing existing knowledge and identifying critical research gaps, we provide a comprehensive roadmap for future research in this domain. Our focus on practical applications and innovative solutions further enhances the relevance and impact of this work, making it a valuable resource for both researchers and practitioners in the field of ML.

1.3 Emerging areas and research gaps

The existing literature provides a comprehensive overview of various model compression techniques and their applications across different domains. However, there is a noticeable gap in addressing the specific challenges and advancements in many ML applications. Key areas where further research is necessary and emerging areas that leverage the latest developments in ML and model compression techniques to enhance performance, efficiency, and reliability include:

  1. 1.

    Digital twin-driven intelligent systems: digital twins, virtual replicas of physical systems, offer significant potential for real-time monitoring and predictive maintenance. Current research lacks a thorough exploration of how digital twins can be integrated with advanced ML models. Integrating model compression techniques can further enhance their efficiency by reducing the computational burden during real-time monitoring [30,31,32,33].

  2. 2.

    PIResNet: traditional ML models have been extensively studied, but the incorporation of physical laws into these models, such as in PIResNet, remains underexplored. This approach can enhance model accuracy and reliability by embedding domain-specific knowledge. Applying model compression techniques can optimize PIResNet for deployment in resource-constrained environments without sacrificing diagnostic accuracy [34,35,36].

  3. 3.

    Gated recurrent units (GRU): there is a need for innovative data-driven approaches that leverage multi-scale fused features and advanced recurrent units, like GRUs. Existing studies often focus on conventional methods, missing the potential benefits of these sophisticated techniques. Incorporating model compression techniques can further enhance the applicability of these approaches by reducing memory and computational requirements [37,38,39,40].

  4. 4.

    Predictive maintenance using DL in smart manufacturing: predictive maintenance involves using ML models to predict equipment failures before they occur, allowing for timely maintenance and reducing downtime. Current research gaps include optimizing DL models for deployment in smart factories by integrating them with IoT devices for continuous monitoring and real-time analysis. Applying model compression techniques can make these models more efficient for real-time data processing [41, 42].

  5. 5.

    RL in supply chain optimization: RL algorithms learn optimal policies through interactions with the environment, making them well-suited for dynamic and complex systems like supply chains. Current research gaps include optimizing various aspects such as inventory management, demand forecasting, and logistics by simulating different scenarios and learning from outcomes. To make RL models more feasible for real-time application in supply chain operations, model compression techniques can be utilized to reduce the model’s complexity and enhance operational efficiency, facilitating faster decision-making processes [43,44,45].

1.4 Material and methods

A systematic literature search was conducted across several databases, including IEEE Xplore [46], ScienceDirect [47], and Google Scholar [48]. Keywords related to model compression techniques such as pruning, quantization, knowledge distillation, transfer learning, and lightweight model design were used. The search was limited to papers published in the last decade to ensure relevance and innovation in the fields of ML and AI, including classical papers. Studies were included based on the following criteria: detailed discussion on model compression techniques; empirical evaluation of compression methods on ML models; availability of performance metrics like compression ratio, speedup, and accuracy retention; and relevant real-world applications. Exclusion criteria involved: papers not in English, reviews without original research, and studies focusing solely on theoretical aspects without empirical validation.

Data extracted from the selected studies included author names, publication year, compression technique evaluated, model used, datasets, performance metrics (e.g., compression ratio, inference speedup, accuracy), and key findings. A thematic synthesis approach was used to categorize the compression techniques and summarize their effectiveness across different applications and model architectures. The synthesis involved comparing and contrasting the effectiveness of different model compression techniques, highlighting their advantages and limitations. The impact of these techniques on computational efficiency, model size reduction, and performance metrics was analyzed to identify trends and potential areas for future research.

1.5 Paper organization

The structure of this paper has been systematically designed to guide the reader through a comprehensive exploration of model compression techniques in ML. The sections are organized as follows:

Section 1. Introduction: the significance of model compression in enhancing the efficiency of ML models, especially in resource-constrained environments, is introduced. An overview of the main contributions and the novelty of this paper is provided.

Section 2. Challenges in machine learning (ML) and deep learning (DL): the historical context and evolution of ML and DL are discussed, highlighting key milestones and the exponential growth in computational demands.

Section 3. Common model compression approaches: key model compression techniques such as pruning, quantization, low-rank factorization, knowledge distillation, and transfer learning are delved into. Detailed explorations of each technique, including theoretical foundations, practical implementation considerations, and their impact on model performance, are provided.

Section 4. Lightweight model design and synergy with model compression techniques: an overview of lightweight model architectures, such as SqueezeNet, MobileNet, and EfficientNet, is presented. The design principles and the synergy with model compression techniques to achieve enhanced efficiency and performance are discussed.

Section 5. Performance evaluation criteria: the criteria for evaluating the performance of compressed models, including metrics like compression ratio, speed-up rate, and robustness metrics, are discussed. The importance of balancing model performance with computational demand is emphasized.

Section 6. Model compression in various domains: recent innovations in model compression are highlighted, and case studies demonstrating the application of these techniques in various domains are presented. The significant improvements in computational efficiency achieved by compressed models without compromising performance are illustrated.

Section 7. Innovations in model compression and performance enhancement: The applications of model compression techniques across various fields are explored, demonstrating their implementation in real-world scenarios such as mobile devices, edge computing, IoT systems, autonomous vehicles, and healthcare. Specific examples illustrate the practical benefits and challenges of deploying compressed models in these environments.

Section 8. Challenges, strategies, and future directions: future research directions in model compression are outlined. Potential advancements and innovations that could enhance the efficiency and applicability of model compression techniques are discussed, including hybrid methods and autonomous selection frameworks. This section aims to inspire further research to address existing gaps.

Section 9. Discussion: the findings from the comprehensive review of model compression techniques are synthesized. The implications for future research and practical applications are evaluated, research gaps are identified, and future directions are suggested.

Section 10. Conclusion: the paper concludes with an exploration of recent innovations in model compression and performance enhancement. The ongoing advancements in the field and the potential for future research to optimize ML models are underscored.

Appendix A. Comprehensive summary of the references used in this paper: summary of references used in this paper, categorized by their specific application areas. This table provides a comprehensive overview of the key publications that have been referenced throughout the study, offering insights into the foundational and recent advancements in each area.

This organization ensures a logical progression from introducing the importance of model compression to exploring specific techniques, discussing their applications, and concluding with future research directions. The structure provides a clear roadmap for readers, facilitating a deeper understanding of the topic.

2 Challenges in machine learning (ML) and deep learning (DL)

2.1 Computational demands vs. computational memory growth

A model serves as a mathematical construct that represents a system or process. This construct is primarily used for the purpose of prediction or decision-making based on the analysis of input data. Typically, DL models are DNNs, which consist of numerous interconnected nodes or neurons. These nodes collectively process incoming data to produce output predictions or decisions, as depicted in Fig. 1. DL models can be implemented using a variety of programming frameworks, including TensorFlow [49] and PyTorch [50].

The training process of these models involves the use of substantial datasets, aiming to refine their predictive accuracy and enhance their generalization capabilities across unseen data. The training of a DL model is a critical process where large volumes of data are employed to iteratively adjust the model’s internal parameters, such as weights and biases. This adjustment process, known as backpropagation, involves the computation of the gradient of the loss function - a measure of model error - relative to the network’s parameters. The optimization of these parameters is executed through algorithms like stochastic gradient descent (SGD), aiming to minimize the loss function and thereby improve the model’s performance. Various programming environments, such as TensorFlow and PyTorch, provide sophisticated application programming interface (API) that support the development and training of complex DNN architectures. These environments also offer access to a range of pre-trained models, which can be directly applied or further fine-tuned for tasks in diverse domains, including image recognition, NLP, and beyond.

Fig. 1
figure 1

A NN commonly used in DL scenarios. The illustration showcases the network’s architecture, highlighting the input layer, hidden layers, and output layer. Each node represents a neuron, and the connections between them indicate the pathways through which data flows and weights are adjusted during training

2.2 Model size and complexity

Model compression in DL is a technique aimed at reducing the size of a model without significantly compromising its predictive accuracy. This process is vital in the context of deploying DL models on resource-constrained devices, such as mobile phones or IoT devices. By compressing a model, it becomes feasible to utilize advanced DL capabilities in environments where computational power and storage are limited.

The performance of a DL model is fundamentally its ability to make accurate predictions or decisions when confronted with new, unseen data. This performance is quantitatively measured through various metrics, including accuracy, precision, recall, F1-score, and area under the curve of receiver operating characteristic (AUC-ROC). The selection of these metrics is contingent upon the specific nature of the problem being addressed and the type of model in use, ensuring a comprehensive evaluation of the model’s effectiveness in real-world applications.

Table 1 Various model compression methods are evaluated, detailing their compression ratios, performance retention, computational efficiency, key strengths, and potential drawbacks, providing insights into their suitability for different applications

We have conducted a comprehensive analysis that delineates the performance retention, model size reduction, and other critical dimensions across different compression techniques. This comparison elucidates the nuanced distinctions between the methods, counteracting the impression that all techniques yield similar outcomes in performance maintenance and size reduction. In Table 1 is encapsulated the comparative analysis of these methods, addressing the strengths and drawbacks of each. This table provides a nuanced view of how each model compression method balances between model size reduction and performance retention, along with their computational efficiency and application suitability. This comparison should clarify the unique attributes and trade-offs of each model compression technique, offering a more refined understanding of their individual and comparative impacts [51,52,53,54,55].

In Table 2 is presented an overview of model compression approaches applied across various ML application domains. It summarizes the most suitable techniques for specific fields, such as image and speech analysis, highlighting the benefits and limitations of each approach. This comprehensive comparison aims to illustrate the effectiveness of pruning, quantization, low-rank factorization, knowledge distillation, and transfer learning in reducing model size, retaining performance, and enhancing computational efficiency.

Table 2 A comprehensive overview of model compression techniques applied across various application domains in ML

2.3 Resource allocation and efficiency

Balancing model performance with computational demand is a critical consideration in the development and deployment of ML models, especially in resource-constrained environments such as mobile devices, edge computing, and IoT systems. This balance ensures that models are not only accurate but also efficient enough to be deployed in real-world applications.

Achieving this balance involves making trade-offs between the size, speed, and accuracy of the models. Techniques such as pruning, quantization, low-rank factorization, and knowledge distillation are pivotal in this regard. For instance, pruning reduces the number of parameters by eliminating less significant ones, which can decrease computational requirements while maintaining performance [89, 90]. Quantization further enhances efficiency by reducing the precision of model parameters, thereby decreasing memory usage and accelerating computation [23, 25]. Low-rank factorization decomposes large weight matrices into smaller matrices, which can capture essential information with fewer parameters [91, 92]. Knowledge distillation involves training a smaller model to replicate the behavior of a larger, well-trained model, effectively transferring knowledge while reducing computational complexity [93, 94]. This technique is particularly useful for deploying models in environments with limited resources without significantly sacrificing accuracy.

For instance, lightweight models like MobileNet and SqueezeNet are designed to operate efficiently on mobile devices, with MobileNet using depthwise separable convolutions to reduce computational load while maintaining accuracy [95], and SqueezeNet achieving AlexNet-level accuracy with significantly fewer parameters through the use of fire modules [96]. In edge computing scenarios, models must balance performance with the limited computational capacity of edge devices, utilizing techniques such as quantization and pruning to ensure real-time inference without excessive latency [97]. For IoT applications, model compression is crucial for deploying intelligent analytics on devices with stringent power and memory constraints, with techniques like low-rank factorization and knowledge distillation creating compact models that can operate efficiently in such environments [86].

In conclusion, the interplay between model performance and computational demand is a dynamic challenge that necessitates a balanced approach. By leveraging various model compression techniques, it is possible to develop efficient models that are suitable for deployment in a variety of resource-constrained environments, thus advancing the practical application of AI technologies.

3 Common model compression approaches

This section delves into key model compression techniques in DNNs. Each technique addresses the challenge of deploying advanced DNNs in scenarios with limited computational power, such as mobile devices and edge computing platforms, highlighting the trade-offs between model size reduction and performance retention. This exploration underlines the importance of innovative approaches to model compression, essential for the practical application of DNNs across various domains. Readers are guided through a detailed exploration of model compression techniques in DNNs, especially in scenarios where resources are constrained. Each technique is explained, encompassing theoretical foundations, practical implementation considerations, and their direct impact on model performance, including accuracy, inference speed, and memory utilization.

Pruning systematically removes less significant parameters from a DNN to reduce size and computational complexity while maintaining performance. Quantization reduces the precision of model parameters to lower-bit representations, decreasing memory usage and speeding up computation, which is ideal for constrained devices. Low-rank factorization decomposes large weight matrices into smaller, low-rank matrices, capturing essential information and reducing model size and computational demands. Knowledge distillation transfers knowledge from a larger, well-trained teacher model to a smaller student model, retaining high accuracy with fewer parameters. Transfer learning leverages pre-trained models on extensive datasets to adapt to new tasks, minimizing the need for extensive data collection and training. Lightweight design architectures, such as SqueezeNet and MobileNet, are engineered with fewer parameters and lower computational requirements without significantly compromising accuracy. Collectively, these techniques address the challenge of deploying advanced ML models in resource-constrained environments, balancing model performance with computational demand, and highlighting their importance in efficient and sustainable AI development.

3.1 Pruning

Pruning is a process for enhancing ML model efficiency and effectiveness. By systematically removing less significant parameters of a DNN, pruning reduces the model’s size and computational complexity without substantially compromising its performance [19,20,21,22, 98, 99]. This practice is especially vital in contexts where storage and computational resources are limited. In Fig. 2 it is depicted one illustration of pruned DNN.

Pruning involves the selective removal of network parameters (weights and neurons) that contribute the least to the network’s output [100]. This process leads to a compressed and efficient model, facilitating faster inference times and reduced energy consumption [101]. The common types of pruning includes neuron pruning, which involves removing entire neurons or filters from the network [102]. It is commonly used in CNNs and targets neurons that contribute less to the network’s ability to model the problem. By removing these neurons, the network’s complexity is reduced, potentially leading to faster inference times [59]. Weight pruning, focused on eliminating individual weights within a DNN [103], involves identifying weights with minimal impact (typically those with the smallest absolute values) and setting them to zero. This process creates a sparse weight matrix, which can significantly reduce the model’s size and computational requirements [104]. Structured pruning focuses on removing larger structural components of a network, such as entire layers or channels [105]. It is aligned with hardware constraints and optimizes for computational efficiency and regular memory access patterns.

The parameters of DNN are selected based on their impact on the output. Techniques like sensitivity analysis or heuristics are often used to identify these parameters [106]. Various algorithms, like magnitude-based pruning or gradient-based approaches, are employed to determine and execute the removal of parameters. These methods frequently involve iteratively pruning and testing the network to find an optimal balance between size and performance [107]. After pruning, it’s essential to evaluate the pruned model’s performance to ensure that accuracy or other performance metrics are not significantly compromised [108]. Re-training or fine-tuning the pruned network is typically required to recover any loss in accuracy [56]. Post-pruning, it is crucial to validate the model on a relevant dataset to ensure that its accuracy and efficiency meet the required standards [109].

Pruning is emphasized as a vital technique for removing excess in oversized models [90, 110]. Although, the main challenge arises from over-pruning, which can result in the loss of crucial information, adversely affecting the model’s performance [90, 110]. Researchers have argued for the necessity of optimized DNN approaches that meticulously avoid the negative consequences of over-pruning [111, 112]. The conversation extends to the impact of over-pruning on cloud-edge collaborative inference, with suggestions for a more conservative approach to network pruning to maintain model effectiveness [97, 113]. This reflects a consensus on the need to preserve essential information while streamlining models for efficiency. Moreover, the optimization challenges of pruning a distributed CNN for IoT performance enhancement are illustrated through a case study, emphasizing the complexity of achieving optimal pruning without compromising model integrity [114]. These discussions collectively underscore the importance of research focusing on developing pruning methodologies that reduce model size and computational demands and safeguard against the loss of essential information.

Pruning allows for the creation of more efficient and compressed ML models. While it involves a trade-off between model size and performance, with careful implementation, it can significantly enhance computational efficiency. Ongoing research in this field continues to refine and develop new pruning techniques, making it a dynamic and essential aspect of DNN optimization.

Fig. 2
figure 2

The process of weight pruning in a DNN. (a) Shows the original DNN with all nodes and connections intact. (b) Highlights the nodes that have been pruned, indicating the parts of the network identified as non-essential. (c) Displays the pruned connections with dashed lines, illustrating the streamlined network structure after the less significant weights have been removed

3.2 Quantization

Quantization serves as a pivotal technique for model compression, playing a key role in enhancing computational efficiency and reducing storage requirements [23,24,25,26,27, 69]. This process is particularly critical in deploying DNN models on devices with limited resources. For example, most modern DNN are made up of billions of parameters, and the smallest large language model (LLM) has 7B parameters [115]. If every parameter is 32 bit, then it is needed \((7\times 10^{9})\times 32=\) 112Gbit just to store the parameter on disk. This implies that large models are not readily accessible on a conventional computer or on an edge device. Quantization refers to the process of reducing the precision of the DNN’s parameters (weights and activations), simplifying the model, leading to decreased memory usage and faster computation, without significantly compromising model performance. Quantization aims to reduce the total number of bits required to represent each parameter, usually converting floating-point numbers into integers [116].

Uniform quantization involves mapping input values to equally spaced levels. It typically converts floating-point representations into lower-bit representations, like 8-bit integers. Uniform quantization simplifies computations and reduces model size, but it must be carefully managed to avoid significant loss in model accuracy [117]. Non-uniform quantization uses unevenly spaced levels, which are often optimized for the specific distribution of the data. Techniques like logarithmic or exponential scaling are used to allocate more levels where the data is denser. Non-uniform quantization can be more efficient in representing complex data distributions, potentially leading to better preservation of model accuracy [118]. Post-training quantization involves applying quantization to a model after it has been fully trained. It simplifies the process as it doesn’t require retraining; however, it may require calibration on a subset of the dataset to maintain accuracy [119].

Selecting the right parameters (like weights and activations) to quantize is crucial. The selection is based on their impact on output and the potential for computational savings. Techniques include linear quantization, which maintains a linear relationship between the quantized and original values, and non-linear quantization, which can better adapt to data distribution [70]. These methods often require additional consideration to ensure minimal impact on the model’s performance. It is crucial to assess the post-quantification to ensure that there is no significant loss in accuracy or efficiency. In some cases, fine-tuning the quantized model can help regain any lost accuracy. Methods include retraining the model with a lower learning rate or using techniques like knowledge distillation [120]. It is important to check if the model works well with a specific set of data to make sure it is accurate and fast [121]. Quantization effectively compresses DNN models, enabling their deployment in resource-constrained environments. It helps make DNN simpler and faster by reducing the number of computational requirements [122]. Advancements in quantization methods continue to focus on maintaining model performance while maximizing compression [71].

Quantization plays a significant role in model size reduction and inference speed, suitable for mobile and edge computing and real-time applications [121, 123,124,125]. Despite its benefits, it carries the risk of introducing quantization errors that can significantly impair model accuracy, especially in complex DL models [121, 123]. This concern is well-documented in the literature, with several studies addressing the impact of quantization on model accuracy and proposing various methods to mitigate these effects. For instance, research has highlighted the effectiveness of higher-bits integer quantization, thereby achieving a balance between reduced model size and maintained accuracy [126]. Additionally, the risks associated with quantization errors have been explicitly discussed, emphasizing the negative impact these errors can pose on model accuracy [127]. Moreover, the development of methodologies such as sharpness- and quantization-aware training (SQuAT) has been shown to mitigate the challenges of quantization and enhance model performance [128]. These studies collectively underscore the critical need for continued innovation in quantization techniques, aiming to minimize the adverse effects of quantization errors while leveraging the efficiency gains offered by this approach in the field of DL.

3.3 Low-rank factorization

Low-rank factorization is a way to make NN smaller and simpler without making it less effective. This method focuses on decomposing large, dense weight matrices found in DNN into two smaller, lower-rank matrices [28, 29, 129]. The essence of low-rank factorization is its ability to combine these two resulting data matrices to approximate the original, thereby achieving compression. This method reduces the model size and data processing demands, making CNN more suitable for applications in resource-limited environments [91]. This process aims to capture the most significant information in the network’s weights, allowing for a more compact representation with minimal loss in performance.

Matrix decomposition and tensor decomposition are common applied techniques. Matrix decomposition in low-rank factorization involves breaking down large weight matrices into simpler matrix forms. Single value decomposition (SVD) is a common method, where a matrix is decomposed into three smaller matrices, capturing the essential features of the original matrix. This reduces the number of parameters in DNN models, leading to less storage and computational requirements, while striving to maintain model performance [130]. Extending beyond matrix decomposition, tensor decomposition deals with multidimensional arrays (tensors) in DNNs. Techniques like canonical polyadic decomposition (CPD) or tucker decomposition are used, which factorize a tensor into a set of smaller tensors [131]. Tensor decomposition is particularly effective for compressing CNNs, often achieving higher compression rates compared to matrix decomposition.

Layers with larger weight matrices or those contributing less to the output variance are prime candidates for factorization. Techniques like sensitivity analysis can help identify these layers [132]. The model’s accuracy and dimension need to be assessed after the factorization process. Metrics like accuracy, inference time, and model size are key considerations. Fine-tuning the factorized network can help recover any loss in accuracy due to the compression [133]. This might involve continued training with a reduced learning rate or applying techniques like knowledge distillation. It is recommended to validate the factorized model on a relevant dataset to ensure that it still meets the required performance standards.

Low-rank factorization efficiently reduces redundancies in models, particularly in fully connected layers [28, 29, 134]. Low-rank factorization faces challenges in its broad applicability, despite its effectiveness in reducing redundancies in large-scale models. Its suitability is somewhat limited to scenarios with significant redundant information in fully connected layers, indicating a constraint in its versatility across different types of models [28, 29]. For instance, the effectiveness of approaches like Kronecker tensor decomposition in compressing weight matrices and reducing the parameter dimension in CNN highlights the potential of low-rank factorization techniques in specific contexts [135]. However, the literature indicates that the applicability of low-rank factorization may be somewhat constrained, reflecting limitations, especially in models with varying architectural complexities or those not characterized by significant redundancy in their fully connected layers [136, 137]. Such challenges underscore the necessity for ongoing research to expand the scope and efficacy of low-rank factorization methods, potentially through approaches that can be applicable to a broader spectrum of DL architectures without compromising model performance or efficiency.

Low-rank factorization is an effective approach for compressing DNNs, particularly useful in environments with limited computational resources. The compromise between model size and precision is inevitable, but an optimization can mitigate the impact of performance dips. Ongoing research in this area continues to explore more efficient factorization techniques and their applications in various types of NNs [138].

3.4 Knowledge distillation

Knowledge distillation is primarily utilized in the domain of DNNs for model compression and optimization [63, 93, 94, 105]. It works by transferring experience from a large-scale model (teacher) to a smaller-scale model (student), enhancing the latter’s performance without the computational intensity of the former [139]. At its core, knowledge distillation is about extracting the informative aspects of a large model’s behavior and instilling this knowledge into a smaller model. This approach allows for the retention of high accuracy in the student model while significantly reducing its size and complexity.

In a teacher-student model framework, a large-scale, well-trained model is employed to guide the implementation of a smaller-scale model. The large-scale network provides guidance to the small-scale network [116]. The small-scale model aims to mimic the large-scale model’s output while having fewer parameters and computational complexity. The small-scale model is optimized to infer the correct categorization, and also to replicate the large-scale model’s output (predictions or intermediate features). The small-scale model can be trained to match the softmax output of the large-scale model, or to match its feature representations. There are common loss functions that measure how closely the small-scale outputs match the large-scale outputs [140]. Distillation loss, for example, helps the small-scale model to learn the behavior of the large-scale model, going beyond mere categorical inferring. Knowledge distillation is especially effective at simplifying models’ complexity, making it suited for applications in limited-resource systems. The small-scale model’s performance is similar to the large-scale model, but it requires fewer computational resources.

While it may seem that the overall performance of small-scale models would decrease compared to large-scale models, the primary goal of knowledge distillation is not to achieve identical performance across all tasks but rather to maintain similar performance on specific tasks while reducing model size and computational complexity. However, these large models often come with significant computational costs and memory requirements, making them impractical for deployment in resource-constrained environments or real-time applications. By distilling the knowledge from a large-scale model into a smaller counterpart, the goal is to retain the essential information and decision-making capabilities necessary for specific tasks while reducing the model’s size and computational demands. While it is true that small-scale models may not match the performance of large-scale models across all tasks, the focus is on achieving comparable performance on targeted tasks of interest while benefiting from the efficiency and speed advantages of smaller models. The aim is not to replicate the exact performance of the large model but to strike a balance between model size, computational efficiency, and task-specific performance, making knowledge distillation a valuable technique for model compression and optimization in practical applications.

Distillation techniques can significantly enhance the performance of smaller models, often outperforming models trained in standard ways [141]. Knowledge distillation has been successfully adopted in areas like computational vision, NLP, and speech recognition, demonstrating its versatility and effectiveness. The choice of large and small-scale models is crucial. Too complex a teacher can make the distillation process less effective. The architecture of the large and small-scale models also has a significant impact on the distillation process’s success. Tuning hyperparameters such as temperature in softmax and the weight of distillation loss is vital for achieving optimal performance in the student model [142].

Knowledge distillation, well-known for its capacity to encapsulate the functionalities of larger models into forms that are suited to deployment in resource-restricted settings, faces a series of intricate challenges and drawbacks. A notable issue is the deployment of extensive pre-trained language models (PLM) on devices with limited memory, which necessitates a delicate balance to optimize performance without overwhelming system resources [143]. Furthermore, the generalization capacity of distilled models may be compromised when utilizing public datasets that differ from the training datasets, diluting the model’s relevance and accuracy [144]. The constraints of existing knowledge distillation-based approaches in federated learning underscore the need for innovative solutions to address the scarcity of cross-lingual alignments for knowledge transfer and the potential for unreliable temporal knowledge discrepancies [145, 146]. Additionally, the sparsity, randomness, and varying density of point cloud data in light detection and ranging (LiDAR) semantic segmentation present challenges that can yield inferior results when traditional distillation approaches are directly applied [147]. These challenges highlight the necessity for continuous exploration and refinement of knowledge distillation techniques to ensure they can effectively reduce model size and complexity while maintaining or even enhancing performance across a broad spectrum of applications.

Knowledge distillation stands as a powerful tool in the realm of CNN, offering an efficient way to compress models and enhance the performance of smaller networks. As the demand for deploying sophisticated models in resource-limited environments grows, knowledge distillation will continue going in a fundamental area of knowledge and development, paving the way for more efficient and accessible AI applications [148].

3.5 Transfer learning

Transfer learning is a technioque in the DNN’s domain that enables models to leverage pre-existing knowledge for new, often related tasks. This methodology significantly reduces the need for extensive data collection and training from scratch. In essence, transfer learning involves taking a model established for one purpose and repurposing it for a different but related task. The assumption behind this strategy is that the knowledge gained by a model in learning one task can be beneficial in learning another, especially when the tasks are similar [149,150,151,152,153,154,155].

In the feature extractor approach, a model employed on a large dataset is applied as a fixed feature extractor. The pre-training layers encompass common capabilities that are applicable to a diverse set of tasks. Common in image and speech recognition tasks, this method is beneficial when there’s limited training data for the new task [156]. It allows for leveraging complex features learned by the model without extensive retraining. Fine-tuning involves adjusting a pre-conditioned model by continuing the learning process on a different dataset. This approach often involves modifying the model design in order to better suit the upcoming task, and then training these layers (or the entire model) on the new data. Fine-tuning can lead to more tailored and accurate models for specific tasks.

Transfer learning drastically reduces the time and resources required to develop effective models, as the initial learning phase is bypassed. Models can achieve higher accuracy, especially in tasks where training data is scarce, by building upon pre-learned patterns and features [157]. Transfer learning has seen successful applications in areas like medical image analysis [158], NLP [159], and autonomous vehicles [160], showcasing its versatility. The selection of the preconditioned model should reflect the nature of the upcoming task. Factors like the similarity of the datasets and the complexity of the model need consideration. Careful adjustment of the model is required to avoid overfitting to the new task or underfitting due to insufficient training. Regularization techniques and data augmentation can be helpful in this regard [161].

Transfer learning is not without its challenges and drawbacks. One such challenge is the need for large and diverse datasets to effectively train models, coupled with the limited interpretability of DL models [162]. In the context of face recognition with masks, the reduction in visible features due to masks poses a significant challenge to maintaining model performance, highlighting the complexity of adapting transfer learning to new and evolving scenarios [163]. Furthermore, the application of transfer learning in breast cancer classification underscores the technique’s dependency on domain-specific data to achieve state-of-the-art (SOTA) performance, suggesting limitations in its versatility across different domains [164]. Moreover, scenarios with limited resources emphasize the need for optimized transfer learning models [165]. The selection of appropriate transfer learning algorithms for practical applications in industrial scenarios presents another layer of complexity, underscoring the challenge of applying transfer learning to varied real-world applications [166]. Additionally, hypothesis transfer learning in binary classification highlights the balance required between leveraging existing knowledge and adapting to new tasks, further complicating the deployment of transfer learning in applications reliant on big data [167]. These references collectively underscore some challenges associated with transfer learning, from dataset and interpretability issues to computational constraints and the risk of negative transfer, highlighting the need for research and development to expand it across more comprehensive applications.

Transfer learning represents a significant leap in training DNNs, offering a practical and efficient pathway to model development and deployment. It accelerates the training process and opens up possibilities for tasks with limited data availability. As AI continues to evolve, transfer learning is poised to play an increasingly vital role [168].

4 Lightweight model design and synergy with model compression techniques

The quest for efficient and effective NN architectures is paramount. Two critical approaches emerge in this pursuit: lightweight model design and model compression. Both methodologies aim to enhance the ease of deployment and performance of DNN, especially in resource-constrained environments [57, 169]. This section delves into the concept of lightweight model design, exemplified by groundbreaking architectures, and draws connections to model compression, illustrating how these strategies collectively drive advancements in the ML domain.

Lightweight model design focuses on constructing DNN from the ground up, with an emphasis on minimalism and efficiency. This approach often involves innovating architectural elements, such as the fire modules in SqueezeNet and the low-rank separable convolutions in SqueezeNext, to reduce the model’s scale and computational needs without significantly compromising its performance. The objective is to create inherently efficient models that can operate effectively on devices with reduced computational capacity and memory, such as smartphones, IoT appliances, and embedded systems [169]. However, model optimization methodologies are applied to pre-existing, often more complex, DNN models. The goal is to make these post-training models smaller and easier to use in limited resource settings. These methods are used to make networks smaller-scale. They balance keeping the network efficient with reducing its size [170].

4.1 Overview of lightweight model architectures

4.1.1 SqueezeNet architecture

SqueezeNet represents a significant advancement in designing NN models [96, 171]. Developed with an emphasis on minimizing model size without compromising accuracy, SqueezeNet stands as an example of lightweight model design in ML.

At the heart of SqueezeNet is the use of fire modules, which are small, carefully designed CNN that drastically reduce the number of parameters without affecting performance [21, 172]. This design aligns with the growing need for deployable DL models in limited-resource applications, such as smartphones and embedded systems. The compact nature of SqueezeNet also offers significant benefits in terms of reduced memory requirements and faster computational speeds, making it ideal for real-time applications [96]. SqueezeNet’s architecture has also been influential in the realm of model compression. Its highly efficient design makes it an excellent baseline for applying further compression techniques. These methods enhance SqueezeNet’s ease of deployment, particularly in scenarios where computational resources are limited. The adaptability of SqueezeNet to various compression techniques exemplifies its versatility and robustness as a DL model [89].

The application of SqueezeNet extends beyond theoretical research, finding practical use in areas including media analysis and mobile applications. Its influence has also paved the way for future research in lightweight NN design, inspiring the development of subsequent architectures like MobileNet and SqueezeNext. These models build on the foundational principles established by SqueezeNet, further pushing the boundaries of efficiency in NN design [95, 173].

4.1.2 SqueezeNext architecture

SqueezeNext is an advanced CNN architecture lightweight model [173]. Building upon the principles of SqueezeNet, SqueezeNext integrates novel design elements to achieve even greater efficiency in model size and computation. SqueezeNext stands out for its innovative architectural choices, which include low-rank separable convolutions and optimized layer configurations. These features enable it to maintain high accuracy while drastically reducing the model’s size and computational demands. This efficiency is particularly beneficial for deployment in environments with stringent memory and processing constraints, such as mobile devices and edge computing platforms. The design of SqueezeNext demonstrates the progress made in crafting models that are both lightweight and capable [173].

SqueezeNext’s design also contributes significantly to the field of model compression. Its inherent efficiency provides one foundation for applying additional compression techniques. These methods further enhance the model’s suitability for deployment in resource-limited settings, showcasing SqueezeNext’s versatility in various application scenarios. The architecture serves as a benchmark in the study of model compression, providing insights into achieving an optimal balance between model size, speed, and accuracy [21]. The impact of SqueezeNext extends to practical applications in areas like image processing, real-time analytics, and IoT devices.

4.1.3 MobileNetV1 architecture

MobileNetV1, introduced by researchers at Google, marks a significant milestone in the development of efficient DL architectures [95]. It is specifically engineered for mobile and embedded vision applications, offering a perfect blend of compactness, speed, and accuracy. The core innovation of MobileNetV1 lies in its use of depth-wise separable convolutions. This design reduces the computational cost and model size compared to conventional CNNs. Depthwise separable convolutions split the standard convolution into two layers - a depthwise convolution and a pointwise convolution - which substantially decreases the number of parameters and operations required. This architectural choice makes MobileNetV1 exceptionally suited for mobile devices, where computational resources and power are limited [21, 95].

MobileNetV1’s efficient design has significantly impacted the deployment of DL models on mobile and edge devices. Its ability to deliver high performance with low latency and power consumption has enabled a wide range of applications, from real-time image and video processing to complex ML tasks on handheld devices. This breakthrough has opened up new possibilities in the field of mobile computing, where the demand for powerful yet efficient AI models is constantly growing [174].

MobileNetV1 not only stands as a remarkable achievement in its own right, but also lays the groundwork for future advancements in lightweight DL models. It has inspired a series of subsequent architectures, like MobileNetV2 and MobileNetV3, each iterating on the initial design to achieve even greater efficiency and performance. The principles established by MobileNetV1 continue to influence the design of NN aimed at edge computing and IoT devices [175].

4.1.4 MobileNetV2 architecture

MobileNetV2, an evolution of its predecessor MobileNetV1, further refines the concept of efficient NN design for mobile and edge devices. Introduced by Google researchers, MobileNetV2 incorporates novel architectural features to enhance performance and efficiency, making it a standout choice in the landscape of lightweight DL models [174]. MobileNetV2 introduces the concept of inverted residuals and linear bottlenecks, which are key to its improved efficiency and accuracy. These innovations involve using lightweight, depth-wise separable convolutions to filter features in the intermediate expansion layer, and then projecting them back to a low-dimensional space. This approach reduces the computational burden and preserves important information flowing through the network. The result is a model that offers higher accuracy and efficiency, particularly in applications where latency and power consumption are critical considerations [174].

MobileNetV2’s enhanced efficiency has significant implications for mobile and edge computing. Its ability to deliver high-performance ML with minimal resource usage has broadened the scope of applications possible on mobile devices. This includes advanced image and video processing tasks, real-time object detection, and augmented/virtual reality (AR/VR) - all on devices with limited computational capabilities. MobileNetV2’s architecture has set a new benchmark for developing AI models that are both powerful and resource-efficient [175]. Its architectural innovations have been foundational in the creation of more advanced models like MobileNetV3 and beyond, which continue to push the boundaries of efficiency and performance in NN design. The legacy of MobileNetV2 is evident in the ongoing efforts to optimize DL models for the increasingly diverse requirements of mobile and edge AI [176].

4.1.5 MobileNetV3 architecture

MobileNetV3 represents a further refinement in the development of efficient and compact DL models tailored for mobile and edge devices. Developed by Google, MobileNetV3 builds upon the foundations laid by its predecessors, MobileNetV1 and MobileNetV2, incorporating several novel architectural innovations to enhance performance while maintaining efficiency [177]. One of the key innovations in MobileNetV3 is the use of a neural architecture search (NAS) to optimize the network structure. This automated design process identifies the most efficient network configurations, balancing the trade-offs between latency, accuracy, and computational cost. Additionally, MobileNetV3 introduces squeeze-and-excitation modules, which adaptively recalibrate channel-wise feature responses by explicitly modeling interdependencies between channels. This improves the model’s representational power without a significant increase in computational burden [177].

MobileNetV3 also incorporates a combination of hard swish (h-swish) activation functions and new efficient building blocks, such as the MobileNetV3 blocks, which include lightweight depthwise convolutions and linear bottleneck structures. These architectural features collectively reduce the computational load and enhance the model’s performance on mobile and edge devices [177]. The efficiency and high performance of MobileNetV3 make it particularly suitable for real-time applications, such as image classification, object detection, and other vision-related tasks on resource-constrained devices. Its compact design ensures low latency and reduced power consumption, enabling deployment in diverse environments, from smartphones to IoT devices [177].

The principles and techniques introduced in MobileNetV3 have been adopted and extended in various new architectures, further advancing the SOTA in lightweight and efficient model design. These developments continue to push the boundaries of what is achievable in the context of mobile and edge AI applications, ensuring that high-performance DL models remain accessible and practical for real-world use [178, 179].

4.1.6 ShuffleNetV1 architecture

ShuffleNetV1 marks a significant advancement in the field of efficient NN architectures. Developed to cater to the increasing demand for computational efficiency in mobile and edge computing, ShuffleNetV1 introduces an innovative approach to designing lightweight DL models. The defining feature of ShuffleNetV1 is its use of pointwise group convolutions and channel shuffle operations. These techniques dramatically reduce computational costs while maintaining model accuracy. Point-wise group convolutions divide the input channels into groups, reducing the number of parameters and computations. The channel shuffle operation then allows for the cross-group information flow, ensuring that the grouped convolutions do not weaken the network’s representational capabilities. This unique combination of features enables ShuffleNetV1 to offer a highly efficient network architecture, particularly suitable for scenarios where computational resources are limited [180].

ShuffleNetV1’s efficiency and high performance make it a valuable asset in mobile and edge computing applications. Its design addresses the challenges of running complex DL models on devices with constrained processing power and memory, such as smartphones and IoT devices. The architecture has been widely adopted for tasks like real-time image classification and object detection, offering a practical solution for deploying advanced AI capabilities in resource-limited environments [181].

The introduction of ShuffleNetV1 has had a significant impact on the research and development of efficient NN models. Its approach to reducing computational demands without compromising accuracy has contributed to subsequent architectures, including ShuffleNetV2. These developments continue to explore and expand the opportunities of what is possible in the realm of lightweight and efficient DL models [180].

4.1.7 ShuffleNetV2 architecture

ShuffleNetV2 represents a progression in the evolution of efficient NN architectures, building upon the foundations laid by its predecessor, ShuffleNetV1. The ShuffleNetV2 architecture was specifically designed to address the limitations and challenges observed in previous lightweight models, particularly in the context of computational efficiency and practical deployment on mobile and edge devices. By introducing a series of novel design principles and techniques, ShuffleNetV2 achieves a superior balance between speed and accuracy, making it highly effective for real-world applications [181].

The core innovation of ShuffleNetV2 lies in its strategy to optimize the network’s computational graph through a more refined use of channel operations. Unlike its predecessor, ShuffleNetV2 focuses on addressing the issues of memory access cost and network fragmentation. The architecture introduces an enhanced channel split operation, where each layer’s input is split into two branches: one that undergoes a pointwise convolution and another that remains unchanged, significantly reducing the computation and memory footprint. Additionally, ShuffleNetV2 employs an improved channel shuffle operation that ensures an even and efficient mixing of information across feature maps, thereby enhancing the network’s representational power without introducing substantial computational overhead [181]. ShuffleNetV2 outperforms its predecessor and other contemporary lightweight models in terms of speed and accuracy on various benchmarks. It achieves a favorable trade-off between model size and computational efficiency, making it particularly well-suited for deployment in scenarios with stringent resource constraints, such as mobile and edge AI applications [181].

The impact of ShuffleNetV2 extends beyond its immediate performance benefits. Its introduction has influenced the broader field of efficient NN design, inspiring subsequent research and development efforts aimed at further optimizing lightweight architectures. By addressing the critical bottlenecks in mobile and edge AI deployment, ShuffleNetV2 has set a new standard for what is achievable in terms of balancing efficiency and accuracy in DL models. This has paved the way for more sophisticated applications in real-time image processing, object detection, and other AI-driven tasks, ensuring that high-performance DL remains accessible and practical for a wide range of real-world uses [181].

4.1.8 EfficientNet architecture

EfficientNet, a groundbreaking series of CNN, represents a significant advancement in the efficient scaling of DL models. Developed with a focus on balanced scaling of network dimensions, EfficientNet has set new standards for achieving SOTA accuracy with remarkably efficient resource utilization. The key innovation of EfficientNet is its systematic approach to scaling, called compound scaling. Unlike traditional methods that independently scale network dimensions (depth, width, or resolution), EfficientNet uses a compound coefficient to uniformly scale these dimensions in a principled manner. This balanced scaling method allows EfficientNet to achieve higher accuracy without an exponential increase in computational complexity. The network efficiently utilizes resources, making it highly effective for both high-end and resource-constrained environments [176].

EfficientNet’s performance sets a benchmark for various ML challenges, especially in image classification tasks. The network’s ability to scale efficiently across different computational budgets makes it adaptable for a wide range of applications, from mobile devices to cloud-based servers. EfficientNet has demonstrated superior performance in tasks requiring high accuracy and efficiency, such as object detection, image segmentation, and transfer learning across different domains [176]. Its principles have been adopted and adapted in subsequent research, pushing the limits of what is possible in terms of efficiency and performance in NNs [182, 183].

4.1.9 EfficientNetV2 architecture

EfficientNetV2 represents a significant advancement in the field of efficient DL models, building upon the success of the original EfficientNet architecture. Developed by researchers at Google, EfficientNetV2 introduces several novel techniques to further enhance performance and efficiency, making it one of the leading models for mobile and edge device applications [184].

EfficientNetV2 incorporates a new scaling method called progressive learning, which adjusts the size of the model during training to improve both accuracy and efficiency. This technique begins training with smaller resolutions and simpler augmentations, progressively increasing the resolution and complexity as training progresses. This method not only speeds up the training process but also helps the model achieve higher accuracy. Another key innovation in EfficientNetV2 is the use of fused convolutional blocks, which combine the efficiency of depthwise convolutions with the accuracy benefits of regular convolutions. These blocks help reduce the overall computational cost while maintaining high performance. Additionally, EfficientNetV2 employs various training-aware optimizations, such as improved data augmentations and regularization techniques, which contribute to its superior performance [184].

The architecture of EfficientNetV2 is designed to be versatile, performing well across a wide range of tasks, including image classification, object detection, and segmentation. Its balanced approach to scaling and optimization allows it to deliver SOTA accuracy with significantly reduced computational resources, making it ideal for deployment in environments with limited processing power and memory, such as mobile devices and IoT platforms [184]. EfficientNetV2 has set new benchmarks in the field of DL, influencing subsequent research and inspiring new directions in the development of efficient NN models. The principles and techniques introduced in EfficientNetV2 have been adopted and further refined in various other architectures, pushing the boundaries of what is possible in efficient model design for real-world applications [185].

4.1.10 Overview of lightweight model architectures

Some of the key lightweight model architectures are summarized in the Table 3, with their year of launch, key features, and impact on various applications highlighted.

Table 3 Summary of lightweight model architectures: year of launch, key features, impact, and applications

4.2 Integration with compression techniques

The concept of lightweight design focuses on architecturally optimizing DNNs to minimize their demand on computational resources without significantly undermining their efficacy. Innovations such as efficient convolutional layers [95, 174], introduce structural efficiencies that lower the parameter count and computational load. These innovations are crucial for enabling the deployment of high-performing DNNs on devices with limited computational capacity, like smartphones and IoT devices. The fusion of lightweight design and model compression in DNNs represents a crucial advancement for deploying advanced ML models under computational and resource constraints [186,187,188].

The synergy between lightweight model design and model compression represents a comprehensive approach to optimizing CNN. While the former approach is proactive, building efficiency into the model’s architecture, the latter is reactive, refining and streamlining models that have already been developed. Together, they address the diverse challenges in deploying advanced ML models, from the initial design phase through to post-training optimization [170]. This section will explore how SqueezeNet, SqueezeNext, and similar architectures embody the principles of lightweight design and how their integration with model compression techniques exemplifies the broader strategy of NN optimization in ML.

Lightweight model design and model compression, though related, represent distinct approaches in DL. Lightweight model design focuses on creating architectures that are inherently optimized for performance and low resource consumption, while maintaining satisfactory accuracy. This involves techniques like employing smaller convolutional filters and depthwise separable convolutions to reduce the number of parameters and computational intensity of each layer [189]. In contrast, model compression is the process of downsizing an existing model to diminish its size and computational demands without significantly compromising accuracy. The objective here is to adapt a pre-trained model for more efficient deployment on specific hardware platforms [190]. Both methods aim to produce models that are well-suited for deployment on devices with limited resources. The following subsections highlight some prominent lightweight models designed to have fewer parameters and lower computational requirements compared to traditional DNNs [191, 192].

Recent studies have contributed to the field by proposing novel approaches  [69, 193, 194], exploring various compression techniques for DNNs, including compact models, tensor decomposition [131, 195], data quantization [122, 196], and network sparsification [197, 198]. These methods are instrumental in the design of NN accelerators, facilitating the deployment of efficient ML models on constrained devices. A noteworthy application in video coding [199] suggested a lightweight model achieving up to 6.4% average bit reduction compared to high efficiency video coding (HEVC), showcasing the potential of architecturally optimized DNNs in real-world applications. Additionally, it a novel and lightweight model for efficient traffic classification [200] was developed, utilizing thin modules and multi-head attention to significantly reduce parameter count and running time, demonstrating the practical utility of lightweight designs in enhancing running efficiency. A pruning algorithm to decrease the computational cost and improve the accuracy of action recognition in CNNs [201] reduces the model size and also decreases overfitting, leading to enhanced performance on large-scale datasets. Finally, a hardware/software co-design approach for a NN accelerator focuses on model compression and efficient execution [202]. A two-phase filter pruning framework was proposed for model compression, optimizing the execution of DNNs. This co-design approach exemplifies how integration of hardware and software can enhance the performance and efficiency of DNNs in practical applications.

4.2.1 Combined impact on performance and efficiency

Innovative techniques, such as pruning depthwise separable convolution networks [203], highlight the potential for improving speed and maintaining accuracy, emphasizing the importance of structural efficiency in lightweight design. Meanwhile, the work on adaptive tensor-train decomposition [204] showcases the significant reduction in parameters and computation, further underscoring the advancements in model compactness and efficiency for mobile devices. Cyclic sparsely connected (CSC) architectures suggests structurally sparse architectures for both fully connected and convolutional layers in CNNs [205]. Unlike traditional pruning methods that require indexing, CSC architectures are designed to be inherently sparse, reducing memory and computation complexity to \(\mathcal {O} (N \log N)\). The number N denotes the number of connections presents in a layer. This technique demonstrates an innovative way to achieve model compactness and computational efficiency without the overhead associated with conventional sparsity methods. An efficient evolutionary algorithm was introduced for NAS [206]. This method enhances the search efficiency for task-specific NN models, illustrating how evolutionary strategies can automate the design of efficient and effective CNN architectures for various tasks. These examples collectively underscore the diverse and innovative strategies being explored to make DNNs more efficient and adaptable, reflecting the ongoing commitment within the research community to push the boundaries of what is possible in ML efficiency.

The combinations of model compression techniques and their impacts on model performance reveals a complex landscape. Integrating various model compression techniques to avoid compromising the original model’s effectiveness is a well-acknowledged challenge in the field [207, 208]. Combining different compression methods can indeed lead to increased efficiency. However, it presents the challenge of balancing improvements in memory usage and computational efficiency against the potential for accuracy reduction and the introduction of noise. This variability underscores the need for application-specific evaluation and adaptation [56, 60]. Moreover, the complexity of optimizing these methods for specific applications highlights an ongoing research area, necessitating innovation to address factors like fairness and bias and to explore hardware advancements for further enhancement. This includes developing strategies that can effectively leverage the strengths of each compression approach while mitigating their drawbacks, ensuring that the resulting models are efficient and suitable for deployment in limited-resource settings and capable of performing near the standard’s set [123, 209].

4.2.2 Synergies between model compression and explainable artificial intelligence (XAI)

When discussing model compression, it is crucial to also consider the role of explainable artificial intelligence (XAI) as a complementary tool in the process [210, 211]. XAI provides insights into how ML models make decisions, which is particularly beneficial during the compression process. By understanding which parts of the model are most important for making accurate predictions, developers can make more informed decisions about which components to prune or quantize. This targeted approach can help maintain the model’s performance while reducing its size. Furthermore, can help identify potential biases or errors introduced during compression, ensuring that the compressed model remains robust and reliable [212,213,214]. Integrating XAI with model compression techniques not only enhances the interpretability of the compressed models but also aids in fine-tuning the balance between model size and performance. This synergy is essential for developing efficient, scalable, and trustworthy AI systems capable of operating effectively in diverse and resource-limited environments.

In looking towards future directions for the advancement of CNN, a multidisciplinary approach emerges across various domains. The potential of automated ML (AutoML) [215] elucidates how it can streamline model optimization by simplifying the search for efficient architectures, thus making the model design process easier. Meanwhile, the imperative of energy efficiency takes center stage [216], who pushes for greener practices in CNN development, and encourages for the adoption of energy-efficient models to mitigate environmental impact.

The exploration of lightweight design and model compression techniques underscores a significant stride in optimizing DNN architectures for efficient deployment on devices with constrained resources. Lightweight design approaches proactively embed efficiency into the model’s architecture, while model compression methods reactively refine existing models to reduce their size and computational demands. This dual strategy addresses the diverse challenges encountered from the initial design phase to post-training optimization. The integration of these techniques exemplifies a comprehensive approach to NN optimization, balancing performance and resource efficiency. Studies have demonstrated the practical utility of these approaches in various applications, including video coding, traffic classification, and action recognition, highlighting their impact on enhancing model performance and efficiency. The ongoing research and innovations in this field continue to push the boundaries of what is achievable in ML efficiency, ensuring that advanced models can be effectively deployed in real-world scenarios with limited computational capacity.

5 Performance evaluation criteria

This section delves into the methodologies and metrics used to assess the efficacy of model compression techniques. Key aspects of performance evaluation, such as accuracy, model size, computational speed, and energy efficiency, are discussed. The trade-offs between maintaining high accuracy and achieving significant compression rates are explored, highlighting the challenges and breakthroughs in this domain. Additionally, this section discusses the practical implications of model compression in real-world applications, emphasizing the need for robust and efficient models that can operate under computational constraints.

5.1 Compression ratio

The compression ratio \(\alpha \) can be determined by calculating the fraction between the original and compressed model’s size [21, 90]. Consider that the original model size is 100 MB and the compressed model size is 10 MB, then the compression ratio would be 10:1 (100:10). Secondly, it can be expressed as a proportion of the total amount of parameters in the original model and the simplified model [217], as the following expression:

$$\alpha (M, M^*) = \frac{a}{a^*}$$

where a is the amount of parameters in the initial model M and \(a^*\) is the number of parameters in the simplified model \(M^*\). The compression ratio \(\alpha (M, M^*)\) of \(M^*\) over M is the proportion of the total number of parameters in M to the total number of parameters in \(M^*\). In addition, a commonly used benchmark is the index space-saving \(\beta \), defined as:

$$\beta (M, M^*) = \frac{a-a^*}{a^*}$$

where \(\beta (M, M^*)\) is the defined space-saving rate.

5.2 Speed up rate

The speed-up rate focuses on quantifying how much faster a compressed model performs compared to the original model. It provides a clear measurement of the reduction in computational time achieved by compression.

To calculate the speed-up rate, we compare the inference time of the original model with that of the compressed model [217]. For example, if the original model takes 10 s to perform inference and the compressed model takes 3 s, the speed-up achieved is 10:3. The speed-up rate \(\delta (M, M^*)\) is defined as:

$$\delta (M, M^*) = \frac{s}{s^*}$$

where \(s\) is the running time of the original model \(M\), and \(s^*\) is the running time of the compressed model \(M^*\).

Most studies use the average training duration per epoch or the average testing duration to measure running time. The speed-up rate is a crucial metric for understanding the efficiency gains from model compression, especially as smaller-scale models typically result in faster computation for both training and testing stages, closely linked to the compression rate.

5.3 Inference latency

Inference latency measures the time required for a model to process an input and produce an output [217]. This metric is particularly important for applications requiring real-time processing, where minimizing delay is critical. While the speed-up rate focuses on the relative improvement in computational efficiency, inference latency measures the absolute time required for a model to process an input and produce an output. This distinction is crucial for applications that demand real-time processing, where minimizing delay is paramount. Inference latency is directly measured in time units (e.g., seconds, milliseconds) and is essential for ensuring real-time performance in applications such as autonomous driving, video processing, and interactive systems.

Reducing inference latency involves several techniques beyond model compression: model parallelism, which splits the model across multiple devices to reduce the time taken for each layer; data parallelism, which distributes input data across multiple devices to reduce batch processing time; weight sharing, which shares model parameters across multiple instances to minimize memory footprint and computation; and hardware acceleration, which utilizes specialized hardware like GPUs or tensor processing units (TPUs) to speed up inference computations.

5.4 Label loyalty

Label loyalty measures how closely a compressed model predicts the same labels as the original model. It is computed by comparing the quantity of the small-scale model’s predictions to the ground truth labels (the correct answers) and the predictions made by the large-scale model, where N is the total number of samples in the dataset being evaluated. The label loyalty score can be calculated as the percentage of instances where the compressed model predicts the same label as the original model [207].

$$\text {Label Loyalty} = \frac{\text {Samples with matching predicted labels}}{N}$$

5.5 Probability loyalty

Probability loyalty measures how closely the probability distribution of the compressed model matches that of the original model [207]. It is calculated using the Jensen-Shannon (JS) divergence relation between the predicted probability distributions of the compressed model and the original model. The JS divergence is a symmetric and finite distance-like metric that measures the difference between two probability distributions. The probability of a loyalty score can be calculated using the following expression:

$$D_{JS}(P,Q)=\frac{1}{2}D_{KL}(P,M)+\frac{1}{2}D_{KL}(Q,M)$$

where P and Q are the probability distributions of the original and compressed models, respectively, and M is the average of P and Q.

The formula for the probability of loyalty is provided:

$$L_p(P,Q) = 1 - \sqrt{D_{JS}(P,Q)}$$

where \(L_p\) is the probability loyalty score, P is the predicted probability distribution of the original model, Q is the probability distribution of the compressed model, and \(D_{JS}\) is the JS divergence between P and Q.

5.6 Robustness

A model’s robustness can be measured using different metrics, depending on the types of perturbations the model is expected to be resistant to [217]. Some common metrics for calculating robustness include the following:

Adversarial accuracy: This measures how accurate the model is on inputs that have been deliberately changed to cause the model to misclassify them.

Robustness radius: This measures the maximum amount of perturbation that the model can tolerate while still correctly classifying an input.

Worst-case error: This measures the worst-case error of the model over a set of perturbations.

Sensitivity analysis: This measure measures the sensitivity of the model’s output to minor modifications in the input.

The specific method for calculating robustness will depend on the type of perturbation that the model is expected to be robust against and the specific application.

5.7 Computation reduction

To calculate computation reduction, it is needed to compare the quantity of operations required to perform inference on the original model and on the compressed model [217]. For example, if the original model requires 1000 operations to perform inference and the compressed model requires 100, then the computation reduction would be 10:1 (1000:100).

6 Model compression in various domains

This section delves into the cutting-edge advancements in model compression and performance optimization, highlighting various strategies such as pruning, quantization, knowledge distillation, and transfer learning. These techniques aim to reduce the model size and computational demands without significantly compromising accuracy, thereby enabling faster, more energy-efficient, and cost-effective ML solutions. Through detailed analysis and comparison of methods like SqueezeNet, model pruning, and innovative compression tactics, this discourse sheds light on the potential and challenges of optimizing DL models for real-world applications.

This section provides case studies detailing the performance of model compression techniques in different scenarios. Developing new techniques for compressed models is a challenging and important area of research that requires a careful balance between model complexity, accuracy, and practical considerations. Researchers need to address various challenges, including maintaining accuracy while compressing the model, optimizing performance in resource-constrained environments like mobile devices, and integrating compressed models effectively into real-world applications [96, 110, 113, 123, 208].

Furthermore, compressed models must overcome obstacles related to their configuration and hardware constraints. Current SOTA approaches often rely on well-designed CNN models, limiting flexibility in changing configurations for more complex tasks. Additionally, the extension of CNN to platforms, such as smartphone, robotic, or self-navigating vehicle platforms, is impacted by hardware bottlenecks. In the context of big data challenges, compressed models face issues related to prediction, data cleansing, dimensionality reduction, and other tasks. Researchers must develop outstanding models and optimization techniques, including parallel and decentralized methods, to effectively handle big data challenges.

6.1 Model compression in image classification

Pruning and quantization are widely recognized for their effectiveness in reducing the computational demands of DNNs without significantly compromising accuracy. A method proposed in one study utilizes both techniques to compress DNNs, enabling their deployment on embedded platforms for image classification tasks. This approach demonstrated that strategic model compression could retain performance levels while significantly reducing model size and computational requirements [56]. In real-time image processing, such as synthetic aperture radar (SAR) ship detection, the need for rapid inference and minimal model size is paramount. Research in this area has shown that model compression can maintain high accuracy in image classification tasks while substantially reducing model size and inference time. This balance is crucial for applications where timely processing of large image datasets is essential [57]. Studies have also explored the impact of model compression techniques on improving the efficiency of image classifiers on platforms with limited computational capabilities. By employing various compression strategies, researchers have been able to enhance the performance of image classification models, demonstrating the potential for efficient image analysis in resource-constrained environments [53]. Further research has focused on developing tailored compression techniques that not only reduce model size but also improve accuracy in image classification. These techniques are designed to optimize widely used models, demonstrating that model compression can lead to better efficiency and performance in image classification tasks [58].

In summary, model compression techniques such as pruning and quantization are instrumental in optimizing image classification models for various applications, including real-time image processing and efficient operation on resource-constrained devices. Tailored compression strategies further illustrate the potential to enhance both the efficiency and accuracy of these models, underlining the significance of model compression in the ongoing advancement of image classification technologies.

6.1.1 SqueezeNet

SqueezeNet, a CNN architecture [96], achieves AlexNet-level performance on the ImageNet database with 50\(\times \) fewer parameters. The primary goal of the paper is to find a model with very few parameters that is still accurate. Smaller CNN architectures offer several advantages, such as more effective ML training, less processing time when deploying new models, and cost-effective field programmable gate array (FPGA) and embedded deployment.

SqueezeNet can reduce the size of a model by 5 times compared to AlexNet, but still does better than its top-1 and top-5 accuracy. When applying deep compression with 8-bit quantization, SqueezeNet results in 363 times smaller than 32-bit AlexNet with similar performance. Furthermore, applying compression with 6-bit quantization on SqueezeNet, it results in 510 times smaller than 32-bit AlexNet with similar performance.

For instance, it is observed that SqueezeNet achieves a 1.5\(\times \) speed up over AlexNet on a central processing unit (CPU) and a 4\(\times \) speed up on a GPU. Additionally, it uses \(3\!-\!4\times \) less memory than AlexNet during inference, making it optimized for running resource-limited applications. Moreover, the developers of SqueezeNet have successfully implemented a hardware accelerator known as efficient inference engine (EIE), which can work directly on the compressed model, achieving significant acceleration and energy efficiency gains. This highlights the potential for model compression techniques not only to enhance accuracy and model size but also energy efficiency, which is critical for mobile and embedded devices.

6.1.2 Model pruning for image classification

In this case [110], a pruning scheme for remote optical image analysis is proposed that reduces the computational cost of CNNs, while maintaining high accuracy. The proposed pruning technique was compared to other pruning techniques. The article also aims to show that refinement of the pruned models can further improve their efficiency. The preliminary results indicated that the presented method achieves comparable or even greater performance than the original models while reducing the number of parameters and computation costs.

The presented technique was used on the UC Merced Image Dataset and 21 land-use scenes. Subsets of the images are divided for training, fine-tuning, validation, and testing. The models are trained using a batch size of 64, using SGD. The first learning rate is set to 0.001. The study’s outcome reveals that the proposed method achieves up to 50% floating-point operations per second (FLOPS) pruning ratio for visual geometry group (VGG)-16 and up to 47.62% FLOPS pruning ratio for residual neural network (ResNet)-50 while maintaining high accuracy. This indicates that the suggested technique can reduce the computational cost of these models by up to 50% while maintaining their accuracy. It also achieved an overall accuracy of 92.50% with a pruning ratio of 60%. The effectiveness of the proposed methodology is unaffected by the training rate, which means that the method is robust and can achieve high pruning ratios despite training data used. On the NWPU-RESISC45 dataset, the proposed method prunes up to 50.68% FLOPS for VGG-16 and up to 44.98% FLOPS for ResNet-50 while maintaining high accuracy. It also achieved an overall accuracy of 93.50% with a pruning ratio of 70%.

6.1.3 Compression based on pruning and quantification

In this case [123], it is proposed a novel DL model optimization technique focused on the use of filter-stripe combination pruning and data quantification. The proposed technique achieves a high compression ratio while maintaining model accuracy. The proposed technique is also suitable for mobile and embedded devices. The authors conducted experiments on two DL models, VGG-16 and ResNet-56, using the Canadian institute for advanced research (CIFAR)-10 dataset. They trained the original models to converge and then applied the proposed pruning and data quantification techniques to obtain a fully compressed model. The observations suggested that the proposed model performs better than existing DL compression techniques in terms of performance and compression percentage. The performance of the resized models is very similar to that of the original, while the compression ratio is significantly improved. The inference speed, memory utilization, and energy efficiency of the compressed models are also improved compared to the original models. Based on the research results from ResNet-56, this technique can minimize the number of parameters to 4:1 and the number of steps of computations to 5:1, and the loss of model performance is only 0.01%. On VGG-16, the quantity of parameters is reduced to 14:1, the quantity of computation is scaled down to 3:1 and the accuracy loss is 0.5%.

6.1.4 Facial expression recognition

In this case [218], the effects of model compression methods on the performance and fairness of facial expression recognition models are investigated. The research encompasses three compression tactics - pruning, weight clustering, and post-training quantization - and examines their combinations, specifically pruning with quantization and weight clustering with quantization. The findings indicate that model size can be substantially reduced through compression and quantization without sacrificing accuracy. However, these processes might negatively affect the fairness of the algorithms. Additionally, the study conducts a comparative analysis of the baseline models versus the compressed models and delves into three research questions concerning the efficacy and fairness of model compression methods.

The study used two datasets, extended extended Cohn-Kanade (CK+) and real-world affective faces database (RAF-DB). The assessment focused on three key aspects: model size, accuracy, and fairness. The study found that compression and quantization can significantly reduce model size without compromising accuracy, but may adversely affect algorithmic fairness. The baseline model reached an overall accuracy of 67.96% on the CK+ dataset, while the baseline RAF-DB classifier attained an overall test accuracy of 82.46%. The compressed MobileNetV2 model attained a higher accuracy compared to the model on the CK+ dataset, while the compressed ShuffleNetV2 model achieved a slightly lower accuracy than the full model. However, the study also found that model compression and quantization can adversely impact algorithmic fairness, particularly in terms of gender and race accuracy. The compressed models showed a larger discrepancy between male and female accuracy metrics compared to the full models, indicating potential algorithmic bias. Similarly, the compressed models showed a larger discrepancy between accuracy metrics for different race groups compared to the full models.

6.1.5 Detecting stress on a person’s health through 2D images

In this case [219], it is addressed the adverse effects of stress on health and its high prevalence in American society. It introduces a new algorithm leveraging ML techniques to detect stress from 2D electrocardiogram (ECG) images, bypassing the need for intricate feature extraction. Pruning, quantization, and knowledge distillation are identified as effective techniques for compressing the model.

The VGG-16 algorithm was optimized to enhance its learning rate and overall performance. Additionally, the efficacy of various other algorithms, including VGG-19, InceptionV3, ResNet-50, and DenseNet-169, in stress prediction was evaluated. The research methodology included leaveone-out cross-validation (LOOCV) and 10-fold cross-validation. Findings showed that frequency domain images exhibited greater complexity and variability compared to spatial images, which, despite their reduced variation, were simpler and more adaptable.

Through model pruning, the total trainable parameters were reduced from about 55 MM to less than 1 MM, which resulted in a processing time of 148 ms/step and accuracy of 86.25%. The computation cost was reduced by 4 times. Additionally, quantization was applied to lower the precision of the model’s weights and activations. This approach achieved a classification accuracy of 90.62% in the stress detection approach. Knowledge distillation outperformed others in terms of the balance between performance and processing power, attaining an accuracy of 88.75% with a loss of 0.0066 and a processing speed of 65 ms/step.

6.1.6 Model compression in medical image analysis

Research has explored the effects of image compression on the classification performance of DL models for medical images, such as mammograms. The findings indicate that model compression can be applied effectively in medical imaging without compromising the accuracy of diagnosis, which is paramount in clinical settings [82]. A novel approach to medical image compression involves the use of variational autoencoders combined with ResNet. This method addresses common issues in CNN training and aims to optimize the balance between image quality and compression rate, thus preserving the critical details necessary for accurate medical diagnosis [83, 220]. Further studies have introduced model compression techniques to enhance the efficiency of DNNs in medical image analysis. These techniques streamline the model architecture, reducing its complexity while maintaining diagnostic accuracy. Such advancements facilitate quicker and more resource-efficient analysis, which is essential for real-time medical decision-making [84]. In the context of medical imaging, transfer learning has been leveraged to adapt existing models to specific medical tasks efficiently. This approach allows for the optimization of models for precise diagnostic analysis without sacrificing accuracy, demonstrating the potential of model compression in enhancing the applicability and performance of DL in medical image recognition [85].

In conclusion, model compression in medical image analysis is a growing field that addresses the need for high-performing, efficient diagnostic tools in healthcare. Through techniques like image compression, variational autoencoding, efficiency optimization, and transfer learning, researchers are able to refine DL models to operate effectively in the demanding environment of medical diagnostics, ensuring that critical healthcare applications benefit from the advancements in AI and ML.

6.2 Model compression in speech recognition

In the context of speech recognition, a variety of model compression techniques have been employed to enhance processing efficiency. These include pruning, quantization, knowledge distillation, Low-rank factorization, and transfer learning. These methods aim to reduce model size and computational complexity while retaining the accuracy necessary for reliable speech recognition [59]. Lossy compression techniques, particularly for CNN-based GANs in speech recognition, have shown promise in reducing numerical precision and encoding, thus minimizing model size and computation requirements. Such approaches facilitate the deployment of efficient speech recognition systems that can operate effectively in real-time environments [60]. Further exploration of model compression in speech recognition has led to the development of methods that encompass pruning, quantization, knowledge distillation, low-rank factorization, and transfer learning. These comprehensive strategies are designed to enhance the performance of speech recognition models, ensuring high accuracy and efficiency on platforms with constrained computational capabilities [61]. Innovative methods like self-distillation have been introduced to improve model efficiency in speech recognition. This technique allows high-accuracy models to be obtained directly without the need for an assistive large-scale model, thus simplifying the training process and enhancing model performance. Self-distillation represents a significant advance in model compression, enabling the deployment of highly efficient speech recognition systems [62].

In conclusion, model compression has emerged as a crucial technology in the advancement of speech recognition systems, enabling the development of efficient and accurate models suitable for deployment on resource-constrained devices. Through techniques such as pruning, quantization, knowledge distillation, low-rank factorization, and self-distillation, researchers have been able to optimize speech recognition models to meet the demands of real-time processing and limited computational capacity.

6.3 Model compression in natural language processing (NLP)

In NLP, model compression encompasses a range of techniques including pruning, quantization, knowledge distillation, low-rank factorization, and transfer learning. These methods are employed to manage the extensive computational requirements of NLP models without significantly impacting their performance. A study discusses various model compression strategies in NLP, highlighting their effectiveness in maintaining high accuracy while reducing computational load [63]. Compression of word embeddings using low-rank matrix decomposition and knowledge distillation presents a significant advancement in NLP model efficiency. This approach not only reduces the size of the model but also retains the semantic richness of the embeddings, which is essential for tasks like translation and sentiment analysis. The technique demonstrates that it is possible to achieve substantial compression without compromising the quality of language representations [64]. The slow inference speed and high computational demands of pre-trained deep models, such as BERT, pose significant challenges in NLP. Knowledge distillation has been proposed as a solution to compress these models effectively, thereby enhancing their practicality for real-time applications. This method allows for the retention of essential language understanding capabilities while significantly reducing the model’s size and computational requirements [65].

In summary, model compression in NLP is a dynamic field that addresses the critical need for efficient and effective language processing models. Through various compression techniques, researchers have made significant strides in optimizing NLP models for improved deployment on devices with limited computational resources, ensuring that advanced language processing capabilities remain accessible and practical for a wide range of applications.

6.3.1 Compressing sparse pre-trained language models (PLM)

In this particular instance [110], a novel approach for implementing sparse PLM training is presented. The method uses weight pruning and knowledge distillation to create pre-trained models that can be used in downstream tasks with minimal accuracy loss. The method is applied to BERT-base, BERT-large and DistilBERT and fine-tuning the sparse models on downstream tasks such as Stanford question answering dataset (SQuAD) and the general language understanding evaluation (GLUE) benchmark. The results indicate that the compressed sparse pre-trained models achieve SOTA compression-to-accuracy ratios and can even be further compressed to 8-bit precision using quantization-aware training.

The experimental setup involved using the English Wikipedia dataset, which contains 2500 MM words. The data was divided into two groups: train (95%) and validation (5%) sets. The pre-trained models were evaluated on a range of benchmarks for transfer learning, including SQuAD and text classification tasks from the GLUE benchmark – multi-genre natural language inference (MNLI), Quora question pairs (QQP), question-answering natural language inference (QNLI) and Stanford sentiment treebank (SST-2).

The method worked better than others when trained at a higher level of sparsity. When comparing the presented approach to other approaches, it was found that it yielded superior results at 85 and 95% sparsity ratios, respectively. Also, the accuracy loss is less than 3% when you compare the compressed sparse models with their dense counterpart. It also demonstrated that it is possible to significantly compress the models using quantization-aware training in order to achieve SOTA results in terms of compression-to-accuracy ratio.

6.3.2 Knowledge distillation for bidirectional encoder representations from transformers (BERT)

In this case [221], a new approach called patient knowledge distillation for compressing large PLMs like BERT into equally effective lightweight models is proposed. It uses two patient distillation schemes to enable the exploitation of relevant information in the large-scale model’s hidden layers. It also boosts the small-scale model to gain knowledge from and mimic the large-scale through a multi-layer distillation process.

The research methodology focused on using the proposed method in relation to four different NLP tasks: sentiment analysis, paraphrase similarity correlation, NLP inferring, and machine reading understanding. The datasets used for each task were SST-2, QQP, MNLI, QNLI, Microsoft research paraphrase corpus (MRPC) and recognizing textual entailment (RTE). The research focused on comparing the effects of the suggested approach with standard knowledge distillation techniques. The preliminary findings suggest that the presented approach results in superior efficiency and better predictive power than the traditional knowledge distillation methodologies, with notable gains in training efficiency and space reduction, while still maintaining comparable model performance to the original model. The method achieved SOTA results on various benchmark data sources and reduced the number of required parameters by up to 90%.

The compressed models achieved up to 4.3\(\times \) speed up in inference time and up to 4.7\(\times \) reduction in model size while maintaining similar accuracy to the original model. In general, the tests and results show that using compression techniques can make LLM easier to use in real life by reducing the number of redundant parts. The authors concluded that the selection of hyperparameters had a critical effect on the performance of the approach, and that careful tuning was necessary to achieve optimal results.

6.4 Model compression in autonomous vehicles

In autonomous vehicle applications, real-time inference is essential for timely decision-making and response. A study introduced a compiler-aware neural pruning search framework, optimizing 2D and 3D object detection. This approach enabled real-time inference speeds with minimal accuracy loss on mobile devices, illustrating the effectiveness of model compression in maintaining performance while reducing computational demands [73]. Another critical aspect in autonomous vehicles is the optimization of energy consumption while adhering to real-time latency constraints. Research in this area has proposed strategies to achieve a balance between edge and cloud computing for autonomous systems, highlighting the importance of model compression in enhancing energy efficiency and reducing latency for real-time processing [74]. For For vehicle-to-vehicle communication (V2V), lightweight CNN designs inspired by MobileNet and enhanced through knowledge distillation have proven effective. These models facilitate automatic scenario recognition, demonstrating that compressed models can achieve high performance while being suitable for the stringent computational limitations of autonomous vehicles [75]. Addressing the challenge of limited communication bandwidth in connected vehicles, ternary quantization-based model compression methods have been explored. These methods aim to reduce the model parameter size, demonstrating that effective model compression can lead to more efficient data transmission and processing in the network of autonomous vehicles [76].

In conclusion, model compression in autonomous vehicles is essential for achieving the necessary balance between performance and computational efficiency. Through techniques like neural pruning, energy optimization, lightweight CNN designs, and quantization, researchers are able to develop advanced systems that meet the real-time, energy-efficient, and accurate processing requirements crucial for the safety and functionality of autonomous vehicles.

6.5 Model compression in recommender systems

Self-distillation techniques have been utilized to reduce model size and computational demands, which is particularly beneficial for recommender systems that must process large volumes of data swiftly to generate timely recommendations. Such approaches enable the creation of efficient and compact models that maintain or even improve recommendation quality [62]. The development of compressed frameworks for sequential recommender systems addresses the challenge of deploying these systems on resource-constrained devices. Research in this area has shown that it is possible to maintain or improve accuracy compared to uncompressed models, thus demonstrating the effectiveness of model compression in enhancing the scalability and responsiveness of recommender systems [80]. Matrix factorization algorithms, a core component of many recommender systems, have been optimized through model compression to improve efficiency, speed, and simplicity. These advancements allow recommender systems to deliver high-quality recommendations more efficiently, reducing the computational load and enabling smoother operation on platforms with limited processing power [81].

In summary, model compression in recommender systems is essential for handling the vast data volumes and real-time processing requirements inherent in personalized recommendation tasks. Through innovative techniques like self-distillation, compressed frameworks, and efficient matrix factorization, researchers are able to significantly improve the performance and efficiency of recommender systems, ensuring they can operate effectively even in resource-limited environments.

6.6 Model compression in Internet of Things (IoT) and non-application specific domains

In the IoT domain, model compression strategies are tailored to meet the unique requirements of resource-constrained devices. One study introduces a model compression approach in IoT that optimizes for low-end devices by combining quantization and pruning, significantly reducing computational demands and enabling efficient deployment [86]. Another study presents a CNN model compression framework tailored for intelligent inspection within power IoT systems, focusing on enhancing performance through pruning and quantization [87]. The demand for low-power solutions in IoT has led to the development of specialized accelerators for CNNs, aimed at improving inference capabilities on low-end devices. These solutions incorporate hybrid quantization schemes and binary activation functions, using SVM techniques to streamline model execution while conserving energy [88]. Beyond IoT-specific considerations, model compression addresses broader challenges across various domains. A significant concern is the computational and energy efficiency constraints faced by embedded general-purpose processors in handling advanced NN-based ML techniques. Research in this area highlights the importance of developing effective compression strategies, such as those combining pruning and quantization, to make CNN inferences more efficient on resource-constrained devices [70].

In summary, model compression in IoT and non-application-specific domains focuses on creating efficient, compact models suitable for deployment in environments with limited computational and energy resources. By implementing tailored compression techniques like pruning, quantization, and specialized accelerators, researchers are able to significantly improve the operational efficiency of models across a wide range of devices and platforms, demonstrating the versatility and necessity of model compression in the expanding landscape of smart devices and general-purpose computing.

6.6.1 Memory- communication-aware on Internet of Things (IoT)

In this case [208], is introduced a transfer learning model compression technique, called network of neural networks (NoNN) that yields higher performance than other baselines and similar accuracy as the associated large-scale model, while using less communication among small-scale models.

The experimental setup involved compressing various DNNs for five image classification tasks: CIFAR-10, CIFAR-100, Scene, caltech-UCSD birds (CUB), and Imagenet. The Scene and CUB datasets are related to the transfer learning domain, where a pre-trained NN is fine-tuned on a particular problem. Furthermore, the parameters in NoNN were compared to the parameters in knowledge distillation and attention-transfer with knowledge distillation (ATKD) standard models. Moreover, they indicate that they outperformed previously used models like Splitnet.

The results indicated that NoNN achieved close to large-scale model’s precision with significantly lower memory (\(2.5\!-\!24\times \) gain) and computation (\(2\!-\!15\times \) fewer FLOPSs). They also reported that NoNN demonstrated superior performance compared to other baselines, despite the lack of communication among small-scale models until the final layer. The proposed NoNN compresses a pre-trained large-scale model resulting in many disjoint and highly compressed small-scale modules, without loss of performance. This facilitates faster and more cost-effective computation on IoT devices.

For instance, NoNN boasts an overall accuracy of 91.5% on the CIFAR-10 sample, which is higher than the accuracy of many baselines such as MobileNet and ShuffleNet. NoNN also achieves an average inference speed of 0.5 ms per image, which is faster than various baselines such as MobileNet and ResNet-18. Additionally, NoNN achieves an average memory utilization of 430 kB per student, which is within the memory budget of most IoT devices. Finally, NoNN achieves an average energy efficiency of 0.5 mJ per image, which is lower than several baselines such as MobileNet and ResNet-18.

6.7 Summary of model compression applications

In Table 4 is provided a concise summary of how model compression techniques are applied across various vertical applications, highlighting the unique challenges and requirements in each case.

Table 4 Summary of model compression case studies and their application peculiarities

7 Innovations in model compression and performance enhancement

The rapid growth in model size and complexity in ML has prompted significant advancements in model compression and performance enhancement techniques. These innovations are crucial for reducing computational demands and expanding GPU memory capacity, thereby facilitating the integration of ML with HPC.

  • GPU memory expansion and computational demands: with the increasing complexity of DL models, GPU memory limitations have become a significant bottleneck. Techniques like memory swapping and tensor rematerialization are employed to alleviate this issue, enabling larger models to fit within the limited GPU memory. Advanced software libraries, such as NVIDIA’s Apex and PyTorch’s memory management utilities, play a pivotal role in optimizing memory usage [222,223,224,225].

  • Integration of ML with HPC: integrating ML with HPC environments leverages the massive parallel processing capabilities of supercomputers, enhancing model training efficiency. Frameworks like TensorFlow and PyTorch now support distributed training across multiple nodes, significantly speeding up the training process for large-scale models [226,227,228,229].

  • Advanced techniques (DeepSpeed, ZeRO-Infinity): DeepSpeed, an open-source DL optimization library, introduces ZeRO to reduce memory redundancy by partitioning model states across data-parallel processes. This approach enables training of models with up to a trillion parameters, significantly enhancing computational efficiency and model scalability. The ZeRO-Infinity extension further optimizes memory usage, making it feasible to train even larger models [230, 231].

  • Advanced pruning techniques: pruning techniques aim to remove redundant or less critical neurons and connections in NNs, thereby reducing model size and computational load without compromising performance. Structured pruning and lottery ticket hypothesis are notable advancements, offering systematic approaches to identify and eliminate unnecessary network components [232,233,234,235].

  • Innovative quantization methods: quantization reduces the precision of model parameters from floating-point to lower bit-width representations (e.g., 8-bit integers), significantly decreasing memory and computational requirements. Techniques such as post-training quantization and quantization-aware training ensure that performance degradation is minimal while achieving substantial efficiency gains [236,237,238].

  • Enhanced low-rank factorization approaches: low-rank factorization decomposes weight matrices into products of smaller matrices, effectively reducing the number of parameters and computational complexity. This method maintains model accuracy while enabling more efficient inference and training processes. Applications in transformer models and CNNs have demonstrated significant improvements in performance [239,240,241].

  • Advanced knowledge distillation techniques: knowledge distillation transfers the knowledge from a large, complex model (teacher) to a smaller, more efficient model (student). Advanced techniques focus on optimizing the distillation process to maximize the transfer of knowledge, resulting in student models that achieve comparable performance with reduced computational demands [242,243,244,245,246].

  • Hybrid compression methods: combining multiple compression techniques, such as pruning, quantization, and low-rank factorization, creates hybrid methods that leverage the strengths of each approach. These methods offer superior compression rates and performance enhancements, making them highly effective for deploying large-scale models on resource-constrained devices [247,248,249,250,251,252].

These innovations in model compression and performance enhancement are pivotal in addressing the challenges posed by the growing complexity of ML models. They enable more efficient use of computational resources, facilitating the deployment of sophisticated models in real-world applications.

8 Challenges, strategies, and future directions

8.1 Computational overhead and suitability

Compressing ML models often requires additional computational resources, particularly when utilizing gradient descent algorithms for optimization. Although these techniques introduce extra computational overhead, their suitability for most applications makes this trade-off reasonable. Techniques like pruning and quantization reduce model size, significantly lowering the computational burden during inference, leading to faster predictions and reduced memory requirements. This optimization is crucial for deploying models in resource-constrained environments [19]. By focusing on essential features and reducing redundancy, pruning and knowledge distillation help prevent overfitting and enhance generalization to unseen data, ultimately improving real-world performance [111].

While compressed models may require more computational resources during training, the long-term benefits of faster inference, reduced memory footprint, and enhanced performance outweigh the initial costs, making the investment in extra computation justified [69]. For instance, pruning works by eliminating unnecessary neurons and connections in NNs, which reduces the overall complexity of the model without significantly impacting its accuracy. An example of this is Google’s use of pruning in its neural machine translation system, which resulted in a 92% reduction in model size while maintaining translation quality [253]. Similarly, quantization converts the weights and activations of a NN from higher precision (such as 32-bit floats) to lower precision (such as 8-bit integers), which can drastically reduce the memory footprint and speed up inference times. This approach was successfully applied in the MobileNet architecture, making it highly efficient for mobile and embedded applications [217]. Knowledge distillation not only reduces the size of the model but also often results in a model that performs better on specific tasks due to the distilled knowledge from the teacher. An example of this is the use of knowledge distillation in training compact BERT models, which maintain high accuracy while being significantly smaller and faster than the original model [9]. These methods collectively ensure that the models remain robust and effective even when deployed on devices with limited computational capabilities, such as mobile phones and edge devices.

Industries where speed, memory efficiency, and accuracy are critical, such as healthcare, finance, and autonomous driving, greatly benefit from the deployment of these efficient and optimized ML models. In healthcare, for example, faster inference times can lead to quicker diagnostics and treatment decisions, which are vital for patient care. A notable example is the use of compressed ML models in real-time magnetic resonance imaging (MRI) reconstruction, reducing scan times from minutes to seconds. In finance, real-time analysis and predictions can improve trading strategies and risk management. High-frequency trading algorithms often rely on compressed models to process vast amounts of data rapidly and make split-second decisions. Autonomous driving systems require real-time data processing to ensure safety and accuracy in navigation and obstacle detection, where models like you only look once (YOLO) have been pruned and quantized to run efficiently on automotive-grade hardware [7].

Moreover, the initial computational overhead incurred during the training phase can often be mitigated by leveraging advanced hardware accelerators such as GPUs and TPUs, which are designed to handle large-scale computations more efficiently. This makes the overall process of compressing models not only feasible but also cost-effective in the long run [6]. Consequently, the extra computational resources required for model compression during training are a worthwhile investment for the significant performance improvements they bring during deployment.

8.2 Over-pruning and regularization techniques

Over-pruning represents a critical challenge, often leading to compromised model performance and generalization capabilities. This phenomenon occurs when an excessive number of parameters are eliminated during the compression process, which, although beneficial for reducing the model’s size and computational demands, can inadvertently strip away vital information necessary for accurate predictions. The balance between model size reduction and performance retention is therefore a pivotal concern in the development and optimization of model compression techniques. Mitigating the risks associated with over-pruning necessitates a well-planned and informed strategy for the implementation of model compression.

Over-pruning, characterized by the excessive removal of model parameters, can significantly negatively affect the model’s performance and its ability to generalize. To counteract these adverse effects, sophisticated methodologies have been developed and refined within the ML community. One such approach is iterative pruning, a process that methodically eliminates less critical parameters over multiple cycles, interspersed with phases of retraining to restore and enhance model performance. The iterative nature of this technique allows for a more gradual reduction in model size, minimizing the risk of removing essential information, which demonstrated that iterative pruning could achieve substantial reductions in model size while maintaining, and sometimes even improving, model accuracy [254]. An example of iterative pruning’s effectiveness can be seen in the ResNet architecture. Researchers applied iterative pruning to ResNet-50, achieving a 90% reduction in parameters with only a 1% drop in accuracy on the ImageNet dataset. This highlights the potential of iterative pruning to maintain high performance even with significant compression [255].

Regularization techniques serve as another critical tool in model compression. These methods introduce additional constraints into the training process, often in the form of penalties on the magnitude of parameters, to encourage the model to maintain only those weights that are genuinely influential in determining the output. L1 and L2 regularization are prominent examples of such techniques, where L1 promotes sparsity in the model weights, thereby facilitating their subsequent removal during compression. Structured sparsity learning elucidates how regularization can be effectively employed to enhance the compressibility of NNs, paving the way for more efficient and compact models [256].

Moreover, recent advancements in model compression have led to the development of more sophisticated approaches, such as the utilization of sparsity-inducing norms and sparsity-aware algorithms. These methods aim not only to reduce the model’s size but also to retain the model’s capacity for accurate predictions on unseen data [257, 258]. Pruning CNNs using Taylor expansion provides insights into how sparsity-aware techniques can be employed to identify and eliminate redundant parameters with minimal impact on model performance [259]. An example of this is the use of Taylor expansion-based pruning in VGG-16, where the model’s size was reduced by over 80% while maintaining accuracy within 1% of the original model. This approach demonstrated that even complex architectures could be significantly compressed without substantial performance degradation.

These methods not only aim at reducing the model’s size but also at retaining the model’s capacity for accurate predictions on unseen data [257, 258]. Pruning CNNs using Taylor expansion provides insights into how sparsity-aware techniques can be employed to identify and eliminate redundant parameters with minimal impact on model performance [259].

8.3 Trade-offs and impact on model architecture

The primary goal of model compression is to reduce computational demands while retaining performance. Our expanded discussion contrasts various methods, such as pruning, quantization, and knowledge distillation, by examining their impact on performance retention, computational efficiency, and their suitability for different ML tasks and scenarios. Each of these methods offers a pathway to reduce the computational footprint of models, yet they must be applied judiciously to avoid compromising the integrity and efficacy of the models. In practice, the process of model compression involves trade-offs.

For instance, while pruning and quantization effectively reduce model size and accelerate inference times, they may introduce artifacts or errors that could degrade model performance [19, 59, 116]. Similarly, low-rank factorization aims to streamline models by reducing redundancy in weight matrices, but if applied excessively, it might eliminate essential features, leading to poor performance on complex tasks. Knowledge distillation and transfer learning present innovative ways to leverage existing models to train more compact and efficient versions, yet they depend heavily on the quality and relevance of the original models and the alignment between tasks [130, 136].

The intricacies of model compression also extend to the interaction between compression techniques and model architecture. Certain architectures may be more amenable to specific compression methods, with variations in their susceptibility to information loss or performance degradation post-compression. Therefore, a comprehensive understanding of both the model’s structural nuances and the operational principles of compression methods is imperative for successful model compression [70, 87, 116, 186]. Furthermore, the dynamic nature of technological advancement and the evolving landscape of ML applications necessitate continuous research and adaptation of model compression strategies. Innovations in hardware and software, along with advancements in ML algorithms, constantly reshape the boundaries and possibilities of model compression [69, 100, 173].

8.4 Future works and recommendations

In the realm of future works and directions within model compression, several pivotal recommendations emerge. A primary focus should be on the evolution of more sophisticated compression algorithms, which can dynamically modulate efficiency and performance tailored to the diverse needs of applications. This approach underscores the significance of enhancing quantization-aware training and knowledge distillation methods, aimed at refining model compression capabilities without forfeiting crucial data, thus safeguarding model integrity across varying compression extents [207,208,209, 260]. Moreover, the development of advanced pruning methodologies that accurately pinpoint and eliminate superfluous or non-critical model components, without undermining the model’s decision-making proficiency, is anticipated to substantially advance the domain [90, 110]. The exploration into the synergistic effects of diverse compression strategies could unveil novel pathways to strike an optimal balance between model size, speed, and accuracy, fostering innovations in model efficiency [56, 60]. Tackling the performance variability across different tasks and environments remains a crucial endeavor, highlighting the necessity for adaptive models that can fine-tune their compression strategies in alignment with the deployment context.

Regarding to emerging areas, the integration of model compression techniques across various industrial applications can lead to significant improvements in efficiency, speed, and accuracy. Whether through the deployment of digital twins, advanced NNs, data-driven health management systems, predictive maintenance strategies, or reinforcement learning for supply chain optimization, the benefits of model compression are evident. By reducing computational demands and enhancing model performance, these techniques pave the way for more effective and scalable intelligent systems in industrial settings. Model compression techniques can significantly enhance the performance of digital twins by reducing the computational burden required for real-time analysis. This is particularly important for resource-constrained environments where computational power is limited. Compressed models can facilitate faster data processing and more efficient anomaly detection, ultimately leading to more timely and accurate maintenance decisions [261, 262]. Model compression can further improve PIResNet’s efficiency by reducing the computational complexity without compromising the model’s diagnostic capabilities. By pruning redundant parameters and employing quantization techniques, compressed PIResNet models can provide rapid and reliable fault detection, making them more suitable for real-time industrial applications [263, 264]. Data-driven approaches to bearing health management rely on extensive data collection and analysis to predict bearing failures. ML models, particularly those that are heavily parameterized, can benefit from compression techniques to manage large datasets effectively [265, 266]. Compressed models can process data more efficiently, leading to faster and more accurate health assessments. This is crucial for industries where downtime due to bearing failures can be costly. By enhancing the speed and accuracy of data-driven health management systems, model compression contributes to more reliable and cost-effective maintenance strategies.

The application of model compression in predictive maintenance in smart manufacturing context can significantly reduce the computational resources required for training and inference [267, 268]. Techniques such as pruning and quantization can make DL models more efficient, enabling their deployment on edge devices with limited processing power. This can lead to more widespread adoption of predictive maintenance solutions, enhancing operational efficiency and reducing maintenance costs across the manufacturing sector. RL has shown great promise in optimizing supply chain operations by learning optimal policies through interaction with the environment [269, 270]. However, RL models can be computationally intensive and require substantial resources for training. Model compression techniques can alleviate this by streamlining the RL models, making them more suitable for real-time decision-making. Compressed RL models can operate more efficiently, enabling faster responses to dynamic supply chain conditions and improving overall supply chain performance.

Future research would need to pivot towards the development of hybrid compression techniques that synergistically combine the strengths of existing methods to achieve superior compression rates without compromising model performance. Additionally, there is a pressing need for more robust frameworks that can automatically select and apply the most appropriate compression technique based on the model’s characteristics and the computational environment. The exploration of model compression techniques presented in this review sets the stage for several key areas of future research and development in the field of ML and AI. Building upon the insights gained from the current state of model compression, the following recommendations and directions can guide future works:

Sophisticated compression algorithms: there is a growing need to prioritize the development of more advanced compression algorithms that can dynamically balance efficiency and performance based on the specific requirements of different applications. By focusing on creating algorithms that can adapt to varying computational environments and application scenarios, researchers can enhance the versatility and effectiveness of model compression techniques.

Hybrid compression techniques: future research should explore the potential of hybrid compression techniques that combine the strengths of existing methods to achieve superior compression rates without compromising model performance. By integrating multiple compression approaches in a synergistic manner, researchers can push the boundaries of model efficiency and effectiveness, paving the way for more optimized AI systems.

Automated compression frameworks: there is a pressing need for the development of robust frameworks that can automatically select and apply the most suitable compression technique based on the characteristics of the model and the computational environment. By creating automated systems that can intelligently adapt compression strategies to specific contexts, researchers can streamline the model compression process and enhance its scalability and applicability.

Enhanced real-world applications: future works should focus on expanding the application of model compression techniques to a wider range of real-world scenarios and domains. By exploring how compression methods can be tailored to specific industries and use cases, researchers can demonstrate the practical value and impact of efficient AI deployment in diverse settings.

Ethical considerations and transparency: as model compression techniques continue to evolve, it is essential for researchers to prioritize ethical considerations and transparency in their work. Future studies should emphasize the ethical implications of model compression, ensuring that AI systems developed through these techniques uphold principles of fairness, accountability, and transparency.

9 Discussion

To actualize these advancements, ensuing research must concentrate on formulating algorithms that can adeptly compress models by recognizing their distinct traits and the specific demands of their respective applications. Envisaging models that self-optimize in terms of size, speed, and accuracy, these advanced algorithms are likely to leverage ML techniques to ascertain the most efficacious compression strategy, potentially integrating reinforcement learning or meta-learning to perpetually enhance their compression tactics based on performance feedback [271, 272]. This segues into the importance of embedding compression considerations within the initial design phase of models, fostering the emergence of inherently efficient architectures. Such an approach encourages the creation of high-performance models that are naturally predisposed to compression, thereby circumventing the typical trade-offs associated with post-development compression [69, 105, 273].

As computational sustainability garners increasing attention, model compression techniques aimed at reducing the energy demand of ML operations will become paramount. These energy-efficient compression methods, which are crucial for lowering operational expenses and fulfilling sustainability objectives, will necessitate innovations that optimize computational pathways and minimize energy-intensive processes [274]. In the context of federated learning, where model training and data are disseminated across numerous devices, the imperative for efficient model compression is underscored. Future research endeavors should be dedicated to crafting compression methodologies that enable swift and effective model updates and sharing across devices, thus alleviating bandwidth and storage constraints while preserving model integrity and performance [86, 145, 275].

With the advent of multi-modal ML, the demand for compression techniques proficient across varied data types, such as text, images, and audio, will intensify. These techniques must adeptly navigate the complexities inherent to different data modalities, ensuring the effectiveness of the compressed model across a spectrum of tasks, and paving the way for more adaptable compression techniques that broaden the utility of ML models in a myriad of settings [276]. Additionally, the potential of quantum computing to transform model compression is on the horizon. Probing into how quantum algorithms could enhance the efficiency of large dataset and model compression processes may herald unprecedented breakthroughs, offering superior compression capabilities beyond the scope of classical algorithms [277, 278].

Future models with self-healing attributes represent an exciting frontier, where models could autonomously detect performance declines due to compression and adapt proactively. Employing mechanisms to self-adjust their structure or parameters would ensure sustained optimal performance, even when faced with compression-related challenges [279, 280]. Lastly, the establishment of definitive benchmarks and standards for model compression is imperative for a coherent and meaningful evaluation of various techniques. The formulation of standardized metrics that accurately delineate the interplay between compression rate, model performance, and computational efficiency is essential for a more objective assessment of compression methodologies, propelling the advancement of more efficacious techniques [281, 282]. These future directions in model compression not only highlight the field’s potential for significant advancements but also underscore the interdisciplinary effort required to optimize ML models for the next generation of applications. By pushing the boundaries of current methodologies and exploring new paradigms, the research community can develop more sophisticated, efficient, and versatile model compression techniques.

10 Conclusion

In this comprehensive review, we explored various model compression techniques within ML environments, addressing the critical challenges faced, and the strategies employed to overcome them. Our investigation highlighted significant advancements in model compression methods, including quantization, pruning, knowledge distillation, transfer learning, and lightweight architectural designs. We found that each technique offers unique benefits and limitations, contingent upon the specific application and ML model requirements.

The significance of this review extends beyond the summarization of model compression techniques; it underscores the need for efficient computational models in the era of big data and AI. This exploration aids in the understanding of how each technique can be optimized and applied to different model architectures and tasks, broadening the scope of model compression research and application. Secondly, this paper underscores the significance of performance evaluation criteria in assessing the efficacy of compressed models. Moreover, through detailed case studies, the paper illustrates the practical implications and successes of model compression techniques in real-world scenarios. These examples not only highlight the effectiveness of model compression in enhancing computational efficiency and model ease of deployment but also underscore the potential for further innovation in the field.

In conclusion, as the frontiers of ML and DL continue to expand, the role of model compression techniques becomes increasingly pivotal. This paper’s exploration of model compression strategies, their applications, and implications for future research serves as a base for ongoing and future explorations in the field. By addressing the challenges posed by the paradox of progress and limitation, it points the way for a future where advanced ML models are not only technically feasible but also easy to deploy across diverse and resource-limited environments. The journey of model compression is far from complete, and the insights generated from this research will inspire further innovations, driving the evolution of efficient, scalable, and accessible AI technologies.