Strategic Data Navigation: Information Value-based Sample Selection

Artiﬁcial Intelligence represents a rapidly expanding domain, with several industrial applications demonstrating its superiority over traditional techniques. Despite numerous advancements within the subﬁeld of Machine Learning, it encounters persistent challenges, highlighting the importance of ongoing research eﬀorts. Among its primary branches, this study delves into two categories, being Supervised and Reinforcement Learning, particularly addressing the common issue of data selection for training. The inherent variability in informational content among data points is apparent, wherein certain samples oﬀer more valuable information to the neural network than others. However, evaluating the signiﬁcance of various data points remains a non-trivial task, generating the need for a robust method to eﬀectively prioritize samples. Drawing inspiration from Reinforcement Learning principles, this paper introduces a novel sample prioritization approach, applied to Supervised Learning scenarios, aimed at enhancing classiﬁcation accuracy through strategic data navigation, while exploring the boundary between Reinforcement and Supervised Learning techniques. We provide a comprehensive description of our methodology, while revealing the identiﬁcation of an optimal prioritization balance and demonstrating its ben-eﬁcial impact on model performance. Although classiﬁcation accuracy serves as the primary validation metric, the concept of information density-based


Introduction
Machine Learning has emerged as a rapidly evolving field, characterized by continual advancements and innovations.However, amidst plenty of state-of-the-art data-driven solutions addressing industrial challenges, a persistent difficulty exists: the inefficiency of data sampling.This inefficiency arises from the inherent difficulty in determining the informational value of specific data points during the training process.Consequently, the optimization of sampling efficiency remains an unresolved issue, presenting a significant challenge for further advancement in Machine Learning techniques.
The inherent complexity in determining the relative importance of individual data samples merits further exploration.These samples vary in their contribution to the network's training process, with some holding more substantial pieces of new information or being more helpful in tuning the network weights.The core challenge is the difficulty of determining, which samples are more informative for the neural network.While prior determination, perhaps with the help of classic feature extraction, could offer preliminary insights into the significance of samples within the training dataset, it would ignore an important reality.The value of a sample is not solely reliant on the information within the raw data; it is also significantly influenced by the current state of the neural network during training.Therefore, a dynamic approach to assessing the value of samples is essential: one, that continuously evaluates their significance in context of the evolving state of the network.This paper introduces an innovative approach for dynamic prioritization of training data to enhance sampling efficiency in Machine Learning models.The primary objective is to develop a reliable metric capable of quantifying the informational value gained by a neural network from a specific sample.This approach aims to utilize samples rich in valuable information more frequently, while reducing reliance on less informative ones.
Our methodology draws inspiration from existing prioritization strategies employed in Reinforcement Learning.In Reinforcement Learning, one often faces a vast number of potential experiences (training data points), making it impractical to retain all points in the memory buffer.Thus, a decision has to be made, which experiences to retain and which to discard, revealing a more apparent need for prioritization, and as such, several methods have already been established in this domain.Nevertheless, data prioritization remains relatively unexplored in Supervised Learning contexts.Although data points are typically static in these scenarios, hence removing the challenge of selection, sampling is often performed stochastically, relying on a uniform distribution without questioning the adequacy of assigning equal value to every data point.Since the informational value of different data points can vary significantly, it would seem, that a well-informed approach to prioritization could potentially yield superior outcomes compared to uniform sampling methods.Additionally, efficient sampling may potentially improve convergence characteristics or final performance metrics, such as accuracy of the network.This paper explores this premise, seeking to bridge the gap in data prioritization strategies between Reinforcement and Supervised Learning paradigms.

Related work
Deep Learning proved to be an outstanding tool for computer vision applications, starting with the MNIST (LeCun et al, 1998) dataset, that contains 60, 000 images of 28 × 28 grayscale images (Baldominos et al, 2019).Nowadays, NN models are able to handle complex real-life visual information, achieving high accuracy on various wellknown benchmark datasets (Voulodimos et al, 2018), such as CIFAR100 or ImageNet or also deal with special domain specific datasets (Aldoski and Koren, 2023).
Oftentimes, the innovation lies within the network architecture (Wu et al, 2021;Ridnik et al, 2021), where the two major competing categories are Convolutional Neural Networks (CNN) (Li et al, 2022) and Vision Transformers (Khan et al, 2022).Another branch of improvements concentrate on the definition of loss function.For instance, Sharpness-Aware Minimization (SAM) enhances a model's ability to perform well on new, unseen data by focusing on two goals at once: reducing the loss value, while also minimizing variability between losses.It aims to identify parameters, that exhibit optimal performance not only at specific points, but also demonstrate consistently low loss across a broader area.This approach thereby generates a trade-off in the optimization problem, where one seeks both minimum and maximum values, but this can be effectively tackled using gradient descent methods, as demonstrated in (Foret et al, 2020).Another such example has been introduced in (Yu et al, 2023), where the authors have investigated, how altering entropy affects Deep Learning systems by introducing noise at various stages, such as in the latent space and directly into the input image.The models developed using this technique are referred to as Noisy Neural Networks (NoisyNN).Traditionally, noise is seen as a disruptive factor in many Deep Learning frameworks.However, this study reveals, that noise can be a beneficial tool for modifying entropy of the learning system.It has been shown, that under certain conditions, carefully chosen noise can enhance the performance of a range of Deep Learning architectures.
An alternative approach involves experimenting with label information.In (Lee et al, 2013), the network has been trained using a combination of labeled and unlabeled data concurrently.So-called pseudo-labels are applied to unlabeled data, generated by selecting the class with the highest predicted probability and treating it as the genuine label.Essentially, this method acts as Entropy Regularization, encouraging class separation with minimal overlap, a commonly acknowledged assumption in Semi-Supervised Learning.
Data augmentation undoubtedly plays a crucial role in enhancing performance as well.Data mix augmentation techniques, such as CutMix (Yun et al, 2019) or AutoMix (Liu et al, 2022), aim to prevent overfitting by blending different images together.This involves overlaying patches of images onto the current image, hence reducing the network's likelihood of incorrectly focusing on local image features instead of recognizing higher-level patterns within the picture.
Several different approaches exist, however, sampling efficiency is rarely scrutinized.In (Nguyen et al, Epub 2011Oct 22, 2011:1004-12, PMID: 22195160, PMCID: PMC3243278), a medical application is presented, where auxiliary label information is utilized to augment the information content of training data.The authors have proposed a strategy, that involves leveraging supplementary probabilistic data indicating the confidence level associated with each label.Several techniques have been suggested in the paper to utilize this data effectively, thereby improving learning efficiency with fewer samples.Prioritized sampling is particularly relevant in scenarios of class imbalance according to (Dablain et al, 2023).In these cases, the training dataset contains a disproportionate amount of training samples for different labels.Techniques, such as oversampling the underrepresented data can enhance performance, among other useful methodologies, as elucidated in (Wang et al, 2016) and (Yan et al, 2015).
Furthermore, the concept of prioritized sampling holds significance in scenarios marked by class imbalance, for example, the DeepSMOTE architecture shown in Fig. 1 and further elaborated in (Dablain et al, 2023).Such imbalances are characterized by a disproportionate representation of various labels within the training dataset.To mitigate this issue, techniques such as oversampling of underrepresented data have been demonstrated to boost performance, alongside modifications to the loss function and other strategies (Wang et al, 2016;Yan et al, 2015).These approaches highlight the potential benefits of strategic sampling in enhancing performance of Machine Learning models, particularly in contexts, where data distribution is uneven.However, the static nature of oversampling persists, as addressing class imbalance typically involves solely considering the number of data points in different classes.
The data prioritization method presented in this paper bears resemblance to the AdaBoost methodologies, initially introduced in (Freund and Schapire, 2002).These methods aim to construct an arbitrarily accurate strong predictor by combining multiple weak learners, each slightly outperforming random guessing.This process involves a re-weighting of training samples following the training of a weak learner, as visualized in Fig. 2. For more details, refer to: (Hastie et al, 2009;CAO et al, 2013;Schapire, 2013).
To contextualize the methodology proposed in this study, it is essential to delve into the realm of Reinforcement Learning.Historically, Reinforcement learning originated as an on-policy technique, where the learning agent primarily assimilates knowledge only from its most recent experiences, thereby necessitating the exploration of alternative techniques to sample prioritization.An early advancement in this field is the introduction of the prioritized sweeping algorithm, detailed in (Moore and Atkeson, 1993).This algorithm innovatively utilizes variations in absorption probabilities, denoted as γ parameters, to assess the significance of a transition.Transitions with higher γ values are deemed to be more significant for updates due to their potential to influence absorption probabilities of preceding states.In practical scenarios, transitions are prioritized between successive real-world observations, sorted based on their assigned priority until reaching a predefined limit of processing steps.This mechanism effectively ensures, that states of greater importance are allocated in preferential positions within the priority queue.
The idea of an experience replay memory came around later with the emergence of Deep Q-Networks (DQN), as detailed in (Mnih et al, 2015).This article introduced DQN, an innovative artificial agent, that integrates Reinforcement Learning with Deep Neural Networks, more specifically deep Convolutional Neural Networks, to process high-dimensional sensory inputs for decision-making.Notably, the DQN is capable of learning successful strategies directly from raw sensory data, as demonstrated by its performance on classic Atari 2600 games, where it outperformed previous algorithms Fig. 1 The DeepSMOTE architecture for oversampling unbalanced datasets (Dablain et al, 2023).
and even rivaled professional human gameplay.The design of DQN agents addresses instability commonly associated with Reinforcement Learning in neural networks by employing two key novelties: experience replay and periodic updates to action-values.Experience replay randomizes data to reduce correlations in observation sequences, while periodic updates reduce target correlations.Unlike previous methods, the DQN efficiently utilizes large neural networks without the need for repeated training, making it well-suited for complex sequential decision-making tasks.Moreover, the introduction of experience replay memory helps to bridge the gap between Supervised and Reinforcement Learning approaches in a sense, that training samples in a Supervised Learning scenario are typically chosen from a static database, whereas in Reinforcement Learning, samples within the memory are being replaced throughout the entire training.
The development of DQNs initiated a revolution in the field of Reinforcement Learning, with a pivotal moment being the victory of DeepMind's AlphaGo program over Lee Sedol, the World Go champion, in March 2016 (Silver et al, 2016).This landmark event highlighted the advanced capabilities of AI systems, which integrate deep neural networks with Reinforcement Learning techniques, in mastering intricate challenges once considered unattainable.
In terms of training data prioritization, Prioritized Experience replay (Schaul et al, 2016) stands as a notable breakthrough.Its novel approach lies in the method of selecting experiences for replay based on their expected learning progress, as considered to be measured by the magnitude of their temporal-difference (TD) error.This strategy differs from traditional experience replay methods, which uniformly sample experiences without regard to their learning potential, by aiming to enhance sampling efficiency through focusing on transitions, that seem more likely to contribute to the agent's training progress.The paper also addresses potential issues of this method, such as reduced diversity and introduction of bias, by incorporating stochastic prioritization and importance sampling weights.This innovation yields expedited learning and improved performance across the Atari 2600 benchmark suite, marking a significant advancement in Reinforcement Learning.Even to date, PER remains to be the most prevalent prioritization technique in literature.
Nonetheless, numerous adaptations of the original PER exist.For instance, (Horgan et al, 2018) has introduced a distributed architecture, that separates the processes of acting and learning.Multiple agents (actors) interact with their environments independently, gathering experiences in a shared memory.Meanwhile, a central learner selectively replays the most impactful experiences for neural network updates.This innovation enables learning from a significantly larger pool of data, thereby improving efficiency and reducing training duration, consequently leading to enhanced performance.Similarly, in (Hou et al, 2017), the same experience prioritization technique is adapted to complement the Deep Deterministic Policy Gradient workflow.Another study in (Brittain et al, 2020) has introduced Prioritized Sequence Experience Replay, an extension of the conventional Prioritized Experience Replay.Unlike PER, which prioritizes individual experiences solely based on their latest observed temporal-difference error, PSER takes into account the entire sequence of transitions, propagating priorities backward through the chain of actions.This approach leverages trajectory information embedded within the sequence.Experimentation with the 'Blind Cliffwalk' environment revealed, that PSER outperformed PER, showcasing faster convergence and superior performance by encouraging sampling of states leading to high-reward outcomes.
Various alternative strategies have been explored in the literature, as elucidated in (Yu, 2018).Another study proposes enhancing the efficiency of sampling in policy gradient methods by employing approximate gradients derived from fixed-size batches of trajectories and incorporating softmax parametrization.This enhancement facilitates more effective and practical global convergence for the REINFORCE algorithm, as detailed in (Zhang et al, 2021).Additionally, (Buckman et al, 2018) proposes the Stochastic Ensemble Value Expansion (STEVE), a model-based method aimed at improving sampling efficiency.STEVE achieves this by dynamically adjusting between model rollouts over varying horizon lengths, ensuring error-minimization in the model's application.Furthermore, the authors of (Hester and Stone, 2013) present TEXPLORE, a model-based Reinforcement Learning algorithm, which significantly improves sampling efficiency in robotic control tasks.By learning a random forest model, that effectively generalizes dynamics to new states, TEXPLORE prioritizes exploration of states crucial for developing the final policy, thereby reducing the need to sample less relevant states.
Another approach to address sampling inefficiency in model-free Reinforcement Learning with high-dimensional images is by integrating auxiliary losses, such as image reconstruction, for more efficient latent representation learning.Authors of (Yarats et al, 2021) identify and rectify training instabilities caused by variational autoencoders in off-policy learning algorithms, leading to an approach, that is both sample-efficient and robust, especially in the presence of observational noise.Additionally, (Grande et al, 2014), demonstrates the sampling efficiency of Gaussian Processes in Reinforcement Learning, illustrating, that GPs can be KWIK-learned, establishing sampling efficiency of the GP-Rmax, a model-based RL method.They also introduce a new model-free RL algorithm using GPs, named DGPQ.In continuous state and action spaces, sampling efficiency can be improved by focusing on learning a Q-function parametrically defined by its rank.The key advancement is an iterative algorithm, that significantly reduces sample complexity, especially when the optimal Q-function has a low rank and the discount factor is below a certain threshold, leading to improved learning efficiency (Shah et al, 2020).
This study focuses on the prioritization technique of the PER algorithm, more specifically an enhanced version of PER, proposed by (Kővári et al, 2023).The methodology, that has been demonstrated, improves convergence speed and final state values by introducing an exploration element in the prioritization metric.PER inherently suffers from overfitting in certain situations, and the exploration term mitigates this risk, while containing a tuning constant, that allows fine-tuning of the explorationexploitation trade-off of the prioritization process.In a Supervised Learning context, this balance would correspond to the overfitting-underfitting problem.The optimal balance between exploration and exploitation arises in several fields (e. g. (Cuevas et al, 2014)).For a more detailed discussion of analogies, similarities and differences between prioritization in Supervised and Reinforcement Learning, see Section 2.

Contribution
Undoubtedly, recent years have witnessed significant progress in various aspects of Supervised Learning, including advancements in loss functions, neural network architectures, training algorithms and data augmentation techniques.However, training sample prioritization, being a fundamental challenge in Reinforcement Learning rooted in the exploration-exploitation trade-off, remains a research gap in this domain.Consequently, stochastic sampling via a uniform distribution persists as the predominant technique for handling training data in this field.
In order to address the gap, this paper presents a novel approach for sampling prioritization.Our strategy efficiently navigates the training dataset to identify samples rich in new information by integrating insights from Reinforcement Learning into Supervised Learning applications.Moreover, the proposed methodology incorporates a sophisticated approach to modulate the inherent risk of overfitting associated with prioritization attempts, utilizing a dual-component metric.Additionally, the effectiveness of our approach is demonstrated through experiments in image classification scenarios, utilizing benchmark datasets and neural network architectures popular in literature.

Methodology
The three main Machine Learning techniques are depicted in Fig. 3.The first one, being Unsupervised Learning, involves a creative approach, where the network attempts to discern patterns in an unlabeled dataset without explicit guidance, although this aspect of learning falls outside the scope of this article.The second method, Supervised Learning, employs labeled data to identify, understand, and learn the fundamental characteristics of the dataset.It can be considered the scholar type of learning since the neural network receives guidance and relies on the ground truth provided by labeled data.The third approach, namely Reinforcement Learning, serves as inspiration for this research.RL realizes the explorer type of learning, wherein the agent maps out environmental states and learns from experiences through reward values provided by the environment, which indicates the quality of the agent's decisions.
Although not directly applied in this research, understanding the principles of RL is crucial for discerning, how the prioritization concept proposed in (Kővári et al, 2023) can be adapted and tailored to Supervised Learning scenarios.

Supervised Learning
Supervised Learning, a principal methodology in the field of Machine Learning and Artificial Intelligence, involves leveraging labeled datasets to train algorithms for accurate data classification or outcome prediction.This technique relies on input-output pairs, where input data is associated with known outputs, alias labels.During training, the algorithm iteratively adjusts the weights of a neural network based on the disparities between model predictions and actual labels, evaluated using a loss function.This iterative procedure, analogously to the Reinforcement Learning workflow Fig. 3 The three main Machine learning categories and their application fields.
detailed in Section 2.2, continues until the model achieves optimal accuracy, typically assessed through cross-validation techniques.Supervised Learning encompasses two primary categories: classification, which involves categorizing data into predefined classes, and regression, which establishes relationships between dependent and independent variables, often to make predictions or projections.In essence, Supervised Learning empowers machines to learn from historical data and make informed predictions or decisions.From the widespread set of applications, this paper focuses on the computer vision domain.
The continuous rapid advancement in computational power and memory storage has highly incremented the relevance firstly of Machine Learning, then particularly Deep Learning, to new heights.A major breakthrough in computer vision occurred with the introduction of Convolutional Neural Networks.Certain architectures, such as LeNet (Lecun et al, 1998) and AlexNet (Krizhevsky et al, 2012) demonstrated considerable improvement in accuracy across various computer vision tasks, while modern structures like ResNet (He et al, 2015) have further pushed the boundaries.The Residual Network framework, hence ResNet as well, addresses the degradation problem commonly encountered with increasing network depth by introducing socalled residual blocks.These blocks enable layers to learn incremental adjustments or refinements to input feature maps, rather than generating entirely new feature representations.ResNet incorporates residual connections, implementing identity mapping by projecting the input through a 1 × 1 convolution to align with the dimensions of subsequent blocks, thereby facilitating effective gradient propagation during training.Moreover, ResNet organizes layers into uniform-dimension blocks, utilizing strided convolutions for periodic downsampling and channel depth augmentation, a strategic design choice aimed at preserving computational efficiency for a layer.
Following the evolution of Convolutional Neural Networks, MobileNetV3 (Howard et al, 2019) represents another significant advancement, specifically designed for efficiency in mobile and edge devices.This architecture, building upon the principles of MobileNetV1 and V2, integrates advanced techniques such as lightweight depthwise separable convolutions, which minimize computational load while maintaining performance.MobileNetV3 utilizes complementary search techniques and a unique architecture design, ensuring a balance between low latency and high accuracy, making it ideal for real-time applications on resource-constrained devices.

Reinforcement Learning
Reinforcement Learning has emerged as a prominent approach for addressing complex control problems and optimization challenges.Its efficiency in maneuvering through nonlinear and intricate problem spaces has been showcased in a diverse set of applications, from vehicle control problems (Kővári et al, 2020) through traffic control systems (Koh et al, 2020) to advancements in robotics (Yan et al, 2020), or tracking control (Luo et al, 2016).
Unlike other Machine Learning paradigms, RL operates without predefined labels for training data.Instead, it dynamically generates training data during the learning phase.RL is conceptually characterized by a sequence of interactions between two entities: the learning agent and the environment, as can be seen in Fig. 4. At each step, the agent evaluates the current state and executes a decision, in the form of a specific action, and observes how the environment responds.This continuous interaction forms the core of RL's learning mechanism, allowing the agent to develop strategies, that optimize outcomes based on environmental feedback.This feedback, known as the reward, quantitatively measures the impact of an action on progressing towards the desired objective.Essentially, the reward reflects the effectiveness of an action in achieving predefined goals, guiding the learning process by offering insights into the appropriateness of actions.
The reward can also be described as a 'procrastinated label'.Contrary to Supervised Learning, where labels are immediately associated with data points to indicate desired outcomes, RL operates on a system of delayed feedback.Here, rewards are not instantly received but emerge as consequences of a series of actions taken over time.This delayed feedback necessitates, that the agent learns to associate its actions with outcomes, that may occur much later.Essentially, rewards serve as a delayed indicator of action effectiveness, akin to labels, provided after a sequence of events rather than immediately.This aspect poses a significant challenge in RL, requiring the agent to comprehend and strategize around both immediate and long-term consequences of actions.Consequently, the learning process involves understanding these delayed rewards and adjusting decision-making strategies accordingly.Viewing rewards as procrastinated labels highlight the complexity of learning in an environment, where actions' consequences are often not immediately evident.

Markov Decision Process
The mathematical framework of Reinforcement Learning is formed by the Markov Decision Processes, primarily used for modeling sequential decision-making problems.An MDP encompasses four key components, which can be described as follows: In reinforcement learning, the comprehensive objective is for the agent to acquire a policy, denoted as π, a decision-making strategy aimed at maximizing cumulative rewards over time.This iterative process involves the agent interacting with the environment and refining its policy based on the outcomes of actions, as informed by the received rewards.The Markov property, which underlies MDPs, states that the future state depends solely only on the current state and the action taken, independent of the historical sequence of states.This property simplifies the decision-making process for the agent, and as such, each training data point encapsulates an experience within a < S, A, R, S ′ > quartet, comprising the current state, chosen action, received reward, and the next state, respectively.

Deep Q-Learning
Reinforcement Learning algorithms can be categorized into two main types, valuebased and policy-based methods, each comprising various subcategories, or even combinations in actor-critic methods (see for example (Liu and Wei, 2014)).To delve deeper into their mathematical formalization, elucidate the underlying structures, and draw parallels between Reinforced and Supervised Learning methodologies, this paper only presents one of the most prominent branches of RL: Deep Q-Learning.
While rewards offer immediate feedback on the consequences of actions taken at a particular time step, providing short-term insights to the agent, the loss function necessitates a perspective on long-term benefit.This perspective is manifested through the aggregate rewards collected by the agent over the course of an episode.Given the indefinite duration (or step size) without theoretical time constraints, a formulation is required to prevent simple reward summation from divergence.This mathematical consideration is expressed in Eq. 1 as: where G t is the weighted cumulative reward at time step t, R is the immediate reward, k represents states in the future after time step t and γ is the so-called discount factor.Thereby this formulation also raises the question of prioritization: the discount factor γ, ranging from zero to one, modulates the balance between immediate versus future rewards, with a value of zero assigning sole importance to the immediate reward.By employing Eq. 1, one can derive the so-called value function V π (s), that represents the expected cumulative reward in state s, at time step t, following policy π, expressed as the expected value of G t , given the current state, formally described in Eq. 2 as: The value function serves as a comprehensive long-term metric.However, the policy, being the state-action mapping, also requires estimating the utility of different actions at specific states.Similarly to the value function, the action-value function, alias Q-function, is shown in Eq. 3 as: where the only difference to the value function is, that the expected value of G t is calculated as a function of state-action (s, a) pairs, instead of solely states.The Qfunction guides the agent towards optimal state-action pairs, facilitating informed decision-making.Although value-based methods do not inherently yield an explicit policy, one can obtain it by simply selecting the best value in each state in a greedy manner.
The Deep-Q-Network algorithm employs a dual-network architecture comprising an action network and a target network to enhance training stability.The action network is responsible for the action selection in case the agent decides to exploit, based on derived Q-values, while the target network calculates target Q-values against which the action network's predictions are calibrated.To ensure coherence between these networks and facilitate effective learning, the weights of the action network are periodically transferred to the target network.The Q-values are continuously tuned using the Bellmann equation, as shown in Eq. 4: where Q(s t , a t ) represents the action-value function for a given state-action pair (s t , a t ), r t+1 denotes the reward received after taking action a t in state s t , γ is the discount factor and max a ′ Q(s t+1 , a ′ ; θ − ) signifies the maximum predicted Q-value for the next state s t+1 over all legal actions a ′ , using the weights of the target network θ − .This formulation lies at the core of the DQN algorithm, encapsulating the process of value iteration within a Deep Learning framework.
The loss is commonly computed with the mean-squared-error formulation represented by Eq. 5 as: where ŷi signifies the predicted value, while y i denotes the target value.In contrast to supervised methods, the target value is provided by the target network and may not align precisely with the ground truth.Therein lies a major difference in the learning techniques.Initially, the weights of the target network are as arbitrary as those of the action network, implying, that the initial adjustments are akin to making an informed guess based on another guess.Assuming a well-designed reward system, it is the values of these rewards, that guide the Q-values, and subsequently, the estimates of the target network, towards approximating the ground truth or, in this context, the optimal policy.In conclusion, rewards serve a function analogous to labels.
In gaming scenarios, agents frequently do not encounter immediate rewards; instead, the game must progress to a certain stage, before the assessment of a particular action's value.This phenomenon, in extreme situations referred to as sparse rewarding, mandates multiple steps before a reward is issued (for specific examples, see (Hare, 2019) or (Vecerik et al, 2018)).Given the temporal gap inherent in this process, rewards can be considered as delayed labels, focusing only on functionality, as the reward system is responsible for reducing the difference between the ground truth (optimal policy) and the current state of the model (current policy).The significance of these delayed rewards becomes evident in the Bellman equation, described in Eq. 4, which facilitates the propagation of information step by step toward earlier states, ensuring, that the learning process accounts for these delayed outcomes, as illustrated in Fig. 5 considering a lane keeping scenario, where the agent has to choose the optimal steering action to achieve the best possible trajectory.

Replay memory
The concept of replay memory has evolved into a key element, diverging from the initial online learning methods, where the agent's learning was limited to immediate experiences at the current step.Replay memory allows the agent to store and retrieve, thus retrospectively learn from a batch of past experiences, akin to the labeled datasets in Supervised Learning frameworks.However, the finite nature of replay memory presents a challenge, particularly in expansive state-action spaces, potentially bordering on infinite.Consequently, it becomes impractical to retain all experiences within the memory buffer.To address this, experiences in the replay memory are typically replaced randomly upon the arrival of new experiences.
Despite this difference in memory utilization, a common challenge persists across both RL and Supervised Learning paradigms.At each training step, it is imperative to select specific data points for training.Random selection overlooks the informational value of individual samples; notably, some samples may inherently pose greater complexity or difficulty for classification.Consequently, certain samples contain intricate patterns crucial for the network's learning process.This disparity results in an uneven utilization of data, with simpler samples potentially overrepresented in training iterations, while more intricate and informative samples are underutilized.The root of the issue lies in the challenge of identifying which samples offer greater learning value, hence establishing a reliable metric for prioritizing data samples is essential.Such a metric would enable more effective navigation through available data, ensuring accurate representation and utilization of each sample's importance in the learning process.

Sample prioritization
The methodology outlined in this paper draws inspiration from principles of Reinforcement Learning, a connection elaborated further in the following section.Our aim is to optimize the selection of training data to maximize information gain.This refinement is achieved by formulating a probability distribution for sample selection, which represents the weighting of information gain.Two distinct metrics are laid out, both yielding a sampling probability distribution reflective of the anticipated information gain expected from utilizing the specified samples.

Temporal Difference error
The selection of data points requires a robust metric capable of assessing and predicting their potential utility.A prime example of such a metric is the temporal difference (TD) error, a concept introduced by (Sutton, 1988).This approach allowed agents to derive learning directly from experiences without the need for an environmental model, playing a pivotal role in value-based learning methodologies.The essence of TD error lies in quantifying the gap between the expected value of a state (or state-action pair) and the actual observed value, calculated from the immediate reward and the anticipated value of the next state (or state-action pair), all adjusted by a discount factor.This discrepancy is mathematically expressed in Eq. 6 for states and in Eq. 7 for state-action pairs, as follows: where δ t is the TD-error of an experience at time step t, R t+1 is the immediate reward received, γ is the discount factor, V (S t+1 ) and Q(S t+1 , a ′ ) are the state and state-action values, respectively.This concept may be recognizable at this point, arising from the application of the Bellman equation within an iterative learning scenario.The Temporal Difference error acts as an indicator of the mismatch between the reward observed during a transition between states and the initially predicted value for the starting state.The magnitude of the TD error reflects the potential for learning progress, indicating the value of information contained within a specific training example.A higher TD error highlights a more notable amount of useful information, suggesting more significance for improving the agent's performance.

Prioritized Experience Replay
The central idea introduced in (Schaul et al, 2016) involves utilizing the TD error as the basis for weighted sampling, prioritizing samples with higher TD error and consequently greater potential for information gain.During each training iteration, TD error values can be updated for the training samples.The straightforward approach employs these values, with a simple constant added to prevent the exclusion of samples, as shown in Eq. 8: where p(i) represents the priority value of sample i, δ i denotes the TD error associated with sample i and ϵ signifies the small constant described earlier.
Hence, computation of the probability distribution for stochastic sampling over the entire experience replay memory is enabled by Eq. 9, expressed as: where P (i) is the probability of selecting sample i, p(i) denotes the priority value assigned to sample i and α serves as an exponent parameter controlling the degree of prioritization.For further insights into bias mitigation and annealing through a detailed discussion, see the original paper: (Schaul et al, 2016).The paper clearly demonstrates the beneficial effects of prioritization, not only on model performance, but also on the rate of convergence, depicted in Fig. 6.

Upper Confidence Bound
The exploration-exploitation trade-off in Reinforcement Learning mirrors the challenges of underfitting and overfitting in Learning.Excessive exploration without sufficient refinement of learned states may lead to underfitting, where the agent fails to gather adequate information for effective policy development.Conversely, excessive exploitation can result in overfitting to a limited set of experiences, hindering the agent's ability to grasp the broader problem structure.Prioritization inherently tends toward exploitation by favoring certain samples over others.The Upper Confidence Bound strategy, proposed in (Kővári et al, 2023), presents a sophisticated approach to finely balance exploitation and exploration, ensuring a more efficient learning process.The mathematical formulation of the UCB method is shown in Eq. 10 as: where UCB value of sample i is given as a summation of the exploration and the exploitation components.Exploitation remains the normalized priority value, consistent with the formulation introduced in Eq. 9.However, exploration is quantified by n i , representing the fit count indicating how many times sample i has been utilized for network updates, relative to n k , the maximum fit count among all experiences stored in the memory buffer.The constant ϵ is introduced to prevent division by zero, while c p acts as a parameter, constrained within the interval of [0; 1], designed to finely tune the equilibrium between exploration and exploitation.This delicate balance empowers the agent to navigate the learning environment's state space more efficiently, leveraging strengths of both strategies to achieve optimal performance.

Probability-based approach
The concept of exploration is relatively straightforward; utilizing the frequency of a sample's usage in training as a metric helps to mitigate overfitting by avoiding excessive reliance on a limited set of samples, as seen in previous discussions.This principle is consistent across both Reinforcement Learning and Supervised Learning domains.
Nevertheless, delving into exploitation requires deeper investigation.Here, the counterparts of the temporal difference, or its constituent elements within Supervised Learning must be considered.Both methodologies involve a predicted value -one generated by the action network in case of RL and the other by the sole network in SL.However, unlike Reinforcement Learning, where a dedicated target value is employed, Supervised Learning relies on ground truth.
In this context, the closest parallel to temporal difference lies in the disparity between the ground truth probability and the network's confidence in its prediction, as expressed in Eq. 11: where j = arg max ŷi (x i ; l, θ) In the equation above, ŷi represents the probabilistic model outputs for sample i, l denotes the set of labels, θ denotes parameters of the neural network, σ signifies the softmax function, y i,j denotes the ground truth for the j-th class of sample i, assuming a binary value of either 0 or 1 and x i represents the input sample i. Essentially, the Probability Error metric quantifies the difference between the highest predicted probability value and the ground truth of the corresponding label.
This disparity in probabilities highlights the informational value of a training sample, explicitly indicating the extent of divergence between the network's prediction and the actual ground truth.The objective of the training process is to minimize this discrepancy between predicted outcomes and ground truth, with this metric designed precisely for this purpose.Further details on this metric is provided in Section 3.1.
In the proposed methodology, illustrated in Fig. 7, the training process initiates by establishing a uniform probability distribution, which forms the basis for stochastic sampling from the training dataset.Hence, the selection mechanism ensures a balanced representation from the onset.Following the step of sample selection, loss computation is performed for the currently selected batch of data.The subsequent step involves model fitting based on the calculated loss, enabling the acquisition of both the exploration and exploitation metrics.Thereafter, weights, aka UCB values, are computed.Finally, upon the weight assignment to each sample, the probability distribution undergoes an update based on these recalibrated weights.This update assigns higher probability of selection to samples anticipated to possess greater information density at a given time, thereby establishing a loop for dynamic weighting in the sampling procedure.

Abstract formalism
As previously highlighted, while the exploitation strategy requires adjustments compared to the RL framework, the fundamental aim remains to be the development of a metric, that evaluates the expected information gain from specific samples.The PB error, derived directly from the TD error, embodies a perspective rooted in Reinforcement Learning.In contrast, the Label Change Error offers a measure tailored for a Supervised Learning context.It is reasonable to argue, that the frequency of class changes for a training point indicates the network's challenge in identifying the content of the image.Such data points could offer valuable learning opportunities for the network due to their complexity.This leads to an additional metric, namely Label Change Error, providing a more aligned scheme with Supervised Learning.In this case, the exploitation element is replaced with a counter c l registering how many times each sample changes its label.Subsequently, this label change count is utilized to compute the sample priorities, as shown in Eq. 12: The Label Change Error provides a simpler and more refined measure for the level of surprise, while maintaining consistency with the main philosophy.Further details and advantages of this metric are demonstrated in Section 3.

Probability Error
Both proposed metrics introduce an additional hyperparameter, c p , crucial for maintaining an appropriate balance between underfitting and overfitting.Optimal tuning of c p is vital for effective sample prioritization.The Probability Error, described in Section 2.4, shows promises in enhancing classification accuracy.However, despite meticulous selection of c p values, the improvement in accuracy is relatively modest.A fundamental limitation arises from the debatable effectiveness of employing probability differences as a metric for error evaluation.This issue is exemplified by the CIFAR dataset, where images of willow, oak and maple trees exhibit either indistinguishable characteristics or pose significant identification challenges, even to trained observers.In order to get an understanding on the difficulty of the task, this phenomenon is illustrated in Fig. 8.In such cases, conventional labeling approaches dictate, that an image of a willow tree is classified as 0% oak, 0% maple and 100% willow, an oversimplification, that disregards shared characteristics among these tree species.As a result, a proficiently trained model might correctly identify a willow tree as such, although with low confidence due to the close resemblance to other tree classes, resulting in a substantial Probability Error.Expecting the model to predict with 100% certainty in favor of the willow tree to mitigate the error is unrealistic under these circumstances.Therefore, reliance on Probability Error as a metric prioritizes samples near class boundaries within the feature space, which ideally should be recognized as closely related.This tendency slightly encourages the network to overfit a few samples lacking unique informational content.Although the exploration term can partially neutralize this tendency, the underlying issue remains unresolved.
Another potential source of bias within the metric arises from instances, where identical images are assigned disparate labels.This phenomenon is exemplified by certain examples within the CIFAR100 dataset, as illustrated in Fig. 9. Given the close relationship between these categories, one might argue, based on prior reasoning, that classifying these images into different categories does not constitute a significant error.However, this scenario presents the network conflicting information, suggesting, that a single image could belong to multiple classes -an assertion such, that while plausible in specific contexts (e.g. in case an image of a baby girl might reasonably fit into more than one category, invoking multi-label classification, as described by Fig. 8 Examples of cross-class similarity in CIFAR100 dataset (graphical quality is inherent for the CIFAR dataset).(Fürnkranz et al, 2008)), still poses challenges.Viewing this perspective, it remains problematic, that the probability-based error metric for these images is elevated due to their positioning at the intersection of two classes.This situation underscores a limitation of the PB error metric: it penalizes the network for ambiguity inherent in the dataset itself, rather than inadequacies in the network's classification capabilities.This issue affects a small minority of all training data, but is present nonetheless and without efficient smoothing (i.e.tuning of the c p parameter), it can be significant.

Label Change Error
Despite the limitations associated with the probability-based error metric, it has shown a modest improvement in accuracy indicating, that the core principle of prioritizing information gain remains valid.To address the shortcomings of the PB error metric, we investigated an alternative metric, termed Label Change Error.This novel approach successfully addresses the drawbacks of the PB error, while adhering to the original concept of emphasizing information gain.The label change metric leverages the behaviour of data points, that frequently alter their class affiliation, typically residing at class boundaries.By focusing on these critical data points, the metric aims to train the network to discern intricate patterns more effectively.
In contrast, the challenge of inherently low probabilities for certain data points, a notable issue with the PB error metric, becomes irrelevant under the label change approach.This method operates on a more abstract level, bypassing direct probability assessment.Consequently, data points firmly categorized within a specific class, even if with lower confidence due to their resemblance to other classes in the dataset, are not disproportionately emphasized.Instead, attention is directed towards data points presenting classification challenges, allowing the network to examine and learn from these cases with greater intensity.This strategic shift ensures, that learning is focused on areas, where the network can achieve the most significant gains in understanding the dataset's complexity, thereby enhancing overall model performance.It is worth noting, that maintaining a balance against overfitting remains crucial, emphasizing the importance of selecting the correct c p value.

Results
Throughout the experiments, the methodology was restricted to data augmentation and hyperparameter optimization, with no advanced modifications applied to the base neural network architectures.The principal objective of this study is to demonstrate the effectiveness of information gain-based prioritization in enhancing model efficiency, and to showcase that the proposed metric successfully embodies this concept.The impact of prioritization is illustrated via the task of image classification.Table 1 presents the validation accuracy obtained gain, demonstrating the effectiveness of the approach.These accuracy metrics represent mean values derived from 6 distinct random seeds.This approach ensures that the study focuses on evaluating the influence of information gain-based prioritization on model performance, rather than the potential benefits of novel network architectures.By leveraging common data augmentation techniques such as Gaussian blur, color jitter, random resize crop, and random flips, we aim to simulate real-world variations in the dataset, thus providing a robust testing ground for our prioritization methodology.
The convergence of validation accuracy is depicted in Fig. 10, for ResNet50 and Mobilenet V3.The noticeable increase in accuracy gain observed on the more complex CIFAR100 dataset indicates scalability.Furthermore, the consistent accuracy improvements demonstrate, that the proposed methodology reliably supports model performance across different network architectures and random seeds.In the context of Reinforcement Learning, prioritization benefits from extended periods to influence outcomes as agents often undergo thousands of episodes.It is notable that, in comparison, the impact of prioritization becomes evident within a considerably shorter time frame.This difference can be partly attributed to the dynamic nature of the Reinforcement Learning buffer, as opposed to the static nature of supervised datasets.Upon examining the images assigned the highest weights, the issue of duplicate data is still observed, as outlined in 3.1, albeit to a lesser extent.Despite the occurrence of low probability values, labels often converge to a resting point.This convergence prevents the prioritization method from targeting these samples for focused learning.This observation indicates a reduction in the incidence of duplicate data problems and explains the success of the label change metric.
The observed effects of prioritization are significant, though not overwhelming, due to the exploration component of the proposed metric.For instance, in 100 epochs using the ResNet50 architecture, the most notable difference in the utilization frequency (fit count) of a sample reached 44, indicating that while samples with lower priorities are still extensively used, thus ensuring their informational content is leveraged, the mechanism also effectively emphasizes the selection of more crucial samples.To better understand this phenomenon, the t-SNE (t-Distributed Stochastic Neighbor Embedding) algorithm was employed.t-SNE is known for its ability to reduce the dimensionality of high-dimensional data, making it especially useful for visualizing such data in a low-dimensional space, by transforming similarities between data points into joint probabilities and then minimizing the Kullback-Leibler divergence between these probabilities across both high-dimensional and low-dimensional spaces.This process effectively groups similar data points together while separating dissimilar ones.The visualization provided in A critical conclusion from our study highlights the significance of the c p parameter, which balances exploration and exploitation.Selecting an appropriate value for this parameter is crucial, as suboptimal choices may result in minimal to no beneficial effect.While the choice of parameter value appears to be slightly influenced by the neural network architecture, it is significantly affected by the dataset.Interestingly, our findings also suggest, that prioritization does not notably increase computational demands or training runtime.

Conclusion
This research addresses a fundamental challenge in the field of Machine Learning: the inefficiency of data sampling.Despite continuous advancements and innovations in the domain, the optimization of sampling efficiency remains an unresolved issue.The essence of the problem lies in the inherent difficulty of determining relative importance of individual training samples during the training process.Traditional approaches fall short in providing a comprehensive solution, as they overlook the dynamic nature of a sample's value due to the evolving state of the neural network over training time.
To address this challenge, this paper proposes an innovative approach for dynamic prioritization of training data to enhance sampling efficiency in Supervised Learning models, while exploring the boundary between Reinforcement and Supervised Learning techniques in Machine Learning.
With the task of classification serving as a primary demonstration tool for our methodology's effectiveness, and RL being a source of inspiration, a vast number of existing prioritization methods are presented in Section 1.1, followed by a deeper dive into the RL framework in Section 2.2, highlighting the similarities and differences of these methods.Section 2.4 and Section 2.5 elucidate the formalization of the proposed metrics, while Section 3 describes practical considerations, conceptual risks and benefits.Although the same formalism cannot be directly applied due to the different nature of learning frameworks, the general utility of prioritization is demonstrated after the appropriate modifications through classification results in Section 4.
The widely recognized CIFAR100 and CIFAR10 datasets have been used for benchmarks, aiming to assess the overall impact of our methodology without employing special augmentations.Similarly, the original ResNet50 and Mobilenet V3 architectures are utilized without alterations.This study endeavors to emphasize the parallelism between different Machine Learning techniques through strategic data navigation, leveraging a concept that is generalizable across a diverse set of applications.
In our future endeavors, based on the potential of the developed sample prioritization method, our aim is to expand its application as this study has opened up numerous promising opportunities for future research.A natural extension involves experimenting with a broader array of datasets.This expansion would illuminate any dataset-specific nuances affecting the efficacy of prioritization.In particular, exploring datasets with varying levels of complexity, diversity and size would enable to deepen our understanding of how prioritization performs across different dataset characteristics.
Additionally, extending the application of prioritization to other tasks beyond image classification, such as object detection or semantic segmentation, is an exciting frontier.Object detection, with its unique challenges and requirements, could significantly benefit from prioritization strategies, especially in handling imbalanced datasets or focusing on rare, but critical objects.This exploration would necessitate adapting prioritization metrics and strategies, accounting for handling bounding boxes and multiple objects per image, for instance.Another promising avenue for future work involves exploring the integration of prioritization with different backends.
Furthermore, a deeper integration of Reinforcement Learning and Supervised Learning presents another intriguing field for investigation.Specifically, Reinforcement Learning may presumably be utilized directly for the prioritization process through prediction of sample weights, which would introduce the possibility of a synergistic interaction between the two approaches.An RL agent could be developed to operate

Fig. 4
Fig. 4 Training loop of Deep-Q Network agent with Experience Replay Memory.

Fig. 5
Fig. 5 Nature of rewards interpreted as delayed labels according to the Bellmann equation.

Fig. 7
Fig. 7 Illustration of the prioritized sampling methodology.The sampling probability distribution is updated at each iteration of the training process.

Fig. 9
Fig. 9 Examples of duplicate labels with different classes assigned in CIFAR100 dataset.

Fig. 10
Fig. 10 Converge curve of validation accuracy with standard deviation bound from 6 distinct random seeds.
Fig. 11 demonstrates how t-SNE distinguishes the ten classes of the CIFAR10 dataset, represented with various colors.Here, the size of the points correlates with the frequency of a sample's use in model training.A larger point indicates a higher difference in utilization rate.

Fig. 11
Fig. 11 Variability in the usage of data points, categorized by class through the t-SNE algorithm.

Table 1
Comparison of validation accuracy across different configurations