Context-adaptable radar-based people counting via few-shot learning

In many industrial or healthcare contexts, keeping track of the number of people is essential. Radar systems, with their low overall cost and power consumption, enable privacy-friendly monitoring in many use cases. Yet, radar data are hard to interpret and incompatible with most computer vision strategies. Many current deep learning-based systems achieve high monitoring performance but are strongly context-dependent. In this work, we show how context generalization approaches can let the monitoring system fit unseen radar scenarios without adaptation steps. We collect data via a 60 GHz frequency-modulated continuous wave in three office rooms with up to three people and preprocess them in the frequency domain. Then, using meta learning, specifically the Weighting-Injection Net, we generate relationship scores between the few training datasets and query data. We further present an optimization-based approach coupled with weighting networks that can increase the training stability when only very few training examples are available. Finally, we use pool-based sampling active learning to fine-tune the model in new scenarios, labeling only the most uncertain data. Without adaptation needs, we achieve over 80% and 70% accuracy by testing the meta learning algorithms in new radar positions and a new office, respectively.


Introduction
Counting the number of people in an environment can be a crucial task not only in industrial settings but also in medical and safety scenarios.In difficult times, such as during a pandemic, keeping track of the occupancy of an environment can greatly reduce the risk of spreading a pathogen [1,2].Estimating the presence of people can lead to other advantages, such as enabling energy management plans in places with frequent turnover of people, such as hospitals, by smartly activating equipment and heating systems [3].A non-automated measure may be challenging or impossible in many contexts, such as for pedestrian crowds in public areas [4].The majority of solutions designed for people monitoring rely on images captured by cameras and thermal sensors [5].Most camera-based solutions use RGB or time of flight (ToF) sensors, and occupancy information is estimated using computer vision [6,7] or machine B Gianfranco Mauro gianfranco.mauro@infineon.comB Manuel P. Cuellar manupc@ugr.esExtended author information available on the last page of the article learning [8][9][10].Camera systems that use cross techniques for image segmentation and edge detection, such as convolutional neural networks (CNNs), achieve high performance even in crowded environments, but suffer from the inherent problem of a lack of privacy [11].Thermal sensors, on the other hand, are much less privacy-invasive because of the usage of infrared frequencies and often lower image resolution [12].Thermal sensors also have the advantage of being usable in the dark, but they can be affected by thermal noise, caused, for example, by heaters and sunlight.In addition, the lack of depth information generally does not allow distinguishing between people moving in the same direction.In contrast to visual solutions, many other systems exploit the measurement of environmental quantities.Radio-frequency (RF) and laser technologies are typically classified as non-image-based approaches [13].The CO 2 sensors, for example, can be used to estimate the occupancy of a room by the concentration of carbon dioxide produced by individuals.Such systems are frequently low-power but must account for venting systems and are practically unusable in open spaces [14].LiDARs represent often another privacy-friendly solution for people counting and tracking.Through the use of pulsed lasers and a scanner, a LiDAR yields the generation of 2-D or 3-D maps of the surrounding space [15,16].Such systems frequently have high spatial resolution and frame rates, but they can be costly and powerconsuming.RF-based systems have the advantage of having almost no privacy concerns and little dependence on light and weather conditions.These characteristics make them appropriate for monitoring several people.Wi-Fi technology, for example, can enable the recognition and segmentation of people even through walls and obstructions [17,18].Wi-Fi modules, however, require the development of high output power in the RF range (≈ W) and a continuous working operation to exploit their functionalities.On the contrary, radar sensors are more versatile in many applications thanks to lower power consumption (≈ mW) and optimized system power management.Among radar modulations, frequencymodulated continuous wave (FMCW) is particularly suited to people monitoring, allowing accurate estimation of the range and velocity of both dynamic and static targets located within the device's field of view (FoV) [19,20].Specifically, 60 GHz technology is particularly suitable for short-range people monitoring applications [21].Radars transmitting around this frequency are cost-effective and versatile compared to other solutions such as cameras or LiDAR.Further, the 60 GHz frequency is much less susceptible to interference with other radio-frequency signals or Bluetooth devices.Image-based or high-resolution RF systems often implement a vision-based pipeline to predict the number of people in a given context.This approach can lead to high classification performance even in the challenging task of tracking through image segmentation, edge detection, and skeletonpose extraction [6].On the other hand, radar data are hardly interpretable through classical computer vision approaches.In this case, deep learning (DL) techniques are commonly used to process the information [22].
DL is nowadays finding the most varied uses for solving tasks and speeding up processes.Over the years, classes of DL models have been developed to extract valuable information from the available data for given tasks.Examples are CNNs for feature map generation or recurrent neural networks (RNNs) for processing time series.Over the years, multiple neural network topologies, such as Inception [23] and VGGNet [24] have been designed to solve specific tasks with successful outcomes.Yet, such topologies have the inherent need to be trained on a large amount of data to achieve robust performance across new contexts.Commonly, these models are adaptable to new tasks by leveraging transfer learning [25], tailoring parameters to newly collected data.However, the limited availability of data and the need for rapid adaptation to new contexts make transfer learning hardly usable for defined types of tasks.To deal with these challenges, a specific branch of DL called few-shot learning has gained momentum in recent years [26].The goal of few-shot learning is to exploit the little available information and data patterns, leveraging previous experience to adapt to new contexts or solve tasks that have not been tackled before.Few-shot learning is approached from different perspectives by specific DL sub-branches such as meta learning and active learning [27,28].
Meta learning, or learning to learn, accounts for the set of algorithms where the primary goal is to learn how to approach new tasks given some past experience, or metadata [29,30].This process not only encourages context generalization but also accelerates the fine-tuning of already observed tasks when new data are available.If the meta learning is optimization-based, an iterative learning process called episodic learning based on available training data is generally used.For a task defined in N -way, i.e., N classes, the few available samples are called shots.To assess generalization performance, C samples of support and J samples of query are fed to the defined model for each class.Algorithms commonly used for meta learning are model agnostic meta learning (MAML) [31] and Reptile [32] which, thanks to their very general conceptualization, enable the episodic adaptation of most of the common topologies defined in DL.Frameworks based on optimization-based meta learning are highly effective and perform well in several data-poor tasks [33,34].However, they have an inherent need for training on a set of representative data for each new, unseen task to learn to generalize.A specific kind of method, called relation network [35], was created to obviate this need by exploiting the ability of the model to compare the features of different examples and learn to distinguish them.The comparison is possible by properly shaping the model topology and regressing a relation score between 0 and 1, comparing individual support and query examples.The relation scores are unconventionally regressed by minimizing the mean squared error (MSE) to the ground truth of query instances.This approach assumes that all available support instances are mutually independent of each other.Intuitively, the model relies on a one-to-one comparison rather than comparing the new query examples with all the available support samples.Such issues are addressed by the weighting network [36].In this adapted topology, the relation between support and query is propagated through two modules.A first comparison module for the extraction of the similarity between the samples and a second weighting module that compresses the information into a one-dimensional vector representing the relation scores.This method leverages all available support sample features for query prediction.Further, the weighting network endorses the use of traditional classification cost functions such as crossentropy during episodic optimization.
Active learning, on the other hand, aims to optimize the model's performance with as few labeled instances as possible [37,38].To accomplish this, the algorithm has control over the inputs on which it trains, labels, or requests additional information about the data it deems most useful for learning.A common strategy is to assign a priority score to the unlabeled data pool, exploiting, for example, the probability distribution generated by the model.Only the instances identified as most uncertain are then labeled and used during training.This procedure, called pool-based sampling, is normally repeated multiple times, increasing the amount of labeled training data, until satisfactory performance for a given task is achieved.
In this paper, we exhibit how few-shot learning techniques can grant generalization of scenarios (environments and locations) for an FMCW radar-based algorithm designed for people counting.The application of this system is intended for uncrowded areas or rooms where there is a need to count the presence of a few people.For this work, a specific dataset was collected using a 60 GHz radar that was set up for the task of counting people.The information was gathered in three different offices with at least four different in-room locations.Per location, 0 to 3 people took part in the data recording for at least 60 seconds per session.The data were preprocessed in frequency to extract range and Doppler information from the people in the scene.Meta learning is then used for the monitoring use case, estimating the number of people from radar data.Instead of using all the available data in a single training, we propose a few-shot episodic approach to foster and speed up adaptation.To meet the learning needs, we introduce both a new relation topology, which we call the Weighting-Injection Net, and an algorithm, which we call model-agnostic meta-weighting (MAMW).The Weighting-Injection Net represents a modification to the traditional weighting network presented in [36].Instead of an embedding module that reduces the dimensionality of the support samples for the next comparison step, the proposed one uses an injection module.This module increases the dimensionality of input data, generating a feature-enriched representation of support and query samples for the next relational phase.The overall network scheme is shown in Fig. 1.The MAMW, on the other hand, combines the query relation strategy of the weighted network with the two-step optimization-based approach of MAML.This is meant to improve the stability of the few-shot episodic training, especially when only very few instances are available as training.Experiments with 1-, 2-, 5-, and 10-shot have been performed and analyzed for the proposed methods.The achieved generalization results have been compared with those of other state-of-the-art approaches.State-of-the-art comparisons are also conducted up to five-person counting, to test the limitations of the radar-based episodic approach.
We also exhibit how pool-based sampling active learning can be efficiently employed to fine-tune the performance of a relational model by exploiting the most uncertain data.Showing how, for adaptations in new contexts, the use of generalization information learned from episodic adaptation leads to a better fit than starting from random initialization.The active learning strategy has been used to fit the 1-shot-pre-trained model on data from an office room used as a test that is therefore unseen in the meta-training phase.
For the meta learning algorithms, we also conducted experiments on a publicly available dataset for few-shot learning in the Appendix A. The main contributions of this paper are as follows: 1. Implementation, to the best of our knowledge, of the first context-adaptable radar-based solution for counting people without a necessary adaptation training.

Design and implementation of the Weighting-Injection
Net.This network represents a variation of the weighting network with an injection module.The injection operation increases the dimension of support and queries to ease feature matching in the subsequent comparison module.

Related works
In this section, we first investigate state-of-the-art solutions for people counting that offer similar features to radar-based systems, such as privacy preservation and low frame resolution.We then focus on the specific approaches aimed at context generalization and active learning.
When low frame resolution and privacy are system needs, traditional image segmentation and detection methods are often replaced or aided by deep learning.Neural networks can also be used to process time series or generate density maps for crowd monitoring.
Massa et al. [39] presented a recurrent neural network (RNN) architecture called LRCN-RetailNet (Longterm recurrent convolutional network) that takes as input sequences of low-resolution RGB frames and analyzes their spatiotemporal content for people counting.The strategy outperforms other state-of-the-art single-image-based approaches.The system based on temporal sequences may be unusable in low frame rate scenarios or with hardware implementation constraints.Gomez et al. [40] developed a system using long-wave infrared imaging and a CNN implementation on the NXP ® LPC54102 microcontroller.The classification approach is binary, exploiting a small detection window on image sections to predict the presence or absence of heads.Because all weights fit in a 512 KB flash memory, the CNN can be easily deployed on the microcontroller.The counting algorithm using the embedded version of the model achieves an accuracy of 53.7% on test images and up to six people.This solution is very low-power and privacyfriendly, but the presence of heat sources in the environment could cause counting issues due to the low resolution of the thermal sensor.
The most common types of RF-based systems used for monitoring are Wi-Fi and radars that use impulse radio ultrawide band (IR-UWB) or FMCW technology.Most of these solutions are inherently characterized by privacy preservation and low sensor resolution.Kianoush et al. [41] presented a people counting system via Wi-Fi radio infrastructure that uses an ensemble of models to leverage the space-frequency features of various transmission and reception channels.The ensemble exploits Bayesian techniques based on signal propagation statistics from RX to TX, a feed-forward neural network (FF-NN), and long-short-term memory (LSTM).Some of the constructed ensembles achieve an accuracy of over 95% in the test setup.However, a network of Wi-Fi terminals is employed for this purpose, which results in higher power consumption and challenges usability in other environments.Bao et al. [42] featured a CNN-based algorithm for people counting focusing on extracting multiscale range-time maps from IR-UWB radar data.Sequences of radar frames are preprocessed to extract the peak information and remove the background.The single frames are then stacked together to form range-time maps.The method proved robust in counting up to 10 people in the selected environment.However, the time dependency and lack of velocity information may make the system unsuitable for real-time applications where multiple people may be at the same distance.Stephan et al. [43] proposed a people counting solution via the BGT60TR13 radar system (60 GHz FMCW) that makes use of knowledge distillation from synchronized camera data during the model generation.The suggested architecture first processes the camera RGB data, exploiting an OpenPose network that extracts the people's poses through pre-trained layers of the VGG-16 network and a multi-stage CNN.The extracted information is then fed to a triplet network with a 32-D embedding layer to generate clusters for each person count class.Radar information is first preprocessed in the form of range Doppler images (RDI) and fed to an encoder with fully connected final layers that learn through knowledge distillation from camera embeddings.Information transfer is possible by minimizing the Kullback-Leibler (KL) divergence between radar and camera embeddings.The method is robust and leads, in the test phase, to an accuracy of up to 71% for six people with another radar sensor with different positions and orientations.What is learned through knowledge distillation, however, could significantly affect the capabilities of the architecture in new environments where morphological and light conditions would directly influence the camera data.
A few cutting-edge works attempt to solve the people counting problem through active learning or aim at context generalization.
Vandoni et al. [44] featured a solution that uses active learning, coupled with SVMs, to improve training on subar-eas of crowd images via head count.Samples that are more dissimilar than those already tagged are estimated in terms of their uncertainty via a metric that accounts for crowd density, called maximum excess over subarrays (MESA).Zhao et al. [45] also proposed an active learning solution for head counting in camera-based density maps.In this case, in the iterative process of instances sampling to be labeled, both crowd density information and dissimilarity from previous selections are employed.The sampling technique is a context-appropriate version of partition-based sample selection with weights (PSSW).The number of people is then regressed through mean absolute error (MAE) and MSE.Both methods presented in [44] and [45] result effective in improving the people count through uncertainty sampling in crowded scenes but are very dependent on the 2D RGB nature of the images.Zhang Yingying et al. [46] proposed a multi-column convolutional neural network (MCNN) to estimate crowd head counts from single images without temporal dependence.Even with a sparse number of people, the method outperforms other cutting-edge solutions on a variety of public datasets.The model, trained on a large dataset with various density map sizes, can be easily tuned for new datasets and contexts via transfer learning.The required resolution is nonetheless high and could create context-specific privacy issues.Reddy et al. [47] and Zan et al. [48] designed an adaptive algorithm to generate crowd density maps using MAML with episodic training.In [47] a backbone consisting of the first layers of VGG-16 and a density map estimator are trained on various RGB sequences collected in different environments.The pioneering approaches depict how meta learning can be effectively employed for people counting.Hou X. et al. [49] presented a cross-domain solution for the estimation of density maps by episodic learning.In this case, a domain-invariant feature representation module is exploited, where synthetic and real camera data are used as source and target domains, respectively.The density maps are then generated using a pre-trained CNN network and an algorithm called β-MAML, where β represents the generalization step's learning rate.The parameter β is dynamically adapted in the episodes by exploiting the gradient information of parts of the images.The number of people is finally estimated from the density maps.The meta learning approach presents more robust performance for the algorithm than other state-of-theart methods for density map generation.However, the need for a sensor camera does not allow for low-resolution uses or where privacy is a requirement.Some cutting-edge RF-based works also propose adaptive context generalization solutions.Hou H. et al. [50] illustrated a few-shot learning solution for indoor crowd counting using Wi-Fi technology.The solution consists of a two-stage framework called domain-agnostic and sampleefficient wireless indoor crowd (DaseCount).In a first stage of meta-training, two separate CNNs learn to extract human activity information from wireless channel state information (CSI) measurements.Generalization performance is improved at this stage by knowledge distillation.In the metatesting phase, the features extracted via CNNs from the CSI data are fed to a few-shot regression algorithm for the people counting task.The presented framework achieves, on average, over 96% accuracy for counting up to eight people in various domain setups.Yet, the solution is computationally expensive for classifier retraining and may not be suitable for frequent Wi-Fi transceiver location changes.Zhang Yong et al. [51] proposed a WI-FI-based few-shot learning solution for activity recognition that makes use of graph neural networks.The method uses a graph convolutional block attention module to extract activity-related information from CSI data.A final classification layer is used to classify the graph features and recognize the activity.The approach presents a robust 99.74% accuracy in the 5-way 5-shot experiment for new environments and activities.Yet, much computation and memory are required for model adaptations.

System setup and radar preprocessing
In this section, we propose a general overview of the system, discuss the data acquisition setup, and provide information about the employed radar board, its configuration, and the main preprocessing steps.

General overview of the system
Figure 2 depicts the overall framework.First, rooms for data gathering are chosen for the few-shot learning approach.The radar data are then gathered from various in-room locations with varying numbers of people.Preprocessing is performed to extract range and Doppler information about the people in the FoV of the device.The sequences of preprocessed frames are averaged by moving average to generate the individual instances of the meta-dataset.The data are then saved and labeled in session-specific folders.The folder names denote the label encoding, from 0 to 3, of the number of people who attended the session.In most of the proposed experiments, the information recorded in two rooms is used as input data for the episodic training of the meta learning model.The third room is instead utilized for testing.Model fine-tuning can be performed via active learning on the test data, using the meta learning model as a baseline.

Radar board
All radar data in this work were collected using the BGT60TR13C FMCW sensor [21]  and a bandwidth of about 6 GHz, this radar represents a miniaturized and low-power solution.This f 0 and bandwidth are especially suitable in short-distance and indoor applications, resulting in low susceptibility to interference with other signals such as WiFi or Bluetooth.Thanks to an operation-optimized duty cycle, the power consumption for sensing within 5 m is minimized to only 5 mW.The BGT60TR13C has a transmit (TX) and three receive (RX) channels built into the package.The RX antennas are placed orthogonally to each other to enable the reconstruction of azimuth and elevation angles of arrival (AoA) for the targets placed in the FoV.The information collected from the RX channels is mixed with the TX and digitized with 12bit resolution via the board connected to the radar sensor (Fig. 3).

Radar configuration
The BGT60TR13C transmits a series of linearly frequencymodulated signals called chirps in a defined bandwidth B w around the central frequency f 0 .Each chirp, of duration t c , normally consists of a fixed number of samples n s .During use, the information reflected in the RX channels is mixed with a transmitted signal reference and digitized, thus generating an output signal called intermediate frequency (IF).
Normally, for further preprocessing, the radar information is packed into frames, each containing the IF relative to a sequence of chirps N c .The theoretical maximum detection range R max and range resolution r of an FMCW modulation are calculated using the following formulas: where c stands for the speed of light in air.A narrow B w of 0.48 GHz was chosen to achieve a R max of about 10 m, which would cover the entire size of the chosen environments.A resolution r of at least 31 cm was chosen to let several targets placed in front of the radar be distinguished even at a considerable distance.A n s per chirp of 64 has been specifically selected.The maximum discernible velocity of the targets V max in one direction and the resolution v can instead be calculated with the following formulas: The average human walking speed is about 1.42 m/s.To allow detecting even faster motions, we opted for a V max of 3.5 m/s and a v of 1.1 cm/s.As a result, we set t c to 351 μs and N c to 64.To collect approximately seven frames every half second, a frame repetition time f ps of 75 ms was chosen.Furthermore, an analog-to-digital converter (ADC) sampling rate F s of 2 MHz was chosen.The parameters used to configure the BGT60TR13C for the people counting recordings in all the selected rooms are listed in Table 1.

Recording setup
The BGT60TR13C radar system was mounted on a tripod for the people counting data, and the data were collected using a Raspberry Pi 4. The raw radar data were then processed and labeled offline at a later time on an eight-generation Intel Core TM i5 processor (4 cores).Figure 4 depicts the used setup.Three different rooms of various sizes were chosen for data collection: an office of approximately 26 m 2 and two meeting rooms of about 20 and 39 m 2 , respectively.Only a portion of the office has been used, with walls separating the other two areas.Various types of furniture, such as cabinets, desks, tables, and chairs, were left in the rooms and were unmoved from their locations.The reflection of such objects represents the so-called clutter that characterizes the FMCW radar data.A graphical illustration of the three environments, indicated with the letters S, M, and B, standing for small, medium, and big, is provided in Fig. 5. Data were gathered in each room from at least the four corners.Data were also collected in three additional locations in the office room.At every location, the tripod was set up at a height ranging from 1.65 to 1.75 meters.Four sessions have been carried out per location, each lasting approximately 60 seconds for the meeting rooms and 90 seconds for the office.Each session contains data from 0 up to a maximum of 3 people in the room at the same time.Ten different people with heights ranging from 1.60 to 1.78 meters took part in the recordings.Some data up to 5 people have been gathered in the big room to further test the performance of the developed algorithm.Before collecting data, user consent was obtained, and as much privacy and data anonymization as possible were maintained during the recordings.The collected data has not been made publicly available.

Radar preprocessing
Raw radar frames are difficult to interpret and label.The information to be fed to a DL model for learning purposes can be too noisy and highly context-dependent due to clutter.
In this work, we propose to preprocess the raw data collected for people counting by removing the clutter and extracting the Doppler and range information of the targets through frequency analysis with the fast Fourier transform (FFT).
We then perform two averages to reduce the noise in the data for the next model generation step.One for each frame, averaging the IF signal Ch I F (i) generated for each of the three RX channels (i ∈ I R X ), and another for each 7-frame recorded series.The whole process, given the f ps of 75 ms, leads to the generation of about 2 RDI per second.The main preprocessing steps are shown in Fig. 6.
The preprocessing steps performed for each RX-generated IF signal are as follows: 1.For each chirp (slow time), the average value of the samples (fast time) is calculated and then subtracted.2. The IF signal is then multiplied in fast time with a Hanning window to reduce the spectral leakage effects.3. A 1-D FFT is performed on the samples to derive the range information of the targets.4. A multiplication with a Hanning window is run also in the slow time. 5.A 1-D FFT is performed along the slow time to obtain the velocity information.6.To drop the information of static objects, aka clutter, moving target indication (MTI) is utilized (5).where μ ∈ [0, 1] is set to 0.9, and weights the importance of the current frame against the average of the previous ones Ch I F (i). 7.For each Ch I F (i) a constant false alarm rate (CFAR) algorithm is used to locally select Range and Doppler peaks in frequency and discard the surrounding information, thus increasing the signal-to-noise ratio (SNR).8. To further improve the SNR, the R D I s(v) for each frame v ∈ V are computed as the absolute value of the average of Ch I F (i) (6).
9. The R D I s thus generated are stored in a seven frames buffer (N v ), which corresponds to roughly half the frame rate.A moving average is performed on the buffer to further reduce the noise in the R D I s.These R D I s represent the individual instances of the people counting dataset that get labeled (7).Even in the same environment, RDIs from classes 1 to 3 are difficult to distinguish from one another.Figure 8 shows a t-distributed stochastic neighbor embedding (t-SNE) with a 2-D component representation of all instances in the S room.The t-SNE succeeds in correctly clustering only data with zero people in the environment.A t-SNE representation of all collected data are shown in Fig. 9 according to the B-Test-Dataset split.Even with a larger amount of data, only the zero-person instances are easily clustered.In this case, it can also be observed that the test data, which represents the B room, have different features than the rest of the points.This is an important indication of the dependence of radar data on the location in which they are collected.Algorithms trained in a single location may be difficult to use in other environments and usually require adaptation.Euclidean distance was used as a metric, and Barnes-Hut was used as an optimization algorithm to generate the t-SNE representation.

Proposed approach
In this section, we present our solutions for generalization  weighting network [36].We then propose an algorithm that makes use of optimization-based meta learning features from MAML [31], which we call MAMW.This modified version aims at increasing training stability when only a very limited number of shots per class are available.Then, we propose an active learning strategy tailored for weighting networks to allow fine-tuning in a new environment while minimizing the amount of required labeled data.

Meta learning
In episodic meta learning, K tasks are sampled from a distribution p(T r ) defined over D m−train .As the episodes progress, the goal is to improve the performance of the model on tasks sampled from p(T s ) defined on D m−test .In DL, taskbased learning is often achieved via the gradient method, which involves training the parameters θ by minimizing a cost function L T r ( f θ ), where f θ represents the relation between the input x and the predicted output ŷ.In the relation networks [35], generalization among tasks is directly achieved thanks to the intrinsic comparison of instances enabled by the topology.In optimization-based meta learning, such as in MAML [31], the information learned for tasks T r and encoded in the parameters θ , is transferred to a base model f θ with parameters θ , minimizing an outer cost function L T r ( f θ ).In this case, the task-specific cost function depends on the parameters θ of the base model L T r ( f θ ).

Weighting-injection net
The Weighting-Injection Net aims to compare the features of the arbitrary examples of query q with those of reference to the support s classes for each task k ∈ K .The Weighting-Injection Net, as shown in Fig. 1 is based on three main modules: injection, comparison, and weighting.During training, the gradient information is propagated through all modules in both forward and backpropagation steps.For a N -way 1-shot task, the idea is to map the relationship between support examples s n , where n ∈ N: [1, 2, ..., N ], to each query example q j , where j is the index of the j-th example of the set.
The injection module e θ generates a higher dimension representation of the input x to enhance the extraction and matching of features in the subsequent comparison step.Gradient information for the injection module is only propagated as e θ (s n ) through the support instances.For the query, only the feature representation e θ (q j ) is generated.
The comparison module c θ , takes as input the concatenation along N channels of e θ (q j ), with each of the n support samples.The number of channels N corresponds to the task number of ways.The features are extracted in the module using convolution layer sequences, yielding a comparison vector z.The vector z is generated in the following way: where denotes the operation of concatenation along the N channels.
Lastly, the weighting module w θ is designed to generate a probability density from the concatenated N channels in the z vector.Each z n, j is the output of the comparison module, between the query q j and a support s n .The predicted output ŷ j for the sample q j can be expressed as follows: where represents the sequence of concatenations performed over the channels N of z.
In the case of a N -way C-shot task, where c ∈ N: [1, 2, ..., C], the supports per class can be denoted as s n,c .The Weighting-Injection Net can be leveraged in this case to create a more robust representation of the comparison vector z n, j .This can be done by arithmetic averaging over C sets of N -channel concatenations, given by the embedded representations of q j with each of the support sets s n,c .Such a more robust representation yields the query class estimation with less bias than with the single support shot scenario.The mathematical expression for a single q j is as follows: The Weighting-Injection Net, trained on p(T r ), can be tested, thanks to its inherent structure, on tasks from p(T s ) without further training.Given a support set with elements s n,c for a task T ∼ p(T s ) a N -way C-shot, the class probability density of the j-th query sample q j , is directly estimated by inference.

Model-agnostic meta-weighting
The weighting network [36] represents a robust episodic learning algorithm thanks to the inherent feature of instance comparison.Yet, this method can be characterized by learning instability when only a few-shot per class are available.Especially in 1-shot learning, this is due to the comparison of the query with the individual support instances, which may not be sufficiently descriptive of a class for a given task.Hence, we present a method called model-agnostic metaweighting (MAMW), which tries to incorporate within the weighting network some features of optimization-based meta learning to enhance the stability and robustness of prediction in this setting.Specifically, in the MAMW, we propose to divide episodic learning into inner and outer steps.Given a N -way C-shot task: 1.In the inner step, the support instances are compared with a noisy version of themselves of Gaussian type via a function e θ (φ((s n,c ))).This noise is generated at random from the N (0, σ 2 ) distribution in the interval [−σ , σ ].Defined s h as the h-th support example, where the computation of z n,h can be expressed as follows: where θ represent the parameters of the base model f θ .Such operations can also be carried out in batches.An example of people counting instances compared with their noisy version is shown in Fig. 10. 2. In the outer step, the comparison between the support examples s n,c and each query q j is performed, starting from the weights θ learned in the inner loop.In this case, the comparison vectors z are computed with the (10) and the predicted output ŷ j with (9).
The main steps of the MAMW, in the case of few-shot, supervised learning with outer updates after every task, are defined in Algorithm 1.
The presented Weighting-Injection Net topology can be trained via the MAMW algorithm.Also with the MAMW episodic learning, the Weighting-Injection Net can tackle new test tasks without the necessary adaptation training.for all s h do 6: Compute z n,h in (11) 7: Compute ŷh in (12) 8: Evaluate ∇ θ L T k ( ŷh ) by L T k for s h 9: Compute adapted parameters with gradient descent: θ = θ − α∇ θ L T k ( ŷh ) 10: end for 11: Sample J query instances q j from T k 12: Compute z n, j in (10) 13: Compute ŷ j in (9) 14: Update θ ← θ − β∇ θ L T k ( ŷ j ) for q j 15: end for

Active learning
Active learning can also be used on top of a meta learning model to perform fine-tuning on a given task, leveraging the most uncertain queries during adaptation.We propose to use pool-based sampling active learning to fine-tune the Weighting-Injection Net on p(T s ), starting from what has been learned on p(T r ).We chose an uncertainty sampling strategy to let the algorithm decide at each training epoch which new examples to label.We test the approach with three different priority scores: least confidence (LC), margin sampling (M S), and entropy (E), respectively.For the instances q j = {x j , y j } representing the input/output pairs on queries sampled by T , the priority scores S p can be defined as follows: 14) where P θ of ŷmax is the highest posterior probability predicted by the model with θ parameters for x j , and N is the number of classes.
Algorithm 2 defines the main step of the proposed poolbased sampling on a task T .In general, the Algorithm 2 represents a generalization of the pool-based sampling approach for relational models.For a given task, a set of class-related support examples is initially labeled.As the number of iterations increases, the uncertainty of the query examples is evaluated, and those with the highest priority score are added to the labeled dataset.A maximum number of support instances per class per iteration is also chosen.Instead of starting with random weights, parameters learned during episodic learning on training tasks can be used as the model initialization.The active learning procedure is therefore performed on unseen test tasks.

Experimental setup
In this section, we present all the results achieved on meta learning episodic experiments and active learning fine-tuning on the people counting meta-datasets (Section 3.6).The algorithms have been written in the Python programming q j = {x j , y j } 6: while not done do 7: Compute z n, j in (10) 8: Compute ŷ j in (9) 9: Compute S p of q j with ( 13), ( 14) or (15) 10: With S p of q j , select A queries q ja and ŷ ja 11: Add all q ja in D p 12: Update Sample in D p support instances: Sample in D p , J query instances: q j = {x j , y j } 15: end while language, using the TensorFlow ™ module to implement the DL models.Further experiments on a public dataset have been performed and discussed in the Appendix A. The codes related to the algorithms and topologies used for the meta learning experiments are available online 1 .As a process unit, we used an Nvidia ® Tesla ® P4 GPU and CUDA ® Toolkit v11.1.0for parallel computing.

Meta learning experiments
All the episodic experiments have been performed with the topology presented in Section 4.1.1 and Fig. 1.Specifically, 4-way experiments with 1-, 2-, 5-, and 10-shot have been performed.The topology has been trained with two different algorithms.First with the classical episodic few-shot training of weighting networks, as defined in [36], using the Weighting-Injection Net equations (Section 4.1.1).Further, the topology has been trained in episodic sequences of inner and outer steps, following the steps of the MAMW algorithm proposed in Section 4.1.2.All the results presented in this section refer to the two algorithms and are consistently called Weighting-Injection Net and MAMW.Comparison results of the two algorithms with the state-of-the-art are presented in the Section 5.1.1.The cutting-edge comparison also fea-tures some application limit experiments for indoor people counting up to five individuals in a room.
A graphical representation of the model modules and respective layers is shown in Fig. 11.The model consists of 283,379 trainable parameters in its entire module sequence.Of the total, the injection module consists of 239,680 parameters, the comparison module of 39,936, and the weighting module of the remaining 4,180.To rescale feature size, max pooling is used in cascade to the 2D convolution (Conv2D) for the two modules e θ and g θ .In addition, batch normalization is used to increase the stability of training.All batch normalization layers are followed by a rectified linear unit (ReLU) activation function.To map the output vector into a probability distribution over the classes, the softmax is used as an activation function for w θ .The cost function chosen for the query classification is categorical crossentropy, and the optimization algorithm is Adam.β 1 and β 2 for Adam have been set to 0 and 0.5, respectively.A learning rate of 5e − 4 has been chosen for the Weighting-Injection Net.A learning rate of 5e − 4 has also been chosen for both the inner and outer steps of MAMW.For the Gaussian noise statistic on the MAMW inner step, a value of σ 2 equal to 0.005 has been chosen.This value represents an empirical choice, noting that larger values led to the loss of the main information in the support instances, while smaller values were less effective for the performance of the experiments.
Regardless of the number of shots, every meta-training experiment is performed over 22,000 episodes, each of a single training epoch.The episodic learning is carried out on D m−train .The validation and testing have been performed at the end of each episode on 10-shot per class (40 samples) on tasks sampled by D m−train and D m−test respectively.
All experiments have been carried out with an embedding size g of 64.Smaller embedding sizes resulted in non-convergent experiments, whereas larger sizes resulted in meta-overfitting on D m−train .For the injection module, an output representation of 14 • 14 • g has been chosen (feature size).This led to a representation per image of 12,544 units (Table 2).On the Nvidia ® Tesla ® P4 GPU, the number of floating points operations per second (FLOPS) for the injection module with this configuration is 108 megaFLOPS.The size in bytes of the weights of the model when saved in ".h5" format, regardless of the chosen episodic training algorithm and the number of shots, is 1,148 KB.Some experiments at varying feature sizes are also presented later in this section to test the benefits of the injection module over the standard embedding module.
The obtained values of prediction accuracy, model size, and single-sample prediction latency are compared to stateof-the-art values obtained by training other algorithms on the people counting dataset employed in this work.The accuracy results for the Weighting-Injection Net are reported for varying numbers of shots.Each experiment by algorithm,  The indices n and c represent the class and shot number, respectively.The index of the j-th query shot is represented by j.The g represents the embedding size, which was set to 64 in the experiments 1 For the Conv2D layers, the filter shape dimensions are, respectively, kernel height and width and input and output channels and D m−test is constructed every 2,200 episodes.In the following plots and paragraphs, statistical insights from one of the experiments performed are analyzed.Specifically, a MAMW 10-shot experiment on Mixed-Dataset is chosen thanks to the good achieved generalization performance.Figure 12 shows the set of box plots generated as the training episodes advance for the considered experiment.As the episodes progress, the mean and median values of the distributions rise while the quartiles and whiskers narrow.With episodes progressing, even the outliers move closer to the upper limit of accuracy.The described behavior demonstrates how, thanks to previously acquired experience, the model can generalize better on new sampled tasks.This means that newly learned parameters θ generalize better in new contexts, i.e., new locations and test rooms, resulting in higher performance under the same learning conditions.Discrete accuracy density histograms can be used to represent the distribution underlying individual box plots.Graphical evidence of how the distribution tends to shift towards higher generalization accuracy can be observed by comparing the first and last histograms of the episodic optimization.Such density histograms can also be compared to a Gaussian probability distribution, thus showing what percentage of the achieved accuracy lies between the first and third quartiles.Figure 13 depicts a comparison of accuracy statistics for the examined experiment at the beginning and end of the episodic training.Even for tasks sampled only by D m−test , the probability density tends, as the episodes progress, to take on a negative skew towards the upper limit of accuracy.The actual distributions underlying the box plots are not Gaussian but multi-modal with density peaks due to the variable complexity of the sampled tasks.The generalization capability can be addressed at the level of individual classes by constructing cumulative confusion matrices on task sequences.Labels 0 to 3 represent the real and predicted number of people for the two dataset splits.Figure 14 depicts the confusion matrices underlying the first and last box plots of Fig. 12 for both D m−train and D m−test .Figure 15 shows another example of cumulative confusion matrices for a Weighting-Injection Net 5-shot experiment on S-Test-Dataset.It is noticeable in both Figs. 14 and 15, that the model learns to generalize better as episodes progress for both unseen locations and rooms.Most miss-classifications, especially at the end of episodic learning, lie around the main diagonal.This means that the models, in most cases, count ±1 person compared to the actual number of individuals in the environment.Moreover, the majority of the misclassifications happen for the classes of 1 to 3 persons, while the model easily succeeds in distinguishing the case 0 that corresponds to no people detected in the sensor's FoV.The per-class accuracy of the test confusion matrices in Fig. 15 turns out to be lower than that in Fig. 14.This is due not only to the use of 10-shot instead of 5-shot in the experiment but also to the higher complexity of the test tasks.In fact, the Fig. 15 experiment sampled all test tasks from a room not included in the training (S).
The prediction accuracy values obtained as an average of the post-training tests for each experiment type are listed in Tables 3, 4, 5 for the three defined meta-datasets.
As can be observed from Tables 3, 4 and 5, regardless of the used meta-dataset, the 1-or 2-shot experiments performed with the MAMW lead to higher average accuracy values than the Weighting-Injection Net.In these specific cases, in episodic learning, the few supports per class make the prediction given by the Weighting-Injection Net less robust, where the learning depends solely and exclusively on the comparison with the query.MAMW instead supplies more information to the model thanks to the initial comparison with a noisy version of the support samples, thus emphasizing the potential intrinsic noise of the query data.For the 5-and 10-shot experiments, the two episodic approaches lead to different performances with respect to the used meta-dataset.The MAMW outperforms the Weighting-Injection Net on the Mixed-Dataset, regardless of the number of shots.The Mixed-Dataset contains, in fact, recordings from all rooms, but with different locations and numbers For relation-based topologies, there is no need to perform adaptation training for new tasks as a result of the direct comparison of features between the newly available support samples and the query.Therefore, the adaptation time to a new task is null.Instead, the inference time on a single sample (query) can be computed as a function of the number of shots.It corresponds to the time required by the model to predict the query class given the available supports.The time required to compute the z comparison vectors for all available supports is thus included in the inference time for single queries.As both the proposed algorithms share the same inference procedure, these values are independent of the employed approach.The single sample inference time is also independent of the selected counting meta-dataset, given the same input size.Average inference values on a single query are listed in Table 6.
As can be seen from Table 6, the inference time for a single query increases as the number of shots increases.Multiple supports available per class enable a more robust prediction of the query class, as shown in (10).However, this requires the generation of multiple z comparison vectors, which, in proportion to the number of shots, lead to a progressive increase in inference time on a single query.
Classification accuracy is also dependent on the chosen feature representation dimension in the feature extraction module e θ .In specific experimental settings, the injection can counter episodic overfitting effects by increasing feature size as opposed to the standard embedding.The 14 • 14 feature size chosen for all the other experiments is compared with two representations of 4 • 4 and 9 • 9 respectively.Given the size of an RDI example of 32 • 64 = 2, 048, a feature representation of 4 • 4 • 64 = 1, 024 converts the injection module into an embedding module.Compared with the 108 MegaFLOPS required by the feature size of 14 • 14, the size 4 • 4 requires only 0.28 MegaFLOPS.Overall, the injection operation, compared to embedding, results in the GPU performing significantly more FLOPS.This is due to the larger size of the extracted features in the convolutional layers.Table 7 features the results on the S-Test-Dataset, obtained with the Weighting-Injection Net as feature size, and the number of shots vary.The 1-shot experiment seems to benefit more from embedding than from an injection module.The squeezed representation of features in such experiments leads to a more compact representation.The entire weighting network can succeed in extracting key features from the few samples available per class in each episode bringing benefits of generalized learning.On the other hand, as the number of shots increases, a larger representation of features seems to lead to greater benefits in training.With 5-or 10-shot per class, a larger feature space upstream of the comparison module facilitates feature extraction from the available support samples and yields better generalization results.The effect of overfitting on individual tasks is clearly visible by comparing the accuracy obtained with the 4 • 4 feature size between the 5-and 10-shot experiments.Contrary to the common scenario, the performance of the model worsens as the number of shots doubles.Without tuning the other hyperparameters, the small feature size favors single-task adaptation rather than generalized learning, reducing so, the overall performance.

Comparison with the state-of-the-art and limitations
In this section, the results of Weighting-Injection Net and MAMW are compared to the results of other state-of-the-art meta learning methods for the task of people counting.Reptile [32] is used as a baseline algorithm.MAML 2 nd [31] and a more stabilized and version of MAML presented in Antoniou et al. [52], are the other algorithms  8.As can be observed, the MAMW turns out to be the best-performing method in all experiments apart from the 10-shot experiment, where, as commented in Section 5.1, the Weighting-Injection Net achieves a higher average accuracy.The accuracy values obtained with the proposed methods are better despite using 30% fewer trainable parameters.As the number of shots increases, relation-based models show an even larger accuracy gap than optimization-based ones due to the more robust prediction given by averaging the comparison vectors computed for the available support samples.Because of the direct mapping between sample and label in the learning process, the single-sample inference time for Reptile, MAML 2 nd and MAML + is independent of the number of shots.Across all the experiments, on an average of 10,000 final tasks, the overall estimated inference time has been 33.47 ms.In comparison to the results in Table 6, only for the 10-shot experiments, the pure optimizationbased methods turn out to be 25% faster for single inference, whereas they turn out to be slower in the other configurations.
The task adaptation time needed for the various algorithms is provided in Table 9.The considered state-of-the-art methods require an adaptation time per task that rises considerably as the number of shots increases.On the contrary, relationbased models, thanks to their comparison-based topology, do not require adaptation for new tasks and therefore lead to a null adaptation time.This results in a great advantage for relational topologies over traditional optimization-based topologies.
To test the application limits of the episodic learning approach for radar-based people counting, experiments were The extension of the counting approach to up to five people and the limitation of radar resolution for close targets in this scenario make generalization more complex.The increased complexity is reflected in the RDIs input instances and features across the different recording locations.For this reason, with a larger number of shots, MAMW performs less well, favoring noise filtering in support samples rather than classification of query instances.Weighting-Injection Net, on the other hand, focuses directly on learning the query class and performs better in this scenario.
In general, although the proposed algorithms outperform the state of the art, they lead to an average accuracy of less than 60% over the six classes with 10-shots.This unfortunately shows that the purely episodic generalization approach with a few shots is limited to scenarios with a very small number of people.Adaptations to larger and more varied datasets or the use of radar sensors with higher resolution could obviate the current limitations.The weights of the counting model up to 5 people need an in-memory size of 1,156 KB.This value is slightly larger than the approach of up to 3 people.More information on a single experiment for the adaptation of up to five people is provided in Appendix B.

Active learning experiments
Active learning experiments with the Algorithm 2 are intended to demonstrate how meta learning-driven model initialization benefits task fine-tuning.All the experiments have been carried out on the task of radar-based people counting, using 75% and 25% of the data collected in the S room as training and testing, respectively.This means that active learning aims to boost the estimation performance in counting people in the entire small room, given all the locations in which the RDIs were collected.Since all the in-room locations are considered at once, the adaptation in this case is more complex than during episodic training.The uncertaintybased experiments used priority scores S p defined in ( 13), ( 14) and (15).As initialization, the parameters θ obtained after the 1-shot episodic learning of Weighting-Injection Net and MAMW on the remaining two environments (M and B) have been used.As D p grows larger, the experiments are limited to a maximum of five supports per class.The selected number of epochs for the active learning training is 6,000.For each epoch, 4 queries (J ) are to be sampled, with A of them labeled using the uncertainty-based approach.Table 11 compares the average results from three experiments for each defined S p score to the random initialization of θ .As can be seen from the table, the results for initialization based on MAMW and Weighting-Injection Net vary very little as the chosen priority score differs.Such initialization, however, leads to a great performance gap compared to the random one, which also features training instability over repetitions.The Weighting-Injection Net also seems to achieve slightly better performance than the MAMW.This is most likely  The values, computed on Nvidia ® Tesla ® P4 GPU, are averaged over three repetitions of each experiment for 10,000 tasks 1 For MAMW and Weighting-Injection Net, considering only the need to compare the query with the available supports, the adaptation time is null (0 ms) All the S room data have been used for the adaptation.The results are averaged over three experiment repetitions of 6,000 iterations each.The initialization consists of meta-learned weights for the M and B rooms related to the large availability of labeled data, which for a test room setup, makes this method more performant than MAMW (Section 5.1).In the case of random initialization, however, the model succeeds in learning almost exclusively when entropy S e is used as the scoring function.This may be due to the entropy formulation itself, which results in a more balanced query selection by taking into account the distribution over all classes for the score computation.The accuracy learning curve for the entropy-based experiments is depicted in Fig. 16.Adaptation starting with Weighting-Injection Net and MAMW weights exhibits similar accuracy profiles as training epochs progress.Random initialization, on the other hand, not only leads to lowerperforming learning but also to instability and experiment failure, collapsing to a 25% accuracy over the four classes.In this case, the algorithm encounters difficulties with only a few learning data at a time to generalize to all locations.Fluctuations in accuracy curves are due to adaptation to new labeled data sampled from different S room locations, which normally display different features.This behavior can be observed in the t-SNE representations of the data in Section 3.6.

Conclusion
This paper features how meta learning and active learning can be effectively employed for radar-based people counting using real-world data.For such a use case, multiple metadatasets are generated based on different combinations of rooms and radar orientations.Episodic learning for few-shot adaptation is carried out through a comparative approach.The model learns task-wise to map features of query examples to representative support instances belonging to the same class.In this way, the belonging class of a radar instance is predicted by comparing it with representative support examples of classes zero to three people.With respect to the traditional weighting network, an injection module increases the input data dimensionality before the comparison step.This process facilitates the comparison of query and support features, reducing episodic task overfitting and aiding generalization.The overall topology with an injection module is called the Weighting-Injection Net.
An episodic adaptation algorithm called model-agnostic meta-weighting is then presented for specific adaptations to very few-shot per task.This two-step training algorithm combines the weighting network topology and the optimization-based meta learning approach to enhance the feature extraction capabilities of the model.The approach features an inner step task adaptation that compares support instances with a noisy version of themselves, leading to more stable generalization training, especially in the 1-shot training.Finally, a pool-based active learning approach designed specifically for relation-based methods is presented.Using only the available samples with the highest prediction uncertainty, this algorithm seeks to minimize the number of examples needed for learning.
The presented meta learning achieves cutting-edge accuracy in people counting while also yielding other performance advantages.The relation-based topology grants no training time for adaptation at new radar test locations.Furthermore, the availability of multiple support examples per class allows for more robust averaged query estimation.Both the presented algorithms are up to 15% more accurate than the state-of-the-art for 1-and 10-shot.They are also found to be up to 50% faster for computing single-sample inference when the model is tested on a new task.The active learning algorithm performs better and is more stable when the initialization is set to the episodically learned weights rather than at random.Nonrandom initialization improves radar adaptation accuracy by 30% on test room radar instances.
Despite the great benefits shown, the work presented is only tested offline on previously collected data.In the future, it will be important to test such a system in a real-time setting.The monitoring approach with more than three people leads to accuracy performance which may be insufficient in several practical contexts.Future work will focus on using relation-based topologies and sensor fusion to counter the current limitations.The use of an unconventional injection module for the relational networks could bring additional benefits for feature representation in episodic learning.Indepth studies will therefore be conducted on the possible applications and limitations of such a module.Research on the injection module will also be carried out in the field of the interpretability of neural networks and training complexity.Also, further active learning and uncertainty sampling strategies that focus on episodic learning with relation-based approaches will be investigated.
comparison have been adopted.The accuracy of the tasks is not calculated on a single query sample per class, as in Reptile [32], but on ten test instances per class in a step following the learning step.This allows a more fair comparison with relational algorithms, where the query example is not used in a step subsequent to the support ones.In addition, no data augmentation or scaling is performed on single inputs, in contrast to the MAML methods presented in [31,52].For the state-of-the-art methods, the same CNN topology and configurations presented in Section 5.1.1 for radar-based people counting have been used on Omniglot.
All experiments have been performed on an Nvidia ® Tesla ® P4 GPU and CUDA ® Toolkit v11.1.0for parallel computing.
Similarly to what has been observed in Section 5.1 for the radar-based people counting dataset, the MAMW seems to perform better than the Weighting-Injection Net in the 1shot and 10-way scenarios (Table 13).For the 5-shot 5-way experiment, the two relation-based algorithms achieved similar accuracy, which is comparable to MAML 2 nd .This may be inherent in the fact that for Omniglot, unlike radar data, there is no intrinsic background noise in the input instances.Consequently, the introduction of noise in the comparison between supports in MAMW does not promote generalization learning when many shots are fed to the network.Conversely, MAMW inner step may divert attention away from the learning goal of single tasks.Even for Omniglot, Training and validation are performed on tasks sampled from locations A and C in the room, while testing is done on tasks sampled from locations B and D. The experiment is a 6-way, since zero individuals in the room is also considered a class.Figures 18,19 and 20, show different statistical insights of a 10-shot Weighting-Injection Net experiment.Figure 18 displays the trend of box plots built on accuracy as episodes increase.Compared to the training up to three people (Fig. 12), the adaptation up to five people shows a less pronounced trend of improvement.In this case, the test fails to generalize better from 15,000 episodes onward, reaching a saturation of accuracy around 55%. Figure 19 reveals the density histograms underlying the first and last box plots constructed on the test in episodic learning.In comparison to the adaptation of up to three people Fig. 13, no marked reduction in whiskers or negative skew in the last histogram is noticeable.Yet, there is an increase in average accuracy from 37% to 55% (18% improvement in generalization).A very interesting analysis can be done by analyzing the accuracy on individual classes, thus by generating the cumulative confusion matrices shown in Fig. 20.As in the confusion matrices generated for the 4-way approach (Figs. 14 and 15), the model easily succeeds in classifying the absence of people in the environment, reaching a solid 98% class accuracy in the test at the end of episodic learning.Further, as the episodes progress, the generalization approach yields higher accuracy in counting more than one person.Moreover, most of the miss-classifications lie around the main diagonal of the confusion matrix, which represents the ±1 of accuracy.This means that most of the classification errors tend to under-or overestimate the number of people in the room by only one unit.

Fig. 1
Fig. 1 Weighting network with an injection module (Weighting-Injection Net).At least one instance per class, represented in the figure with a different marker color and a label, is used as support.A query example belonging to one of the classes is what is to be associated with a label by the classification algorithm.An injection module trained on the support images enables the concatenation of a query with an increased-

Fig. 3 BGT60TR13
Fig. 3 BGT60TR13 Radar System.The board filters, mixes, and digitizes data from each RX channel, located on top of the radar sensor

Fig. 4 Fig. 5 A
Fig. 4 Data recording setup.A Raspberry Pi4 (a) is used for data storage.For data collection, the BGT60TR13C radar system is mounted on the tripod (b).The tripod is moved between sessions in the various rooms and locations (c)

Fig. 6
Fig. 6 Flow diagram representing the main preprocessing steps.The yellow blocks represent the main time-domain steps.The orange ones instead represent the frequency domain steps In general, for each of the three generated meta-datasets, the training and test instances are part of the respective training D m−train and test D m−test meta-dataset splits.Three different averaged RDI examples per class, sampled from the different recordings in all rooms and locations, are shown in Fig. 7.

Fig. 7
Fig. 7 Example RDI instances from the people counting dataset.Every row shows three examples per class, chosen from a random combination of rooms and locations.The axes indicate people relative motion velocity in m/sec and distance from the radar sensor in cm

Fig. 8 2
Fig.82-D t-SNE representation of all S room data.This t-SNE was obtained with a perplexity of 40 over 6,000 optimization iterations

Fig. 9 2
Fig. 9 2-D t-SNE representation of the B-Test-Dataset, for all the recorded data.The B room data are represented by the "x" marker, while the rest of the data (rooms S and M) are represented by the "o" marker.This representation was obtained with a perplexity of 30 over 7,000 optimization iterations

Fig. 10 2 :
Fig. 10 Examples of RDI without (a) and with added Gaussian noise (b) used in the inner step training of the MAMW

Fig. 11
Fig. 11 Representation of the topology modules and respective layers used in the relational experiments.The injection module (e θ ) increases the data dimensionality via a sequence of convolutional layers.The query sample is compared with all the available support samples.

Fig. 12
Fig. 12 Accuracy statistics box plots vs. episodes for a MAMW 10shot Mixed-Dataset experiment.The red box plots are generated on validation tasks(a), whereas the blue ones (b) are generated on test

EFirstFig. 14 Fig. 15
Fig. 13 MAMW 10-shot experiment, first (a) and last (b) box plot underlying distributions, generated on test tasks sampled from Mixed-Dataset.The q1 and q2 values on the Gaussians indicate the first and third quartiles, respectively.The probability density histograms show

Fig. 16
Fig. 16 Entropy pool-based active learning accuracy across epochs.The thicker lines highlight the best experiments by type of initialization.Accuracy values are averaged per trial every 20 epochs.Random initialization (green) experiments are more unstable and collapse to 25% random learning on 4 classes

dFig. 18
Fig. 18 Accuracy statistics box plots vs. episodes for a Weighting-Injection Net 10-shot 6-way experiment on radar-based people counting (B room).The red box plots are constructed on validation tasks (a),

EFirstFig. 19 Fig. 20
Fig. 19 Weighting-Injection Net 10-shot 6-way, first (a) and last (b) box plot underlying distributions, generated on people counting test tasks.The q1 and q2 values on the Gaussians indicate the first and third quartiles, respectively.The probability density histograms show from Infineon Technologies AG.With a center frequency of f 0 of 60 GHz Proposed Framework.The setup is mounted in three rooms.Data sessions with a number of people from 0 to 3 in the scenario are collected and processed (orange).The frequency analysis is performed via the fast Fourier transform (FFT).Instances are generated via a moving average over frame sequences.A meta-dataset is then generated, and one room is used as the test dataset.A classifier is then episodically trained and tested.Active learning is used to fine-tune the model to a new environment (yellow)

Table 1
Radar Sensor Parameters Configuration With an average duration of 60 seconds across all recordings in rooms S and B, a total of 1,677 and 1,702 examples were created, respectively.For M, a total of 4,290 examples were built with six available locations.With all the available instances, the following three meta-datasets have been generated: the number of people (P m ), and the location, L ∈ [A, H ]. • B-Test-Dataset: all the sub-folders (B, P m , L) were used as test, while all the others ([S, M], P m , L) were used as training.The number of training and test instances are 5,967 and 1,702, respectively.

Table 2
Network Layers Configuration -People Counting

Table 3
Accuracy of the two meta learning approaches on people counting (4 classes): Mixed-Dataset

Table 4
Accuracy of the two meta learning approaches on people counting (4 classes): S-Test-Dataset

Table 5
Accuracy of the two meta learning approaches on people counting (4 classes): B-Test-Dataset

Table 6
Average single-sample inference time computed as the average of all MAMW and Weighting-Injection Net experiments on all defined meta-datasets, in function of the number of shots.Every experiment has been run over 10,000 final tasks on Nvidia ® Tesla ® P4 GPU The latter, labeled MAML + , leverages the contributions of multi-step loss optimization (MSL), derivative-order annealing (DA), and cosine annealing of meta-optimizer learning rate (CA).The model chosen for the state-of-the-art algorithms is a CNN suitable for the generalization goal, consisting of four main blocks.The first three blocks consist of a Conv2D with 64, 128, and 256 filters, followed by batch normalization and the ReLU activation function.The last block consists of a dense layer with 4 neurons, corresponding to the number of classes.This topology consists of 403,332 trainable parameters compared to the 283,379 of MAMW and the Weighting-Injection Net.The adaptation training was done with Adam as the optimizer, with learning rates of 8e − 3 and 7e − 3 in the inner and outer cycles, respectively.Likewise, in this case, the values of β 1 and β 2 for Adam have been set to 0 and 0.5, respectively.The model training was executed on 22,000 episodes with a batch size of 2 and a number of epochs per task of 4, respectively.The comparison was performed on 10,000 final tasks on S-Test-Dataset for 1-, 2-, 5-and 10-shot over 3 repetitions of each experiment.For each task, 10 test samples per class were randomly selected, resulting in 40 test instances in total.The computed mean classification accuracy values are listed in Table

Table 7
Section 3.6).In this case, five sessions of one minute each per location and number of people were collected and used.Locations A and C were used to generate training tasks, and locations B and D were used for testing tasks.Table10presents the results obtained on test data for the average of three experiments and 10,000 final tasks.The results for this meta-dataset show similar characteristics to those where an entire room is used exclusively as a test.In general, the two proposed approaches outperform the state of the art regardless of the number of shots.The MAMW proves more stable and performs better in experiments with very few shots (1and 2-).The Weighting-Injection Net, on the other hand, outperforms MAMW for the 5-and 10-shot approaches.