Introduction Counting the number of people in an environment can be a crucial task not only in industrial settings but also in medical and safety scenarios. In difficult times, such as during a pandemic, keeping track of the occupancy of an environment can greatly reduce the risk of spreading a pathogen [1, 2]. Estimating the presence of people can lead to other advantages, such as enabling energy management plans in places with frequent turnover of people, such as hospitals, by smartly activating equipment and heating systems [3]. A non-automated measure may be challenging or impossible in many contexts, such as for pedestrian crowds in public areas [4]. The majority of solutions designed for people monitoring rely on images captured by cameras and thermal sensors [5]. Most camera-based solutions use RGB or time of flight (ToF) sensors, and occupancy information is estimated using computer vision [6, 7] or machine learning [8,9,10]. Camera systems that use cross techniques for image segmentation and edge detection, such as convolutional neural networks (CNNs), achieve high performance even in crowded environments, but suffer from the inherent problem of a lack of privacy [11]. Thermal sensors, on the other hand, are much less privacy-invasive because of the usage of infrared frequencies and often lower image resolution [12]. Thermal sensors also have the advantage of being usable in the dark, but they can be affected by thermal noise, caused, for example, by heaters and sunlight. In addition, the lack of depth information generally does not allow distinguishing between people moving in the same direction. In contrast to visual solutions, many other systems exploit the measurement of environmental quantities. Radio-frequency (RF) and laser technologies are typically classified as non-image-based approaches  [13]. The CO2 sensors, for example, can be used to estimate the occupancy of a room by the concentration of carbon dioxide produced by individuals. Such systems are frequently low-power but must account for venting systems and are practically unusable in open spaces [14]. LiDARs represent often another privacy-friendly solution for people counting and tracking. Through the use of pulsed lasers and a scanner, a LiDAR yields the generation of 2-D or 3-D maps of the surrounding space  [15, 16]. Such systems frequently have high spatial resolution and frame rates, but they can be costly and power-consuming. RF-based systems have the advantage of having almost no privacy concerns and little dependence on light and weather conditions. These characteristics make them appropriate for monitoring several people. Wi-Fi technology, for example, can enable the recognition and segmentation of people even through walls and obstructions [17, 18]. Wi-Fi modules, however, require the development of high output power in the RF range (\(\approx \) W) and a continuous working operation to exploit their functionalities. On the contrary, radar sensors are more versatile in many applications thanks to lower power consumption (\(\approx \) mW) and optimized system power management. Among radar modulations, frequency-modulated continuous wave (FMCW) is particularly suited to people monitoring, allowing accurate estimation of the range and velocity of both dynamic and static targets located within the device’s field of view (FoV) [19, 20]. Specifically, 60 GHz technology is particularly suitable for short-range people monitoring applications [21]. Radars transmitting around this frequency are cost-effective and versatile compared to other solutions such as cameras or LiDAR. Further, the 60 GHz frequency is much less susceptible to interference with other radio-frequency signals or Bluetooth devices. Image-based or high-resolution RF systems often implement a vision-based pipeline to predict the number of people in a given context. This approach can lead to high classification performance even in the challenging task of tracking through image segmentation, edge detection, and skeleton-pose extraction [6]. On the other hand, radar data are hardly interpretable through classical computer vision approaches. In this case, deep learning (DL) techniques are commonly used to process the information [22].

DL is nowadays finding the most varied uses for solving tasks and speeding up processes. Over the years, classes of DL models have been developed to extract valuable information from the available data for given tasks. Examples are CNNs for feature map generation or recurrent neural networks (RNNs) for processing time series. Over the years, multiple neural network topologies, such as Inception [23] and VGGNet [24] have been designed to solve specific tasks with successful outcomes.Yet, such topologies have the inherent need to be trained on a large amount of data to achieve robust performance across new contexts. Commonly, these models are adaptable to new tasks by leveraging transfer learning [25], tailoring parameters to newly collected data. However, the limited availability of data and the need for rapid adaptation to new contexts make transfer learning hardly usable for defined types of tasks. To deal with these challenges, a specific branch of DL called few-shot learning has gained momentum in recent years [26]. The goal of few-shot learning is to exploit the little available information and data patterns, leveraging previous experience to adapt to new contexts or solve tasks that have not been tackled before.Few-shot learning is approached from different perspectives by specific DL sub-branches such as meta learning and active learning [27, 28].

Meta learning, or learning to learn, accounts for the set of algorithms where the primary goal is to learn how to approach new tasks given some past experience, or meta-data [29, 30]. This process not only encourages context generalization but also accelerates the fine-tuning of already observed tasks when new data are available. If the meta learning is optimization-based, an iterative learning process called episodic learning based on available training data is generally used. For a task defined in N–way, i.e., N classes, the few available samples are called shots. To assess generalization performance, C samples of support and J samples of query are fed to the defined model for each class. Algorithms commonly used for meta learning are model agnostic meta learning (MAML) [31] and Reptile [32] which, thanks to their very general conceptualization, enable the episodic adaptation of most of the common topologies defined in DL. Frameworks based on optimization-based meta learning are highly effective and perform well in several data-poor tasks [33, 34]. However, they have an inherent need for training on a set of representative data for each new, unseen task to learn to generalize. A specific kind of method, called relation network [35], was created to obviate this need by exploiting the ability of the model to compare the features of different examples and learn to distinguish them. The comparison is possible by properly shaping the model topology and regressing a relation score between 0 and 1, comparing individual support and query examples. The relation scores are unconventionally regressed by minimizing the mean squared error (MSE) to the ground truth of query instances. This approach assumes that all available support instances are mutually independent of each other. Intuitively, the model relies on a one-to-one comparison rather than comparing the new query examples with all the available support samples. Such issues are addressed by the weighting network [36]. In this adapted topology, the relation between support and query is propagated through two modules. A first comparison module for the extraction of the similarity between the samples and a second weighting module that compresses the information into a one-dimensional vector representing the relation scores. This method leverages all available support sample features for query prediction. Further, the weighting network endorses the use of traditional classification cost functions such as crossentropy during episodic optimization.

Active learning, on the other hand, aims to optimize the model’s performance with as few labeled instances as possible [37, 38]. To accomplish this, the algorithm has control over the inputs on which it trains, labels, or requests additional information about the data it deems most useful for learning. A common strategy is to assign a priority score to the unlabeled data pool, exploiting, for example, the probability distribution generated by the model. Only the instances identified as most uncertain are then labeled and used during training. This procedure, called pool-based sampling, is normally repeated multiple times, increasing the amount of labeled training data, until satisfactory performance for a given task is achieved.

Fig. 1
figure 1

Weighting network with an injection module (Weighting-Injection Net). At least one instance per class, represented in the figure with a different marker color and a label, is used as support. A query example belonging to one of the classes is what is to be associated with a label by the classification algorithm. An injection module trained on the support images enables the concatenation of a query with an increased-dimensionality representation of each support. A comparison module merges support and query information by mapping the relation into a one-dimensional vector. Finally, a weighting module composed of fully connected layers maps the relational information to the query label. The model parameters are represented by \(\theta \)

In this paper, we exhibit how few-shot learning techniques can grant generalization of scenarios (environments and locations) for an FMCW radar-based algorithm designed for people counting. The application of this system is intended for uncrowded areas or rooms where there is a need to count the presence of a few people. For this work, a specific dataset was collected using a 60 GHz radar that was set up for the task of counting people. The information was gathered in three different offices with at least four different in-room locations. Per location, 0 to 3 people took part in the data recording for at least 60 seconds per session. The data were preprocessed in frequency to extract range and Doppler information from the people in the scene. Meta learning is then used for the monitoring use case, estimating the number of people from radar data. Instead of using all the available data in a single training, we propose a few-shot episodic approach to foster and speed up adaptation. To meet the learning needs, we introduce both a new relation topology, which we call the Weighting-Injection Net, and an algorithm, which we call model-agnostic meta-weighting (MAMW). The Weighting-Injection Net represents a modification to the traditional weighting network presented in [36]. Instead of an embedding module that reduces the dimensionality of the support samples for the next comparison step, the proposed one uses an injection module. This module increases the dimensionality of input data, generating a feature-enriched representation of support and query samples for the next relational phase. The overall network scheme is shown in Fig. 1. The MAMW, on the other hand, combines the query relation strategy of the weighted network with the two-step optimization-based approach of MAML. This is meant to improve the stability of the few-shot episodic training, especially when only very few instances are available as training. Experiments with 1–, 2–, 5–, and 10–shot have been performed and analyzed for the proposed methods. The achieved generalization results have been compared with those of other state-of-the-art approaches. State-of-the-art comparisons are also conducted up to five-person counting, to test the limitations of the radar-based episodic approach.

We also exhibit how pool-based sampling active learning can be efficiently employed to fine-tune the performance of a relational model by exploiting the most uncertain data. Showing how, for adaptations in new contexts, the use of generalization information learned from episodic adaptation leads to a better fit than starting from random initialization. The active learning strategy has been used to fit the 1–shot-pre-trained model on data from an office room used as a test that is therefore unseen in the meta-training phase.

For the meta learning algorithms, we also conducted experiments on a publicly available dataset for few-shot learning in the Appendix A. The main contributions of this paper are as follows:

  1. 1.

    Implementation, to the best of our knowledge, of the first context-adaptable radar-based solution for counting people without a necessary adaptation training.

  2. 2.

    Design and implementation of the Weighting-Injection Net. This network represents a variation of the weighting network with an injection module. The injection operation increases the dimension of support and queries to ease feature matching in the subsequent comparison module.

  3. 3.

    Design of a cross-algorithm between MAML and the weighting network, called MAMW to increase the training stability of 1– and 2–shot experiments.

  4. 4.

    Development of a pool-based sampling active learning algorithm compatible with weighting network topologies.

1 Related works

In this section, we first investigate state-of-the-art solutions for people counting that offer similar features to radar-based systems, such as privacy preservation and low frame resolution. We then focus on the specific approaches aimed at context generalization and active learning.

When low frame resolution and privacy are system needs, traditional image segmentation and detection methods are often replaced or aided by deep learning. Neural networks can also be used to process time series or generate density maps for crowd monitoring.

Massa et al. [39] presented a recurrent neural network (RNN) architecture called LRCN-RetailNet (Long-term recurrent convolutional network) that takes as input sequences of low-resolution RGB frames and analyzes their spatiotemporal content for people counting. The strategy outperforms other state-of-the-art single-image-based approaches. The system based on temporal sequences may be unusable in low frame rate scenarios or with hardware implementation constraints. Gomez et al. [40] developed a system using long-wave infrared imaging and a CNN implementation on the NXP® LPC54102 microcontroller. The classification approach is binary, exploiting a small detection window on image sections to predict the presence or absence of heads. Because all weights fit in a 512 KB flash memory, the CNN can be easily deployed on the microcontroller. The counting algorithm using the embedded version of the model achieves an accuracy of 53.7% on test images and up to six people. This solution is very low-power and privacy-friendly, but the presence of heat sources in the environment could cause counting issues due to the low resolution of the thermal sensor.

The most common types of RF-based systems used for monitoring are Wi-Fi and radars that use impulse radio ultra-wide band (IR-UWB) or FMCW technology. Most of these solutions are inherently characterized by privacy preservation and low sensor resolution. Kianoush et al. [41] presented a people counting system via Wi-Fi radio infrastructure that uses an ensemble of models to leverage the space-frequency features of various transmission and reception channels. The ensemble exploits Bayesian techniques based on signal propagation statistics from RX to TX, a feed-forward neural network (FF-NN), and long-short-term memory (LSTM). Some of the constructed ensembles achieve an accuracy of over 95% in the test setup. However, a network of Wi-Fi terminals is employed for this purpose, which results in higher power consumption and challenges usability in other environments. Bao et al. [42] featured a CNN-based algorithm for people counting focusing on extracting multi-scale range-time maps from IR-UWB radar data. Sequences of radar frames are preprocessed to extract the peak information and remove the background. The single frames are then stacked together to form range-time maps. The method proved robust in counting up to 10 people in the selected environment. However, the time dependency and lack of velocity information may make the system unsuitable for real-time applications where multiple people may be at the same distance. Stephan et al. [43] proposed a people counting solution via the BGT60TR13 radar system (60 GHz FMCW) that makes use of knowledge distillation from synchronized camera data during the model generation. The suggested architecture first processes the camera RGB data, exploiting an OpenPose network that extracts the people’s poses through pre-trained layers of the VGG-16 network and a multi-stage CNN. The extracted information is then fed to a triplet network with a 32-D embedding layer to generate clusters for each person count class. Radar information is first preprocessed in the form of range Doppler images (RDI) and fed to an encoder with fully connected final layers that learn through knowledge distillation from camera embeddings. Information transfer is possible by minimizing the Kullback-Leibler (KL) divergence between radar and camera embeddings. The method is robust and leads, in the test phase, to an accuracy of up to 71% for six people with another radar sensor with different positions and orientations. What is learned through knowledge distillation, however, could significantly affect the capabilities of the architecture in new environments where morphological and light conditions would directly influence the camera data.

A few cutting-edge works attempt to solve the people counting problem through active learning or aim at context generalization.

Vandoni et al. [44] featured a solution that uses active learning, coupled with SVMs, to improve training on subareas of crowd images via head count. Samples that are more dissimilar than those already tagged are estimated in terms of their uncertainty via a metric that accounts for crowd density, called maximum excess over subarrays (MESA). Zhao et al. [45] also proposed an active learning solution for head counting in camera-based density maps. In this case, in the iterative process of instances sampling to be labeled, both crowd density information and dissimilarity from previous selections are employed. The sampling technique is a context-appropriate version of partition-based sample selection with weights (PSSW). The number of people is then regressed through mean absolute error (MAE) and MSE. Both methods presented in [44] and [45] result effective in improving the people count through uncertainty sampling in crowded scenes but are very dependent on the 2D RGB nature of the images. Zhang Yingying et al. [46] proposed a multi-column convolutional neural network (MCNN) to estimate crowd head counts from single images without temporal dependence. Even with a sparse number of people, the method outperforms other cutting-edge solutions on a variety of public datasets. The model, trained on a large dataset with various density map sizes, can be easily tuned for new datasets and contexts via transfer learning. The required resolution is nonetheless high and could create context-specific privacy issues. Reddy et al. [47] and Zan et al. [48] designed an adaptive algorithm to generate crowd density maps using MAML with episodic training. In [47] a backbone consisting of the first layers of VGG-16 and a density map estimator are trained on various RGB sequences collected in different environments. The pioneering approaches depict how meta learning can be effectively employed for people counting. Hou X. et al. [49] presented a cross-domain solution for the estimation of density maps by episodic learning. In this case, a domain-invariant feature representation module is exploited, where synthetic and real camera data are used as source and target domains, respectively. The density maps are then generated using a pre-trained CNN network and an algorithm called \(\beta \)-MAML, where \(\beta \) represents the generalization step’s learning rate. The parameter \(\beta \) is dynamically adapted in the episodes by exploiting the gradient information of parts of the images. The number of people is finally estimated from the density maps. The meta learning approach presents more robust performance for the algorithm than other state-of-the-art methods for density map generation. However, the need for a sensor camera does not allow for low-resolution uses or where privacy is a requirement.

Some cutting-edge RF-based works also propose adaptive context generalization solutions. Hou H. et al. [50] illustrated a few-shot learning solution for indoor crowd counting using Wi-Fi technology. The solution consists of a two-stage framework called domain-agnostic and sample-efficient wireless indoor crowd (DaseCount). In a first stage of meta-training, two separate CNNs learn to extract human activity information from wireless channel state information (CSI) measurements. Generalization performance is improved at this stage by knowledge distillation. In the meta-testing phase, the features extracted via CNNs from the CSI data are fed to a few-shot regression algorithm for the people counting task. The presented framework achieves, on average, over 96% accuracy for counting up to eight people in various domain setups. Yet, the solution is computationally expensive for classifier retraining and may not be suitable for frequent Wi-Fi transceiver location changes. Zhang Yong et al. [51] proposed a WI-FI-based few-shot learning solution for activity recognition that makes use of graph neural networks. The method uses a graph convolutional block attention module to extract activity-related information from CSI data. A final classification layer is used to classify the graph features and recognize the activity. The approach presents a robust 99.74% accuracy in the 5–way 5–shot experiment for new environments and activities. Yet, much computation and memory are required for model adaptations.

2 System setup and radar preprocessing

In this section, we propose a general overview of the system, discuss the data acquisition setup, and provide information about the employed radar board, its configuration, and the main preprocessing steps.

2.1 General overview of the system

Figure 2 depicts the overall framework. First, rooms for data gathering are chosen for the few-shot learning approach. The radar data are then gathered from various in-room locations with varying numbers of people. Preprocessing is performed to extract range and Doppler information about the people in the FoV of the device. The sequences of preprocessed frames are averaged by moving average to generate the individual instances of the meta-dataset. The data are then saved and labeled in session-specific folders. The folder names denote the label encoding, from 0 to 3, of the number of people who attended the session. In most of the proposed experiments, the information recorded in two rooms is used as input data for the episodic training of the meta learning model. The third room is instead utilized for testing. Model fine-tuning can be performed via active learning on the test data, using the meta learning model as a baseline.

2.2 Radar board

All radar data in this work were collected using the BGT60TR13C FMCW sensor [21] from Infineon Technologies AG. With a center frequency of \(f_{0}\) of 60 GHz and a bandwidth of about 6 GHz, this radar represents a miniaturized and low-power solution. This \(f_{0}\) and bandwidth are especially suitable in short-distance and indoor applications, resulting in low susceptibility to interference with other signals such as WiFi or Bluetooth. Thanks to an operation-optimized duty cycle, the power consumption for sensing within 5 m is minimized to only 5 mW. The BGT60TR13C has a transmit (TX) and three receive (RX) channels built into the package. The RX antennas are placed orthogonally to each other to enable the reconstruction of azimuth and elevation angles of arrival (AoA) for the targets placed in the FoV. The information collected from the RX channels is mixed with the TX and digitized with 12-bit resolution via the board connected to the radar sensor (Fig. 3).

Fig. 2
figure 2

Proposed Framework. The setup is mounted in three rooms. Data sessions with a number of people from 0 to 3 in the scenario are collected and processed (orange). The frequency analysis is performed via the fast Fourier transform (FFT). Instances are generated via a moving average over frame sequences. A meta-dataset is then generated, and one room is used as the test dataset. A classifier is then episodically trained and tested. Active learning is used to fine-tune the model to a new environment (yellow)

Fig. 3
figure 3

BGT60TR13 Radar System. The board filters, mixes, and digitizes data from each RX channel, located on top of the radar sensor

2.3 Radar configuration

The BGT60TR13C transmits a series of linearly frequency-modulated signals called chirps in a defined bandwidth \(B_{w}\) around the central frequency \(f_{0}\). Each chirp, of duration \(t_{c}\), normally consists of a fixed number of samples \(n_{s}\). During use, the information reflected in the RX channels is mixed with a transmitted signal reference and digitized, thus generating an output signal called intermediate frequency (IF). Normally, for further preprocessing, the radar information is packed into frames, each containing the IF relative to a sequence of chirps \(N_{c}\). The theoretical maximum detection range \(R_{max}\) and range resolution \(\Delta r\) of an FMCW modulation are calculated using the following formulas:

$$\begin{aligned} \Delta r = \frac{c}{2B_{w}} \ , \end{aligned}$$
(1)
$$\begin{aligned} R_{\max } = \frac{\Delta r}{2} n_{s} \ , \end{aligned}$$
(2)

where c stands for the speed of light in air. A narrow \(B_{w}\) of 0.48 GHz was chosen to achieve a \(R_{max}\) of about 10 m, which would cover the entire size of the chosen environments. A resolution \(\Delta r\) of at least 31 cm was chosen to let several targets placed in front of the radar be distinguished even at a considerable distance. A \(n_{s}\) per chirp of 64 has been specifically selected. The maximum discernible velocity of the targets \(V_{max}\) in one direction and the resolution \(\Delta v\) can instead be calculated with the following formulas:

$$\begin{aligned} V_{\max } = \frac{c}{4f_{0}t_{c}} \ , \end{aligned}$$
(3)
$$\begin{aligned} \Delta v = \frac{2 V_{\max }}{N_{c}} . \end{aligned}$$
(4)

The average human walking speed is about 1.42 m/s. To allow detecting even faster motions, we opted for a \(V_{max}\) of 3.5 m/s and a \(\Delta v\) of 1.1 cm/s. As a result, we set \(t_{c}\) to 351 \(\mu s\) and \(N_{c}\) to 64. To collect approximately seven frames every half second, a frame repetition time fps of 75 ms was chosen. Furthermore, an analog-to-digital converter (ADC) sampling rate \(F_{s}\) of 2 MHz was chosen. The parameters used to configure the BGT60TR13C for the people counting recordings in all the selected rooms are listed in Table 1.

Table 1 Radar Sensor Parameters Configuration
Fig. 4
figure 4

Data recording setup. A Raspberry\(^{\circledR }\) Pi4 (a) is used for data storage. For data collection, the BGT60TR13C radar system is mounted on the tripod (b). The tripod is moved between sessions in the various rooms and locations (c)

2.4 Recording setup

The BGT60TR13C radar system was mounted on a tripod for the people counting data, and the data were collected using a Raspberry\(^{\circledR }\) Pi 4. The raw radar data were then processed and labeled offline at a later time on an eight-generation Intel\(^{\circledR }\) CoreTM i5 processor (4 cores). Figure 4 depicts the used setup. Three different rooms of various sizes were chosen for data collection: an office of approximately 26 m2 and two meeting rooms of about 20 and 39 m2, respectively. Only a portion of the office has been used, with walls separating the other two areas. Various types of furniture, such as cabinets, desks, tables, and chairs, were left in the rooms and were unmoved from their locations. The reflection of such objects represents the so-called clutter that characterizes the FMCW radar data. A graphical illustration of the three environments, indicated with the letters S, M, and B, standing for small, medium, and big, is provided in Fig. 5. Data were gathered in each room from at least the four corners. Data were also collected in three additional locations in the office room. At every location, the tripod was set up at a height ranging from 1.65 to 1.75 meters. Four sessions have been carried out per location, each lasting approximately 60 seconds for the meeting rooms and 90 seconds for the office. Each session contains data from 0 up to a maximum of 3 people in the room at the same time. Ten different people with heights ranging from 1.60 to 1.78 meters took part in the recordings. Some data up to 5 people have been gathered in the big room to further test the performance of the developed algorithm. Before collecting data, user consent was obtained, and as much privacy and data anonymization as possible were maintained during the recordings. The collected data has not been made publicly available.

2.5 Radar preprocessing

Raw radar frames are difficult to interpret and label. The information to be fed to a DL model for learning purposes can be too noisy and highly context-dependent due to clutter. In this work, we propose to preprocess the raw data collected for people counting by removing the clutter and extracting the Doppler and range information of the targets through frequency analysis with the fast Fourier transform (FFT). We then perform two averages to reduce the noise in the data for the next model generation step. One for each frame, averaging the IF signal \(Ch_{IF}(i)\) generated for each of the three RX channels (\(i \in I_{RX}\)), and another for each 7-frame recorded series. The whole process, given the fps of 75 ms, leads to the generation of about 2 RDI per second. The main preprocessing steps are shown in Fig. 6.

Fig. 5
figure 5

A graphic illustration of the environments chosen for data collection. Data from 0 to 3 people were collected from the four corners of the rooms. For the office M, data were also gathered at three other locations (C, E, and H, respectively). For M, data could not be collected from location B due to the presence of the front door

Fig. 6
figure 6

Flow diagram representing the main preprocessing steps. The yellow blocks represent the main time-domain steps. The orange ones instead represent the frequency domain steps

The preprocessing steps performed for each RX-generated IF signal are as follows:

  1. 1.

    For each chirp (slow time), the average value of the samples (fast time) is calculated and then subtracted.

  2. 2.

    The IF signal is then multiplied in fast time with a Hanning window to reduce the spectral leakage effects.

  3. 3.

    A 1-D FFT is performed on the samples to derive the range information of the targets.

  4. 4.

    A multiplication with a Hanning window is run also in the slow time.

  5. 5.

    A 1-D FFT is performed along the slow time to obtain the velocity information.

  6. 6.

    To drop the information of static objects, aka clutter, moving target indication (MTI) is utilized (5).

    $$\begin{aligned} Ch_{IF}(i) = \mu Ch_{IF}(i) + (1 - \mu ) \overline{Ch_{IF}}(i) \ , \end{aligned}$$
    (5)

    where \(\mu \) \(\in \) [0, 1] is set to 0.9, and weights the importance of the current frame against the average of the previous ones \(\overline{Ch_{IF}}(i)\).

  7. 7.

    For each \(Ch_{IF}(i)\) a constant false alarm rate (CFAR) algorithm is used to locally select Range and Doppler peaks in frequency and discard the surrounding information, thus increasing the signal-to-noise ratio (SNR).

  8. 8.

    To further improve the SNR, the RDIs(v) for each frame \(v \in V\) are computed as the absolute value of the average of \(Ch_{IF}(i)\) (6).

    $$\begin{aligned} RDI(v) = \displaystyle \left|\frac{1}{I_{RX}} \sum _{n=0}^{I_{RX}} Ch_{IF}(i) \right|\ . \end{aligned}$$
    (6)
  9. 9.

    The RDIs thus generated are stored in a seven frames buffer (\(N_{v}\)), which corresponds to roughly half the frame rate. A moving average is performed on the buffer to further reduce the noise in the RDIs. These RDIs represent the individual instances of the people counting dataset that get labeled (7).

    $$\begin{aligned} RDI = \displaystyle \left|\frac{1}{N_{v}} \sum _{v=0}^{N_{v}} RDI(v) \right|\ . \end{aligned}$$
    (7)
Fig. 7
figure 7

Example RDI instances from the people counting dataset. Every row shows three examples per class, chosen from a random combination of rooms and locations. The axes indicate people relative motion velocity in m/sec and distance from the radar sensor in cm

2.6 People counting dataset

For people counting, three different meta-datasets have been generated from the collected data of up to three people. Given a frame timing of 75 ms and the frames averaged performed on a seven frames buffer, a total of 7,669 labeled samples have been created. Each sample has a size of 32 times 64 pixels. The width of 64 pixels represents the velocity span, corresponding to the number of chirps per frame. The height of 32 pixels represents the range span, corresponding to half of the bin samples per frame. Independently of the recording room, labels represents the number of people \(P_{m}\) in the recording, with m \(\in \) [0, 3]. As shown in Fig. 5, the data has been divided into sub-folders of the tuple (R, \(P_{m}\), and L). The tuple components are the room’s name R: S, M, or B, the number of people (\(P_{m}\)), and the location, L \(\in \) [AH]. With an average duration of 60 seconds across all recordings in rooms S and B, a total of 1,677 and 1,702 examples were created, respectively. For M, a total of 4,290 examples were built with six available locations. With all the available instances, the following three meta-datasets have been generated:

  • Mixed-Dataset: the data from the sub-folders (R, \(P_{m}\), L) were randomly split so that approximately 75% of the instances was training and 25% was testing. The number of training and test instances in this case are 5,803 and 1,866, respectively.

  • S-Test-Dataset: in this case, all sub-folders (S, \(P_{m}\), L) were used as tests, while all others ([MB], \(P_{m}\), L) were used as training. In total, for this meta-dataset, there are 5,922 training examples and 1,677 test examples.

  • B-Test-Dataset: all the sub-folders (B, \(P_{m}\), L) were used as test, while all the others ([SM], \(P_{m}\), L) were used as training. The number of training and test instances are 5,967 and 1,702, respectively.

In general, for each of the three generated meta-datasets, the training and test instances are part of the respective training \(\mathcal {D}^{m-train}\) and test \(\mathcal {D}^{m-test}\) meta-dataset splits. Three different averaged RDI examples per class, sampled from the different recordings in all rooms and locations, are shown in Fig. 7.

Even in the same environment, RDIs from classes 1 to 3 are difficult to distinguish from one another. Figure 8 shows a t-distributed stochastic neighbor embedding (t-SNE) with a 2-D component representation of all instances in the S room. The t-SNE succeeds in correctly clustering only data with zero people in the environment. A t-SNE representation of all collected data are shown in Fig. 9 according to the B-Test-Dataset split. Even with a larger amount of data, only the zero-person instances are easily clustered. In this case, it can also be observed that the test data, which represents the B room, have different features than the rest of the points. This is an important indication of the dependence of radar data on the location in which they are collected. Algorithms trained in a single location may be difficult to use in other environments and usually require adaptation. Euclidean distance was used as a metric, and Barnes-Hut was used as an optimization algorithm to generate the t-SNE representation.

Fig. 8
figure 8

2-D t-SNE representation of all S room data. This t-SNE was obtained with a perplexity of 40 over 6,000 optimization iterations

Fig. 9
figure 9

2-D t-SNE representation of the B-Test-Dataset, for all the recorded data. The B room data are represented by the “x” marker, while the rest of the data (rooms S and M) are represented by the “o” marker. This representation was obtained with a perplexity of 30 over 7,000 optimization iterations

3 Proposed approach

In this section, we present our solutions for generalization learning. We begin by proposing a new network topology called the Weighting-Injection Net, which is inspired by the weighting network [36]. We then propose an algorithm that makes use of optimization-based meta learning features from MAML [31], which we call MAMW. This modified version aims at increasing training stability when only a very limited number of shots per class are available. Then, we propose an active learning strategy tailored for weighting networks to allow fine-tuning in a new environment while minimizing the amount of required labeled data.

3.1 Meta learning

In episodic meta learning, K tasks are sampled from a distribution \(p(\mathcal {T}_{r})\) defined over \(\mathcal {D}^{m-train}\). As the episodes progress, the goal is to improve the performance of the model on tasks sampled from \(p(\mathcal {T}_{s})\) defined on \(\mathcal {D}^{m-test}\). In DL, task-based learning is often achieved via the gradient method, which involves training the parameters \(\theta ^{\prime }\) by minimizing a cost function \(\mathcal {L}_{\mathcal {T}_{r}}(f_{\theta ^{\prime }})\), where \(f_{\theta ^{\prime }}\) represents the relation between the input x and the predicted output \(\hat{y}\). In the relation networks [35], generalization among tasks is directly achieved thanks to the intrinsic comparison of instances enabled by the topology. In optimization-based meta learning, such as in MAML [31], the information learned for tasks \(\mathcal {T}_{r}\) and encoded in the parameters \(\theta ^{\prime }\), is transferred to a base model \(f_{\theta }\) with parameters \(\theta \), minimizing an outer cost function \(\mathcal {L}_{\mathcal {T}_{r}}(f_{\theta ^{\prime }})\). In this case, the task-specific cost function depends on the parameters \(\theta \) of the base model \(\mathcal {L}_{\mathcal {T}_{r}}(f_{\theta })\).

3.1.1 Weighting-injection net

The Weighting-Injection Net aims to compare the features of the arbitrary examples of query q with those of reference to the support s classes for each task \(k \in K\). The Weighting-Injection Net, as shown in Fig. 1 is based on three main modules: injection, comparison, and weighting. During training, the gradient information is propagated through all modules in both forward and backpropagation steps. For a N–way 1–shot task, the idea is to map the relationship between support examples \(s_{n}\), where n \(\in \) \(\mathbb {N}\): [1, 2, ..., N], to each query example \(q_{j}\), where j is the index of the j-th example of the set.

The injection module \(e_{\theta }\) generates a higher dimension representation of the input x to enhance the extraction and matching of features in the subsequent comparison step. Gradient information for the injection module is only propagated as \(e_{\theta ^{\prime }}(s_{n})\) through the support instances. For the query, only the feature representation \(e_{\theta ^{\prime }}(q_{j})\) is generated.

The comparison module \(c_{\theta }\), takes as input the concatenation along N channels of \(e_{\theta ^{\prime }}(q_{j})\), with each of the n support samples. The number of channels N corresponds to the task number of ways. The features are extracted in the module using convolution layer sequences, yielding a comparison vector z. The vector z is generated in the following way:

$$\begin{aligned} z_{n,j} = g_{\theta ^{\prime }}(e_{\theta ^{\prime }}(s_{n}) \mathbin {\Vert }e_{\theta ^{\prime }}(q_{j})) \ , \end{aligned}$$
(8)

where \(\mathbin {\Vert }\) denotes the operation of concatenation along the N channels.

Lastly, the weighting module \(w_{\theta }\) is designed to generate a probability density from the concatenated N channels in the z vector. Each \(z_{n,j}\) is the output of the comparison module, between the query \(q_{j}\) and a support \(s_{n}\). The predicted output \(\hat{y}_{j}\) for the sample \(q_{j}\) can be expressed as follows:

$$\begin{aligned} \hat{y}_{j} = w_{\theta ^{\prime }}(\Vert _{n=1}^N z_{n,j}) = w_{\theta ^{\prime }}(z_{1,j} \mathbin {\Vert }z_{2,j} \cdots \mathbin {\Vert }z_{N,j}) \ , \end{aligned}$$
(9)

where \(\Vert \) represents the sequence of concatenations performed over the channels N of z.

In the case of a N-way C–shot task, where c \(\in \) \(\mathbb {N}\): [1, 2, ..., C], the supports per class can be denoted as \(s_{n, c}\). The Weighting-Injection Net can be leveraged in this case to create a more robust representation of the comparison vector \(z_{n, j}\). This can be done by arithmetic averaging over C sets of N-channel concatenations, given by the embedded representations of \(q_{j}\) with each of the support sets \(s_{n, c}\). Such a more robust representation yields the query class estimation with less bias than with the single support shot scenario. The mathematical expression for a single \(q_{j}\) is as follows:

$$\begin{aligned} z_{n,j} = \frac{1}{C} \sum _{c=1}^C g_{\theta ^{\prime }}(e_{\theta ^{\prime }}(s_{n,c}) \mathbin {\Vert }e_{\theta ^{\prime }}(q_{j})) \ . \end{aligned}$$
(10)

The Weighting-Injection Net, trained on \(p(\mathcal {T}_{r})\), can be tested, thanks to its inherent structure, on tasks from \(p(\mathcal {T}_{s})\) without further training. Given a support set with elements \(s_{n,c}\) for a task \(\mathcal {T} \sim p(\mathcal {T}_{s})\) a N–way C–shot, the class probability density of the j-th query sample \(q_{j}\), is directly estimated by inference.

3.1.2 Model-agnostic meta-weighting

The weighting network [36] represents a robust episodic learning algorithm thanks to the inherent feature of instance comparison. Yet, this method can be characterized by learning instability when only a few-shot per class are available. Especially in 1–shot learning, this is due to the comparison of the query with the individual support instances, which may not be sufficiently descriptive of a class for a given task. Hence, we present a method called model-agnostic meta-weighting (MAMW), which tries to incorporate within the weighting network some features of optimization-based meta learning to enhance the stability and robustness of prediction in this setting. Specifically, in the MAMW, we propose to divide episodic learning into inner and outer steps. Given a N–way C–shot task:

  1. 1.

    In the inner step, the support instances are compared with a noisy version of themselves of Gaussian type via a function \(e_{\theta }(\phi ((s_{n,c})))\). This noise is generated at random from the \(\mathcal {N}(0, \sigma ^{2})\) distribution in the interval [\(-\sigma \), \(\sigma \)]. Defined \(s_{h}\) as the h-th support example, where \(H = N \cdot C \implies h \in \) \(\mathbb {N}\): [1, 2, ..., H], the computation of \(z_{n,h}\) can be expressed as follows:

    $$\begin{aligned} z_{n,h} = \frac{1}{C} \sum _{c=1}^C g_{\theta }(e_{\theta }(s_{n,c}) \mathbin {\Vert }e_{\theta } (\phi (s_{h}))) \ , \end{aligned}$$
    (11)
    $$\begin{aligned} \hat{y}_{h} = w_{\theta }(\parallel _{n=1}^N z_{n,h}) \ , \end{aligned}$$
    (12)

    where \(\theta \) represent the parameters of the base model \(f_{\theta }\). Such operations can also be carried out in batches. An example of people counting instances compared with their noisy version is shown in Fig. 10.

  2. 2.

    In the outer step, the comparison between the support examples \(s_{n,c}\) and each query \(q_{j}\) is performed, starting from the weights \(\theta ^{\prime }\) learned in the inner loop. In this case, the comparison vectors z are computed with the  (10) and the predicted output \(\hat{y}_{j}\) with (9).

Fig. 10
figure 10

Examples of RDI without (a) and with added Gaussian noise (b) used in the inner step training of the MAMW

The main steps of the MAMW, in the case of few-shot, supervised learning with outer updates after every task, are defined in Algorithm 1.

The presented Weighting-Injection Net topology can be trained via the MAMW algorithm. Also with the MAMW episodic learning, the Weighting-Injection Net can tackle new test tasks without the necessary adaptation training. MAMW does not need algorithmic modifications when an embedding module is used instead of the injection module.

Algorithm 1
figure h

MAMW for N–way C–shot Supervised Learning

3.2 Active learning

Active learning can also be used on top of a meta learning model to perform fine-tuning on a given task, leveraging the most uncertain queries during adaptation. We propose to use pool-based sampling active learning to fine-tune the Weighting-Injection Net on \(p(\mathcal {T}_{s})\), starting from what has been learned on \(p(\mathcal {T}_{r})\). We chose an uncertainty sampling strategy to let the algorithm decide at each training epoch which new examples to label. We test the approach with three different priority scores: least confidence (LC), margin sampling (MS), and entropy (E), respectively. For the instances \(q_{j} = \{x_{j}, \ y_{j}\}\) representing the input/output pairs on queries sampled by \(\mathcal {T}\), the priority scores \(S_{p}\) can be defined as follows:

$$\begin{aligned} S_{LC} = \mathop {\textrm{argmax}}_{x_{j}} \ (1 - P_{\theta }(\hat{y}_{max} \mid x_{j})) \ , \end{aligned}$$
(13)
$$\begin{aligned} S_{MS} = \mathop {\textrm{argmin}}_{x_{j}} \ (P_{\theta }(\hat{y}_{max} \mid x_{j}) - P_{\theta }(\hat{y}_{max-1} \mid x_{j})) \ , \end{aligned}$$
(14)
$$\begin{aligned} S_{E} = \mathop {\textrm{argmax}}_{x_{j}} \ (- \sum _{n=1}^N P_{\theta }(\hat{y}_{n} \mid x_{j}) \log P_{\theta }(\hat{y}_{n} \mid x_{j})) \ , \end{aligned}$$
(15)

where \(P_{\theta }\) of \(\hat{y}_{max}\) is the highest posterior probability predicted by the model with \(\theta \) parameters for \(x_{j}\), and N is the number of classes.

Algorithm 2 defines the main step of the proposed pool-based sampling on a task \(\mathcal {T}\). In general, the Algorithm 2 represents a generalization of the pool-based sampling approach for relational models. For a given task, a set of class-related support examples is initially labeled. As the number of iterations increases, the uncertainty of the query examples is evaluated, and those with the highest priority score are added to the labeled dataset. A maximum number of support instances per class per iteration is also chosen. Instead of starting with random weights, parameters learned during episodic learning on training tasks can be used as the model initialization. The active learning procedure is therefore performed on unseen test tasks.

Algorithm 2
figure i

Pool-based Sampling Active Learning for N–way C–shot Supervised Learning on Weighting-Injection Net.

4 Experimental setup

In this section, we present all the results achieved on meta learning episodic experiments and active learning fine-tuning on the people counting meta-datasets (Section 3.6). The algorithms have been written in the Python programming language, using the TensorFlow module to implement the DL models. Further experiments on a public dataset have been performed and discussed in the Appendix A. The codes related to the algorithms and topologies used for the meta learning experiments are available onlineFootnote 1. As a process unit, we used an Nvidia® Tesla® P4 GPU and CUDA® Toolkit v11.1.0 for parallel computing.

4.1 Meta learning experiments

All the episodic experiments have been performed with the topology presented in Section 4.1.1 and Fig. 1. Specifically, 4–way experiments with 1–, 2–, 5–, and 10–shot have been performed. The topology has been trained with two different algorithms. First with the classical episodic few-shot training of weighting networks, as defined in [36], using the Weighting-Injection Net equations (Section 4.1.1). Further, the topology has been trained in episodic sequences of inner and outer steps, following the steps of the MAMW algorithm proposed in Section 4.1.2. All the results presented in this section refer to the two algorithms and are consistently called Weighting-Injection Net and MAMW. Comparison results of the two algorithms with the state-of-the-art are presented in the Section 5.1.1. The cutting-edge comparison also features some application limit experiments for indoor people counting up to five individuals in a room.

Fig. 11
figure 11

Representation of the topology modules and respective layers used in the relational experiments. The injection module (\(e_{\theta }\)) increases the data dimensionality via a sequence of convolutional layers. The query sample is compared with all the available support samples. To combine relevant features, the comparison module (\(g_{\theta }\)) employs convolution and global average pooling. The weighting module (\(w_{\theta }\)) generates a feature matching probability density using dense layers and softmax activation

A graphical representation of the model modules and respective layers is shown in Fig. 11. The model consists of 283,379 trainable parameters in its entire module sequence. Of the total, the injection module consists of 239,680 parameters, the comparison module of 39,936, and the weighting module of the remaining 4,180. To rescale feature size, max pooling is used in cascade to the 2D convolution (Conv2D) for the two modules \(e_{\theta }\) and \(g_{\theta }\). In addition, batch normalization is used to increase the stability of training. All batch normalization layers are followed by a rectified linear unit (ReLU) activation function. To map the output vector into a probability distribution over the classes, the softmax is used as an activation function for \(w_{\theta }\). The cost function chosen for the query classification is categorical crossentropy, and the optimization algorithm is Adam. \(\beta _{1}\) and \(\beta _{2}\) for Adam have been set to 0 and 0.5, respectively. A learning rate of \(5e-4\) has been chosen for the Weighting-Injection Net. A learning rate of \(5e-4\) has also been chosen for both the inner and outer steps of MAMW. For the Gaussian noise statistic on the MAMW inner step, a value of \(\sigma ^{2}\) equal to 0.005 has been chosen. This value represents an empirical choice, noting that larger values led to the loss of the main information in the support instances, while smaller values were less effective for the performance of the experiments.

Regardless of the number of shots, every meta-training experiment is performed over 22,000 episodes, each of a single training epoch. The episodic learning is carried out on \(\mathcal {D}^{m-train}\). The validation and testing have been performed at the end of each episode on 10-shot per class (40 samples) on tasks sampled by \(\mathcal {D}^{m-train}\) and \(\mathcal {D}^{m-test}\) respectively.

All experiments have been carried out with an embedding size g of 64. Smaller embedding sizes resulted in non-convergent experiments, whereas larger sizes resulted in meta-overfitting on \(\mathcal {D}^{m-train}\). For the injection module, an output representation of \(14 \ \cdot 14 \ \cdot g\) has been chosen (feature size). This led to a representation per image of 12,544 units (Table 2). On the Nvidia® Tesla® P4 GPU, the number of floating points operations per second (FLOPS) for the injection module with this configuration is 108 megaFLOPS. The size in bytes of the weights of the model when saved in ”.h5” format, regardless of the chosen episodic training algorithm and the number of shots, is 1,148 KB. Some experiments at varying feature sizes are also presented later in this section to test the benefits of the injection module over the standard embedding module.

The obtained values of prediction accuracy, model size, and single-sample prediction latency are compared to state-of-the-art values obtained by training other algorithms on the people counting dataset employed in this work. The accuracy results for the Weighting-Injection Net are reported for varying numbers of shots. Each experiment by algorithm, meta-dataset, and number of shots has been performed three times and tested on 10,000 final tasks sampled by \(\mathcal {D}^{m-test}\). All presented results include the 95% confidence interval in addition to the average accuracy value.

Table 2 Network Layers Configuration - People Counting

The performance evaluation of each individual experiment is measured according to the validation and test accuracy values obtained by the model as the number of episodes increases. For every experiment, a box plot on the validation and testing accuracy statistics of tasks sampled by \(\mathcal {D}^{m-train}\) and \(\mathcal {D}^{m-test}\) is constructed every 2,200 episodes. In the following plots and paragraphs, statistical insights from one of the experiments performed are analyzed. Specifically, a MAMW 10–shot experiment on Mixed-Dataset is chosen thanks to the good achieved generalization performance. Figure 12 shows the set of box plots generated as the training episodes advance for the considered experiment. As the episodes progress, the mean and median values of the distributions rise while the quartiles and whiskers narrow. With episodes progressing, even the outliers move closer to the upper limit of accuracy. The described behavior demonstrates how, thanks to previously acquired experience, the model can generalize better on new sampled tasks. This means that newly learned parameters \(\theta \) generalize better in new contexts, i.e., new locations and test rooms, resulting in higher performance under the same learning conditions.

Fig. 12
figure 12

Accuracy statistics box plots vs. episodes for a MAMW 10–shot Mixed-Dataset experiment. The red box plots are generated on validation tasks(a), whereas the blue ones (b) are generated on test tasks. The median and mean values are represented by a horizontal line and a green triangle in each box plot. The small circles represent the box plot outliers

Discrete accuracy density histograms can be used to represent the distribution underlying individual box plots. Graphical evidence of how the distribution tends to shift towards higher generalization accuracy can be observed by comparing the first and last histograms of the episodic optimization. Such density histograms can also be compared to a Gaussian probability distribution, thus showing what percentage of the achieved accuracy lies between the first and third quartiles. Figure 13 depicts a comparison of accuracy statistics for the examined experiment at the beginning and end of the episodic training. Even for tasks sampled only by \(\mathcal {D}^{m-test}\), the probability density tends, as the episodes progress, to take on a negative skew towards the upper limit of accuracy. The actual distributions underlying the box plots are not Gaussian but multi-modal with density peaks due to the variable complexity of the sampled tasks.

Fig. 13
figure 13

MAMW 10–shot experiment, first (a) and last (b) box plot underlying distributions, generated on test tasks sampled from Mixed-Dataset. The q1 and q2 values on the Gaussians indicate the first and third quartiles, respectively. The probability density histograms show the actual non-Gaussian nature of the distribution. The accuracy probability density for the last box plot (b) exhibits a negative skew as a result of the generalization learning

The generalization capability can be addressed at the level of individual classes by constructing cumulative confusion matrices on task sequences. Labels 0 to 3 represent the real and predicted number of people for the two dataset splits. Figure 14 depicts the confusion matrices underlying the first and last box plots of Fig. 12 for both \(\mathcal {D}^{m-train}\) and \(\mathcal {D}^{m-test}\).

Fig. 14
figure 14

Cumulative confusion matrices for a 10–shot MAMW Mixed-Dataset experiment. Confusion matrices are obtained on the first (a) and last (b) 5,550 meta-iterations in the validation phase for both \(\mathcal {D}^{m-train}\) and \(\mathcal {D}^{m-test}\) sampled tasks

Figure 15 shows another example of cumulative confusion matrices for a Weighting-Injection Net 5–shot experiment on S-Test-Dataset. It is noticeable in both Figs. 14 and 15, that the model learns to generalize better as episodes progress for both unseen locations and rooms. Most miss-classifications, especially at the end of episodic learning, lie around the main diagonal. This means that the models, in most cases, count ±1 person compared to the actual number of individuals in the environment. Moreover, the majority of the misclassifications happen for the classes of 1 to 3 persons, while the model easily succeeds in distinguishing the case 0 that corresponds to no people detected in the sensor’s FoV. The per-class accuracy of the test confusion matrices in Fig. 15 turns out to be lower than that in Fig. 14. This is due not only to the use of 10–shot instead of 5–shot in the experiment but also to the higher complexity of the test tasks. In fact, the Fig. 15 experiment sampled all test tasks from a room not included in the training (S).

Fig. 15
figure 15

Cumulative confusion matrices for a 5–shot Weighting-Injection Net S-Test-Dataset experiment. Confusion matrices are obtained on the first (a) and last (b) 5,550 meta-iterations in the validation phase for both \(\mathcal {D}^{m-train}\) and \(\mathcal {D}^{m-test}\) sampled tasks. In this case, the entire S room is utilized as the test set

The prediction accuracy values obtained as an average of the post-training tests for each experiment type are listed in Tables 3, 4, 5 for the three defined meta-datasets.

As can be observed from Tables 3, 4 and 5, regardless of the used meta-dataset, the 1– or 2–shot experiments performed with the MAMW lead to higher average accuracy values than the Weighting-Injection Net. In these specific cases, in episodic learning, the few supports per class make the prediction given by the Weighting-Injection Net less robust, where the learning depends solely and exclusively on the comparison with the query. MAMW instead supplies more information to the model thanks to the initial comparison with a noisy version of the support samples, thus emphasizing the potential intrinsic noise of the query data. For the 5– and 10–shot experiments, the two episodic approaches lead to different performances with respect to the used meta-dataset. The MAMW outperforms the Weighting-Injection Net on the Mixed-Dataset, regardless of the number of shots. The Mixed-Dataset contains, in fact, recordings from all rooms, but with different locations and numbers of people. In this case, the MAMW goal of capturing noise similarity between support and query also aids query class recognition. This is thanks to the intrinsic features of the RDIs collected in the same room, which are thus influenced by the properties of that environment. On S-Test-Dataset and B-Test-Dataset instead, the Weighting-Injection Net outperforms MAMW in most 5– and 10–shot experiments. In these cases, given the relevant difference in context for the test room, the MAMW comparison with the noisy version of supports might shift the learning objective towards detecting noise rather than the class of query samples.

Table 3 Accuracy of the two meta learning approaches on people counting (4 classes): Mixed-Dataset

For relation-based topologies, there is no need to perform adaptation training for new tasks as a result of the direct comparison of features between the newly available support samples and the query. Therefore, the adaptation time to a new task is null. Instead, the inference time on a single sample (query) can be computed as a function of the number of shots. It corresponds to the time required by the model to predict the query class given the available supports. The time required to compute the z comparison vectors for all available supports is thus included in the inference time for single queries. As both the proposed algorithms share the same inference procedure, these values are independent of the employed approach. The single sample inference time is also independent of the selected counting meta-dataset, given the same input size. Average inference values on a single query are listed in Table 6.

Table 4 Accuracy of the two meta learning approaches on people counting (4 classes): S-Test-Dataset

As can be seen from Table 6, the inference time for a single query increases as the number of shots increases. Multiple supports available per class enable a more robust prediction of the query class, as shown in (10). However, this requires the generation of multiple z comparison vectors, which, in proportion to the number of shots, lead to a progressive increase in inference time on a single query.

Classification accuracy is also dependent on the chosen feature representation dimension in the feature extraction module \(e_{\theta }\). In specific experimental settings, the injection can counter episodic overfitting effects by increasing feature size as opposed to the standard embedding. The \(14 \cdot 14\) feature size chosen for all the other experiments is compared with two representations of \(4 \cdot 4\) and \(9 \cdot 9\) respectively. Given the size of an RDI example of \(32 \cdot 64 = 2,048\), a feature representation of \(4 \cdot 4 \cdot 64 = 1,024\) converts the injection module into an embedding module. Compared with the 108 MegaFLOPS required by the feature size of \(14 \cdot 14\), the size \(4 \cdot 4\) requires only 0.28 MegaFLOPS. Overall, the injection operation, compared to embedding, results in the GPU performing significantly more FLOPS. This is due to the larger size of the extracted features in the convolutional layers.

Table 5 Accuracy of the two meta learning approaches on people counting (4 classes): B-Test-Dataset
Table 6 Average single-sample inference time computed as the average of all MAMW and Weighting-Injection Net experiments on all defined meta-datasets, in function of the number of shots. Every experiment has been run over 10,000 final tasks on Nvidia® Tesla® P4 GPU

Table 7 features the results on the S-Test-Dataset, obtained with the Weighting-Injection Net as feature size, and the number of shots vary. The 1–shot experiment seems to benefit more from embedding than from an injection module. The squeezed representation of features in such experiments leads to a more compact representation. The entire weighting network can succeed in extracting key features from the few samples available per class in each episode bringing benefits of generalized learning. On the other hand, as the number of shots increases, a larger representation of features seems to lead to greater benefits in training. With 5– or 10–shot per class, a larger feature space upstream of the comparison module facilitates feature extraction from the available support samples and yields better generalization results. The effect of overfitting on individual tasks is clearly visible by comparing the accuracy obtained with the \(4 \cdot 4\) feature size between the 5– and 10– shot experiments. Contrary to the common scenario, the performance of the model worsens as the number of shots doubles. Without tuning the other hyperparameters, the small feature size favors single-task adaptation rather than generalized learning, reducing so, the overall performance.

Table 7 Accuracy achieved for the Weighting-Injection Net with varying feature size on people counting (4 classes): S-Test-Dataset. The chosen embedding size g is 64
Table 8 Mean classification accuracy achieved by the various algorithms, for experiments on people counting (4 classes): S-Test-Dataset

4.1.1 Comparison with the state-of-the-art and limitations

In this section, the results of Weighting-Injection Net and MAMW are compared to the results of other state-of-the-art meta learning methods for the task of people counting. Reptile [32] is used as a baseline algorithm. MAML 2nd [31] and a more stabilized and performant version of MAML presented in Antoniou et al. [52], are the other algorithms used for comparison. The latter, labeled MAML+, leverages the contributions of multi-step loss optimization (MSL), derivative-order annealing (DA), and cosine annealing of meta-optimizer learning rate (CA). The model chosen for the state-of-the-art algorithms is a CNN suitable for the generalization goal, consisting of four main blocks. The first three blocks consist of a Conv2D with 64, 128, and 256 filters, followed by batch normalization and the ReLU activation function. The last block consists of a dense layer with 4 neurons, corresponding to the number of classes. This topology consists of 403,332 trainable parameters compared to the 283,379 of MAMW and the Weighting-Injection Net. The adaptation training was done with Adam as the optimizer, with learning rates of \(8e-3\) and \(7e-3\) in the inner and outer cycles, respectively. Likewise, in this case, the values of \(\beta _{1}\) and \(\beta _{2}\) for Adam have been set to 0 and 0.5, respectively. The model training was executed on 22,000 episodes with a batch size of 2 and a number of epochs per task of 4, respectively. The comparison was performed on 10,000 final tasks on S-Test-Dataset for 1–, 2–, 5– and 10–shot over 3 repetitions of each experiment. For each task, 10 test samples per class were randomly selected, resulting in 40 test instances in total. The computed mean classification accuracy values are listed in Table 8. As can be observed, the MAMW turns out to be the best-performing method in all experiments apart from the 10–shot experiment, where, as commented in Section 5.1, the Weighting-Injection Net achieves a higher average accuracy. The accuracy values obtained with the proposed methods are better despite using 30% fewer trainable parameters. As the number of shots increases, relation-based models show an even larger accuracy gap than optimization-based ones due to the more robust prediction given by averaging the comparison vectors computed for the available support samples.

Because of the direct mapping between sample and label in the learning process, the single-sample inference time for Reptile, MAML 2nd and MAML+ is independent of the number of shots. Across all the experiments, on an average of 10,000 final tasks, the overall estimated inference time has been 33.47 ms. In comparison to the results in Table 6, only for the 10–shot experiments, the pure optimization-based methods turn out to be 25% faster for single inference, whereas they turn out to be slower in the other configurations.

The task adaptation time needed for the various algorithms is provided in Table 9. The considered state-of-the-art methods require an adaptation time per task that rises considerably as the number of shots increases. On the contrary, relation-based models, thanks to their comparison-based topology, do not require adaptation for new tasks and therefore lead to a null adaptation time. This results in a great advantage for relational topologies over traditional optimization-based topologies.

To test the application limits of the episodic learning approach for radar-based people counting, experiments were also conducted with up to five people per session in the big room B (Section 3.6). In this case, five sessions of one minute each per location and number of people were collected and used. Locations A and C were used to generate training tasks, and locations B and D were used for testing tasks. Table 10 presents the results obtained on test data for the average of three experiments and 10,000 final tasks. The results for this meta-dataset show similar characteristics to those where an entire room is used exclusively as a test. In general, the two proposed approaches outperform the state of the art regardless of the number of shots. The MAMW proves more stable and performs better in experiments with very few shots (1– and 2–). The Weighting-Injection Net, on the other hand, outperforms MAMW for the 5– and 10– shot approaches. The extension of the counting approach to up to five people and the limitation of radar resolution for close targets in this scenario make generalization more complex. The increased complexity is reflected in the RDIs input instances and features across the different recording locations. For this reason, with a larger number of shots, MAMW performs less well, favoring noise filtering in support samples rather than classification of query instances. Weighting-Injection Net, on the other hand, focuses directly on learning the query class and performs better in this scenario.

In general, although the proposed algorithms outperform the state of the art, they lead to an average accuracy of less than 60% over the six classes with 10–shots. This unfortunately shows that the purely episodic generalization approach with a few shots is limited to scenarios with a very small number of people. Adaptations to larger and more varied datasets or the use of radar sensors with higher resolution could obviate the current limitations. The weights of the counting model up to 5 people need an in-memory size of 1,156 KB. This value is slightly larger than the approach of up to 3 people. More information on a single experiment for the adaptation of up to five people is provided in Appendix B.

Table 9 Adaptation time per new task by algorithm and number of shots
Table 10 Mean classification accuracy achieved by the various algorithms, for people counting (6 classes): B room, B and D locations
Table 11 Accuracy on people counting (4 classes), obtained through pool-based sampling active learning
Fig. 16
figure 16

Entropy pool-based active learning accuracy across epochs. The thicker lines highlight the best experiments by type of initialization. Accuracy values are averaged per trial every 20 epochs. Random initialization (green) experiments are more unstable and collapse to 25% random learning on 4 classes

4.2 Active learning experiments

Active learning experiments with the Algorithm 2 are intended to demonstrate how meta learning-driven model initialization benefits task fine-tuning. All the experiments have been carried out on the task of radar-based people counting, using 75% and 25% of the data collected in the S room as training and testing, respectively. This means that active learning aims to boost the estimation performance in counting people in the entire small room, given all the locations in which the RDIs were collected. Since all the in-room locations are considered at once, the adaptation in this case is more complex than during episodic training. The uncertainty-based experiments used priority scores \(S_{p}\) defined in (13), (14) and (15). As initialization, the parameters \(\theta \) obtained after the 1–shot episodic learning of Weighting-Injection Net and MAMW on the remaining two environments (M and B) have been used. As \(D_{p}\) grows larger, the experiments are limited to a maximum of five supports per class. The selected number of epochs for the active learning training is 6,000. For each epoch, 4 queries (J) are to be sampled, with A of them labeled using the uncertainty-based approach. Table 11 compares the average results from three experiments for each defined \(S_{p}\) score to the random initialization of \(\theta \). As can be seen from the table, the results for initialization based on MAMW and Weighting-Injection Net vary very little as the chosen priority score differs. Such initialization, however, leads to a great performance gap compared to the random one, which also features training instability over repetitions. The Weighting-Injection Net also seems to achieve slightly better performance than the MAMW. This is most likely related to the large availability of labeled data, which for a test room setup, makes this method more performant than MAMW (Section 5.1). In the case of random initialization, however, the model succeeds in learning almost exclusively when entropy \(S_{e}\) is used as the scoring function. This may be due to the entropy formulation itself, which results in a more balanced query selection by taking into account the distribution over all classes for the score computation.

The accuracy learning curve for the entropy-based experiments is depicted in Fig. 16. Adaptation starting with Weighting-Injection Net and MAMW weights exhibits similar accuracy profiles as training epochs progress. Random initialization, on the other hand, not only leads to lower-performing learning but also to instability and experiment failure, collapsing to a 25% accuracy over the four classes. In this case, the algorithm encounters difficulties with only a few learning data at a time to generalize to all locations. Fluctuations in accuracy curves are due to adaptation to new labeled data sampled from different S room locations, which normally display different features. This behavior can be observed in the t-SNE representations of the data in Section 3.6.

5 Conclusion

This paper features how meta learning and active learning can be effectively employed for radar-based people counting using real-world data. For such a use case, multiple meta-datasets are generated based on different combinations of rooms and radar orientations. Episodic learning for few-shot adaptation is carried out through a comparative approach. The model learns task-wise to map features of query examples to representative support instances belonging to the same class. In this way, the belonging class of a radar instance is predicted by comparing it with representative support examples of classes zero to three people. With respect to the traditional weighting network, an injection module increases the input data dimensionality before the comparison step. This process facilitates the comparison of query and support features, reducing episodic task overfitting and aiding generalization. The overall topology with an injection module is called the Weighting-Injection Net.

An episodic adaptation algorithm called model-agnostic meta-weighting is then presented for specific adaptations to very few-shot per task. This two-step training algorithm combines the weighting network topology and the optimization-based meta learning approach to enhance the feature extraction capabilities of the model. The approach features an inner step task adaptation that compares support instances with a noisy version of themselves, leading to more stable generalization training, especially in the 1–shot training. Finally, a pool-based active learning approach designed specifically for relation-based methods is presented. Using only the available samples with the highest prediction uncertainty, this algorithm seeks to minimize the number of examples needed for learning.

The presented meta learning achieves cutting-edge accuracy in people counting while also yielding other performance advantages. The relation-based topology grants no training time for adaptation at new radar test locations. Furthermore, the availability of multiple support examples per class allows for more robust averaged query estimation. Both the presented algorithms are up to 15% more accurate than the state-of-the-art for 1– and 10–shot. They are also found to be up to 50% faster for computing single-sample inference when the model is tested on a new task. The active learning algorithm performs better and is more stable when the initialization is set to the episodically learned weights rather than at random. Nonrandom initialization improves radar adaptation accuracy by 30% on test room radar instances.

Despite the great benefits shown, the work presented is only tested offline on previously collected data. In the future, it will be important to test such a system in a real-time setting. The monitoring approach with more than three people leads to accuracy performance which may be insufficient in several practical contexts. Future work will focus on using relation-based topologies and sensor fusion to counter the current limitations. The use of an unconventional injection module for the relational networks could bring additional benefits for feature representation in episodic learning. In-depth studies will therefore be conducted on the possible applications and limitations of such a module. Research on the injection module will also be carried out in the field of the interpretability of neural networks and training complexity. Also, further active learning and uncertainty sampling strategies that focus on episodic learning with relation-based approaches will be investigated.