1 Introduction

The increasing prevalence of smart technology in our daily lives, including self-learning and automated decision-making systems, can be attributed to advancements in Machine Learning (ML) [3, 9]. ML algorithms are utilized in applications such as Gmail’s spam filtering and YouTube’s video recommendations, enhancing their functionality. However, along with the benefits, ML also brings potential vulnerabilities that attackers can exploit for malicious purposes, jeopardizing system reliability.

One notable privacy attack, introduced by Shokri et al. in 2017 [13], and subsequent variants like the Label-Only Membership Inference Attack [2], aim to distinguish between records used during the training phase of an ML model and those that were not, operating under different assumptions. These attacks pose risks to privacy and confidentiality. The reconstruction of training data can potentially conflict with trade secrets, as some training data may derive from successful corporate experiences, providing competitive advantages. Consequently, organizations holding such data are reluctant to disclose it to competitors.

In this paper, we present Aloa, an improved variant of LabelOnly, which achieves high performance and stable prediction metrics. Unlike LabelOnly, Aloa calculates a data-agnostic robustness score without exploiting knowledge of the training data’s feature distribution. This score determines membership. The experimental results highlight that our attack allows for better stability and an enhanced performance, up to 3 percentage points, in terms of accuracy in predicting the records membership. Even if this enhancement may seem small, for the privacy setting this is extremely risky since it means that the adversary may have a higher probability of re-identifying people in the dataset. In addition, we relax the assumption that the attacker needs a dataset following the same distribution as the original training dataset, making the attack easier to perform with respect to the competitors.

The remaining of the paper is organized as follows. Section 2 discusses prior works related to our; Sect. 3 introduces some preliminary notions useful for understanding the details of our attack. In Sect. 4 we describe how to learn and apply Aloa while in Sect. 5, we present the experimental results on its performance. Lastly, Sect. 6 concludes the paper.

2 Related Work

In this Section, we contextualize our work concerning the current literature. The issue of privacy has been addressed in several fields to assess the privacy risk and/or protect information systems from the dangerous disclosure of sensitive information. Disclosure of sensitive information may derive from accessing directly data [15] or accessing ML models [1, 5, 13]. Indeed, ML models learn from data, and even if the data are not exposed but simply used during the training, querying the model may still leak sensitive information about the people in the training dataset. In the context of data privacy, the first goal is to assess the privacy risk of the users represented in a dataset by using a privacy risk assessment methodology. Then, depending on the results of this assessment, a privacy protection technique on the data or the ML model can be applied to protect the users from malicious adversaries. Such protection techniques are based on well-defined privacy models, such as randomization, differential privacy, and k-anonymity [4, 12, 15], and transform data/ML models in such a way to guarantee certain thresholds on the risk of privacy leaks. In this paper, we focus on the research topic of privacy risk assessment, Therefore, in the following, we discuss the literature in this context. The process of assessing privacy risks can be applied to either data or ML models. Pratesi et al. [11] proposed PRUDEnce, a framework enabling a systematic assessment of empirical privacy risk concerning specific privacy attacks on data. In practice, it simulates an adversary that, for each individual, possesses the knowledge to maximize the privacy risk of that individual. To this end, the framework generates all the possible background knowledge that the adversary may know and assesses the risk with respect to the worst. Similar approaches have been studied in various works over the past few years to detect privacy attacks against ML models. Shokri et al. [13] proposed the Membership Inference Attack (Mia) with the aim is to infer the membership of a given record to the training set of a classification model. Fredrikson et al. [5, 6] designed the so-called reconstruction attacks, where the attacker’s objective is to reconstruct one or more training samples and their respective training labels. Another type of attack is the property inference attack [7], which extracts unintentionally learned information not explicitly encoded as features in the model. For instance, property inference attacks can uncover information such as the gender ratio in the training data set. Such attacks can be used in tandem with Mia or reconstruction attacks to enhance the adversary’s knowledge. Recently, Choquette-Choo et al. [2] proposed a variant of the original Membership Inference Attack called LabelOnly attack, in which some of the assumptions of Shokri’s attack are relaxed. In particular, Mia needs the probability vector for inferring the membership of a record while LabelOnly exploits only the hard labels. In our paper, we present Aloa, a variant of the LabelOnly attack, which is completely agnostic.

3 Background

Before providing the details of our privacy attack against classification models, in this Section we introduce some basic notions that are fundamental for understanding the details of our approach.

Classifier. A classifier, is a function \(b:\mathcal {X}^{(m)} \rightarrow \mathcal {Y}\) which maps instances (tuples) x from a feature space \(\mathcal {X}^{(m)}\) with m input features to a decision y in a target space \(\mathcal {Y}\) of size \(L = |\mathcal {Y}|\), i.e., y can assume one of the L different labels (\(L=2\) is binary classification). We use \(f(x) = y\) to denote the decision y taken by b we denote by \(\overline{y}_b\) the probability vector of size L in which the sum of all the values is one. An instance x consists of a set of m attribute-value pairs \((a_i, v_i)\), where \(a_i\) is a feature (or attribute) and \(v_i\) is a value from the domain of \(a_i\). The domain of a feature can be continuous or categorical. We assume a classifier is available as a function that can be queried at will.

Membership Inference Attack (Mia). Shokri et al. [13] assume that a ML algorithm is used to train a classifier b that captures the relationship between data records and their labels. To attack b trained on \(D_{b}^{train}\), Mia defines an attack model \(A(\cdot )\): it is a ML model able to discern if a record was part of the training dataset \(D_{b}^{train}\) or not. Note that, \(D_{b}^{train}\) is composed by \((x^i,y^i_o)_b\), where \(y^i_o\) is the true labels associated to \(x^i_b\). In practice, the attack \(A(\cdot )\) is a binary classifier that predicts in if the record was part of the training set or out otherwise. \(A(\cdot )\) is trained on a dataset \(D_{a}^{train}\): \((x^i, y^i)_a\), where each \(x^i_a\) is composed by the label predicted by the classifier b for a record under analysis and its probability vector \(\overline{y^i}\) of length L obtained by querying a shadow model \(s^i(\cdot )\) mimicking b; while \(y^i_a\) is the correct membership label and that can be in or out. The attack model \(A(\cdot )\) is a voting model composed of L ML models: one for each output class of the classifier model under attack. The key factor in this attack is the knowledge of the probability vector: given how the probabilities in \(\overline{y}_b\) are distributed around the true value of the record, the attack model computes the membership probability \(\text {Pr}\{ (x, y) \in D_b^{train} \}\), which is the probability that x belongs to the in class, i.e. it is part of the training set. To obtain the dataset \((x^i, y^i)_a\), on which the Mia model \(A(\cdot )\) is trained, the authors used shadow models. In the original paper, the authors assume a black-box setting in which there is no knowledge about the type of classifier to be attacked or the training dataset used to train it. In the following, we use the term black-box model to indicate the classifier to be attacked. To overcome the limitation of absence of knowledge on data and model, they employed a set of k shadow models \(s^i(\cdot )\): ML models trained to mimic the decisions of the black-box model \(b(\cdot )\) we would like to attack. These shadow models are trained on \(D_{s}^{train}\): \((x^i, y^i)_s\), in which \(x^i_s\) has the same format and similar distribution w.r.t. to the dataset employed to train the black-box model X, while \(y^i_s\) is the predicted class obtained querying the black-box model \(b(\cdot )\). After the training, we know which record was part of the training dataset (class in) for each shadow model and which was part of the test one (class out). Hence, we can exploit this information to create a supervised training dataset for training the attack model \(A(\cdot )\), which is \(D_{a}^{train}\).

The datasets employed for training the shadow models are disjoint from the unknown dataset used to train the black-box model. Shokri et al. [13] tested different kinds of training data for the shadow models: (i) a random dataset, with randomly generated records labeled querying the black-box model; (ii) a statistical one, with synthetic data generated exploiting the original statistical distribution; (iii) a noise one, in which the attacker knows a noisy portion of data from the same distribution of the original training dataset. The different data for shadow models allow for privacy attacks of different strengths.

Label-Only Membership Inference Attack (LabelOnly). Choquette-Choo et al. [2] design a variant of Mia which relaxes some requirements of the original attack. Given a black-box model b, LabelOnly \(A_{LO}(\cdot )\) targets it by exploiting only the hard labels, i.e. the output predictions of the model under analysis. Hence, the probability vector \(\overline{y^i}\), employed by Mia, is not exploited in LabelOnly. It develops a procedure that derives the robustness of a model to perturbations and uses it as a proxy for model confidence in its predictions. The basic intuition is that records which exhibit high robustness belong to the training dataset. \(A_{LO}(\cdot )\) exploits a dataset \(D_{s}^{train}\) for training only one shadow model \(s(\cdot )\), i.e., a ML model mimicking the decision of black-box model b. The dataset \(D_{s}^{train}\): \((x^i, y^i)_s\) is composed of records with the same format and similar distribution w.r.t. to the dataset for training the black-box b, and is labeled by the predicted class obtained querying b. With the shadow model, we know which records were part of the training dataset (in) of the shadow model and which was part of the test (out). For each tuple \(x_s^i\) the algorithm generates a set of records resulting from its perturbation and labels the generated records using the trained shadow model. Analyzing the percentage of generated records having the same predicted class of \(x_s^i\), the algorithm computes the robustness score of the black-box with respect to the \(x_s^i\) classification. Then, the attack uses an iterative thresholding procedure on the robustness scores, assigned to each record of the training an testing dataset of the shadow model, to find a threshold on the scores to separate the records between in and out. The attack will use this threshold for classifying new records as part of the training of the black-box or not.

4 Agnostic Label-Only Membership Inference Attack

figure a

In this paper, we present Aloa (Agnostic Label-Only membership inference Attack), which is a variant of LabelOnly attack, presented in Sect. 3. The LabelOnly attack assumes the statistical distributions and the domain of the features in training data. This knowledge is exploited for applying a perturbation to each feature tailored to its type and its statistical distribution. We propose a variant completely agnostic with respect to the training data and the type of classifier to be attacked.

Threat Model. The objective of this kind of attack is to determine whether or not a given data record belongs to the training dataset of a specific classification model. To conduct an attack, the adversary can exploit specific prior knowledge that can be accessed. In this paper we assume an adversary having black-box access to the classifier b.: the adversary can only query the model to obtain a prediction and, as in [2], the model only returns hard labels to queries. The adversary does not know the model architecture, e.g., the type of classifier. The adversary knows the total number of classes, the class labels, and the input format. To perform Aloa we do not need to know distributions of the original training dataset, nor during the training of the shadow model, nor in the perturbation mechanism, in contrast to LabelOnly.

Learning ALOA. Given a black-box b, trained on \(D_{b}^{train}\), Aloa targets it by exploiting only the hard labels, e.g. \(b(x) = \widehat{y}\), and deriving a robustness score by an agnostic data perturbation. This score enables Aloa to determine if a record x belongs to the training data \(D_{b}^{train}\) of the black-box model under attack. The algorithm’s pseudo-code is reported in Algorithm 1. The process to create Aloa model requires as input a dataset \(D_{s}\): \((x^i, y^i)_s\) in which \(x_s^i\) has the same format of training data of b and \(y_s^i\) is the predicted class obtained querying the black-box model b. Given the agnostic nature of Aloa, it does not rely on any assumptions about \(D_s\), which may include a completely random dataset.

After querying the black-box model for labeling each \(x^i_s\), Aloa splits the dataset \(D_s\) into training and testing datasets, obtaining \(D_s^{train}\) and \(D_s^{test}\) respectively, and then it trains one or more shadow models, \(s^i(\cdot )\) on sub-samples of \(D_s^{train}\) (lines 6–7). The goal is to mimic the behavior of b, by also knowing which records are part of the training set and which are not. In particular, as reported in Algorithm 1, Aloa constructs a dataset D, where assigns the label in to each record in the training data of the shadow models and the label out to those belonging to their test data (lines 8–10).

At this point, Aloa performs its core process: the agnostic perturbation of the data used for training and testing a given shadow model (line 12, Algorithm 1). We call this procedure Noisy Neighborhood Generation and we report its pseudo-code in Algorithm 2. For each data record \(x^i_s\) of the shadow dataset \(D_s\), it generates a neighborhood of n records obtained perturbing the values of its attributes. Since the goal is to perturb each data record in their local vicinity without using any knowledge of the dataset’s domain or attributes distribution, making the algorithm completely domain agnostic, our perturbation mechanism adds noise values to each attribute of the record under analysis. Given an instance \(x^i_s\) composed by m attribute-value pairs \((a_j, v_j)\), to generate the noise value for perturbing \(v_j\) Aloa adds or subtracts to \(v_j\) a noise values \(\nu = p \times v_j\) (lines 10–14, Algorithm 2). The value p is a percentage randomly generated from a uniform distribution in the range \([p_{min}, p_{max}]\) (line 5, Algorithm 2). The noise value \(\nu \) is added or subtracted with a probability equal to 50% (i.e., following a Bernoulli process).

figure b

After this perturbation, Aloa computes for each record in the shadow dataset the robustness score to estimate the confidence of the shadow model s in predicting the record label (line 13, Algorithm 1). This score is formally defined as follows:

$$\begin{aligned} rScore_{x_s^i} (N_{x_s^i}) = {\left\{ \begin{array}{ll} 0 &{} \text {if }\; s(x_s^i) \ne b(x_s^i)\\ \frac{\sum _{x' \in N_{x_s^i}} F(s(x'),s(x_s^i))}{|N_{x_s^i}|} &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where \(F(s(x'),s(x^i))\) is a function returning 0 in case the shadow model predicts a label for the neighbor \(x'\), which is not coherent with the label predicted for \(x^i\). In other words, if the shadow model is faithful to the black-box model on \(x^i\), the robustness score on this record is computed as the fraction of perturbed records having coherent labels with \(x^i\). This score has values in the range [0, 1]: values close to 1 mean that the classifier is robust to perturbations, thus the model is confident in predicting the record, while values close to zero register low confidence of the classifier in the prediction, indeed, in this case several neighbors have the opposite class label to the record under analysis, meaning that the model is unsure of the prediction since it is very close to the boundary.

Once each record of the shadow dataset has its robustness score, we get a dataset where for each record \(x_s^i\) we have its score \(rScore_{x_s^i}\) and the label in, in case \(x_s^i\) belongs to the training dataset of the shadow model, or out if it belongs to the test dataset. Using the iterative thresholding procedure, Aloa finds the threshold value on the score that optimizes the accuracy in separating records with class label in and out (line 15, Algorithm 1).

ALOA Application. Once Aloa has been trained, an adversary can use it to determine whether a given record belongs to the training dataset of the black-box model b or not. Given a record x, having the same shape as the records \(D_{b}^{train}\) on which the black-box was trained, our attack performs the following steps:

  1. 1.

    Aloa applies the Noisy Neighborhood Generation procedure, presented in Algorithm 2, to the record x. The result is a set of synthetic neighbors \(N_{x}\) which are perturbed through our agnostic procedure;

  2. 2.

    Exploiting the neighborhood \(N_{x}\), Aloa computes the Robustness Score rScore of the record x applying Eq. (1);

  3. 3.

    The best threshold value \(threshold_{split}\), found during the training of Aloa, is used to discern whether the record x is part of the training set or not: if \(rScore \ge threshold_{best}\) then it will be predicted as part of the training set, otherwise not.

5 Experiments

In this Section, we report the results obtained testing Aloa attack, presented in Sect. 4 the code developed is in Python 3.8 and publicly available [10]. We present the experiments as follows: we present the datasets used and their pre-processing (Sect. 5); then, we describe the trained black-box models on which we tested the validity of our attack (Sect. 5). Lastly, in Sect. 5.1 we present the results of Aloa attacks to all the ML models, comparing the performance with respect to the original Mia and LabelOnly attack, and discussing the privacy risk of each of them.

Datasets. We use three classification datasets, each with different characteristics.

We utilize the Adult dataset, a benchmark dataset of 48, 842 records and 15 numerical or categorical variables. It has information about employees, such as job, capital loss, marital status, etc. The labels are “\(\le \)50K” or “>50K,” referred to as Class 0 and 1 respectively. They indicate whether the individual’s annual income will be above or below 50, 000. Next, we exploit the Bank dataset, which contains information about bank’s customers. It has 150, 000 records and 10 numerical variables. The objective is to classify individuals as either good or bad creditors. Finally, we include the Synth dataset, which is a synthetic dataset generated using a Gaussian mixture model. It has 30, 000 records and 30 numerical variables, with 15 classes. We chose this dataset to address the multi-class classification problem and to evaluate the attack’s behavior in a controlled environment through synthetic data.

For Adult, we handled null values by removing them exploiting a Pearson correlation analysis among the variables (\( \ge 80\%\) correlation were dropped). For the categorical variables, we applied a one-hot encoding technique. For the Bank dataset, null values were also eliminated, and a correlation analysis was performed. Regarding the Synth dataset, since it was synthetically generated, we did not perform any pre-processing procedures.

Following the pre-processing stage, we partitioned each dataset into two subsets: (i) \(70\%\) of the original dataset (\(D_b\)) was utilized for training and testing the black-box models, and (ii) the remaining \(30\%\) of the pre-processed data (\(D_s\)) was designated for training the attack models.

Black-Boxes.

Given each pre-processed dataset \(D_{b}\), we split it into \(D^{train}_{b}\) (\(70\%\) of it) and \(D^{test}_{b}\) (\(30\%\) of it). We use \(D^{train}_{b}\) for training the black-box models. The ML models selected are described in the following:

  1. 1.

    Decision Tree (dt), selected for its simplicity, but prone to overfitting and to noise data;

  2. 2.

    Random Forest (rf), an ensemble model composed of multiple decision trees, with better performance with respect to the dt;

  3. 3.

    Neural Network (nn), a feed-forward network with some hidden layers, varying from 1 to 3, depending on the data in input;

For all the models we trained two variants: regularized, with very good performance and with a good level of generalization; and overfitted on purpose, specific to the input training dataset and with poor generalization capabilities. This choice resides in the fact that it has been proved that Mia leads to higher privacy risk when attacking overfitted models ([14, 16]). For this reason, we also want to evaluate how privacy exposure changes concerning the level of overfitting of black-boxes. We report the classification performance of these models in Table 1Footnote 1. The results reported in this table show that all the black-box models have an overall good performance, with comparable performance for the rf and nn models, and a slightly worse prediction performance for the dt, as expected. The model performance reported in the table also shows a different behavior of the regularized models w.r.t. the overfitted ones.

Table 1. Prediction performance of the black-box models for all the dataset. We report the Accuracy both for the train and test sets to appreciate the difference in performance in generalization capability for the generalized and overfitted models. We achieve good performance for all the models presented.

5.1 Evaluation of ALOA and Comparison Against Competitors

In this Section, we present the privacy threats obtained by applying Aloa, LabelOnly and the original Mia to the trained black-boxes. In order to train all the attacks, we need to have the shadow dataset \(D_s\) having the same format as the data used for training the black-box model. We employed two variants of this dataset, denoted as \(D_s^{\text {stat}}\) and \(D_s^{\text {rand}}\), in our experiments. The former was designed to have the same statistical distribution as the original training dataset, whereas the latter was generated randomly. We used \(D_s^{\text {stat}}\) for learning the LabelOnly attack because the procedure described in [2] requires training the shadow models on a dataset with similar distributions to those of the training data of the black box, and it also exploits the distribution knowledge in the computation of the robustness score. Although Aloa does not require the use of \(D_s^{\text {stat}}\), as it is agnostic to the training data distributions, we conducted experiments with both \(D_s^{\text {stat}}\) and \(D_s^{\text {rand}}\) to evaluate the effectiveness of Aloa and having a more complete comparison with LabelOnly. To ensure a clear understanding of the performance of the attack, we have balanced the \(D_s\) used for creating the attack models: having 50% of the rows of class in and 50% of class out. This setting is the same used in [2] to clearly compare our proposal and the attacks in the literature. Indeed, the balanced setting enables the possibility to compare the attack performance based on accuracy that, in this case, cannot be influenced by the under or over-representation of one class with respect to the others. In this way, if the attack has more than 50%

The results of the attacks are reported in Table 2 for Adult, Bank and Synth. Aloa was run three times for each black-box, with \(n=1000\) perturbations for each record of \(D_s\) (the same n is used for the training of LabelOnly attack), \(p_{min}\) and \(p_{max}\) set toFootnote 20.10 and 0.50 respectively, and a Bernoulli probability p = 0.50 for adding or subtracting the noise value. Mia was created with 8 shadow models and nn as final attack models. For LabelOnly we applied the same hyper-parameters as in the work [2]: \(n=1000\) perturbations, with a Bernoulli flip probability of \(0.60\%\) and a Gaussian noise with \(\sigma = 0.04\). We remark that Mia and LabelOnly were tested on the \(D_s^{\text {stat}}\) due to the assumptions needed, while Aloa was tested both on \(D_s^{\text {stat}}\) and on \(D_s^{\text {rand}}\).

Regarding Adult dataset, Mia and LabelOnly attacks performance is coherent with the one presented in their original papers. For the Mia, overall the attack against regularized models is not effective, apart from the decision tree with 51% of accuracy. On the other hand, the overfitted models are easily attacked, in particular rf-o and nn-o. However, the attack on the dt-o is not posing a privacy threat. This result may be due to the poor prediction performance of the dt-o for Adult. In fact, the overall accuracy of the model is 48%, suggesting that the model is not able to learn patterns in the data. Hence, the attack cannot have sufficient information from the confidence. By looking at the LabelOnly attack, it is ineffective for all the regularized models, while it poses privacy threats for all the overfitted ones. Analyzing Aloa in both experimental settings, we have the same performance as LabelOnly on the overfitted models with 54% dt, 55% rf, 60% nn. Instead, by looking at the regularized models, we have in general better performance: the attack has gained 1–3% points in the attack compared to LabelOnly. With Aloa based on \(D_s^{\text {stat}}\) we are always better than LabelOnly except for the regularized nn, for which we have the same performance. Hence, for Adult Aloa poses the worst privacy threats both for the overfitted and regularized models. Among the ML models, the attack shows more privacy leakage for the rf and nn. This finding is reasonable because, as highlighted in prior works, more complex models learn more information.

For Bank dataset, the results are in line with the ones described for Adult, even if overall they are slightly lower. Interestingly, the improvement in terms of privacy threats posed by Aloa is more significant for the rf-o model (+3%) and lower for the nn-o one (+1%). This result may be due to the different structure of this dataset: it is composed of only a few numerical variables.

In Synth dataset we can better appreciate the effectiveness of Aloa: the trend is again that the attacks undermine the privacy more in the case of overfitted models, while regularized ones remain in danger, but with a lower privacy risk. Both Aloa and LabelOnly have better privacy threats with respect to Mia. However, Aloa in both settings shows better or comparable performance with respect to LabelOnly with an improvement for rf-o and nn-o. Comparing the two experimental settings of Aloa, our results indicate that the performance of our attack is generically consistent for both \(D_s^{\text {stat}}\) and \(D_s^{\text {rand}}\), showing at most a discrepancy of 1% in accuracy. More importantly, they also demonstrate that even if our attack is assuming an adversary with weaker knowledge with respect to LabelOnly, we achieve higher or comparable privacy risks. These findings have significant implications for privacy protection in ML models.

Overall, the experiments show that Aloa poses a worrying privacy risk, especially if the model is overfitted. The more complex a model is, the easier it is to overfit and experience higher privacy leakage. Comparing Aloa against the LabelOnly attack, we note that for the overfitted models we have comparable or better performance. This behavior may be the result of the agnostic perturbation we perform, which is independent of the distributions of the input variables, and hence Aloa is not affected by the slight changes in the data. We remark that this property is valid for both cases where we use \(D_s^{\text {stat}}\) and \(D_s^{\text {rand}}\) since the perturbation mechanism always remains agnostic. Regarding Aloa against the original Mia, the performances of our method are overall better with the exception of rf-o. For this model, in fact, the accuracies of the Mia attack are always higher w.r.t. both LabelOnly and Aloa, highlighting that in the case of overfitted rf the added knowledge of the prediction probability has a greater impact in this setting. However, for the regularized rf and nn, instead, Mia shows higher accuracy and precision for the in class but an extremely low recall and hence F-1 score, showing that this attack is unstable.

Aloa performed overall better than the LabelOnly, with an improvement up to 3%. It is a significant improvement in the context of privacy assessment, where every gain in performance can shed light on the privacy leakage of a model. Aloa is more stable and the perturbation we performed is data agnostic, without knowledge of the distribution of the features. Importantly, our attack showed better results in attacking regularized models compared to others.

Table 2. Results of the attacks on the three datasets for all the black-box models selected. In bold are highlighted the highest privacy risks. We remark that for the Mia and LabelOnly we exploit the statistical dataset \(D_s^{\text {stat}}\), while Aloa was tested both on the \(D_s^{\text {stat}}\) and on the \(D_s^{\text {rand}}\), showing that it is completely agnostic w.r.t. the data and a good stability. Aloa is the one with the highest privacy threats overall, showing a good stability since for all the datasets we achieve similar performance.

Comparison Between Regularized and Overfitted Models. Recently, several works have empirically shown that if the model being attacked is overfitted, the attack will be much more damaging to the users of the training set [14, 16]. For this reason, we study the behavior of both models that generalize well and those that are overfitting. From the results in Table 2, all the overfitted models exhibit a higher degree of privacy leakage than regularized models, as evidenced in all three datasets, particularly in the third one. This dataset highlights the vulnerability of models that are not properly regularized and exhibit a gap between training and test accuracy. As outlined in [8], the gap between training and test accuracy is directly proportional to the efficacy of the accuracy of an attack - the larger the gap, the more effective the attack. To better analyze this aspect, we took advantage of the Synth dataset, which allows for a controlled study in which ML models achieve excellent performance and it is easy to overfit ML models. In Fig. 1, it is possible to examine the difference in the performance of Aloa for nn and nn-o trained on the Synth dataset. In particular, we present a box plot on the robustness score which shows that the overfitted model exhibits a larger difference between the average in and out robustness scores, which could potentially enable an attacker to distinguish between the two classes more easily. In this way, we empirically prove the existing link between model overfitting and privacy risk and the train-test gap [8].

Fig. 1.
figure 1

These two box plots show the robustness score behaviour for overfitted and regularized nn on Synth. It is possible to see that the overfitted exhibits a larger difference between the average in and out robustness scores, which could enable an attacker to distinguish between the two classes more easily. This confirms the existing link between model overfitting and privacy risk and the train-test gap [8]. On the other hand, the regularized displayed a smaller gap between the two classes, hence separating the two classes is more difficult.

Analysis on the Number of Shadow Models. There are many conflicting opinions in the literature about the use of shadow models, i.e., models that mimic the behavior of the original black box. In fact, in the first publication of Mia [13] the authors used a large number of shadow models, but in LabelOnly [2] only one shadow model is used. In our paper, we present the results with only one shadow model after having analyzed the effectiveness of using different shadow models, and our results highlight that for Aloa one or k models does not lead to any improvement. This behavior can be seen in Fig. 2, in which the performance of the attack on Adult is the same whether using only one or 10 shadow models. Given this finding, our experiments were conducted with just one shadow model for time constraints.

Fig. 2.
figure 2

The performance of Aloa by changing the number of shadow models from 1 to 10 for the nn-o trained on the Adult dataset. It is clear that the performance of the attack are not affected by the number of shadow models.

6 Conclusion

We presented Aloa, a variant of the LabelOnly attack. Our proposed attack is completely data agnostic, both in the shadow model training and in the perturbation mechanism. In particular, the perturbation does not exploit knowledge of the statistical distributions and domains of the features in the training data. Our results demonstrate that Aloa outperforms the traditional LabelOnly attack with an improvement of up to 3% in terms of attack accuracy, although it assumes an adversary with weaker prior knowledge. This improvement is significant in the context of privacy assessment, where every gain in performance can provide valuable, sensitive insights into the people represented in the data. The agnostic nature of our attack raises concerns regarding privacy protection, as it can be executed without any specific assumptions. Moreover, Aloa exhibits excellent stability in terms of prediction performance, outperforming standard Mia and other attacks when targeting regularized models. In summary, Aloa offers a robust and effective approach for assessing the privacy of ML models.